# Convert to Dataframe Using Multiprocessing Pool

Multiprocessing is a useful python package that enables the user to utilize multiple processors on a given machine for more efficient progress. The Pool object allows data parallelism--making the function execution of multiple input values more convenient through split processes. This sample script displays the use of Multiprocessing Pool in parsing large numbers of XML files.

Multiprocessing is preferred when calling functions on larger sets of data. The concept of data-parallelism allows independent processes to run simultaneously without having to communicate with other processes to perform the particular function on its data.

This script displays the use of multiprocessing in parsing the xml files. It creates the function for parsing, creates a pool object, then calls the function using that pool object to run multiprocessing.

This jupyter notebook is supposed to be run on the ProQuest TDM Studio.

## Importing Libraries and Files

In [137]:
# XML Reading and Parsing Libraries
from lxml import etree
from bs4 import BeautifulSoup
import pandas as pd
import os
import dask
import dask.dataframe as dd

# Multiprocessing Module
import multiprocessing as mp
from multiprocessing import Pool

In [138]:
#Import dask and we have 4 workers with 8 threads
from dask.distributed import Client
client = Client()
client

Perhaps you already have a cluster running?
Hosting the HTTP server on port 40425 instead
  f"Port {expected} is already in use.\n"


0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:40425/status,

0,1
Dashboard: http://127.0.0.1:40425/status,Workers: 4
Total threads: 8,Total memory: 31.07 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:41843,Workers: 4
Dashboard: http://127.0.0.1:40425/status,Total threads: 8
Started: Just now,Total memory: 31.07 GiB

0,1
Comm: tcp://127.0.0.1:40819,Total threads: 2
Dashboard: http://127.0.0.1:37001/status,Memory: 7.77 GiB
Nanny: tcp://127.0.0.1:44487,
Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-7_l7nf0e,Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-7_l7nf0e

0,1
Comm: tcp://127.0.0.1:36401,Total threads: 2
Dashboard: http://127.0.0.1:45043/status,Memory: 7.77 GiB
Nanny: tcp://127.0.0.1:41519,
Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-kjgdd_jv,Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-kjgdd_jv

0,1
Comm: tcp://127.0.0.1:39641,Total threads: 2
Dashboard: http://127.0.0.1:40863/status,Memory: 7.77 GiB
Nanny: tcp://127.0.0.1:33021,
Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-tng0opsc,Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-tng0opsc

0,1
Comm: tcp://127.0.0.1:33877,Total threads: 2
Dashboard: http://127.0.0.1:37111/status,Memory: 7.77 GiB
Nanny: tcp://127.0.0.1:42049,
Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-reihvs9x,Local directory: /home/ec2-user/SageMaker/dask-worker-space/worker-reihvs9x


## Choosing the Dataset for later processing


In [139]:
dataset_name = 'WSJ201911202111'

# Defining the dataset path
corpus_directory = '/home/ec2-user/SageMaker/data/'

articles = os.listdir(corpus_directory + dataset_name + '/')

In [140]:
# Verify that the number of articles is correct
len(articles) 

96599

## Define Functions

In [141]:
# Locate the text content in XML files, will be used in function below
def getxmlcontent(root):
    if root.find('.//HiddenText') is not None:
        return(root.find('.//HiddenText').text)
    
    elif root.find('.//Text') is not None:
        return(root.find('.//Text').text)
    
    else:
        return None

In [142]:
# Extract the necessary goid, text, and date content from the XML files
# Set up for multiprocessing--for a single file
def make_lists(article):

    try: 
        tree = etree.parse(corpus_directory + dataset_name + '/' + article)
        root = tree.getroot()
    
        if getxmlcontent(root):
            soup = BeautifulSoup(getxmlcontent(root))
            text = soup.get_text().replace('\\n','\n')
        else:
            text = 'Error in processing document'
        
        date = root.find('.//NumericDate').text
        
    except AttributeError:
        # Error logging - will show filename if there is a problem processing it
        print("Attribute Error" + article)
    
    return article, text, date

## Run Multiprocessing to parse XML files

In [143]:
# Test function on single article
make_lists(articles[1])

('2384277728.xml',
 '2020-03-30')

In [46]:
# Check core count
num_cores = mp.cpu_count()
print(num_cores)

8


In [144]:
# When using multiple processes, important to eventually close them to avoid memory/resource leaks
try:
    # Define a thread Pool to process multiple XML files simultaneously
    # Default set to num_cores, but may change number of processes depending on instance
    p = Pool(processes=num_cores)
    
    # Apply function with Pool to corpus, may limit number of articles by using split
    processed_lists = p.map(make_lists, articles)

except:
    print("Error in processing document")
    
finally:
    p.close()

## View and Save Dataframe

In [145]:
# Transform processed data into a dataframe
df = pd.DataFrame(processed_lists, columns=['Article ID', 'Text', 'Date'])
df

Unnamed: 0,Article ID,Text,Date
0,2457405332.xml,\nJoe Biden racked up Electoral College votes ...,2020-11-05
1,2384277728.xml,\nDanita Sienknecht was on a car ride with her...,2020-03-30
2,2524054500.xml,\nSimon Heaton said he expected to get a prest...,2021-05-10
3,2325226974.xml,\nHere are some of the companies with shares e...,2019-12-13
4,2599921634.xml,\nNovartis AG bet big on its new cholesterol-b...,2021-11-21
...,...,...,...
96594,2512354895.xml,\n\n\n\n\n\n\n\n\n\n\n\nBiden administration o...,2021-04-14
96595,2577580565.xml,"\nTOKYO—Fumio Kishida, a former foreign minist...",2021-09-30
96596,2443906195.xml,\nU.K. budget carrier EasyJet PLC hired a new ...,2020-09-18
96597,2540658373.xml,"\nMINNEAPOLIS\nDriver Plows Into\nProtest, Kil...",2021-06-15
