# Preprocessing Pipeline Demo

In [1]:
import sys
import os
import pandas as pd
from sklearn.datasets import fetch_20newsgroups

import warnings
warnings.filterwarnings("ignore")

#sys.path.insert(0, os.path.abspath(os.path.join('nlp_library')))
sys.path.insert(0, os.path.abspath(os.path.join('..')))
import nlp.preprocessing as pre

## Read Data

In [2]:
df = pd.DataFrame({'text':fetch_20newsgroups(subset='train')['data']}).iloc[:100]
df.head()

Unnamed: 0,text
0,From: lerxst@wam.umd.edu (where's my thing)\nS...
1,From: guykuo@carson.u.washington.edu (Guy Kuo)...
2,From: twillis@ec.ecn.purdue.edu (Thomas E Will...
3,From: jgreen@amber (Joe Green)\nSubject: Re: W...
4,From: jcm@head-cfa.harvard.edu (Jonathan McDow...


## Object Creation
The class Pipeline is designed to be the central object of the preprocessing library.  
- It is based on spacy
- It includes certain cleaning actions before the spacy pipeline
- It also contains an spacy extension to handle some other actions not included in spacy
- The configuration is passed on a dict and can be changed later
- It follows Scikit-Learn standard syntax

In [3]:
#############################################################
# The configuration below defines the way we process the data.
# Every line indicates a preprocessing action to perform.
# Feel free to modify individual values from 'True' to 'False'

config = {
        'appos': True,
        'slang': True,
        'sep_digit_text': True,
        'emoticons': False,
        'emoticons_del': True,
        'url': True, 
        'email': True,  # mask can be used
        'html': True,
        'proper_noun': True, # mask can be used
        'phone_number': 'PHONE_NUMBER', # mask can be used
        'repeated_chars': True,
        'single_char': False,
        'punct': True,
        'number': True, # mask can be used
        'extra_space': True,
        'stop_words': False,
        'lemmas': False
    }

#############################################################

# Create preprocessing pipeline
pp = pre.Pipeline(config, 
                  n_process=1, # Single process
                  progress_bar=True # Display progress bar
                 )
print(pp)

ImportError: DLL load failed while importing _swigfaiss: The specified module could not be found.

In [4]:
# Check configuration
pp.config

NameError: name 'pp' is not defined

In [5]:
# Activate stop_words removal in configuration and show it again
pp.stop_words = True
pp.config

{'appos': True,
 'slang': True,
 'sep_digit_text': True,
 'emoticons': False,
 'emoticons_del': True,
 'url': True,
 'email': True,
 'html': True,
 'proper_noun': True,
 'phone_number': 'PHONE_NUMBER',
 'repeated_chars': True,
 'single_char': False,
 'punct': True,
 'number': True,
 'extra_space': True,
 'stop_words': True,
 'lemmas': False,
 'eol_remove': False,
 'eol_replace': False}

In [6]:
# To show a single configuration
pp.slang

True

In [7]:
# Setting maskable attributes
pp.email = 'email@email.com'
pp.config

{'appos': True,
 'slang': True,
 'sep_digit_text': True,
 'emoticons': False,
 'emoticons_del': True,
 'url': True,
 'email': 'email@email.com',
 'html': True,
 'proper_noun': True,
 'phone_number': 'PHONE_NUMBER',
 'repeated_chars': True,
 'single_char': False,
 'punct': True,
 'number': True,
 'extra_space': True,
 'stop_words': True,
 'lemmas': False,
 'eol_remove': False,
 'eol_replace': False}

## Fit Method
The Pipeline's fit method accepts a list (or vector or Series) of texts.  
The fit method first runs the irreversible actions:
- Appos replacement  
- Slang replacement
- Separate digits from text
- Emoticons replacement or removal
- Repeated chars removal
    
Then it process the spacy pipeline.

In [8]:
pp.fit(df['text'])
print(f'Number of documents processed {len(pp)}')

HBox(children=(HTML(value='Cleaning Data'), FloatProgress(value=0.0), HTML(value='')))




HBox(children=(HTML(value='Running Spacy'), FloatProgress(value=0.0), HTML(value='')))


Number of documents processed 100


The first progress bar above tracks the irreversible actions. The second one the spacy pipeline.  
After calling fit, the Pipeline object stores the already preprocessed spacy object.

## Indexing and Slicing Methods

In [9]:
# Calling built-in method len() return the lenght of the corpus
len(pp)

100

In [10]:
# Retrieving an index returns a spacy.Doc object. Note that calling it this way return the text without spacy preprocessing
pp[10]

From: irwin@cmptrc.lonestar.org (Irwin Arnstein) Subject: Re: Recommendation on Duc Summary: What's it worth? Distribution: usa Expires: Sat, 1 May 1993 05:00:00 GMT Organization: CompuTrac Inc., Richardson TX Keywords: Ducati, GTS, How much? Lines: 13 I have a line on a Ducati 900 GTS 1978 model with 17 ok on the clock. Runs very well, paint is the bronze/brown/orange faded out, leaks a bit of oil and pops out of 1 st with hard accel. The shop will fix trans and oil leak. They sold the bike to the 1 and only owner. They want $3495, and I am thinking more like $3 K. Any opinions out there? Please email me. Thanks. It would be a nice stable mate to the Beemer. Then I'll get a jap bike and call myself Axis Motors! -- -- "Tuba" (Irwin) "I honk therefore I am" CompuTrac-Richardson,Tx irwin@cmptrc.lonestar.org DoD #0826 (R75/6) --

In [11]:
# Retrieving a sublist returns a another Pipeline object.
# This new Pipeline object is just a view of the original one.
# This means they share the spacy pipeline and elements.
# The indexing in this new object is reset
new_pp = pp[10:12]
new_pp[0]

From: irwin@cmptrc.lonestar.org (Irwin Arnstein) Subject: Re: Recommendation on Duc Summary: What's it worth? Distribution: usa Expires: Sat, 1 May 1993 05:00:00 GMT Organization: CompuTrac Inc., Richardson TX Keywords: Ducati, GTS, How much? Lines: 13 I have a line on a Ducati 900 GTS 1978 model with 17 ok on the clock. Runs very well, paint is the bronze/brown/orange faded out, leaks a bit of oil and pops out of 1 st with hard accel. The shop will fix trans and oil leak. They sold the bike to the 1 and only owner. They want $3495, and I am thinking more like $3 K. Any opinions out there? Please email me. Thanks. It would be a nice stable mate to the Beemer. Then I'll get a jap bike and call myself Axis Motors! -- -- "Tuba" (Irwin) "I honk therefore I am" CompuTrac-Richardson,Tx irwin@cmptrc.lonestar.org DoD #0826 (R75/6) --

## Other Methods
Fit method runs the spacy pipeline but does not output anything. Fit method just puts different tags on the words (tokens) that will act as a filter later on.  
An output should be generated using a secondary method. The output could be in different ways:
- Applying the preprocessing filtering based on the current configuration and getting strings.
- Getting strings without preprocessing
- Getting Doc objects

### Data Preprocessed

In [12]:
# Calling transform method returns the full cleaned corpus. It is like calling method text() for every text in the corpus
texts = pp.transform()
texts[0]

'email@email.com thing Subject car Nntp Posting Host Organization University Maryland College Park Lines wondering enlighten car saw day door sports car looked late s/ early s. called Bricklin doors small addition bumper separate rest body know tellme model engine specs years production car history information funky looking car e mail Thanks IL brought neighborhood '

In [14]:
# If we want a specific text already preprocessed, call method text
pp.text(10)

'email@email.com Subject Recommendation Duc Summary worth Distribution usa Expires Sat 05:00:00 GMT Organization CompuTrac Inc. Ducati GTS Lines line Ducati GTS model ok clock Runs paint bronze brown orange faded leaks bit oil pops st hard accel shop fix trans oil leak sold bike owner want $ thinking like $ K. opinions email Thanks nice stable mate Beemer jap bike Axis Motors Tuba Irwin honk CompuTrac Richardson Tx email@email.com DoD R75/6 '

In [15]:
# Sometimes we want the preprocessed data but in a list-of-tokens format
tokens = pp.tokenize()
tokens[10]

['email@email.com',
 'Subject',
 'Recommendation',
 'Duc',
 'Summary',
 'worth',
 'Distribution',
 'usa',
 'Expires',
 'Sat',
 '05:00:00',
 'GMT',
 'Organization',
 'CompuTrac',
 'Inc.',
 'Ducati',
 'GTS',
 'Lines',
 'line',
 'Ducati',
 'GTS',
 'model',
 'ok',
 'clock',
 'Runs',
 'paint',
 'bronze',
 'brown',
 'orange',
 'faded',
 'leaks',
 'bit',
 'oil',
 'pops',
 'st',
 'hard',
 'accel',
 'shop',
 'fix',
 'trans',
 'oil',
 'leak',
 'sold',
 'bike',
 'owner',
 'want',
 '$',
 'thinking',
 'like',
 '$',
 'K.',
 'opinions',
 'email',
 'Thanks',
 'nice',
 'stable',
 'mate',
 'Beemer',
 'jap',
 'bike',
 'Axis',
 'Motors',
 'Tuba',
 'Irwin',
 'honk',
 'CompuTrac',
 'Richardson',
 'Tx',
 'email@email.com',
 'DoD',
 'R75/6']

### Data Without Preprocessing

In [16]:
pp[10].text

'From: irwin@cmptrc.lonestar.org (Irwin Arnstein) Subject: Re: Recommendation on Duc Summary: What\'s it worth? Distribution: usa Expires: Sat, 1 May 1993 05:00:00 GMT Organization: CompuTrac Inc., Richardson TX Keywords: Ducati, GTS, How much? Lines: 13 I have a line on a Ducati 900 GTS 1978 model with 17 ok on the clock. Runs very well, paint is the bronze/brown/orange faded out, leaks a bit of oil and pops out of 1 st with hard accel. The shop will fix trans and oil leak. They sold the bike to the 1 and only owner. They want $3495, and I am thinking more like $3 K. Any opinions out there? Please email me. Thanks. It would be a nice stable mate to the Beemer. Then I\'ll get a jap bike and call myself Axis Motors! -- -- "Tuba" (Irwin) "I honk therefore I am" CompuTrac-Richardson,Tx irwin@cmptrc.lonestar.org DoD #0826 (R75/6) --'

### Doc Objects

In [17]:
doc = pp[10]
type(doc)

spacy.tokens.doc.Doc

## Multiprocessing
Pipeline object supports multiprocessing.  
It is just recommended for large corpus. For small corpus it is not worth it due to the multiprocess overhead.  
The progress bar functionality is limited in multiprocessing mode.

In [18]:
pp = pre.Pipeline(config, 
                  n_process=4, # 4 processes
                  progress_bar=True # Display progress bar
                 )
pp.fit(df['text'])

HBox(children=(HTML(value='Running Spacy'), FloatProgress(value=0.0), HTML(value='')))




<nlp.preprocessing.Pipeline at 0x1c2a596de80>

Just the second progress bar is displayed (the one covering the spacy pipeline).  
For short corpus normally single process is preferred over multiprocessing.

In [19]:
pp[10]

From: irwin@cmptrc.lonestar.org (Irwin Arnstein) Subject: Re: Recommendation on Duc Summary: What's it worth? Distribution: usa Expires: Sat, 1 May 1993 05:00:00 GMT Organization: CompuTrac Inc., Richardson TX Keywords: Ducati, GTS, How much? Lines: 13 I have a line on a Ducati 900 GTS 1978 model with 17 ok on the clock. Runs very well, paint is the bronze/brown/orange faded out, leaks a bit of oil and pops out of 1 st with hard accel. The shop will fix trans and oil leak. They sold the bike to the 1 and only owner. They want $3495, and I am thinking more like $3 K. Any opinions out there? Please email me. Thanks. It would be a nice stable mate to the Beemer. Then I'll get a jap bike and call myself Axis Motors! -- -- "Tuba" (Irwin) "I honk therefore I am" CompuTrac-Richardson,Tx irwin@cmptrc.lonestar.org DoD #0826 (R75/6) --

## Memory Saving

In [4]:
# Method fit_transform() can be called in a Scikit-Learn fashion to output a list of preprocessed texts.
pp = pre.Pipeline(config, 
                  n_process=4, # 4 processes
                  progress_bar=True # Display progress bar
                 )
texts = pp.fit_transform(df['text'])
texts[1]

HBox(children=(HTML(value='Runninggg Spacy'), FloatProgress(value=0.0), HTML(value='')))




'From Subject SI Clock Poll Final Call Summary Final call for SI clock reports Keywords SI acceleration clock upgrade Article I.D. shelley.1 qvfo9 INNc3 s Organization University of Washington Lines NNTP Posting Host A fair number of brave souls who upgraded their SI clock oscillator have shared their experiences for this poll Please send a brief message detailing your experiences with the procedure Top speed attained CPU rated speed add on cards and adapters heat sinks hour of usage per day floppy disk functionality with and male floppies are especially requested I will be summarizing in the next days so please add to the network knowledge base if you have done the clock upgrade and have not answered this poll Thanks '

In [5]:
len(pp)

0