### Organizing Data Science pipeline for text processing

In this exercise, you will go back to previous exercises and decompose them into discrete tasks. You will then describe how these tasks connect into workflows. In this exercise, you will only be designing the tasks and the workflows and not be implementing them in code. The actual implementation of these tasks and workflows will be part of your final project.

Import required libraries used in this pipeline application.

In [3]:
# Import libraries

import json, re, os, nltk, string

nltk.download('stopwords')

import pandas as pd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Anirban\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Text Preprocessing
#### Task 1: Data load

Define process to load the corpus data in python.

In [None]:
# Data load

def readfiles(dirdata, infile):
    """
    Read the json file(s) as data frame
    Args:
        data directory, filenames
    Output:
        pandas dataframe with data from files
    """
    
    data = []

    full_filename = "%s/%s" % (dirdata, infile)
        
    with open(full_filename,'r') as fi:
        for line in fi:
            data.append(json.loads(line))

    # Create data frame from the json data
    outdf = pd.DataFrame(data)
    
    return outdf

#### Task 2: Preprocessing

Text from the corpus is cleaned and normalized in this step.

In [None]:
# Create cleaning process

def preprocess_text(text):
    
    ps=PorterStemmer()
    
    # Create stop words
    sw = stopwords.words('english')
    
    text=text.lower()
    text=re.sub('&lt;/?.*?&gt;',' &lt;&gt', text)
    text=re.sub(r'\\n','',str(text))
    text=re.sub(r'@\w+','',text)
    text=re.sub('[^a-zA-Z]',' ',str(text))
    text=text.split()
    # Remove stop words
    text=[ps.stem(word) for word in text if word not in (sw)]
    text=' '.join(text)
    
    return text

#### Task 3: Tokenization and Count Vectorization

Break the sentences in the document into separate words or tokens. 
Count the number of occurrences each words appears in a document to set the number of features, or in this case words, using CountVectorizer.

### Text Classification
#### Task 4: Train Test Split

Create test train split of the data.

#### Task 5: Binary and multinomial model - application and evaluation

Apply binary and multinomial model to the data and evaluate accuracy and other model metrics

### Topic Modelling

Create bag of words topic model

### Workflow creation and execution

Create workflow for pipelining the tasks