<center>
    <img src="./images/logo.png" width="20%"></img>
</center>
<a id="TOC"></a>

# Data Pre-processing

<a id="TOC"></a>
# Table of Contents
* [Data Pre-processing](#Data-Pre-processing)
	* [Steps to pre-process data](#Steps-to-pre-process-data)
		* [Tokenization](#Tokenization)
		* [Remove Stop words](#Remove-Stop-words)
		* [Stemming](#Stemming)
		* [Word Embedding: Representing Text as Numerical Vectors](#Word-Embedding:-Representing-Text-as-Numerical-Vectors)
* [Split the Dataset into Training and Testing Sets](#Split-the-Dataset-into-Training-and-Testing-Sets)
* [Selecting a Classifier](#Selecting-a-Classifier)
* [Create an Instance of RandomForestClassifier()](#Create-an-Instance-of-RandomForestClassifier%28%29)
* [Fit the Model and Predict the Test Set](#Fit-the-Model-and-Predict-the-Test-Set)
* [Evaluation of Performance](#Evaluation-of-Performance)
* [Bonus material:  lemmatizer exercise](#Bonus-material:--lemmatizer-exercise)


## Steps to pre-process data

Steps 1-3 are the typical steps taken to clean and process the data to prepare our features (step 4).

1. Tokenize
2. Perform stemming/lemmatization
3. Remove stop words
4. Word embedding

Today, we're going to be working with a text loaded in the following cell for all our pre-processing steps.  The python package, Pandas, is a convenient way to read in the data and use it in this notebook.

In [None]:
from pathlib import Path

root = Path(".")
data_path = root / "data" / "preprocess_corpus.txt"

In [None]:
import pandas as pd

data = pd.read_csv(data_path, sep='\n', header = None, names=['text'])

In [None]:
data

### Tokenization

**Tokenization**: Segmentation of text into words (a form of feature extraction)
<div align="center">
  <img height = 400, width = 400, src="./images/tokenize4.jpg">
</div>


Several options for tokenizing are available in the NLTK library.  Some tokenizers are more refined than others and may significantly contribute to the results that you gain from your NLP models depending on your purposes and the type of data you have. 

Two different tokenizers are included in the examples below.

The first example is `word_tokenize` from the nltk.tokenize module.  It simply splits a sentence into a list of words, symbols, and numbers.

In [None]:
print(data.iloc[4].text)

In [None]:
#Import word_tokenize
from nltk.tokenize import word_tokenize 

tokens = word_tokenize(data.iloc[4].text) 
print(tokens)

`RegexpTokenizer()` is another useful `nltk` tokenizer.  Characters in a ***regex*** (or regular expression) define a search pattern of a given text. In this example, we are using the `RegexpTokenizer()` to remove punctuation.

In [None]:
#Import the RegexpTokenizer module
from nltk.tokenize import RegexpTokenizer

# Create an instance of RegexpTokenizer with your search pattern. 
tokenizer = RegexpTokenizer(r'\w+')

tokens = tokenizer.tokenize(data.iloc[4].text)

print(tokens)

<a href="#TOC">Back to top</a>

### Remove Stop words

Removal of words that are not important from the information point of view, such as: the, is, a, etc.
The NLTK library has a list of stopwords that can be used to remove words from your corpus.
Let's look at the stopwords in English that are defined in nltk library.  

In [None]:
from nltk.corpus import stopwords

set(stopwords.words('english'))

**Exercise:** Remove stop words from our list of tokens

In [None]:
#Hint: Use either a for loop or a list comprehension to go through the list of tokens, check if its in the 
#list of stop words, and keep it if it is not.

#Your code here:


<a href="#TOC">Back to top</a>

### Stemming

**Stemming**: Reduces words to their root, but the root might not always result in an actual word.

<div align="center">
  <img height = 300, width = 300, src="./images/stem2.jpg">
</div>

There are several stemmers available in NLTK.  Some examples are `nltk.PorterStemmer` and the `SnowballStemmer`. 

In [None]:
#Import your stemmer of choice
from nltk.stem import PorterStemmer 

# Create an instance of the PorterStemmer()
stemmer = PorterStemmer()  

# Note: A word stem need not be the same root as a dictionary-based morphological root, 
# it just is an equal to or smaller form of the word. 
for w in tokens_no_sw: 
    print(w, " : ", stemmer.stem(w))

Note: In some cases, you may need to do more than stemming.  There is a process called **"lemmatization"** that reduces a word to a root in a more sophisticated way.  See the bonus section at the very end of the notebook for examples of a lemmatizer.

Now we will run through all of these steps (tokenization, remove stop words, and stemming) on our dataset we are considering today.

**Exercise:** Tokenize, remove stop words, and stem each line from our data

In [None]:
#load in the twitter dataset.

data = pd.read_csv(...)

**Let's look at what the data looks like**

In [None]:
# Use pandas library to inspect data
data.shape

In [None]:
data.head()

In [None]:
#Create an empty list as our initial corpus.  We will add to it as we pre-process each line.
#Note: this cell takes almost 10 minutes to run with this dataset.  
corpus = []

#loop through each line of data
for n in range(len(data)):    
    sentence = data['tweet'][n]
    
    #1. Tokenize
    tokens = ...
    
    #2. remove stop words
    tokens_no_sw = ...
    
    #3. stem the remaining words
    stem_words = ...
    
    
    #Join the words back together as one string and append this string to your corpus.
    corpus.append(' '.join(stem_words))

<a href="#TOC">Back to top</a>

Because the above cell takes awhile to complete, I saved the resulting corpus in a pickle file for the session.

In [None]:
import pickle
with open ('./data/corpus.txt', 'rb') as fp:
    corpus = pickle.load(fp)

In [None]:
corpus[0:2]

### Word Embedding: Representing Text as Numerical Vectors

+ We first need to represent texts to numbers that the learning algorithm can process. 
+ To represent each word in the dataset, we will use `CountVectorizer` from `scikit-learn` library. This is a very straightforward class for converting words into features.
+ `CountVectorizer` will also lowercase and tokenize the data, but it is good practice to know how to do such preprocessing as we have done above.

<div align="center">
      <img height = 350, width = 350, src="./images/one_hot2.jpg">
</div>  

In [None]:
#import CountVectorizer module
from sklearn.feature_extraction.text import CountVectorizer

# Create a vectorizer instance
vectorizer = ...


We have already pre-processed our data and created a corpus to insert into our CountVectorizer...

In [None]:
print(corpus[0:2])

In [None]:
#the CountVectorizer is expecting a list of strings as a corpus

# The function fit_transform() is used for dataset transformations in scikit-learn. 
X = vectorizer.fit_transform(corpus).toarray()
# In this instance, the dataframe data contains target labels in column  2
y = data.iloc[:,2].values

#extract the feature names (words) to view in our dataframe
labels = vectorizer.get_feature_names()

#create a pandas dataframe with the columns being our words (or features)
df = pd.DataFrame(data=X, columns=labels)
df

<a href="#TOC">Back to top</a>

### Split the Dataset into Training and Testing Sets

+ To be able to test the accuracy of Machine Learning models, we need to have a set of data that our model has not seen during training. 
+ To achieve this, we will use the function `train_test_split()` from `sklearn.model_selection` to split the dataset into `train` and `test` sets. 
+ It is common practice to take only 20% of the total data as the test set. However, depending on the nature of your data, you can play with the ratios to see if a better performance can be observed.

<img src="./images/split.png">

In [None]:
from sklearn.model_selection import train_test_split

# X_test and y_test contains 20% of our data which we reserve for testing.
X_train, X_test, y_train, y_test = ...

<a href="#TOC">Back to top</a>

### Selecting a Classifier 

+ We will use `RandomForestClassifier()` from `scikit-learn` library as our classifier for this exercise. 
     + Please note, there are a number of classifiers in `scikit_learn` that you can use for classification problems, such as:
     + `AdaBoostClassifier`
     + `GaussianProcessClassifier`

### Create an Instance of RandomForestClassifier() 

`RandomForestClassifier()` takes a bunch of parameters. For our purposes, we will specify only a few:
+ `n_estimators` is the number of trees in the forest.
+ `criterion` determines what a good split for the tree is. We can select `gini` or `entropy` for the model. 
+ `random_state` controls both the randomness of the bootstrapping of the samples used when building trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = ...

### Fit the Model and Predict the Test Set

+ In `scikit-learn`, an estimator for classification is a Python object that implements the methods `fit(X_train, y_train)` and `predict(X_test)`
+ We call the methods `fit(X_train, y_train)` and `predict(X_test)` on our `model`

In [None]:
import sklearn

# Fit function adjusts weights according to data values so that better accuracy can be achieved. 
model = ...

# model.predict() : given a trained model, predict the label of a new set of data.
y_pred = ...

### Evaluation of Performance

+ **Accuracy** is the total number of correct predictions divided by the total number of predictions. 
+ This metric might not always be a very good indicator of performance, particularly for imbalanced datasets.

In [None]:
from sklearn.metrics import accuracy_score

print('Accuracy Score:', accuracy_score(y_pred, y_test))

<a href="#TOC">Back to top</a>

# Bonus material:  lemmatizer exercise

If interested in exploring a lemmatizer, see the following examples

**Lemmatization** -- Lemmatization is a more sophisticated approach to reducing to root words.  See the difference in the results here.

<div align="center">
    <img height = 300, width = 300, src="./images/lemma.jpg">
</div>

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 

In [None]:
words = ['trespassed', 'argued', 'languages', 'rocks', 'radii', 'punctuate', 'car''s', 'ran', 'distanced', 'spoke']

In [None]:
for n in range(len(words)):
    print(f'{words[n]}: {lemmatizer.lemmatize(words[n])}')

*** Notice these did not change some of these values to a root word.  This is because they are verbs and the lemmatizer default is a noun.  You can specify verb in order to get a root word of a verb as well.***


In [None]:
lemmatizer.lemmatize('spoke', pos ='v')

In [None]:
lemmatizer.lemmatize('ran', pos ='v')

In [None]:
lemmatizer.lemmatize('distanced', pos ='v')

<a href="#TOC">Back to top</a>

<center>
    <img src="./images/logo.png" width="25%"></img>
</center>
Copyright Quansight LLC 2018-2020