<center>
    <img src="./images/logo.png" width="20%"></img>
</center>
<a id="TOC"></a>

# Data pre-processing

This notebook will walk through the same pre-processing steps that we did with NLTK, but now using Spark NLP.  Before getting started, if you are running this notebook in Google Colab, the following 2 cells needs to be executed to run the rest of the notebook.

In [None]:
#install the appropriate packages
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.4

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.5.0

In [None]:
#give permission to colab to access files from your google drive.  This permission lasts for just this Colab session.
#You will need to click on the link that pops up, give Colab permission, and copy and paste the string into the box
#that pops up down below.

from google.colab import drive
drive.mount("content")

In [None]:
#import sparknlp and confirm all is installed properly with the following commands

import sparknlp
# Start Spark Session with Spark NLP
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

## Steps to pre-process data

Steps 1-3 are some typical steps taken to clean and process the data to prepare our features (step 4).

1. Tokenize
2. Perform stemming/lemmatization
3. Remove stop words
4. Word embedding

Today, we're going to be working with a text loaded in the following cell for all our pre-processing steps.  The python package, Pandas, is a convenient way to read in the data and use it in this notebook.

In [2]:
#this root is set assuming you are using Google Colab.  If you are not, you can set it to root = './data/'
root = '/content/content/My Drive/Colab Notebooks/data/'

data_path = f'{root}preprocess_corpus.txt'

In [None]:
import pandas as pd

In [None]:
data = pd.read_csv(data_path, sep='\n', header = None, names=['text'])

### Data Prep before pre-processing

Before we can do any of our pre-prcessing steps with Spark NLP, we must put our data into the proper format.  This involves casting our pandas dataframe to a Spark dataframe, and then creating a document object out of each text in our dataset.

In [None]:
data.head()

In [None]:
#Cast our pandas dataframe into a spark dataframe
spark_df = spark.createDataFrame(data.astype(str))

In [None]:
spark_df.show()

In [None]:
#Use the DocumentAssembler to create document objects our of texts in our dataset
from sparknlp.base import DocumentAssembler

#To do: 
#1. Instantiate DocumentAssembler 

#2. Specify input column

#3. Specify output column

#4. Transform the spark dataframe




Now we have a dataframe that is ready to pre-process!

In [None]:
doc_df.show()

### Tokenization

**Tokenization**: Segmentation of text into words (a form of feature extraction)
<div align="center">
  <img height = 400, width = 400, src="./images/tokenize4.jpg">
</div>


Spark NLP has a single function for tokenization that is very convenient to use.  It has several different parameters that can be set in order to change how you do tokenization, all listed within the one tokenization function.

First, we will simply tokenize our text by splitting up the sentence into a list of words, symbols and numbers.  

In [None]:
#To do:  Instantiate tokenizer and set columns, tokenize documents.

from sparknlp.annotator import Tokenizer

tokenizer = Tokenizer()
#set columns

#fit and transform


In [None]:
token_df.select('token.result').take(1)

Just as we saw last week with NLTK, and the RegexpTokenizer(), there is a way to tokenize by using regular expressions in the Spark NLP tokenizer function.  The following example will show how to exclude punctuation from our tokens.

In [None]:
#Use the same tokenizer instance, but now give it a target pattern to split on.
tokenizer.setTargetPattern('\w+')
token_df=tokenizer.fit(doc_df)
token_df = token_df.transform(doc_df)

In [None]:
token_df.select('token.result').take(1)

### Remove Stop words

Removal of words that are not important from the information point of view, such as: the, is, a, etc.
The Spark NLP library does not have a list of stopwords available as a starting point.  It is up to the user to provide a list of stop words.  I have saved off the list that is available in the NLTK library to use in this example.

In [5]:
import pickle
with open(f'{root}stopwords.txt', 'rb') as file:
    stopwords = pickle.load(file)

In [None]:
#To do: create StopWordsCleaner instance and remove stop words from tokens.
from sparknlp.annotator import StopWordsCleaner




In [None]:
clean_token_df.select('cleanTokens.result').take(1)

### Stemming

**Stemming**: Reduces words to their root, but the root might not always result in an actual word.

<div align="center">
  <img height = 300, width = 300, src="./images/stem2.jpg">
</div>


Spark NLP has a single stemmer function.  You can assume that the stemmer is up to the latest standards in stemming.

In [None]:
from sparknlp.annotator import Stemmer

stemmer = Stemmer()
stemmer.setInputCols(["cleanTokens"]) 
stemmer.setOutputCol("stem")

stem_df=stemmer.transform(clean_token_df)

In [None]:
stem_df.select('stem.result').take(1)

Spark NLP also has a lemmatizer available if desired.  

In addition, there is a function called 'normalizer' which does several things at once. Per the Spark NLP documentation a normalizer 'removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary.'  In other words, it can remove punctuation and reduce words to a root based off of a dictionary provided by the user.

### Word Embedding: Representing Text as Numerical Vectors

+ We first need to represent texts to numbers that the learning algorithm can process. 
+ To represent each word in the dataset, we will use the pre-trained `WordEmbeddingsModel` from Spark NLP library. 

Word embedding converts a text into a numerical vector.  There are several different methods to do this.  The WordEmbeddingsModel is a pre-trained model based on the [GloVe](https://nlp.stanford.edu/projects/glove/) algorithm.  


<div align="center">
      <img height = 350, width = 350, src="./images/one_hot2.jpg">
</div>  

In [None]:
from sparknlp.annotator import WordEmbeddingsModel

word_embeddings=WordEmbeddingsModel.pretrained()
word_embeddings.setInputCols(['document','stem'])
word_embeddings.setOutputCol('embeddings')

embeddings_df=word_embeddings.transform(stem_df)

In [None]:
embeddings_df.select('embeddings.embeddings').take(1)

# Using a Pipeline

Now, let's process our actual dataset. With this, we can make this process even more streamlined by setting up a pipeline.  A pipeline is a set of actions set by the user in a specific order.  The pipeline will take in the dataframe and run through all the steps that you told it to do in an efficient manner.

In [None]:
#load in the covid19 tweet dataset. 
data_path = f"{root}covid19_tweets.csv"
df = pd.read_csv(data_path)

df = df.rename(columns={'tweet': 'text'})

#we need to cast this pandas dataframe as a spark dataframe.
spark_df = spark.createDataFrame(df.astype(str))

spark_df.show()

In [None]:
#import the Pipeline function
from pyspark.ml import Pipeline

#Every other instance needed for pre-processing have already been imported and set up.
#documentAssembler
#tokenizer
#stop_words_cleaner
#stemmer
#word_embeddings

In [None]:
#To do: set up the pipeline


In [None]:
#create the model specified by the pipeline, feed the Spark dataframe into it.
pipelineModel = nlpPipeline.fit(spark_df)

In [None]:
#obtain results of the pipeline
result = pipelineModel.transform(spark_df)
result.show()

Notice how fast this was to process!

<center>
    <img src="./images/logo.png" width="25%"></img>
</center>
Copyright Quansight LLC 2018-2020