# Sentiment Analysis

This is a brief analysis of the [Rotten tomatoes dataset](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) using PySpark.  PySpark's machine learning pipeline is similar to scikit-learn's, and the purpose of this project was to get familiar with Spark and natural language processing.

PySpark makes use of DataFrames.  DataFrames are similar to those in Pandas, yet with Pandas DataFrames seemingly more feature-rich. Because of this, we'll need Pandas. 

In [1]:
import pandas as pd
import numpy as np

Getting PySpark up and running can be a bit challenging.  On macOS High Sierra, environment variables are a pain, and no amount of `export` in the Terminal let Spark run correctly in Jupyter.  Because of that, I have to set the environment variables manually.

In [2]:
import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--master local[4] pyspark-shell'
import findspark
findspark.init()
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext('local')
spark = SparkSession(sc)

Data files are in tab separated values format. Importing them in Pandas then converting to a Spark DataFrame is not efficient but it's simple.

In [3]:
train = pd.read_csv('train.tsv', delimiter='\t')
test = pd.read_csv('test.tsv', delimiter='\t')
train_sdf = spark.createDataFrame(train)
test_sdf = spark.createDataFrame(test)

We divide the training data into a training set and a validation set.

In [4]:
(train_set, val_set) = train_sdf.randomSplit([0.90, 0.10], seed = 2000)

In [5]:
from pyspark.ml.feature import HashingTF, Tokenizer, IDF, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import IndexToString

We will create a pipeline through which our data passes:

phrases -> tokenizer -> countvectorizer -> idf -> stringindexer

In [6]:
tokenizer = Tokenizer(inputCol="Phrase", outputCol="words")
cv = CountVectorizer(vocabSize=2**18, inputCol="words", outputCol='cv',minDF=12)
idf = IDF(inputCol='cv', outputCol="features",minDocFreq=12) 
label_stringIdx = StringIndexer(inputCol = "Sentiment", outputCol = "label")
pipeline = Pipeline(stages=[tokenizer,  cv, idf,  label_stringIdx])

In [7]:
pipelineFit = pipeline.fit(train_set)
train_df = pipelineFit.transform(train_set)
val_df = pipelineFit.transform(val_set)
train_df.show(10)

+--------+----------+--------------------+---------+--------------------+--------------------+--------------------+-----+
|PhraseId|SentenceId|              Phrase|Sentiment|               words|                  cv|            features|label|
+--------+----------+--------------------+---------+--------------------+--------------------+--------------------+-----+
|       1|         1|A series of escap...|        1|[a, series, of, e...|(8796,[0,1,2,3,5,...|(8796,[0,1,2,3,5,...|  2.0|
|       2|         1|A series of escap...|        2|[a, series, of, e...|(8796,[0,2,3,9,10...|(8796,[0,2,3,9,10...|  0.0|
|       3|         1|            A series|        2|         [a, series]|(8796,[2,325],[1....|(8796,[2,325],[1....|  0.0|
|       4|         1|                   A|        2|                 [a]|    (8796,[2],[1.0])|(8796,[2],[1.6136...|  0.0|
|       5|         1|              series|        2|            [series]|  (8796,[325],[1.0])|(8796,[325],[6.08...|  0.0|
|       6|         1|of 

We use logistic regression to perform the classification. 

In [8]:
lr = LogisticRegression(maxIter=100)
lrModel = lr.fit(train_df)
predictions = lrModel.transform(val_df)
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")
evaluator.evaluate(predictions)

0.6440515087449549

Based on Kaggle leaderboard, this score is actually pretty good.

The final step is to move from the labels created by StringIndexer to the actual Sentiments predicted by the model. 

In [9]:
pred_df = pipelineFit.transform(test_sdf)
test_pred = lrModel.transform(pred_df)

In [10]:
converter = IndexToString(inputCol="prediction", outputCol="Sentiment", labels=['2.0', '3.0', '1.0', '4.0', '0.0'])
converted = converter.transform(test_pred)

I expected that IndexToString would use meta-data to automatically infer the correct Sentiment for each label, but I found that I needed to specify the correct labels myself (in the right order!).  Surely there is a better way to do this.

As the final step, I convert the Spark DataFrame to a Pandas DataFrame to make writing the CSV easier. In general, using `toPandas()` for this conversion is pretty slow, so we only convert using the two columns that we need.

In [11]:
pred_pdf = converted[['PhraseId','Sentiment']].toPandas()

Pandas complained when I tried to convert `Sentiment` to integer type, so I had to do it in two steps.

In [12]:
pred_pdf['Sentiment'] = pred_pdf['Sentiment'].astype(float)
pred_pdf['Sentiment'] = pred_pdf['Sentiment'].astype(int)

In [13]:
pred_pdf.to_csv('prediction051118-2.csv',index=False)