## NLP & Binary Classification: Yelp Reviews
https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences#

** Dataset Information: **

1000 sentences labelled with positive or negative sentiment from yelp.com 

** Attribute Information: (1 features and 1 class)**

- Sentences	
- Score : 1 (for positive) or 0 (for negative)	

** Objective of this project **

predict sentiment (positive or negative) from sentences

## Data

In [1]:
import findspark
findspark.init('/home/danny/spark-2.2.1-bin-hadoop2.7')

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('yelp').getOrCreate()

In [2]:
# Load Data
df = spark.read.csv('yelp_labelled.txt',
                    inferSchema=True,sep='\t')
# Inspect Data
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: integer (nullable = true)



In [3]:
df = df.withColumnRenamed('_c0','text').withColumnRenamed('_c1','label')

In [4]:
df.show(5)

+--------------------+-----+
|                text|label|
+--------------------+-----+
|Wow... Loved this...|    1|
|  Crust is not good.|    0|
|Not tasty and the...|    0|
|Stopped by during...|    1|
|The selection on ...|    1|
+--------------------+-----+
only showing top 5 rows



In [5]:
df.head()

Row(text='Wow... Loved this place.', label=1)

In [6]:
df.describe().show()

+-------+--------------------+------------------+
|summary|                text|             label|
+-------+--------------------+------------------+
|  count|                1000|              1000|
|   mean|                null|               0.5|
| stddev|                null|0.5002501876563867|
|    min|!....THE OWNERS R...|                 0|
|    max|you can watch the...|                 1|
+-------+--------------------+------------------+



In [7]:
df.groupby('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    1|  500|
|    0|  500|
+-----+-----+



## Data preprocessing

In [8]:
from pyspark.sql.functions import length
from pyspark.ml.feature import (Tokenizer,StopWordsRemover,
                                CountVectorizer,IDF,StringIndexer,
                                VectorAssembler,OneHotEncoder)
from pyspark.ml import Pipeline

** Pipeline for data prep **

In [9]:
# create a new length feature (feature engineering)
df= df.withColumn('length',length(df['text']))
df.show(5)
df.groupby('label').mean().show()

+--------------------+-----+------+
|                text|label|length|
+--------------------+-----+------+
|Wow... Loved this...|    1|    24|
|  Crust is not good.|    0|    18|
|Not tasty and the...|    0|    41|
|Stopped by during...|    1|    87|
|The selection on ...|    1|    59|
+--------------------+-----+------+
only showing top 5 rows

+-----+----------+-----------+
|label|avg(label)|avg(length)|
+-----+----------+-----------+
|    1|       1.0|     55.882|
|    0|       0.0|      60.75|
+-----+----------+-----------+



In [10]:
# create bag-of-words model (feature transformations)
tokenizer = Tokenizer(inputCol="text", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')
idf = IDF(inputCol="c_vec", outputCol="tf_idf")

In [11]:
# combine features into a single column
assembler = VectorAssembler(inputCols=['tf_idf','length'],
                            outputCol='features')

In [12]:
# pipeline
data_prep_pipe = Pipeline(stages=[tokenizer,stopremove,count_vec,idf,assembler])
clean_data = data_prep_pipe.fit(df).transform(df)
final_data = clean_data.select(['label','features'])
final_data.show(5)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|    1|(2584,[18,64,2222...|
|    0|(2584,[14,759,258...|
|    0|(2584,[145,351,19...|
|    1|(2584,[34,64,186,...|
|    1|(2584,[3,72,137,3...|
+-----+--------------------+
only showing top 5 rows



** Split Train Test sets **

In [13]:
seed = 101 #for reproducibility
train_data,test_data = final_data.randomSplit([0.7,0.3],seed=seed)
train_data.describe().show()
test_data.describe().show()

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|               690|
|   mean|0.4681159420289855|
| stddev|0.4993443462061108|
|    min|                 0|
|    max|                 1|
+-------+------------------+

+-------+------------------+
|summary|             label|
+-------+------------------+
|  count|               310|
|   mean|0.5709677419354838|
| stddev|0.4957381788788524|
|    min|                 0|
|    max|                 1|
+-------+------------------+



## Naive Bayes Model

In [14]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# Train the model
nb = NaiveBayes()
spam_predictor = nb.fit(train_data)
# Make predictions
predictions = spam_predictor.transform(test_data)
predictions.select(['features','label','prediction']).show()
# Evaluate the models
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(predictions)
print("Accuracy of model at predicting spam: {:.1f}%".format(acc*100))

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|(2584,[0,2,25,26,...|    0|       0.0|
|(2584,[0,4,25,42,...|    0|       1.0|
|(2584,[0,4,62,68,...|    0|       0.0|
|(2584,[0,5,19,33,...|    0|       0.0|
|(2584,[0,14,2583]...|    0|       1.0|
|(2584,[0,17,143,2...|    0|       0.0|
|(2584,[0,17,621,1...|    0|       0.0|
|(2584,[0,24,153,1...|    0|       1.0|
|(2584,[0,49,145,1...|    0|       1.0|
|(2584,[0,62,139,1...|    0|       0.0|
|(2584,[0,86,225,2...|    0|       1.0|
|(2584,[1,5,31,126...|    0|       0.0|
|(2584,[1,5,127,33...|    0|       0.0|
|(2584,[1,6,7,10,1...|    0|       0.0|
|(2584,[1,8,266,57...|    0|       0.0|
|(2584,[1,9,997,19...|    0|       1.0|
|(2584,[1,15,22,43...|    0|       0.0|
|(2584,[1,26,36,22...|    0|       0.0|
|(2584,[1,36,39,98...|    0|       1.0|
|(2584,[1,69,78,67...|    0|       1.0|
+--------------------+-----+----------+
only showing top 20 rows

Accuracy of mo

In [15]:
spark.stop()