# Predicting movie reviews with pyspark

This exercise is to use simple ngram models and logistic regression to predict sentiments in movie reviews

Adapted from:https://towardsdatascience.com/sentiment-analysis-with-pyspark-bc8e83f80c35

# Install pyspark and download data

In [None]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 33 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 51.1 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=44f6b83e22be9cd59a0b09183c549d09ad16f8614fae7debc540d8f964e0bede
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [None]:
!wget https://github.com/garyongguanjie/learning-pyspark/raw/main/data/imdb-movie-review.zip
!unzip imdb-movie-review.zip

--2022-06-27 07:51:15--  https://github.com/garyongguanjie/learning-pyspark/raw/main/data/imdb-movie-review.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/garyongguanjie/learning-pyspark/main/data/imdb-movie-review.zip [following]
--2022-06-27 07:51:15--  https://raw.githubusercontent.com/garyongguanjie/learning-pyspark/main/data/imdb-movie-review.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 26962657 (26M) [application/zip]
Saving to: ‘imdb-movie-review.zip’


2022-06-27 07:51:16 (168 MB/s) - ‘imdb-movie-review.zip’ saved [26962657/26962657]

Archive:  imdb-movie-review.zip
  inflati

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.appName('imdb-nlp-example').getOrCreate()

In [None]:
df = spark.read.csv('IMDB Dataset.csv',header=True,inferSchema=True,escape='"')

In [None]:
df.show(5)

+--------------------+---------+
|              review|sentiment|
+--------------------+---------+
|One of the other ...| positive|
|A wonderful littl...| positive|
|I thought this wa...| positive|
|Basically there's...| negative|
|Petter Mattei's "...| positive|
+--------------------+---------+
only showing top 5 rows



In [None]:
df.printSchema()

root
 |-- review: string (nullable = true)
 |-- sentiment: string (nullable = true)



# EDA

In [None]:
df.groupBy('sentiment').count().show()

+---------+-----+
|sentiment|count|
+---------+-----+
| positive|25000|
| negative|25000|
+---------+-----+



## Count top words for each sentinment

Here we remove stopwords so that the count is more accurate

In [None]:
positive_words = df.filter(df.sentiment == "positive").withColumn('tokens', F.split(F.col('review'), ' '))
negative_words = df.filter(df.sentiment == "negative").withColumn('tokens', F.split(F.col('review'), ' '))

In [None]:
from pyspark.ml.feature import StopWordsRemover
# slightly ugly here since we are extending the stopwords list

remover = StopWordsRemover(stopWords=["/><br"]+StopWordsRemover.loadDefaultStopWords('english'),inputCol = 'tokens',outputCol='cleaned')

# we can use remover.loadDefaultStopWords('english') if you do not wish to extend the current stop words list

In [None]:
remover.transform(positive_words).withColumn('word',F.explode('cleaned')).groupby('word').count().sort('count',ascending=False).show()

+------+-----+
|  word|count|
+------+-----+
|  film|28855|
| movie|26367|
|   one|21304|
|  like|15665|
|  good|11318|
|   see|10693|
| great|10331|
|really|10138|
| story| 9329|
|     -| 9182|
|  also| 9015|
|  much| 8034|
|  even| 7999|
|   get| 7797|
|  time| 7711|
| first| 7695|
|  well| 7157|
| />The| 7151|
|  many| 6901|
|people| 6818|
+------+-----+
only showing top 20 rows



In [None]:
remover.transform(negative_words).withColumn('word',F.explode('cleaned')).groupby('word').count().sort('count',ascending=False).show()

+------+-----+
|  word|count|
+------+-----+
| movie|34387|
|  film|25417|
|  like|20361|
|   one|20025|
|  even|12978|
|  good|11263|
|really|11183|
|   bad|10101|
|   see| 9669|
|   get| 9534|
|     -| 9019|
|  much| 8792|
|  make| 8730|
|  time| 7596|
|people| 7594|
|  made| 7293|
| />The| 7184|
| story| 7110|
| first| 6588|
| think| 6556|
+------+-----+
only showing top 20 rows



# Modelling

For this exercise we will not be removing stopwords and see if our model is able to identify to most appropriate words without cleaning the data.

In [None]:
from pyspark.ml.feature import Tokenizer, CountVectorizer,StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression

In the first step we encode the sentimet with a label. This is similar to sklearn's labelencoder.

In [None]:
si = StringIndexer(inputCol="sentiment",outputCol="label")
df = si.fit(df).transform(df)
df = df.withColumn('label',df['label'].cast('integer'))

In [None]:
df.show()

+--------------------+---------+-----+
|              review|sentiment|label|
+--------------------+---------+-----+
|One of the other ...| positive|    1|
|A wonderful littl...| positive|    1|
|I thought this wa...| positive|    1|
|Basically there's...| negative|    0|
|Petter Mattei's "...| positive|    1|
|Probably my all-t...| positive|    1|
|I sure would like...| positive|    1|
|This show was an ...| negative|    0|
|Encouraged by the...| negative|    0|
|If you like origi...| positive|    1|
|Phil the Alien is...| negative|    0|
|I saw this movie ...| negative|    0|
|So im not a big f...| negative|    0|
|The cast played S...| negative|    0|
|This a fantastic ...| positive|    1|
|Kind of drawn in ...| negative|    0|
|Some films just s...| positive|    1|
|This movie made i...| negative|    0|
|I remember this f...| positive|    1|
|An awful film! It...| negative|    0|
+--------------------+---------+-----+
only showing top 20 rows



Split into training and validationd data.

In [None]:
train_df,val_df = df.randomSplit([0.8,0.2])

In [None]:
train_df.show()

+--------------------+---------+-----+
|              review|sentiment|label|
+--------------------+---------+-----+
|\b\b\b\bA Turkish...| positive|    1|
|!!!! MILD SPOILER...| negative|    0|
|" Now in India's ...| positive|    1|
|" Så som i himmel...| positive|    1|
|" While sporadica...| negative|    0|
|"... the beat is ...| positive|    1|
|"2001: A Space Od...| positive|    1|
|"200l: A Space Od...| positive|    1|
|"8 SIMPLE RULES.....| positive|    1|
|"9/11," hosted by...| positive|    1|
|"A Cry in the Dar...| positive|    1|
|"A Guy Thing" may...| positive|    1|
|"A Minute to Pray...| positive|    1|
|"A Slight Case of...| positive|    1|
|"A Tale of Two Si...| positive|    1|
|"A Thief in the N...| positive|    1|
|"A bored televisi...| negative|    0|
|"A death at a col...| negative|    0|
|"A total waste of...| negative|    0|
|"A trio of treasu...| negative|    0|
+--------------------+---------+-----+
only showing top 20 rows



Write the pipeline. First we tokenized then we use a count vectorizer followed by logistic regression.

In [None]:
tokenizer = Tokenizer(inputCol="review",outputCol="tokens")
cv = CountVectorizer(binary=True,inputCol="tokens",outputCol="features")
lr = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer,cv,lr])

In [None]:
model = pipeline.fit(train_df)

In [None]:
pred = model.transform(val_df)

In [None]:
pred.show()

+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|              review|sentiment|label|              tokens|            features|       rawPrediction|         probability|prediction|
+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|!!!! MILD SPOILER...| negative|    0|[!!!!, mild, spoi...|(262144,[0,1,2,3,...|[53.1570142981802...|           [1.0,0.0]|       0.0|
|!!!! POSSIBLE MIL...| negative|    0|[!!!!, possible, ...|(262144,[0,1,2,3,...|[-9.3802594344848...|[8.43661920924515...|       1.0|
|"A Gentleman's Ga...| negative|    0|["a, gentleman's,...|(262144,[0,1,2,3,...|[16.6437362268485...|[0.99999994088234...|       0.0|
|"A Mouse in the H...| positive|    1|["a, mouse, in, t...|(262144,[0,1,2,3,...|[-35.424210695566...|[4.12534970159587...|       1.0|
|"A lot of the fil...| negative|    0|["a, lot, of, the...|(26

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(pred) # gives area under roc by default

0.9256584970353555

In [None]:
# Accuracy
acc = pred.filter(pred.prediction.cast('integer')==pred.label).count()/val_df.count()
print(acc)
# can you find out precision and recall? (hint filter the denonminator before counting)

0.851822346146186


In [None]:
from pyspark.ml.feature import NGram, VectorAssembler

For ngrams we need to build each ngram separately and then concatenate the features using the vector assembler. 

In [None]:
def build_ngrams(n=1):
    tokenizer = [Tokenizer(inputCol="review",outputCol="tokens")]
    ngrams = [
              NGram(n=i,inputCol="tokens",outputCol=f"{i}_grams")
              for i in range(1,n+1)
    ]
    
    cv = [
          CountVectorizer(vocabSize=2**14,inputCol=f"{i}_grams",outputCol=f"features_{i}")
          for i in range(1,n+1)
    ]

    assembler =  [
        VectorAssembler(
            inputCols=[f"features_{i}" for i in range(1, n + 1)],
            outputCol="features"
        )
    ]
    lr = [LogisticRegression()]

    return Pipeline(stages=tokenizer+ngrams+cv+assembler+lr)

In [None]:
pipeline = build_ngrams(2)

In [None]:
model = pipeline.fit(train_df)

In [None]:
pred = model.transform(val_df)

In [None]:
evaluator.evaluate(pred)

0.9424111201549831

In [None]:
acc = pred.filter(pred.prediction.cast('integer')==pred.label).count()/val_df.count()
print(acc)

0.8752240589524


In [None]:
pred.show()

+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|              review|sentiment|label|              tokens|             1_grams|             2_grams|          features_1|          features_2|            features|       rawPrediction|         probability|prediction|
+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|!!!! MILD SPOILER...| negative|    0|[!!!!, mild, spoi...|[!!!!, mild, spoi...|[!!!! mild, mild ...|(16384,[0,1,2,3,4...|(16384,[0,1,9,10,...|(32768,[0,1,2,3,4...|[120.446376819478...|           [1.0,0.0]|       0.0|
|!!!! POSSIBLE MIL...| negative|    0|[!!!!, possible, ...|[!!!!, possible, ...|[!!!! possible, p...|(16384,[0,1,2,3,4...|(16384