In [1]:
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/07/03 00:22:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/07/03 00:22:10 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
df_imdb = spark.read.options(inferSchema='True', delimiter='\t').text("./data/sentiment/imdb_labelled.txt")
df_amazon = spark.read.text("./data/sentiment/amazon_cells_labelled.txt")
df_yelp = spark.read.text("./data/sentiment/yelp_labelled.txt")

- In the `readme.txt` file (the same directory where the data are), read how to parse the text.
- - Hint: You might want to use `map`.
- Union all the three dataframes to one.

In [3]:
df_raw = df_imdb.union(df_amazon).union(df_yelp)
# df_raw = df_raw.map(lambda row: row.split('\t'))
# rdd = spark.sparkContext.parallelize(df_raw)
# split_rdd = rdd.map(lambda x: x.split('\t'))
split_rdd = df_raw.rdd.map(lambda x: x.value.split('\t'))
df_raw = split_rdd.toDF(['text', 'target'])
df_raw.show(10)

                                                                                

+--------------------+------+
|                text|target|
+--------------------+------+
|A very, very, ver...|     0|
|Not sure who was ...|     0|
|Attempting artine...|     0|
|Very little music...|     0|
|The best scene in...|     1|
|The rest of the m...|     0|
| Wasted two hours.  |     0|
|Saw the movie tod...|     1|
|A bit predictable.  |     0|
|Loved the casting...|     1|
+--------------------+------+
only showing top 10 rows



In [4]:
# Here you split to train and test
(train_set, test_set) = df_raw.randomSplit([0.9, 0.1], seed=33)

After we split to train and test, we will build a pipeline to extract the features from the text.

If you remember from the lectures, you can't do analysis on the words themselves, since they have no information.

So, we need to have an informative feature for the words.

We will choose counting as a basic feature that works.

_(Meaning that the value of the word will be the count of how many times it occurs in the dataset)_

So, we will use [`pyspark.ml.feature.CountVectorizer`](https://spark.apache.org/docs/latest/ml-features#countvectorizer).

But first, we need to split the words.
And we want to do it smart, not just by `.split()`.

So, we will use [`pyspark.ml.feature.Tokenizer`](https://spark.apache.org/docs/latest/ml-features#tokenizer).

In [14]:
from pyspark.ml.feature import CountVectorizer, Tokenizer, StringIndexer

tokenizer = Tokenizer(inputCol="text", outputCol="words")
counter = CountVectorizer(vocabSize=2**16, inputCol="words", outputCol="features")
label_stringIdx = StringIndexer(inputCol="target", outputCol="label")

Next, we will use [`pyspark.ml.classification.LogisticRegression`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.LogisticRegression.html) for our ML model.

In [6]:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=400)

In [7]:
# Create a `pyspark.ml.pipeline` of the feature extractors and the model.
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[tokenizer, counter, label_stringIdx,  lr])

In [8]:
# fit the model on the train
# (like we did with the ML exercises earlier in the course)
pipelineFit = pipeline.fit(train_set)
train_df = pipelineFit.transform(train_set)
test_df = pipelineFit.transform(test_set)
train_df.show(5)

                                                                                

22/07/03 00:22:20 WARN InstanceBuilder$JavaBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
22/07/03 00:22:21 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/07/03 00:22:21 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
22/07/03 00:22:21 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/07/03 00:22:21 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
+--------------------+------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|                text|target|               words|            features|label|       rawPrediction|         probability|prediction|
+--------------------+------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|" But "Storm Troo...|     0|["

We need some benchmarks to understand how well is our model.

We'll use [`pyspark.ml.evaluation.BinaryClassificationEvaluator`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.BinaryClassificationEvaluator.html) to evaluate the model on train and on test

In [12]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
# continue...


# accuracy = 
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
evaluator.evaluate(test_df)

0.8682583304706284

In [13]:
pipelineFit.save('./best_pipeline')

                                                                                