# Multilabel Text Classification- Building a Tone Analyzer

### Bheeni Garg
### August 31, 2017

In [3]:
# check if the spark context is working
sc

In [4]:
# load dependencies
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext

In [5]:
# load dataset
df = spark.read.option("header", "true").csv("text_tone.csv")

In [6]:
df.show()

+----------+-----------+---------------+--------------------+
|  tweet_id|  sentiment|         author|             content|
+----------+-----------+---------------+--------------------+
|1956967341|frustration|     xoshayzers|@tiffanylue i kno...|
|1956967666|      anger|      wannamama|Layin n bed with ...|
|1956967696|      anger|      coolfunky|Funeral ceremony....|
|1956967789| excitement|    czareaquino|wants to hang out...|
|1956968416|    neutral|      xkilljoyx|@dannycastillo We...|
|1956968477|frustration|  xxxPEACHESxxx|Re-pinging @ghost...|
|1956968487|      anger|       ShansBee|I should be sleep...|
|1956968636|frustration|       mcsleazy|Hmmm. http://www....|
|1956969035|      anger|    nic0lepaula|@charviray Charle...|
|1956969172|      anger|     Ingenue_Em|@kelcouch I'm sor...|
|1956969456|    neutral|     feinyheiny|    cant fall asleep|
|1956969531|frustration|   dudeitsmanda|Choked on her ret...|
|1956970047|      anger|       Danied32|Ugh! I have to be...|
|1956970

In [7]:
df.printSchema()

root
 |-- tweet_id: string (nullable = true)
 |-- sentiment: string (nullable = true)
 |-- author: string (nullable = true)
 |-- content: string (nullable = true)



We create a new dataset with only the necessary columns- `sentiment` and `content`.

In [8]:
dfNew = df.select('sentiment', 'content')

In [9]:
dfNew.show(truncate = False)

+-----------+------------------------------------------------------------------------------------------------------------------------------------------+
|sentiment  |content                                                                                                                                   |
+-----------+------------------------------------------------------------------------------------------------------------------------------------------+
|frustration|@tiffanylue i know  i was listenin to bad habit earlier and i started freakin at his part =[                                              |
|anger      |Layin n bed with a headache  ughhhh...waitin on your call...                                                                              |
|anger      |Funeral ceremony...gloomy friday...                                                                                                       |
|excitement |wants to hang out with friends SOON!                                 

In [10]:
dfNew.count()

40000

There are 40000 tweets/documents that we use to build our classification model.

***

### Feature Transformer - Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter “pattern” (regex, default: "\\s+") is used as delimiters to split the input text. Alternatively, users can set parameter “gaps” to false indicating the regex “pattern” denotes “tokens” rather than splitting gaps, and find all matching occurrences as the tokenization result.

In Spark, Tokenizer class provides this functionality.

In [11]:
from pyspark.ml.feature import Tokenizer, RegexTokenizer
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType


In [12]:
# using RegexTokenizer here 
regexTokenizer = RegexTokenizer(inputCol="content", outputCol="words", pattern="\\W+")

In [13]:
# count the number of tokens/words in a tweet
countTokens = udf(lambda words: len(words), IntegerType())

In [14]:
regexTokenized = regexTokenizer.transform(dfNew)
dfTokenized = regexTokenized.withColumn("tokens", countTokens(col("words")))

In [15]:
dfTokenized.show(5)

+-----------+--------------------+--------------------+------+
|  sentiment|             content|               words|tokens|
+-----------+--------------------+--------------------+------+
|frustration|@tiffanylue i kno...|[tiffanylue, i, k...|    17|
|      anger|Layin n bed with ...|[layin, n, bed, w...|    11|
|      anger|Funeral ceremony....|[funeral, ceremon...|     4|
| excitement|wants to hang out...|[wants, to, hang,...|     7|
|    neutral|@dannycastillo We...|[dannycastillo, w...|    15|
+-----------+--------------------+--------------------+------+
only showing top 5 rows



***

### StopWordsRemover

Stop words are words which should be excluded from the input, typically because the words appear frequently and don’t carry as much meaning.

These are some of the most common, short function words, such as the, is, at, which, and on. 

In [16]:
from pyspark.ml.feature import StopWordsRemover

remover = StopWordsRemover(inputCol="words", outputCol="filtered")
dfSWRemoved = remover.transform(dfTokenized)
dfSWR = dfSWRemoved.select('sentiment','content','words','filtered')\
     .withColumn("filtokens", countTokens(col("filtered")))
dfSWR.show()

+-----------+--------------------+--------------------+--------------------+---------+
|  sentiment|             content|               words|            filtered|filtokens|
+-----------+--------------------+--------------------+--------------------+---------+
|frustration|@tiffanylue i kno...|[tiffanylue, i, k...|[tiffanylue, know...|        9|
|      anger|Layin n bed with ...|[layin, n, bed, w...|[layin, n, bed, h...|        7|
|      anger|Funeral ceremony....|[funeral, ceremon...|[funeral, ceremon...|        4|
| excitement|wants to hang out...|[wants, to, hang,...|[wants, hang, fri...|        4|
|    neutral|@dannycastillo We...|[dannycastillo, w...|[dannycastillo, w...|        7|
|frustration|Re-pinging @ghost...|[re, pinging, gho...|[re, pinging, gho...|       11|
|      anger|I should be sleep...|[i, should, be, s...|[sleep, im, think...|       12|
|frustration|Hmmm. http://www....|[hmmm, http, www,...|[hmmm, http, www,...|        5|
|      anger|@charviray Charle...|[charvira

In [17]:
dfSWR.select('filtered').show(truncate=False)

+----------------------------------------------------------------------------------------------------+
|filtered                                                                                            |
+----------------------------------------------------------------------------------------------------+
|[tiffanylue, know, listenin, bad, habit, earlier, started, freakin, part]                           |
|[layin, n, bed, headache, ughhhh, waitin, call]                                                     |
|[funeral, ceremony, gloomy, friday]                                                                 |
|[wants, hang, friends, soon]                                                                        |
|[dannycastillo, want, trade, someone, houston, tickets, one]                                        |
|[re, pinging, ghostridah14, didn, go, prom, bc, bf, didn, like, friends]                            |
|[sleep, im, thinking, old, friend, want, married, damn, amp, wants, 2, s

***

### Weighted Bag-of-Words using TF-IDF

#### Bag-of-words comparisons are not very good when all tokens are treated the same: some tokens are more important than others.
Weights give us a way to specify which tokens to favor. With weights, when we compare documents, instead of counting common tokens, we sum up the weights of common tokens. A good heuristic for assigning weights is called "Term-Frequency/Inverse-Document-Frequency," or TF-IDF for short.

#### TF
TF rewards tokens that appear many times in the same document. It is computed as the frequency of a token in a document, that is, if document d contains 100 tokens and token t appears in d 5 times, then the TF weight of t in d is 5/100 = 1/20. The intuition for TF is that if a word occurs often in a document, then it is more important to the meaning of the document.

#### IDF
IDF rewards tokens that are rare overall in a dataset. The intuition is that it is more significant if two documents share a rare word than a common one. IDF weight for a token, t, in a set of documents, U, is computed as follows:

#### Let N be the total number of documents in U
#### Find n(t), the number of documents in U that contain t
#### Then IDF(t) = N/n(t). 
#### Note that n(t)/N is the frequency of t in U, and N/n(t) is the inverse frequency.

Note on terminology: Sometimes token weights depend on the document the token belongs to, that is, the same token may have a different weight when it's found in different documents. We call these weights local weights. TF is an example of a local weight, because it depends on the length of the source. On the other hand, some token weights only depend on the token, and are the same everywhere that token is found. We call these weights global, and IDF is one such weight.

#### TF-IDF
Finally, to bring it all together, the total TF-IDF weight for a token in a document is the product of its TF and IDF weights.

[TF-IDF in Spark](https://spark.apache.org/docs/2.1.0/ml-features.html#tf-idf)

In [18]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

# implementing TF
hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=50000)
tf = hashingTF.transform(dfSWR)
# alternatively, CountVectorizer can also be used to get term frequency vectors

tf.select('sentiment', 'filtokens', 'rawFeatures').show(truncate=False)

+-----------+---------+---------------------------------------------------------------------------------------------------------------------------------+
|sentiment  |filtokens|rawFeatures                                                                                                                      |
+-----------+---------+---------------------------------------------------------------------------------------------------------------------------------+
|frustration|9        |(50000,[7086,7522,9561,21217,23207,39586,45192,47779,48740],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])                               |
|anger      |7        |(50000,[2199,11359,21655,35146,47985,48354,48602],[1.0,1.0,1.0,1.0,1.0,1.0,1.0])                                                 |
|anger      |4        |(50000,[36480,38834,39290,43328],[1.0,1.0,1.0,1.0])                                                                              |
|excitement |4        |(50000,[4442,15928,31041,48062],[1.0,1.0,1.0,1.0])   

In [20]:
# implementing IDF
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(tf)
tfidf = idfModel.transform(tf)

tfidf.select('sentiment', 'features').show(10, truncate = False)

+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|sentiment  |features                                                                                                                                                                                                                                                                                                  |
+-----------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|frustration|(50000,[7086,7522,9561,21217,23207,39586,45192,4

In [21]:
tfidf.show(5)

+-----------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+
|  sentiment|             content|               words|            filtered|filtokens|         rawFeatures|            features|
+-----------+--------------------+--------------------+--------------------+---------+--------------------+--------------------+
|frustration|@tiffanylue i kno...|[tiffanylue, i, k...|[tiffanylue, know...|        9|(50000,[7086,7522...|(50000,[7086,7522...|
|      anger|Layin n bed with ...|[layin, n, bed, w...|[layin, n, bed, h...|        7|(50000,[2199,1135...|(50000,[2199,1135...|
|      anger|Funeral ceremony....|[funeral, ceremon...|[funeral, ceremon...|        4|(50000,[36480,388...|(50000,[36480,388...|
| excitement|wants to hang out...|[wants, to, hang,...|[wants, hang, fri...|        4|(50000,[4442,1592...|(50000,[4442,1592...|
|    neutral|@dannycastillo We...|[dannycastillo, w...|[dannycastillo, w...|        7|(50000,[177

In [22]:
dataDF = tfidf.select('sentiment', 'content','features')

In [23]:
dataDF.show()

+-----------+--------------------+--------------------+
|  sentiment|             content|            features|
+-----------+--------------------+--------------------+
|frustration|@tiffanylue i kno...|(50000,[7086,7522...|
|      anger|Layin n bed with ...|(50000,[2199,1135...|
|      anger|Funeral ceremony....|(50000,[36480,388...|
| excitement|wants to hang out...|(50000,[4442,1592...|
|    neutral|@dannycastillo We...|(50000,[17712,220...|
|frustration|Re-pinging @ghost...|(50000,[2402,2425...|
|      anger|I should be sleep...|(50000,[3688,3998...|
|frustration|Hmmm. http://www....|(50000,[6757,1379...|
|      anger|@charviray Charle...|(50000,[6240,2107...|
|      anger|@kelcouch I'm sor...|(50000,[7757,1052...|
|    neutral|    cant fall asleep|(50000,[6026,1372...|
|frustration|Choked on her ret...|(50000,[3190,2180...|
|      anger|Ugh! I have to be...|(50000,[6814,7350...|
|      anger|@BrodyJenner if u...|(50000,[2607,1027...|
| excitement|        Got the news|(50000,[16007,

In [25]:
dataDF.printSchema()

root
 |-- sentiment: string (nullable = true)
 |-- content: string (nullable = true)
 |-- features: vector (nullable = true)



***

### Extracting, transforming and selecting features- StringIndexer

We now convert the sentiments which are of type string to categorical variables or 'classes' to create labels/ responses for building the classification model. 

This is achieved using the StringIndexer class in Spark. [String Indexer](https://spark.apache.org/docs/2.1.0/ml-features.html#stringindexer)

In [24]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="sentiment", outputCol="label")
indexed = indexer.fit(dataDF).transform(dataDF)
indexed.show()

+-----------+--------------------+--------------------+-----+
|  sentiment|             content|            features|label|
+-----------+--------------------+--------------------+-----+
|frustration|@tiffanylue i kno...|(50000,[7086,7522...|  1.0|
|      anger|Layin n bed with ...|(50000,[2199,1135...|  3.0|
|      anger|Funeral ceremony....|(50000,[36480,388...|  3.0|
| excitement|wants to hang out...|(50000,[4442,1592...|  4.0|
|    neutral|@dannycastillo We...|(50000,[17712,220...|  2.0|
|frustration|Re-pinging @ghost...|(50000,[2402,2425...|  1.0|
|      anger|I should be sleep...|(50000,[3688,3998...|  3.0|
|frustration|Hmmm. http://www....|(50000,[6757,1379...|  1.0|
|      anger|@charviray Charle...|(50000,[6240,2107...|  3.0|
|      anger|@kelcouch I'm sor...|(50000,[7757,1052...|  3.0|
|    neutral|    cant fall asleep|(50000,[6026,1372...|  2.0|
|frustration|Choked on her ret...|(50000,[3190,2180...|  1.0|
|      anger|Ugh! I have to be...|(50000,[6814,7350...|  3.0|
|      a

In [26]:
indexed.printSchema()

root
 |-- sentiment: string (nullable = true)
 |-- content: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)



Here's the label encoding as performed by the String Indexer class-

* Happiness: 0.0 
* Frustration: 1.0 
* Neutral: 2.0 
* Anger: 3.0 
* Excitement: 4.0


***

### Split data 80/20 into training and test data sets

Having completed the Feature Engineering phase, we move on to splitting our dataset into train and test sets.

In [36]:
trainDF, testDF = indexed.randomSplit([0.8, 0.2])

***

### Fitting the Multinomial Naive Bayes classifier to the training dataframe

Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features. The spark.ml implementation currently supports both multinomial naive Bayes and Bernoulli naive Bayes.

[Paper on Multinomial NB](http://www.cs.cmu.edu/~knigam/papers/multinomial-aaaiws98.pdf)

[Naive Bayes in Spark](https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes)

In [37]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# fit the naive bayes model to the training set
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
model = nb.fit(trainDF)

In [38]:
# predicting probabilities and labels on the test set
# select example rows to display.
predictions = model.transform(testDF)
predictions.select('probability', 'prediction').show(10, truncate=False)

+--------------------------------------------------------------------------------------------------------------+----------+
|probability                                                                                                   |prediction|
+--------------------------------------------------------------------------------------------------------------+----------+
|[7.751364105151501E-10,7.165477198045954E-5,5.154244350756452E-8,6.753444062239701E-11,0.9999282928429053]    |4.0       |
|[1.6668974717695789E-12,0.0018377450004126738,0.0013443703748229178,0.9968178846230855,1.1992943613046301E-14]|3.0       |
|[3.650374493438045E-18,0.9999888913420104,1.2347678596587943E-12,1.1108656754684378E-5,3.6286687293873124E-24]|1.0       |
|[4.312817636314641E-28,0.9999999999918163,4.43928990613844E-12,6.457174429445445E-22,3.744276976374241E-12]   |1.0       |
|[9.537300505493539E-14,0.009320853345868328,4.391185329033474E-13,0.990679146639608,1.3989400777009734E-11]   |3.0       |
|[1.0132

In [39]:
# compute accuracy on the test set
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.313319186119


Multinomial NB model gives an accuracy of about 31% which is low. 

I have built a very crude model to lay a foundation of the analyzer. My next step will include refining the model in order to improve accuracy. 

An ML pipeline can now be created to model the training dataset with labels and test the built model on a test set without labels. In other words, we feed in text/documents and the analyzer throws out the predicted tone of the text. 

***

### Creating a Pipeline

In [34]:
from pyspark.ml import Pipeline

pipeline = Pipeline(stages=[regexTokenizer, remover, hashingTF, idf, indexer, nb])

# Fit the pipeline to training documents.
model = pipeline.fit(dfNew)


# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    ("i am glad you came"),
    ("sooo look forward to this match today!!"),
    ("i gave him an enthusiastic smile."),
    ("you were never to step out of that door"),
    ("it was a horrific rain")
], StringType())

testdf = test.selectExpr("value as content")

# Make predictions on test documents and print columns of interest.
prediction = model.transform(testdf)
selected = prediction.select("content", "probability", "prediction")
for row in selected.collect():
    content, prob, prediction = row
    #print("(%s, %s) --> prob=%s, prediction=%f" % (content, str(prob), prediction))
    print("'{0}', '{1}', '{2}'".format(content, str(prob), prediction))

'i am glad you came', '[0.999293805414,0.000290829745061,4.2468694878e-05,0.000205155000072,0.000167741145984]', '0.0'
'sooo look forward to this match today!!', '[0.326912215958,0.657931619476,0.000213437832279,5.63099231622e-05,0.0148864168107]', '1.0'
'i gave him an enthusiastic smile.', '[0.0873458054923,1.24295920149e-14,5.57930922629e-06,0.91264861519,8.86301451627e-12]', '3.0'
'you were never to step out of that door', '[1.03265854052e-05,0.157013476744,0.497622282685,0.00351662579954,0.341837288186]', '2.0'
'it was a horrific rain', '[5.00123142958e-14,0.979025836093,2.33895748233e-12,0.0209741638696,3.48286487033e-11]', '1.0'


The classifier here is a weak classifier which gives an accuracy of about 31%. With this accuracy it is able to somewhat correctly predict the tone of the first and last sentences but really goes off tangent for the rest. There is a need to refine the model further and/or try different algorithms which might improve accuracy.

***

### References

1. [How to use spark Naive Bayes classifier for text classification with IDF?](https://stackoverflow.com/questions/32231049/how-to-use-spark-naive-bayes-classifier-for-text-classification-with-idf)

2. [Sentiment Analysis PySpark](https://github.com/nisarg64/Sentiment-Analysis-Pyspark/blob/master/text_analysis.py)

3. [mastering-apache-spark-book](https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-mllib/spark-mllib-pipelines-example-classification.adoc)

4. [Apache Spark Documentation](https://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer)