# Ex3: NLP - Tags 

### Requirement: Build a tags filter. Use the various NLP tools and a classifier, to predict tag for one question.  In future questions could be auto-tagged by such a classifier or tags could be recommended to users prior to posting.
- Dataset: stack-overflow-data.csv. It contains Stack Overflow questions and associated tags.
- Link tham khảo: http://benalexkeen.com/multiclass-text-classification-with-pyspark/

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark

In [3]:
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark import SparkConf

In [4]:
SparkContext.setSystemProperty('spark.executor.memory', '12g')
sc = SparkContext(master='spark://172.25.51.55:7077', appName='Stack_Overflow')

In [5]:
sc

In [6]:
spark = SparkSession(sc)

In [7]:
file_name = "hdfs://172.25.51.16:19000/stack_overflow_data.csv"
# file_name = "stack-overflow-data.csv"

In [8]:
data = spark.read.csv(file_name, inferSchema=True,header=True)

In [9]:
data.show(5)

+--------------------+-----------+
|                post|       tags|
+--------------------+-----------+
|what is causing t...|         c#|
|have dynamic html...|    asp.net|
|how to convert a ...|objective-c|
|.net framework 4 ...|       .net|
|trying to calcula...|     python|
+--------------------+-----------+
only showing top 5 rows



In [10]:
data.groupby('tags').count().show(30)

+-------------+-----+
|         tags|count|
+-------------+-----+
|       iphone| 2000|
|      android| 2000|
|           c#| 2000|
|         null|20798|
|      asp.net| 2000|
|         html| 2000|
|        mysql| 2000|
|       jquery| 2000|
|   javascript| 2000|
|          css| 2000|
|          sql| 2000|
|          c++| 2000|
|            c| 2000|
|  objective-c| 2000|
|         java| 2000|
|          php| 2000|
|         .net| 2000|
|          ios| 2000|
|       python| 2000|
|    angularjs| 2000|
|ruby-on-rails| 2000|
+-------------+-----+



In [11]:
tags_null_data = data.filter(data.tags.isNull())

In [12]:
tags_null_data.count()

20798

In [13]:
data = data.filter(data.tags.isNotNull())

In [14]:
data.count()

40000

In [15]:
from pyspark.sql.functions import *

## Clean and Prepare the Data

** Create a new length feature: **

In [16]:
from pyspark.sql.functions import length

In [17]:
data = data.withColumn('length',length(data['post']))

In [18]:
data.show()

+--------------------+-------------+------+
|                post|         tags|length|
+--------------------+-------------+------+
|what is causing t...|           c#|   833|
|have dynamic html...|      asp.net|   804|
|how to convert a ...|  objective-c|   755|
|.net framework 4 ...|         .net|   349|
|trying to calcula...|       python|  1290|
|how to give alias...|      asp.net|   309|
|window.open() ret...|    angularjs|   495|
|identifying serve...|       iphone|   424|
|unknown method ke...|ruby-on-rails|  2022|
|from the include ...|    angularjs|  1279|
|when we need inte...|           c#|   995|
|how to install .i...|          ios|   344|
|dynamic textbox t...|      asp.net|   389|
|rather than bubbl...|            c|  1338|
|site deployed in ...|      asp.net|   349|
|connection in .ne...|         .net|   228|
|how to subtract 1...|  objective-c|    62|
|ror console show ...|ruby-on-rails|  2594|
|distance between ...|       iphone|   336|
|sql query - how t...|          

In [19]:
# Pretty Clear Difference
data.groupby('tags').mean().show()

+-------------+-----------+
|         tags|avg(length)|
+-------------+-----------+
|       iphone|    709.621|
|      android|  1713.4345|
|           c#|  1145.3065|
|      asp.net|     999.95|
|         html|   891.3105|
|        mysql|   1038.561|
|       jquery|   1081.507|
|   javascript|    964.396|
|          css|    954.809|
|          sql|    870.912|
|          c++|   1295.955|
|            c|  1121.1115|
|  objective-c|   972.8925|
|         java|   1357.308|
|          php|  1123.4205|
|         .net|   731.0075|
|          ios|   970.7565|
|       python|  1018.6695|
|    angularjs|  1294.7545|
|ruby-on-rails|  1244.2055|
+-------------+-----------+



## Feature Transformations

In [20]:
from bs4 import BeautifulSoup

from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import HasInputCol, HasOutputCol
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

In [21]:
class BsTextExtractor(Transformer, HasInputCol, HasOutputCol):

    @keyword_only
    def __init__(self, inputCol=None, outputCol=None):
        super(BsTextExtractor, self).__init__()
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    def _transform(self, dataset):

        def f(s):
            cleaned_post = BeautifulSoup(s).text
            return cleaned_post

        t = StringType()
        out_col = self.getOutputCol()
        in_col = dataset[self.getInputCol()]
        return dataset.withColumn(out_col, udf(f, t)(in_col))

In [22]:
from pyspark.ml.feature import Tokenizer,StopWordsRemover, CountVectorizer,IDF,StringIndexer
text_extractor = BsTextExtractor(inputCol="post", outputCol="cleaned_post")
tokenizer = Tokenizer(inputCol="cleaned_post", outputCol="token_text")
stopremove = StopWordsRemover(inputCol='token_text',outputCol='stop_tokens')
count_vec = CountVectorizer(inputCol='stop_tokens',outputCol='c_vec')
idf = IDF(inputCol="c_vec", outputCol="tf_idf")
class_to_num = StringIndexer(inputCol='tags',outputCol='label')

In [23]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vector

In [24]:
clean_up = VectorAssembler(inputCols=['tf_idf','length'],outputCol='features')

### The Model

We'll use Naive Bayes, but feel free to play around with this choice!

In [25]:
from pyspark.ml.classification import NaiveBayes

In [26]:
# Use defaults
nb = NaiveBayes()

### Pipeline

In [27]:
from pyspark.ml import Pipeline

In [28]:
data_prep_pipe = Pipeline(stages=[class_to_num,text_extractor,tokenizer,stopremove,count_vec,idf,clean_up])

In [29]:
cleaner = data_prep_pipe.fit(data)

In [30]:
clean_data = cleaner.transform(data)

### Training and Evaluation!

In [31]:
clean_data = clean_data.select(['label','features'])

In [32]:
clean_data.show() 

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  5.0|(262145,[0,1,2,3,...|
|  3.0|(262145,[0,12,31,...|
| 15.0|(262145,[0,1,2,3,...|
|  0.0|(262145,[0,18,21,...|
| 17.0|(262145,[0,1,4,8,...|
|  3.0|(262145,[0,12,21,...|
|  2.0|(262145,[0,1,3,6,...|
| 10.0|(262145,[0,44,61,...|
| 18.0|(262145,[0,1,14,2...|
|  2.0|(262145,[0,1,3,4,...|
|  5.0|(262145,[0,2,3,6,...|
|  9.0|(262145,[0,18,27,...|
|  3.0|(262145,[0,7,12,1...|
|  4.0|(262145,[0,1,2,3,...|
|  3.0|(262145,[0,11,27,...|
|  0.0|(262145,[0,187,23...|
| 15.0|(262145,[0,10,15,...|
| 18.0|(262145,[0,1,3,12...|
| 10.0|(262145,[0,30,39,...|
| 19.0|(262145,[0,12,15,...|
+-----+--------------------+
only showing top 20 rows



In [33]:
(training,testing) = clean_data.randomSplit([0.7,0.3], seed=142)

In [34]:
#training.cache()

In [35]:
#testing.cache()

In [36]:
#training.groupBy("label").count().show()

In [37]:
#testing.groupBy("label").count().show()

In [38]:
predictor = nb.fit(training)

In [39]:
test_results = predictor.transform(testing)

In [40]:
test_results.show()

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(262145,[0,1,2,3,...|[-9525.9204663975...|[2.06145944879209...|       5.0|
|  0.0|(262145,[0,1,2,3,...|[-11222.607664018...|[1.09156029556411...|       3.0|
|  0.0|(262145,[0,1,2,3,...|[-5699.5010733911...|[1.27980939167733...|       5.0|
|  0.0|(262145,[0,1,2,3,...|[-8624.8868314121...|[6.68896536535949...|       5.0|
|  0.0|(262145,[0,1,2,3,...|[-3034.0894816783...|[0.99999999417313...|       0.0|
|  0.0|(262145,[0,1,2,3,...|[-3393.1534857104...|[1.65827521693478...|       3.0|
|  0.0|(262145,[0,1,7,8,...|[-6028.1757418936...|[1.0,0.0,0.0,2.19...|       0.0|
|  0.0|(262145,[0,1,8,9,...|[-2239.4267154832...|[7.37356324710894...|       3.0|
|  0.0|(262145,[0,1,8,11...|[-2779.4219594575...|[3.54937613564798...|       3.0|
|  0.0|(262145,[

In [41]:
# Create a confusion matrix
test_results.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  8.0|       3.0|   25|
| 12.0|      16.0|   15|
| 16.0|       8.0|   24|
| 19.0|       5.0|    8|
| 10.0|       1.0|    5|
|  2.0|       0.0|    3|
| 15.0|      16.0|    2|
| 12.0|       5.0|   14|
|  0.0|      12.0|    7|
|  1.0|      19.0|    3|
|  0.0|       8.0|    5|
|  1.0|      12.0|    1|
| 19.0|      12.0|    2|
|  7.0|       3.0|    3|
| 15.0|      11.0|    6|
| 11.0|      17.0|    7|
| 17.0|      19.0|    8|
|  6.0|       1.0|    1|
|  8.0|       6.0|    1|
| 16.0|      10.0|    1|
+-----+----------+-----+
only showing top 20 rows



In [42]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [43]:
acc_eval = MulticlassClassificationEvaluator()
acc = acc_eval.evaluate(test_results)
print("Accuracy of model at predicting: {}".format(acc))

Accuracy of model at predicting: 0.7175324986751529


In [44]:
# save may cuc bo
# nb.save("NB_TagFilters_model")

In [45]:
# save hdfs
nb.save("hdfs://172.25.51.16:19000/NB_TagFilters_model")

- Not very good result! (~72%)
- Solution: Try switching out the classification models! Or even try to come up with other engineered features!...

### Use LogisticRegression/Random Forest

### Logistic Regression

In [46]:
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression

In [47]:
lg = LogisticRegression(maxIter=100, regParam=0.3, elasticNetParam=0)

In [48]:
predictor_1 = lg.fit(training)

In [49]:
test_results_1 = predictor_1.transform(testing)

In [50]:
# Create a confusion matrix
test_results_1.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  8.0|       3.0|   13|
| 12.0|      16.0|   13|
| 16.0|       8.0|   80|
| 19.0|       5.0|    6|
| 10.0|       1.0|    6|
|  2.0|       0.0|    2|
| 15.0|      16.0|    1|
| 12.0|       5.0|    9|
|  0.0|      12.0|    4|
|  0.0|       8.0|   26|
|  1.0|      19.0|    3|
|  1.0|      12.0|    3|
|  7.0|       3.0|    2|
| 15.0|      11.0|    4|
| 19.0|      12.0|    2|
| 11.0|      17.0|    8|
| 17.0|      19.0|   15|
| 17.0|       7.0|    2|
| 16.0|      10.0|   20|
|  6.0|       1.0|    3|
+-----+----------+-----+
only showing top 20 rows



In [51]:
acc_eval = MulticlassClassificationEvaluator()
acc_1 = acc_eval.evaluate(test_results_1)
print("Accuracy of model at predicting: {}".format(acc_1))

Accuracy of model at predicting: 0.6988642446771226


In [52]:
## It's not better result!!!

In [53]:
# Save máy cục bộ
# lg.save("LG_TagFilters_model")

In [54]:
# Save HDFS
lg.save("hdfs://172.25.51.16:19000/LG_TagFilters_model")

### Random forest

In [55]:
rf = RandomForestClassifier(labelCol="label", \
                            featuresCol="features", \
                            numTrees = 500, \
                            maxDepth = 5, \
                            maxBins = 64)

In [56]:
predictor_2 = rf.fit(training)

In [57]:
test_results_2 = predictor_2.transform(testing)

In [58]:
# Create a confusion matrix
test_results_2.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  8.0|       3.0|   11|
| 12.0|      16.0|    7|
| 16.0|       8.0|   22|
| 19.0|       5.0|    1|
| 10.0|       1.0|    7|
| 15.0|      16.0|    1|
| 12.0|       5.0|    6|
|  1.0|      19.0|    8|
|  0.0|       8.0|    6|
|  0.0|      12.0|    3|
| 15.0|      11.0|    3|
| 17.0|       7.0|    1|
| 11.0|      17.0|   10|
| 17.0|      19.0|    9|
| 16.0|      10.0|   26|
|  8.0|       6.0|    1|
|  4.0|       6.0|   19|
|  3.0|       5.0|   39|
|  9.0|       5.0|    3|
| 15.0|      19.0|    9|
+-----+----------+-----+
only showing top 20 rows



In [59]:
test_results_2.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       8.0|  584|
|       0.0|  414|
|       7.0|  659|
|      18.0|  601|
|       1.0|  532|
|       4.0|  614|
|      11.0|  571|
|      14.0|  545|
|      19.0|  819|
|       3.0|  452|
|       2.0|  583|
|      17.0|  640|
|      10.0| 1442|
|      13.0|  624|
|       6.0|  510|
|       5.0|  481|
|      15.0|  587|
|       9.0|  341|
|      16.0|  524|
|      12.0|  398|
+----------+-----+



In [60]:
acc_eval = MulticlassClassificationEvaluator()
acc_2 = acc_eval.evaluate(test_results_2)
print("Accuracy of model at predicting: {}".format(acc_2))

Accuracy of model at predicting: 0.7502639717792994


In [61]:
## It has higher accuracy but is not a better result!!!

In [62]:
# Save máy cục bộ
# rf.save("RF_TagFilters_model")

In [63]:
# Save HDFS
rf.save("hdfs://172.25.51.16:19000/RF_TagFilters_model")

In [64]:
sc.stop()