# Spark MLlib Exercises


http://spark.apache.org/docs/latest/ml-statistics.html

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 41 kB/s 
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 47.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805912 sha256=a057aea790e3a60cbb5d4672a62378a1a05e70b688f3e67f04f1f782e2bce932
  Stored in directory: /root/.cache/pip/wheels/0b/de/d2/9be5d59d7331c6c2a7c1b6d1a4f463ce107332b1ecd4e80718
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0


In [2]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
spark = SparkSession.builder.getOrCreate()

In [3]:
spark

## 1. Statistics (1p.)

Download the following dataset: https://www.kaggle.com/c/titanic/data?select=train.csv

In [None]:
file = "titanic_train.csv"
titanic_df = spark.read.format("csv").options(inferSchema="true", header="true").load(file)
titanic_df = titanic_df.dropna(how='any')
titanic_df.show(10)
print(titanic_df.dtypes)

+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----------+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|  Ticket|   Fare|      Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+--------+-------+-----------+--------+
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|PC 17599|71.2833|        C85|       C|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|  113803|   53.1|       C123|       S|
|          7|       0|     1|McCarthy, Mr. Tim...|  male|54.0|    0|    0|   17463|51.8625|        E46|       S|
|         11|       1|     3|Sandstrom, Miss. ...|female| 4.0|    1|    1| PP 9549|   16.7|         G6|       S|
|         12|       1|     1|Bonnell, Miss. El...|female|58.0|    0|    0|  113783|  26.55|       C103|       S|
|         22|       1|     2|Beesley, Mr. Lawr...|  male|34.0|    0|    0|  248698|   13.0|     

### Exercise 1.A.
**TODO:** Calculate descriptive statistics for 'Age' and 'Fare' (see https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/sql/DataFrame.html#describe(scala.collection.Seq))

In [None]:
titanic_df.describe("Age", "Fare").show()

+-------+------------------+-----------------+
|summary|               Age|             Fare|
+-------+------------------+-----------------+
|  count|               183|              183|
|   mean|  35.6744262295082|78.68246885245901|
| stddev|15.643865966849717|76.34784270040569|
|    min|              0.92|              0.0|
|    max|              80.0|         512.3292|
+-------+------------------+-----------------+



### Exercise 1.B.

**TODO:** Check if 'Age' and 'Fare' have normal distribution (see http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/stat/KolmogorovSmirnovTest.html)

In [None]:
from pyspark.ml.stat import KolmogorovSmirnovTest

KolmogorovSmirnovTest.test(titanic_df, "Age", "norm", 0, 1).show()

+--------------------+------------------+
|              pValue|         statistic|
+--------------------+------------------+
|1.943689653671754...|0.9713276975967852|
+--------------------+------------------+





In [None]:
KolmogorovSmirnovTest.test(titanic_df, "Fare", "norm", 0, 1).show()

+--------------------+------------------+
|              pValue|         statistic|
+--------------------+------------------+
|8.816725127758218...|0.9890707515997943|
+--------------------+------------------+





We can reject the null hypothesis, so for both variables we haven't got a normal distribution.

### Exercise 1.C.

**TODO:** Calculate Pearson correlation between the following pairs of features:  
* 'Age' and 'Survived'
* 'Sex' and 'Survived' *(remember about encoding 'Sex' attributes as 0s and 1s)*

Which correlation is stronger?

In [None]:
from pyspark.sql.functions import col, udf

@udf("integer")
def encode_sex_to_int(sex):
  if sex == "male":
    return 1
  else:
    return 0

new_titanic_df = (titanic_df.select("Age", "Survived", "Sex").withColumn("Sex", encode_sex_to_int(col("Sex"))))
new_titanic_df.show()

+----+--------+---+
| Age|Survived|Sex|
+----+--------+---+
|38.0|       1|  0|
|35.0|       1|  0|
|54.0|       0|  1|
| 4.0|       1|  0|
|58.0|       1|  0|
|34.0|       1|  1|
|28.0|       1|  1|
|19.0|       0|  1|
|49.0|       1|  0|
|65.0|       0|  1|
|45.0|       0|  1|
|29.0|       1|  0|
|25.0|       0|  1|
|23.0|       1|  0|
|46.0|       0|  1|
|71.0|       0|  1|
|23.0|       1|  1|
|21.0|       0|  1|
|47.0|       0|  1|
|24.0|       0|  1|
+----+--------+---+
only showing top 20 rows



In [None]:
new_titanic_df.corr("Age", "Survived", method="pearson")

-0.2540847542030532

In [None]:
new_titanic_df.corr("Sex", "Survived", method="pearson")

-0.5324179744538412

As we can see, correlation between Sex and Survived features is stronger

## 2. Loading data

Doc: http://spark.apache.org/docs/latest/ml-datasource.html 

Download data from https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt and load as DataFrame. 

In [None]:
file = "sample_libsvm_data.txt"

df = spark.read.format("libsvm").option("numFeatures", "780").load(file)
df.show(10)
df.take(1)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(780,[127,128,129...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[124,125,126...|
|  1.0|(780,[152,153,154...|
|  1.0|(780,[151,152,153...|
|  0.0|(780,[129,130,131...|
|  1.0|(780,[158,159,160...|
|  1.0|(780,[99,100,101,...|
|  0.0|(780,[154,155,156...|
|  0.0|(780,[127,128,129...|
+-----+--------------------+
only showing top 10 rows



[Row(label=0.0, features=SparseVector(780, {127: 51.0, 128: 159.0, 129: 253.0, 130: 159.0, 131: 50.0, 154: 48.0, 155: 238.0, 156: 252.0, 157: 252.0, 158: 252.0, 159: 237.0, 181: 54.0, 182: 227.0, 183: 253.0, 184: 252.0, 185: 239.0, 186: 233.0, 187: 252.0, 188: 57.0, 189: 6.0, 207: 10.0, 208: 60.0, 209: 224.0, 210: 252.0, 211: 253.0, 212: 252.0, 213: 202.0, 214: 84.0, 215: 252.0, 216: 253.0, 217: 122.0, 235: 163.0, 236: 252.0, 237: 252.0, 238: 252.0, 239: 253.0, 240: 252.0, 241: 252.0, 242: 96.0, 243: 189.0, 244: 253.0, 245: 167.0, 262: 51.0, 263: 238.0, 264: 253.0, 265: 253.0, 266: 190.0, 267: 114.0, 268: 253.0, 269: 228.0, 270: 47.0, 271: 79.0, 272: 255.0, 273: 168.0, 289: 48.0, 290: 238.0, 291: 252.0, 292: 252.0, 293: 179.0, 294: 12.0, 295: 75.0, 296: 121.0, 297: 21.0, 300: 253.0, 301: 243.0, 302: 50.0, 316: 38.0, 317: 165.0, 318: 253.0, 319: 233.0, 320: 208.0, 321: 84.0, 328: 253.0, 329: 252.0, 330: 165.0, 343: 7.0, 344: 178.0, 345: 252.0, 346: 240.0, 347: 71.0, 348: 19.0, 349: 28.0

### Exercise 2.A
**TODO:** Load wine data from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/wine.scale
Dataset description: http://archive.ics.uci.edu/ml/datasets/Wine

In [None]:
wine_df = spark.read.format("libsvm").load("wine.scale")

wine_df.show()

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,4,5,...|
|  1.0|(13,[0,1,2,3,5,6,...|
+-----+--------------------+
only showing top 20 rows



## 3. Classification (2p.)

In [None]:
file = "wine.csv" # https://gist.githubusercontent.com/tijptjik/9408623/raw/b237fa5848349a14a14e5d4107dc7897c21951f5/wine.csv

# Remember about deleting dots from the headers of this csv file!
winedf2 = spark.read.format("csv").options(inferSchema="true", header="true").load(file)
winedf2.show(10)
print(winedf2.dtypes)

+----+-------+-----+----+----+---+-------+----------+--------------------+-------+-----+----+----+-------+
|Wine|Alcohol|Malic| Ash| Acl| Mg|Phenols|Flavanoids|Nonflavanoid_phenols|Proanth|Color| Hue|  OD|Proline|
+----+-------+-----+----+----+---+-------+----------+--------------------+-------+-----+----+----+-------+
|   1|  14.23| 1.71|2.43|15.6|127|    2.8|      3.06|                0.28|   2.29| 5.64|1.04|3.92|   1065|
|   1|   13.2| 1.78|2.14|11.2|100|   2.65|      2.76|                0.26|   1.28| 4.38|1.05| 3.4|   1050|
|   1|  13.16| 2.36|2.67|18.6|101|    2.8|      3.24|                 0.3|   2.81| 5.68|1.03|3.17|   1185|
|   1|  14.37| 1.95| 2.5|16.8|113|   3.85|      3.49|                0.24|   2.18|  7.8|0.86|3.45|   1480|
|   1|  13.24| 2.59|2.87|21.0|118|    2.8|      2.69|                0.39|   1.82| 4.32|1.04|2.93|    735|
|   1|   14.2| 1.76|2.45|15.2|112|   3.27|      3.39|                0.34|   1.97| 6.75|1.05|2.85|   1450|
|   1|  14.39| 1.87|2.45|14.6| 96|   

### Exercise 3.A
**TODO:** 

Remember about deleting dots from the headers of this csv file and splitting data into train and test set


1) Create pipeline with VectorAssembler and DecisionTreeClassifier.

2) Use the pipeline to make predictions.

3) Evaluate predictions using MulticlassClassificationEvaluator.

4) Calculate accuracy and test error

5) Print the structure of the trained decision tree (hint: use toDebugString attribute)

In [5]:
file = "wine.csv"
wine = spark.read.format("csv").options(inferSchema="true", header="true").load(file).withColumnRenamed("Malic.acid", "Malic_acid").withColumnRenamed("Nonflavanoid.phenols", "Nonflavanoid_phenols").withColumnRenamed("Color.int", "Color_int")

wine.show()

+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
|Wine|Alcohol|Malic_acid| Ash| Acl| Mg|Phenols|Flavanoids|Nonflavanoid_phenols|Proanth|Color_int| Hue|  OD|Proline|
+----+-------+----------+----+----+---+-------+----------+--------------------+-------+---------+----+----+-------+
|   1|  14.23|      1.71|2.43|15.6|127|    2.8|      3.06|                0.28|   2.29|     5.64|1.04|3.92|   1065|
|   1|   13.2|      1.78|2.14|11.2|100|   2.65|      2.76|                0.26|   1.28|     4.38|1.05| 3.4|   1050|
|   1|  13.16|      2.36|2.67|18.6|101|    2.8|      3.24|                 0.3|   2.81|     5.68|1.03|3.17|   1185|
|   1|  14.37|      1.95| 2.5|16.8|113|   3.85|      3.49|                0.24|   2.18|      7.8|0.86|3.45|   1480|
|   1|  13.24|      2.59|2.87|21.0|118|    2.8|      2.69|                0.39|   1.82|     4.32|1.04|2.93|    735|
|   1|   14.2|      1.76|2.45|15.2|112|   3.27|      3.39|              

In [16]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

def make_pipeline(train, test):
    vector_assembler = VectorAssembler(inputCols=train.columns[1:], outputCol="features")
    decision_tree_cls = DecisionTreeClassifier(labelCol="Wine", featuresCol="features")
    pipeline = Pipeline(stages=[vector_assembler, decision_tree_cls]) 
    evaluation = MulticlassClassificationEvaluator(labelCol="Wine", predictionCol="prediction", metricName="accuracy")
    model = pipeline.fit(train)
    predictions = model.transform(test)
    accuracy = evaluation.evaluate(predictions)
    return model, accuracy

wine_train, wine_test = wine.randomSplit([0.7, 0.3], seed=0)
model, accuracy = make_pipeline(wine_train, wine_test)

tree_model = model.stages[1]
print(tree_model.toDebugString)
print("Accuracy:", accuracy)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_70673fa310a5, depth=5, numNodes=15, numClasses=4, numFeatures=13
  If (feature 6 <= 1.335)
   If (feature 9 <= 3.77)
    Predict: 2.0
   Else (feature 9 > 3.77)
    Predict: 3.0
  Else (feature 6 > 1.335)
   If (feature 12 <= 682.5)
    If (feature 9 <= 6.165)
     If (feature 1 <= 3.875)
      Predict: 2.0
     Else (feature 1 > 3.875)
      If (feature 0 <= 12.059999999999999)
       Predict: 2.0
      Else (feature 0 > 12.059999999999999)
       Predict: 1.0
    Else (feature 9 > 6.165)
     Predict: 3.0
   Else (feature 12 > 682.5)
    If (feature 0 <= 12.364999999999998)
     Predict: 2.0
    Else (feature 0 > 12.364999999999998)
     Predict: 1.0

Accuracy: 0.8936170212765957


### Exercise 3.B
**TODO:** 

1) Extend the pipeline from the previos task with QuantileDiscretizer 

2) Try using a couple of different numbers of buckets, which cinfiguration gives the best results?

3) Can you see any difference in the structure of the decistion tree?

In [17]:
from pyspark.ml.feature import QuantileDiscretizer

def make_pipeline_2(number_of_buckets, train, test):
    new_cols = ["new_" + str(c) for c in train.columns[1:]]
    quantile_discretizer = QuantileDiscretizer(inputCols=train.columns[1:], outputCols=new_cols, numBuckets=number_of_buckets)
    vector_assembler = VectorAssembler(inputCols=new_cols, outputCol="features")
    decision_tree_cls = DecisionTreeClassifier(labelCol="Wine", featuresCol="features")
    pipeline = Pipeline(stages=[quantile_discretizer, vector_assembler, decision_tree_cls])
    evaluation = MulticlassClassificationEvaluator(labelCol="Wine", predictionCol="prediction", metricName="accuracy") 
    model = pipeline.fit(train)
    predictions = model.transform(test)
    accuracy = evaluation.evaluate(predictions)
    return model, accuracy

for buckets in [2, 3, 4]:
    model, accuracy = make_pipeline_2(buckets, wine_train, wine_test)
    tree_model = model.stages[2]
    print(tree_model.toDebugString)
    print("Accuracy:", accuracy)

DecisionTreeClassificationModel: uid=DecisionTreeClassifier_159c80388b07, depth=5, numNodes=17, numClasses=4, numFeatures=13
  If (feature 6 in {0.0})
   If (feature 10 in {0.0})
    If (feature 9 in {0.0})
     If (feature 0 in {0.0})
      If (feature 7 in {0.0})
       Predict: 3.0
      Else (feature 7 not in {0.0})
       Predict: 2.0
     Else (feature 0 not in {0.0})
      Predict: 3.0
    Else (feature 9 not in {0.0})
     Predict: 3.0
   Else (feature 10 not in {0.0})
    Predict: 2.0
  Else (feature 6 not in {0.0})
   If (feature 12 in {0.0})
    Predict: 2.0
   Else (feature 12 not in {0.0})
    If (feature 0 in {0.0})
     If (feature 1 in {0.0})
      Predict: 2.0
     Else (feature 1 not in {0.0})
      Predict: 1.0
    Else (feature 0 not in {0.0})
     Predict: 1.0

Accuracy: 0.9148936170212766
DecisionTreeClassificationModel: uid=DecisionTreeClassifier_e609f0ce7b11, depth=5, numNodes=15, numClasses=4, numFeatures=13
  If (feature 6 in {0.0})
   If (feature 9 in {0.0})


Best accuracy equals 0.9787234042553192 for 3 buckets. There are small differences for each numer of buckets.

## 4. Text classification (2p.)

### Exercise 4
**TODO:** 
Build a pipeline consisting of Tokenizer, HashingTF, IDF and StringIndexer and LogisticRegression, fit it to training data: 
http://help.sentiment140.com/for-students/

What is the accuracy of this classifier?