# Model Tuning Quiz
Use this Jupyter notebook to find the answer to the quiz in the previous section. There is an answer key in the next part of the lesson.

In [1]:
from pyspark.sql import SparkSession

# TODOS: 
# 1) import any other libraries you might need
# 2) run the cells below to read dataset
# 3) follow the steps below to find the answer to the quiz question
from pyspark.ml import Pipeline
from pyspark.ml.feature import CountVectorizer, IDF, Normalizer, PCA, RegexTokenizer, StandardScaler, StopWordsRemover, StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [2]:
spark = SparkSession.builder \
    .master("local") \
    .appName("Creating Features") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/01/14 17:52:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/14 17:52:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
22/01/14 17:52:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [3]:
stack_overflow_data = 'Train_onetag_small.json'

In [4]:
df = spark.read.json(stack_overflow_data)
df.persist()

                                                                                

DataFrame[Body: string, Id: bigint, Tags: string, Title: string, oneTag: string]

# Question
What is the accuracy of the best model trained with the parameter grid described above (and keeping all other parameters at their default value computed on the 10% untouched data?

### Step 1. Train Test Split
As a first step break your data set into 90% of training data and set aside 10%. Set random seed to `42`.

In [5]:
# TODO: write your code for this step
train, test = df.randomSplit([0.9, 0.1], seed=42)

### Step 2. Build Pipeline

In [6]:
# TODO: write your code for this step
# TF-IDF
regexTokenizer = RegexTokenizer(inputCol='Body', outputCol='words', pattern="\\W")
cv = CountVectorizer(inputCol="words", outputCol="TF", vocabSize=1000)
idf = IDF(inputCol="TF", outputCol="features")

indexer = StringIndexer(inputCol="oneTag", outputCol="label")

# Modeling
lr =  LogisticRegression(maxIter=10, regParam=0.0, elasticNetParam=0)

# Pipeline
pipeline = Pipeline(stages=[regexTokenizer, cv, idf, indexer, lr])

### Step 3. Tune Model
On the first 90% of the data let's find the most accurate logistic regression model using 3-fold cross-validation with the following parameter grid:

- CountVectorizer vocabulary size: `[1000, 5000]`
- LogisticRegression regularization parameter: `[0.0, 0.1]`
- LogisticRegression max Iteration number: `[10]`

In [7]:
# TODO: write your code for this step
# Parameters Tuning
paramGrid = ParamGridBuilder() \
    .addGrid(cv.vocabSize, [1000, 5000]) \
    .addGrid(lr.regParam, [0.0, 0.1]) \
    .build()

In [8]:
# Cross-Validation
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(),
                          numFolds=3)

### Step 4: Compute Accuracy of Best Model

In [9]:
# TODO: write your code for this step
# Model training
cv_model = crossval.fit(train)

22/01/14 17:53:16 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
22/01/14 17:53:16 WARN InstanceBuilder$NativeBLAS: Failed to load implementation from:dev.ludovic.netlib.blas.ForeignLinkerBLAS
22/01/14 17:53:19 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
22/01/14 17:53:19 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
22/01/14 17:53:50 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
22/01/14 17:54:52 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
22/01/14 17:56:31 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
22/01/14 17:58:14 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
22/01/14 17:59:13 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
22/01/14 18:00:09 WARN DAGScheduler: Broadcasting large task binary with size 2.4 MiB
22/01/14 18:01:48 WARN DA

In [10]:
# Training result
cv_model.avgMetrics

[0.3111202060304143,
 0.23108195330468076,
 0.3635594536132107,
 0.28608901695647787]

In [11]:
# Model Evaluation
result = cv_model.transform(test)

### Step 4: Compute Accuracy of Best Model

In [12]:
print(result.filter(result.label == result.prediction).count())
print(result.count())

22/01/14 18:11:53 WARN DAGScheduler: Broadcasting large task binary with size 11.7 MiB
                                                                                

3899
10062
