# IST 718: Big Data Analytics

- Professor: Daniel Acuna <deacuna@syr.edu>

## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers either from your classmates or from the internet__
- You can put the homework files anywhere you want in your http://notebook.acuna.io workspace but _do not change_ the file names. The TAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and TAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before downloading and submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`). 
- Good luck!

In [557]:
# Load the packages needed for this part
# create spark and sparkcontext objects
from pyspark.sql import SparkSession
import numpy as np

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

import pyspark
from pyspark.ml import feature, regression, Pipeline, classification, pipeline, evaluation
from pyspark.sql import functions as fn, Row
from pyspark import sql

import matplotlib.pyplot as plt
import pandas as pd

# Part 2

In this section, you are going to develop a SMS spam detector based on logistic regression. This is the same idea behind sentiment analysis, but instead of predicting positive sentiment vs negative sentiment, you are going to predict whether a SMS text is spam or not.

The dataset will be in `sms_spam_df`

In [558]:
sms_spam_df = spark.read.csv('sms_spam.csv', header=True, inferSchema=True)

# Question 2.1

Encode the `type` column to be 1 for `spam` and 0 for `ham` and store the result in `sms_spam2_df`. Besides, assign the count of spam to `spam_count` and the count of ham to `ham_count`. 

In [559]:
# create sms_spam2_df below
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='type', outputCol='encoded_type')
sms_spam2_df = indexer.fit(sms_spam_df).transform(sms_spam_df)
ham_count = sms_spam2_df.where(fn.col('encoded_type') == 0).count()
spam_count = sms_spam2_df.where(fn.col('encoded_type') == 1).count()

In [560]:
# check here
sms_spam2_df.show()

+----+--------------------+------------+
|type|                text|encoded_type|
+----+--------------------+------------+
| ham|Go until jurong p...|         0.0|
| ham|Ok lar... Joking ...|         0.0|
|spam|Free entry in 2 a...|         1.0|
| ham|U dun say so earl...|         0.0|
| ham|Nah I don't think...|         0.0|
|spam|FreeMsg Hey there...|         1.0|
| ham|Even my brother i...|         0.0|
| ham|As per your reque...|         0.0|
|spam|WINNER!! As a val...|         1.0|
|spam|Had your mobile 1...|         1.0|
| ham|I'm gonna be home...|         0.0|
|spam|SIX chances to wi...|         1.0|
|spam|URGENT! You have ...|         1.0|
| ham|I've been searchi...|         0.0|
| ham|I HAVE A DATE ON ...|         0.0|
|spam|XXXMobileMovieClu...|         1.0|
| ham|Oh k...i'm watchi...|         0.0|
| ham|Eh u remember how...|         0.0|
| ham|Fine if that's th...|         0.0|
|spam|England v Macedon...|         1.0|
+----+--------------------+------------+
only showing top

In [561]:
# (5 pts)
np.testing.assert_array_equal(
    sms_spam2_df.groupBy('type').count().orderBy('type').rdd.map(lambda x: x['count']).collect(),
    [4827, 747])
np.testing.assert_array_equal(spam_count, 747)
np.testing.assert_array_equal(ham_count, 4827)

# Question 2.2: tfidf feature engineering
Create a pipeline that combines a `Tokenizer`, `CounterVectorizer`, and a `IDF` estimator to compute the tfidf vectors of each SMS. Fit this pipeline and assign the pipeline transformer to a variable `tfidf_pipeline`. The `Tokenizer` step should create a column `words`, the `CounterVectorizer` step should create a column `tf`, and the `IDF` step should create a column `tfidf`.

$$
\text{tf-idf}_{ij} = f_{ij} \log \frac{|D|+1}{f_i+1}
$$

$f_i$ is number of documents that contain word $i$

In [562]:
from pyspark.ml.feature import Tokenizer
Tokenizer = Tokenizer().setInputCol('text').setOutputCol('words')

In [563]:
from pyspark.ml.feature import CountVectorizer
CounterVectorizer = CountVectorizer(minTF=1., minDF=5., vocabSize=2**17).setInputCol('words').setOutputCol('tf')


In [564]:
from pyspark.ml.feature import IDF
idf = IDF().\
    setInputCol('tf').\
    setOutputCol('tfidf')

tfidf_pipeline = Pipeline(stages=[Tokenizer, CounterVectorizer, idf]).fit(sms_spam2_df)

In [565]:
# Check pipeline result
tfidf_pipeline.transform(sms_spam2_df).show()

+----+--------------------+------------+--------------------+--------------------+--------------------+
|type|                text|encoded_type|               words|                  tf|               tfidf|
+----+--------------------+------------+--------------------+--------------------+--------------------+
| ham|Go until jurong p...|         0.0|[go, until, juron...|(2005,[8,42,51,65...|(2005,[8,42,51,65...|
| ham|Ok lar... Joking ...|         0.0|[ok, lar..., joki...|(2005,[5,74,404,5...|(2005,[5,74,404,5...|
|spam|Free entry in 2 a...|         1.0|[free, entry, in,...|(2005,[0,3,8,20,5...|(2005,[0,3,8,20,5...|
| ham|U dun say so earl...|         0.0|[u, dun, say, so,...|(2005,[5,22,60,14...|(2005,[5,22,60,14...|
| ham|Nah I don't think...|         0.0|[nah, i, don't, t...|(2005,[0,1,66,86,...|(2005,[0,1,66,86,...|
|spam|FreeMsg Hey there...|         1.0|[freemsg, hey, th...|(2005,[0,2,6,10,1...|(2005,[0,2,6,10,1...|
| ham|Even my brother i...|         0.0|[even, my, brothe...|(20

In [566]:
# (5 pts)
np.testing.assert_array_equal([type(s) for s in tfidf_pipeline.stages],
                              [feature.Tokenizer, feature.CountVectorizerModel, feature.IDFModel])

# Question 2.3: uppercase feature

Typical spam messages contain words that are upper case. Create a dataframe `sms_spam3_df` where you add a new column `has_uppercase` which contains an integer `1` if the first sequence of uppercase letters is longer or equal to 3 and an integer `0` otherwise. You can extract sequence of 3 or more uppercase letters by using the regular expression `[A-Z]{3,}`. You will use the function `fn.regexp_extract` to find those sequences and extract the first one (e.g., with index 0) and then use `fn.length` to compute the length of such sequence.

In [567]:
# create sms_spam3_df below

sms = sms_spam2_df.select('type', 'text', fn.regexp_extract('text', '[A-Z]{3,}', 0).alias('uppercase'))

In [568]:
indexer = StringIndexer(inputCol='uppercase', outputCol='hasuppercase')
sms_spam3 = indexer.fit(sms).transform(sms)
sms_spam3
df = sms_spam3.withColumn('has_uppercase', fn.when(fn.col('hasuppercase') > 0, 1.0).otherwise(0.))

sms_spam3_df = df.select('type', 'text', fn.col('has_uppercase').cast("int"))

sms_spam3_df.show()

+----+--------------------+-------------+
|type|                text|has_uppercase|
+----+--------------------+-------------+
| ham|Go until jurong p...|            0|
| ham|Ok lar... Joking ...|            0|
|spam|Free entry in 2 a...|            0|
| ham|U dun say so earl...|            0|
| ham|Nah I don't think...|            0|
|spam|FreeMsg Hey there...|            0|
| ham|Even my brother i...|            0|
| ham|As per your reque...|            0|
|spam|WINNER!! As a val...|            1|
|spam|Had your mobile 1...|            1|
| ham|I'm gonna be home...|            0|
|spam|SIX chances to wi...|            1|
|spam|URGENT! You have ...|            1|
| ham|I've been searchi...|            0|
| ham|I HAVE A DATE ON ...|            1|
|spam|XXXMobileMovieClu...|            1|
| ham|Oh k...i'm watchi...|            0|
| ham|Eh u remember how...|            0|
| ham|Fine if that's th...|            0|
|spam|England v Macedon...|            1|
+----+--------------------+-------

The first three messages with `has_uppercase == 1` are as follows:

```python
sms_spam3_df.where('has_uppercase == 1').take(3)
```

```console
[Row(type=1, text='WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.', has_uppercase=1),
 Row(type=1, text='Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030', has_uppercase=1),
 Row(type=1, text='SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info', has_uppercase=1)]
```

In [569]:
# try it here
sms_spam3_df.where('has_uppercase == 1').take(3)

[Row(type='spam', text='WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.', has_uppercase=1),
 Row(type='spam', text='Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030', has_uppercase=1),
 Row(type='spam', text='SIX chances to win CASH! From 100 to 20,000 pounds txt> CSH11 and send to 87575. Cost 150p/day, 6days, 16+ TsandCs apply Reply HL 4 info', has_uppercase=1)]

In [570]:
# (5 pts)
np.testing.assert_equal(set(sms_spam3_df.columns), {'has_uppercase', 'text', 'type'})
np.testing.assert_equal(type(sms_spam3_df.schema['has_uppercase'].dataType), sql.types.IntegerType)
np.testing.assert_equal(sms_spam3_df.rdd.map(lambda x : x['has_uppercase']).sum(), 891)

# Question 2.4: Compare models

Using the following splits:

In [571]:
training_df, validation_df, testing_df = sms_spam2_df.randomSplit([0.6, 0.3, 0.1], seed=0)

In [572]:
[training_df.count(), validation_df.count(), testing_df.count()]

[3311, 1709, 554]

**(15 pts)** Create pipelines where the first stage is the `tfidf_pipeline` created above and the second stage is a `LogisticRegression` model with different regularization parameters ($\lambda$) and elastic net mixture ($\alpha$) as following:
$$\lambda = [0, 0.02, 0.1] $$
$$\alpha = [0.2, 0.4] $$

Try different combinations of $\lambda$ and $\alpha$ (6 combinations) to find out best parameters by using the area under the curve as the estimator. Fit those pipelines to the appropriate data split and assign the pipeline with the best model to a variable `best_model`. Also, assign $\lambda$ and $\alpha$ of the best model to `best_model_lambda` and `best_model_alpha`.

For example, the AUC on training of the first model is perfect:

```
evaluator.evaluate(lr_pipeline1.transform(training_df))
```

```console
1.0
```

In [573]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
    MulticlassClassificationEvaluator, \
    RegressionEvaluator

pipeline_cv_estimator = Pipeline(stages=[Tokenizer, CounterVectorizer]).fit(sms_spam2_df)
tfidf_pipeline = Pipeline(stages=[pipeline_cv_estimator, idf]).fit(sms_spam2_df)

lr = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(0.0).\
    setMaxIter(100).\
    setElasticNetParam(0.) # logistic regression


lr_pipeline = Pipeline(stages=[tfidf_pipeline, lr]).fit(training_df) # pipeline 
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
evaluator.evaluate(lr_pipeline.transform(testing_df))

0.9080584551148206

In [574]:
import pandas as pd
vocabulary = tfidf_pipeline.stages[0].stages[-1].vocabulary
weights = lr_pipeline.stages[-1].coefficients.toArray()
coeffs_df = pd.DataFrame({'word': vocabulary, 'weight': weights})

In [575]:
coeffs_df.sort_values('weight').head(10)

Unnamed: 0,word,weight
205,his,-26.905969
357,ok...,-25.307068
452,later.,-24.973442
1201,snow,-18.903645
83,see,-18.578804
335,called,-16.627778
120,where,-16.045486
1483,then...,-15.56371
1302,liked,-15.550081
1179,nobody,-14.661711


In [576]:
lambda_par = 0.02
alpha_par = 1.0
lr2 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(lambda_par).\
    setMaxIter(100).\
    setElasticNetParam(alpha_par)

lr_model2 = Pipeline(stages=[Tokenizer, CounterVectorizer, idf, lr2]).fit(training_df)
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
evaluator.evaluate(lr_model2.transform(testing_df))

0.9745441892832293

In [577]:
lambda_par = 1.0
alpha_par = 1.0
lr3 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(lambda_par).\
    setMaxIter(100).\
    setElasticNetParam(alpha_par)

lr_model3 = Pipeline(stages=[Tokenizer, CounterVectorizer, idf, lr3]).fit(training_df)
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
evaluator.evaluate(lr_model3.transform(testing_df))

0.5

In [578]:
lambda_par = 1.0
alpha_par = 0.02
lr4 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(lambda_par).\
    setMaxIter(100).\
    setElasticNetParam(alpha_par)

lr_model4 = Pipeline(stages=[Tokenizer, CounterVectorizer, idf, lr4]).fit(training_df)
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
evaluator.evaluate(lr_model4.transform(testing_df))

0.9804453723034107

In [579]:
alpha_par = 1.0
lr5 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(alpha_par).\
    setMaxIter(100).\
    setElasticNetParam(0.0)

lr_model5 = Pipeline(stages=[Tokenizer, CounterVectorizer, idf, lr5]).fit(training_df)
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
evaluator.evaluate(lr_model5.transform(testing_df))

0.9908420320111356

In [580]:
lambda_par = 1.0
lr6= LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(lambda_par).\
    setMaxIter(100).\
    setElasticNetParam(0.0)

lr_model6 = Pipeline(stages=[Tokenizer, CounterVectorizer, idf, lr6]).fit(training_df)
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
evaluator.evaluate(lr_model6.transform(testing_df))

0.9908420320111356

In [581]:
best_model_lambda = 0.02
best_model_alpha = 0.2

lr7 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(best_model_lambda).\
    setMaxIter(100).\
    setElasticNetParam(best_model_alpha)

best_model1 = Pipeline(stages=[Tokenizer, CounterVectorizer, idf, lr7]).fit(training_df) # estimator
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
AUC_best = evaluator.evaluate(best_model1.transform(testing_df)) #transformer
AUC_best # lower AUC, higher precision (1.0)

0.9835908141962438

In [582]:
best_model_lambda = 0.02
best_model_alpha = 0.2

pipeline_cv_estimator = Pipeline(stages=[Tokenizer, CounterVectorizer]).fit(sms_spam2_df)
tfidf_pipeline = Pipeline(stages=[pipeline_cv_estimator, idf]).fit(sms_spam2_df)

lr8 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(best_model_lambda).\
    setMaxIter(100).\
    setElasticNetParam(best_model_alpha) # logistic regression



best_model = Pipeline(stages=[tfidf_pipeline, lr8]).fit(training_df) # uses tdif pipeline 
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
AUC_best = evaluator.evaluate(best_model.transform(testing_df)) #transformer
AUC_best # precision 0.98

0.9846068197633973

In [583]:
# (15 pts)
np.testing.assert_equal(type(best_model), pipeline.PipelineModel)

np.testing.assert_array_equal([type(s) for s in best_model.stages],
                              [pipeline.PipelineModel, classification.LogisticRegressionModel])

np.testing.assert_equal(best_model_lambda, 0.02)
np.testing.assert_equal(best_model_alpha, 0.2)

# Question 2.5: Generalization of best model

Using the right split and the best model selected before, compute the generalization performance and assign it to a variable `AUC_best`

In [584]:
best_model_lambda = 0.02
best_model_alpha = 0.2

pipeline_cv_estimator = Pipeline(stages=[Tokenizer, CounterVectorizer]).fit(sms_spam2_df)
tfidf_pipeline = Pipeline(stages=[pipeline_cv_estimator, idf]).fit(sms_spam2_df)

lr8 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('tfidf').\
    setRegParam(best_model_lambda).\
    setMaxIter(100).\
    setElasticNetParam(best_model_alpha) # logistic regression



best_model = Pipeline(stages=[tfidf_pipeline, lr8]).fit(training_df) # pipeline 
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
AUC_best = evaluator.evaluate(best_model.transform(testing_df)) #transformer
AUC_best

0.9846068197633973

In [585]:
# (5 pts)
np.testing.assert_approx_equal(AUC_best, 0.9883924843423813, significant=2)

Using the same split and the best model, compute and assign `precision`, `recall` and `f1_score`. You should first count the numbers in the confusion matrix, and then compute these metrics based on the formula.

- **Prevalence**: (TP+FN) / everything
- **Precision**: TP / (TP + FP)
- **Sensitivity**, **Recall**, or **True positive rate**: TP / true condition positive
- **Specificity**: TN / true condition negative
- **F1**: $\frac{2*precision*recall}{precision + recall}$

In [586]:
from pyspark.mllib.evaluation import MulticlassMetrics

dft = best_model1.transform(testing_df)

df = dft.select(['prediction', 'encoded_type'])
metrics = MulticlassMetrics(df.rdd.map(tuple))

print(metrics.confusionMatrix().toArray())

[[479.   0.]
 [ 15.  60.]]


In [587]:
precision = metrics.precision(1.0)
recall = metrics.recall(1.0)
f1_score = metrics.fMeasure(1.0)

In [588]:
best_model = best_model1.transform(testing_df)

y_true = dft.select(['encoded_type']).collect()
y_pred = dft.select(['prediction']).collect()

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98       479
         1.0       1.00      0.80      0.89        75

    accuracy                           0.97       554
   macro avg       0.98      0.90      0.94       554
weighted avg       0.97      0.97      0.97       554



In [589]:
# (5 pts)
np.testing.assert_array_almost_equal([precision, recall, f1_score],
    [1.0, 0.7976190476190477, 0.8874172185430463], decimal=2)

# Question 2.6: Inference

Use the best pipeline fitted above (`best_model`) to create Pandas dataframes that contain the most negative words and the most positive words. In particular, create a dataframe `positive_words` with the columns `word` and `weight` with the top 20 positive words, sorted by descending coefficient. Similarly create a `negative_words` Pandas dataframe with the top 20 negative words where the coefficient are sorted in ascending order. **Hint: follow the `sentiment_analysis.ipynb` notebook in the repo**

In [590]:
vocabulary = best_model1.stages[1].vocabulary
weights = best_model1.stages[-1].coefficients.toArray()
coeffs_df = pd.DataFrame({'word': vocabulary, 'weight': weights})

In [591]:
positive_words = coeffs_df.sort_values('weight', ascending=False).head(20)
positive_words.weight.sum()

7.885048977821447

In [592]:
negative_words = coeffs_df.sort_values('weight', ascending=True).head(20)
negative_words.weight.sum()

-1.4337804407269388

In [593]:
# examine positive vocabulary
positive_words.head(20)

Unnamed: 0,word,weight
162,won,0.624268
15,call,0.529299
1263,onto,0.479627
372,ringtone,0.466865
87,txt,0.454826
671,sex,0.412815
215,service,0.383503
736,order,0.382584
159,now!,0.382461
923,games,0.374132


In [594]:
# examine solutions
negative_words.head(20)

Unnamed: 0,word,weight
1,i,-0.2719
11,me,-0.133675
359,friends,-0.120932
602,lose,-0.099188
29,if,-0.095204
9,my,-0.087518
75,i'll,-0.080973
26,but,-0.072798
27,i'm,-0.071817
161,i've,-0.050437


The `positive_words` and `negative_words` dataframe should look like this:

```python
positive_words.head()
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>word</th>
      <th>weight</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>7263</th>
      <td>sexy?</td>
      <td>0.642738</td>
    </tr>
    <tr>
      <th>3555</th>
      <td>widelive.com/index.</td>
      <td>0.588182</td>
    </tr>
    <tr>
      <th>15</th>
      <td>call</td>
      <td>0.537161</td>
    </tr>
    <tr>
      <th>12237		</th>
      <td>08714712388</td>
      <td>0.504090</td>
    </tr>
    <tr>
      <th>81</th>
      <td>txt</td>
      <td>0.495005</td>
    </tr>
  </tbody>
</table>

and 

```python
negative_words.head()
```

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>word</th>
      <th>weight</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>i</td>
      <td>-0.183463</td>
    </tr>
    <tr>
      <th>3332</th>
      <td>lose.</td>
      <td>-0.074937</td>
    </tr>
    <tr>
      <th>3371</th>
      <td>fightng</td>
      <td>-0.074937</td>
    </tr>
    <tr>
      <th>3221</th>
      <td>dificult</td>
      <td>-0.074937</td>
    </tr>
    <tr>
      <th>13</th>
      <td>me</td>
      <td>-0.065904</td>
    </tr>
  </tbody>
</table>

In [595]:
# (5 pts)
np.testing.assert_equal(set(positive_words.columns), {'weight', 'word'})
np.testing.assert_equal(set(negative_words.columns), {'weight', 'word'})
np.testing.assert_approx_equal(positive_words.weight.sum(), 8.675741267346245, significant=1)
np.testing.assert_approx_equal(negative_words.weight.sum(), -0.7292131526997235, significant=1)
np.testing.assert_array_less(positive_words.weight.iloc[-1], positive_words.weight.iloc[0])
np.testing.assert_array_less(negative_words.weight.iloc[0], negative_words.weight.iloc[-1])

# Question 2.7
Use the dataframe `sms_spam3_df` to create a model where the first feature is `has_uppercase` and the next set of features are the tfidf of the text. Perform feature engineering in all features using a max absolute scaler ([`MaxAbsScaler`](https://spark.apache.org/docs/2.0.2/ml-features.html#maxabsscaler)). Do a logistic regression on the resulting scaled features with regularization parameter $\lambda = 0.2$ and elastic net mixture $\alpha=0.1$ for the entire data (all of `sms_spam3_df`). Since you have scaled all features to be within the same range, you can compare them. 

**(5 pts)** with code and comments, answer below

1. is `has_uppercase` a feature that is positively or negative related to an SMS being spam?
2. what is the ratio of the coefficient of `has_uppercase` to the biggest positive tfidf coefficient?

In [596]:
from pyspark_pipes import pipe

In [597]:
indexer = StringIndexer(inputCol='type', outputCol='encoded_type')
sms_spam3_df = indexer.fit(sms_spam3_df).transform(sms_spam3_df)

In [615]:
# 2. pipeline for tfidf of the text
from pyspark.ml.feature import Tokenizer
Tokenizer = Tokenizer().setInputCol('text').setOutputCol('words')

from pyspark.ml.feature import CountVectorizer
CounterVectorizer = CountVectorizer(minTF=1., minDF=5., vocabSize=2**17).setInputCol('words').setOutputCol('tf')

from pyspark.ml.feature import IDF
idf = IDF().\
    setInputCol('tf').\
    setOutputCol('tfidf')

tfidf_pipeline = Pipeline(stages=[Tokenizer, CounterVectorizer, idf]).fit(sms_spam3_df)

#tfidf = tfidf_pipeline.fit(sms_spam3_df)
tfidf = tfidf_pipeline.transform(sms_spam3_df)



In [609]:
# model 

In [616]:
training, test = tfidf.randomSplit([0.8, 0.2], 0)

In [617]:
model2 = pipe(feature.VectorAssembler(inputCols=['has_uppercase', 'tfidf'], outputCol="features"),
              feature.MaxAbsScaler(inputCol='features', outputCol='Scaledfeatures'))

In [618]:
best_model_lambda = 0.2
best_model_alpha = 0.1


lr2 = LogisticRegression().\
    setLabelCol('encoded_type').\
    setFeaturesCol('Scaledfeatures').\
    setRegParam(best_model_lambda).\
    setMaxIter(100).\
    setElasticNetParam(best_model_alpha)

bestmodel = Pipeline(stages=[model2, lr2]).fit(training) 
evaluator = BinaryClassificationEvaluator(labelCol='encoded_type')
AUC = evaluator.evaluate(model.transform(test)) 

In [619]:
uppercoef = bestmodel.stages[-1].coefficients.toArray()[0]
tfidfcoef = bestmodel.stages[-1].coefficients.toArray().max()
print('1.', uppercoef) #1.  has_uppercase coef is positively correlated
print('2.', uppercoef / tfidfcoef) #2. ratio

1. 0.9451844869703176
2. 0.4742783128057046
