### University of Virginia
### DS 7200: Distributed Computing
### Lab: Supervised Learning
### Last Updated: August 20, 2023

---

#### Instructions

This project has two parts:
- Part I: Classification - build and apply a logistic regression model on the Wisconsin Breast Cancer dataset.
- Part II: Regression - build and apply a linear regression model on the California Housing dataset.

**Total Possible Points: 10**

---

#### Part I: Classification (5 POINTS)

Here are the specifications and grading breakdown:

- the target variable is `diagnosis`
- use `f1`, `f2` as predictors (1 PT)
- split data into 60% training set, 40% test set 
- standardize the predictors (1 PT)
- use seed=314 whenever a seed is needed
- fit a Logistic Regression model with an intercept (1 PT)
- compute and show the area under the ROC curve for the test set (2 PTS)

In [27]:
from pyspark.sql import *
DATA_FILEPATH = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

In [28]:
training = spark.read.csv(DATA_FILEPATH,  inferSchema=True, header = True)

In [29]:
train, test = training.randomSplit([0.6, 0.4], 314)

In [30]:
from pyspark.ml.feature import StringIndexer
stringIndexer = StringIndexer(inputCols=['diagnosis'], outputCols=['diagnosis_numeric'], handleInvalid="skip")

+----+---------+-----+-----+-----+-----+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+--------+-------+--------+-----+-----+-----+------+-------+------+------+-------+------+-------+
|  id|diagnosis|   f1|   f2|   f3|   f4|     f5|     f6|    f7|     f8|    f9|    f10|   f11|   f12|  f13|  f14|     f15|    f16|    f17|     f18|    f19|     f20|  f21|  f22|  f23|   f24|    f25|   f26|   f27|    f28|   f29|    f30|
+----+---------+-----+-----+-----+-----+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+--------+-------+--------+-----+-----+-----+------+-------+------+------+-------+------+-------+
|8670|        M|15.46|19.48|101.7|748.9| 0.1092| 0.1223|0.1466|0.08087|0.1931|0.05796|0.4743|0.7859|3.094|48.31| 0.00624|0.01484|0.02813| 0.01093|0.01397|0.002461|19.26| 26.0|124.9|1156.0| 0.1546|0.2394|0.3791| 0.1514|0.2837|0.08019|
|8913|        B|12.89|13.12|81.89|515.9|0.06955|0.03729|0.0226|0

In [31]:
from pyspark.ml.feature import VectorAssembler

# inputCols take a list of column names
# outputCol is arbitrary name of new column; generally called features

assembler = VectorAssembler(inputCols=["f1", "f2"],
                            outputCol="features")

In [32]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

In [33]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
# instantiate the model
lr = LogisticRegression(labelCol='diagnosis_numeric',
                        featuresCol='scaledFeatures',
                        maxIter=10, 
                        regParam=0.3, 
                        elasticNetParam=0.8)

pipeline = Pipeline(stages = [stringIndexer, assembler, scaler, lr])
# Fit the model
pipelineModel = pipeline.fit(train)

In [36]:
predDF = pipelineModel.transform(test)
predDF.show(2)

+-----+---------+-----+-----+-----+------+-------+-------+-------+-------+------+-------+------+-----+------+-----+--------+-------+-------+--------+-------+--------+-----+-----+-----+------+------+------+------+-------+------+-------+-----------------+-------------+--------------------+--------------------+--------------------+----------+
|   id|diagnosis|   f1|   f2|   f3|    f4|     f5|     f6|     f7|     f8|    f9|    f10|   f11|  f12|   f13|  f14|     f15|    f16|    f17|     f18|    f19|     f20|  f21|  f22|  f23|   f24|   f25|   f26|   f27|    f28|   f29|    f30|diagnosis_numeric|     features|      scaledFeatures|       rawPrediction|         probability|prediction|
+-----+---------+-----+-----+-----+------+-------+-------+-------+-------+------+-------+------+-----+------+-----+--------+-------+-------+--------+-------+--------+-----+-----+-----+------+------+------+------+-------+------+-------+-----------------+-------------+--------------------+--------------------+-------

In [37]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# set up evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction",
                                          labelCol="diagnosis_numeric",
                                          metricName="areaUnderROC")

# pass to evaluator the DF with predictions, labels
auc = evaluator.evaluate(predDF)

print("Area under ROC Curve:", auc)

Area under ROC Curve: 0.6959459459459459


#### Enter code and solution

#### Part II: Regression (5 POINTS)

In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Here are the specifications and grading breakdown:

- Scale the response variable median_house_value, dividing by 100000 (1 PT)

- Split data into train set (80%), test set (20%) using seed=314 (1 PT)

- Add new predictor: `rooms_per_household`

- In the training set, select all of these features and standardize them: (1 PT)

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

- Fit a linear regression model on the training set with these parameters:

  - maxIter=10
  - regParam=0.3
  - elasticNetParam=0.8  


- Compute the MSE on the test set (2 PTS)

In [65]:
import os
import pandas as pd

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [66]:
DATA_FILEPATH2 = 'cal_housing_data_preproc_w_header.txt'

In [67]:
data = spark.read.csv(DATA_FILEPATH2,  inferSchema=True, header = True)
data.show(2)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|          452600.0|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|          358500.0|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
only showing top 2 rows



In [68]:
data = data.withColumn("median_house_value", data['median_house_value']/100000)
data.show(2)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
|             4.526|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|             3.585|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+
only showing top 2 rows



In [69]:
data = data.withColumn('rooms_per_household', data['total_rooms']/data['households'])
data.show(2)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+
|             4.526|       8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|  6.984126984126984|
|             3.585|       8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|  6.238137082601054|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+
only showing top 2 rows



In [70]:
train, test = data.randomSplit([0.8, 0.2], 314)

In [74]:
feats = ["total_bedrooms", "population", "households", "median_income", "rooms_per_household"]

assembler = VectorAssembler(inputCols=feats,
                            outputCol="features")
scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")

from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline
# instantiate the model
reg = LinearRegression(labelCol='median_house_value',
                        featuresCol='scaledFeatures',
                        maxIter=10, 
                        regParam=0.3, 
                        elasticNetParam=0.8)

pipeline = Pipeline(stages = [assembler, scaler, reg])
# Fit the model
pipelineModel = pipeline.fit(train)

In [75]:
predDF = pipelineModel.transform(test)
predDF.show(2)

+------------------+------------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+-----------------+
|median_house_value|     median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|            features|      scaledFeatures|       prediction|
+------------------+------------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+-----------------+
|           0.14999|            4.1932|              52.0|      803.0|         267.0|     628.0|     225.0|   34.24|  -117.86|  3.568888888888889|[267.0,628.0,225....|[0.63739996917453...|2.159373789302095|
|             0.225|0.7916999999999998|              52.0|      107.0|          79.0|     167.0|      53.0|   37.95|  -121.29|  2.018867924528302|[79.0,167.0,53.0,...|[0.18

In [77]:
from pyspark.ml.evaluation import RegressionEvaluator
regressionEvaluator = RegressionEvaluator(
    predictionCol = 'prediction',
    labelCol = 'median_house_value',
    metricName= 'mse')
mse = regressionEvaluator.evaluate(predDF)
print(f"MSE is {mse}")

MSE is 0.7551749809615463


#### Enter code and solution