### University of Virginia
### DS 7200: Distributed Computing
### Lab: Supervised Learning
### Last Updated: August 20, 2023

---

#### Instructions

This project has two parts:
- Part I: Classification - build and apply a logistic regression model on the Wisconsin Breast Cancer dataset.
- Part II: Regression - build and apply a linear regression model on the California Housing dataset.

**Total Possible Points: 10**

---

#### Part I: Classification (5 POINTS)

Here are the specifications and grading breakdown:

- the target variable is `diagnosis`
- use `f1`, `f2` as predictors (1 PT)
- split data into 60% training set, 40% test set 
- standardize the predictors (1 PT)
- use seed=314 whenever a seed is needed
- fit a Logistic Regression model with an intercept (1 PT)
- compute and show the area under the ROC curve for the test set (2 PTS)

In [1]:
from pyspark.sql import SparkSession
DATA_FILEPATH = 'wisc_breast_cancer_w_fields.csv'

spark = SparkSession \
    .builder \
    .appName("Wisc BRCA") \
    .getOrCreate()

24/09/26 10:23:54 WARN Utils: Your hostname, Eileanors-Laptop.local resolves to a loopback address: 127.0.0.1; using 172.25.161.99 instead (on interface en0)
24/09/26 10:23:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/26 10:23:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Enter code and solution

In [2]:
# read data into dataframe
df = spark.read.csv(DATA_FILEPATH,  inferSchema=True, header = True)
df.show(5)

24/09/26 10:24:04 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+--------+---------+-----+-----+-----+------+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+-------+-------+--------+-----+-----+-----+------+------+------+------+------+------+-------+
|      id|diagnosis|   f1|   f2|   f3|    f4|     f5|     f6|    f7|     f8|    f9|    f10|   f11|   f12|  f13|  f14|     f15|    f16|    f17|    f18|    f19|     f20|  f21|  f22|  f23|   f24|   f25|   f26|   f27|   f28|   f29|    f30|
+--------+---------+-----+-----+-----+------+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+-------+-------+--------+-----+-----+-----+------+------+------+------+------+------+-------+
|  842302|        M|17.99|10.38|122.8|1001.0| 0.1184| 0.2776|0.3001| 0.1471|0.2419|0.07871| 1.095|0.9053|8.589|153.4|0.006399|0.04904|0.05373|0.01587|0.03003|0.006193|25.38|17.33|184.6|2019.0|0.1622|0.6656|0.7119|0.2654|0.4601| 0.1189|
|  842517|        M|20.57|17.77|132.9|1326.0|0.08474|0.0

In [3]:
df.select('diagnosis').distinct().collect()

[Row(diagnosis='B'), Row(diagnosis='M')]

In [4]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="diagnosis", outputCol="diagnosisEncoded")
indexed = indexer.fit(df).transform(df)
indexed.show()

+--------+---------+-----+-----+-----+------+-------+-------+-------+-------+------+-------+------+------+-----+-----+--------+--------+-------+--------+-------+--------+-----+-----+-----+------+------+------+------+-------+------+-------+----------------+
|      id|diagnosis|   f1|   f2|   f3|    f4|     f5|     f6|     f7|     f8|    f9|    f10|   f11|   f12|  f13|  f14|     f15|     f16|    f17|     f18|    f19|     f20|  f21|  f22|  f23|   f24|   f25|   f26|   f27|    f28|   f29|    f30|diagnosisEncoded|
+--------+---------+-----+-----+-----+------+-------+-------+-------+-------+------+-------+------+------+-----+-----+--------+--------+-------+--------+-------+--------+-----+-----+-----+------+------+------+------+-------+------+-------+----------------+
|  842302|        M|17.99|10.38|122.8|1001.0| 0.1184| 0.2776| 0.3001| 0.1471|0.2419|0.07871| 1.095|0.9053|8.589|153.4|0.006399| 0.04904|0.05373| 0.01587|0.03003|0.006193|25.38|17.33|184.6|2019.0|0.1622|0.6656|0.7119| 0.2654|0.460

In [5]:
from pyspark.ml.feature import VectorAssembler

# inputCols take a list of column names
# outputCol is arbitrary name of new column; generally called features

assembler = VectorAssembler(inputCols=["f1", "f2"],
                            outputCol="features")

tr = assembler.transform(indexed)
tr.select("*").show(5, truncate=False)

+--------+---------+-----+-----+-----+------+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+-------+-------+--------+-----+-----+-----+------+------+------+------+------+------+-------+----------------+-------------+
|id      |diagnosis|f1   |f2   |f3   |f4    |f5     |f6     |f7    |f8     |f9    |f10    |f11   |f12   |f13  |f14  |f15     |f16    |f17    |f18    |f19    |f20     |f21  |f22  |f23  |f24   |f25   |f26   |f27   |f28   |f29   |f30    |diagnosisEncoded|features     |
+--------+---------+-----+-----+-----+------+-------+-------+------+-------+------+-------+------+------+-----+-----+--------+-------+-------+-------+-------+--------+-----+-----+-----+------+------+------+------+------+------+-------+----------------+-------------+
|842302  |M        |17.99|10.38|122.8|1001.0|0.1184 |0.2776 |0.3001|0.1471 |0.2419|0.07871|1.095 |0.9053|8.589|153.4|0.006399|0.04904|0.05373|0.01587|0.03003|0.006193|25.38|17.33|184.6|2019.0|0.1622|

In [6]:
# Split the data into train and test
splits = tr.randomSplit([0.6, 0.4], 314)
train = splits[0]
test = splits[1]

In [7]:
from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures") #withMean = True centers the data - not good if sparse data

#Scale Training Data
scalerTrain = scaler.fit(train)
scaledTrain = scalerTrain.transform(train)

#Scale Testing Data
scalerTest = scaler.fit(test)
scaledTest = scalerTrain.transform(test)

scaledTrain.select("diagnosisEncoded","features","scaledFeatures").show(5, truncate=False)
scaledTest.select("diagnosisEncoded","features","scaledFeatures").show(5, truncate=False)

+----------------+-------------+---------------------------------------+
|diagnosisEncoded|features     |scaledFeatures                         |
+----------------+-------------+---------------------------------------+
|1.0             |[15.46,19.48]|[4.32528603980796,4.543962571206202]   |
|0.0             |[12.89,13.12]|[3.6062701845488103,3.0604101095598235]|
|0.0             |[14.96,19.1] |[4.185399686644701,4.455322644252488]  |
|1.0             |[13.17,18.66]|[3.684606542320235,4.352686939358713]  |
|0.0             |[12.18,17.84]|[3.4076315630569827,4.161411307511224] |
+----------------+-------------+---------------------------------------+
only showing top 5 rows

+----------------+-------------+--------------------------------------+
|diagnosisEncoded|features     |scaledFeatures                        |
+----------------+-------------+--------------------------------------+
|0.0             |[12.94,16.17]|[3.620258819865136,3.771862154846216] |
|1.0             |[20.26,23.03

In [8]:
from pyspark.ml.classification import LogisticRegression

# instantiate the model
lr = LogisticRegression(labelCol='diagnosisEncoded',
                        featuresCol='scaledFeatures',
                        maxIter=10, 
                        regParam=0.3, 
                        elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(scaledTrain)

# Print the coefficients and intercept for logistic regression
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

24/09/26 10:24:38 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


Coefficients: [0.39066757761187426,0.0]
Intercept: -2.0137943460354157


In [9]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# compute predictions. this will append column "prediction" to dataframe
lrPred = lrModel.transform(scaledTest)
lrPred.select('probability','prediction').show(5,truncate=False)

# set up evaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="probability",
                                          labelCol="diagnosisEncoded",
                                          metricName="areaUnderPR")

# pass to evaluator the DF with predictions, labels
aupr = evaluator.evaluate(lrPred)

print("Area under PR Curve:", aupr) #can see not super confident in predictions

+----------------------------------------+----------+
|probability                             |prediction|
+----------------------------------------+----------+
|[0.6455365520110307,0.35446344798896934]|0.0       |
|[0.4500210424472049,0.5499789575527951] |1.0       |
|[0.6532507194634758,0.3467492805365242] |0.0       |
|[0.6118755975910292,0.38812440240897084]|0.0       |
|[0.6224629287211927,0.3775370712788073] |0.0       |
+----------------------------------------+----------+
only showing top 5 rows

Area under PR Curve: 0.9196399670688022


#### Part II: Regression (5 POINTS)

In this project, you will work with the California Home Price dataset to train a regression model and predict median home prices. Here are the specifications and grading breakdown:

- Scale the response variable median_house_value, dividing by 100000 (1 PT)

- Split data into train set (80%), test set (20%) using seed=314 (1 PT)

- Add new predictor: `rooms_per_household`

- In the training set, select all of these features and standardize them: (1 PT)

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

- Fit a linear regression model on the training set with these parameters:

  - maxIter=10
  - regParam=0.3
  - elasticNetParam=0.8  


- Compute the MSE on the test set (2 PTS)

In [18]:
import os
import pandas as pd

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [19]:
DATA_FILEPATH2 = 'cal_housing_data_preproc_w_header.txt'

#### Enter code and solution

In [26]:
# read data into dataframe
df = spark.read.csv(DATA_FILEPATH2,  inferSchema=True, header = True)
df.show(5)

+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|    median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+
|          452600.0|           8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|          358500.0|           8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
|          352100.0|7.257399999999999|              52.0|     1467.0|         190.0|     496.0|     177.0|   37.85|  -122.24|
|          341300.0|           5.6431|              52.0|     1274.0|         235.0|     558.0|     219.0|   37.85|  -122.25|
|          342200.0|           3.8462|              52.0|     1627.0|         280.0|     565.0|     259.0|   37.85|  -

In [27]:
#Scale the response variable median_house_value, dividing by 100000 

from pyspark.sql.functions import col

df = df.withColumn("median_house_value",col("median_house_value")/100000)
df.show(5)

+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+
|median_house_value|    median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|
+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+
|             4.526|           8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|
|             3.585|           8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|
|             3.521|7.257399999999999|              52.0|     1467.0|         190.0|     496.0|     177.0|   37.85|  -122.24|
|             3.413|           5.6431|              52.0|     1274.0|         235.0|     558.0|     219.0|   37.85|  -122.25|
|             3.422|           3.8462|              52.0|     1627.0|         280.0|     565.0|     259.0|   37.85|  -

In [28]:
#Add new predictor: `rooms_per_household`

df = df.withColumn("rooms_per_household",col("total_rooms")/col("households"))
df.show()

+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+
|median_house_value|    median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|
+------------------+-----------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+
|             4.526|           8.3252|              41.0|      880.0|         129.0|     322.0|     126.0|   37.88|  -122.23|  6.984126984126984|
|             3.585|           8.3014|              21.0|     7099.0|        1106.0|    2401.0|    1138.0|   37.86|  -122.22|  6.238137082601054|
|             3.521|7.257399999999999|              52.0|     1467.0|         190.0|     496.0|     177.0|   37.85|  -122.24|  8.288135593220339|
|             3.413|           5.6431|              52.0|     1274.0|         235.0|     558.0|     219.0|   37.85|  -122.25

In [36]:
from pyspark.ml.feature import VectorAssembler

# inputCols take a list of column names
# outputCol is arbitrary name of new column; generally called features

feats = ["total_bedrooms", 
         "population", 
         "households", 
         "median_income", 
         "rooms_per_household"]

assembler = VectorAssembler(inputCols=feats,
                            outputCol="features")

tr = assembler.transform(df)
tr.select("*").show(1, truncate=False)

+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------------------------------+
|median_house_value|median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|features                                    |
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------------------------------+
|4.526             |8.3252       |41.0              |880.0      |129.0         |322.0     |126.0     |37.88   |-122.23  |6.984126984126984  |[129.0,322.0,126.0,8.3252,6.984126984126984]|
+------------------+-------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------------------------------+
only showing top 1 row



In [37]:
# Split the data into train and test
splits = tr.randomSplit([0.8, 0.2], 314)
train = splits[0]
test = splits[1]

In [38]:
#In the training set, select all of these features and standardize them: (1 PT) - why only standardize the training features not the test features??


from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol='features', outputCol="scaledFeatures") #withMean = True centers the data - not good if sparse data

#Scale Training Data
scalerTrain = scaler.fit(train)
scaledTrain = scalerTrain.transform(train)

scaledTrain.select("median_house_value","features","scaledFeatures").show(5, truncate=False)

#Scale Test Data
scalerTest = scaler.fit(test)
scaledTest = scalerTest.transform(test)

scaledTest.select("median_house_value","features","scaledFeatures").show(5, truncate=False)


+------------------+----------------------------------------------+-----------------------------------------------------------------------------------------------------+
|median_house_value|features                                      |scaledFeatures                                                                                       |
+------------------+----------------------------------------------+-----------------------------------------------------------------------------------------------------+
|0.14999           |[28.0,18.0,8.0,0.536,12.25]                   |[0.06684344246025051,0.016109651840083426,0.021007972177067596,0.28366233505166594,5.132877105254897]|
|0.14999           |[73.0,85.0,38.0,1.6607,6.7105263157894735]    |[0.17427040355708168,0.07607335591150508,0.09978786784107108,0.8788769399632492,2.811780154328676]   |
|0.14999           |[239.0,490.0,164.0,2.1,3.774390243902439]     |[0.570556526714281,0.4385405223133822,0.43066342962988574,1.1113636261352582,1.5815

In [41]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(featuresCol='scaledFeatures',         # feature vector name
                      labelCol='median_house_value',  # target variable name
                      maxIter=10,
                      regParam=0.3, 
                      elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(scaledTrain)

# Print the weights and intercept for linear regression
print("Weights: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))

24/09/26 10:39:51 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


Weights: [0.081755424803359,-0.1614868306276797,0.10560698176227044,0.6345625835282451,-0.02760799353025227]
Intercept: 0.7952155251919881


In [42]:
from pyspark.ml.evaluation import RegressionEvaluator

# compute predictions. this will append column "prediction" to dataframe
lrPred = lrModel.transform(scaledTest)
lrPred.show(5)

ev = RegressionEvaluator(predictionCol="prediction", labelCol="median_house_value")

print('-'*20)
print("METRICS")
print("Mean Squared Error:", ev.evaluate(lrPred, {ev.metricName: "mse"}))
print("R Squared:", ev.evaluate(lrPred, {ev.metricName:'r2'}))

+------------------+------------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+------------------+
|median_house_value|     median_income|housing_median_age|total_rooms|total_bedrooms|population|households|latitude|longitude|rooms_per_household|            features|      scaledFeatures|        prediction|
+------------------+------------------+------------------+-----------+--------------+----------+----------+--------+---------+-------------------+--------------------+--------------------+------------------+
|           0.14999|            4.1932|              52.0|      803.0|         267.0|     628.0|     225.0|   34.24|  -117.86|  3.568888888888889|[267.0,628.0,225....|[0.62024503250489...| 2.158224976508338|
|             0.225|0.7916999999999998|              52.0|      107.0|          79.0|     167.0|      53.0|   37.95|  -121.29|  2.018867924528302|[79.0,167.0,53.0,...|[