# DSE 230: Programming Assignment 3 - Linear Regression

#### Tasks: 

- Linear Regression on the Boston Housing dataset.  
  
- Submission on Gradescope:
  - Submit this Jupyter Notebook as a PDF to "PA3 Notebook"
  - Convert this Notebook to a .py file and submit that to "PA3"

#### Due date: Friday 5/14/2021 at 11:59 PM PST

---

Remember: when in doubt, read the documentation first. It's always helpful to search for the class that you're trying to work with, e.g. pyspark.sql.DataFrame. 

PySpark API Documentation: https://spark.apache.org/docs/latest/api/python/index.html

Spark DataFrame Guide:  https://spark.apache.org/docs/latest/sql-programming-guide.html

Spark MLlib Guide: https://spark.apache.org/docs/latest/ml-guide.html

### Import libraries/functions

In [None]:
<< YOUR CODE HERE >>

### Initialize Spark
* Initialize Spark with 2 cores

In [None]:
<< YOUR CODE HERE >>

### Read the data from Boston_Housing.csv file
* Print the number of rows in the dataframe

In [None]:
<< YOUR CODE HERE >>

### Column names in file and their description

CRIM — per capita crime rate by town.

ZN — proportion of residential land zoned for lots over 25,000 sq.ft.

INDUS — proportion of non-retail business acres per town.

CHAS — Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

NOX — nitrogen oxides concentration (parts per 10 million).

RM — average number of rooms per dwelling.

AGE — proportion of owner-occupied units built prior to 1940.

DIS — weighted mean of distances to five Boston employment centres.

RAD — index of accessibility to radial highways.

TAX — full-value property-tax rate per $10,000.

PTRATIO — pupil-teacher ratio by town.

BLACK — 1000(Bk — 0.63)² where Bk is the proportion of blacks by town.

LSTAT — lower status of the population (percent).

MV — median value of owner-occupied homes in $1000s. This is the target variable.

### See one row of the dataframe

In [None]:
<< YOUR CODE HERE >>

### Helper function for filling columns using mean or median strategy

In [None]:
from pyspark.ml.feature import Imputer

def fill_na(df, strategy):    
    imputer = Imputer(
        strategy=strategy,
        inputCols=df.columns, 
        outputCols=["{}_imputed".format(c) for c in df.columns]
    )
    
    new_df = imputer.fit(df).transform(df)
    
    # Select the newly created columns with all filled values
    new_df = new_df.select([c for c in new_df.columns if "imputed" in c])
    
    for col in new_df.columns:
        new_df = new_df.withColumnRenamed(col, col.split("_imputed")[0])
        
    return new_df

### Feature selection
* Print schema to verify

In [None]:
# These are the column names in the csv file as described above.
col_names = ['CRIM' , 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'BLACK', 'LSTAT', 'MV']

<< YOUR CODE HERE >>

### Drop NA's in the target variable `MV`
* Print the number of remaining rows

In [None]:
<< YOUR CODE HERE >>

### Fill the NA's for remaining columns using a mean strategy
* Use the `fill_na` function provided above

In [None]:
<< YOUR CODE HERE >>

### Create feature vector using VectorAssembler

* Create a vector column composed of _all_ the features
* Don't include the label "MV" here since label isn't a feature

In [None]:
from pyspark.ml.feature import VectorAssembler

<< YOUR CODE HERE >>

### Print first 5 rows of the created dataframe

In [None]:
<< YOUR CODE HERE >>

### Rename the column `MV` to `Label`

In [None]:
<< YOUR CODE HERE >>

### Split the dataframe using the randomSplit() function 
 * Train dataframe and test dataframe with a 75:25 split between them
 * Use seed=42 as one the parameters of the randomSplit() function to maintain consistency among all submissions.
 * Print the number of rows in train and test dataframes

In [None]:
train_df, test_df = << YOUR CODE HERE >>

### Use the StandardScaler to standardize your data.
* **IMPORTANT** - Use only the training data for scaling
* Standardize values to have zero mean and unit standard deviation

In [None]:
from pyspark.ml.feature import StandardScaler

<< YOUR CODE HERE >>

### Scale your training and test data with the same mean and std that you'll get from the scaler.

In [None]:
<< YOUR CODE HERE >>

### Use `scaler_model.mean`, `scaler_model.std` to see the mean and std for each feature

In [None]:
<< YOUR CODE HERE >>

### Select only the `features` and `label` columns from both train and test dataset

In [None]:
<< YOUR CODE HERE >>

### Show the first 5 rows of the resulting train dataframe

In [None]:
<< YOUR CODE HERE >>

### Use LinearRegression for training a regression model.
* Use maxIter = 100.
* Use the following values for regParam and elasticNetParam and see which one works better.
  1. regParam = 0, elasticNetParam = 0
  2. regParam = 0.3, elasticNetParam = 0.5

Look into the [API](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.regression.LinearRegression.html) specification to get more details.

In [None]:
<< YOUR CODE HERE >>

### Print the coefficients and intercept of the linear regression model

In [None]:
print("Coefficients: " + << YOUR CODE HERE >>)
print("Intercept: " + << YOUR CODE HERE >>)

### Print the training results
* Print the root mean squared error(RMSE) of the training
* Print the coefficient of determination(r2) of the training

In [None]:
<< YOUR CODE HERE >>

print("RMSE: %f" % << YOUR CODE HERE >>)
print("r2: %f" % << YOUR CODE HERE >>)

### Test the model on test data
* Print the RMSE and r2 on test data
* Hint - Refer to [`RegressionEvaluator`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.RegressionEvaluator.html)

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

<< YOUR CODE HERE >>

print("RMSE: %f" % << YOUR CODE HERE >>)
print("r2: %f" % << YOUR CODE HERE >>)

### Plot results on test data(using matplotlib)

 * In the test data, you have labels, and you also have predictions for each of the test data.
 * Plot a scatter plot of the labels(in blue) and predictions(in red) on a single plot so that you can visualize how the predictions look as compared to the ground truth.


In [None]:
<< YOUR CODE HERE >>

### Add regularization to model
* Try different values of regularization parameters `regParam` and `elasticNetParam` to see how performance changes.
* Look into the API specification for [regParam](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html#pyspark.ml.regression.LinearRegression.regParam) and [elasticNetParam](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegression.html#pyspark.ml.regression.LinearRegression.elasticNetParam) to get more details.

In [None]:
print("Best elasticNetParam = ", << YOUR ANSWER HERE >>)

### Stop the spark session

In [None]:
<< YOUR CODE HERE >>