# Problem 3 - Apache Spark Walkthrough

REMEMBER TO SWITCH OVER TO KAGGLE AT https://www.kaggle.com/mehulraheja/keras-apache-spark-problem-3

In this problem, we'll be working with california housing data and using Spark to do parrallelized Linear Regression on some of its columns. The results aren't very accurate at all, but it's a good introduction on various functionalities that Spark has.

In [None]:
!pip install pyspark

In [None]:
import os
import pandas as pd
import numpy as np

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col

from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.evaluation import RegressionEvaluator

import matplotlib.pyplot as plt

## Housing Data Set

The California Housing data set appeared in a 1997 paper titled *Sparse Spatial Autoregressions*, written by Pace, R. Kelley and Ronald Barry and published in the Statistics and Probability Letters journal. The researchers built this data set by using the 1990 California census data.

The data contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). In this sample a block group on average includes 1425.5 individuals living in a geographically compact area.

These spatial data contain 20,640 observations on housing prices with 9 economic variables:

<p style="text-align: justify;"></p>
<pre><strong>Longitude:</strong>refers to the angular distance of a geographic place north or south of the earth’s equator for each block group
<strong>Latitude :</strong>refers to the angular distance of a geographic place east or west of the earth’s equator for each block group
<strong>Housing Median Age:</strong>is the median age of the people that belong to a block group. Note that the median is the value that lies at the midpoint of a frequency distribution of observed values
<strong>Total Rooms:</strong>is the total number of rooms in the houses per block group
<strong>Total Bedrooms:</strong>is the total number of bedrooms in the houses per block group
<strong>Population:</strong>is the number of inhabitants of a block group
<strong>Households:</strong>refers to units of houses and their occupants per block group
<strong>Median Income:</strong>is used to register the median income of people that belong to a block group
<strong>Median House Value:</strong>is the dependent variable and refers to the median house value per block group
</pre>

What's more, we also learn that all the block groups have zero entries for the independent and dependent variables have been excluded from the data.

The Median house value is the dependent variable and will be assigned the role of the target variable in our ML model.

In [None]:
spark = SparkSession.builder.master("local[2]").appName("Linear-Regression-California-Housing").getOrCreate()
path = '../input/hausing-data/cal_housing.data'

schema = StructType([
    StructField("long", FloatType(), nullable=True),
    StructField("lat", FloatType(), nullable=True),
    StructField("medage", FloatType(), nullable=True),
    StructField("totrooms", FloatType(), nullable=True),
    StructField("totbdrms", FloatType(), nullable=True),
    StructField("pop", FloatType(), nullable=True),
    StructField("houshlds", FloatType(), nullable=True),
    StructField("medinc", FloatType(), nullable=True),
    StructField("medhv", FloatType(), nullable=True)]
)

housing_df = spark.read.csv(path=path, schema=schema).cache()

# PART A: Basic Spark Commands

## (a) Display the first five rows of the Spark dataframe


In [None]:
## YOUR CODE HERE ##

## (b) Create a new 1x1 dataframe, result, which contains the average of the population column

Your result should be around 1425

In [None]:
result = ## YOUR CODE HERE ##
result.show(1)

## (c) Save the pandas version of housing_df as pandas_housing_df

Feel free to lookup documentation on how this is done

In [None]:
pandas_housing_df = ## YOUR CODE HERE ##

In [None]:
#Checks if your code worked by plotting median age
plt.hist(pandas_housing_df['medage'])
plt.xlabel('Median Age')
plt.ylabel('Number of Houses')
plt.title('Histogram of Median Ages in California Houses')

## PART B: Basic Spark Machine Learning

## (d) First we define a set of feature columns that we would like to use as an input. 
Right now, lets go with Median Age, Total Bedrooms, Median Income, and Total Rooms

In [None]:
feature_cols = ## YOUR CODE HERE ##

## (e) Use VectorAssembler to create a column names "features" which contains the desired features

In [None]:
assembler = VectorAssembler(## YOUR CODE HERE ##)
assembled_df = ## YOUR CODE HERE ##

In [None]:
assembled_df.show(10, truncate=False)

## (f) Randomly split the data into 80% train data and 20% testing data

In [None]:
train_data, test_data = ## YOUR CODE HERE ##

In [None]:
train_data.show(10)

In [None]:
test_data.show(10)

## (g) Set linearModel equal to the result of fitting the given linear regression on the training data

In [None]:
lr = (LinearRegression(featuresCol='features', labelCol="medhv", predictionCol='predmedhv', 
                               maxIter=10, regParam=0.3, elasticNetParam=0.8, standardization=False))

In [None]:
linearModel = ## YOUR CODE HERE ##

## (h) Use the linearModel to predict on the test_data

In [None]:
predictions = ## YOUR CODE HERE ##

In [None]:
predictions.show(10)

## (i) Print root mean squared error for the linear model

If you did everything correctly, the MSE should be around 80,000. 

In [None]:
mse = ## YOUR CODE HERE ##
print("MSE:",mse)

## (j) Stop Spark

In [None]:
## YOUR CODE HERE ##

### CREDITS:
THIS NOTEBOOK IS HEAVILY INSPIRED BY THE ONE BY FATMAKURSUN WHICH YOU CAN FIND HERE https://www.kaggle.com/fatmakursun/pyspark-ml-tutorial-for-beginners. SOME BLOCKS OF CODE, FOR EXAMPLE LOADING THE DATASET, ARE TAKEN DIRECTLY FROM IT.