### Boston Housing

Using the Boston House-price dataset available at the URL provided below, perform the following tasks using PySpark: 

1. Compute the pairwise correlations of the variables; 
2. Select the top three variables based on the pairwise correlations of the variables; 
3. Create a regression model using a polynomial function of degree two on the three selected variables. Use 70% of the data for training; 
4. Compute the R-Squared value of the model using the remaining 30% of the test data; and 


Import necessary libraries

First, let's import the necessary libraries and then load the dataset from the csv file

In [None]:
!pip install seaborn # Visualising Library
import pandas as pd # pandas for data manipulation and analysis. In this code we use it to show scatter plots.

#Seaborn is a library for making statistical graphics in Python. 
#It builds on top of matplotlib and integrates closely with pandas data structures. 
#Seaborn helps you explore and understand your data.
import seaborn as sb

from matplotlib import pyplot as plt # We use matplotlib for create axe and figures to plot data 
from pyspark.ml.feature import VectorAssembler 
from pyspark.sql.types import DoubleType 
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col # For use the column name of the dataframe in pyspark 
from pyspark.sql import SparkSession



In [None]:
spark = SparkSession.builder. getOrCreate()


### Import the Dataset

 #### Variables in order:
 
 * **CRIM**     per capita crime rate by town
 
 * **ZN**       proportion of residential land zoned for lots over 25,000 sq.ft.
 
 * **INDUS**    proportion of non-retail business acres per town
 
 * **CHAS**     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
 
 * **NOX**     nitric oxides concentration (parts per 10 million)
 
 * **RM**       average number of rooms per dwelling
 
 * **AGE**      proportion of owner-occupied units built prior to 1940
 
 * **DIS**      weighted distances to five Boston employment centres
 
 * **RAD**      index of accessibility to radial highways
 
 * **TAX**      full-value property-tax rate per $10,000
 
 * **PTRATIO**  pupil-teacher ratio by town
 
 * **B**        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
 
 * **LSTAT**    % lower status of the population
 
 * **MEDV**     Median value of owner-occupied homes in $1000's

In [None]:
# Here again we use infraschema becouse we need all columns to be double.
boston_housing = spark.read.option('header', 'true').csv('boston.csv', inferSchema=True)
boston_housing.show()
print (boston_housing.dtypes)


### Question 1:

Compute the pairwise correlations of the variables;

In [None]:
boston_housing_pandas_dataframe = boston_housing.toPandas ()
fig, ax = plt.subplots(figsize=(15, 15))
sb.heatmap(boston_housing_pandas_dataframe.corr(), cmap="Blues", annot=True, ax=ax)


### Question 2:  

Select the top three variables based on the pairwise correlations of the variables; 

In [None]:
 fig, ax = plt.subplots(1, 3, figsize=(15, 5))

boston_housing_pandas_dataframe.plot.scatter(x='MEDV', y='LSTAT', ax=ax[0])
boston_housing_pandas_dataframe.plot.scatter(x='MEDV', y='RM', ax=ax[1])
boston_housing_pandas_dataframe.plot.scatter(x='MEDV', y='PTRATIO', ax=ax[2])

plt.show()



The correlation coefficient ranges from -1 to 1. When it is close to 1, it means that there is a strong positive correlation; for example, the median value (MED) tends to go up when the number of rooms (RM) goes up. When the coefficient is close to -1, it means that there is a strong negative correlation; the median value (MED) tends to go down when the percentage of the lower status of the population (LSTAT) goes up.

### Section 3

Create a regression model using a polynomial function of degree two on the three selected variables. Use 70% of the data for training;

y=a*x^2+ b*x+ c


In [None]:
boston_housing = boston_housing.withColumn("LSTAT2", col("LSTAT") * col("LSTAT"))
boston_housing = boston_housing.withColumn ("RM2", col("RM") * col("RM"))
#boston_housing = boston_housing.withColumn ("RM2", col("RM") * col("RM"))

rmAssembler = VectorAssembler(inputCols = ['RM2', 'RM'] , outputCol='rm_features')
lstatAssembler = VectorAssembler (inputCols = ['LSTAT2', 'LSTAT'] , outputCol= 'lstat_features')

df_rm = rmAssembler.transform(boston_housing).select (['MEDV', 'rm_features'])
df_lstat = lstatAssembler.transform(boston_housing).select(['MEDV', 'lstat_features'])

df_rm.show()
df_lstat.show()

df_training_lstat, df_test_lstat = df_lstat.randomSplit([0.7, 0.3])
df_training_rm, df_test_rm = df_rm.randomSplit([0.7, 0.3])


### Create a regression model

**maxiter** : It is the maximum number of iterations to perform before giving up.

In [None]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

pr = LinearRegression(featuresCol="lstat_features", labelCol="MEDV", maxIter=30)
prModel = pr.fit(df_training_lstat)

print("Coefficients: " + str (prModel.coefficients))
print("Intercept:" + str (prModel.intercept)) # Describe Intercept

print ("R2:", prModel.summary.r2)



$$ y=a \times x^2+ b \times x+ c $$



$$MEDV = 0.055 \times (LSTAT)^2 - 2.55 \times (LSTAT) + 43.44$$

In [None]:
import numpy as np
from matplotlib import pyplot as plt

x = np.linspace(0, 50, 100)
# From 0 to 50, create 100 numbers with equal distance


In [None]:
x

In [None]:
fx = []
for i in range(len(x)):
    fx.append(prModel.coefficients[0]*x[i]*x[i] + prModel.coefficients[1]*x[i] + prModel.intercept)
    


In [None]:
fx

In [None]:
plt.plot(x, fx)
plt.show()

*R squared at 0.65 indicates that in our model, approximate 65% of the variability in "MEDV" can be explained using the model and the considered independent variable(s).*

#### Compute the R-Squared value of the model using the remaining 30% of the test data

In [None]:
pr_predictions = prModel.transform(df_test_lstat)
pr_predictions.show()

pr_predictions.select("prediction", "MEDV", "lstat_features")

pr_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="MEDV", metricName="r2")

print("R2 on test data:", pr_evaluator.evaluate(pr_predictions))


Compare R2?