# Multiple Linear Regression

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession 

pyspark = SparkSession.builder \
.master("local[4]")\
.appName("MultipleRegression")\
.config("spark.executer.memory","3g")\
.config("spark.driver.memory","3g")\
.getOrCreate()

sc = pyspark.sparkContext

### Reading of dataset

In [3]:
ad_df = spark.read.format("csv")\
.option("header","True")\
.option("inferSchema", "True")\
.option("sep", ",")\
.load("data/Advertising.csv")

In [4]:
new_attributes = ["id","TV", "Radio","Newspaper","label"]

In [5]:
ad_df2 = ad_df.selectExpr("_c0 as id", "TV","Radio","Newspaper","Sales as label")

In [6]:
ad_df2.toPandas().head()

Unnamed: 0,id,TV,Radio,Newspaper,label
0,1,230.1,37.8,69.2,22.1
1,2,44.5,39.3,45.1,10.4
2,3,17.2,45.9,69.3,9.3
3,4,151.5,41.3,58.5,18.5
4,5,180.8,10.8,58.4,12.9


In [7]:
numeric_attributes = ["TV","Radio","Newspaper"]
label = ["label"]

In [8]:
ad_df2.describe().toPandas().head()

Unnamed: 0,summary,id,TV,Radio,Newspaper,label
0,count,200.0,200.0,200.0,200.0,200.0
1,mean,100.5,147.0425,23.264000000000024,30.553999999999995,14.022500000000004
2,stddev,57.87918451395112,85.85423631490805,14.846809176168728,21.77862083852283,5.217456565710477
3,min,1.0,0.7,0.0,0.3,1.6
4,max,200.0,296.4,49.6,114.0,27.0


Describe Table count row shows that we don't have null values. The other values are shown as in table.

## Data Preparation

### Transforming by VectorAssembler

In [9]:
from pyspark.ml.feature import VectorAssembler

vector_assembler  = VectorAssembler()\
.setInputCols(numeric_attributes)\
.setOutputCol("features")

### Regression Model

In [10]:
from pyspark.ml.regression import LinearRegression
linear_obj = LinearRegression()\
.setFeaturesCol("features")\
.setLabelCol("label")

### Pipeline

In [11]:
from pyspark.ml import Pipeline

pipeline_obj = Pipeline()\
.setStages([vector_assembler, linear_obj])

### Train-Test splitting

In [12]:
train_df, test_df = ad_df2.randomSplit([0.8, 0.2], seed=142)

### Model training

In [13]:
pipeline_model = pipeline_obj.fit(train_df)

### Model testing

In [14]:
result_df = pipeline_model.transform(test_df)
result_df.toPandas().head()

Unnamed: 0,id,TV,Radio,Newspaper,label,features,prediction
0,3,17.2,45.9,69.3,9.3,"[17.2, 45.9, 69.3]",12.90928
1,6,8.7,48.9,75.0,7.2,"[8.7, 48.9, 75.0]",13.145714
2,9,8.6,2.1,1.0,4.8,"[8.6, 2.1, 1.0]",3.658976
3,10,199.8,2.6,21.2,10.6,"[199.8, 2.6, 21.2]",12.276161
4,17,67.8,36.6,114.0,12.5,"[67.8, 36.6, 114.0]",13.492679


##### Getting Linear Model from pipeline

In [15]:
lr_model = pipeline_model.stages[1]

In [16]:
lr_model.coefficients

DenseVector([0.0441, 0.1964, 0.0039])

In [17]:
lr_model.intercept

2.8630452712927066

In [18]:
lr_model.summary.r2

0.8931175171003486

Model can explain %89.31 percent of total variability. This value was %72 in previous Linear Regression

In [19]:
lr_model.summary.rootMeanSquaredError

1.6561100287995882

In [20]:
lr_model.summary.pValues

[0.0, 0.0, 0.5717102604020492, 3.810285420513537e-13]

In [21]:
lr_model.summary.tValues

[27.723526858081737, 20.32452728577851, 0.5667562977703877, 7.950153508044762]

#### Evaluation of model finished. Now we need to create a new model by p-values.

## Model Selection

We included all values to model and calculated p-values. Our determined threshold value is 0.05. Calculated variable p-values are as follows:

TV --> 0.0

Radio --> 0.0

Newspaper --> 0.5718

So that we can say that Newspaper value is higher than threshold and is removed from dataset.

#### [ Old Model ] ===> y = 2.935 + (0.044 * TV) + (0.1964 * Radio) + (0.0039 * Newspaper)

In [22]:
numeric_attributes = ["TV","Radio"]
label = ["label"]

In [23]:
vector_assembler  = VectorAssembler()\
.setInputCols(numeric_attributes)\
.setOutputCol("features")

In [24]:
linear_obj = LinearRegression()\
.setFeaturesCol("features")\
.setLabelCol("label")

In [25]:
pipeline_obj = Pipeline()\
.setStages([vector_assembler, linear_obj])

In [26]:
train_df, test_df = ad_df2.randomSplit([0.8, 0.2], seed=142)

In [27]:
pipeline_model = pipeline_obj.fit(train_df)

In [28]:
result_df = pipeline_model.transform(test_df)
result_df.toPandas().head()

Unnamed: 0,id,TV,Radio,Newspaper,label,features,prediction
0,3,17.2,45.9,69.3,9.3,"[17.2, 45.9]",12.77388
1,6,8.7,48.9,75.0,7.2,"[8.7, 48.9]",12.991416
2,9,8.6,2.1,1.0,4.8,"[8.6, 2.1]",3.73113
3,10,199.8,2.6,21.2,10.6,"[199.8, 2.6]",12.283048
4,17,67.8,36.6,114.0,12.5,"[67.8, 36.6]",13.17162


In [29]:
result_df = pipeline_model.transform(test_df)
result_df.toPandas().head()

Unnamed: 0,id,TV,Radio,Newspaper,label,features,prediction
0,3,17.2,45.9,69.3,9.3,"[17.2, 45.9]",12.77388
1,6,8.7,48.9,75.0,7.2,"[8.7, 48.9]",12.991416
2,9,8.6,2.1,1.0,4.8,"[8.6, 2.1]",3.73113
3,10,199.8,2.6,21.2,10.6,"[199.8, 2.6]",12.283048
4,17,67.8,36.6,114.0,12.5,"[67.8, 36.6]",13.17162


In [30]:
lr_model = pipeline_model.stages[1]

In [31]:
print("B1 and B2 coefficient: ",lr_model.coefficients)
print("\t Bo Intercept: ", lr_model.intercept)
print("\t\t  R^2: ", lr_model.summary.r2)
print("\t\tRMSE : ", lr_model.summary.rootMeanSquaredError)
print("p-values: ", lr_model.summary.pValues)
print("t-values: ", lr_model.summary.tValues)

B1 and B2 coefficient:  [0.044210411496210966,0.19777489934012493]
	 Bo Intercept:  2.935593134859488
		  R^2:  0.8928931248714045
		RMSE :  1.6578475603790448
p-values:  [0.0, 0.0, 3.774758283725532e-15]
t-values:  [27.918094216203865, 21.216582516976807, 8.740412243937218]


Previos r2 was 0.89 and after removing Newspaper it is 0.89 again too. Therefore we say that Newspaper variable does nont contribute to y (sales) value. We remove Newspaper variable and create a new model as follows.

#### [ New Model ]  ==>  Y = 2.935 + (0.044 * TV) + (0.1977 * Radio)

## Prediction

Question: Whal would be if we spent 150.000 for TV and 20.000 for Radio ads ?

In [35]:
import pandas as pd 
data = { "TV" : [150.0], "Radio" : [20.0] } 
pd_df = pd.DataFrame(data)
pd_df.head()

Unnamed: 0,TV,Radio
0,150.0,20.0


In [36]:
predict_df = spark.createDataFrame(pd_df)
predict_df.show()

+-----+-----+
|   TV|Radio|
+-----+-----+
|150.0| 20.0|
+-----+-----+



In [38]:
predict_vector = vector_assembler.transform(predict_df)

In [40]:
lr_model.transform(predict_vector).show()

+-----+-----+------------+------------------+
|   TV|Radio|    features|        prediction|
+-----+-----+------------+------------------+
|150.0| 20.0|[150.0,20.0]|13.522652846093631|
+-----+-----+------------+------------------+



[Interpretation]: Sales prediction may be 13.52 if we spend 150.0 for TV and 20 for Radio ads.