## II. PREDICTIVE MODELLING

## Data tools and file handling

In [89]:
# Importing python data analysis tools and machine learning library
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle

In [90]:
# Opening the csv file containing the baseline data frame whose data we shall use to create and evaluate
# a multiple regression model
model_df = pd.read_csv('/Users/anix/API-deployment/preprocessing/baseline.csv', index_col=0)
# Getting a glimpse of the data frame
model_df.head()

Unnamed: 0,Property type,Price,Number of bedrooms,Living area,Surface area land
105,HOUSE,295000.0,1.0,70.0,417.0
106,HOUSE,235000.0,1.0,70.0,104.0
107,HOUSE,275000.0,1.0,90.0,415.0
108,HOUSE,295000.0,1.0,70.0,417.0
111,HOUSE,239000.0,1.0,100.0,355.0


## Testing and training a multiple regression model

In [91]:
# Splitting our dataset into its attributes and labels
# The X variable (i.e. attributes) contains the last three columns of our data frame, while y contains the label.
X = model_df.iloc[:, 2:5].values
y = model_df.iloc[:,1].values

In [92]:
# Splitting the data between training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

In [93]:
# Scikit-Learn’s LinearRegression class actually also works for several independent variables.  
from sklearn.linear_model import LinearRegression
LR = LinearRegression()

# Starting to train the model on our training dataset. 
# by first fitting the training data
LR.fit(X_train,y_train)

LinearRegression()

In [94]:
# Testing the regressor by using it to predict on our test data. 
# We can use our model’s .predict method to do this.
y_prediction =  LR.predict(X_test)
y_prediction

array([401836.00165869, 258692.72231955, 332429.1543727 , 523868.37746126,
       291675.25639092, 428397.69363441, 219210.77315412, 335936.88931054,
       327470.64819267, 226623.73480775, 312103.27468268, 265555.69452414,
       289569.9613956 , 238826.54236047, 469480.51601076, 415174.62091844,
       384643.72381273, 288231.70678646, 436390.05875912, 274389.86583529,
       193449.12689835, 293126.86427705, 274094.66470533, 259979.41776633,
       261546.92219445, 251779.80936757, 220409.50626138, 260242.24328729,
       325800.01647263, 268620.73346844, 295740.81106051, 328028.04511882,
       335832.14360365, 275726.92483506, 300615.37350196, 267085.81605212,
       254032.81062927, 219046.06763395, 240762.96567206, 318891.68800619,
       257403.6311704 , 260324.98244927, 234146.03113744, 324921.24570802,
       308843.97311716, 297459.19922837, 449860.93960626, 228026.74960211,
       337731.81373988, 345223.70735734, 289163.4482092 , 297455.81483582,
       471284.04352658, 3

**Now the model’s predictions are stored in the variable predictions, which is a Numpy array.**

## Model evaluation using an adjusted RSquared metric

A caveat: since we are using several independent variables, RSquared metric can only be used indirectly. 
The drawback of this evaluation method is that each time we add an independent variable, the metric’s value will get closer to 1; this leads to a performance rating that is inaccurately high.

To tackle this obstacle, we must manually implement an **Adjusted R Squared metric since Scikit-Learn does not provide a function to do this.**

Hence an **alternative approach** entails the following:
- First we need to find R Squared by using the Scikit-Learn’s r2_score function. 
- Then, we need to plug in the R Squared value into the formula above to get adjusted R Squared.

In [95]:
# Using SKlearn's function to determine RSquared metric
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

# Predicting the accuracy score
score=r2_score(y_test,y_prediction)
print('r2 score is', score)
print('mean_sqrd_error is==', mean_squared_error(y_test,y_prediction))
print('root_mean_squared error of is==', np.sqrt(mean_squared_error(y_test,y_prediction)))

r2 score is 0.2226912365378506
mean_sqrd_error is== 11953133353.298536
root_mean_squared error of is== 109330.38623044618


**The performance metric, RSquared score, of our multiple regression model is roughly 22%**