## II. DATA MODELLING

## Data tools and file handling

In [None]:
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
import pickle

In [19]:
model_df = pd.read_csv('/Users/anix/API-deployment/preprocessing/baseline.csv', index_col=0)
model_df.head()

Unnamed: 0,Property type,Price,Number of bedrooms,Living area,Surface area land
105,HOUSE,295000.0,1.0,70.0,417.0
106,HOUSE,235000.0,1.0,70.0,104.0
107,HOUSE,275000.0,1.0,90.0,415.0
108,HOUSE,295000.0,1.0,70.0,417.0
111,HOUSE,239000.0,1.0,100.0,355.0


## Testing and training a multiple regression model

In [20]:
# Splitting our dataset into its attributes and labels
# The X variable (i.e. attributes) contains the last three columns of our data frame, while y contains the label.
X = model_df.iloc[:, 2:5].values
y = model_df.iloc[:,1].values

In [21]:
# Splitting the data between training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state = 4)

In [22]:
# Scikit-Learn’s LinearRegression class actually also works for several independent variables.  
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()

# Starting to train the model on our training dataset. 
# by first fitting the training data
regressor.fit(X_train, y_train)

LinearRegression()

In [25]:
# Testing the regressor by using it to predict on our test data. 
# We can use our model’s .predict method to do this.
predictions = regressor.predict(X_test)

**Now the model’s predictions are stored in the variable predictions, which is a Numpy array.**

## Model evaluation using an adjusted RSquared metric

A caveat: since we are using several independent variables, RSquared metric can only be used indirectly. 
The drawback of this evaluation method is that each time we add an independent variable, the metric’s value will get closer to 1; this leads to a performance rating that is inaccurately high.

To tackle this obstacle, we must manually implement an **Adjusted R Squared metric since Scikit-Learn does not provide a function to do this.**

Hence an **alternative approach** entails the following:
- First we need to find R Squared by using the Scikit-Learn’s r2_score function. 
- Then, we need to plug in the R Squared value into the formula above to get adjusted R Squared.

In [26]:
# Using SKlearn's function to determine RSquared metric
from sklearn.metrics import r2_score
r_squared = r2_score(y_test, predictions)

In [28]:
#Now we need to find the number of data values in our test dataset. 
# This can be done by using len() to find the number of rows in X_test. 
N = len(X_test)
print(N)

1855


In [29]:
# Now we need to find the number of predictors in X_test. 
# There were 3 predictors in our dataset: 'Number of bedrooms', 'Living area', and 'Surface area land'
k = 3

In [30]:
# Now all we have to do is plug in these numbers into the formula for adjusted RSquared metric
# Let's implement the forumula manually through code

adjusted_r_squared = 1 - (((1 - (r_squared ** 2)) * (N - 1)) / (N - k - 1))

print(f'The adjusted R score of our model is: {adjusted_r_squared}')

The adjusted R score of our model is: 0.062024122891589206
