# Dataset: Boston housing

- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town.
- CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-valjue property-tax rate per $10,000

- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

In [None]:
import pandas as pd

url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"

df = pd.read_csv(url)

df.head()

In [None]:
df.info()

In [None]:
# lets find a feature that correlates strongly
df.corr()['medv']


## Terminology

#### Data
Feature or input variable or independent variable
Target or dependent variable or prediction

#### Modelling
model - a function or a system or sequence of logic that takes input and produces an output

loss function - a defined function that represents the error of your model compared with real results



### Simple case: Linear regression with one input


y = b + w X

X -> input variable lets take the room number only

y -> value of the dwelling

The goal is to find w and b that reduces the loss function below:

loss = Sum( ( y - yPred ) ^ 2)

In [None]:
# Let's focus only on a single feature as opposed to all features
X = df[['rm']]

In [None]:
# We are trying to predict the price, so we will use 'medv'
y = df['medv']

### Train Test split

We keep a portion of the data for evaluation of the model. typically between 15 to 25 percent of the data
This is to ensure the model is able to make good prediction for unseen data as a complex model can be capable of memorizing the training data

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

### Plotting data

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X_train, y_train, label='Training set')
plt.scatter(X_test, y_test, label='Test set')

plt.xlabel('Feature (average number of rooms)')
plt.ylabel('Target (median value)')

plt.title('Real Estate Price based on number of rooms')
plt.legend();

### Training
Is based on the train set only and is simply a single line (in general is a hyperplane)

There is so much action happening under the 'fit' method here!

In [None]:
# Doc at https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
from sklearn import linear_model

reg = linear_model.LinearRegression()

reg.fit(X_train, y_train);

In [None]:
# Model coefficients (parameters)
b = reg.intercept_
w = reg.coef_

print(f'b={b} and w={w}')

### Evaluating
We will evaluate the performance of the model for both training and test datasets

The score of a linear regression is the coefficient of determination which is 1 - loss / (variance of true targets * size)

In [None]:
train_score = reg.score(X_train, y_train)
test_score = reg.score(X_test, y_test)

print(f'Train score is: {train_score}')
print(f'Test score is: {test_score}')

In [None]:
# Prediction -> the model can be used for the application
num_rooms = 6.
price = reg.predict([[num_rooms]])

print(price)

In [None]:
# Let's plot the fitted line
y_pred = reg.predict(X)

plt.scatter(X['rm'], y, label='All data')
plt.scatter(X['rm'], y_pred, color='red', label='Fitted line')


plt.xlabel('Feature (average number of rooms)')
plt.ylabel('Target (median value)')

plt.title('Real Estate Price based on number of rooms')
plt.legend();

### Reflection
Why did we get such scores?

Could it be that our model is too simple?
No based on the scatter plot it seems like a more complex can't do better necessarily

Could it be an isssue with the data? Yes, it seems like that simply relying on the number of rooms is not a good criteria for predicting the price. Let's include more features and see if we can get a better score

In [None]:
# Preparing data
X = df.drop(columns=['medv'])
y = df['medv']

In [None]:
# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

In [None]:
# Scaling: we need to bring all the features in the same scale
from sklearn.preprocessing import StandardScaler

# Using a scaler object to bring all features in the same range (only using the training set to avoid data leak)
scaler = StandardScaler().fit(X_train)

# Generating a scaled train and test set
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# View the scaled values in a DataFrame format
pd.DataFrame(X_train_scaled, columns=[c + '_scaled' for c in X_train.columns])

In [None]:
# Training
reg_full = linear_model.LinearRegression()

reg_full.fit(X_train_scaled, y_train);

In [None]:
train_score = reg_full.score(X_train_scaled, y_train)
test_score = reg_full.score(X_test_scaled, y_test)

print(f'Train score is: {train_score}')
print(f'Test score is: {test_score}')

In [None]:
# For prediction you have to provide all the features now.
# Lets calulatethe prediction for all test data by adding a column and comparing
y_test_predictions= reg_full.predict(X_test_scaled)

y_test_predictions_series = pd.Series(y_test_predictions, name='medv_pred')

In [None]:
pd.concat([y_test.reset_index(drop=True), y_test_predictions_series], axis = 1)

# Bias-Variance trade off

A key concept in ML to understand is bias-variance trade-off. 

- A model with high bias makes strong assumptions such as a linear relationship
- A model with high variance makes minimal assumptions and conforms to variations in data

high bias and high variance both can hurt predictions. There is always a trade-off meaning you cannot have a model with low bias and low variance. Finding the right balance is key

### kNN
K Nearest Neighbors is a low bias model. It simply memorizes the whole training set. It will then look at the k closest data points in the training set and makes a prediction based on those. 

In [None]:
# Splitting
X = df[['rm']]
y = df['medv']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# the Hyper parameter k acts as a 'regularization'. The higher k the lower the variance
k = 1

knnRegressor = KNeighborsRegressor(n_neighbors=k)
knnRegressor.fit(X_train, y_train);


# Let's calculate the score
score = knnRegressor.score(X_test, y_test)

import numpy as np
sample = pd.DataFrame([[rm] for rm in np.arange(3,9, 0.001)], columns=['rm'])

plt.scatter(X_train, y_train, label='training set')
plt.plot(sample, knnRegressor.predict(sample), color='red', label=f'knn prediction with k = {k}')

plt.xlabel('Number of rooms')
plt.ylabel('Predicted price')
plt.title(f'Score={score}')
plt.legend();

### Simple Hyperparameter tuning

Involves looking for the best hyperparameter configuration for maximizing a score.

In our kNN case, we want to find the best K for maximum score

In [None]:
def getScores(k):
    knnRegressor = KNeighborsRegressor(n_neighbors=k)
    knnRegressor.fit(X_train, y_train)
    
    trainScore = knnRegressor.score(X_train, y_train)
    testScore = knnRegressor.score(X_test, y_test)
    
    return trainScore, testScore

In [None]:
ks = range(1,100)
scores = [getScores(k) for k in ks]

train_scores = [score[0] for score in scores]
test_scores = [score[1] for score in scores]

In [None]:
plt.scatter(ks, train_scores, label='Train score')
plt.scatter(ks, test_scores, label='Test score')
plt.xlabel('Hyperparameter K')
plt.ylabel('kNN score')
plt.legend();

### Knn with all features

In [None]:
# Preparing data
X = df.drop(columns=['medv'])
y = df['medv']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

# Scaling here is particularly important due to the notion of 'distance' between neighbors
scaler = StandardScaler().fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Rerun cells above make sure to user the scaled version of the features

### Cross validation

An important technique to ensure your analysis is not sensitive to your data split. In the example above let's change the random_state to produce different splits. You will see that the scores chart will change drammatically.

![Cross Validation](https://zitaoshen.rbind.io/project/machine_learning/machine-learning-101-cross-vaildation/featured.png)

## Polynomial Regression
As opposed to linear regression that assumes target is a linear combination of feature values, polynomial regression assumes a more comples shape. For example a polynomal of degree 2 with a single feature is simply a parabola:

y = b + w0 X + w1 X ^ 2

To achieve the form above, we can simply consider X^2 as a new feature and convert our problem to a LinearRegression with two features. The goal is to find the best b, w0, and w1

In [None]:
# adding a new column for squared values of rmsss
X.insert(1, 'rm squared', df['rm']**2)

### Exercise: fit a polynomial of degree 3 to the one dimentional training set and plot the curve

y = b + w0 X + w1 X ^ 2 + w2 X ^ 3