# Regression

In this tutorial we will investigate how to do a regression in python

Objectives:
- Perform a simple linear regression
- Perform a multiple linear regression
- 

For the tutorial we will use the following libraries:
- pandas
- numpy
- matplotlib
- sklearn (dataset, linear_model)
- seaborn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib  inline
from sklearn import datasets, linear_model
import seaborn as sns

## The boston house price dataset

For this tutorial we will use a pre-existing dataset from the `sklearn` library.

Here is the official description of the dataset:

Boston House Prices dataset
- `CRIM`     per capita crime rate by town
- `ZN`       proportion of residential land zoned for lots over 25,000 sq.ft.
- `INDUS`    proportion of non-retail business acres per town
- `CHAS`     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- `NOX`      nitric oxides concentration (parts per 10 million)
- `RM`       average number of rooms per dwelling
- `AGE`      proportion of owner-occupied units built prior to 1940
- `DIS`      weighted distances to five Boston employment centres
- `RAD`      index of accessibility to radial highways
- `TAX`      full-value property-tax rate per \$10,000
- `PTRATIO`  pupil-teacher ratio by town
- `B`        1000(Bk - 0.63)$^2$ where Bk is the proportion of blacks by town
- `LSTAT`     lower status of the population
- `MEDV`    Median value of owner-occupied homes in \$1000's

The full description:
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

The package description of the dataset: 
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

With the dataset, the target variable is the house price, i.e `MEDV`


In [None]:
data=datasets.load_boston()

This dataset contains different part: the numeric data is in the `data` an array, the names of the variable  are in `feature_names`, the target value $Y_i$ is in `target`. So we will do some manipulation, to transform the data in a panda dataframe.

In [None]:
data.keys()

In [None]:
data['data'] #the numeric dataset

In [None]:
data['feature_names'] #the name of variables

The `df` variable will store the data frame

In [None]:
df = pd.DataFrame(data.data, columns=data.feature_names) 

The target value, i.e. house price, will be store in a separate variable `house_price`

In [None]:
house_price=pd.DataFrame(data.target, columns=["MEDV"])

In [None]:
df.head() # this will show the top of the dataset, i.e. the first few rows

In [None]:
house_price.head()

## Simple linear regression

A simple linear regression models a linear relationship between two variables. 

$$Y_i=X_i+e_i$$



In [None]:
plt.plot(df['RM'],house_price,'.')
plt.xlabel("Number of rooms")
plt.ylabel("House price")
plt.title("House prices in Boston")

The `seaborn` library plots directly a line through the scatter plot, however to use seaborn we have to create a special datafame with the two variables we want to plot.

In [None]:
df2=pd.DataFrame({'RM':df.RM,'price':house_price.MEDV }) # transform data into a new dataframe
sns.lmplot('RM','price',data=df2) # linear plot

From this plot, how do you describe the relatioship between the house price and the number of rooms?

*Write your answer here*

Let's build a simple linear model:
$$price = a \times nb\_ of\_ rooms + b$$

In [None]:
x=df['RM'].values.reshape(-1, 1) ## this is necessary to use the regression function

In [None]:
y=house_price

We will use the `linear_model.LinearRegression()` to fit the following equation:
    $$y=ax+b$$

In [None]:
model=linear_model.LinearRegression()

In [None]:
model.fit(x,y)

With the `fit` function, we obtain `a` the `coef_` and `b` the `ìntercept_`

In [None]:
model.coef_

In [None]:
model.intercept_

In this case the equation is:
    $$y=9.10\times x -34.67$$
$$price=9.10\times nb\_ of\_ rooms-34.67$$

How do you interpret these numbers?

*Write your answers here*

This equations matches the lines that goes through the data, let's plot it.

In order to plot it we will predict the price of houses with 4 to 9 rooms.

In [None]:
xpred=np.arange(4,10).reshape(-1,1)

In [None]:
ypred=model.predict(xpred)

In [None]:
(xpred,ypred)

In [None]:
plt.plot(df['RM'],house_price,'.')
plt.xlabel("Number of rooms")
plt.ylabel("House price")
plt.title("House prices in Boston")
plt.plot(xpred,ypred,'r')

## Multiple regression

The price of houses is not only explained by the number of rooms, but also by the distanc efrom teh city center, and all the other variables from the dataset. So we need to build a linear model that takes all the variables into account:

$$y_i=a_1 \times x_i^1 + a_2 \times x_i^2 + \dots + a_p \times x_i^p$$

$$price=a_1 \times nbroom + a_2 \times distance + \dots + a_p \times stat$$


In [None]:
model2 = linear_model.LinearRegression()

In [None]:
x=df.copy() # now we include all the variables

In [None]:
model2.fit(x,y)

In [None]:
model2.coef_

In [None]:
x.columns

In [None]:
model.intercept_

The model is:
    $$price=-0.12 \times CRIM +0.046 \times ZN + 0.021 \times INDUS +3.67 CHAS  -0.178 NOX +3.80 RM +0.00075 AGE-1.48  DIS +0.31 RAD - 0.012 TAX -0.95 PRATIO +0.0094 B -0.53 LSTAT +36.5$$

Describe the relationship between the price and the pupil-teacher ratio by town, and the price and the Charles River dummy variable and interpret what it means

*Write your answer*

Which of these variables are the most important to predict the price?

*Write your answer*

## Statistical significance of regression

To assess the significance of the eahc avriable in the regression, we need to use another library

In [None]:
import statsmodels.api as sm

In [None]:
X2 = sm.add_constant(x) # the  variables of the regression
model3 = sm.OLS(y, x) # create the model
model_final = model3.fit() # fit it
print(model_final.summary()) # information about the model

The output above shows a pvalue for each variable. 
For each variable, if $p<0.05$ then the variable is significant, if $p>0.05$ the variable is not significant. 
Basically if a variable is not significnat it should not be included in a model. So we should remove all the variables that are not significant, one by one. Typically we need to remove AGE, INDUS and NOX.

#### remove INDUS

In [None]:
x3=x.drop(["INDUS"],axis=1) # remove NOX from the dataset

In [None]:
X2 = sm.add_constant(x3) # the  variables of the regression
model3 = sm.OLS(y, X2) # create the model
model_final = model3.fit() # fit it
print(model_final.summary()) # information about the model

#### remove Age

In [None]:
x4=x3.drop(["AGE"],axis=1) # remove NOX from the dataset
X2 = sm.add_constant(x4) # the  variables of the regression
model3 = sm.OLS(y, X2) # create the model
model_final = model3.fit() # fit it
print(model_final.summary()) # information about the model

Now the model only contains variables that are significant. 

## Simulations

Simulation is  a very useful tool to explore what might happen, according to a model. Typically, imagine that the following formula is known:

$fuel\_ consumption= 3\times average\_ speed - 20 \times cruising\_ duration + 50 \times flight\_ duration+0.03\times nb\_of\_passenger$

We want to explore what would the fuel consumption of typical Sydney - Los angeles flight. 

We know that the flight is in avergae 14 hours long, with 12 hours of cruising, it has an average speed of 800km/h, and takes n average 360 passengers. 


In [None]:
av_speed=800

In [None]:
av_passenger=360

In [None]:
av_duration=14

In [None]:
av_cruising=10

So we can predict that the average fuel consomption is:

In [None]:
3*av_speed - 20*av_cruising + 50*av_duration+0.03*av_passenger

However to predict how much variation there will be in fuel comsumption we need to do simulations, taking into account an estimated distribution of each variable. 
We estimate that each variable follow these distributions:
- av_speed N(800,200)
- av_cuising N(10,2)
- av_duration N(14,2)
- av_passenger  N(360,100)

`N(x,s)` is a normal distribution with mean `x` and standard deviation `s`

Simulate 1000 flight, and plot the corresponding distribution of fuel consumption. What values of fuel consumption can you predict with 95% confidence?

Hint use `np.random.normal` to simulate a random sample



## Cricketers

Use the cricketer dataset to produce a multile regression on the data

In [None]:
#Write code

*Write text*

## Diabetes

Using the diabetes dataset, do a multiple linear regression to model the disease progression with only the significant variables.

In [None]:
diabetes=datasets.load_diabetes()
print(diabetes['DESCR'])

In [None]:
# write code

*write text*