# Predicting housing prices

Regression attempts to predict one dependent variable (usually denoted by *Y*) using a series of other changing variables (known as independent variables, usually denoted by *X*).

Let's start by importing the libraries needed:
- Pandas is a dependency used for easily inspecting and visualizing datasets.
- Numpy is a dependency used for numerical calculations. We will use it to generate the "random" numbers
- Matplotlib is a dependency used for plotting
- Seaborn is a dependency used for making plots look pretty, combined with some extra functionalities over matplotlib.
- Scipy is a dependency used for scientific computing. We will use for constructing and validating models.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as pl
import seaborn as sns
from scipy import stats
%matplotlib inline

Next let us import and check out the data.

In [None]:
USAhousing = pd.read_csv('USA_Housing.csv')
USAhousing.head()

Now its time to play around with the data and create some visualizations.

In [None]:
sns.pairplot(USAhousing)

## Calculating correlation
Now that we have a dataset, we can calculate correlation between different features. 
The correlation coefficient, or simply the correlation, is an index that ranges from -1 to 1. When the value is near zero, there is no linear relationship. As the correlation gets closer to plus or minus one, the relationship is stronger. A value of one (or negative one) indicates a perfect linear relationship between two variables.

The formula for correlation between x and y:
$$Correlation = \frac{Cov(x,y)}{\sigma_x \sigma_y}$$

Let us try to calculate the correlation coefficients between the features 'Price' and 'Area Population' in our housing dataset. How much do these two have in common? Let's find out

In [None]:
#First we calculate the variance of USAhousing
variance = np.var(USAhousing)
print(variance)


Since we want to find out the correlation between 'Price' and 'Area Population', we fill these in for $x$ and $y$ in the correlation formula given above as follows:
$$Correlation = \frac{Cov(Price,Area Population)}{\sigma_{Price} \sigma_{Area Population}}$$

So to calculate correlation, we now need the covariance between Price and Area Population and the standard deviation($\sigma$ (sigma)) of Price and Area Population. 
Lets start with calculating the standard deviations or sigma's of our two features. Here we use the previously calculated variance.

In [None]:
# Since sigma_Price is the square root of the variance of Price we calculate as follows
var_price = variance[['Price']]
sigma_price = np.sqrt(var_price)
# Now we use float to return sigma as a number
sigma_price = float(sigma_price)
sigma_price

Now try and do the same for 'Area Population', fill in the gaps (...) by using the code you learned previously:

In [None]:
# Calculate sigma for 'Area Population'
var_area_population = ...
sigma_area_population = ...
# Now use float to return sigma as a number
sigma_area_population = float(sigma_area_population)
sigma_area_population

Now we only need the $Cov(Price, Area Population)$. 
Let's first calculate the covariance.

In [None]:
covariance = USAhousing.cov()
covariance

Here we see the covariances between all the features in USAhousing.  
Now we select the covariance we want, namely the one between Price and Area Population:

In [None]:
covariance_price_area_population = covariance.loc[['Price'], ['Area Population']].values
covariance_price_area_population = float(covariance_price_area_population)
covariance_price_area_population

Now we have all the building blocks for our correlation let's fill in the formula!  
Remember the correlation formula: $Correlation = \frac{Cov(Price,Area Population)}{\sigma_{Price} \sigma_{Area Population}}$

In [None]:
correlation_price_area_population = covariance_price_area_population/(float(sigma_price)*float(sigma_area_population))
float(correlation_price_area_population)

Now let's check the other correlations with the comand corr()

In [None]:
correlation = USAhousing.corr()
correlation

And now, let’s plot the correlation using a heatmap:

In [None]:
sns.heatmap(correlation, 
        xticklabels=correlation.columns,
        yticklabels=correlation.columns)

# Training a Linear Regression Model
Let’s now begin to train out regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. 

In [None]:
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
                'Area Population']]
y = USAhousing['Price']
X.head()

## Train Test Split
Our goal is to create a model that generalises well to new data. Our test set serves as a proxy for new data.Trained data is the data on which we apply the linear regression algorithm. And finally we test that algorithm on the test data.The code for splitting is as follows:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)

From the above code snippet we can infer that 40% of the data goes to the test data and the rest remains in the training set.

## Creating and Training the Model
Let us import the LinearRegression from sklearn and fit the linear regression on the training dataset.

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train)

Congratulations! You have just trained your first model! Now let's check its coefficients...

In [None]:
names = ['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
                'Area Population']

coef = pd.DataFrame(columns=names)
coef.loc[0] = lm.coef_
coef

Here we see the regression coefficients for the different features. A 1 dollar increase in average area income, for example, increases the housing price with about 21.53 dollars. Keep in mind though! We are talking about average $area$ income, so for it to increase by 1 dollar the whole neighbourhood needs to earn 1 dollar more on average.  

The same holds for house age and number of rooms. 

An important observation here is that all the coefficients are positive. What does this mean?  
Do the features have a positive influence on housing prices or a negative?  
E.g. Do you expect a higher area income would have a postive or negative effect on housing prices in this area? Does this expectation match with the coefficient given?

## Predicting the test set
Now let's predict! Predict using the test set.

In [None]:
predictions = lm.predict(X_test)
predictions

Let's visualise the predicitons.

In [None]:
plt.scatter(y_test, predictions)
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.show()

## Cross-validating your predictions


Now we have our predictions it is time to check how accurate these are by cross-validating.  
Let's start with importing the necessities.

In [None]:
# Necessary imports: 
from sklearn.cross_validation import cross_val_score, cross_val_predict
from sklearn import metrics

Now we perform a 6-fold cross validation.

In [None]:
# Perform 6-fold cross validation
scores = cross_val_score(lm, X, y, cv=6)
print ('Cross-validated scores Linear Regression:', scores)
# Calculate the mean accuracy
np.mean(scores)

Our cross-validation score for Linear Regression is 0.917.  
Now let's see if we can predict housing prices better using another method, the Random Forest.

In [None]:
# Import the model we are using
from sklearn.ensemble import RandomForestRegressor
# Instantiate model with 1000 decision trees
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)
# Train the model on training data
rf.fit(X_train, y_train)

We have now trained a Random Forest. Like Linear Regression, let's cross-validate and compare!

In [None]:
# Perform 6-fold cross validation
scores = cross_val_score(rf, X_test, y_test, cv=6)
print ('Cross-validated scores Random Forest:', scores)

Which method would you choose? The Linear Regression or the Random Forest? And why?