<a href="https://colab.research.google.com/github/michalis0/DataMining_and_MachineLearning/blob/master/week4/regression_students.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

Here we will examine if we can predict the price of a houses in Iowa given some of the house features.

## Loading the data

Read the data file into a Pandas DataFrame called `home_data`.

In [None]:
import pandas as pd

# Path of the file to read
data_path = "https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week4/data/housing-prices-dataset/train.csv"
home_data = pd.read_csv(data_path)

print("Setup Complete")
home_data.head()

We will select the **features** which by convention are called **X**. We will also choose the target variable which we typically call `y`.

In [None]:
feature_names = ['1stFlrSF']
X = home_data[feature_names]
y = home_data["SalePrice"]

In [None]:
X.head()

In [None]:
y.head()

We create the linear model. 

In [None]:
# do the right imports
from sklearn.linear_model import LinearRegression

# create the model
model = LinearRegression()

# Fit the model
model.fit(X,y)

and we do the predictions:

In [None]:
predictions = model.predict(X)
print(predictions)

and we can plot the data and the regression.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(X.iloc[:,0], y)
plt.plot(X.iloc[:,0], predictions, 'r')
plt.xlabel(X.columns[0])
plt.ylabel('SalePrice')
plt.show()

How much is the model's $MAE$ and $R^2$ ?

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

predicted_home_prices = model.predict(X)
mae = mean_absolute_error(y, predicted_home_prices)
r2 = r2_score(y, predicted_home_prices)

print("MAE %.2f" % mae)
print("R^2 %.2f" % r2)

In [None]:
%reset

# Exercises

Now it's your turn! Make 2 linear regressions:

    A. Between on the columns `OverallQual` and  the `SalePrice`.
    B. Between `FullBath` and  the `SalePrice`.
    
<br>
**Which has the lowest MAE and $R^2$?**


## Step 0: We load the dataset and the necessary files

In [None]:
# import pandas
import 

# change to GET THE RAW FILE FROM GITHUB. WILL PROVIDE...
data_path = "train.csv"

# read the file using pandas
home_data = 

In [None]:
# how many observations we have? Hint: use the `shape` attribute
# how many columns are there?

print("Number of observations", ---)
print("Number of dimensions", ---)




## Step 1: Specify Prediction Target
Select the target variable (what we want to predict), which corresponds to the **sales price**. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

In [None]:
# print the list of columns in the dataset to find the name of the prediction target


In [None]:
# store in y the column with the target variable 
# y = 

## Step 2: Create X
Now you will create a DataFrame called **`X`** holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

There are a number of numerical columns that you can use:
 * LotArea
  * YearBuilt
  * 1stFlrSF
  * 2ndFlrSF
  * FullBath
  * BedroomAbvGr
  * TotRmsAbvGrd
  * OverallQual

However, for now just use either `FullBath` or `OverallQual`.

After you've created that list of features, use it to create the DataFrame that you'll use to fit the model.

In [None]:
# Create the list of features below
feature_names = 

# select data corresponding to features in feature_names
X = 

## Review Data
Before building a model, take a quick look at **X** to verify it looks sensible

In [None]:
# Review data
# print description or statistics from X


# print the top few lines of X


## Step 3: Specify and Fit Model
Create a `LinearRegression` model and save it as `iowa_model`. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in `X` and `y` that you saved above.

In [None]:
# do the right imports
from sklearn.linear_model import LinearRegression

# create the model
iowa_model = LinearRegression()

# Fit the model
iowa_model.fit(X,y)


Which are the **parameters** of the model?

In [None]:
print(f'intercept = {iowa_model.intercept_}')
print(f'coefficients = {iowa_model.coef_}')

## Step 4: Make Predictions

Make predictions with the model's `predict` command using `X` as the data. Save the results to a variable called `predictions`.

In [None]:
predictions = _
print(predictions)

## Show the regression

Now show the regression.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(--, --)
plt.plot(---, predictions, 'r')
plt.xlabel(X.columns[0])
plt.ylabel('SalePrice')
plt.show()

## Model Validation

You've built a model. But how good is it?

The prediction error for each house is:  `error=|actual−predicted|`

So, if a house cost CHF 150'000 and you predicted it would cost CHF 100'000 the error is  CHF 50'000.

We call this **MAE** (Mean Average Error) and to get a single number we can average the errors for all the houses.

### Compute the MAE and the $R^2$ of the two models. 

For which feature
   - FullBath
   - OverQual
 
do we have the lowest MAE?

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

predicted_home_prices = 
mae = 
r2 = 

print("MAE %.2f" % mae)
print("R^2 %.2f" % r2)

So, on average we are off by some CHF 30k-46k on the predicted price. But this is for the "in-sample" points. 

However, in practice **should always** evaluate the quality of our model on datapoints that were not used to create the model. 

Try to load the test data from the following link:

`https://raw.githubusercontent.com/michalis0/DataMining_and_MachineLearning/master/week4/data/housing-prices-dataset/test.csv`

Using the models you just trained, try to predict the sales prices for the test data. Then compute the MAE and $R^2$ score of each model for the test data. Which model preforms better on the (unseen) test data?