# Linear Regression
You should build a machine learning pipeline using a linear regression model. In particular, you should do the following:
- Load the `housing` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Train and test a linear regression model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

## Load libraries

In [2]:
import pandas as pd
import sklearn.model_selection
import sklearn.linear_model
import sklearn.metrics

## Load Data

In [3]:
csv_route = '/Users/adolfomytr/Documents/Alemania/Master/GISMA/Materias/teaching-main/datasets/housing.csv'
housing_df = pd.read_csv(csv_route)
housing_df = housing_df.set_index('id')
housing_df.head()

Unnamed: 0_level_0,price,area,bedrooms,bathrooms,stories,stories.1,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,1.0
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,1.0
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,0.5
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,1.0
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,1.0


## Split the data into training and testing
Define variables x and y. X is the feature vector and Y is the target variable

In [4]:
x = housing_df.drop(['price'], axis=1) #axis 1 means droping a column, axis 0 means droping a row
y = housing_df['price']
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y)

print('housing_df', housing_df.shape)
print('x_train', x_train.shape)
print('x_test', x_test.shape)
print('y_train', y_train.shape)
print('y_test', y_test.shape)

housing_df (545, 13)
x_train (408, 12)
x_test (137, 12)
y_train (408,)
y_test (137,)


## Train the Model

In [5]:
model = sklearn.linear_model.LinearRegression()
model.fit(x_train, y_train)

## Test the Model

In [6]:
y_prediction = model.predict(x_test)
mse = sklearn.metrics.mean_squared_error(y_test, y_prediction)
print('MSE', mse) #It helps to compare it with other algorithms, but it is not really informative itself


#Al of the following gives us values that are not very intuitive and doesn't tell us much insight.
#Mean squared error
#Root mean squared error
#Mean absolute error

#Rsquared tries to normalize and bring it into range, to make it comparable


MSE 1195994692675.9238
