# Kunskapskontroll 1 

### Fyll i uppgifterna nedan (obligatoriskt för att bli godkänd) innan du lämnar in på Omniway: 
Namn: 

Datum då du presenterade ditt arbete på lektionen: 2024-02-21

Presenterade du inte det på lektionen, skriv vem du har diskuterat igenom koden med: 

# Task

As a data analyst there is plenty of opportunity to improve processes or suggest improved ways of doing things. When doing so it is often very smart and efficient (time is a scarce resource) to create a POC (Proof of Concept) which basically is a small demo checking wether it is worthwile going further with something. It is also something concrete which facilitates discussions, do not underestimate the power of that. 

In this example, you are working in a company that sells houses and they have a "manual" process of setting prices by humans. You as a Data Scientist can make this process better by using Machine Learning. Your task is to create a POC that you will present to your team colleagues and use as a source of discussion of wether or not you should continue with more detailed modelling. 

Two quotes to facilitate your reflection on the value of creating a PoC: 

"*Premature optimization is the root of all evil*". 

"*Fail fast*".


**More specifially, do the following:**
1. A short EDA (Exploratory Data Analysis) of the housing data set.
2. Drop the column "ocean_proximity", then you only have numeric columns which will simplify your analysis. Remember, this is a POC!
3. Split your data into train and test set.
4. You have missing values in your data. Handle this with [ SimpleImputer(strategy="median") ], check the fantastic Scikit-learn documentation for details.
5. Create one "Linear Regression" model and one "Lasso" model. For the Lasso model, use GridSearchCV to optimize $\alpha$ values, choose yourself which $\alpha$ values to evaluate.
Use RMSE as a metric to decide which model to choose. 

7. Evaluate your chosen model on the test set using the root mean squared error (RMSE) as the metric. Conclusions? 

8. Do a short presentation (~ 2-5 min) on your POC that you present to your colleagues (no need to prepare anything particular, just talk from the code). Think of:
- What do you want to highlight/present?
- What is your conclusion?
- What could be the next step? Is the POC convincing enough or is it not worthwile continuing? Do we need to dig deeper into this before taking some decisions?

# Code

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

from sklearn.linear_model import LinearRegression, Lasso

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score

In [2]:
# Below, set your own path where you have stored the data file. 
housing = pd.read_csv(r'housing.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'housing.csv'

In [None]:
housing.head()

In [None]:
housing.info()

In [None]:
housing = housing.drop('ocean_proximity', axis=1)

In [None]:
housing.head()

In [None]:
X = housing.drop('median_house_value', axis=1)
y = housing['median_house_value'].copy()

In [None]:
# X.head()

In [None]:
# y.head()

Kalle kommer till vår mäklarfirma. 
Han säger att områdets median inkomst är 8000, det finns 900 rum ,....
--> Vi predikterar ett rimligt pris för huset. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

## EDA

In [None]:
X_train.describe()

In [None]:
df = X_train.copy()
df['target'] = y_train
corr_matrix = df.corr()
sns.heatmap(corr_matrix)

In [None]:
# X_train.info()

## Preparing data

In [None]:
# Create pipeline
steps = [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
]

pipeline = Pipeline(steps=steps)

In [None]:
# Fit pipeline and transform training data
X_train_prepared = pipeline.fit_transform(X_train)

## Validating models

In [None]:
# Instantiate and cross validate linear regression model

linreg = LinearRegression()
linreg_scores = cross_val_score(linreg, X_train_prepared, y_train, cv=5, scoring='neg_mean_squared_error')
linreg_rmses = np.sqrt(-linreg_scores)
print(f'Average Linear Regression RMSE: {np.mean(linreg_rmses)}')

In [None]:
# Instantiate Lasso model, find best alpha value using GridSearch

lasso = Lasso()
params = {
    'alpha': [1, 10, 41, 50]
}
lasso_reg = GridSearchCV(lasso, params, cv=5)
lasso_reg.fit(X_train_prepared, y_train)

print(lasso_reg.best_params_)

In [None]:
# Cross validate Lasso model

lasso_reg_scores = cross_val_score(lasso_reg, X_train_prepared, y_train, cv=3, scoring='neg_mean_squared_error')
lasso_reg_rmses = np.sqrt(-lasso_reg_scores)
print(f'Average Lasso RMSE: {np.mean(lasso_reg_rmses)}')

### Findings
The Lasso model performs ever so slightly better than the Linear regression model with a difference of $13.14$.

The Lasso model will be used to predict against the test data.

## Final testing of models

In [None]:
# Transform the test data

X_test_prepared = pipeline.transform(X_test)

In [None]:
# Predict using Lasso model

lasso_pred = lasso_reg.predict(X_test_prepared)
lasso_RMSE = mean_squared_error(y_test, lasso_pred, squared=False)

print(lasso_RMSE)
print(lasso_RMSE/y_test.mean())

In [None]:
# y_test.describe()

### Quick conclusion
The lasso model performs ever so slightly better than the linear regression model.

However, the RMSE of the model is a bit over $70000, which is about 35% of the mean house value. Further steps to fine tune the model could be taken.

In [None]:
preds = linreg.predict(X_test_prepared)

###### ---- End ----