# Exploring Realstate Sales Prices

In [3]:
%matplotlib inline
import mpld3
mpld3.enable_notebook()
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

## 1. Exploration

### Question 1

Load the titanic using `pandas`. It is located in `datasets/ames_housing.csv`. Using the function `head()` and `info()`, which issues do you identify which need to be noted before to learn a machine learning model.

The dataset is described in https://www.kaggle.com/datasets/prevek18/ames-housing-dataset

In [None]:
data = pd.read_csv('datasets/ames_housing.csv')
data.head()

### Question 2

- Identify the target variable: "SalePrice", what's its type? What are its distributionals characteristics?
- What variables contain more missing values?

## Question 3
Split the data into features and target variables.
Then, the data into a model selection, sample and a model evaluation sample. Use `sklearn.model_selection.train_test_split`.
Use a 20% ratio.

In [17]:
from sklearn.model_selection import train_test_split
target = data["SalePrice"]
features = data.drop(columns="SalePrice")

selection_features, evaluation_features, selection_target, evaluation_target = train_test_split(
    features, target, test_size=.2
)
selection_target.shape

(2344,)

## Question 4
Extract the columns with numerical data using `selection_features.select_dtypes("number")`. Examine their distributions, through histograms. What issues do you identify? Then use  `selection_features.select_dtypes("number")` and seaborn's `sns.countplot` to analyze the string variables. Identify data types and issues.

# Section 2: Implement a linear regressor using only the numerical variables

### Question 1
Use a Column Transformer to _just select_ the numerical variables. Build a linear regressor using `sklearn.linear.LinearRegressor

For this we will
* build the column transformer
* build the machine learning pipeline
* evaluate it trough cross-validation (using `cross_vals_score`)

Does it work? Why?

In [18]:
from sklearn.linear_model import LinearRegression
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

### Question 2
Fix the previous issue using, first dropping the problemating rows, then using the `SimpleImputer`


In [19]:
from sklearn.impute import SimpleImputer

### Question 3
Now plot the evolution of mean and standard deviations for test sample sizes of 05%, 10%, 20%, 25%, 30%. What do you conclude?


### Question 4
Use `sklearn.model_selection.learning_curve` to study the learning curve https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html?highlight=learning_curve

In [20]:
from sklearn.model_selection import learning_curve

### Question 5
Are there corrleations between the features? Explore it through the correlation matrix, and the `sns.pairplot` plotting tool from seaborn (warning, if you plot all variables together it might be slow)

### Question 6
Can we use this correlation to improve the learning curve? This is regularization, let's try ridge regression https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

Plot the learning curve and compare it with the plain linear regression

In [21]:
from sklearn.linear_model import Ridge

### Question 6.1
How did you pick your regularization parameter? Use a grid search now. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html


In [22]:
from sklearn.model_selection import GridSearchCV

### Question 7
Do we need all features? Repeat the previous analysis from Question 6 but with the Lasso which enforces sparsity https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

### Question 8

Now we will repeat the same analysis but with the categorical variables. For which we will use the `OneHotEncoder` and the `OrdinalEncoder` 
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html 
* https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html

and combine them in the preprocessing pipeline in Section 2, Questions 1 and 2.

In [23]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

## Question 9

Non linearity! Now use the RandomForestRegressor https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html to fit and predict the data. The two hyper-parameters that you will use are 

* n_estimators : with a default of 100 which deals with the uncertainty in the data/algorithm relationship.
* max_depth : with a no limit as a default which deals with the granularity of the solution.

Use a Grid search cross validation to set the two parameters. Plot the learning curve.

In [24]:
from sklearn.ensemble import RandomForestClassifier

## Question 10

We will now use the data to obtain 
Use permutation feature importance to assess which are the most important features in predicting house pricing https://scikit-learn.org/stable/modules/permutation_importance.html

Compare these importances across models.

In [25]:
from sklearn.inspection import permutation_importance

## Question 11

Pick one of the estimators. Use cross_val_predict to evaluate the quality of the prediction in different cases.

https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_predict.html#sphx-glr-auto-examples-model-selection-plot-cv-predict-py

Cross-val predict will give you for each element in the target, a prediction. Produce a scatterplot between target and prediction, use the trained model and the predictive importance to find the most explanatory variables.

In [28]:
from sklearn.model_selection import cross_val_predict


# Section 3: Analyzing our phenomenon.

Now that we have picked one model: