# 21_HES-SO-ARC_646-2.3 SCIENCE DES DONNÉES 😁
## House Prices: Advanced Regression Techniques
##### Adrien Sigrist, Vincent Zurbrugg, Loic Mary, Antoine Frey

**Each phase of the process:**
1. [Business understanding](#Businessunderstanding)
    1. [Determine the Business Objectives](#BusinessObjectives)
    2. [Assess the Current Situation](#Assessthecurrentsituation)
        1. [Inventory of resources](#Inventory)
        2. [Requirements, assumptions and constraints](#Requirements)
        3. [Risks and contingencies](#Risks)
        4. [Terminology](#Terminology)
        5. [Costs and benefits](#CostBenefit)
    3. [What are the Desired Outputs](#Desiredoutputs)
    4. [What Questions Are We Trying to Answer?](#QA)
2. [Data Understanding](#Dataunderstanding)
    1. [Initial Data Report](#Datareport)
    2. [Describe Data](#Describedata)
    3. [Initial Data Exploration](#Exploredata) 
    4. [Verify Data Quality](#Verifydataquality)
        1. [Missing Data](#MissingData) 
        2. [Outliers](#Outliers) 
    5. [Data Quality Report](#Dataqualityreport)
3. [Data Preparation](#Datapreparation)
    1. [Select Your Data](#Selectyourdata)
    2. [Cleanse the Data](#Cleansethedata)
        1. [Label Encoding](#labelEncoding)
        2. [Drop Unnecessary Columns](#DropCols)
        3. [Altering Datatypes](#AlteringDatatypes)
        4. [Dealing With Zeros](#DealingZeros)
    3. [Construct Required Data](#Constructrequireddata)
    4. [Integrate Data](#Integratedata)
4. [Exploratory Data Analysis](#EDA)
5. [Modelling](#Modelling)
    1. [Modelling Technique](#ModellingTechnique)
    2. [Modelling Assumptions](#ModellingAssumptions)
    3. [Build Model](#BuildModel)
    4. [Assess Model](#AssessModel)
6. [Evaluation](#Evaluation)
    1. [Evaluate Results](#EvaluateResults)
    2. [Review Process](#ReviewProcess)
    3. [Determine Next Steps](#NextSteps)
7. [Deployment](#Deployment)
    1. [Plan Deployment](#PlanDeployment)
    2. [Plan Monitoring and Maintenance](#Monitoring)
    3. [Produce Final Report](#FinalReport)
    4. [Review Project](#ReviewProject)


# 1. Compréhension métier  <a class="anchor" id="Businessunderstanding"></a>

## 1.1 Les objectifs métiers <a class="anchor" id="BusinessObjectives"></a> 

Prédire le prix de maisons dont seul Kaggle dispose les informations de prix et cela avec la plus grande mesure de succes. 
Pour cela nous disposons d'un dataset "train" pour nous permettre d'entrainer notre algorithme et un dataset "test" pour le tester a la fin. 

La mesure du succès ne se fait pas sur les erreurs absolues mais les erreurs relatives. C'est un élement important car cela veut dire que la mesure de l'erreur entre le prix estimé et le prix réel se fait relativement au prix de la maison. En d'autre termes, les maisons cheres et les maisons bon marché affecteront le résultat de la meme maniere. 

(Exemple: 2 maisons: la premieres a 100'000 dollars et l'autre a 1'000 000 dollars, mesurer l'erreur en valeur absolue voudrait dire qu'une difference de 10'000 dollars pour les deux maisons affecterait le resultat de facon equivalente, cependant ce n'est pas le cas, la difference est mesurer en fonction du prix de la maison. La maison la plus chere affectera beaucoup moins le resultat car 10'000 dollars représente une plus petite partie de son prix. 

En sachant cela nous utiliseront le log du prix qui nous permettra de visualiser le prix des maisons en valeur relative. 


## 1.2 Évaluer la situation actuelle<a class="anchor" id="Assessthecurrentsituation"></a>

Demandez à un acheteur de décrire la maison de ses rêves, et il ne commencera probablement pas par la hauteur du plafond du sous-sol ou la proximité d'une voie ferrée est-ouest. Mais les données de ce concours de terrain de jeu prouvent que l'influence sur les négociations de prix est bien plus importante que le nombre de chambres à coucher ou une clôture blanche.

Avec 79 variables explicatives décrivant (presque) tous les aspects des maisons résidentielles à Ames, dans l'Iowa, cette concurrence vous met au défi de prédire le prix final de chaque maison.

### 1.2.1 Inventaire des ressources <a class="anchor" id="Inventory"></a>

- Personnel: 4 personnes 
- Données: 3 jeux de données et une description texte
- Computing resources: Nos pcs
- Software: Jupyter

### 1.2.2 Risques et éventualités <a class="anchor" id="Risks"></a>
- Ne pas finir le projet à temps
- D'avoir des données inutilisables


 ## 1.3 What are the desired outputs of the project? <a class="anchor" id="Desiredoutputs"></a>

### Critère de succès
- Avoir la marge d'erreur la plus petite possible
- 

### Critère de succès du minage de données
- Avoir une prédiction du prix qui se rapproche le plus possible de la réalité
- 

### Produire un plan de projet
- On va utiliser la méthodologie CRISP DM, pour analyser le jeu de données et produire un modèle statistique de prédiction.

<img src="image/gantt.jpg"
     alt="Gantt"
     style="height: 180px;" />



 ## 1.4 Quelles sont les questions auxquelles nous essayons de répondre ? <a class="anchor" id="QA"></a>

- Prédire, au plus proche de la réalité, les prix des maisons

# 2. Compréhension des données <a class="anchor" id="Dataunderstanding"></a>

Dans cette phase, nous allons comprendre les données que nous avons dans notre dataset. Pour relever celles qui pourront être utiles et celles qui ne le seront pas. Cette phase est importante car une bonne compréhension des datas permet une analyse plus précise et correcte. Dans ce processus on relève le taux de valeurs manquantes dans chaque attribut, mais également son type et éventuellement la décrire (quantitative, qualitative.. ).

## 2.1 Initial Data Report <a class="anchor" id="Datareport"></a>
Initial data collection report - 
List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.

In [1]:
# Import Libraries Required
%matplotlib inline
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import scipy.stats as stats
import sklearn.linear_model as linear_model
import seaborn as sns
import statsmodels.api as sm
import graphviz
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import KFold
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn import ensemble
from sklearn.metrics import mean_squared_error
from IPython.display import HTML, display

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error


pd.options.display.max_rows = 1000
pd.options.display.max_columns = 20

ModuleNotFoundError: No module named 'graphviz'

In [None]:
# Data source: 
# ss =  pd.read_csv('data/sample_submission.csv', sep=',') 
train_raw = pd.read_csv('data/train.csv', index_col = 'Id')
validation_raw = pd.read_csv('data/test.csv', index_col = 'Id')

## 2.2 Description des données <a class="anchor" id="Describedata"></a>


### Validation_raw
Description du dataset validation_raw qui va permettre de valid notre modèle réaliser avec le dataset train.

#### Les colonnes

In [None]:
validation_raw.columns

#### Shape

In [None]:
validation_raw.shape

#### Les types de variables

In [None]:
validation_raw.dtypes

#### Description des variables

In [None]:
validation_raw.describe()

In [None]:
validation_raw.info()

In [None]:
validation_raw.head(5)

### Train
Description du data set train, qui va permettre d'entrainer notre modèle.

#### Les colonnes

In [None]:
train_raw.columns

#### Shape

In [None]:
train_raw.shape

#### Les types de variables

In [None]:
train_raw.dtypes

#### Les colonnes

In [None]:
train_raw.describe()

In [None]:
train_raw.info()

In [None]:
train_raw.head(5)

## 2.3 Verify Data Quality <a class="anchor" id="Verifydataquality"></a>

Examiner la qualité des données en répondant à des questions telles que:

- Les données sont-elles complètes (couvrent-elles tous les cas requis)?
- Est-ce correct ou contient-il des erreurs et, s'il y a des erreurs, quelle est leur fréquence?
- Y a-t-il des valeurs manquantes dans les données? Si tel est le cas, comment sont-ils représentés, où se produisent-ils et quelle est leur fréquence?

#### La variable dépendante

Cette variable doit être analysée en tout premier lieu. En effet, c'est votre raison d'être dans la compétition. Bonjour SalePrice!

In [None]:
y = train_raw['SalePrice']
train_raw = train_raw.drop(['SalePrice'], axis = 1)

Nous mettons le 'SalePrice' dans la variable y et nous le supprimons de train_raw.
Comme c'est une variable dépendente nous la séparons pour pouvoir la réutiliser plus facilement. 

Description de la variable y

In [None]:
print(y.describe())

In [None]:
# distribution plot with normal fit
plt.title('Normal')
sns.distplot(y, fit = stats.norm)
plt.show()

La ligne bleue représente la distribution du prix et la ligne noire est la distribution normale. 
Ce qui nous amene a la conclusion que la variable du prix n'est pas normalement distribuée. 
Mais c'est normal! On n'utilise pas le log du prix comme expliqué plus haut.

In [None]:
# distribution plot with log normal fit
plt.title('Log Normal')
sns.distplot(y, fit=stats.lognorm)
plt.show()

En utilisant le log normal, la distribution du prix est cette fois quasi normale.

In [None]:
# distribution plot with normal fit
y = np.log1p(y)
plt.title('Normal')
sns.distplot(y, fit = stats.norm)
plt.show()

La même chôse peut être dites en utilisant log1p, qui est la version du log qui selon la documentation est à utiliser. 
https://numpy.org/doc/stable/reference/generated/numpy.log1p.html

#### Variable indépendantes

On traite maintenant les variables indépendantes du dataset. Ci-dessous, on regroupe les variables par type de variables. On note que certaines variables peuvent avoir plusieurs interprétations. Par exemple, la variable _OverallQual_ est une variable à priori qualitative ordinale. Cependant, puisqu'elle possède plus que 5 niveaux, on peut la traiter comme une variable quantitative. D'autres choix sont possibles.

In [None]:
qualitative = ['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
              'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
              'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
              'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure',
              'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
              'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
              'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageFinish', 
              'GarageCars', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
              'SaleType', 'SaleCondition']

quantitative = ['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'MasVnrArea', 'BsmtFinSF1',  
               'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 
               'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 
               'PoolArea', 'MiscVal']

time = ['YearBuilt', 'YearRemodAdd', 'MoSold', 'YrSold', 'GarageYrBlt']

On concatène les deux dataset afin d'executer le traitement sur les données d'une traite

In [None]:
data_raw = pd.concat([train_raw, validation_raw])

### 2.3.1 Traitement des données manquantes <a class="anchor" id="MissingData"></a>

#### Variable quantitative

In [None]:
def plot_missing(data):

    missing = data.isnull().sum()
    missing = missing[missing > 0]
    
    if missing.empty:
        print('Aucune donnée manquante')
    else :
        missing.sort_values(inplace=True)
        missing.plot.bar()
    
        plt.show()

In [None]:
plot_missing(data_raw[quantitative])

In [None]:
def fill_missing_with_zero(data, columns):
    
    data_clean = data.copy()
    
    for c in columns :
        
        if data_clean[c].isnull().any():
            data_clean[c] = data_clean[c].fillna(0)
    
    return data_clean

#### Variable qualitative

Si on lit attentivement, les données manquantes ne sont en réalité pas manquantes. Elles sont selon la documentation des __NA__ ce qui veut dire que ça ne fait pas de sens de parler de ladite variables dans ce cas là...

In [None]:
plot_missing(data_raw[qualitative])

In [None]:
def get_dummies(data, columns):
    return pd.get_dummies(data, columns = columns)

Bien qu'il y ai des valeurs manquantes, elles ont une valeure métier. 
Par exemple avoir un null (NA) dans PoolQC (la qualité de la piscine)
signifie que la maison n'a pas de piscine.

Il est impossible de savoir quelle proportion des valeurs null a véritablement cette valeur métier ou les valeurs ne sont pas renseignées. 
Pour les colonnes en lien avec le garage (GarageQual, GarageYrBlt, GarageType, GarageFinish, GarageCond) elles ont toutes le même nombre de valeurs null ce qui indique que c'est une vraie valeur métier et pas simplement un oubli. La même chose peut être dites pour les variables lié au basement (sous-sol). 

Suite à cette observation, nous ne supprimerons aucune colonne ou valeur null. 

#### Variables temporelles

In [None]:
plot_missing(data_raw[time])

Afin d'être correct statistiquement parlant, on va substituer les null dans cette colonne avec la date de construction de la maison. Ces dates concordent très souvent. Il suffit de regarder le dataset. Ceci fait aussi sens d'un point de vue métier. Le garage est rarement construit à postériori. Il vient avec la maison.

In [None]:
def fill_missing_with_column(data, missing, column) :
    
    data_clean = data.copy()
    
    data_clean[missing] = np.where(data_clean[missing].isnull(), data_clean[column], data_clean[missing])
    
    return data_clean

De plus, on va traiter les variables temporelles en calculant l'age relatif à l'année de vente. D'un point de vu métier, c'est probablement plus pertinent, car les années de ventes se situe entre 2006 et 2010.

In [None]:
def compute_differences_to_year_sold(data) :
    
    data_clean = data.copy()
    
    data_clean['YearBuilt'] = data_clean['YrSold'] - data_clean['YearBuilt']
    data_clean['YearRemodAdd'] = data_clean['YrSold'] - data_clean['YearRemodAdd']
    data_clean['GarageYrBlt'] = data_clean['YrSold'] - data_clean['GarageYrBlt']
    
    return data_clean

#### Normalisations des données

In [None]:
from sklearn.preprocessing import StandardScaler

def normalize_all_columns(data) :
    
    data_clean = data.to_numpy(copy = True)
    
    data_clean = StandardScaler().fit_transform(data_clean)
    data_clean = pd.DataFrame(data_clean, index = data.index, columns = data.columns)
    
    return data_clean

#### Preprocessing

In [None]:
def preprocess(data, qualitative, quantitative):
    
    data_clean = data.copy()
    
    
    data_clean = get_dummies(data_clean, qualitative)
    data_clean = fill_missing_with_zero(data_clean, quantitative)
    data_clean = fill_missing_with_column(data_clean, missing = ['GarageYrBlt'], column = ['YearBuilt'])
    
    data_clean = compute_differences_to_year_sold(data_clean)
    
    data_clean = normalize_all_columns(data_clean)
    
    return data_clean

In [None]:
data_clean = preprocess(data_raw, qualitative, quantitative)

data_clean.head()

## 2.4 Initial Data Exploration  <a class="anchor" id="Exploredata"></a>

On va analyser le liens entre chacune des variables avec _SalePrice_. Pour ce faire, on a besoin de reséparer le dataset qui a été nettoyé en $X$ et $X_{val}$. La corrélation ne peut être calculée qu'entre $y$ et $X_{val}$.

In [None]:
X = data_clean[data_clean.index.isin(train_raw.index)]
X_val = data_clean[data_clean.index.isin(validation_raw.index)]

On va calculer les corrélations entre chacune des variables dépendantes avec la variable indépendante. Dans le cas des variables quantititatives, ceci suggère l'existence d'un lien linéaire. Dans le cas des variables dichotomiques la corrélation se fait interpréter comme une _corrélation bisérial de point_ (https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient). 

En bref, cette corrélation effectue un t-test à deux échantillons. La corrélation traduit la taille d'effet du t-test, qui correspond aussi à la part de variance expliquée par la variable dichotomique.

In [None]:
def correlation(y, X, method = 'pearson'):
    
    cor = pd.DataFrame()
    features = X.columns.tolist()
   
    cor['feature'] = X.columns.tolist()
    cor['correlation_coef'] = [X[f].corr(y, method = 'pearson') for f in features]
    cor['correlation_coef'] = cor['correlation_coef'].fillna(0)
    
    cor = cor.sort_values('correlation_coef', ascending = False)

    plt.figure(figsize=(10, 0.25*len(features)))
    sns.barplot(data = cor, y = 'feature', x = 'correlation_coef', orient = 'h')
    
    return cor

In [None]:
cor = correlation(y, X)

### 2.4.1 Distributions  <a class="anchor" id="Distributions"></a>

In [None]:
def count_values_table(df):
        count_val = df.value_counts()
        count_val_percent = 100 * df.value_counts() / len(df)
        count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
        count_val_table_ren_columns = count_val_table.rename(
        columns = {0 : 'Count Values', 1 : '% of Total Values'})
        return count_val_table_ren_columns

In [None]:
# Histogram
def hist_chart(df, col):
        plt.style.use('fivethirtyeight')
        plt.hist(df[col].dropna(), edgecolor = 'k');
        plt.xlabel(col); plt.ylabel('Number of Entries'); 
        plt.title('Distribution of '+col);

### 2.4.2 Correlations  <a class="anchor" id="Correlations"></a>
En général, pour réduire le financement, seules les variables non corrélées entre elles doivent être ajoutées aux modèles de régression (qui sont corrélés avec le prix de vente).

In [None]:
features = cor['feature'][abs(cor['correlation_coef']) > 0.5]

In [None]:
def scatter_plots(y, X, columns) :
    
    for f in columns :
        x = X[f]
        
        plt.title('Correlation ' + y.name + ' & ' + x.name)
        sns.regplot(x = x.name, y = y.name, data = pd.concat([x, y], axis = 1), x_jitter = .05)
        
        plt.show()

In [None]:
scatter_plots(y, X, features)

## 2.5 Modèle baseline

Le modèle baseline est le modèle le plus simple possible que nous avons réeussi à faire et notre but sera de 
l''améliorer et de trouver d''autres modèles qui sont peu-être plus performants. 


In [None]:
base = ['OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'YearBuilt']

Nous prenons ces variables car ce sont celles qui sont les plus logiques quand nous choisissons un bien. 
En effet, même en Suisse quand on éstime le prix d'un bien nous regardons la qualité global(OverallQual), les metres carrées total et des différentes pieces (GrLivArea,TotalBsmtSF,1stFlrSF) et l'année de construction (YearBuilt).

### 2.5.1 Regression linéaire

In [None]:
model = sm.OLS(y, sm.add_constant(X[base]))
res = model.fit()

print(res.summary())

Avant de pouvoir interpréter ces résultats nous devons vérifier les assomptions. 

In [None]:
from statsmodels.graphics.gofplots import ProbPlot
plt.style.use('seaborn') # pretty matplotlib plots
plt.rc('font', size=14)
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)

def graph(formula, x_range, label=None):
    """
    Helper function for plotting cook's distance lines
    """
    x = x_range
    y = formula(x)
    plt.plot(x, y, label=label, lw=1, ls='--', color='red')


def diagnostic_plots(X, y, model_fit=None):
  """
  Function to reproduce the 4 base plots of an OLS model in R.

  ---
  Inputs:

  X: A numpy array or pandas dataframe of the features to use in building the linear regression model

  y: A numpy array or pandas series/dataframe of the target variable of the linear regression model

  model_fit [optional]: a statsmodel.api.OLS model after regressing y on X. If not provided, will be
                        generated from X, y
  """

  if not model_fit:
      model_fit = sm.OLS(y, sm.add_constant(X)).fit()

  # create dataframe from X, y for easier plot handling
  dataframe = pd.concat([X, y], axis=1)

  # model values
  model_fitted_y = model_fit.fittedvalues
  # model residuals
  model_residuals = model_fit.resid
  # normalized residuals
  model_norm_residuals = model_fit.get_influence().resid_studentized_internal
  # absolute squared normalized residuals
  model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
  # absolute residuals
  model_abs_resid = np.abs(model_residuals)
  # leverage, from statsmodels internals
  model_leverage = model_fit.get_influence().hat_matrix_diag
  # cook's distance, from statsmodels internals
  model_cooks = model_fit.get_influence().cooks_distance[0]

  plot_lm_1 = plt.figure()
  plot_lm_1.axes[0] = sns.residplot(model_fitted_y, dataframe.columns[-1], data=dataframe,
                            lowess=True,
                            scatter_kws={'alpha': 0.5},
                            line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

  plot_lm_1.axes[0].set_title('Residuals vs Fitted')
  plot_lm_1.axes[0].set_xlabel('Fitted values')
  plot_lm_1.axes[0].set_ylabel('Residuals');

  # annotations
  abs_resid = model_abs_resid.sort_values(ascending=False)
  abs_resid_top_3 = abs_resid[:3]
  for i in abs_resid_top_3.index:
      plot_lm_1.axes[0].annotate(i,
                                 xy=(model_fitted_y[i],
                                     model_residuals[i]));

  QQ = ProbPlot(model_norm_residuals)
  plot_lm_2 = QQ.qqplot(line='45', alpha=0.5, color='#4C72B0', lw=1)
  plot_lm_2.axes[0].set_title('Normal Q-Q')
  plot_lm_2.axes[0].set_xlabel('Theoretical Quantiles')
  plot_lm_2.axes[0].set_ylabel('Standardized Residuals');
  # annotations
  abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
  abs_norm_resid_top_3 = abs_norm_resid[:3]
  for r, i in enumerate(abs_norm_resid_top_3):
      plot_lm_2.axes[0].annotate(i,
                                 xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                     model_norm_residuals[i]));

  plot_lm_3 = plt.figure()
  plt.scatter(model_fitted_y, model_norm_residuals_abs_sqrt, alpha=0.5);
  sns.regplot(model_fitted_y, model_norm_residuals_abs_sqrt,
              scatter=False,
              ci=False,
              lowess=True,
              line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
  plot_lm_3.axes[0].set_title('Scale-Location')
  plot_lm_3.axes[0].set_xlabel('Fitted values')
  plot_lm_3.axes[0].set_ylabel('$\sqrt{|Standardized Residuals|}$');

  # annotations
  abs_sq_norm_resid = np.flip(np.argsort(model_norm_residuals_abs_sqrt), 0)
  abs_sq_norm_resid_top_3 = abs_sq_norm_resid[:3]
  for i in abs_norm_resid_top_3:
      plot_lm_3.axes[0].annotate(i,
                                 xy=(model_fitted_y[i],
                                     model_norm_residuals_abs_sqrt[i]));


  plot_lm_4 = plt.figure();
  plt.scatter(model_leverage, model_norm_residuals, alpha=0.5);
  sns.regplot(model_leverage, model_norm_residuals,
              scatter=False,
              ci=False,
              lowess=True,
              line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8});
  plot_lm_4.axes[0].set_xlim(0, max(model_leverage)+0.01)
  plot_lm_4.axes[0].set_ylim(-3, 5)
  plot_lm_4.axes[0].set_title('Residuals vs Leverage')
  plot_lm_4.axes[0].set_xlabel('Leverage')
  plot_lm_4.axes[0].set_ylabel('Standardized Residuals');

  # annotations
  leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]
  for i in leverage_top_3:
      plot_lm_4.axes[0].annotate(i,
                                 xy=(model_leverage[i],
                                     model_norm_residuals[i]));

  p = len(model_fit.params) # number of model parameters
  graph(lambda x: np.sqrt((0.5 * p * (1 - x)) / x),
        np.linspace(0.001, max(model_leverage), 50),
        'Cook\'s distance') # 0.5 line
  graph(lambda x: np.sqrt((1 * p * (1 - x)) / x),
        np.linspace(0.001, max(model_leverage), 50)) # 1 line
  plot_lm_4.legend(loc='upper right');

In [None]:
diagnostic_plots(X[base], y)

Nous utilisons ces plots pour vérifier les assomptions et on peut voir qu'il y a des valeurs trop extremes et qu'elles sont donc pas interprétables. 

### 2.5.2 Outliers et valeurs extrêmes

In [None]:
print(X[base].iloc[[523, 1298]])

Nous supprimons les valeurs 523 et 1298 car elles ont des valeurs trop extremes. (voir les termes exactes)

In [None]:
def drop_outliers(data, y, features) :
    
    data_clean = data.copy()
    data_clean[y.name] = y
    
    for f in features :
        condition = (data_clean[f] < -3) | (data_clean[f] > 3)
        ix = data_clean[condition].index
        data_clean = data_clean.drop(index = ix)
    
    return data_clean[data.columns], data_clean[y.name]

A commenter 

In [None]:
X, y = drop_outliers(X, y, features = base)

X[base].info()

### 2.5.3 Régression linéaire corrigée

Nous refaisons la régression linéaire pour voir l''effet de la suppresion des outliers. 

In [None]:
model = sm.OLS(y, sm.add_constant(X[base]))
res = model.fit()

print(res.summary())

Comme les assomptions sont respectées (voir section suivante) nous pouvons interpréter les résultats.

On peut voir que les p-valeurs sont toutes en dessous de 0.05 donc ces valeurs ont peu de chances de faire une erreur sur les hypothèses null. ??

Parler du r-square.

In [None]:
diagnostic_plots(sm.add_constant(X[base]), y)

Nous pouvons voir que les diagnostics plots sont bien meilleurs après la suppresion des outliers. 



## 2.6 Modèle features

Nous allons refaire une régression linéaire, mais avec des paramètres différents.En effet nous allons prendre toutes les variables qui ont plus de 0.5 de correlation positives ou négatives.

### 2.6.1 Régression linéaire

In [None]:
model = sm.OLS(y, sm.add_constant(X[features]))
res = model.fit()

print(res.summary())

In [None]:
diagnostic_plots(X[features], y)

### 2.6.2 Outliers

In [None]:
X, y = drop_outliers(X, y, features = features)

X[features].info()

### 2.6.3 Régression linéaire corrigée

In [None]:

model = sm.OLS(y, sm.add_constant(X[features]))
res = model.fit()

print(res.summary())

On peut voir que le R-squared c'est améliorer par rapport au modèle de basline, cependant beaucoup de variables ont un p-valeur suppérieur a 0.05 donc nous devons pas les garder. 

In [None]:
features =['OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'YearBuilt','GarageArea','Foundation_PConc','GarageCars_3.0','ExterQual_Gd','FullBath_2','Fireplaces_0','KitchenQual_TA','YearRemodAdd','GarageYrBlt','FullBath_1','ExterQual_TA']

features =['OverallQual', 'GrLivArea', 'TotalBsmtSF', 'YearBuilt','GarageArea','Fireplaces_0','YearRemodAdd','GarageYrBlt']




In [None]:
model = sm.OLS(y, sm.add_constant(X[features]))
res = model.fit()

print(res.summary())

In [None]:
diagnostic_plots(sm.add_constant(X[features]), y)

# 3. Data Preparation <a class="anchor" id="Datapreparation"></a>
This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

## 3.1 Select Your Data <a class="anchor" id="Selectyourdata"></a>

<span style="color:red">Nous créons les holdout pour entrainer notre modèle</span>

### Création des holdouts

In [None]:
# Hold-out pour x
from sklearn.model_selection import train_test_split

#y=np.expm1(y)

X_train, X_test, y_train, y_test = train_test_split(X[features], y, test_size=0.2) 

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

## 3.2 Nettoyer les données <a class="anchor" id="Cleansethedata"></a>

## 3.3 Construct Required Data   <a class="anchor" id="Constructrequireddata"></a>
This task includes constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes.

**Derived attributes** - These are new attributes that are constructed from one or more existing attributes in the same record, for example you might use the variables of length and width to calculate a new variable of area.

**Generated records** - Here you describe the creation of any completely new records. For example you might need to create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modelling purposes it might make sense to explicitly represent the fact that particular customers made zero purchases.


## 3.4 Integrate Data  <a class="anchor" id="Integratedata"></a>
These are methods whereby information is combined from multiple databases, tables or records to create new records or values.

**Merged data** - Merging tables refers to joining together two or more tables that have different information about the same objects. For example a retail chain might have one table with information about each store’s general characteristics (e.g., floor space, type of mall), another table with summarised sales data (e.g., profit, percent change in sales from previous year), and another with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record for each store, combining fields from the source tables.

**Aggregations** - Aggregations refers to operations in which new values are computed by summarising information from multiple records and/or tables. For example, converting a table of customer purchases where there is one record for each purchase into a new table where there is one record for each customer, with fields such as number of purchases, average purchase amount, percent of orders charged to credit card, percent of items under promotion etc.


### Construct Our Primary Data Set
Join data 

# 4. Exploratory Data Analysis <a class="anchor" id="EDA"></a>

Now that the dataset has been prepared, you'll need to analyze it and summarize it's main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. 

# 5. Modelling <a class="anchor" id="Modelling"></a>
As the first step in modelling, you'll select the actual modelling technique that you'll be using. Although you may have already selected a tool during the business understanding phase, at this stage you'll be selecting the specific modelling technique e.g. decision-tree building with C5.0, or neural network generation with back propagation. If multiple techniques are applied, perform this task separately for each technique.

In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
def encode(data):
    data_encode = data.copy()
    for col in data:
        encoder = LabelEncoder()
        data_encode[col] = encoder.fit_transform(data_encode[col].astype(str))
        X_test[col] = encoder.transform(data_encode[col].astype(str))
    

In [None]:
def preprocess_advanced(data, qualitative, quantitative):
    
    data_clean = data.copy()
    
    ## ajouter du label encoding 
    data_clean = get_dummies(data_clean, qualitative)
    data_clean = fill_missing_with_zero(data_clean, quantitative)
    data_clean = fill_missing_with_column(data_clean, missing = ['GarageYrBlt'], column = ['YearBuilt'])
    
    data_clean = compute_differences_to_year_sold(data_clean)
    
    data_clean = normalize_all_columns(data_clean)
    
    return data_clean

In [None]:
data_encode = encode(train_raw[qualitative])

data_encode.head()

In [None]:
#data_clean =preprocess_advanced(data_row)

#### Définition de la fonction pour nos modèles

In [None]:
def Predictive_Model(estimator):
    estimator.fit(X_train, y_train)
    prediction = estimator.predict(X_test)
    print('R_squared:', metrics.r2_score(y_test, prediction))
    print('Square Root of MSE:',np.sqrt(metrics.mean_squared_error(y_test, prediction)))
    plt.figure(figsize=(10,5))
    sns.distplot(y_test, hist=True, kde=False)
    sns.distplot(prediction, hist=True, kde=False)
    plt.legend(labels=['Actual Values of Price', 'Predicted Values of Price'])
    plt.xlim(0,)

### Nos modèles de prédictions

In [None]:
# Linear Regressor
lr = LinearRegression()

# K_Neighbor Regressor
knn = KNeighborsRegressor(n_neighbors=5)

# Decision Tree Regressor
dt = DecisionTreeRegressor(max_depth=15, random_state=0)

# Gradient Boost Regressor
gbr = GradientBoostingRegressor(n_estimators=6000,
                                learning_rate=0.001,
                                max_depth=7,
                                max_features='sqrt',
                                min_samples_leaf=15,
                                min_samples_split=10,
                                loss='huber',
                                random_state=42)  

# Random Forest Regressor
rf = RandomForestRegressor(n_estimators=1200,
                          max_depth=15,
                          min_samples_split=5,
                          min_samples_leaf=5,
                          max_features=None,
                          oob_score=True,
                          random_state=42)

#### Régression linéaire

In [None]:
Predictive_Model(lr)

#### K_Nearest Neighbors (KNN)

In [None]:
Predictive_Model(knn)

#### Arbre décisionnel

In [None]:
Predictive_Model(dt)

#### Random Forest

In [None]:
Predictive_Model(rf)

#### Gradient Boosting

In [None]:
Predictive_Model(gbr)

#### Sommaire des performances

In [None]:
regressor = ['Linear Regression', 'KNN', 'Decision Tree', 'RandomForest', 'GradientBoosting']
models = [lr, knn, dt, rf, gbr ]
R_squared = []
RMSE = []
for m in models:
    m.fit(X_train, y_train)
    prediction_m = m.predict(X_test)
    r2 = metrics.r2_score(y_test, prediction_m)
    rmse = np.sqrt(metrics.mean_squared_error(y_test, prediction_m))
    R_squared.append(r2)
    RMSE.append(rmse)
basic_result = pd.DataFrame({'R squared':R_squared,'RMSE':RMSE}, index=regressor)
basic_result

### Cross Validation et Grid Search

In [None]:
X[features].shape

In [None]:
y.shape

In [None]:
scoring={'R_squared':'r2','MSE':'neg_mean_squared_error'}
kf = KFold(n_splits=12, random_state=42, shuffle=True)

# Define error metrics
def rmsle(y, y_pred):
    return np.sqrt(mean_squared_error(y, y_pred))

def cv_rmse(model):
    rmse = np.sqrt(-cross_val_score(model, X[features], y, scoring='neg_mean_squared_error', cv=kf))
    return (rmse)

def CrossVal(estimator):
    scores = cross_validate(estimator, X[features], y, cv=kf, scoring=scoring)
    r2 = scores['test_R_squared'].mean()
    mse = abs(scores['test_MSE'].mean())
    print('R_squared:', r2)
    print('Square Root of MSE:', np.sqrt(mse))
    
def GridSearch(estimator, Features, Target, param_grid):
    for key, value in scoring.items():
        grid = GridSearchCV(estimator, param_grid, cv=10, scoring=value)
        grid.fit(Features,Target)
        print(key)
        print('The Best Parameter:', grid.best_params_)
        if grid.best_score_ > 0:
            print('The Score:', grid.best_score_)
        else:
            print('The Score:', np.sqrt(abs(grid.best_score_)))
        print()
        

#### Régression linéaire

In [None]:
CrossVal(LinearRegression())

#### K_Nearest Neighbors (KNN)

In [None]:
param_grid = dict(n_neighbors=np.arange(5,26))
GridSearch(KNeighborsRegressor(), X[features], y, param_grid)

In [None]:
from sklearn.model_selection import validation_curve
def ValidationCurve(estimator, Features, Target, param_name, Name_of_HyperParameter, param_range):
    
    train_score, test_score = validation_curve(estimator, Features, Target, param_name, param_range,cv=10,scoring='r2')
    Rsqaured_train = train_score.mean(axis=1)
    Rsquared_test= test_score.mean(axis=1)
    
    plt.figure(figsize=(10,5))
    plt.plot(param_range, Rsqaured_train, color='r', linestyle='-', marker='o', label='Training Set')
    plt.plot(param_range, Rsquared_test, color='b', linestyle='-', marker='x', label='Testing Set')
    plt.legend(labels=['Training Set', 'Testing Set'])
    plt.xlabel(Name_of_HyperParameter)
    plt.ylabel('R_squared')
ValidationCurve(KNeighborsRegressor(), X[features], y, 'n_neighbors', 'K-Neighbors',np.arange(5,26))

#### Arbre de décision

In [None]:
param_grid=dict(max_depth=np.arange(2,15))
GridSearch(DecisionTreeRegressor(), X[features], y, param_grid)

In [None]:
ValidationCurve(DecisionTreeRegressor(), X[features], y, 'max_depth', 'Maximum Depth', np.arange(4,15))

#### Random Forest (2 méthodes de crossvalidation)

In [None]:
CrossVal(rf)

In [None]:
score = cv_rmse(rf)
print("rf: {:.4f} ({:.4f})".format(score.mean(), score.std()))

#### Gradient Boosting (2 méthodes de crossvalidation)

In [None]:
CrossVal(gbr)

In [None]:
score = cv_rmse(gbr)
print("gbr: {:.4f} ({:.4f})".format(score.mean(), score.std()))

#### Sommaire des Cross-validation

In [None]:
lr_scores = cross_validate(LinearRegression(), X[features], y, cv=10, scoring='r2')
knn_scores = cross_validate(KNeighborsRegressor(n_neighbors=16), X[features], y, cv=10, scoring='r2')
dt_scores = cross_validate(DecisionTreeRegressor(max_depth=9, random_state=0), X[features], y, cv=10, scoring='r2')
rf_scores = cross_validate(rf, X[features], y, cv=10, scoring='r2')
gbr_scores = cross_validate(gbr, X[features], y, cv=10, scoring='r2')
lr_test_score = lr_scores.get('test_score')
knn_test_score = knn_scores.get('test_score')
dt_test_score = dt_scores.get('test_score')
rf_test_score = rf_scores.get('test_score')
gbr_test_score = gbr_scores.get('test_score')
box= pd.DataFrame({'Linear Regression':lr_test_score, 'K-Nearest Neighbors':knn_test_score, 'Decision Tree':dt_test_score, 'Random Forest':rf_test_score, 'Gradient Boosting':gbr_test_score})
box.index = box.index + 1
box.loc['Mean'] = box.mean()
box

In [None]:
f,ax=plt.subplots(1,2, figsize=(20,10))
sns.boxplot(data=box.drop(box.tail(1).index), width=0.3, palette="Set2", ax=ax[0])
ax[0].set_ylabel('R squared')
sns.lineplot(data=box.drop(box.tail(1).index), palette="Set2", ax=ax[1])
ax[1].set_xticks(np.arange(1,11,1))
ax[1].set_xlabel('K-th Fold')

## 5.1. Modelling technique <a class="anchor" id="ModellingTechnique"></a>
Document the actual modelling technique that is to be used.

Import Models below:

In [None]:
#model = LinearRegression()
#model.fit(X_train, y_train)
#y_preds = model.predict(X_test)

In [None]:
print('RandomForest')
rf_model_full_data = rf.fit(X[features], y)

## 5.2 Modelling assumptions <a class="anchor" id="ModellingAssumptions"></a>


## 5.3 Build Model <a class="anchor" id="BuildModel"></a>
Run the modelling tool on the prepared dataset to create one or more models.

**Parameter settings** - With any modelling tool there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice of parameter settings.

**Models** - These are the actual models produced by the modelling tool, not a report on the models.

**Model descriptions** - Describe the resulting models, report on the interpretation of the models and document any difficulties encountered with their meanings.

## 5.4 Assess Model <a class="anchor" id="AssessModel"></a>
Interpret the models according to your domain knowledge, your data mining success criteria and your desired test design. Judge the success of the application of modelling and discovery techniques technically, then contact business analysts and domain experts later in order to discuss the data mining results in the business context. This task only considers models, whereas the evaluation phase also takes into account all other results that were produced in the course of the project.

At this stage you should rank the models and assess them according to the evaluation criteria. You should take the business objectives and business success criteria into account as far as you can here. In most data mining projects a single technique is applied more than once and data mining results are generated with several different techniques. 

**Model assessment** - Summarise the results of this task, list the qualities of your generated models (e.g.in terms of accuracy) and rank their quality in relation to each other.

**Revised parameter settings** - According to the model assessment, revise parameter settings and tune them for the next modelling run. Iterate model building and assessment until you strongly believe that you have found the best model(s). Document all such revisions and assessments.

# 6. Evaluation <a class="anchor" id="Evaluation"></a>

Before proceeding to final deployment of the model built by the data analyst, it is important to more thoroughly evaluate the model and review the model’s construction to be certain it properly achieves the business objectives. Here it is critical to determine if some important business issue has not been sufficiently considered. At the end of this phase, the project leader then should decide exactly how to use the data mining results. The key steps here are the evaluation of results, the process review, and the determination of next steps.

## 6.1 Evaluate Results <a class="anchor" id="EvaluateResults"></a>

Previous evaluation steps dealt with factors such as the accuracy and generality of the model. This step assesses the degree to which the model meets the business objectives and determines if there is some business reason why this model is deficient. Another option here is to test the model(s) on real-world applications—if time and budget constraints permit. Moreover, evaluation also seeks to unveil additional challenges, information, or hints for future directions.

At this stage, the data analyst summarizes the assessment results in terms of business success criteria, including a final statement about whether the project already meets the initial
business objectives.

### <span style="color:red">Utilisation du gradient boosting car il est le modèle le plus précis</span>

### Génération du modèle avec gradientboosting

In [None]:
print('GradientBoosting')
gbr_model_full_data = gbr.fit(X[features], y)

### Prédiction de y 

In [None]:
y_preds = gbr_model_full_data.predict(X_test)

### Evaluation du RMSE

In [None]:
print("Root Mean square error: " , np.sqrt(metrics.mean_squared_error(y_test,y_preds)))

### Génération des valeurs prédites

In [None]:
predictions = gbr_model_full_data.predict(X_val[features])

output = pd.DataFrame({'Id': X_val.index, 'SalePrice': np.expm1(predictions)})
output.to_csv('Data/my_submission.csv', index=False)
print("Your submission was successfully saved!")

## 6.2 Review Process <a class="anchor" id="ReviewProcess"></a>

It is now appropriate to do a more thorough review of the data mining engagement to determine if there is any important factor or task that has somehow been overlooked. This review also covers quality assurance issues (e.g., did we correctly build the model? Did we only use allowable attributes that are available for future deployment?).

## 6.3 Determine Next Steps <a class="anchor" id="NextSteps"></a>

At this stage, the project leader must decide whether to finish this project and move on to deployment or whether to initiate further iterations or set up new data mining projects.

# 7. Deployment  <a class="anchor" id="Deployment"></a>

Model creation is generally not the end of the project. The knowledge gained must be organized and presented in a way that the customer can use it, which often involves applying “live” models within an organization’s decision-making processes, such as the real-time personalization of Web pages or repeated scoring of marketing databases.

Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. Even though it is often the customer, not the data analyst, who carries out the deployment steps, it is important for the customer to understand up front what actions must be taken in order to actually make use of the created models. The key steps here are plan deployment, plan monitoring and maintenance, the production of the final report, and review of the project.

## 7.1 Plan Deployment <a class="anchor" id="PlanDeployment"></a>

In order to deploy the data mining result(s) into the business, this task takes the evaluation results and develops a strategy for deployment.

## 7.2 Plan Monitoring and Maintenance <a class="anchor" id="Monitoring"></a>
Monitoring and maintenance are important issues if the data mining result is to become part of the day-to-day business and its environment. A carefully prepared maintenance strategy avoids incorrect usage of data mining results.

## 7.3 Produce Final Report <a class="anchor" id="FinalReport"></a>
At the end of the project, the project leader and his or her team write up a final report. Depending on the deployment plan, this report may be only a summary of the project and its experiences (if they have not already been documented as an ongoing activity) or it may be a final and comprehensive presentation of the data mining result(s). This report includes all of the previous deliverables and summarizes and organizes the results. Also, there often will be a meeting at the conclusion of the project, where the results are verbally presented to the customer.

## 7.4 Review Project <a class="anchor" id="ReviewProject"></a>
The data analyst should assess failures and successes as well as potential areas of improvement for use in future projects. This step should include a summary of important experiences during the project and can include interviews with the significant project participants. This document could include pitfalls, misleading approaches, or hints for selecting the best-suited data mining techniques in similar situations. In ideal projects, experience documentation also covers any reports written by individual project members during the project phases and tasks.