<a href="https://colab.research.google.com/github/arun-arunisto/Machine_Learning_Tutorial/blob/todo/MachineLearningTutorial3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##In this tutorial we are going to learn about:
1. Introduction
2. Missing Values
3. Categorical Variables
4. Pipelines
5. Cross-validation
7. XGBoost
8. Data Leakage

#1. Missing Values
Most of the machine learning libraries give error if you try to build a model using data with missing values
To solve this issue we have 3 approaches

###1. Drop columns with missing values
this is one of the simple solution, but it will not the actual solution, because if in a dataset there's a column with important entries and it missing one entry, this solution will delete the column entirely! and it will effect the model

###2. Imputation
Imputation is one of the better option to handle missing values, imputation fills the missing values with other entries
it's a better solution but it will not be exactly right in most cases

###3. An Extenson to Imputation
in this approach, we impute the missing values, as before, and additionally we are going to add a new column and for each dataset with missing entries will add a data in new column to show the inputed entries
in some cases it will work other cases it doesnt help

##Example
we will work with housing dataset and our model will use information such as number of rooms and landsize to predict room price

In [1]:
#importing libraries
import pandas as pd
from sklearn.model_selection import train_test_split

#loading data
data = pd.read_csv('/content/drive/MyDrive/Datascience&MachineLearning/datasets/melb_data.csv')

#we want to predict price so, we are going set target as price
y = data.Price

#to keep things simple, we'll use only numerical predictors
melb_predictors = data.drop(['Price'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

#dividing data into training and validation subsets
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)


Defining function to measure quality of each approach
- we re going to declare a function called "score_dataset()"
- to compare dealing with different approaches to missing values
- this function return MAE by using "random forest model"

In [2]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

#function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
  model = RandomForestRegressor(n_estimators=10, random_state=0)
  model.fit(X_train, y_train)
  preds = model.predict(X_valid)
  return mean_absolute_error(y_valid, preds)

##1st Approach (Drop columns with missing values)

In [3]:
#getting names of columns with missing values
cols_with_missing = [col for col in X_train.columns
                     if X_train[col].isnull().any()]
cols_with_missing

['Car', 'BuildingArea', 'YearBuilt']

In [4]:
#drpping columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

In [5]:
#MAE approach 1 -> dropping columns with missing values
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

183550.22137772635


##Approach 2: Imputation
we use "SimpleImputer" method to replace missing values with the mean value of each column

In [6]:
#importing SimpleImputer
from sklearn.impute import SimpleImputer

#imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_train

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.0,5.0,3182.0,1.0,1.0,1.0,0.0,153.764119,1940.000000,-37.85984,144.98670,13240.0
1,2.0,8.0,3016.0,2.0,2.0,1.0,193.0,153.764119,1964.839866,-37.85800,144.90050,6380.0
2,3.0,12.6,3020.0,3.0,1.0,1.0,555.0,153.764119,1964.839866,-37.79880,144.82200,3755.0
3,3.0,13.0,3046.0,3.0,1.0,1.0,265.0,153.764119,1995.000000,-37.70830,144.91580,8870.0
4,3.0,13.3,3020.0,3.0,1.0,2.0,673.0,673.000000,1970.000000,-37.76230,144.82720,4217.0
...,...,...,...,...,...,...,...,...,...,...,...,...
10859,3.0,5.2,3056.0,3.0,1.0,2.0,212.0,153.764119,1964.839866,-37.77695,144.95785,11918.0
10860,3.0,10.5,3081.0,3.0,1.0,1.0,748.0,101.000000,1950.000000,-37.74160,145.04810,2947.0
10861,4.0,6.7,3058.0,4.0,2.0,2.0,441.0,255.000000,2002.000000,-37.73572,144.97256,11204.0
10862,3.0,12.0,3073.0,3.0,1.0,1.0,606.0,153.764119,1964.839866,-37.72057,145.02615,21650.0


In [7]:
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
imputed_X_valid

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,4.0,8.0,3016.0,4.0,2.0,2.0,450.0,190.000000,1910.000000,-37.86100,144.89850,6380.0
1,2.0,6.6,3011.0,2.0,1.0,0.0,172.0,81.000000,1900.000000,-37.81000,144.88960,2417.0
2,3.0,10.5,3020.0,3.0,1.0,1.0,581.0,153.764119,1964.839866,-37.76740,144.82421,4217.0
3,3.0,4.5,3181.0,2.0,2.0,1.0,128.0,134.000000,2000.000000,-37.85260,145.00710,7717.0
4,3.0,8.5,3044.0,3.0,2.0,2.0,480.0,153.764119,1964.839866,-37.72523,144.94567,7485.0
...,...,...,...,...,...,...,...,...,...,...,...,...
2711,2.0,6.4,3011.0,2.0,1.0,1.0,47.0,35.000000,2013.000000,-37.80140,144.89590,7570.0
2712,4.0,8.0,3016.0,4.0,2.0,4.0,551.0,153.764119,1964.839866,-37.85790,144.87860,6380.0
2713,3.0,10.8,3105.0,3.0,1.0,1.0,757.0,153.764119,1964.839866,-37.78094,145.10131,4480.0
2714,4.0,6.2,3039.0,4.0,1.0,3.0,478.0,152.000000,1925.000000,-37.76421,144.90571,6232.0


In [8]:
#imputation will remove column names so, put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

In [9]:
#MAE approach 2 -> Imputation
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

178166.46269899711


We can see that approach 2 has lower MAE, that means Approach 2 performed better than Approach 1

##Approach 3: (An Extension to imputation)

In [10]:
#making copy to avoid changing original data
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

In [11]:
#making new columns indicating what will be imputed
for col in cols_with_missing:
  X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
  X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

In [12]:
#again we are going to use imputer, for a clarity
#imputation
my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_train_plus

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,1.0,5.0,3182.0,1.0,1.0,1.0,0.0,153.764119,1940.000000,-37.85984,144.98670,13240.0,0.0,1.0,0.0
1,2.0,8.0,3016.0,2.0,2.0,1.0,193.0,153.764119,1964.839866,-37.85800,144.90050,6380.0,0.0,1.0,1.0
2,3.0,12.6,3020.0,3.0,1.0,1.0,555.0,153.764119,1964.839866,-37.79880,144.82200,3755.0,0.0,1.0,1.0
3,3.0,13.0,3046.0,3.0,1.0,1.0,265.0,153.764119,1995.000000,-37.70830,144.91580,8870.0,0.0,1.0,0.0
4,3.0,13.3,3020.0,3.0,1.0,2.0,673.0,673.000000,1970.000000,-37.76230,144.82720,4217.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10859,3.0,5.2,3056.0,3.0,1.0,2.0,212.0,153.764119,1964.839866,-37.77695,144.95785,11918.0,0.0,1.0,1.0
10860,3.0,10.5,3081.0,3.0,1.0,1.0,748.0,101.000000,1950.000000,-37.74160,145.04810,2947.0,0.0,0.0,0.0
10861,4.0,6.7,3058.0,4.0,2.0,2.0,441.0,255.000000,2002.000000,-37.73572,144.97256,11204.0,0.0,0.0,0.0
10862,3.0,12.0,3073.0,3.0,1.0,1.0,606.0,153.764119,1964.839866,-37.72057,145.02615,21650.0,0.0,1.0,1.0


In [13]:
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))
imputed_X_valid_plus

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,4.0,8.0,3016.0,4.0,2.0,2.0,450.0,190.000000,1910.000000,-37.86100,144.89850,6380.0,0.0,0.0,0.0
1,2.0,6.6,3011.0,2.0,1.0,0.0,172.0,81.000000,1900.000000,-37.81000,144.88960,2417.0,0.0,0.0,0.0
2,3.0,10.5,3020.0,3.0,1.0,1.0,581.0,153.764119,1964.839866,-37.76740,144.82421,4217.0,0.0,1.0,1.0
3,3.0,4.5,3181.0,2.0,2.0,1.0,128.0,134.000000,2000.000000,-37.85260,145.00710,7717.0,0.0,0.0,0.0
4,3.0,8.5,3044.0,3.0,2.0,2.0,480.0,153.764119,1964.839866,-37.72523,144.94567,7485.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2711,2.0,6.4,3011.0,2.0,1.0,1.0,47.0,35.000000,2013.000000,-37.80140,144.89590,7570.0,0.0,0.0,0.0
2712,4.0,8.0,3016.0,4.0,2.0,4.0,551.0,153.764119,1964.839866,-37.85790,144.87860,6380.0,0.0,1.0,1.0
2713,3.0,10.8,3105.0,3.0,1.0,1.0,757.0,153.764119,1964.839866,-37.78094,145.10131,4480.0,0.0,1.0,1.0
2714,4.0,6.2,3039.0,4.0,1.0,3.0,478.0,152.000000,1925.000000,-37.76421,144.90571,6232.0,0.0,0.0,0.0


In [14]:
#imputed removed column names put them back
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_train_plus

Unnamed: 0,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount,Car_was_missing,BuildingArea_was_missing,YearBuilt_was_missing
0,1.0,5.0,3182.0,1.0,1.0,1.0,0.0,153.764119,1940.000000,-37.85984,144.98670,13240.0,0.0,1.0,0.0
1,2.0,8.0,3016.0,2.0,2.0,1.0,193.0,153.764119,1964.839866,-37.85800,144.90050,6380.0,0.0,1.0,1.0
2,3.0,12.6,3020.0,3.0,1.0,1.0,555.0,153.764119,1964.839866,-37.79880,144.82200,3755.0,0.0,1.0,1.0
3,3.0,13.0,3046.0,3.0,1.0,1.0,265.0,153.764119,1995.000000,-37.70830,144.91580,8870.0,0.0,1.0,0.0
4,3.0,13.3,3020.0,3.0,1.0,2.0,673.0,673.000000,1970.000000,-37.76230,144.82720,4217.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10859,3.0,5.2,3056.0,3.0,1.0,2.0,212.0,153.764119,1964.839866,-37.77695,144.95785,11918.0,0.0,1.0,1.0
10860,3.0,10.5,3081.0,3.0,1.0,1.0,748.0,101.000000,1950.000000,-37.74160,145.04810,2947.0,0.0,0.0,0.0
10861,4.0,6.7,3058.0,4.0,2.0,2.0,441.0,255.000000,2002.000000,-37.73572,144.97256,11204.0,0.0,0.0,0.0
10862,3.0,12.0,3073.0,3.0,1.0,1.0,606.0,153.764119,1964.839866,-37.72057,145.02615,21650.0,0.0,1.0,1.0


In [15]:
#above you can clearly see that column names
imputed_X_valid_plus.columns = X_valid_plus.columns

In [16]:
#MAE third approach
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

178927.503183954


##why did imputation better than dropping the columns?
the data has 10864 rows and 12 columns, three columns contain missing data. for each column less than half of the entries are missing. so, the dropping columns removes a lot of useful information, and so it makes sense that imputation would perform better.

In [17]:
#datasets num of rows and columns
print(X_train.shape)

(10864, 12)


In [18]:
#number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])


Car               49
BuildingArea    5156
YearBuilt       4307
dtype: int64


#Categorical Variables
- categorical variables takes only a limited number of values
- Categorical data is qualitative and represents characteristics or attributes that belong to a specific group or category.
- In contrast to categorical data, numerical data represents quantities and can be further divided into discrete or continuous data types.
- Categorical data is valuable for understanding preferences, choices, and qualitative characteristics among different groups. It is often analyzed using frequency distributions, bar charts, pie charts, and other graphical representations.

we are going to make three approaches on categorical variables
##1. Drop Categorical Variables
- the easiest approach to deal with is to simply removing from the dataset
- this will work only if the columns did not contain useful information
##2. Ordinal Encoding
- assigns each unique value to a different integer
- this approach assumes an ordering of the categories
for example
Breakfast Frequency Survey:
The responses "Never," "Rarely," "Most days," and "Every day" represent different categories, and each respondent's answer falls into one of these fixed categories. This data allows for a qualitative analysis of breakfast habits.

the above example will consider as "Never" (0) < "Rarely" (1) < "Most days"(2) < "Every day" < (3)
- the above assumption makes sense in this example becoz, there is an indisputable ranking to the categories.
- not all categorical variables have clear ordering in the values, but we refer to those that do as ordinal variables.
- for tree-based models like decision trees and random forest, you can expect ordinal encoding to work well with ordinal variables.

###3. One-Hot Encoding
- this method creates a new column indicationg the presence or absence of each possible value in the original data
- example
In the original dataset, "Color" is a categorical variable with three categories: "Red", "Yellow", and "Green". The corresponding one-hot encoding contains one column for each possible value, and one row for each row in the original dataset. Wherever the original value was "Red", we put a 1 in the "Red" column; if the original value was "Yellow", we put a 1 in the "Yellow" column, and so on.
<img src="https://storage.googleapis.com/kaggle-media/learn/images/TW5m0aJ.png">
- the one-hot encoding doesn't perform well if the categorical variable takes on a large number of values.

we are going to do some programming for categorical variables

In [19]:
#loading dataset
data = pd.read_csv("/content/drive/MyDrive/Datascience&MachineLearning/datasets/melb_data.csv")

In [20]:
#seprating target from predictors
y = data.Price
X = data.drop(['Price'], axis=1)

In [21]:
#dividing data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [22]:
#next we are dropping columns (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)

- next we are going to select the categorical columns with relatively low cardinality (convenient but arbitrary)
- cardinality means the number of unique values in a column

In [23]:
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == 'object']

In [24]:
#select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

In [25]:
#keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [26]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Landsize,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,0.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,193.0,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,555.0,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,265.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,673.0,-37.7623,144.8272,4217.0


- in above we can see that two new columns added "Type" and "Method" and "Regionname"

In [27]:
#getting list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical Variables :\n",object_cols)

Categorical Variables :
 ['Type', 'Method', 'Regionname']


##Approach 1: (dropping categorical variables)


In [28]:
drop_X_train = X_train.select_dtypes(exclude=["object"])
drop_X_valid = X_valid.select_dtypes(exclude=["object"])

print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))

MAE from Approach 1 (Drop categorical variables):
183550.22137772635


##Approach 2: (Ordinal Encoding)
- for this approach we are going to use OrdinalEncoder function from scikit learn

In [29]:
from sklearn.preprocessing import OrdinalEncoder

#always try to copy dataset to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()

#applying ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])

print("MAE for Approach 2:")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))

MAE for Approach 2:
175062.2967599411


##Approach 3 (one-hot encoding)
- for this we are going to use "OneHotEncoder"
- and also there some parameter also to customize its behaviour
1. handle_unknown='ignore' -> to avoid errors when the validation data contains classes that aren't represented in the training data
2. sparse=False -> to ensure that the encoded columns are returned as a numpy array


In [30]:
from sklearn.preprocessing import OneHotEncoder

#applying one-hot encoder for each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))




In [31]:
#one-hot encoding removed index put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index

In [32]:
#removing categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)

In [33]:
#adding one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)

In [34]:
#Ensure all columns having string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)

In [35]:
print("MAE from approach 3:")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))

MAE from approach 3:
176703.63810751104


in above approach 3 performed best

##Pipelines
- pipelines are the simplest way to keep your data preprocessing and modeling code organized.
- a pipeline bundles preprocesing and modeling steps so you are able to use the whole bundle as if it wera a single step
- benefits of pipelines:
1. Cleaner Code: with a pipeline you wont need to manually keep track of your training and validation data at each step.
2. Fewer bugs: There are fewer opportunities to misapply a step or forget a preprocessing data
3. Easier to Productionize: its hard to transition a model from prototype to something deployable at scale. we wont go into the many related concerns but pipelines can help
4. more options for model validation

In [36]:
#on above we drop the columns with null value we need that data back so we are going
#divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

In [37]:
#selecting categorical columns
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and X_train_full[cname].dtype == "object"]

In [38]:
#selecting numerical columns
numerical_cols = [cname for cname in X_train_full.columns
                  if X_train_full[cname].dtype in ['int64', 'float64']]

In [39]:
#keeping selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [40]:
X_train.head()

Unnamed: 0,Type,Method,Regionname,Rooms,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
12167,u,S,Southern Metropolitan,1,5.0,3182.0,1.0,1.0,1.0,0.0,,1940.0,-37.85984,144.9867,13240.0
6524,h,SA,Western Metropolitan,2,8.0,3016.0,2.0,2.0,1.0,193.0,,,-37.858,144.9005,6380.0
8413,h,S,Western Metropolitan,3,12.6,3020.0,3.0,1.0,1.0,555.0,,,-37.7988,144.822,3755.0
2919,u,SP,Northern Metropolitan,3,13.0,3046.0,3.0,1.0,1.0,265.0,,1995.0,-37.7083,144.9158,8870.0
6043,h,S,Western Metropolitan,3,13.3,3020.0,3.0,1.0,2.0,673.0,673.0,1970.0,-37.7623,144.8272,4217.0


- above we are looking at the training data by using head() method.
- so, you can see that the data contains both categorical data and columns with missing values.
- it's easy to deal with both!

we are going to construct the full pipeline in three steps

Step 1: Defining Preprocessing Steps
- we use ColumnTransfer class to bundle together different preprocessing steps.
1. imputes missing values in numerical data
2. imputes missing values and applies a one-hot encoding to categorical data


In [41]:
#importing required methods and classes
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [42]:
#preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

In [43]:
#preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [44]:
#bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ]
)


Step 2: Define the model
- we are using random forest model
- so, we are going to import RandomForestRegressor class

In [45]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, random_state=0)

Step 3: Create and Evaluate the Pipeline
- important things to notice
1. with the pipeline we are preprocess the training data and fit the model in a single line of code, otherwise without pipeline we have to do the imputation one-hot encoding and model training in seperate steps
2. and with pipeline we supply unprocessed features in X_valid to the predict() command, and the pipeline automatically preprocesses the features before generating predictions. if without a pipeline we have to remember the preprocess validation data before making predictions.

In [46]:
from sklearn.metrics import mean_absolute_error

#bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

In [47]:
#preprocessing the training data, and fitting model
my_pipeline.fit(X_train, y_train)

In [48]:
#preprocessing validation data, to get predictions
preds = my_pipeline.predict(X_valid)

In [49]:
#evaluating the model
score = mean_absolute_error(y_valid, preds)
print("MAE :",score)

MAE : 160679.18917034855


Pipelines are valuable for cleaning up machine learning code and avoiding errors, and are especially useful for workflows with sophisticated data preprocessing

##Cross-Validation

- in cross-validation we run our modeling process on different subsets of the data to get multiple measures of model quality
- for example, we could begin by dividing the data into 5 pieces, each 20% of the full dataset
<img src="https://storage.googleapis.com/kaggle-media/learn/images/9k60cVA.png">
- the data we broke into 5 folds
- in experiment 1 we use the first fold data as a validation data, and everything other as a training data
- in experiment 2 the second data will be validation and other will be training data
- it will repeat this process
- when will we use cross-validation for small datasets where extra computational burden isn't a big deal, you should run cross-validation
- because we are iterating through processes so it will take longer to run in large datasets
- so for large datasets a single validation set is sufficient, your code will run faster
- there's no simple threshold for what constitutes a large vs. small dataset
- if youur model takes a couple minutes or less to run, it's worth to switching to cross validation


In [50]:
#to use cross-validation we are running the same housing data
data = pd.read_csv("/content/drive/MyDrive/Datascience&MachineLearning/datasets/melb_data.csv")

#selecting subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
X

Unnamed: 0,Rooms,Distance,Landsize,BuildingArea,YearBuilt
0,2,2.5,202.0,,
1,2,2.5,156.0,79.0,1900.0
2,3,2.5,134.0,150.0,1900.0
3,3,2.5,94.0,,
4,4,2.5,120.0,142.0,2014.0
...,...,...,...,...,...
13575,4,16.7,652.0,,1981.0
13576,3,6.8,333.0,133.0,1995.0
13577,3,6.8,436.0,,1997.0
13578,4,6.8,866.0,157.0,1920.0


In [51]:
#selecting the target
y = data.Price
y

0        1480000.0
1        1035000.0
2        1465000.0
3         850000.0
4        1600000.0
           ...    
13575    1245000.0
13576    1031000.0
13577    1170000.0
13578    2500000.0
13579    1285000.0
Name: Price, Length: 13580, dtype: float64

- next we are going to define a pipeline that uses an imputer to fill missing values and random forest model to make predictions
- it's possible to do cross validations without-pipelines but its quite difficult using a pipeline will make the code remarkably straightforward


In [52]:
#importing necessary methods to create pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[('preprocessor', SimpleImputer()),
                              ('model', RandomForestRegressor(n_estimators=50, random_state=0))])

above we define the pipeline using RandomForestRegressor model
- next we need obtain the scores of cross-validation for that we are going to use "cross_val_score()" from scikit-learn
- we are going to set the number of folds with the "cv" parameter

In [53]:
#importing cross_val_score function
from sklearn.model_selection import cross_val_score

#multiplying by -1 since sklearn calculations *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5, scoring='neg_mean_absolute_error')

print("MAE scores :\n", scores)

MAE scores :
 [301628.7893587  303164.4782723  287298.331666   236061.84754543
 260383.45111427]


- now we are getting multiple MAE's actually we want a single measure so, we are going to take the average

In [54]:
print("Average MAE score (across experiments) :")
print(scores.mean())

Average MAE score (across experiments) :
277707.3795913405


##XGBoost
- we have made predictions with the random forest method
- actually random forest method achieves better performance than single decision tree
- we refer to the random forest method as an "ensemble method"
- "Ensemble methods" combine the predictions of several models(eg: several trees, in the case of random forests)

###Gradient Boosting
- gradient boosting is also an another ensemble method
- Gradient boosting is a method that goes through cycles to iteratively add models into an ensemble
it begins by initializing the ensemble with single model, whose predictions can be naive.(Even if its predictions are wildly inaccurate, subsequent additions to the ensemble will address those errors)
after that we start the cycle
- first, we use the current ensemble to generate predictions for each observation in the dataset, to make predictions from all models in the ensemble
- these predictions used to calculate a loss function (like mean squared error, for instance)
- then we use the loss function to fit a new model that will be added to the ensemble. specifically we determine model parameters so that adding this new model to the ensembl will reduce the loss
- finally we add the new model to ensemble and, repeat!
<img src="https://storage.googleapis.com/kaggle-media/learn/images/MvCGENh.png">

In [55]:
#we are going to use melbourne data for the Gradient Boosting also
data = pd.read_csv("/content/drive/MyDrive/Datascience&MachineLearning/datasets/melb_data.csv")

#selecting the subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

In [56]:
#select the target
y = data.Price

In [57]:
#seperating data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

- for this example, we are going to use XGBosst library
- XGBoost stands for extreme gradient boosting
- from xgboost library we will use XGBRegressor

In [58]:
#importing XGBRegressor from xgboost
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

In [59]:
#next we are going to make predictions to evaluate the model
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean absolute error:",mean_absolute_error(predictions, y_valid))

Mean absolute error: 245979.2263807069


##Parameter Tuning
- XGBoost has few parameters that can affect accuracy and training speed
- "n_estimators" is one of the parameters of xgboost
- "n_estimators" specifies how many times to go through the modeling cycle described above. it is equal to the number of models that we include in the ensemble
1. too low a value causes underfitting: it will leads to inaccuratre predictions on both training and test data
2. too high a value causes overfitting: it will causes accurate predictions on training data but inaccurate in testdata
- so, the typical values range from 100-1000, this depends a lot on the "learning_rate" parameter


In [60]:
#sample code with n_estimators
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train)

- next, parameter is "early_stopping_rounds"
- it offers a way to automatically find the ideal value for n_estimators.
- early stopping causes stop iterating when the validation score stops improving
- we dont know when will the validation stops iterating and improves the score, it will happen in any round
- so, for a reasonable choice we are going to give 5 for the early_stopping_rounds

In [61]:
#early_stopping_rounds
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train,
             early_stopping_rounds=5,
             eval_set=[(X_valid, y_valid)],
             verbose=False)



- our next parameter is "learning rate"
- instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number(known as the "learning rate") before adding them in
- this means each tree we add to the ensemble helps us less. so we can set n_estimators a higher value without overfitting
- if we use early stopping the appropriate number of trees will be determined automatically
- default value will be 0.1

In [62]:
#learning_rate
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train,
             early_stopping_rounds=5,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

- "n_jobs" on larger datasets where runtime is a consideration, you can use parallelism to build your model
- its common to set th parameter n_jobs equal to the number of cores on your machine. on smaller datasets this wont help


In [63]:
#using njobs
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train,
             early_stopping_rounds=5,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

##Data Leakage
- Data Leakage (or leakage) it happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction
- this leads to high performance on the training set but the model will perform poorly in production
- there are two types of leakages
1. Target Leakage
- it occurs when your predictors include data that will not be available at the time you make predictions.
2. Train-Test Contamination
- a different type of leak occurs when you aren't careful to distinguish training data from validation data


- for this leakage we are going to use new dataset


In [64]:
data = pd.read_csv("/content/drive/MyDrive/Datascience&MachineLearning/datasets/AER_credit_card_data.csv", true_values=['yes'], false_values=['no'])

In [65]:
data.head()

Unnamed: 0,card,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,True,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,True,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,True,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,True,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,True,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [66]:
#selecting card as target
y = data.card

In [67]:
#selecting predictors
X = data.drop(['card'], axis=1)

In [68]:
print("Number of rows in the dataset:",X.shape[0])

Number of rows in the dataset: 1319


In [69]:
#data in X
X.head()

Unnamed: 0,reports,age,income,share,expenditure,owner,selfemp,dependents,months,majorcards,active
0,0,37.66667,4.52,0.03327,124.9833,True,False,3,54,1,12
1,0,33.25,2.42,0.005217,9.854167,False,False,3,34,1,13
2,0,33.66667,4.5,0.004156,15.0,True,False,4,58,1,5
3,0,30.5,2.54,0.065214,137.8692,False,False,0,25,1,7
4,0,32.16667,9.7867,0.067051,546.5033,True,False,2,64,1,5


In [70]:
#we are going to use cross-validation on this small dataset
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

In [71]:
#since there is no preprocessing, so we dont need a pipeline
#but we used for best practise
my_pipeline = make_pipeline(RandomForestClassifier(n_estimators=100))
cv_scores = cross_val_score(my_pipeline, X, y, cv=5, scoring='accuracy')
print("Cross-validation accuracy:",cv_scores.mean())

Cross-validation accuracy: 0.9802943887544648


- with experience you'll find that its very rare to find models that are accurate 98% of the time. it happens but its uncommon that we should inspect the data more closely for target leakage
- take the columns of data and closely watch that

In [72]:
#columns
data.columns

Index(['card', 'reports', 'age', 'income', 'share', 'expenditure', 'owner',
       'selfemp', 'dependents', 'months', 'majorcards', 'active'],
      dtype='object')

if you watch closely you will know many variables look suspicious
so at this point basic data comparisons can be very helpful

In [76]:
#first we are going to take expenditures for checking
expenditure_cardholders =X.expenditure[y]
expenditure_noncardholders = X.expenditure[~y]

In [74]:
print("Who did not receive a card and had no expenditure:",((expenditure_noncardholders==0).mean()))

Who did not receive a card and had no expenditure: 1.0


In [77]:
print("Who receive a card and had no expenditure:",((expenditure_cardholders==0).mean()))

Who receive a card and had no expenditure: 0.020527859237536656


- this is how we compare the data so next we are going to going to drop leaky predictors

In [78]:
potential_leaks = ['expenditure', 'share', 'active', 'majorcards']
#we are going to drop this leaky predictors from dataset
X2 = X.drop(potential_leaks, axis=1)

In [79]:
#evaluating model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, cv=5, scoring='accuracy')
print('Cross-val accuracy:', cv_scores.mean())

Cross-val accuracy: 0.830922341283558


after that we can see that the score got lower