# **<ins style="color:green">Binning and Binarization | Discretization | Quantile Binning | KMeans Binning</ins>**

## **<ins style="color:red">1. Encoding Numerical Features</ins>**
- **Numerical data >-------->>> convert into categorical data.**
- Google_Play_Store_Downloads = [23, 123987, 35434, 947, 68463, 5846378373464 ]
- Conver in categorical(bins) = [ 100k+ | 1000k+ | 10000k+ | 100000k+ | 1M+ | 2M+ ]  \

                                  [ 23 | 2345 | 3453345 | 34534532323 | 145634634456 | 287867676 ]

- ## **To Convert Numerical data to Categorical Data we have two technique.**
- ### **Numberical Data >------------>>> Categorical Data**
  - #### __Discretization(Binning) :__
  - Discretization(Binning) is the process of transforming continuous variable into discrete variable by creating a set of contiguous intervals(Bin) that span the range of variable's values.
    - Why use Discretization?
      1. _To handle Outliers._
      2. _To improve the value spread._
    - __Types of Discretization(Binning) :__
      1. Unsupervised
         - Equal Width Binning(Uniform Binning)
         - Equal Frequency Binning(Quantile Binning)
         - KMeans Binning
      2. Supervised
         - Decision Treee Binning
      3. Custom Binning
  
  - #### __Binarization :__

## **<ins style="color:blue">Discretization(Binning) :</ins>**
- ### **Unsupervised :**
  - #### **Equal Width / Uniform Binning :**
    - You will decide bins.
    - Formula use to choose the __number of bins = (max - min) / bins__
    - This is also called equal width binning.
    - __Bins :__ The size of 1 bin. (0-10)=bins=10, (0-20)=bins=20, (3-9)=bins=6.
    - eg. - bins=10
      - max=100, min=0 __bins__ = (100-0)/10 = 10
    - Outlier will handle automatically.
    - It does not change the spreaad of data.
  - #### **Equal Frequency / Quantile Binning :**
    - __Interval__ = 10 : Each interval contains 10% of total observations.
    - 0-16(10%), 16-20(20% - 10%), 20-22(30% - 20%), 22-25(40% - 30%) -------------- 90-100(100% - 90%)
    - The size of width are not same for all interval. Different bins in different columns.
    - Give better result on outliers and make uniform spread of values.
  - #### **KMeans Binning :**
    - Make clusters of data.
    - Use it when our data look like in clusters.
    - Create a __centroid__(interval) randomaly anywhere in plane and calculate distance of each point to centroid.
    - **Sklearn - KBinsDiscretizer(_no of bins_, _strategy_(Uniform, Quantile, KMean), _Encoding_(Ordinal, OneHotEncodig))** 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.compose import ColumnTransformer

In [3]:
df = pd.read_csv("../data/train.csv", usecols=["Age", "Fare", "Survived"])
df.sample(7)

Unnamed: 0,Survived,Age,Fare
839,1,,29.7
472,1,33.0,27.75
838,1,32.0,56.4958
301,1,,23.25
793,0,,30.6958
488,0,30.0,8.05
853,1,16.0,39.4


In [4]:
df.isnull().sum()

Survived      0
Age         177
Fare          0
dtype: int64

In [5]:
# df.fillna(df['Age'].median(), inplace=True)
df.dropna(inplace=True)

In [6]:
df.isnull().sum()

Survived    0
Age         0
Fare        0
dtype: int64

In [7]:
df.shape

(714, 3)

## **<ins style="color:red">Without Binning</ins>**

In [8]:
X = df.iloc[:, 1:]
y = df.iloc[:, [0]]

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.head(5)

Unnamed: 0,Age,Fare
328,31.0,20.525
73,26.0,14.4542
253,30.0,16.1
719,33.0,7.775
666,25.0,13.0


In [10]:
dtc = DecisionTreeClassifier()

In [11]:
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

In [12]:
accuracy_score(y_test, y_pred)*100

62.93706293706294

In [13]:
# check using cross_validation
# np.mean(cross_val_score(DecisionTreeClassifier(), X, y, cv=10, scoring='accuracy'))    # correct
np.mean(cross_val_score(dtc, X, y, cv=10, scoring='accuracy'))*100


63.17097026604068

## **<ins style="color:red">With Binning</ins>**

In [14]:
## Kbin
kbin_age = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
kbin_fare = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='kmeans')
# encode{‘onehot’, ‘onehot-dense’, ‘ordinal’}, default=’onehot’
# strategy{‘uniform’, ‘quantile’, ‘kmeans’}, default=’quantile’

In [15]:
trf = ColumnTransformer([
    ("Age", kbin_age, [0]),
    ('Fare', kbin_fare,[1])
], remainder='passthrough')
trf

In [16]:
X_train_trf=trf.fit_transform(X_train)
X_test_trf=trf.transform(X_test)
# Do not convert it into pd.DataFrame(). Because when you need to create a new data frame it will give NAN value.

In [17]:
X_train.head(7)

Unnamed: 0,Age,Fare
328,31.0,20.525
73,26.0,14.4542
253,30.0,16.1
719,33.0,7.775
666,25.0,13.0
30,40.0,27.7208
287,22.0,7.8958


In [18]:
X_train_trf

array([[5., 1.],
       [4., 1.],
       [5., 1.],
       ...,
       [7., 5.],
       [6., 1.],
       [6., 0.]])

In [19]:
trf.named_transformers_['Age'].n_bins_  # see the transformers

array([10])

In [20]:
age_label = trf.named_transformers_['Age'].bin_edges_[0].tolist()    # see the range of widths
fare_label = trf.named_transformers_['Fare'].bin_edges_[0].tolist()    # see the range of widths
print("age_label : ", age_label)
print()
print("fare-label : ", fare_label)

age_label :  [0.42, 14.0, 19.0, 22.0, 25.0, 28.5, 32.0, 36.0, 42.0, 50.0, 80.0]

fare-label :  [0.0, 11.694186540052963, 22.367481981593325, 42.19216655959238, 67.87308285285285, 100.48995908496732, 137.4050068627451, 185.67419166666667, 237.86718333333334, 385.65157500000004, 512.3292]


In [21]:
nt = trf.named_transformers_['Age']
print(nt.n_features_in_)
print(nt.feature_names_in_)
print(nt.n_bins_)

1
['Age']
[10]


In [22]:
X_train

Unnamed: 0,Age,Fare
328,31.0,20.5250
73,26.0,14.4542
253,30.0,16.1000
719,33.0,7.7750
666,25.0,13.0000
...,...,...
92,46.0,61.1750
134,25.0,13.0000
337,41.0,134.5000
548,33.0,20.5250


In [23]:
X_train_trf

array([[5., 1.],
       [4., 1.],
       [5., 1.],
       ...,
       [7., 5.],
       [6., 1.],
       [6., 0.]])

In [24]:
dfn = pd.DataFrame({
    'age' : X_train['Age'],
    'age_trf' : X_train_trf[:, 0],
    'fare' : X_train['Fare'],
    'fare_trf' : X_train_trf[:, 1]
})
dfn.head(7)

Unnamed: 0,age,age_trf,fare,fare_trf
328,31.0,5.0,20.525,1.0
73,26.0,4.0,14.4542,1.0
253,30.0,5.0,16.1,1.0
719,33.0,6.0,7.775,0.0
666,25.0,4.0,13.0,1.0
30,40.0,7.0,27.7208,2.0
287,22.0,3.0,7.8958,0.0


In [25]:
dfn.isnull().sum()

age         0
age_trf     0
fare        0
fare_trf    0
dtype: int64

In [26]:
dfn.describe()

Unnamed: 0,age,age_trf,fare,fare_trf
count,571.0,571.0,571.0,571.0
mean,30.016935,4.591944,35.07856,1.581436
std,14.728887,2.854336,49.575809,1.80786
min,0.42,0.0,0.0,0.0
25%,21.0,2.0,8.05,0.0
50%,28.5,5.0,15.75,1.0
75%,39.0,7.0,34.375,2.0
max,80.0,9.0,512.3292,9.0


- ### **pandas.cut**    - `
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True`
    - 
Bin values into discrete interval
    - Use _cut_ when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable.
    - For example, _cut_ could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.bounds values will be NA in the resulting Series or Categorical object.

In [27]:
dfn['age_labels'] = pd.cut(x=X_train['Age'], bins=age_label)
dfn['fare_labels'] = pd.cut(x=X_train['Fare'], bins=fare_label)
dfn.head(10)

Unnamed: 0,age,age_trf,fare,fare_trf,age_labels,fare_labels
328,31.0,5.0,20.525,1.0,"(28.5, 32.0]","(11.694, 22.367]"
73,26.0,4.0,14.4542,1.0,"(25.0, 28.5]","(11.694, 22.367]"
253,30.0,5.0,16.1,1.0,"(28.5, 32.0]","(11.694, 22.367]"
719,33.0,6.0,7.775,0.0,"(32.0, 36.0]","(0.0, 11.694]"
666,25.0,4.0,13.0,1.0,"(22.0, 25.0]","(11.694, 22.367]"
30,40.0,7.0,27.7208,2.0,"(36.0, 42.0]","(22.367, 42.192]"
287,22.0,3.0,7.8958,0.0,"(19.0, 22.0]","(0.0, 11.694]"
217,42.0,8.0,27.0,2.0,"(36.0, 42.0]","(22.367, 42.192]"
797,31.0,5.0,8.6833,0.0,"(28.5, 32.0]","(0.0, 11.694]"
371,18.0,1.0,6.4958,0.0,"(14.0, 19.0]","(0.0, 11.694]"


In [28]:
dfn.age_trf.value_counts()

4.0    65
3.0    65
7.0    60
9.0    60
8.0    58
6.0    55
1.0    54
0.0    54
5.0    53
2.0    47
Name: age_trf, dtype: int64

In [29]:
dfn.age_trf.unique()

array([5., 4., 6., 7., 3., 8., 1., 0., 2., 9.])

In [30]:
dfn.age_labels.value_counts()

(14.0, 19.0]    68
(28.5, 32.0]    68
(0.42, 14.0]    58
(22.0, 25.0]    58
(32.0, 36.0]    56
(36.0, 42.0]    54
(42.0, 50.0]    54
(19.0, 22.0]    52
(50.0, 80.0]    52
(25.0, 28.5]    50
Name: age_labels, dtype: int64

In [31]:
dfn.fare_trf.value_counts()

0.0    212
2.0    122
1.0    114
4.0     45
3.0     37
5.0     17
6.0      9
8.0      8
7.0      6
9.0      1
Name: fare_trf, dtype: int64

In [32]:
dfn.fare_labels.value_counts()

(0.0, 11.694]         206
(22.367, 42.192]      122
(11.694, 22.367]      114
(67.873, 100.49]       45
(42.192, 67.873]       37
(100.49, 137.405]      17
(137.405, 185.674]      9
(237.867, 385.652]      8
(185.674, 237.867]      6
(385.652, 512.329]      1
Name: fare_labels, dtype: int64

In [33]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train_trf, y_train)
y_pred_trf = dtc.predict(X_test_trf)
accuracy_score(y_test, y_pred_trf)*100

62.23776223776224

In [34]:
X_trf = trf.fit_transform(X)
np.mean(cross_val_score(dtc, X, y, cv=10, scoring='accuracy'))*100

63.45461658841941

## **<ins style="color:blue">Binarization</ins>**
### **Custom / Domain Based Binning**
[0-18] : Younger,
[18-60] : Men,
[60-100] : Old

### **<ins style="color:maroon">Binarization</ins>**
- Convert the value in form of 0 or 1.
- Need two values. 1. Thresold, 2. Copy=True/False

In [35]:
df_old = pd.read_csv("../data/train.csv")
df_old

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [36]:
df = pd.read_csv("../data/train.csv")[['Age', 'Fare', 'SibSp', 'Parch', 'Survived']]
df.dropna(inplace=True)
df.head()

Unnamed: 0,Age,Fare,SibSp,Parch,Survived
0,22.0,7.25,1,0,0
1,38.0,71.2833,1,0,1
2,26.0,7.925,0,0,1
3,35.0,53.1,1,0,1
4,35.0,8.05,0,0,0


In [37]:
df['Family'] = df['SibSp'] + df['Parch']
df.head()

Unnamed: 0,Age,Fare,SibSp,Parch,Survived,Family
0,22.0,7.25,1,0,0,1
1,38.0,71.2833,1,0,1,1
2,26.0,7.925,0,0,1,0
3,35.0,53.1,1,0,1,1
4,35.0,8.05,0,0,0,0


In [38]:
df.drop(columns=['SibSp', 'Parch'], inplace=True)

In [39]:
X = df.drop(columns=['Survived'])
y = df[['Survived']]

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.sample(7)

Unnamed: 0,Age,Fare,Family
309,30.0,56.9292,0
506,33.0,26.0,2
373,22.0,135.6333,0
302,19.0,0.0,0
824,2.0,39.6875,5
84,17.0,10.5,0
565,24.0,24.15,2


### **<ins style="color:red">Without Binarization</ins>**

In [41]:
## Without Binarization
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
accuracy_score(y_test, y_pred)*100

63.63636363636363

In [42]:
np.mean(cross_val_score(dtc, X, y, cv=10, scoring='accuracy'))*100

64.01017214397497

### **<ins style="color:red">With Binarization</ins>**

In [43]:
# Applying Binarization
from sklearn.preprocessing import Binarizer

In [44]:
trf = ColumnTransformer([
    ('bin', Binarizer(copy=False), ['Family'])],
    remainder='passthrough')
# copy=True : make a new column after transform
# copy=False : make change in existing column

In [45]:
X_train_trf = trf.fit_transform(X_train)
X_test_trf = trf.transform(X_test)

In [46]:
X_train_trf = pd.DataFrame(X_train_trf, columns=X_train.columns)
X_train_trf.head(7)

Unnamed: 0,Age,Fare,Family
0,1.0,31.0,20.525
1,1.0,26.0,14.4542
2,1.0,30.0,16.1
3,0.0,33.0,7.775
4,0.0,25.0,13.0
5,0.0,40.0,27.7208
6,0.0,22.0,7.8958


In [47]:
## With Binarization
dtc = DecisionTreeClassifier()
dtc.fit(X_train_trf, y_train)
y_pred_trf = dtc.predict(X_test_trf)
accuracy_score(y_test, y_pred_trf)*100



60.83916083916085

In [48]:
np.mean(cross_val_score(dtc, X, y, cv=10, scoring='accuracy'))*100

65.1310641627543

In [49]:
np.mean(cross_val_score(dtc, X_train, y_train, cv=10, scoring='accuracy'))*100

66.02843315184514

In [50]:
np.mean(cross_val_score(dtc, X_train_trf, y_train, cv=10, scoring='accuracy'))*100

63.929219600725965