## 1.Discretization/Binning
 Binning is a way to group a number of more or less continuous values into a smaller number of "bins"
#### Types (most commonly used)
* equal width binning (uniform)
* equal frequency binning (quantile)
* Kmeans binning
* Dission tree binning
* custom/domain based binning (it is type of binning where we can create our own conditions of ranges of bins,this need to do on owr own, we dont have any libary for this)

In [186]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import KBinsDiscretizer


In [187]:

df=pd.read_csv('winequalityN.csv',usecols=['type','pH','alcohol'])
df.sample(5)


Unnamed: 0,type,pH,alcohol
3454,white,3.26,12.4
5376,red,3.14,10.2
2922,white,3.13,13.0
3991,white,2.99,9.2
3040,white,3.2,10.8


In [188]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6488 entries, 0 to 6496
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   type     6488 non-null   object 
 1   pH       6488 non-null   float64
 2   alcohol  6488 non-null   float64
dtypes: float64(2), object(1)
memory usage: 202.8+ KB


In [189]:
x=df.iloc[:,1:4]
y=df.iloc[:,0]

x.sample(5)

Unnamed: 0,pH,alcohol
4257,2.96,12.0
5307,3.05,9.6
2767,3.12,9.5
2885,3.15,12.8
2877,2.97,9.3


In [190]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=42)

dtc=DecisionTreeClassifier()
dtc.fit(xtrain,ytrain)

ypred=dtc.predict(xtest)

accuracy_score(ytest,ypred)


0.7380585516178737

In [191]:
np.mean(cross_val_score(dtc,x,y,cv=10,scoring='accuracy'))

0.6878906770149709

In [192]:
kb_ph=KBinsDiscretizer(n_bins=10,encode='ordinal',strategy='quantile')
kb_al=KBinsDiscretizer(n_bins=10,encode='ordinal',strategy='quantile')

# similarly for uniform and kmeans binning you just need to change the strategy

In [193]:
trf=ColumnTransformer([
    ('first',kb_ph,[0]),
    ('second',kb_al,[1])
])

In [194]:
xtrain_trf=trf.fit_transform(xtrain)
xtest_trf=trf.transform(xtest)

dtc.fit(xtrain_trf,ytrain)

ypred=dtc.predict(xtest_trf)

accuracy_score(ytest,ypred)

# the accuracy will be more when compared with after and before transformation

0.7665639445300462

In [195]:
np.mean(cross_val_score(dtc,x,y,cv=10,scoring='accuracy'))

0.6886610930396241

## 2.Binarization 
in descretization we convert contineous values in to discrete vales, in binarization we convert contineous values in to binary values (0 ,1).

In [196]:
ti=pd.read_csv('titanic.txt')[['Survived','SibSp','Parch','Age']]

In [197]:
ti.dropna(inplace=True)

In [198]:
ti['family']=ti['SibSp']+ti['Parch']
ti

Unnamed: 0,Survived,SibSp,Parch,Age,family
0,0,1,0,22.0,1
1,1,1,0,38.0,1
2,1,0,0,26.0,0
3,1,1,0,35.0,1
4,0,0,0,35.0,0
...,...,...,...,...,...
885,0,0,5,39.0,5
886,0,0,0,27.0,0
887,1,0,0,19.0,0
889,1,0,0,26.0,0


In [199]:
ti.drop(columns=['SibSp','Parch'],inplace=True)


In [200]:
a = ti.iloc[:,1:3]
b = ti.iloc[:,0]

In [201]:
xtrain,xtest,ytrain,ytest=train_test_split(a,b,test_size=0.2,random_state=42)

In [202]:
# before applying binarization

clf = DecisionTreeClassifier()

clf.fit(xtrain,ytrain)

ypred = clf.predict(xtest)

print(accuracy_score(ytest,ypred))
print(np.mean(cross_val_score(DecisionTreeClassifier(),a,b,cv=10,scoring='accuracy')))


0.5734265734265734
0.5924491392801252


In [208]:
# now will apply binarization
# family is 1 when the family count is greatter or equal ot 1 then the value will and 0 if the family count is 0

from sklearn.preprocessing import Binarizer

tr = ColumnTransformer([
    ('bin',Binarizer(copy=False),['family'])
],remainder='passthrough')


In [211]:
X_traintrf = tr.fit_transform(xtrain)
X_testtrf = tr.transform(xtest)

pd.DataFrame(X_traintrf,columns=['Age','family'])

Unnamed: 0,Age,family
0,1.0,31.0
1,1.0,26.0
2,1.0,30.0
3,0.0,33.0
4,0.0,25.0
...,...,...
566,1.0,46.0
567,0.0,25.0
568,0.0,41.0
569,1.0,33.0


In [214]:
np.mean(cross_val_score(DecisionTreeClassifier(),X_traintrf,ytrain,cv=10,scoring='accuracy'))

0.5955232909860859

0.5924491392801252