# Understanding Binning:
___

Let's understand Binning with an example: the famous [Titanic DataSet on Kaggle.](https://www.kaggle.com/c/titanic) <br>
You don't have to download the dataset separately. Just clone the repo and try playing around with this notebook. <br>
I have provided the train.csv file in the repo. <br>

The Titanic problem is a classic. So, I won't spoil it for you. <br> Let me just explain one small aspect of data cleaning that you would be doing on your own soon.
#### Binning:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
df = pd.read_csv('train.csv')
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


___

Note that this tutorial is only for you to see a real application of Binning.<br>
So, we won't care about the missing values or any other specific features of the dataset.

___
Consider only the Age column. It contains the ages of all the passengers that had boarded the ship.

In [4]:
ages = df["Age"].dropna()
ages = np.array(ages)
print("Length of the ages array: ", len(ages))
print(ages)

Length of the ages array:  714
[22.   38.   26.   35.   35.   54.    2.   27.   14.    4.   58.   20.
 39.   14.   55.    2.   31.   35.   34.   15.   28.    8.   38.   19.
 40.   66.   28.   42.   21.   18.   14.   40.   27.    3.   19.   18.
  7.   21.   49.   29.   65.   21.   28.5   5.   11.   22.   38.   45.
  4.   29.   19.   17.   26.   32.   16.   21.   26.   32.   25.    0.83
 30.   22.   29.   28.   17.   33.   16.   23.   24.   29.   20.   46.
 26.   59.   71.   23.   34.   34.   28.   21.   33.   37.   28.   21.
 38.   47.   14.5  22.   20.   17.   21.   70.5  29.   24.    2.   21.
 32.5  32.5  54.   12.   24.   45.   33.   20.   47.   29.   25.   23.
 19.   37.   16.   24.   22.   24.   19.   18.   19.   27.    9.   36.5
 42.   51.   22.   55.5  40.5  51.   16.   30.   44.   40.   26.   17.
  1.    9.   45.   28.   61.    4.    1.   21.   56.   18.   50.   30.
 36.    9.    1.    4.   45.   40.   36.   32.   19.   19.    3.   44.
 58.   42.   24.   28.   34.   45.5  18.   

To put these age entries into different bins, we must make those bins first.

> Make sure you explore each new function that you encounter by pressing "**Shift**" + "**Tab**"

In [5]:
bins = np.linspace(0,100,11)
print(bins)

[  0.  10.  20.  30.  40.  50.  60.  70.  80.  90. 100.]


These are going to be the boundaries for all the bins.

>**np.digitize()** is a function that can be used to segregate the data into the bins.

In [6]:
binned_ages = np.digitize(ages, bins, right = False)
print("\nData points:\n", ages[:5])
print("\nBin membership for data points:\n", binned_ages[:5])


Data points:
 [22. 38. 26. 35. 35.]

Bin membership for data points:
 [3 4 3 4 4]


In [7]:
print(binned_ages)

[3 4 3 4 4 6 1 3 2 1 6 3 4 2 6 1 4 4 4 2 3 1 4 2 5 7 3 5 3 2 2 5 3 1 2 2 1
 3 5 3 7 3 3 1 2 3 4 5 1 3 2 2 3 4 2 3 3 4 3 1 4 3 3 3 2 4 2 3 3 3 3 5 3 6
 8 3 4 4 3 3 4 4 3 3 4 5 2 3 3 2 3 8 3 3 1 3 4 4 6 2 3 5 4 3 5 3 3 3 2 4 2
 3 3 3 2 2 2 3 1 4 5 6 3 6 5 6 2 4 5 5 3 2 1 1 5 3 7 1 1 3 6 2 6 4 4 1 1 1
 5 5 4 4 2 2 1 5 6 5 3 3 4 5 2 1 4 3 2 5 3 4 3 4 4 3 5 4 4 2 3 6 4 3 2 3 2
 4 3 6 1 3 5 1 2 4 3 3 4 5 3 3 4 6 3 7 4 5 3 4 4 6 1 6 5 4 2 3 6 4 3 5 4 7
 5 1 4 7 3 2 2 4 4 3 5 3 3 2 4 3 3 3 1 6 2 1 2 4 4 3 2 3 3 5 3 3 6 4 5 3 3
 4 3 4 7 4 4 2 5 4 2 3 5 5 5 1 3 3 3 4 3 5 1 5 3 2 3 3 3 4 5 3 5 4 4 7 3 3
 2 2 3 1 3 3 3 2 5 1 4 4 2 1 4 2 4 3 3 3 3 3 4 5 3 3 4 3 3 3 3 4 6 1 3 4 5
 4 2 4 2 3 3 3 2 3 2 4 3 5 2 6 2 3 3 7 4 5 3 3 3 1 2 4 1 6 4 4 5 3 7 6 5 4
 5 5 4 6 1 4 4 3 3 4 3 3 1 1 6 7 3 4 6 4 1 3 6 8 3 6 3 3 2 3 4 2 2 4 3 3 3
 4 6 3 5 4 4 4 4 3 5 5 6 4 3 1 2 4 1 5 4 3 4 1 2 4 6 7 2 4 1 2 3 3 3 7 5 4
 4 5 3 3 2 3 4 7 6 4 2 2 4 4 4 3 4 6 4 2 5 7 3 4 6 5 4 4 5 5 3 5 4 4 4 3 3
 5 4 4 3 4 3 1 3 3 5 3 3 

Now the age data is **much more useful** for the classification problem.