# Binning

Binning is a data processing technique used to group or bin data into buckets, reducing the amount of detail (noise) in the data, which helps to simplify the data and reduce overfitting. We can reduce the amount of detail in both categorical and numerical features.

### Import Basic Packages & Data

In [2]:
# Basics
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns',None)

## Categorical Binning

In this dataset we have the basic stats for a number of different players. Points per game is the target variable.

In [3]:
#import Categorical binning dataset
df_bball = pd.read_csv('basketball_stats.csv')
df_bball.head()

Unnamed: 0,first_name,last_name,points_per_game,reb_per_game,assist_per_game,3pt_per_game,steals_per_game,blocks_per_game,position
0,Stephen,Larry,25,4,3,7,1.8,0.2,PG
1,Lebron,Games,30,12,11,3,1.3,1.0,SF
2,Grayson,Ballen,10,3,2,2,0.9,0.2,SG
3,Luke,Dontik,35,9,10,4,0.8,0.6,PG
4,Lonjo,Tall,7,6,10,2,1.5,0.6,PG


Categorical varisbles can be grouped or binned into more general categories. In this example, each value in our Position column represents the position that player typically plays. However, it's possible to further categorise the positions into Guard, Forward and Centre.

In [41]:
# Check the value counts for each of the categories in the Position column.


In [40]:
# Create a new column called PositionBins and use the map function to map the positions to their respective bins.


### Categorical Binning - High Cardinality Feature

Binning can also be useful to help simplify high cardinality categorical features. This is particluarly the case with post code or zipcode data where you may want to group postal codes into districts, cities or states.

## Numerical Binning

For numerical binning, we are returning to our kc_house_dataset from previous workbooks.

In [9]:
# Import Data
df_house = pd.read_csv('kc_house_data.csv')
df_house.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living_15neighbors,sqft_lot_15neighbors
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


### View & Bin the Original Data using HistPlot

By plotting a simple histogram we can view the original distribution of the feature, and apply a visual interpretation of bins.

In [39]:
# Plot a countplot of the AssistPerGame variable to see what the unaltered distribution looks like.


However, the above method only allows us to visually summarise the data into bins. The bins are not stored anywhere.

### Defining our Bin Names

### Numerical Binning - Equal Width

Equal width binning divides numerical data into buckets, each of which have the same width on the original x axis.

In [38]:
# Plot a histplot showing the sqft_bin variable grouped into 3 bins with equal width.


In [37]:
#Inspect the sqft_bin column, before and after adding the labels to see what is happening. Also try the value_counts().


### Numerical Binning - Equal Frequency

Equal frequency binning divides a numerical feature into buckets, each of which have the same or similar number of observations.
- In this case, since we have some values that appear many times, they must all appear in the same bin.
- Since the bins each represent a different proportion of the original X scale, we must give each one a name.
- You may come across 'quartile binning' which is simply equal frequency binning using 4 bins.

In [36]:
# Create a new column in the dataframe and use the qcut function to create 3 equal freqency bins.


In [14]:
#Explore the value counts of the sqft_bin_freq column. How does it differ from the equal width method?


As we can see above, once we converted the values into buckets, the distribution became much more simplified and smoother. These smoother distributions allow for simplier interpretation of our data and may reduce the noise submitted to our models. 

### Numerical Binning - Manual Method

Manual binning allows us to define the boundaries of each bin ourselves.

In the above example, our research tells us that the housing tax brackets change at 2000 and at 5000 sqft. Since we know that people consider tax payments when purchasing houses, it seems more logical to think that these boundaries might be a sinsible way to bin this variable



Interval Notation
- (1, 2000) : do NOT include 1, do NOT include 2000
- [1, 2000) : INCLUDE 1, do NOT include 2000
- (1, 2000] : do NOT include 1, INCLUDE 2000
- [1, 2000] : INCLUDE 1, INCLUDE 2000

### Applying these methods to testing data

If we simply apply cut or qcut to our testing data we will get different results (the exception being manual binning). Again, we need to find a way to apply the same binning thresholds to our testing data.

The implementation with pandas is fairly simple, using the **ret_bins = True** argument to confirm that we want to save our bins in the variable training_bins. We can then re-use our training bins as the input to the cut method on our testing data. You can read more about it at the below link:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

In [None]:
#Note this code i
df_house['sqft_bin_manual'], training_bins = pd.cut(df_house['sqft_living'],
                              [1,2000,5000,20000],
                              labels = labels,
                              retbins= True)
df_house.head()

In [None]:
#Note this code will not work by default. It is here to demonstrate syntax only.
df_house_test = pd.read_csv('kc_house_data_test.csv')
df_house_test['sqft_bin_manual'] = pd.cut(df_house_test['sqft_living'],
                              bins = training_bins,
                              labels = labels)
df_house_test

### Summary of Binning Methods So Far

In practice, we try not to bin if possible, since this destroys some of the information we have in our data. Our first choice would be to bin for a good reason associated with domain knowledge or justification such as in the above example.



### Binning Using SKLearn

In many cases, once we have permormed **binning to generate buckets or groups**, we'll need to **encode those categories** using one hot encoding, or labels in order to reach a numeric dataset.

Instead of doing each separately, SKLearn's **KBinsDiscretizer** allows us to perform both actions together. This will be demonstrated in a separate workbook.

Documentation for KBinsDiscretizer can be found here:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

### Numerical Smoothing using Bin Averages

Smoothing is similar to binning, but instead of turning values into a categorical field, values are replaced by the mean, median or boundary value from their respective bins. Below is an example of bin smoothing using bin means. Note that to apply this to a testing dataset, a little more work would be required to maintain the same means for the testing bins.

In [35]:
# Find the average value per bin


# Assigning a new value based on the bin average


### Exercise 1: Binning a Histogram

Below is data of basketball players displaying their jumping attribute.

In [None]:
#import the dunk data training dataset
df_dunk = pd.read_csv('dunk_data.csv')
df_dunk

In [43]:
# Display the unaltered distribution of the height column


In [44]:
# Display a binned version of the above histogram with 5 equal width bins


### Exercise 2 (Advanced): Apply bins to training and testing data using Pandas

Below is data of basketball players displaying their attributes. Bin the height column to tall and very tall with a threshold of 2m.

In [None]:
#apply binning with the threshold of 2m to the height column


In [None]:
#apply binning with the threshold of 2m to the height column
df_dunk_test = pd.read_csv('dunk_data_test.csv')

In [None]:
#Apply the same bin boundaries to the testing data
