# Discretizing Numerical Data and Collapsing Categories
## An introduction to binning numerical data and combining categorical data
***

In this article, you will learn simple transformation techniques for numerical and categorical data to help make your data easier to analyze, including:

* What it means to bin numerical data and why it’s important
* How to bin numerical data in Python
* Why combining categorical data can be useful for analysis/visualization
* How to combine categorical data in Python

### Binning Data

As a data analyst or scientist, you will sometimes deal with numerical data that you would like to arrange into different categories. For example, imagine you are looking at the ages of a set of individuals. You might care more about the age category each person falls under (20-29, 30-39, 40-49, etc) than the exact age. The process of transforming numerical variables into categorical counterparts is called “binning.”

Binning is a way to group a number of continuous values into a smaller number of “bins”. We see this in the real world quite often. For example, a student’s letter grade is determined by the percentage “bin” they fall under. Moreover, an individual’s tax rate is determined by the income “bin” they fall under.

### Why bin data?

Binning data can be useful in many situations, whether it be to create more appealing visualizations or to improve a machine learning model.

In areas like machine learning, binning data can be helpful to improve the accuracy of predictive models by reducing the noise of the data. For example, if you’re a machine learning engineer working at an investment firm, you might use binning as a technique to improve your stock prediction models by smoothing the data to reduce the impact of small, short term price changes.

Now that you have a better understanding of why we bin data, let’s explore how to bin numerical data in Python!

### Binning Numerical Data in Python

In this example, we will look at the ages of students in a dance class (stored in a file called **`dance_class_data.csv)`**.

First, we import pandas – a Python library to perform data manipulation and analysis. Then, we load in our data and take a peek at the first 10 rows of the data:

```python
import pandas as pd
 
# Load in data
dance_class = pd.read_csv('dance_class_data.csv')
 
# Print the first 10 rows 
print(dance_class.head(10))
```
The output looks like:

|   | Name           | Gender | Ager | Experience   |
|---|----------------|--------|------|--------------|
| 0 | Chris Shelton  | M      | 23   | beginner     |
| 1 | Douglas Watson | M      | 28   | intermediate |
| 2 | Martha Gomez   | F      | 45   | beginner     |
| 3 | Amos Moore     | M      | 63   | beginner     |
| 4 | Valentina Sen  | F      | 35   | beginner     |
| 5 | Billy Woods    | M      | 53   | advanced     |
| 6 | Oscar Barker   | M      | 43   | intermediate |
| 7 | Marie Sandoval | F      | 23   | beginner     |
| 8 | Nancy Mcbride  | F      | 35   | beginner     |
| 9 | Lindsay Bowen  | F      | 27   | beginner     |


The `dance_class` dataframe has four columns `– Name, Gender, Age, and Experience`. We can find out the data types of these columns by using the `dtypes` property:

```python
print(dance_class.dtypes)
```

The output looks like:

| Name       | object |
|------------|--------|
| Gender     | object |
| Age        | int64  |
| Experience | object |

Notice that the `Age` column is of type `int`. This is the column we’ll be exploring.

We’re curious to see which age range each student in the class falls under. To do this, we can bin the values to specific age groups. Before we do that, let’s store the age of the students and find out the minimum and maximum ages:

```python
# Store the ages
student_ages = dance_class['Age'] 

# min() method returns the lowest value
print(student_ages.min()) # 23
 
# max() method returns the highest value
print(student_ages.max()) # 65
```

The youngest student in the class is 23 years old and the oldest is 65 years old. We can therefore define our bin boundaries by decade (20-29, 30-39, 40-49, 50-59, 60-69).

Let’s first create a `bins` variable to store these values:

```python
# Store the boundaries
bins = [20, 30, 40, 50, 60, 70]
```

Next, we can create the bins using **`pd.cut(df['column_name'], bins)`** where bins is either:

* An integer specifying the number of evenly spaced bins, or
* A list of bin boundaries

Let’s bin the value of the `Age` column into equally sized bins in a new column called `binned_age`. Then, let’s print out the first few rows of the data:


```python
# Create new binned_age column that bins the values of the ‘Age’ column
dance_class['binned_age'] = pd.cut(dance_class['Age'], bins)
 
# Print the first few rows of the data
print(dance_class[['binned_age', 'Age']].head())
```


The output looks like:

|   | Age      | Age |
|---|----------|-----|
| 0 | (20, 30] | 23  |
| 1 | (20, 30] | 28  |
| 2 | (40, 50] | 45  |
| 3 | (60, 70] | 63  |
| 4 | (30, 40] | 35  |

Notice how each age now falls under a specific age group. We can also plot our data in a bar graph to get a better visualization:

```python
# Plot the bar graph of binned ages
dance_class['binned_age'].value_counts().plot(kind='bar')
 
# Label the bar graph 
plt.title('Dance Class Age Distribution')
plt.xlabel('Ages')
plt.ylabel('Count') 
 
# Show the bar graph 
plt.show()
```

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/discretizing-data/dance_class_dist.png width=400>

Notice how the data is categorized into different age groups, allowing us to see the big picture of our data without focusing too much on the precise ages of each individual. Based on this plot, we can see that the class skews toward younger ages.

If we want, we can also specify labels for our bins using a labels argument:

```python
# Store the boundaries
bins = [20, 30, 40, 50, 60, 70]
 
# Store the labels for our bins
age_labels = ['Young Adult', 'Adult', 'Middle Aged', 'Middle-Older Age', 'Senior']
 
# Bin the values of the 'Age' column and specify the labels 
dance_class['binned_age'] = pd.cut(dance_class['Age'], bins, labels = age_labels)
```


Now when we print the first few rows of our data or plot a bar graph of the data, each numeric range that we specified earlier is replaced with the assigned label:

|   | binned_age    | age |
|---|---------------|-----|
| 0 |  Young Adult  | 23  |
| 1 |  Young Adult  | 28  |
| 2 |  Middle Aged  | 45  |
| 3 |  Senior       | 63  |
| 4 |  Adult        | 35  |

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/discretizing-data/dance_class_dist_labeled.png width=400>

***
### Combining Categorical Data

As a data analyst or scientist, you will sometimes work with categorical data. Just as a refresher, categorical data are variables that contain label values rather than numeric values. For example,

* A “color” variable with the values: “blue”, “red” and “yellow”.
* An “education” variable with the values: “high school”, “college”, and “no education”.

When dealing with categorical data, you may sometimes want to combine/remove categories for easier analysis. For example, imagine you’re looking at a data set of the most popular sports in a region:

| Sport      | Count |
|------------|-------|
| Basketball | 500   |
| Football   | 400   |
| Baseball   | 8     |
| Tennis     | 7     |
| Cricket    | 4     |
| Sailing    | 3     |

Notice that there are several categories of sports here but a pretty uneven distribution of their occurrences. Basketball and football make up an overwhelming majority of the data, while the other categories only have a few occurrences. Therefore, it may make sense to combine all of the other sports into a category called “Other.”

| Sport      | Count |
|------------|-------|
| Basketball | 500   |
| Football   | 400   |
| Other      | 22    |


### Combining Categories in Python

Let’s explore how to combine categories of categorical data in Python. We’ll be working with a data set called `election_poll.csv` to see the most popular candidates for a city council election. This data set has a column called `Votes` that we will use.

First, let’s import our data and see the counts for each of the candidates (categories) in our data:

```python
import pandas as pd
 
# read in data
election_data = pd.read_csv('election_data.csv')
 
# get the counts for each candidate
votes = election_data['Vote'].value_counts()
print(votes)
```


The output looks like:

| Liliana | 1067 |
|---------|------|
| John    | 998  |
| William | 494  |
| Emilie  | 196  |
| Pattie  | 6    |
| Neil    | 3    |
| Bob     | 2    |
| Demi    | 1    |
| David   | 1    |
| Hester  | 1    |


What do you notice about the data? Are there specific candidates we should focus more on and others who we should focus less on?

You may notice that `Liliana`, `John`, and `William` make up most of the data while the other candidates make up a smaller percentage of the overall vote count. If we’re trying to understand the likelihood that each of these candidates is elected in an upcoming vote, it makes sense to collapse the other candidates into a category called “Other.” If we decided not to do this, then our analysis might be more difficult to parse. For example, if we were to plot this data in a pie chart, it would look like this:

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/discretizing-data/pie_chart_election.png 
     width=400>

That isn’t a pleasant pie chart to look at. Liliana, John, and William make up 92% of the data, while the other seven categories make up just 8% of the data. Once we combine those seven categories in an “Other” category, the pie chart looks like:

<img src=https://static-assets.codecademy.com/Paths/data-analyst-career-path/discretizing-data/pie_chart_election_collapsed.png 
     width=400>

How can we do this in Python? We can create a mask for values occurring less than a specific number of times in `votes`. In our case, we’ll check for values occurring less than 200 times. We do this using the `isin()` function:

```python
mask = election_data.isin(votes[votes < 200].index)
```

Next, we can label these other categories as “Other” and print the updated vote count:
```python
election_data[mask] = 'Other'
print(election_data['Vote'].value_counts())
```

| Liliana                  | 1067 |
|--------------------------|------|
| John                     | 998  |
| William                  | 494  |
| Other                    | 210  |
| Name: Vote, dtype: int64 |      |


Notice how much cleaner the data is now that we have combined the seven low frequency categories into one “Other” category.

Remember that the cleaner your data is, the simpler it will be to analyze and visualize, so taking the time to understand your data and apply simple transformations like this when appropriate will go a long way.