# Categorical Map and Training

### Initial Loading and Cleaning

Let's take another look at our SAT data from the last lab.

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/analytics-engineering-jigsaw/introductory-pandas/master/2-coercing-data/nyc_hs_sat.csv"
sat_df = pd.read_csv(url, index_col = 0)

We'll start by dropping the rows that contain missing values.

In [2]:
dropped_sat_df = sat_df.dropna()

Let's confirm that we no longer have `na` values in our data.

In [26]:
dropped_sat_df.isna().sum()

dbn                    0
name                   0
num_test_takers        0
reading_avg            0
math_avg               0
writing_score          0
boro                   0
total_students         0
graduation_rate        0
attendance_rate        0
college_career_rate    0
dtype: int64

Looks pretty good.  Now, as we know, we still cannot use the column `boro` as the values in it are text and not numeric, but perhaps they could be.  Let's tackle that in the next section.  

### Exploring and Mapping

Now we currently have three columns in our dataset that are non-numeric: `dbn`, `name`, and `boro`. Now, there is not an easy way of representing `dbn` and `name` as meaningful numbers.

In [27]:
dropped_sat_df[['dbn', 'name']][:2]

Unnamed: 0,dbn,name
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL


As we can see, each of the values in those columns are different.  But a little data exploration will reveal that the values in the `boro` column are constrained to five different values, one for each borough of New York City.  A good way to see this, is by using the `value_counts` method, which is available on a pandas series.

In [28]:
dropped_sat_df['boro'].value_counts()

K    96
X    80
M    77
Q    60
R    10
Name: boro, dtype: int64

```
M -> Manhattan
Q -> Queens
X - Bronx
K -> Brooklyn
R -> Staten Island
```

Now we cannot use these strings to train our model, but we can have each of them correspond to a number, and our decision tree model can split based on if the column is greater than or not greater than that number.  This is an effective technique for using a decision tree with categorical data.

Now we just need to convert each of the five letters to a corresponding number.  We can do so with the `map` function.  Here's how:

In [29]:
mapping = {'M': 0, 'Q': 1, 'X': 2, 'K': 3, 'R': 4}
mapped_borough = dropped_sat_df['boro'].map(mapping)

Let's see what this did.

In [30]:
mapped_borough.value_counts()

3    96
2    80
0    77
1    60
4    10
Name: boro, dtype: int64

In [31]:
mapped_borough[:3]

0    0
1    0
2    0
Name: boro, dtype: int64

So we can see that we provided `map` a dictionary.  And map changed each of the values that matched a key in our dictionary to the corresponding value, here a number.

Now we can just replace the original object column with our `mapped_borough` column, and we can use the data in our model.

In [32]:
copied_sat_df = dropped_sat_df.copy()
copied_sat_df['boro'] = mapped_borough

In [33]:
copied_sat_df.boro[:4]

0    0
1    0
2    0
3    0
Name: boro, dtype: int64

Now that the `boro` series is now represented by integers.

### Mapping another way

So we just saw one way that we can use map.  That is we can pass a dictionary into the map function.

In [4]:
boro_original = dropped_sat_df['boro']
boro_original[:2]

0    M
1    M
Name: boro, dtype: object

In [6]:
mapping = {'M': 0, 'Q': 1, 'X': 2, 'K': 3, 'R': 4}
boro_original.map(mapping)[:2]

0    0
1    0
Name: boro, dtype: int64

Another way that we can use map is to pass through a function, that we want to be applied to every value in the series.

In [12]:
rates = ['graduation_rate', 'attendance_rate', 'college_career_rate']
dropped_sat_df[rates][:2]

Unnamed: 0,graduation_rate,attendance_rate,college_career_rate
0,0.66,0.87,0.36
1,0.9,0.93,0.7


For example, let's write a function called `multiply_by_100` that we can apply to a column.

In [13]:
def multiply_by_100(val):
    return val * 100

In [15]:
attendance = dropped_sat_df['attendance_rate']
multiplied_attendance = attendance.map(multiply_by_100)
multiplied_attendance[:3]

0    87.0
1    93.0
2    94.0
Name: attendance_rate, dtype: float64

It worked! So essentially, each value in the column was passed through our function.

### Mapping a dataframe

So we just used map to work on a series, but we can also use map on an entire dataframe.  For example, let's multiply each of our rates by 100.

In [16]:
rates = ['graduation_rate', 'attendance_rate', 'college_career_rate']
rates_df = dropped_sat_df[rates]
rates_df[:2]

Unnamed: 0,graduation_rate,attendance_rate,college_career_rate
0,0.66,0.87,0.36
1,0.9,0.93,0.7


In [19]:
multiplied_rates_df = rates_df.applymap(multiply_by_100)
multiplied_rates_df[:3]

Unnamed: 0,graduation_rate,attendance_rate,college_career_rate
0,66.0,87.0,36.0
1,90.0,93.0,70.0
2,92.0,94.0,77.0


* Above we do so using the `applymap` function, but as of pandas 2, you can also just call `map` on an entire dataframe.

### Wrapping up with numeric data

Ok, so remember in the very beginning we used the map function to convert our categorical neighborhood data to be numeric.

In [20]:
mapping = {'M': 0, 'Q': 1, 'X': 2, 'K': 3, 'R': 4}
mapped_borough = dropped_sat_df['boro'].map(mapping)

In [21]:
copied_sat_df = dropped_sat_df.copy()
copied_sat_df['boro'] = mapped_borough

And a benefit (among others) is that data scientists can only pass through numeric data to their ml models.  

Let's quickly see this.  

We begin by viewing the columns that are numeric.

In [23]:
copied_sat_df.select_dtypes(exclude = ['object']).columns

Index(['num_test_takers', 'reading_avg', 'math_avg', 'writing_score', 'boro',
       'total_students', 'graduation_rate', 'attendance_rate',
       'college_career_rate'],
      dtype='object')

Then we set the `math_avg` as our y variable, as this is what we are trying to predict.

In [64]:
y = copied_sat_df.math_avg

And we set all of our remaining numeric columns as potentially predictive features of the math average.

In [24]:
X = copied_sat_df.select_dtypes(exclude = ['object']).drop(columns = ['math_avg'])

Then we can train an ml model.

In [28]:
# !pip3 install sklearn

In [30]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X, y)

And we see that our model successfully trains, and predicts below.

In [None]:
predictions = model.predict(X)

> Although there is a bit more towards making predictions than just this.