# Categorical Map and Training

### Initial Loading and Cleaning

Let's take another look at our SAT data from the last lab.

In [24]:
import pandas as pd
sat_df = pd.read_csv('./nyc_hs_sat.csv', index_col = 0)

We'll start by dropping the rows that contain missing values.

In [25]:
dropped_sat_df = sat_df.dropna()

Let's confirm that we no longer have `na` values in our data.

In [26]:
dropped_sat_df.isna().sum()

dbn                    0
name                   0
num_test_takers        0
reading_avg            0
math_avg               0
writing_score          0
boro                   0
total_students         0
graduation_rate        0
attendance_rate        0
college_career_rate    0
dtype: int64

Looks pretty good.  Now, as we know, we still cannot use the column `boro` as the values in it are text and not numeric, but perhaps they could be.  Let's tackle that in the next section.  

### Exploring and Mapping

Now we currently have three columns in our dataset that are non-numeric: `dbn`, `name`, and `boro`. Now, there is not an easy way of representing `dbn` and `name` as meaningful numbers.

In [27]:
dropped_sat_df[['dbn', 'name']][:2]

Unnamed: 0,dbn,name
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL


As we can see, each of the values in those columns are different.  But a little data exploration will reveal that the values in the `boro` column are constrained to five different values, one for each borough of New York City.  A good way to see this, is by using the `value_counts` method, which is available on a pandas series.

In [28]:
dropped_sat_df['boro'].value_counts()

K    96
X    80
M    77
Q    60
R    10
Name: boro, dtype: int64

```
M -> Manhattan
Q -> Queens
X - Bronx
K -> Brooklyn
R -> Staten Island
```

Now we cannot use these strings to train our model, but we can have each of them correspond to a number, and our decision tree model can split based on if the column is greater than or not greater than that number.  This is an effective technique for using a decision tree with categorical data.

Now we just need to convert each of the five letters to a corresponding number.  We can do so with the `map` function.  Here's how:

In [29]:
mapping = {'M': 0, 'Q': 1, 'X': 2, 'K': 3, 'R': 4}
mapped_borough = dropped_sat_df['boro'].map(mapping)

Let's see what this did.

In [30]:
mapped_borough.value_counts()

3    96
2    80
0    77
1    60
4    10
Name: boro, dtype: int64

In [31]:
mapped_borough[:3]

0    0
1    0
2    0
Name: boro, dtype: int64

So we can see that we provided `map` a dictionary.  And map changed each of the values that matched a key in our dictionary to the corresponding value, here a number.

Now we can just replace the original object column with our `mapped_borough` column, and we can use the data in our model.

In [32]:
copied_sat_df = dropped_sat_df.copy()
copied_sat_df['boro'] = mapped_borough

In [33]:
copied_sat_df.boro[:4]

0    0
1    0
2    0
3    0
Name: boro, dtype: int64

Now that the `boro` series is now represented by integers, we can include it in training our model.

In [63]:
dropped_sat_df.select_dtypes(exclude = ['object']).columns

Index(['num_test_takers', 'reading_avg', 'math_avg', 'writing_score',
       'total_students', 'graduation_rate', 'attendance_rate',
       'college_career_rate'],
      dtype='object')

In [64]:
y = dropped_sat_df.math_avg

In [65]:
X = dropped_sat_df.select_dtypes(exclude = ['object']).drop(columns = ['math_avg'])

In [66]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor()
model.fit(X, y)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

And we see that our model successfully trains.