# Coercing Data

### Introduction

In our last lab, we were able to gather data from a csv file and use the data to train a machine learning model.  However, one issue is that we were constrained to only using features that were preformatted as numbers.  This stopped us from using our `genre` column as a feature, even though it would have been interesting to discover how genre can be predictive of movie revneue.  

In this lesson, we'll learn some of the techniques for coercing our data into numbers with Python.

### Exploring DataTypes

Let's take another look at our [movies data from 538](https://github.com/fivethirtyeight/data/blob/master/bechdel/movies.csv).

In [4]:
import pandas as pd 
url = 'https://raw.githubusercontent.com/fivethirtyeight/data/master/bechdel/movies.csv'
movies_df = pd.read_csv(url)
movies_df[:2]

Unnamed: 0,year,imdb,title,test,clean_test,binary,budget,domgross,intgross,code,budget_2013$,domgross_2013$,intgross_2013$,period code,decade code
0,2013,tt1711425,21 &amp; Over,notalk,notalk,FAIL,13000000,25682380.0,42195766.0,2013FAIL,13000000,25682380.0,42195766.0,1.0,1.0
1,2012,tt1343727,Dredd 3D,ok-disagree,ok,PASS,45000000,13414714.0,40868994.0,2012PASS,45658735,13611086.0,41467257.0,1.0,1.0


Now in pandas, each series must be of the same type.  So, for example, if we look at the datatype of the year series, we get the following.

In [5]:
movies_df['year'][:3]

0    2013
1    2012
2    2013
Name: year, dtype: int64

We can see at the bottom, that the `dtype` is `int64`.  In other words it's a 64 bit integer.  Now, we can see the datatypes of the entire dataframe by calling `dtypes` on our dataframe.

In [59]:
movies_df.dtypes

year                int64
imdb               object
title              object
test               object
clean_test         object
binary             object
budget              int64
domgross          float64
intgross          float64
code               object
budget_2013$        int64
domgross_2013$    float64
intgross_2013$    float64
period_code       float64
decade_code       float64
dtype: object

So here we can see the columns and the corresponding datatypes.  A series of type object is equivalent to the Python datatype of string.  In general we want to change as many columns as possible from being type `object` as possible.  

In [60]:
movies_df['code'][:2]

3    2013FAIL
4    2012PASS
Name: code, dtype: object

We can find all of the columns that are of the object dtype with the following.

In [6]:
movie_objects_df = movies_df.select_dtypes('object')
movie_objects_df[:2]

Unnamed: 0,imdb,title,test,clean_test,binary,code
0,tt1711425,21 &amp; Over,notalk,notalk,FAIL,2013FAIL
1,tt1343727,Dredd 3D,ok-disagree,ok,PASS,2012PASS


So we can see that `imdb`, `title`, `test`, `clean_test`, `binary`, and `code` are all storred as objects.  

We can take a look to see how easily this data can be coerced by taking a look at `value_counts`, which tells us the different values of the data in the series.  For example, let's look at value counts for the column of `test`.

In [11]:
movies_df['test'].value_counts()

ok                  696
notalk              379
notalk-disagree     135
men                 125
ok-disagree         107
nowomen              88
dubious              81
men-disagree         69
dubious-disagree     61
nowomen-disagree     53
Name: test, dtype: int64

> This dataset is about passing the bechdel test, which is one measurement for the role of women characters in movies.  To pass the Bechdel test, the movie must feature at least two women in speaking roles, who have names, and who talk to each other about something – anything – other than a man.

In the `binary` column we can see the number of movies in the dataset that pass this test.

In [12]:
movies_df['binary'].value_counts()

FAIL    991
PASS    803
Name: binary, dtype: int64

The other thing we can see is that with only two values in binary, we can easily replace this with a `Fail` or `Pass` with a 1 or 0.

### Mapping booleans

Ok, now that we found a column that we can change into numbers.  Note that in Python, False equals 0 and True equals 1.  So let's change our numbers accordingly.  This is an easy way to accomplish this.

In [13]:
boolean_mapping = {'FAIL': False, 'PASS': True}

In [14]:
binary = movie_objects_df['binary'].map(boolean_mapping)

In [16]:
binary[:5]

0    False
1     True
2    False
3    False
4    False
Name: binary, dtype: bool

In [69]:
binary.dtype

dtype('bool')

We can see that this column is now a boolean.

In [71]:
1 == True and 0 == False

True

Now that we have this new series, let's replace our original series of text with this boolean series.

It's a good idea to avoid changing our previous data.  So we can copy our `movies_df`.

In [23]:
movies_with_binary = movies_df.copy()

In [24]:
movies_with_binary.select_dtypes('object').columns

Index(['imdb', 'title', 'test', 'clean_test', 'binary', 'code'], dtype='object')

Now we can change the binary column that has string data of `Pass` our new column of the equivalent binary data.

In [21]:
movies_with_binary['binary'] = binary

After doing so, notice

In [22]:
movies_with_binary.select_dtypes('object').columns

Index(['imdb', 'title', 'test', 'clean_test', 'code'], dtype='object')

So we can see that our binary column is now gone. 

### Doing more with Map

### Working with Categories

Let's look at the some of the columns that are remaining as objects.

In [85]:
remaining_movies_df = movies_with_binary.select_dtypes('object')
remaining_movies_df[:2]

Unnamed: 0,imdb,title,test,clean_test,code
3,tt1711425,21 &amp; Over,notalk,notalk,2013FAIL
4,tt1343727,Dredd 3D,ok-disagree,ok,2012PASS


It looks like test and clean are similar to one another.

In [88]:
test_df = remaining_movies_df[['test', 'clean_test']]
test_df[:5]

Unnamed: 0,test,clean_test
3,notalk,notalk
4,ok-disagree,ok
5,notalk-disagree,notalk
6,notalk,notalk
7,men,men


It looks like the data is largely the same, but that `test` captures a little bit more detailed data.

Once again, we can look at the value_counts.

In [89]:
test_df['test'].value_counts()

ok                  696
notalk              379
notalk-disagree     135
men                 125
ok-disagree         107
nowomen              88
dubious              81
men-disagree         69
dubious-disagree     61
nowomen-disagree     53
Name: test, dtype: int64

In [90]:
test_series = test_df['test'].astype('category')

In [91]:
test_series[:2]

3         notalk
4    ok-disagree
Name: test, dtype: category
Categories (10, object): [dubious, dubious-disagree, men, men-disagree, ..., nowomen, nowomen-disagree, ok, ok-disagree]

In [94]:
test_series.cat.categories

Index(['dubious', 'dubious-disagree', 'men', 'men-disagree', 'notalk',
       'notalk-disagree', 'nowomen', 'nowomen-disagree', 'ok', 'ok-disagree'],
      dtype='object')

In [103]:
movies_with_binary['test'] = test_series.cat.codes

In [105]:
movies_with_binary['clean_test'] = movies_with_binary['clean_test'].astype('category').cat.codes

In [108]:
movies_with_binary.select_dtypes(exclude = 'object').columns

Index(['year', 'test', 'clean_test', 'binary', 'budget', 'domgross',
       'intgross', 'budget_2013$', 'domgross_2013$', 'intgross_2013$',
       'period_code', 'decade_code'],
      dtype='object')

In [109]:
movies_with_binary.select_dtypes(include = 'object').columns

Index(['imdb', 'title', 'code'], dtype='object')

### Summary

In this lesson, we saw how to coerce our data into formats that are not objects.  We saw how to explore the datatypes with the `dtypes` method, and how to select columns by their type with `select_dtypes`.  We then saw how to coerce our data with the `map` function to convert matching strings to other values.  We saw how to convert a series into a category column with `astype('category')` and into a numeric column with `pd.to_numeric`.