# Exploring and Coercing Data

### Introduction

In our last lab, we were able to gather data from a csv file and select data our target and features (or explanatory variables).  However, one issue is that we were constrained to only using features that were preformatted as numbers.  This stopped us from using our `genre` column as a feature, even though it would have been interesting to discover how genre can be predictive of movie revneue.  

In this lesson, we'll explore datatypes in Pandas.

### DataTypes in Pandas

Let's take a look at the table below regarding datatypes in Pandas.

|  Pandas dtype |  Python Type | Use |
|---|---|---|
|object|string|text|
|int64/float64| int/float   | numbers|
|bool|bool   |True/False|
|datetime64| NA   |Dates and Times|
|category| NA |Finite list of text values|

By the end of this our work on Pandas, we'll look at each of these datatypes.  Remember that the reason why datatypes in Pandas are important, is because we need all of our data to be a number before we feed it into a machine learning model.  And some Pandas datatypes are more easy to convert to a number than others.  In general, a datatype of object is the most difficult to convert to being a number.  Because of this, a lot of the work in coercing our data involves changing a series from a type of object to a different datatype.  We'll explore some of the easy ways to change data from type `object` to type `int` or `float64`, `boolean`, or `datetime`.

For this lesson, let's load up some data about [NYC SAT scores](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4), drawn from the [NYC Open data](https://opendata.cityofnewyork.us/) website.  We have uploaded our a version of this dataset here.  We'll use this dataset to explore datatypes in pandas.

In [4]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/nyc_hs_sat.csv', index_col = 0)


# to make things more interesting, we also alter some of the data
columns = ['reading_avg', 'math_avg', 'writing_score']
df[columns] = df[columns].astype('object')
str_cols = df[columns].apply(lambda x: x.map(str))
df = df.drop(columns = columns)
sat_df = pd.concat([df, str_cols], axis = 1)

> Press shift + enter on the cell above and we can get started to load the data and we can get started.

For this lesson, we'll choose Math SAT scores to be the target we are trying to predict. 

### Exploring Our Data

Now that we have loaded up our data, we want to begin changing our data so that it is numeric.  So we want to identify:

1. The data that is either an integer or float and therefore ready
2. The data that we can change into a better datatype to eventually use more easily

What this generally means in practice is that we should identify those columns that are of type object, and should be changed to something else, and those that are not of type object, and thus are in pretty good shape.  

To determine this we'll look at two methods: `df.dtypes` and `df.select_dtypes`.

* `dtypes`

We can call the dtypes method directly on our pandas dataframe.

In [52]:
sat_df.dtypes

dbn                     object
name                    object
num_test_takers        float64
boro                    object
total_students           int64
graduation_rate        float64
attendance_rate        float64
college_career_rate    float64
reading_avg             object
math_avg                object
writing_score           object
dtype: object

The `dtypes` method lists the column name and corresponding datatype for each column.  We can see that a lot of these columns are of type object that we may like to change in a different datatype to feed into our machine learning model.  

* `df.select_dtypes`

Now, if we would like to only select those columns of type object, we can do so with the `select_dtypes` method.

In [5]:
sat_objects_df = sat_df.select_dtypes('object')
sat_objects_df[:2]

Unnamed: 0,dbn,name,boro,reading_avg,math_avg,writing_score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,M,355.0,404.0,363.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,M,383.0,423.0,366.0


So we see that there are a number of columns that are currently not numeric, but could be good to include in our model.  For example, `reading_avg`, `math_avg`, `writing_score`, and `boro`.

* exclude

If we want to also see the columns that are currently **not** of type object, and thus may be ready for our model, we can find that by using `select_dtypes` to identify the columns that are not of type object.

In [6]:
except_objects_df = sat_df.select_dtypes(exclude = ['object'])
except_objects_df[:2]

Unnamed: 0,num_test_takers,total_students,graduation_rate,attendance_rate,college_career_rate
0,29.0,171,0.66,0.87,0.36
1,91.0,465,0.9,0.93,0.7


So these columns are not of type object, and look like they are good to go as features of our model.

### Changing the DataType of Columns

Now that we have identified the columns that we may wish to change -- with the `dtypes` and `select_dtypes` methods, let's move onto coercing some of these columns.

Let's start by taking another look at the columns that are currently of type object.

In [55]:
sat_df.select_dtypes('object')[:2]

Unnamed: 0,dbn,name,boro,reading_avg,math_avg,writing_score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,M,355.0,404.0,363.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,M,383.0,423.0,366.0


The column `reading_avg` looks like it could be predictive of our `math_avg` target, so let's try to make the column numeric.  Currently, the column is of type `object`, and if we look, we see that each of the entries are a string. 

In [7]:
sat_df.reading_avg.dtype

dtype('O')

In [58]:
sat_df.reading_avg[0]

'355.0'

Now if we change the data to be of type numeric, we can eventually use this data as a feature in our model.

In [61]:
reading = sat_df.reading_avg.astype('float64')
reading.dtype

dtype('float64')

In [63]:
reading[0]

355.0

Now that we have a series of data in an integer format, we can replace the original `sat_df` column to be our new `reading`, coerced, column.

In [64]:
sat_df['reading_avg'] = reading

> So we just used the `astype` method to specify the datatype that the column should be.  Then, we replaced the old column with with the new coerced column.

Ok, Let's see our progress, by checking that there is one fewer column of type `object`.

In [66]:
sat_df.select_dtypes('object')[:2]

Unnamed: 0,dbn,name,boro,math_avg,writing_score
0,01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,M,404.0,363.0
1,01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,M,423.0,366.0


### Coercing DateTime Data

Now there's more work to do with our SAT dataset, but we'll leave that for you in the next lab.  For now, let's move onto working with another type of data, datetimes.  To do so, we'll use some revenue data from Max's Wine Bar in Texas.  We currently have the data stored in JSON.  Let's load it up.

In [8]:
max_revenue = pd.read_json('https://raw.githubusercontent.com/jigsawlabs-student/introductory-pandas/master/max-revenue.json')

In [9]:
max_revenue[:2]

Unnamed: 0,end_date,total_receipts
0,2016-12-31T00:00:00.000,56182
1,2017-08-31T00:00:00.000,9400


Now `total_receipts` the represents the revenue earned from alcohol in a month, and the `end_date` is the month in which that revenue was earned.  So the first row indicates that `56182` was earned in the month of December 2016.  

Let's say we to predict the revenue earned per month, making `total_receipts` our target.  And as information to predict the revenue earned we can use information like the year, or month for the related period.  Let's see how we can extract `year` and `month` information from `end_date`.

In [10]:
max_revenue.dtypes

end_date          object
total_receipts     int64
dtype: object

The first step is to change then end_date from type object to type `datetime64`.  This way each entry is not just a string, which is hard to work with.

In [89]:
max_revenue.end_date[0]

'2016-12-31T00:00:00.000'

In [11]:
end_date = max_revenue.end_date.astype('datetime64')
end_date[:2]

0   2016-12-31
1   2017-08-31
Name: end_date, dtype: datetime64[ns]

Another way that we can do this is using the `pd.to_datetime` method.

In [12]:
end_date = pd.to_datetime(max_revenue.end_date)
end_date[:2]

0   2016-12-31
1   2017-08-31
Name: end_date, dtype: datetime64[ns]

Ok, now that our data is of type `datetime`, we can call methods to extract the month, weekday, or year from each entry.

In [13]:
end_date[0].month

12

In [14]:
end_date[1].year

2017

Now when we get to the next reading on replacing data we will learn how to use these methods to create entire columns of the related month and year for a revenue period.

### Summary

In this lesson, we saw how to coerce our data into formats that are not objects.  We saw how to explore the datatypes with the `dtypes` method, and how to select columns by their type with `select_dtypes`.  We then saw how to coerce our data with the `astype` method.