# Handling missing values
This notebook describes how to read and decode data.

## Data sources (or extract of data sources) used in this notebook
* Extract (rows for 31/12/2017): NFL play by play data (from **Kaggle**: [Detailed NFL Play-by-Play Data 2009-2017](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016))

## Imports

In [38]:
import pandas as pd
import chardet

## Read data from files with correct encoding
____

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding that the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

> **UTF-8 is the standard text encoding**. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

Strings can be encoded/decoded differently using ```str.encode``` or ```str.decode```:
* ```str.encode``` returns data of type **bytes** encoded in the specified encoding.
* ```str.decode``` returns a str decoded as the specified encoding.

### Encodings

In [34]:
before = "This is the euro symbol: €"
print("before is of type {}".format(type(before)))

before is of type <class 'str'>


In [35]:
after = before.encode("utf-8", errors = "replace")
print("after is of type {0} and reads as [{1}]".format(type(after),after))

after is of type <class 'bytes'> and reads as [b'This is the euro symbol: \xe2\x82\xac']


> Note the "**\xe2\x82\xac**" replacing the "**€**" character. That's because bytes are printed as they were ascii characters.

#### Attempting to read/decode data with the wrong encoding will usually returns error:
Here, attempting to read the **UTF-8** encoded ```bytes``` as **ascii**.

In [37]:
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

#### ```errors = "replace"``` replaces character that cannot be encoded in the specified format with the format byte string for the unknown character.

In [42]:
after = before.encode("ascii", errors = "replace")
print(after.decode("ascii"))

This is the euro symbol: ?


### Finding the encoding of a file and reading it
When reading a file we don't usually know which encoding is used. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

The process is to read a certain numbers of bytes from the file and let ```chardet``` find out which encoding is used for the file. It's generally enough to look at the first ten thousand bytes of a file for ```chardet``` to be able to give a good guess and is much faster than trying to look at the whole file. (Especially with a  large file this can be very slow). 

In [44]:
with open("../data/NFL_Play_by_Play_2009-2017_(v4)_extract.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(500))
print(result)

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}


#### In this case, ```chardet``` is 100% confident the encoding is UTF-8.
We can then read the file with ```encoding = "utf-8"``` (which is actually the default for ```pandas.read_csv```).

In [48]:
data = pd.read_csv("../data/NFL_Play_by_Play_2009-2017_(v4)_extract.csv", encoding="utf-8")
data.shape

(2801, 102)

```DataFrame.shape``` give us information on the size of the data, number of columns and rows.

In [15]:
data.shape

(2801, 102)

```DataFrame.describe()``` give us additional (statistical) information on each columns. For instance, one can see at a glance if some data is missing (value of **count** will be < than ```data.shape[0]```)

In [16]:
data.describe()

Unnamed: 0,GameID,Drive,qtr,down,TimeUnder,TimeSecs,PlayTimeDiff,yrdln,yrdline100,ydstogo,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
count,2801.0,2801.0,2801.0,2407.0,2801.0,2801.0,2801.0,2799.0,2799.0,2801.0,...,1083.0,2630.0,2630.0,2621.0,2621.0,2630.0,2762.0,1085.0,1083.0,2801.0
mean,2017123000.0,12.844698,2.533738,2.047362,7.410211,1736.225991,20.561228,29.263308,51.03537,7.40664,...,-0.402652,0.539297,0.460703,0.540004,0.459996,0.5119508,0.001042,0.009552,-0.008354,2017.0
std,4.641369,7.41421,1.110512,1.029612,4.660919,1052.031232,16.673179,12.290691,24.086361,4.789491,...,1.961941,0.28721,0.28721,0.289224,0.289224,0.2893241,0.042262,0.057162,0.070068,0.0
min,2017123000.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,-11.649439,0.0,0.0,0.0,0.0,2.22e-16,-0.758072,-0.973533,-0.350323,2017.0
25%,2017123000.0,7.0,2.0,1.0,3.0,844.0,5.0,21.0,34.0,4.0,...,-1.176876,0.306312,0.211095,0.303742,0.207901,0.2790179,-0.015958,-0.015675,-0.022787,2017.0
50%,2017123000.0,13.0,2.0,2.0,7.0,1808.0,15.0,30.0,54.0,9.0,...,0.0,0.528281,0.471719,0.530743,0.469257,0.5150523,-0.000596,0.001354,0.0,2017.0
75%,2017123000.0,19.0,4.0,3.0,11.0,2612.0,37.0,39.0,72.0,10.0,...,0.578132,0.788905,0.693688,0.792099,0.696258,0.7643252,0.012951,0.033685,0.013069,2017.0
max,2017123000.0,30.0,4.0,4.0,15.0,3600.0,72.0,50.0,99.0,31.0,...,7.270664,1.0,1.0,1.0,1.0,0.9992919,0.763208,0.200225,0.973533,2017.0


Another of looking for missing data (NAN) is to use ```DataFrame.isna().sum()```

This will count the number of NAN for each columns.

In [27]:
# Showing the first 10 columns with NAN count
# sorted by descending order

data.isna().sum().sort_values(ascending = False)[:10]

BlockingPlayer       2801
DefTwoPoint          2799
TwoPointConv         2795
ChalReplayResult     2780
Interceptor          2773
RecFumbPlayer        2770
RecFumbTeam          2770
FieldGoalDistance    2747
FieldGoalResult      2747
ExPointResult        2744
dtype: int64