# Reading data and encoding
This notebook describes how to read and decode data.

## Data sources (or extract of data sources) used in this notebook
* Extract (rows for 31/12/2017): NFL play by play data (from **Kaggle**: [Detailed NFL Play-by-Play Data 2009-2017](https://www.kaggle.com/maxhorowitz/nflplaybyplay2009to2016))

## Imports

In [38]:
import pandas as pd
import chardet

## Read data from files with correct encoding
____

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and if you tried to read in text with a different encoding that the one it was originally written in, you ended up with scrambled text called "mojibake" (said like mo-gee-bah-kay). Here's an example of mojibake:

æ–‡å—åŒ–ã??

You might also end up with a "unknown" characters. There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:

����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

> **UTF-8 is the standard text encoding**. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

Strings can be encoded/decoded differently using ```str.encode``` or ```str.decode```:
* ```str.encode``` returns data of type **bytes** encoded in the specified encoding.
* ```str.decode``` returns a str decoded as the specified encoding.

### Encodings

In [34]:
before = "This is the euro symbol: €"
print("before is of type {}".format(type(before)))

before is of type <class 'str'>


In [35]:
after = before.encode("utf-8", errors = "replace")
print("after is of type {0} and reads as [{1}]".format(type(after),after))

after is of type <class 'bytes'> and reads as [b'This is the euro symbol: \xe2\x82\xac']


> Note the "**\xe2\x82\xac**" replacing the "**€**" character. That's because bytes are printed as they were ascii characters.

#### Attempting to read/decode data with the wrong encoding will usually returns error:
Here, attempting to read the **UTF-8** encoded ```bytes``` as **ascii**.

In [37]:
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

#### ```errors = "replace"``` replaces character that cannot be encoded in the specified format with the format byte string for the unknown character.

In [42]:
after = before.encode("ascii", errors = "replace")
print(after.decode("ascii"))

This is the euro symbol: ?


### Finding the encoding of a file and reading it
When reading a file we don't usually know which encoding is used. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.

The process is to read a certain numbers of bytes from the file and let ```chardet``` find out which encoding is used for the file. It's generally enough to look at the first ten thousand bytes of a file for ```chardet``` to be able to give a good guess and is much faster than trying to look at the whole file. (Especially with a  large file this can be very slow). 

In [44]:
with open("../data/NFL_Play_by_Play_2009-2017_(v4)_extract.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(500))
print(result)

{'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}


#### In this case, ```chardet``` is 100% confident the encoding is UTF-8.
We can then read the file with ```encoding = "utf-8"``` (which is actually the default for ```pandas.read_csv```).

In [48]:
data = pd.read_csv("../data/NFL_Play_by_Play_2009-2017_(v4)_extract.csv", encoding="utf-8")
data.shape

(2801, 102)

## Other resources
* [**ftfy** (fixes text for you)](https://ftfy.readthedocs.io/en/latest/): **ftfy** fixes Unicode that’s broken in various ways. The goal of ftfy is to take in bad Unicode and output good Unicode, for use in your Unicode-aware code.