# Character encoding

Loading data from a CSV file is usually a trivial matter of running:

```python
import pandas as pd

pd.read_csv('path/to/file.csv')
```

However it may happen that when trying to run `pd.read_csv` you get a `UnicodeDecodeError` with a long error message who's last line may look something like this:

```python
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 7955: invalid start byte
```

To understand - and fix - these errors, first let's look at [character encoding](https://en.wikipedia.org/wiki/Character_encoding).

> Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers.

There are many different encoding standars and if you try to *read* a file with the wrong encoding, you end up with garbled text, sometimes referred to as [mojibake](https://en.wikipedia.org/wiki/Mojibake).

The standard encoding that we should strive to use is [UTF-8](https://www.unicode.org/faq/utf_bom.html#UTF8), however many documents that we come accross use a different standard, such as for example:
- US-ASCII
- ISO_8859-6-E
- Adobe-Symbol-Encoding
- windows-1258
and many others, see [here](https://www.iana.org/assignments/character-sets/character-sets.xhtml) for examples.

## Time to look at some code!

In [1]:
# Let's start with a string
string = "This is the 'e accute accent' symbol: é"

print(type(string))
string

<class 'str'>


"This is the 'e accute accent' symbol: é"

We can "encode" this string to any encoding we like, for example like so:

In [2]:
encoded = string.encode('UTF-8', errors='replace')
print(type(encoded))

<class 'bytes'>


If we print the `encoded` object we just get back some _mojibake_ characters (`\xc3\xa9`) as we're assuming this is ASCII:

In [3]:
print(encoded)

b"This is the 'e accute accent' symbol: \xc3\xa9"


If we want to print this correctly, we need to convert these bytes back to a python string using the correct encoding, like so:

In [4]:
print(type(encoded.decode('UTF-8')))
encoded.decode('UTF-8')

<class 'str'>


"This is the 'e accute accent' symbol: é"

This works because we know the correct encoding, but if we pass an incorrect one, python will throw an error, and this looks suspiciously similar to the error we might get when reading a CSV!

In [5]:
encoded.decode('ASCII')

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 38: ordinal not in range(128)

Let us make a small `DataFrame` and save it with a random encoding:

In [6]:
import numpy as np
import pandas as pd

d = {'col1': ['è', 'é', 'à'], 'col2': [0, 1, 2]}
df = pd.DataFrame(data=d)

In [7]:
encodings = 'EBCDIC-CP-BE IBM850 ibm1140 windows-1252 iso-2022-jp-2 iso-8859-15 macroman'.split()
chosen_encoding = np.random.choice(encodings) # I'm storing this to inspect later
df.to_csv('./test_encoding_df.csv', index=False, encoding=chosen_encoding)

We can now try to read this using the default UTF-8:

In [8]:
pd.read_csv('test_encoding_df.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 10: invalid continuation byte

We need a way to identify the correct encoding in the input file, this is where we can introduce [chardet](https://pypi.org/project/chardet/):

In [9]:
import chardet

In [10]:
# Look at the first 1000 bytes (using 'rb' in the context manager)
# to try to guess the encoding
with open('./test_encoding_df.csv', 'rb') as rawdata:
    result = chardet.detect(rawdata.read(1000))

# check what the character encoding might be
print(result)
print(f'chosen_encoding: {chosen_encoding}')

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}
chosen_encoding: iso-8859-15


Now let's try to read the file in first with the encoding guessed by `charder` and then with the actual `chose_encoding`:

In [11]:
df_guessed = pd.read_csv('test_encoding_df.csv', encoding=result.get('encoding'))
df_actual  = pd.read_csv('test_encoding_df.csv', encoding=chosen_encoding)

In [12]:
df_guessed

Unnamed: 0,col1,col2
0,è,0
1,é,1
2,à,2


In [13]:
df_actual

Unnamed: 0,col1,col2
0,è,0
1,é,1
2,à,2


Remember to check the `result['confidence']` value of chardet's guess, as you may need to explore some more. In a bind, you might have to loop through a list of possible encodings such as found on [docs.python.org](https://docs.python.org/2.4/lib/standard-encodings.html) for example.