# Overview of the process

![image.png](attachment:image.png)

# String Encoding

Text is made up of characters. 
But when text is saved to a file on our computers, these characters are written as numbers. The association between characters and numbers is, of course, arbitrary. This means that when we read the text from a file, we have to know what mapping between characters and numbers was used.

Unfortunately, the same system is not always used. 

Some of the systems have a lot of overlap.

You can think of all numbers as being saved in the form of a sequence of *bytes*. A byte is 8 bits long. This means it can store a number between 0 and 256. That's enough for 255 characters

## ascii

`ascii` uses only one byte to represent characters. Actually, it uses only 7 of the 8 bits. So it can represent only 128 charactters

In [None]:
s = "abc"
ascii_bytes = s.encode(encoding="ascii")

In [None]:
type(ascii_bytes)

In [None]:
list(ascii_bytes)

## latin1

* common in Europe
* uses all 8 bits in a byte
* is a superset of ascii

In [None]:
latin1_bytes = s.encode(encoding="latin1")
list(latin1_bytes)

In [None]:
s2 = "Äè"
latin1_bytes2 = s2.encode("latin1")
list(latin1_bytes2)

In [None]:
s2 = '\xc4\xe8'
print(s2)

In [None]:
s2.encode("ascii")

### Decoding

In [None]:
latin1_bytes2.decode("latin1")

In [None]:
latin1_bytes2.decode("ascii")

## utf-8

In [None]:
utf8_binary = s2.encode(encoding="utf-8")
list(utf8_binary)

## cp500

In [None]:
list(s.encode("cp500"))

In [None]:
list(s2.encode("cp500"))

## the default

If we don't specify an encoding, it will default to the platform default

In [None]:
import sys
sys.getdefaultencoding()

## Reading and writing from files

In [None]:
f3 = open("corpora/encoding_test_files/ascii.txt")
txt3 = f3.read()
print(txt3)
print(txt3[0])

In [None]:
f4 = open("corpora/encoding_test_files/cp865.txt", encoding="cp865")
txt = f4.read()
print(txt)

In [None]:
f4a = open("corpora/encoding_test_files/cp865.txt")
print(f4a.read())

In [None]:
f5 = open("corpora/encoding_test_files/utf8.txt")
txt = f5.read()
print(txt)

In [None]:
f5 = open("corpora/encoding_test_files/utf8.txt", encoding="ascii")
txt = f5.read()
print(txt)

In [None]:
f5 = open("corpora/encoding_test_files/koi8_r.txt", encoding="koi8_r")
txt = f5.read()
print(txt)

In [None]:
f3b = open("corpora/encoding_test_files/ascii.txt", "rb")
txt = f3b.read()
print(txt)