## Loading plain text files into Python
Plain text files can be loaded into Python using the `open()` function.

The first argument to the `open()` function must be a string, which contains a *path* to the file that is being opened.


---
### Learn & play

> Object-oriented filesystem paths: https://docs.python.org/3/library/pathlib.html

> Methods of file objects: https://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects







In [None]:
# Check your current working directory. You can also use '!pwd'
import os
cwd = os.getcwd()
cwd

'/content'

In [None]:
# We will use pathlib module
from pathlib import Path
print(Path.cwd())         # Prints current working directory

/content


In [None]:
# Create a Path object that points towards the directory 'data' and assign
# the object to the variable 'data_filder'
data_folder = Path("/content/drive/MyDrive/CLT/S1/data/")
jokes = data_folder / "jokes.txt"

In [None]:
# Inspect the paths to data folder and to text file
print (data_folder)
print (jokes)
jokes

/content/drive/MyDrive/CLT/S1/data
/content/drive/MyDrive/CLT/S1/data/jokes.txt


PosixPath('/content/drive/MyDrive/CLT/S1/data/jokes.txt')

POSIX is a set of standards put forth by IEEE and The Open Group that describes how an ideal Unix would operate. 
Since every Unix does things a little differently -- Solaris, Mac OS X, IRIX, BSD, and Linux all have their quirks -- POSIX is especially useful to those in the industry as it defines a standard environment to operate in. For example, most of the functions in the C library are based in POSIX; a programmer can, therefore, use one in his application and expect it to behave the same across most Unices.

In [None]:
# Open a file and assign it to the variable 'file'
file = open(jokes, mode='r', encoding='utf-8')

By default, Python 3 assumes that the text is encoded using UTF-8, but we can make this explicit using the `encoding` argument. 

The `encoding` argument takes a string as its input: we pass `utf-8` to the argument to declare that the plain text is encoded in UTF-8.

Moreover, we use the `mode` argument to define that we only want to open the file for *reading*, which is done by passing the string `r` to the argument.

In [None]:
# Call the variable to examine the object
file

<_io.TextIOWrapper name='/content/drive/MyDrive/CLT/S1/data/jokes.txt' mode='r' encoding='utf-8'>



If we call the variable `file`, we see a Python object that contains three arguments: the path to the file under the argument `name` and the `mode` and `encoding` arguments that we specified above.

In [None]:
# Use the read() method to read the file context and assign the
# result to the variable 'text'
text = file.read()

In [None]:
# Call the first 200 characters under the variable 'text'
text[:200]

'"Automatic" simply means that you cannot repair it yourself.\n\n90% of everything is crud.\n\nA Project Manager is like the madam in a brothel. His job is to see\nthat everything comes off right.\n\nA Smith '

Most of the text is legible, but there numerous \n sequences occurring throughout the text. The \n sequences, in turn, indicate a line change.

This becomes evident if we use Python's print() function to print the first 200 characters stored in the text variable.

In [None]:
print(text[:200])

"Automatic" simply means that you cannot repair it yourself.

90% of everything is crud.

A Project Manager is like the madam in a brothel. His job is to see
that everything comes off right.

A Smith 


As you can see, Python knows how to interpret \n character sequences and inserts a line break if it encounters this sequence when printing the string contents.



## String Encoding
* human readable text is processed and stored by the computer as bits (0,1)
* we need a mapping scheme to
    * Encode: translate from human readable text to bits
    * Decode: interpret bit sequences and generate text
* multiple definitions of such mappings with different character sets exist, e.g. ASCII, latin-1, ...

**ASCII** encoding scheme: 1 byte per character

| Bits        | Character          |
| ------------- |-------------:|
| 01000001     | A |
| 01000010     | B |
| 01000011     | C |
| 01000101     | D |
| 01000110     | E |

The original ASCII definition actually only uses 7 bits, and therefore encodes 128 characters. This is enough for English and some special characters. Unicode, in contrast, provides enough space for all alphabets, emojis... It maps the characters to 32 bit (4 bytes) code points. Variable length definitions are used for space optimization, e.g. **UTF-8** (uses 1-4 bytes), which is the standard string representation in Python since version 3.

In **UTF-8** the first 128 characters are represented in the exact same way as in ASCII. 
```
0xxxxxxx    A single-byte US-ASCII code (from the first 127 characters)
```

The highest ("sign") bit, which is unused in ASCII, can now be used to indicate the start of a multi-byte sequence; the number of consecutive 1s indicates the number of bytes, then a 0, and the remaining bits (x) contribute to the value: 

```
110xxxxx    One more byte follows  
1110xxxx    Two more bytes follow  
11110xxx    Three more bytes follow  
```

For the continuation bytes, the highest two bits are always 1 and 0 and the remaining 6 bits encode the value. This makes handling of corrupted data more robust by distinguishing continuation bytes from the start of the next character.  
```
10xxxxxx    A continuation of one of the multi-byte characters
``` 


---

## Learn & play:
> https://onlineutf8tools.com/convert-decimal-to-utf8


In [None]:
# A unicode string, 6 characters long:
string = 'pythön'
print(string)
print(len(string))
string

pythön
6


'pythön'

In [None]:
# Use str.encode() to convert to a bytes object (the default encoding in Python 3 is utf-8):
string_bytes = string.encode('utf-8')
# and print its representation:
print(string_bytes)

b'pyth\xc3\xb6n'


In [None]:
# The ASCII characters in the bytes object can be represented as is, but the 'ö' is escaped to '\xc3\xb6'. The latter represents two bytes, 0xc3 and 0xb6 in hex:
' '.join(f'{i:08b}' for i in (0xc3, 0xb6))

'11000011 10110110'

This means that the character 'ö' requires two bytes for its binary representation under UTF-8.

In [None]:
#Calling `list()` on a bytes object gives the decimal value for each byte.
list(string_bytes)

[112, 121, 116, 104, 195, 182, 110]

In [None]:
# The corresponding 8-bit sequences
' '.join('{0:08b}'.format(x) for x in string_bytes)

'01110000 01111001 01110100 01101000 11000011 10110110 01101110'

In [None]:
# 
print('a')

a


In [None]:
type('a')

str

In [None]:
type(u'a')

str

In [None]:
type(b'a')

bytes

In [None]:
print(b'\xc3\xbc'.decode('utf-8'))

ü


In [None]:
# How to tell if a byte string is valid utf-8 or ascii
u_umlaut = b'\xc3\x9c'   # UTF-8 representation of the letter 'Ü'
u_umlaut.decode('utf-8')


'Ü'

In [None]:
u_umlaut.decode('ascii')

UnicodeDecodeError: ignored

In [None]:
ord('ü')

252

In [None]:
# Returns a Unicode string
chr(252)

'ü'