Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Teo Kai Wen"
COLLABORATORS = ""

---

# Assignment 5

This assignment gives you practice with file paths, with buffers such as open files and web connections, as well as with string encoding and decoding. You may want to review the [week 06 lecture notes](http://compling.hss.ntu.edu.sg/courses/hg2051/week06.html) or [notebook](http://compling.hss.ntu.edu.sg/courses/hg2051/week06.ipynb).

**NOTE:** Each step of this assignment depends on the step before it. If you're stuck on one part for more than 10 minutes, please email me or ask a classmate for help, as you won't be able to move on otherwise.

For your convenience, use `dirname` as the base directory to store your files in. I will create a `TemporaryDirectory` object for it now. Run the final cell in the notebook to delete the temporary directory.

In [2]:
import os                   # for os.path
from urllib import request  # for request.urlopen()
import tempfile             # for tempfile.TemporaryDirectory

tempdir = tempfile.TemporaryDirectory()
dirname = tempdir.name

### Get the URL of a sample file (1 point)

Visit the following website and select a file. Copy the URL of the file into the `url` variable below.

http://andrew.triumf.ca/multilingual/samples/samples.html

Any of the files in the bulleted list is fine.

In [30]:
url = 'http://andrew.triumf.ca/multilingual/samples/french.html'  # Copy URL here as string

In [31]:
assert url.startswith('http://andrew.triumf.ca/multilingual/samples/'), 'URL is not from the specified domain'

#### Download the file to a `bytes` object (2 points)

Use [urllib.request.urlopen()][] with `url` as shown in class to download the file to a variable named `raw`. Do not decode the binary data yet. This step may be done in a single line or several, but it shouldn't be complicated.

[urllib.request.urlopen()]: https://docs.python.org/3/library/urllib.request.html#urllib.request.urlopen

In [37]:
from urllib import request
response = request.urlopen(url)
raw = response.read()  # Download the bytes from the URL

In [38]:
assert isinstance(raw, bytes), "'raw' is not a bytes object"
assert len(raw) > 0, "'raw' is empty"

### Decoding the bytes to unicode (1 point)

You will now convert the bytes into a unicode string with the proper encoding using [bytes.decode()](https://docs.python.org/3/library/stdtypes.html#bytes.decode). Recall the encoding that was specified on the website for the file you downloaded. Now consult the [list of standard encodings](https://docs.python.org/3/library/codecs.html#standard-encodings) in Python and ensure your file's encoding is there (the encoding may be listed in the 'Aliases' column; if it is not anywhere, try a similarly-named encoding or choose a different file above). Store the decoded data into the `data` variable.

In [125]:
data = raw.decode(encoding = "latin_1")  # decode raw using the appropriate encoding

In [65]:
assert isinstance(data, str), "'data' is not a string object"
assert len(data) > 0, "'data' is empty"

### Make a new path (1 point)

Before saving `data` to disk, you need a file path. Make up some path using `dirname` and [os.path.join()][], and assign it to the variable `path`.

Note that I may not be using the same operating system as you when grading, so if you just do string operations like this:

```python
path = dirname + '\\myfile.txt'
```

... it might not work for me. This is why [os.path.join()][] is important.

[os.path.join()]: https://docs.python.org/3/library/os.path.html#os.path.join

In [107]:
path = os.path.join(dirname, 'Documents')  # Make up a path for the downloaded file

In [108]:
assert os.path.exists(os.path.dirname(path)), 'Parent directory does not exist'

### Write `data` to a file (2 points)

Use the `with` syntax and the [open()](https://docs.python.org/3/library/functions.html#open) function to open `path` for **writing in text mode with the `utf-8` encoding**. Then write `data` to the open file object. Use `fileobj.write(...)` as shown in class instead of `print(..., file=fileobj)` (as shown in the NLTK reading), because the latter adds an extra newline (`\n` (mac, Linux) or `\r\n` (Windows)).

In [109]:
with open(path, 'w+', encoding = "utf-8") as obj:
        obj.write(data)
#with open(path, mode = 'wt', encoding = 'utf-8') as f:
    #f.write(data)

In [110]:
assert os.path.isfile(path), 'File does not exist at path'
assert os.stat(path).st_size > 0, 'File at path is empty'

### Read the data back from the file (1 point)

Use the `with` syntax and the [open()](https://docs.python.org/3/library/functions.html#open) function to open `path` for **reading in binary mode**. Read the data to a variable named `raw2`.

In [116]:
with open(path, 'rb') as obj2:
    raw2 = obj2.read()

In [120]:
assert isinstance(raw2, bytes), "'raw2' is not a bytes object"
assert len(raw2) > 0, "'raw2' is empty"

In [129]:
raw.decode('latin_1')

'\n<HEAD>\n<TITLE>French / Français (ISO Latin-1 / ISO 8859-1)</TITLE>\n</HEAD>\n<BODY>\n<H1>French / Français (ISO Latin-1 / ISO 8859-1)</H1>\nBonjour, Salut\n</BODY>\n'

In [130]:
raw2.decode('utf8')

'\n<HEAD>\n<TITLE>French / Français (ISO Latin-1 / ISO 8859-1)</TITLE>\n</HEAD>\n<BODY>\n<H1>French / Français (ISO Latin-1 / ISO 8859-1)</H1>\nBonjour, Salut\n</BODY>\n'

### Comparison (1 point)

Both `raw` and `raw2` are `bytes` objects representing the same data. Are they equal? Why or why not?

In [118]:
raw == raw2

False

No. raw is encoded with latin_1 and raw is encoded with utf8.

### The capabilities of buffers (1 point)

Buffers are a way to work with streaming data (reading from the hard disk, transmitting across the internet, etc.). Unlike unicode strings, byte strings, lists, etc., buffers cannot report their size (`len(mybuffer)`) nor allow random access (`mybuffer[100:150]`). Why is that? 

(You may notice that some buffers, like when watching YouTube videos, allow you to see how long the video is and to jump to the middle or the end, but these are capabilities of the overall system and not the buffer itself. E.g., the video length (and file size) are sent separately as metadata. And when you jump in the video, the player has to buffer some data again before it will play.)

Buffers are not fixed in size and are only iterable.

## Cleaning up

Run the following cell to delete the temporary directory and its contents.

In [None]:
tempdir.cleanup()