# Files

There are several libraries in Python that can be used for sending files across the Internet: `urllib`, `urllib2`, `urllib3`. Having all these libraries can be [confusing](https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-urllib3-and-requests-modul). The library that is probably the easiest to use is [`requests`](https://2.python-requests.org//en/latest/). Common to all these libraries is that they use HTTP as the protocol for transferring files.

If you need to parse or build URLs, `urllib.parse` contains some useful functions.

In [1]:
import requests

In [2]:
from urllib.parse import urlparse, quote, quote_plus, unquote, unquote_plus, urljoin

In [3]:
quote_plus('My books: Python Crash Course, 2nd Ed.')

'My+books%3A+Python+Crash+Course%2C+2nd+Ed.'

In [4]:
unquote_plus(_)

'My books: Python Crash Course, 2nd Ed.'

In [5]:
assets_url = 'https://introcs.cs.princeton.edu/java/data/'

pi_digits_url = urljoin(assets_url, 'pi-1million.txt')
print(pi_digits_url)
response = requests.get(pi_digits_url)

https://introcs.cs.princeton.edu/java/data/pi-1million.txt


The `response` object has several attributes, among others `status_code` and `contents`. The contents is encoded as a byte string. If the contents is a plain text string, it can be accessed through the attribute `text`. When the contents is a JSON object representation, it can be loaded into a JSON object, using the method `json()`.

In [6]:
response.status_code

200

In [7]:
response.content[:100]

b'3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067'

In [8]:
response.text == response.content.decode()

True

In [9]:
len(response.text)

1000000

In [10]:
len(response.text.split('\n'))

1

In [11]:
len(response.content)

1000000

We can save the response to a local file:

In [12]:
with open('assets/pi_million_digits.txt', mode='w') as f:
    for i in range(0, 1000000, 100):
        print(response.text[i: i + 100], file=f)

And then test that the contents of the local file is as we would expect.

In [13]:
with open('assets/pi_million_digits.txt', mode='r') as f:
    pi_txt = f.read().replace('\n', '')
    
pi_txt == response.text

True

In [14]:
# Count the frequencies of digits.
from collections import Counter

Counter(pi_txt).most_common()

[('5', 100359),
 ('3', 100230),
 ('4', 100230),
 ('9', 100106),
 ('2', 100026),
 ('8', 99985),
 ('0', 99959),
 ('7', 99800),
 ('1', 99757),
 ('6', 99548)]

Let's create a small multi-line file by taking the 32 first characters of the decimal expansion of π and writing them on four lines with eight characters on each line. To do this, we could take 4 slices of `pi_txt`, but for illustration purposes, we will create an in-memory file with `pi_txt` as contents using `io.StringIO`.

Let's look at different ways we can read this file into a string with no newlines.

In [15]:
!head -2 assets/pi_million_digits.txt

3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067
9821480865132823066470938446095505822317253594081284811174502841027019385211055596446229489549303819


In [22]:
with open('assets/pi_digits.txt', mode='r') as f:
    pi_txt = ''
    line = f.readline()
    while line:
        pi_txt += line.strip()
        line = f.readline()
pi_txt[:4] + '...' + pi_txt[-3:]

'3.14...279'

The following has an syntax error on line 3. In Python (unlike Java), an assignment is *not* an expression. In Python 3.8, *assignment expressions* are introduced. An assignment expression uses the 'walrus operator' `:=` to assign a value to a variable. The assignment is then an expression that evaluates to the assigned value. 

In [17]:
with open('assets/pi_million_digits.txt', mode='r') as f:
    pi_txt = ''
    while line := f.readline(): # This is invalid syntax in Python!
        pi_txt += line.strip()
pi_txt[:10] + '...' + pi_txt[-10:]

'3141592653...0577945815'

`f.readlines()` reads all the lines into a list. 

In [18]:
with open('assets/pi_million_digits.txt', mode='r') as f:
    pi_txt = ''.join(line.strip() for line in f.readlines())
'...'.join((pi_txt[:12], pi_txt[-10:]))

'314159265358...0577945815'

`f` is an iterator. It has the advantage over `f.readlines()` that it does not store all the lines in a list, but reads the lines from the file as the iteration progresses. 

In [19]:
with open('assets/pi_million_digits.txt', mode='r') as f:
    pi_txt = ''.join(line.strip() for line in f)
'...'.join((pi_txt[:12], pi_txt[-10:]))

'314159265358...0577945815'

In [20]:
with open('assets/pi_million_digits.txt', mode='r') as f:
    head = next(f).strip()
    for line in f: pass
    tail = line.strip()
'...'.join((head[:12], tail[-10:]))

'314159265358...0577945815'

In [21]:
from datetime import date

today_str = str(date.today()).replace('-', '')[2:]
print(today_str)
try:
    res = f'{today_str} found at position {pi_txt.index(today_str)}'
except ValueError as e:
    res = f'{today_str} not found among the first 1000000 digits of pi'
res

250123


'250123 not found among the first 1000000 digits of pi'