# CSC 593

## Week 5

### Reading and Writing Files

The built-in function [`open()`](https://docs.python.org/3/library/functions.html#open) returns a "file object" that can be used to read from and/or write to a file.

In general we do file input/output inside code blocks started with the `with` keyword. This just simplifies the process of managing the file object and closing it after you've finished.

`open()` takes two arguments: a path to the file on the filesystem, and the *mode* we want to open the file in. To read from the file, use 'r' as the mode. Use the file object's `read()` method to read the entire file, or `readline()` to read one line at a time.

In [1]:
with open('../data/textfile.txt', 'r') as f:
    print(f.read())

This is just a short text file.

Here is another line of text.


Use mode 'w' to write to the file:

In [2]:
with open('../data/newfile.txt', 'w') as f:
    f.write("""This is some text.
    
This is some more text.""")

In [3]:
with open('../data/newfile.txt', 'r') as f:
    print(f.readline())

This is some text.



You can also iterate over the file object, line by line:

In [4]:
with open('../data/newfile.txt', 'r') as f:
    for ln in f:
        print(ln)

This is some text.

    

This is some more text.


Be careful: `open(filename, 'w')` will overwrite existing files.

In [5]:
open('../data/newfile.txt', 'w').close()  # our new file is now empty.

with open('../data/newfile.txt', 'r') as f:
    print(f.read())




Using mode 'x' will open a new file for writing, but throw an error if the file already exists.

In [6]:
with open('../data/textfile.txt', 'x') as f:
    pass

FileExistsError: [Errno 17] File exists: '../data/textfile.txt'

#### Practice

Try opening the class syllabus ('../README.md') and printing the first line.

##### `CSV`

Comma-separated values files are a common data exchange format. Python has built-in support for them:

In [None]:
import csv

To read a CSV file, open it like any other, then read the file object with a `csv.reader()`. Here we use the `next()` function to retrieve the first line of the `familyxx.csv` file, then print the header labels.

`familyxx.csv` is part of the data release from the 2018 [National Health Interview Survey](https://www.cdc.gov/nchs/nhis/index.htm).

In [None]:
with open('../data/nhis/familyxx.csv') as f:
    rdr = csv.reader(f)
    hdr = next(rdr)
    for name in hdr:
        print(name)

##### `zip()`

The [`zip`](https://docs.python.org/3/library/functions.html#zip) function merges two or more iterables (like lists or strings).

In [None]:
l1 = [1, 2, 3]
l2 = ['a', 'b', 'c']
l3 = ['x', 'y', 'z']
for x in zip(l1, l2, l3):
    print(x)

for x in zip('foo', 'bar'):
    print(x)

This gives us another way to answer the last question from homework assignment 2:

In [None]:
string1 = 'ABCDEFGHIJ'
string2 = 'ABCDEEGHIJ'

for x in zip(string1, string2):
    print(*x, sep='')

More importantly, it's a convenient way to create dictionaries from two lists:

In [None]:
dict(zip(l2, l1))

Here, we create a list of dictionaries, each containing one row of the `familyxx` data.

In [None]:
with open('../data/nhis/familyxx.csv') as f:
    rdr = csv.reader(f)
    hdr = next(rdr)
    nhis = [dict(zip(hdr, row)) for row in rdr]

In [None]:
print(len(nhis))
print(nhis[0])

#### Practice
Import your own dataset, or the NHIS persons file (`..\data\nhis\personsx.csv`). Create a list of dictionaries, as I have above.

### Web Scraping



We'll use the [`requests`](https://3.python-requests.org/) module to retrieve data from the web, and [`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to read these pages.

In [None]:
import requests
from bs4 import BeautifulSoup as bs

Get [Wikipedia's list of Rhode Island municipalities](https://en.wikipedia.org/wiki/List_of_municipalities_in_Rhode_Island). A response code of 200 means "OK"

In [None]:
page = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Rhode_Island')
page

Parse the source of the page with BeautifulSoup and find the table. We know it has the class 'wikitable'. We have to do some tinkering here--the table is messier than the CDC's file.

In [None]:
soup = bs(page.text, 'html.parser')
table = soup.find('table', class_='wikitable')

#Find all the table headers (th elements).
#Remove the footnotes/references from the header cell labels.
headers = [th.text.strip().split('[')[0] for th in table.find_all('th')]

print(headers)

#There are two subheaders under Land Area. We need to make some adjustments to our headers.
lahead = headers[-4]
headers[-4] = lahead + ' sq mi'

#the list.insert() method adds an element to the list at a specified location.
headers.insert(-3, lahead + ' km2')

#Remove the last two elements from the headers list.
headers = headers[:-2]
print(headers)

In [None]:
ridata = []
for row in table.find_all('tr')[2:]:
    rowdata = [cell.text.strip() for cell in row.find_all('td')]
    ridata.append(dict(zip(headers, rowdata)))

ridata

#### Practice
Find another table on Wikipedia (try searching for "list of...". Import that table, as I have the RI towns list.

### Working with lists of data

#### Selecting "rows" or "columns"

Picking a single row by its index is easy--we've been doing this since the second class.

In [None]:
ridata[5]

In [None]:
print(ridata[-1])
del(ridata[-1])

We can also choose one or more rows using a list comprehension.

In [None]:
[x for x in ridata if x['County']=="Washington"]

Another option is the [`filter()`](https://docs.python.org/3/library/functions.html#filter) function. For this we need a new language feature: [_lambda_ expressions](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions). These are small functions that can be used as function or method arguments without first declaring them.

Here's the equivalent of the last expression using `filter()`

In [None]:
list(filter(lambda x: x['County'] == 'Washington', ridata))

`filter()` takes two arguments:
    
1. A function that returns `True` if we should keep the list (or other iterable) item or `False` otherwise; and
2. our list.

Our first argument above is a lambda function:

`lambda x: x['County'] == 'Washington'`

This is simply a shorthand method of creating a function and using it once. We can get the same effect this way:

In [None]:
def wash_county(x):
    return x['County'] == 'Washington'

list(filter(wash_county, ridata))

We can select a single "column" with a simple list comprehension:

In [None]:
[row['Name'] for row in ridata]

##### Practice
Experiment with selecting rows or columns from one of the datasets you've loaded (your data, `personsx.csv`, or your Wikipedia table).

In [None]:
#Choose a subset of rows

In [None]:
#Choose a column

#### [Sorting](https://docs.python.org/3/howto/sorting.html)

Sorting simple lists is simple.

In [None]:
from random import randrange
somelist = [randrange(100) for x in range(10)]
print(somelist)
print(sorted(somelist))

Our lists of dictionaries are slightly more complex. We must provide a `key` argument. We can use a `lambda`.

In [None]:
sorted(ridata, key=lambda muni: muni['Year established'])

We can also use `itemgetter` from the `operator` module:

In [None]:
from operator import itemgetter
sorted(ridata, key=itemgetter('Population(2010)'))

##### Practice
Experiment with sorting your data.

In [None]:
#Sort one of your open datasets.

#### Cleaning

We can loop over the list to make changes to our data. Here we use the `.isnumeric()` [string method](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) to determine whether a string can be converted to an `int` or `float`. We'll also use the `.replace()` method to remove periods and commas from strings as needed.

In [None]:
for row in ridata:
    if row['Year established'].isnumeric():
        row['Year established'] = int(row['Year established'])
    if row['Land area(2010) sq mi'].replace('.', '').isnumeric(): 
        row['Land area(2010) sq mi'] = float(row['Land area(2010) sq mi'])
    if row['Population(2010)'].replace('.', '').replace(',','').isnumeric():
        row['Population(2010)'] = int(row['Population(2010)'].replace(',',''))

In [None]:
ridata[0]

We can use the `.split()` string method to extract specific parts of a string when we know the string has some regular formatting. For example:

In [None]:
pd = ridata[0]['Population density']
print(ridata[0]['Name']+"'s population density:", pd)
#Population density per square mile:
print("Per square mile", pd.split('/')[0])

#per square km:
print("Per square kilometer:", pd.split('(')[1].split('/')[0])

##### Practice 
Find a field in your data that should be numeric and convert it to integers or floating-point numbers.

#### Derived fields

Sometimes, the numbers we want to analyze are not provided in the data we have, but can be calculated from that data. We'll want to add new fields to the data with our calculated figures.

Earlier, I showed how we could extract population density from the numbers above. But we can also calculate it from the population and area numbers we've already converted to numeric variables:

In [None]:
pop  = ridata[0]['Population(2010)']
area = ridata[0]['Land area(2010) sq mi']
print(ridata[0]['Name']+"'s population density:", pop/area, "/square mile")

We can add this figure to every row of our data:

In [None]:
for row in ridata:
    row['population_density'] = row['Population(2010)'] / row['Land area(2010) sq mi']

ridata

#### Summary Statistics

We've already discusses reading "columns" of data; with the functions in the [`statistics`](https://docs.python.org/3/library/statistics.html) module and the `min()` and `max()` functions, we can summarize those columns.

In [None]:
import statistics

print(statistics.mean([x['Land area(2010) sq mi'] for x in ridata]))
print(min([x['Land area(2010) sq mi'] for x in ridata]), max([x['Land area(2010) sq mi'] for x in ridata]))

##### Practice

Calculate the mean and range (maximum and minimum values) for a numeric field in one of the loaded datasets.