# Announcements

* PS7 due tonight, 11:59pm
* PS8 out today, due Tuesday 11/26, 11:59pm
* Quiz 9 this Wednesday
* You are required to stay and correct any outstanding quiz questions after the quiz on Wednesday

# File Input/Output, CSVs

<style>
section.present > section.present { 
    max-height: 90%; 
    overflow-y: scroll;
}
</style>

<small><a href="https://colab.research.google.com/github/brandeis-jdelfino/cosi-10a/blob/main/lectures/notebooks/12_file_io.ipynb">Link to interactive slides on Google Colab</a></small>

# Reading files

Python allows you to read the contents of files:

In [None]:
f = open('../../snippets/names.txt')
print(f.read())

You can also get the data one line at a a time:

In [None]:
f = open('../../snippets/names.txt')
for line in f:
    print(line, end='')


Note the `end=''` for the `print`. 

When reading a file, the newlines from the file are returned as part of each line. 

You can strip them out with the `strip()` string method, depending on how you want to use the data.

In [None]:
f = open('../../snippets/names.txt')
for line in f:
    print(f"Name: {line.strip()}")

## File objects

`open()` returns a "file object". File objects have a number of methods, and are also **iterables** - that's why we were able to use a for loop on them.

# Writing Files

You can write strings to files:

In [None]:
f = open('../../snippets/output.txt', 'w')
f.write("Hello, files!")
f.close()

g = open('../../snippets/output.txt', 'r')
print(f"File contents: {g.read()}")

Note the call to `close()` - files must be closed after writing to them.

## File modes

Notice that we called `open` slightly differently for writing vs. reading:

`open('../../snippets/output.txt', 'w')`  
vs.  
`open('../../snippets/output.txt', 'r')`

The second paramater is the "mode". There are several modes, but the most commonly useful are:

| character | mode |
|:---:|:---|
| r | open for reading (default) |
| w | open for writing, truncating the file first |
| a | open for writing, appending to the end of file if it exists |

## Closing files

Files need to be closed when they are no longer needed. 

We did it above with the `.close()` method.

This is especially important when writing data to files, because the data is sometimes not actually written to the file until `close()` is called!

In [None]:
f = open('../../snippets/output.txt', 'w')
f.write("Hello, files!")
#f.close()

g = open('../../snippets/output.txt', 'r')
print(f"File contents: {g.read()}")

File objects have a `closed` attribute that can be accessed to tell you whether a file has been closed or not.

In [None]:
output_filename = '../../snippets/output.txt'
output_file = open(output_filename, 'w')

print(f"[Before writing to file] Is f closed? {output_file.closed}")
output_file.write("Hello, files!")
print(f"[After writing to file] Is f closed? {output_file.closed}")
output_file.close()
print(f"[After closing] Is f closed? {output_file.closed}")

g = open('../../snippets/output.txt', 'r')
print(f"File contents: {g.read()}")
print(f"[After reading file contents] Is g closed? {g.closed}")

## `with`

There's a convenient way to ensure you don't forget to close a file: a `with` clause.

In [None]:
with open('../../snippets/names.txt', 'r') as f:
    print(f"[Before for loop] Is f closed? {f.closed}")
    
    for line in f:
        print(line, end='')
    
    print(f"[After for loop] Is f closed? {f.closed}")

print(f"[Outside 'with' clause] Is f closed? {f.closed}")

This code opens the file, assigns the file object to the variable `f`, executes the code inside the `with` block, then automatically closes the file when exiting the `with` block.

## Context managers

File objects are **context managers**, which means they can be used with the `with` statement to manage resources automatically when entering and exiting a `with` block.

There are other context managers in Python, and you can even [write your own](https://docs.python.org/3/reference/datamodel.html#context-managers). 

It's good practice to handle file objects with `with` rather than closing manually.

## Exercise

Read in two files, `../../snippets/hamlet.txt` and `../../snippets/macbeth.txt` and print out all the lines that are found in both files.

First, read a single file into a list and print out some things about it:

In [None]:
macbeth = []
with open('../../snippets/shakespeare/macbeth.txt', 'r') as f:
    for line in f:
        macbeth.append(line.strip())

print(f"Found {len(macbeth)} total lines.")
print()
print("First 10 lines:")
print()
for line in macbeth[:10]:
    print(line)

print()
print("10 lines from the middle:")
print()
for line in macbeth[2000:2010]:
    print(line)


Now read the other file too, and check for overlap:

In [None]:
macbeth = []
hamlet = []

with open('../../snippets/shakespeare/macbeth.txt', 'r') as f:
    for line in f:
        macbeth.append(line.strip())

with open('../../snippets/shakespeare/hamlet.txt', 'r') as f:
    for line in f:
        hamlet.append(line.strip())

for line in hamlet:
    if line in macbeth:
        print(line)

Looks like we need to handle duplicates...

In [None]:
macbeth = set()
hamlet = set()

with open('../../snippets/shakespeare/macbeth.txt', 'r') as f:
    for line in f:
        macbeth.add(line.strip())

with open('../../snippets/shakespeare/hamlet.txt', 'r') as f:
    for line in f:
        hamlet.add(line.strip())

print(hamlet & macbeth)

We can make the output a little prettier

In [None]:
macbeth = set()
hamlet = set()

with open('../../snippets/shakespeare/hamlet.txt', 'r') as f:
    for line in f:
        hamlet.add(line.strip())

with open('../../snippets/shakespeare/macbeth.txt', 'r') as f:
    for line in f:
        macbeth.add(line.strip())

overlapping = sorted(list(macbeth & hamlet))
print('\n'.join(overlapping))

## Exercise 2

Do the same thing for all of Shakespeare's plays at once.

In [None]:
macbeth = set()
hamlet = set()
othello = set()
henry_iv_part1 = set()
henry_iv_part2 = set()
# ...

There must be an easier way.

We can use `os.listdir()` to get everything in a directory.

In [None]:
import os
play_files = []
for f in os.listdir('../../snippets/shakespeare/'):
    play_files.append(f)
print(play_files)
print(len(play_files))

Next problem, how do we avoid writing 41 for loops?

Let's functionally decompose, 2 steps:
1. Find all file names and load the lines from each file into a set
2. Find the intersection between the lines from each file

1. Find all file names and load the lines from each file into a set

In [None]:
def load_files(directory):
    all_data = []
    for filename in os.listdir(directory): 
        play_lines = set()
        with open(directory + filename, 'r') as f:
            for line in f:
                play_lines.add(line.strip())
        all_data.append(play_lines)
    return all_data

Let's make this more readable - split out the code to process a single file

In [None]:
def load_file(filename):
    play_lines = set()
    with open(filename, 'r') as f:
        for line in f:
            play_lines.add(line.strip())
    return play_lines

def load_files(directory):
    all_data = []
    for filename in os.listdir(directory):
        all_data.append(load_file(directory + filename))
    return all_data

In [None]:
all_lines = load_files('../../snippets/shakespeare/')

print(f"Loaded {len(all_lines)} files")
for play in all_lines:
    print(f"Unique lines: {len(play)}")

Hm, might be nice to see filenames. Let's do a dictionary of sets.

In [None]:
def load_file(filename):
    play_lines = set()
    with open(filename, 'r') as f:
        for line in f:
            play_lines.add(line.strip())
    return play_lines

def load_files(directory):
    all_data = {}
    for filename in os.listdir(directory):
        all_data[filename] = load_file(directory + filename)
    return all_data

In [None]:
play_lines = load_files('../../snippets/shakespeare/')

print(f"Loaded {len(play_lines)} files")
for play in play_lines:
    print(f"unique lines: {len(play_lines[play])} ({play})")

2. Find the intersection between the lines from each file

In [None]:
play_lines = load_files('../../snippets/shakespeare/')

In [None]:
common_lines = set()
for play in play_lines:
    common_lines = common_lines & play_lines[play]
    print(common_lines)
print(common_lines)

Seems suspicious... let's debug

In [None]:
common_lines = set()
for play in play_lines:
    common_lines = common_lines & play_lines[play]
    print(f"Lines after intersection with {play}: {len(common_lines)}")
print(common_lines)

Oh, we start with an empty set. An empty set intersected with anything is... an empty set. Start with the first play instead.

In [None]:
common_lines = None
for play in play_lines:
    if common_lines is None:
        common_lines = play_lines[play]
    else:
        common_lines = common_lines & play_lines[play]
    print(f"Lines after intersection with {play}: {len(common_lines)}")
print(common_lines)

# CSV files

CSV stands for "character separated values". 

In CSV files, rows of data are represented by lines in a file, and columns of data are separated by a specific character, called a **delimiter**. Commas (`,`) are commonly used as a delimiter, but any character can be a delimiter.

CSV files are another common way to store structured data, especially if the data is tabular (like a spreadsheet).

Here's an example of CSV data. Each line contains 4 fields: `id`, `name`, `house`, `hair color`:

In [None]:
with open('../../snippets/csv_example.csv', 'r') as f:
    for line in f:
        split_line = line.strip().split(',')
        print(f"{split_line[1]} has {split_line[3]} colored hair")


## Reading CSVs

We could just use `.split(',')` to split each line into a list. But Python provides some nice CSV utilities in the `csv` module.

`csv.reader()` creates an iterable object that produces each line as a list.

In [None]:
import csv
with open('../../snippets/csv_example.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for line in reader:
        print(line)

# Announcements

* Quiz 9 tonight (files)
* PS8 due next Tuesday 11/26, 11:59pm

## Writing CSVs

`csv.writer()` creates an object with a `writerow` method, which takes a `list` and writes it out as a single CSV row.

In [None]:
import csv
data = [
    ['11', 'Harry', 'Gryffindor', 'Brown'],
    ['18', 'Draco', 'Slytherin', 'Blonde'],
    ['22', 'Cho', 'Ravenclaw', 'Black'],
    ['28', 'Ron', 'Gryffindor', 'Red'],
    ['47', 'Hermione', 'Gryffindor', 'Brown']]

with open('../../snippets/csv_example2.csv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter='|')
    for d in data:
        writer.writerow(d)
        
with open('../../snippets/csv_example2.csv', 'r') as f:
    for line in f:
        print(line, end='')

## How useful is this, really?

Why use the `csv` module, when `split()` and `join()` calls are so easy?

Well, it handles a lot of edge cases - for example, escaped delimiters within fields.

Here's an example using `split()` and `join()`

In [None]:
import csv
data = [
    ['11', 'Harry', 'Gryffindor', 'Brown,mostly'],
    ['18', 'Draco', 'Slytherin', 'Blonde,mostly'],
    ['22', 'Cho', 'Ravenclaw', 'Black,really'],
    ['28', 'Ron', 'Gryffindor', 'Red,very'],
    ['47', 'Hermione', 'Gryffindor', 'Brown']]

with open('../../snippets/csv_example2.csv', 'w', newline='') as f:
    for d in data:
        f.write(','.join(d) + '\n')
        
with open('../../snippets/csv_example2.csv', 'r') as f:
    for line in f:
        print(line.strip().split(','))

Here's the same example using the `csv` module

In [None]:
import csv
data = [
    ['11', 'Harry', 'Gryffindor', 'Brown,mostly'],
    ['18', 'Draco', 'Slytherin', 'Blonde,mostly'],
    ['22', 'Cho', 'Ravenclaw', 'Black,really'],
    ['28', 'Ron', 'Gryffindor', 'Red,very'],
    ['47', 'Hermione', 'Gryffindor', 'Brown']]

with open('../../snippets/csv_example2.csv', 'w', newline='') as f:
    writer = csv.writer(f, delimiter=',')
    for d in data:
        writer.writerow(d)

with open('../../snippets/csv_example2.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for line in reader:  
        print(line)

## Header rows

CSVs often have "header rows":

```
Make, Model, Color, Year
Honda, Accord, Grey, 2013
Tesla, Model Y, Blue, 2022
Ford, F-150, Black, 1991
...
```

If we're reading a file with a header row, we need to skip the first row when loading the data. 

Otherwise, the header row will be mixed in with our data, like this:

In [None]:
with open('../../snippets/cars.csv', 'r') as f:
    reader = csv.reader(f, delimiter=',')
    for line in reader:  
        print(line)

One way to do skip the first row is with the `next` function, which gets the next row from the reader. 

Here we get the first row, and throw it away:

In [None]:
with open('../../snippets/cars.csv', 'r') as f:
    bob = csv.reader(f, delimiter=',')
    next(bob)
    for line in bob:  
        print(line) 

Another way is to read all the rows into a list and then drop the first row at the end by slicing:

In [None]:
data = []
with open('../../snippets/cars.csv', 'r') as f:
    bob = csv.reader(f, delimiter=',')
    for line in bob:  
        data.append(line)

data = data[1:]

for line in data:
    print(line)

# JSON

We know how to read and write strings, but what about other types - ints, floats, lists, dictionaries?

Enter **JSON**: "**J**ava**s**cript **O**bject **N**otation". 

JSON is a **data exchange format**: a method of representing data as sequences of characters (strings) which can be interpreted by many programming languages.

The JSON format can represent strings, ints, floats, booleans, lists, dictionaries, and `None`. 

A Python data structure:

In [None]:
data = {
    "name": "John",
    "age": 30,
    "candy_preferences": [
        "Reese's",
        "Snickers"
    ]
}

This data structure can be represented as a string in JSON format (or: a "JSON string"):

In [None]:
'{"name":"John", "age":30, candy_preferences:["Reese\'s", "Snickers"]}'

## Why?

This looks very similar to the way that Python prints data structures... why is this useful?

JSON is not Python-specific. If you use a Python program to create a JSON string, it will be readable by many other programming languages.

It's a **data exchange format**, and it's very commonly used.

# JSON in Python

Python has the `json` package, which contains utilities for reading and writing JSON.

In [None]:
import json
mydata = {
    "numbers": [1,2,3,4],
    "another number": 2.75,
    "more dictionaries": [{'a': 1, 'b': 2, 'c': 3}] 
}
json.dumps(mydata)

## dump / dumps

The `dump` and `dumps` methods **serialize** data structures:
* `json.dumps(<object>)` **serializes** a data structure to a string.  
* `json.dump(<object>, <file object>)` **serializes** a data structure to a string and writes the string to a file.

"Serializing" a data structure means converting it to a string (or bytes) representation.

An easy way to remember the difference: The `s` on `dumps` actually stands for `string`.

## Pretty printing

`json.dumps()` also takes an optional `indent` parameter. If specified, it will "pretty print" the JSON:

In [None]:
import json
mydata = {
    "numbers": [1,2,3,4],
    "another number": 2.75,
    "more dictionaries": [{'a': 1, 'b': 2, 'c': 3}]
}
print(json.dumps(mydata, indent=4))

## load / loads

`load`/`loads` do the opposite of `dump`/`dumps`: they **parse** strings into data structures.
* `json.loads(<str>)` **parses** a string into a data structure.  
* `json.load(<file object>)` reads the contents of a file and **parses** it into a data structure.

In [None]:
import json
mydata = {
    "numbers": [1,2,3,4],
    "another number": 2.75,
    "more dictionaries": [{'a': 1, 'b': 2, 'c': 3}]
} 
with open('../../snippets/test.json', 'w') as f:
    json.dump(mydata, f)

with open('../../snippets/test.json', 'r') as f:
    data = json.load(f)

print(data)

## Exercise: Let's play around with some movie data

You can grab data scraped from Wikipedia on movies, in JSON format, here: https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json

In [None]:
import json
with open('../../snippets/movie_data.json', 'r') as f:
    json_movies = json.load(f)
print(f"Loaded {len(json_movies)} movies")

In [None]:
print(json_movies[0])

In [None]:
print(json_movies[-100])

In [None]:
# Or, a pretty-printed version:
print(json.dumps(json_movies[-100], indent=2))

Cool! Let's print all the movies from a given year.

In [None]:
year = 2021
for movie in json_movies:
    if movie['year'] == year:
        print(movie['title'])

How about searching based on title? 

In [None]:
search_string = "Freedom"
for movie in json_movies:
    if search_string in movie['title']:
        print(f"{movie['title']} ({movie['year']})")

Ok how about a cast search? Find every movie that had certain cast member.

In [None]:
person = "Christopher Walken"
for movie in json_movies:
    if person in movie['cast']:
        print(f"{movie['title']} ({movie['year']}) had {person} in it")

## Exercise: MTA Ridership Data (CSV)

[Open the class exercises Codespace](https://codespaces.new/brandeis-cosi-10a/class-exercises?quickstart=1)

Open the file: `exercises/09/01_mta_ridership/README.md`, follow the instructions.
* If you don't see this folder: Open the file: `get_exercises.sh`, click the "Run" button at the top right of the editor.


## Exercise: Airport Data (JSON)

[Open the class exercises Codespace](https://codespaces.new/brandeis-cosi-10a/class-exercises?quickstart=1)

Open the file: `exercises/09/02_airport_data/README.md`, follow the instructions.
* If you don't see this folder: Open the file: `get_exercises.sh`, click the "Run" button at the top right of the editor.
