# Project 2c: Goals and Deliverables

The goals of this assignment are:
* To review basic NLP tasks for words using dictionaries, sets, lists and file read/write.


Here are the steps you should do to successfully complete this project:
1. From moodle, accept the assignment.
2. Open and set up a code space (install a python kernal and select it).
3. Complete the notebook and commit it to Github. Make sure to answer all questions, and to commit the notebook in a "run" state!
4. Edit the README.md file. Provide your name, your class year, links to/descriptions of any extensions and a list of resources. Make sure to paste the output from running spacy-on-files.py!
5. Commit your code often. We will take the last commit before the deadline as your submission of the project.

For extra credit:
* Read the spaCy docs (https://spacy.io/usage/models). Figure out how to make spaCy work for another language. Add a starter question asking the user to indicate the language. 
* We are doing all of this on the command line. If you got mercury web apps running, make some of this work in mercury.
* Take a look at the [spaCy universe pdf reader](https://spacy.io/universe/project/spacypdfreader/). Download the pdfs for at least two of the papers in our corpus and load them using this reader. Does this reader give cleaner text then constellate, or not? Justify your answer.
* Your other ideas are welcome! If you'd like to discuss one with Dr Stent, feel free.

# Setup

## Let's Install Our Packages

On the command line (in the terminal), type:

% `pip install -r requirements.txt`

## Let's Upload Our Data

From Moodle, download `files.zip`. 

Then, upload `files.zip` to the code space.

Click on the blue circle.

Right click on `files.zip` and click `add to .gitignore`.

## Let's Uncompress Our Data

In the terminal, type:

% `mkdir texts`

% `cp files.zip texts`

% `cd texts`

% `unzip files.zip`

% `rm files.zip`

Click on the blue circle.

Select all the files that start with `ark` and click `add to .gitignore`.

# Reading Files in Python

In python we can use the packages [`glob`](https://docs.python.org/3/library/glob.html) and [`zipfile`](https://docs.python.org/3/library/zipfile.html) (among others!) to read and write files.

## Opening a Regular File

You can use the following syntax to open a regular file (note, the text in `colby_college.txt` comes from wikipedia):


In [21]:
# let's talk about this encoding business!
with open('colby_college.txt', encoding='utf-8') as f:
    # read the lines all at once
    # how could you read the lines one by one?
    lines = f.readlines()
    # print them; are they a string, set, list, dictionary or something else?
    print(lines)

['Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors.\n', '\n', 'Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium.\n', '\n', 'History']


## Opening a Zip File

You can use the following syntax to open a zip file (note, our zip file comes from constellate!):

In [20]:
# import zipfile package
import zipfile

# make a pattern
pattern = 'files.zip'

# make a zipfile object
zipf = zipfile.ZipFile(pattern)
# for each file_name in the zip file
for file_name in zipf.namelist():
    # open file_name as f
    with zipf.open(file_name) as f:
        # get all the text; why do we decode it to utf-8 and what is that? 
        # what if we read it line by line, how would that work?
        text = ''.join([x.decode('utf-8') for x in f.readlines()])
        # print the file name and the number of characters in the text


ark:__27927_phx1wcjq0tm 83155
ark:__27927_phz35174v0z 18637
ark:__27927_phz8qhfbxzm 35441
ark:__27927_phzbjns29gn 81655
ark:__27927_phzjj6kfdxp 85290
ark:__27927_phzkfzqzs41 18604
ark:__27927_phzmmfj893c 37058
ark:__27927_phznswfkrxz 24534
ark:__27927_phzpdcpvdnb 71315
ark:__27927_phzq26wnjzn 20182
ark:__27927_phzq8c34ggp 41636
ark:__27927_pjb16g9m9r7 97709
ark:__27927_pjb1wn175cv 37144
ark:__27927_pjb1z5xzrx7 102877
ark:__27927_pjb1z8505hp 88642
ark:__27927_pjb3ptfm8xd 108700
ark:__27927_pjb5s37cx32 64217
ark:__27927_pjb65xt4m6r 130322


## Opening a Bunch of Regular Files

You can use the following syntax to open a bunch of files with the `glob` package (note, `*` matches any number of characters):

In [24]:
# import glob
import glob

# set a pattern
pattern = 'c*.txt'

# read the file names matching the pattern
for file_name in glob.glob(pattern):
    # print file_name
    print(file_name)
    # open file_name as f
    with open(file_name, encoding='utf-8') as f:
        # get all the text; how would we read it one line at a time?
        text = f.readlines()
        # how would we turn that into one string?

        # print the file name and the number of characters in the text
        

colby_college.txt
['Colby College is a private liberal arts college in Waterville, Maine. Founded in 1813 as the Maine Literary and Theological Institution, it was renamed Waterville College in 1821. The donations of Christian philanthropist Gardner Colby saw the institution renamed again to Colby University before settling on its current title, reflecting its liberal arts college curriculum, in 1899. Approximately 2,000 students from more than 60 countries are enrolled annually. The college offers 54 major fields of study and 30 minors.\n', '\n', 'Located in central Maine, the 714-acre Neo-Georgian campus sits atop Mayflower Hill and overlooks downtown Waterville and the Kennebec River Valley. Along with fellow Maine institutions Bates College and Bowdoin College, Colby competes in the New England Small College Athletic Conference (NESCAC) and the Colby-Bates-Bowdoin Consortium.\n', '\n', 'History']
columbia_university.txt
["Columbia University, officially titled as Columbia Universit

Questions:
1. *What python package helps us read and write zip files?*
2. *What python package helps us read and write multiple files at once?* 
3. *What is a `wildcard` character used by that package?* 
4. *What happens if we try to load a file that doesn't exist?* 
5. *A file opens and we read it line by line. What if there's a funny character in the file?* 
6. *A file opesn and we read it line by line using `readlines()`. What do we get as output?* 
7. *How do we read a file line by line?*
8. *How do we read a file all lines at once?* 
9. *How do we turn (the text in a) file into a spaCy document?* 
10. *How do we deal with file reading errors?*
