# Why do we care about Python?

* We can use Python to automate boring a repetitive tasks
* We can use Python to clean, transform, and manipulate data (not to mention perform analysis)

## Working with Data: The Federalist papers


* Who wrote the Federalist Papers? Alexander Hamilton, James Madison, or John Jay?  
* For more than 150 years, historians argued over the authorship of the 12 essays in _The Federalist Papers_. 
* It wasn't until 1963 that the mystery was solved by Frederick Mosteller of Harvard University and David Wallace of the University of Chicago. They used statistics and computation to analyze [the style of the text](https://en.wikipedia.org/wiki/Stylometry) and determine the authors of each paper.

Full text of _The Federalist Papers_ is available at http://www.gutenberg.org/ebooks/1404

### Looking at the data

<- Open the file called `federalist_papers.txt` in JupyterLab to see the whole file.

### Loading the Data into Python

* We can use Python to manipulate the file, but first we need to load it into memory.

In [1]:
# Path to our data file (source file)
source_file_name = "federalist_papers.txt"

# open the text file and read it into memory
with open(source_file_name) as f:
    fed_papers_text = f.read()

# Display the first 1000 characters
print(fed_papers_text[0:1000])

The Project Gutenberg EBook of The Federalist Papers, by
Alexander Hamilton, John Jay, and James Madison

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: The Federalist Papers

Author: Alexander Hamilton, John Jay, and James Madison

Release Date: July, 1998  [Etext #1404]
Posting Date: November 6, 2009

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK THE FEDERALIST PAPERS ***




Produced by The Consitution Society and Anonymous Volunteers





THE FEDERALIST PAPERS

By Alexander Hamilton, John Jay, and James Madison




FEDERALIST No. 1

General Introduction

For the Independent Journal. Saturday, October 27, 1787


HAMILTON

To the People of the State of New York:

AFTER an unequivocal experience of the inefficacy of the subsisting
federal government, you are cal

* Now that we have the text file loaded we can begin to manipulate it.
* The main thing we want to do is calculate *word frequencies*.
    * Let's count the frequency of the words "while" and "whilst"
* So first we need to divide it up into separate words.

In [2]:
# Split the text into separate words based on spaces
word_list = fed_papers_text.split(" ")

#display the first 100 words
print(word_list[0:100])

['The', 'Project', 'Gutenberg', 'EBook', 'of', 'The', 'Federalist', 'Papers,', 'by\nAlexander', 'Hamilton,', 'John', 'Jay,', 'and', 'James', 'Madison\n\nThis', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with\nalmost', 'no', 'restrictions', 'whatsoever.', '', 'You', 'may', 'copy', 'it,', 'give', 'it', 'away', 'or\nre-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included\nwith', 'this', 'eBook', 'or', 'online', 'at', 'www.gutenberg.org\n\n\nTitle:', 'The', 'Federalist', 'Papers\n\nAuthor:', 'Alexander', 'Hamilton,', 'John', 'Jay,', 'and', 'James', 'Madison\n\nRelease', 'Date:', 'July,', '1998', '', '[Etext', '#1404]\nPosting', 'Date:', 'November', '6,', '2009\n\nLanguage:', 'English\n\n\n***', 'START', 'OF', 'THIS', 'PROJECT', 'GUTENBERG', 'EBOOK', 'THE', 'FEDERALIST', 'PAPERS', '***\n\n\n\n\nProduced', 'by', 'The', 'Consitution', 'Society', 'and', 'Anonymous', 'Volunteers\n\n\n\n\n\nTHE', 'FEDERALI

* Will this work?  Are words always separated by spaces?
* While there are specialized Python libraries for dealing with text parsing (for example, [spaCy](https://spacy.io/) and the [natural language toolkit](https://www.nltk.org/), we are going to keep it simple.
* For now we'll clean our text data in a quick and dirty way by removing the main punctuation marks

In [3]:

# make a list of punctuation marks
punctuation_marks = ['!','.', ',', ':', ';', '?', '-', '\n','\ufeff']
# Remove them all from the text and replace with spaces
for pm in punctuation_marks:
    fed_papers_text = fed_papers_text.replace(pm, '')
                     
# Display the first 1000 characters
print(fed_papers_text[0:1000])

The Project Gutenberg EBook of The Federalist Papers byAlexander Hamilton John Jay and James MadisonThis eBook is for the use of anyone anywhere at no cost and withalmost no restrictions whatsoever  You may copy it give it away orreuse it under the terms of the Project Gutenberg License includedwith this eBook or online at wwwgutenbergorgTitle The Federalist PapersAuthor Alexander Hamilton John Jay and James MadisonRelease Date July 1998  [Etext #1404]Posting Date November 6 2009Language English*** START OF THIS PROJECT GUTENBERG EBOOK THE FEDERALIST PAPERS ***Produced by The Consitution Society and Anonymous VolunteersTHE FEDERALIST PAPERSBy Alexander Hamilton John Jay and James MadisonFEDERALIST No 1General IntroductionFor the Independent Journal Saturday October 27 1787HAMILTONTo the People of the State of New YorkAFTER an unequivocal experience of the inefficacy of the subsistingfederal government you are called upon to deliberate on a newConstitution for the United States of Ameri

* Notice we removed a bunch of the formatting too.
* We still have a little bit of text processing to do

In [4]:
# It would be a good idea to convert everything to lower case before we do anything else
fed_papers_text = fed_papers_text.lower()

# Now let's build a list of words
word_list = fed_papers_text.split(" ")

#display the first 100 words
print(word_list[0:100])

['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'federalist', 'papers', 'byalexander', 'hamilton', 'john', 'jay', 'and', 'james', 'madisonthis', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'withalmost', 'no', 'restrictions', 'whatsoever', '', 'you', 'may', 'copy', 'it', 'give', 'it', 'away', 'orreuse', 'it', 'under', 'the', 'terms', 'of', 'the', 'project', 'gutenberg', 'license', 'includedwith', 'this', 'ebook', 'or', 'online', 'at', 'wwwgutenbergorgtitle', 'the', 'federalist', 'papersauthor', 'alexander', 'hamilton', 'john', 'jay', 'and', 'james', 'madisonrelease', 'date', 'july', '1998', '', '[etext', '#1404]posting', 'date', 'november', '6', '2009language', 'english***', 'start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'the', 'federalist', 'papers', '***produced', 'by', 'the', 'consitution', 'society', 'and', 'anonymous', 'volunteersthe', 'federalist', 'papersby', 'alexander', 'hamilton', 'john', 'jay']


* Now that we have the text a bit more *normalized* we can start counting!

In [5]:
# find the frequency for "while" and "whilst"

# create some variables to track the count
freq_while = 0
freq_whilst = 0

# loop over every word in the word list
for word in word_list:
    
    # is the word while?
    if word == "while":
        # add one to our while variable
        freq_while = freq_while + 1 
    # is the word whilst?
    elif word == "whilst":
        # add one to our while variable
        freq_whilst = freq_whilst + 1
    # the word is neither while or whilst
    else:
        # just continue looping
        continue
        
print("The frequency of 'while' is:", freq_while)
print("The frequency of 'whilst' is:", freq_whilst)

The frequency of 'while' is: 30
The frequency of 'whilst' is: 18


* Look at that! We figured out who wrote the federalist papers!