# Guided Exercise 1
This exercise will be using another speech by Churchill called 'We Will Fight On the Beaches'.  Listen to a bit here: https://www.youtube.com/watch?v=MkTw3_PmKtc

In [7]:
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import string
import pandas as pd

## Parse Data

Identify the filepath of the speech.  Open the speech, read it, convert to lowercase, and store it in the variable ```speech```.


In [4]:
fp = 'speeches/Churchill-Beaches.txt'
speech = open(fp).read().lower()

Create an empty list called ```rows``` and an integer set to 0 called ```sent_id```.  Then for each sentence in ```speech```, tokenize the sentence and for each token in the sentence, create a dictionary with keys and values for 'token' and 'sent_id'.  Then append each dictionary to ```rows```.

In [5]:
rows = []
sent_id = 0

for sent in sent_tokenize(speech):
    sent_id += 1
    for token in word_tokenize(sent):
        d = {'sent_id':sent_id, 'token':token}
        rows.append(d)


Use the ```rows``` list to create a pandas DataFrame called ```parsed_speech```.

In [6]:
parsed_speech = pd.DataFrame(rows)

Create two new columns in ```parsed_speech```, 'is_stop' and 'is_punct', that each identify whether or not a token is a stopword or punctuation, respectively.

In [9]:
def is_stopword(token):
    stops = stopwords.words('english').copy()
    return token in stops

def is_punctuation(token):
    return token in string.punctuation

parsed_speech['is_stop'] = parsed_speech.token.apply(is_stopword)
parsed_speech['is_punct'] = parsed_speech.token.apply(is_punctuation)

## Analysis
Answer the following questions about the speech.

### What are the ten least common words in the speech?

In [10]:
parsed_speech.token.value_counts().tail(10)

nazi          1
fall          1
armed         1
try           1
home          1
any           1
forth         1
god’s         1
famous        1
subjugated    1
Name: token, dtype: int64

### What are the twenty most common words in the speech that are not stop words?

In [12]:
parsed_speech[parsed_speech.is_stop == False].token.value_counts().head(20)

,             35
shall         12
.              7
fight          7
island         3
defend         3
even           2
seas           2
confidence     2
old            2
growing        2
large          2
may            2
strength       2
good           2
british        2
empire         2
made           2
necessary      2
gestapo        1
Name: token, dtype: int64

### What is the number of tokens in the sentence with the largest number of tokens in the speech?

In [14]:
parsed_speech.sent_id.value_counts().head(1)

7    163
Name: sent_id, dtype: int64

## Function Creation
Create a function called ```parse_speech_to_tokens``` that takes a filepath as input and returns a DataFrame with four columns: sent_id, token, is_punct, and is_stop.

In [15]:
def parse_speech_to_tokens(fp):
    speech = open(fp).read().lower()
    rows = []
    sent_id = 0

    for sent in sent_tokenize(speech):
        sent_id += 1
        for token in word_tokenize(sent):
            d = {'sent_id':sent_id, 'token':token, 'is_stop': is_stopword(token), 'is_punct': is_punctuation(token)}
            rows.append(d)
            
    df = pd.DataFrame(rows)
    return df

Test your function on a new speech, FDR's Pearl Harbor speech.

In [16]:
fdr_fp = 'speeches/FDR-PearlHarbor.txt'
fdr_parsed_speech = parse_speech(fdr_fp)
fdr_parsed_speech.head()

Unnamed: 0,is_punct,is_stop,sent_id,token
0,False,False,1,mr.
1,False,False,1,vice
2,False,False,1,president
3,True,False,1,","
4,False,False,1,mr.


So that we can save our work, save the DataFrame to a file as a csv.

In [17]:
fdr_parsed_speech.to_csv('data/FDR-PearlHarbor_parsed.csv')

## Restructure for Sentences
This section will be even more guided and provide better skeletons.
Objective: Make a dataframe where each row is a sentence in the text.  There are two approaches.

### Approach 1
Starting with the filepath, read the file, ```sent_tokenize``` its contents, and as you iterate, create a dictionary entry for each sentence. Each dictionary should have the follow key/value pairs:
* 'sentence': the sentence as a string
* 'tokens': the sentence as a list of tokens

Append each dictionary to the empty list ```sentences```.  Then convert the list to a DataFrame and store it in the variable ```sentence_df_1```.

In [31]:
#Complete this part
def parse_speech_to_sentences(fp):
    speech = open(fp).read().lower()
    rows = []
    sent_id = 0

    for sent in sent_tokenize(speech):
        d = {'sentence':sent, 'tokens': word_tokenize(sent)}
        rows.append(d)
            
    df = pd.DataFrame(rows)
    return df

sentence_df_1 = parse_speech_to_sentences('speeches/Churchill-Beaches.txt').head()
sentence_df_1.head()

Unnamed: 0,sentence,tokens
0,"i have, myself, full confidence that if all do...","[i, have, ,, myself, ,, full, confidence, that..."
1,"at any rate, that is what we are going to try ...","[at, any, rate, ,, that, is, what, we, are, go..."
2,that is the resolve of his majesty’s governmen...,"[that, is, the, resolve, of, his, majesty’s, g..."
3,that is the will of parliament and the nation.,"[that, is, the, will, of, parliament, and, the..."
4,"the british empire and the french republic, li...","[the, british, empire, and, the, french, repub..."


### Approach 2
   First, write a function called ```get_tokens``` that, given a DataFrame and a column name, converts the column name to a list a returns the list.

In [34]:
def get_tokens(df, args):
    tokens = df[args].tolist()
    sentence = ' '.join(tokens)
    return pd.Series({'tokens': tokens, 'sentence':sentence})

Then, take the first DataFrame we made, ```parsed_speech```, and group it by the ```sent_id```.  Then use the DataFrame's ```apply``` method to apply the get_tokens function.  Assign the resulting DataFrame to the variable ```sentence_df_2```.

In [39]:
sentence_df_2 = parsed_speech.groupby('sent_id').apply(get_tokens, args=('token')).reset_index()
sentence_df_2.head()

Unnamed: 0,sent_id,sentence,tokens
0,1,"i have , myself , full confidence that if all ...","[i, have, ,, myself, ,, full, confidence, that..."
1,2,"at any rate , that is what we are going to try...","[at, any, rate, ,, that, is, what, we, are, go..."
2,3,that is the resolve of his majesty’s governmen...,"[that, is, the, resolve, of, his, majesty’s, g..."
3,4,that is the will of parliament and the nation .,"[that, is, the, will, of, parliament, and, the..."
4,5,"the british empire and the french republic , l...","[the, british, empire, and, the, french, repub..."


### One More Step
Using either ```sentence_df_1``` or ```sentence_df_2``` (but consistently moving forward), create a new column called num_tokens that stores the number of tokens in each sentence (look at the apply method and how it was used to identify stop words and punctuation).

In [42]:
sentence_df_2['num_tokens'] = sentence_df_2.tokens.apply(len)
sentence_df_2.head()

Unnamed: 0,sent_id,sentence,tokens,num_tokens
0,1,"i have , myself , full confidence that if all ...","[i, have, ,, myself, ,, full, confidence, that...",71
1,2,"at any rate , that is what we are going to try...","[at, any, rate, ,, that, is, what, we, are, go...",15
2,3,that is the resolve of his majesty’s governmen...,"[that, is, the, resolve, of, his, majesty’s, g...",12
3,4,that is the will of parliament and the nation .,"[that, is, the, will, of, parliament, and, the...",10
4,5,"the british empire and the french republic , l...","[the, british, empire, and, the, french, repub...",40


### What is the number of tokens in the sentence with the largest number of tokens in the speech?

In [43]:
sentence_df_2.num_tokens.max()

163