# Introduction

In Part II, we organized the speeches that we got from Part I. In this Part, we will engineer features from the data and visualize them! 

We will be getting simple metrics such as:
1. speech length
2. number of sentences in speech

In this notebook, you will do the following:
1. Import pandas and data vizualization libraries
2. Derive metrics around the speeches
3. Plot the metrics for comparison between the two

Useful readings on visualization: 
<a href = "https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed">Introduction to Data Visualization in Python</a> (run it in Incognito Mode if you face the paywall)

It's quite comprehensive and a useful guide for this Part if you're new to visualization.

<font color = 'red'>Once again, don't be alarmed if your analysis is slightly different from ours, since the sources of data might vary. 
    
However, if you want to reproduce the exact results, you can download our data from Part I Step 1</font>

### Step 1: Import the following libraries
- pandas
- matplotlib.pyplot as plt
- seaborn as sns

In [35]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### End of Part II
In this Part, you have successfully organized the speeches that you obtained from various places into a single DataFrame. We will use this CSV for subsequent Parts, where we will analyze the contents of the speeches. 

### Step 2: Read your CSV from Part II
Let's read the CSV that we got from the previous part as a DataFrame.

In [36]:
df = pd.read_csv('speechDF.csv')
df.head(10)

Unnamed: 0,filename,name,year,speech
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o..."
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co..."
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C..."
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co..."
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co..."
5,obama2015.txt,obama,2015,"Mr. Speaker, Mr. Vice President, Members of Co..."
6,obama2016.txt,obama,2016,"Mr. Speaker, Mr. Vice President, Members of Co..."
7,trump2018.txt,trump,2018,"Mr. Speaker, Mr. Vice President, Members of Co..."
8,trump2019.txt,trump,2019,"Madam Speaker, Mr. Vice President, Members of ..."
9,trump2020.txt,trump,2020,Thank you very much. Thank you. Thank you v...


## Engineer features from speech
We'll start deriving features from the speeches.

1. Number of characters
2. Number of sentences
3. Average number of characters
4. Readability scores

This is what you'll see at the end.

![FinalDataFrame.png](attachment:FinalDataFrame.png)

We'll work towards this step-by-step, so hope you're hyped! 

This image serves as a clue to figure out what the current step is asking so it's worth coming back to this image once in a while for reference. 

### Step 3: Count number of characters in speech
Create a new column named len_speech, where it is the total number of characters in the string of the speeches in "speech" column.

Again, there are a few ways to do this so go for it! 

<strong>Hint: Google "find length of string in column python"</strong>

In [37]:
df['len_speech'] = df['speech'].str.len()
df.head(10)

Unnamed: 0,filename,name,year,speech,len_speech
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o...",43698
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co...",41023
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C...",42204
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co...",41202
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co...",40009
5,obama2015.txt,obama,2015,"Mr. Speaker, Mr. Vice President, Members of Co...",40269
6,obama2016.txt,obama,2016,"Mr. Speaker, Mr. Vice President, Members of Co...",31167
7,trump2018.txt,trump,2018,"Mr. Speaker, Mr. Vice President, Members of Co...",30457
8,trump2019.txt,trump,2019,"Madam Speaker, Mr. Vice President, Members of ...",30945
9,trump2020.txt,trump,2020,Thank you very much. Thank you. Thank you v...,39649


### Step 4: Replace "!" and "?" with "." in speech
We'll be getting the number of sentences in the speeches, so let's replace our exclamation and question marks with a full stop so that we can split the sentences cleanly later on.

There are a few ways to do this, so go ahead and pick your favorite method. 

<strong>Hint: Google "how to replace text in a column of a Pandas dataframe?"</strong>

In [38]:
df['speech'] = df['speech'].str.replace('!', '.')
df['speech'] = df['speech'].str.replace('?', '.')
df.head()

Unnamed: 0,filename,name,year,speech,len_speech
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o...",43698
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co...",41023
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C...",42204
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co...",41202
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co...",40009


### Step 5: Split the speech and get the number of sentences
In this step, you will need to do two things:
1. Split the speech into a list of sentences by splitting based on "."
2. Get the resulting length of the list

![SpeechLength.png](attachment:SpeechLength.png)

Don't sweat the potential inaccuraries - for this particular task we will trade speed and convenience for data cleanliness.

Once again, don't panic if your values differ slightly from what you see in this image.

<strong>Hint: Google "pandas split string to list"</strong>

<strong>Hint: Google "count length of lists in column"</strong>

In [39]:
df['sentences'] = df['speech'].str.split(pat = ".")
df['num_sentences'] = df['sentences'].str.len()
df.head()

Unnamed: 0,filename,name,year,speech,len_speech,sentences,num_sentences
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o...",43698,"[Madam Speaker, Vice President Biden, members ...",562
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co...",41023,"[Mr, Speaker, Mr, Vice President, members of...",500
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C...",42204,"[ Mr, Speaker, Mr, Vice President, members o...",518
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co...",41202,"[Mr, Speaker, Mr, Vice President, members of...",455
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co...",40009,"[Mr, Speaker, Mr, Vice President, Members of...",352


### Step 6: Get the average sentence length
It's not enough that we get the absolute length of the speeches, and the number of sentences. 

We should also create a column named 'average_sen_length' that tells you what the average number of characters there are in a sentence. 

<strong>Hint: Google "create a column from two columns pandas"</strong>

In [40]:
df['average_sen_length'] = df['len_speech'] / df['num_sentences']

In [41]:
df.head()

Unnamed: 0,filename,name,year,speech,len_speech,sentences,num_sentences,average_sen_length
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o...",43698,"[Madam Speaker, Vice President Biden, members ...",562,77.754448
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co...",41023,"[Mr, Speaker, Mr, Vice President, members of...",500,82.046
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C...",42204,"[ Mr, Speaker, Mr, Vice President, members o...",518,81.474903
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co...",41202,"[Mr, Speaker, Mr, Vice President, members of...",455,90.553846
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co...",40009,"[Mr, Speaker, Mr, Vice President, Members of...",352,113.661932


### Step 7: Import readability
We will now get some readability scores for the speeches. 

Why so? Readability indicates how accessible the ideas in the speeches are, and accessibility helps with connecting with the audience.

That said, since these are speech transcripts, they're not exactly meant for reading. However, it's still a good idea to see if we can analyze the readability of their speeches.

There are a lot of readability scores, but we will focus on two scores:
1. SMOG
2. Flesch reading ease score

Readings:
1. https://en.wikipedia.org/wiki/SMOG
2. https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests
3. [Important] https://pypi.org/project/readability/

In [42]:
import readability

### Step 8: Add SMOGIndex and FleschReadingEase into your DataFrame
As mentioned, we will be using these two commonly used scores in readability tests.

Loop through the values in the 'speech' column and get the scores for your new column, named 'SMOG_index' and 'Flesch_score'.

<strong>Hint: Check out the documentation - it'll be immensely helpful.</strong>

In [43]:
import numpy as np

In [47]:
#df['SMOG_index'] = np.nan
#df['Flesch_score'] = np.nan
for i, j in df.iterrows(): 
    results = readability.getmeasures(j['speech'], lang='en')
    j['SMOG_index'] = results['readability grades']['SMOGIndex']
    j['Flesch_score'] = results['readability grades']['FleschReadingEase']
    print(j['SMOG_index'])
    print(j['Flesch_score'])

17.826429403250046
24.720575267901523
16.543302870111816
32.322806083320536
17.34406540661958
26.228014157842694
14.18755068469326
52.06242707744663
19.06450067129949
9.227648912106872
13.518313583218587
56.19437746425349
17.44140633274578
25.06250899911781
13.84361068175401
51.03448989662204
14.799099541914204
44.106259681903715
16.381695237968188
33.439535910467704


In [45]:
df.head()

Unnamed: 0,filename,name,year,speech,len_speech,sentences,num_sentences,average_sen_length
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o...",43698,"[Madam Speaker, Vice President Biden, members ...",562,77.754448
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co...",41023,"[Mr, Speaker, Mr, Vice President, members of...",500,82.046
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C...",42204,"[ Mr, Speaker, Mr, Vice President, members o...",518,81.474903
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co...",41202,"[Mr, Speaker, Mr, Vice President, members of...",455,90.553846
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co...",40009,"[Mr, Speaker, Mr, Vice President, Members of...",352,113.661932


In [10]:
results = readability.getmeasures(df['speech'], lang='en')
df['SMOG_index'] = results['readability grades']['SMOGIndex']

In [11]:
df['Flesh_score'] = results['readability grades']['FleschReadingEase']

In [12]:
df.head()

Unnamed: 0,filename,name,year,speech,len_speech,sentences,num_sentences,average_sen_length,SMOG_index,Flesh_score
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o...",43698,"[Madam Speaker, Vice President Biden, members ...",562,77.754448,148.481958,-6571.055172
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co...",41023,"[Mr, Speaker, Mr, Vice President, members of...",500,82.046,148.481958,-6571.055172
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C...",42204,"[ Mr, Speaker, Mr, Vice President, members o...",518,81.474903,148.481958,-6571.055172
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co...",41202,"[Mr, Speaker, Mr, Vice President, members of...",455,90.553846,148.481958,-6571.055172
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co...",40009,"[Mr, Speaker, Mr, Vice President, Members of...",352,113.661932,148.481958,-6571.055172


## Visualizing the speech metrics
Okay, we have successfully extract all the information that we need for now. 

Next up, we will now visualize them for comparison between Obama and Trump. Most of the analyses will be done using boxplot, and we recommend seaborn as the library of choice here. 

### Step 9: Analyze len_speech with a boxplot
We'll first compare len_speech with a boxplot. 

Here's what you anticipate to see when you plot the boxplot:

![FirstBoxplotExample.png](attachment:FirstBoxplotExample.png)

Subsequently, you'll be plotting similar plots for the other features that you engineered.

In [None]:
# Step 9: Plot a boxplot for len_speech

### Step 10: Analyze num_sentences with a boxplot
Next, we'll first compare num_sentences with a boxplot. 

In [None]:
# Step 10: Analyze num_sentences with a boxplot

### Step 11: Analyze average_sen_length with a boxplot
So far, it seems that Trump's speeches are shorter and there are fewer sentences.

However, plotting the absolute speech length and number of sentences is one half of a data story - you'll still need to look at the averages/ratio of length to number of sentences.

Plot average_sen_length as well.

In [None]:
# Step 11: Plot average_sen_length with a boxplot

### Step 12: Get the mean of the 'average_sen_length' for Obama and Trump respectively
We anticipate that you'll find something surprising in the boxplots - turns out that the average_sen_length is visually similar.

How similar? 

Give it a look by getting the average_sen_length of Obama and Trump, and see how much their averages differ.

You can do this by creating two new DataFrames based on the 'name' column, and get the mean of 'average_sen_length'.

In [None]:
# Step 12: Print the average_sen_length

### Step 13: Analyze SMOG_index with a boxplot
Next up, let's compare the readability of the two individuals' speeches.

We'll start with a boxplot for SMOG_index.

P.S. The higher the SMOG score, the harder it is to read

In [None]:
# Step 13: Plot SMOG_index with a boxplot

### Step 14: Analyze Flesch_score with a boxplot
Once you clear the SMOG, next up is Flesch_score. 

Same thing, plot a boxplot for Flesch_score.

P.S. The higher the Flesch Reading Ease score, the easier it is to read.

In [None]:
# Step 14: Plot average_sen_length with a boxplot

Now that you've plotted the barplots for SMOG and Flesch Reading Ease scores, what can you say about their speeches?

Hopefully, you'll come to the same conclusion as us - Trump's speeches are generally easier to read whereas Obama's speeches are harder to read. 

Interpret that how you will. 

### Step 15: Export the expanded DataFrame as a CSV
Now that we're done plotting, we will be exporting the DataFrame into a CSV for use in other Parts.

In [None]:
# Step 15: Export the DataFrame to CSV

### End of Part III
Well done for making it this far! It was a lot of work, but you successfully extract a lot of basic text information from the speeches. 

Next up, we'll use NLP library spaCy to perform technical text analysis on the speeches in Part IV.