# Introduction
In Part I, we collected 10 speeches and saved them in a .txt format. After collecting them, we will need to organize them into a single file for analysis. 

In this notebook, you will do the following:

1. Import the pandas and glob library
2. Read one text file
3. Obtain a list of the txt filenames in your directory
4. Loop through the txt files and store them in a list
5. Create a DataFrame containing the speeches
6. Export the speech DataFrame as a CSV

### Step 1: Import your libraries
We'll be using a few libraries to firstly read the txt files and subsequently organise them into a DataFrame.
1. pandas as pd
2. glob

In [1]:
import pandas as pd
import glob

### Step 2: Read Obama's 2010 speech and store the string in a variable
We'll start with reading one .txt file first, and store the string in a variable (you can call it anything you want). 

In [2]:
# Step 2: Read Obama's 2010 speech and store in a variable
fn = 'obama2010.txt'
with open(fn, 'r') as f:
    o2010 = f.read()

### Step 3: Get a list of your filenames in your folder
Congrats! You've successfully read the first text file. 

Next up, we'll be reading all of the other text files. However, before we do that, let's get a list of the filenames you have in your folder.

![UsingglobLibrary.png](attachment:UsingglobLibrary.png)

<strong>Hint: remember the glob library you imported? It'll come in handy</strong>

<strong>Hint 2: Google "find all files in a directory with extension .txt in Python"</strong>

In [3]:
# Step 3: Get the string of file names that ends with .txt
speeches = []
for file in glob.glob("*.txt"):
    speeches.append(file)

speeches

['obama2010.txt',
 'obama2011.txt',
 'obama2012.txt',
 'obama2013.txt',
 'obama2014.txt',
 'obama2015.txt',
 'obama2016.txt',
 'trump2018.txt',
 'trump2019.txt',
 'trump2020.txt']

### Step 4: Store all of your txt files as a string 
Now that you've obtained a list of your .txt filenames, it's time to loop through them and repeat Step 2.

You'll have to store all of your speech text as strings in a list. This is so that we can build a DataFrame later on.

Here's what we suggest:
1. Create an empty list
2. Use a for loop to loop through the list of filenames
3. In each loop
    1. Open the corresponding text file
    2. Read the text
    3. Strip and append the read text into the empty list in 1

<strong>Hint: Don't forget to <em>strip</em> your text so there's no trailing left or right whitespaces</strong>

In [4]:
text = []
for i in speeches:
    f = open(i, 'r', encoding='utf-8')
    t = f.read()
    text.append(t)
print(len(text))

10


## Prepare a DataFrame
Now that we have all of the .txt strings in a list, we'll create a DataFrame containing the speeches and basic information about the data you have.

We have four columns:
1. filename (you already have it from Step 3)
2. name
3. year
4. speech (you got it from Step 4)

### Step 5: Get a list of names
You'll need a list containing the names in the same order as the speech in the list you got in Step 4.

There are a few ways to do it, but since there are only 10 speeches you can consider just making a list manually.

In [5]:
names = ['obama', 'obama', 'obama', 'obama', 'obama', 'obama', 'obama', 'trump', 'trump', 'trump']
print(names)

['obama', 'obama', 'obama', 'obama', 'obama', 'obama', 'obama', 'trump', 'trump', 'trump']


### Step 6: Get a list of years
Similarly, you'll need a list containing the years of the speeches. 

Take note, the years run from 2010 to 2020, but there were no State of the Union speeches made in 2017. 

In [22]:
years = ['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2018', '2019', '2020']
print(years)

['2010', '2011', '2012', '2013', '2014', '2015', '2016', '2018', '2019', '2020']


### Step 7: Create the DataFrame of the speech
Now that you have the four lists, it's time to create the DataFrame containing the information you need (scroll up to see the DataFrame we want).

There will be four columns:
1. filename
2. name
3. year
4. speech

<strong>Hint: Google "create a dataframe using lists"</strong>

In [33]:
lst = speeches
lst2 = names
lst3 = years
lst4 = text
df = pd.DataFrame(list(zip(lst, lst2, lst3, lst4)), columns =['filename', 'name', 'year', 'speech'])
df

Unnamed: 0,filename,name,year,speech
0,obama2010.txt,obama,2010,"Madam Speaker, Vice President Biden, members o..."
1,obama2011.txt,obama,2011,"Mr. Speaker, Mr. Vice President, members of Co..."
2,obama2012.txt,obama,2012,"Mr. Speaker, Mr. Vice President, members of C..."
3,obama2013.txt,obama,2013,"Mr. Speaker, Mr. Vice President, members of Co..."
4,obama2014.txt,obama,2014,"Mr. Speaker, Mr. Vice President, Members of Co..."
5,obama2015.txt,obama,2015,"Mr. Speaker, Mr. Vice President, Members of Co..."
6,obama2016.txt,obama,2016,"Mr. Speaker, Mr. Vice President, Members of Co..."
7,trump2018.txt,trump,2018,"Mr. Speaker, Mr. Vice President, Members of Co..."
8,trump2019.txt,trump,2019,"Madam Speaker, Mr. Vice President, Members of ..."
9,trump2020.txt,trump,2020,Thank you very much. Thank you. Thank you v...


### Step 8: Export your DataFrame as CSV
Now that you're done with creating the DataFrame, export it as a CSV so that you can use it in other Parts.

In [34]:
df.to_csv(r'C:\Users\daani\US-Speech-Analysis\speechDF.csv', index = False, header = True)