## Democratic Candidate Speeches Project Guide

The ultimate goal of this project is to build a database of campaign stops and speeches by "all" of the 2020 democratic candidates for US president in order to produce an interactive map that follows their routes and allows for deeper reading of their platforms and how they change over time. For the purposes of these two weeks the goal is more modest: to begin to build this database for each candidate so that we can map candidates by city, date, and speech.

The first question to ask yourself his what is the best methodology for gathering this data: how can you obtain a list of campaign stops and the speeches made. The most programmatic method, the one being offered here in the programming guidelines, is scraping speeches in C-SPAN. This is limited, of course, to C-SPAN's coverage--as well as the poor quality of closed captioning transcripts. Nonetheless, you could get pretty far using this method--it is the path of least resistance (which is often the worst path)--but it will provide an adequate data set for investigating this concept.

You may choose other methodologies: such as requesting transcripts directly from the campaigns. This may or may not work, it may take time, which you don't have. But all methods for gathering are open.

From the standpoint of data architecture this may be the most straightforward of the projects. You want to collect the name of the candidate, the date and location the speech, the text of the speech--and anything else you think may be pertinent. Give it some thought. If you can obtain or interpret what kind of event the speech was for: rally, fundraiser, and so forth, that would be extremely helpful but also more advanced. As you go through and scrape or build your database do think about what your rows and columns should look like. 

For the purposes of the next two weeks, if you are using the C-SPAN method you may choose to do a handful of candidates (see instructions below). If you are gathering by less automated methods, I would suggest to focus on two or three. 

For your output: it is perfectly reasonable to just have a point-map with a different color point for each candidate, and a click-through to read the speech. If you want to do more qualitative work, you might think about doing some searching, aggregating and reading through the speeches to see what words/phrases recur from candidate to candidate. But again, this is advanced--it may be worth a little time to investigate. But is not required.






### Scraping C-SPAN:
Below are your initial coding instructions for obtaining the speech coverage and the transcripts of each speech for a candidate. Again, you do not have to use C-SPAN--it entails advanced use of beautiful soup AND selenium--but it will give you a solid data set, and a chance to get your coding solidified. If you choose other methods, that will be an uncharted path--I fully support any attempt to use other methods, but understand that there is more of a chance of getting lost!

Time to code!!!

### STEP 1: Get C-SPAN's coverage of the democratic candidates. 

This page:

https://www.c-span.org/series/?roadToTheWhiteHouse#candidates-tab

Discovered by Sawyer (thanks, Sawyer!) is an incredibly helpful starting point for obtaining an initial database of the candidates names, the start date of their candidacy, and most importantly--for scraping C-SPAN's transcripts--their unique ID for each candidate. 

Scrape this page to build a database with as much information as you can for all of the candidates. It should be one row per candidate.


In [11]:
###Import your scraping libraries

In [12]:
###Write your scraping code here



### STEP 2: Scrape a candidate's speech coverage
Only select one candidate. If you get this working for one, then you go back and get this working for another one. (You could theoretically loop through multiple candidates using this method, but I don't recommend it. I would go one by one.)

This is the best way I have found to get proper search results for the speeches by candidate. You may find others that you think are better (like the URL attached to "View All Videos" on the page you scraped in step one)--if so, go for it! My suggestion is to use this link:

https://www.c-span.org/search/?empty-date=1&sdate=&edate=&searchtype=Videos&sort=Most+Recent+Airing&text=0&&personid%5B%5D=1023023&formatid%5B%5D=55&show100=

It produces the search results targeting in on the candidate's ID:

'personid%5B%5D=1023023'

And the event tag for a speech: 'formatid%5B%5D=55' 

Also 'show100=' will give you up to 100 most recent results, which should definitely be enough--later you will likely want to go in and remove speeches that pre-date the campaign start date (which you have in the table that you scraped in step one).

The only thing you would need to programmatically change in this is the candidate ID. In this example I am showing it with Elizabeth Warren's ID '1023023'.

Scrape the information available on the page to make a list of lists or list of dictionaries. There is a lot of important information there--the most important one is the link to the speech itself.

You can use beautiful soup or selenium to scrape this page--whatever you are most comfortable with. Do not try to do all the steps and one programmatic sweep. This data set is small enough that it makes more sense to go step-by-step. Also some of these links will not necessarily be single speeches by the candidate: you will need to stop and evaluate your resulting database, and remove or tag rows that are not useful.

Once you have a good dataframe for all the speech information for a candidate, export the data frame is a CSV in order to back it up.

In [13]:
#Get the candidates ID from your first dataframe 
#in (step 1)
#Input it into the URL above
#Use beautiful soup or selenium to scrape

In [14]:
#Bring your list of lists or dictionaries
#into pandas
#Remove old speeches (unless you want to keep them)
#Export your clean dataframe as a CSV

### STEP 3: Scrape closed captioning transcripts for each speech

You can only do this part using selenium. Export the URLs from the last dataframe, and loop through them, using selenium to scrape the closed captioning speech. This is hard! 

Follow the instructions below, and see how you do. You might want to do each speech one by one. But eventually you want to get a data frame with two columns--the URL, and the full speech text. You will then join that to your dataframe from step two.

(For this project I recommend having multiple tables/dataframes--one for each candidate.) You can join on the first dataframe later.

The close captioning speech is distributed across multiple HTML frames. You need to locate those frames in selenium and first click to "Show Full Text" for every box that has that option. Only after you have done these clickings do you want to then take out the text from each of those fields.

In [15]:
#Try scraping one page first


In [16]:
#Have selenium open the page
#Locate the tags that hold the transcript


In [17]:
#Instruct Selenium to click "Show Full Text"
#when it appears
#To make all of the text available for each frame
#If you are doing this in the loop
#Make sure to code for a delay/wait
#Before going onto the next step
#You need to give it time to click everything

In [18]:
#Once all the frames have revealed the text
#Extract all the text

In [19]:
#If you got this working for one
#Try looping this, but be careful about timing

### STEP 4: Use regular expressions and your mind to clean up the speeches.
This is potentially a hornets nest. It is up to you how clean you want the speeches to be. There might be other speakers involved, there might not. It's up to you how far you want to push this part.

In [20]:
#View a speech and look for patterns

### STEP 5: Evaluate where you are and what you want to do next
If you have successfully created a dataframe for all the C-SPAN speeches by candidate that is totally amazing!!!! Export your data frame as a CSV. Think about whether you want to try another candidate, and repeat the steps above...