## Independent Project: Converting Election Data from NBC into a CSV File

While looking for project ideas on Kaggle, [this dataset](https://www.kaggle.com/imoore/2020-us-general-election-turnout-rates/activity) caught my eye. My first instinct was to investigate if there was a correlation between the turnout in a state, and the share of votes that Joe Biden or Donald Trump recieved there. 

However, a Google search for '2020 Election Results CSV' didn't yield any results. That's frustrating, given that several news outlets have published this data. Why hasn't someone turned it into a csv? Wait - why don't I convert it into a CSV from a news outlet's site? 

### Part 1: Web Scraping using Beautiful Soup
I decided to use the vote count published [here](https://www.nbcnews.com/politics/2020-elections/president-results) on the NBC News Website. I loaded the html into a string, and selected out the 'tr' elements, which represent the rows in the Results table. The original list returned from this was 63 elements long, so I examined the first 30 characters of each element to determine which acutally represented a state/territory, and deleted the rest of the rows from my list.

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://www.nbcnews.com/politics/2020-elections/president-results"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
soup = BeautifulSoup(html, 'html.parser')

In [2]:
state_list = []
state_data = soup.find_all('tr')
for item in state_data: 
    state_list.append(str(item))
print(len(state_list))

63


In [3]:
list_preview = [i[:30] for i in state_list]
print(list_preview)

['<tr><th scope="col">State Name', '<tr><th scope="row">Alabama</t', '<tr><th scope="row">Alaska</th', '<tr><th scope="row">Arizona</t', '<tr><th scope="row">Colorado</', '<tr><th scope="row">Florida</t', '<tr><th scope="row">Georgia</t', '<tr><th scope="row">Indiana</t', '<tr><th scope="row">Kansas</th', '<tr><th scope="row">Maine</th>', '<tr><th scope="row">Massachuse', '<tr><th scope="row">Minnesota<', '<tr><th scope="row">New Jersey', '<tr><th scope="row">North Caro', '<tr><th scope="row">North Dako', '<tr><th scope="row">Oklahoma</', '<tr><th scope="row">Pennsylvan', '<tr><th scope="row">South Dako', '<tr><th scope="row">Texas</th>', '<tr><th scope="row">Wyoming</t', '<tr><th scope="row">Connecticu', '<tr><th scope="row">Missouri</', '<tr><th scope="row">West Virgi', '<tr><th scope="row">Illinois</', '<tr><th scope="row">New Mexico', '<tr><th scope="row">Arkansas</', '<tr><th scope="row">California', '<tr><th scope="row">Delaware</', '<tr><th scope="row">District o', '<tr><th scop

In [4]:
del state_list[0]
del state_list[-6:]
print(len(state_list))

56


### Part 2: Cleaning Data for Each State and Reading it into a DataFrame

I used a function to get rid of all of the HTML elements in each item on my list, and to split the data for each state into its own list. The only HTML element that I chose to clean separately was the URL. If I had used the function with the characters at the end of the URL, the word 'Go' would have been tacked onto the URL, making the URL in my end product invalid. 

In [5]:
print(state_list[0])

<tr><th scope="row">Alabama</th><td>9</td><td>100%</td><td>Biden 36.6% 849,648 votes</td><td>Trump 62% 1,441,168 votes</td><td><a class="" href="/politics/2020-elections/alabama-president-results?icid=election_usmap">Go to page Alabama results page</a></td></tr>


In [6]:
cleaned_list = []
bad_chars = ['<tr><th scope="row">',"</th>","</td>","</a></td></tr>",'<a class="" href="/','</a></tr>']
def strip_characters(string): 
    for char in bad_chars: 
        string = string.replace(char, "") 
    return string

for item1 in state_list:
    temp_list = []
    item_list = item1.split("<td>")
    for item in item_list: 
        item = strip_characters(item)
        item = item.replace('">'," ")
        temp_list.append(item)
    cleaned_list.append(temp_list)
print(cleaned_list[0])
    

['Alabama', '9', '100%', 'Biden 36.6% 849,648 votes', 'Trump 62% 1,441,168 votes', 'politics/2020-elections/alabama-president-results?icid=election_usmap Go to page Alabama results page']


In [7]:
import pandas as pd
import numpy as np
results_2020 = pd.DataFrame(cleaned_list,columns=['State','Electoral Votes', '% Vote Counted','Biden Results','Trump Results','URL'])
results_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56 entries, 0 to 55
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   State            56 non-null     object
 1   Electoral Votes  56 non-null     object
 2   % Vote Counted   56 non-null     object
 3   Biden Results    56 non-null     object
 4   Trump Results    51 non-null     object
 5   URL              51 non-null     object
dtypes: object(6)
memory usage: 2.8+ KB


### Part 3: Cleaning the DataFrame

Although the data had already come a long way from the HTML format, there was more to do before it would be usable as a CSV. 

The first step I took was removing the '% Vote Counted' column - at the time of scraping this data, all states and territories had either 99% or 100% of votes counted, making this measure unimportant.  

The next step was separating the results for each candidate into distinct columns for vote count and percentage. I ran into a problem, however, when attempting this: I was getting an error. After looking at the results of the column split in a new list, I realized that the data for the US territories was empty, so I removed these from my dataframe. This left only the fifty states and Washington D.C. as my row values. 

In [8]:
results_2020['% Vote Counted'].value_counts()

100%    48
         5
99%      3
Name: % Vote Counted, dtype: int64

In [9]:
results_2020 = results_2020.drop(columns = '% Vote Counted')
results_2020.head()

Unnamed: 0,State,Electoral Votes,Biden Results,Trump Results,URL
0,Alabama,9,"Biden 36.6% 849,648 votes","Trump 62% 1,441,168 votes",politics/2020-elections/alabama-president-resu...
1,Alaska,3,"Biden 42.8% 153,778 votes","Trump 52.8% 189,951 votes",politics/2020-elections/alaska-president-resul...
2,Arizona,11,"Biden 49.4% 1,672,143 votes","Trump 49.1% 1,661,686 votes",politics/2020-elections/arizona-president-resu...
3,Colorado,9,"Biden 55.4% 1,804,352 votes","Trump 41.9% 1,364,607 votes",politics/2020-elections/colorado-president-res...
4,Florida,29,"Biden 47.9% 5,297,045 votes","Trump 51.2% 5,668,731 votes",politics/2020-elections/florida-president-resu...


In [10]:
new_list= results_2020['Biden Results'].str.split(" ",n=2,expand = True)
print(new_list)

        0         1                                                  2
0   Biden     36.6%                                      849,648 votes
1   Biden     42.8%                                      153,778 votes
2   Biden     49.4%                                    1,672,143 votes
3   Biden     55.4%                                    1,804,352 votes
4   Biden     47.9%                                    5,297,045 votes
5   Biden     49.5%                                    2,473,633 votes
6   Biden       41%                                    1,242,495 votes
7   Biden     41.5%                                      570,323 votes
8   Biden     53.1%                                      435,072 votes
9   Biden     65.6%                                    2,382,202 votes
10  Biden     52.4%                                    1,717,077 votes
11  Biden     57.1%                                    2,608,335 votes
12  Biden     48.6%                                    2,684,292 votes
13  Bi

In [11]:
print(results_2020.iloc[44:47, : ],results_2020.iloc[49:51, : ])

                                           State Electoral Votes  \
44                                American Samoa                   
45                                          Guam                   
46  Commonwealth of the Northern Mariana Islands                   

                                        Biden Results Trump Results   URL  
44  <a class="" href="undefined?icid=election_usma...          None  None  
45  <a class="" href="undefined?icid=election_usma...          None  None  
46  <a class="" href="undefined?icid=election_usma...          None  None                              State Electoral Votes  \
49                   Puerto Rico                   
50  United States Virgin Islands                   

                                        Biden Results Trump Results   URL  
49  <a class="" href="undefined?icid=election_usma...          None  None  
50  <a class="" href="undefined?icid=election_usma...          None  None  


In [12]:
results_2020 = results_2020.drop(index = [44,45,46,49,50])
print(len(results_2020))

51


#### 3.1: Splitting Columns

After cleaning the territories from my data, I was able to successfully split the data for each candidate's votes into three distinct columns. I then removed the original columns from the dataframe, as well as the ones which just contained the values 'Biden' or 'Trump'. 

Then, I cleaned the URL column. Because of the way that the data was stored in the HTML file, the base URL https://www.nbcnews.com/ was missing, so I concatenated it with the values in the URL column. I also removed the text data that comes after the URL. 

In [13]:
results_2020[['Biden','Biden Vote Share','Biden Vote Count']]= results_2020['Biden Results'].str.split(" ",n=2,expand = True)
results_2020[['Trump','Trump Vote Share','Trump Vote Count']]= results_2020['Trump Results'].str.split(" ",n=2,expand = True)
results_2020 = results_2020.drop(columns = ['Trump','Biden','Trump Results','Biden Results'])
results_2020.head()

Unnamed: 0,State,Electoral Votes,URL,Biden Vote Share,Biden Vote Count,Trump Vote Share,Trump Vote Count
0,Alabama,9,politics/2020-elections/alabama-president-resu...,36.6%,"849,648 votes",62%,"1,441,168 votes"
1,Alaska,3,politics/2020-elections/alaska-president-resul...,42.8%,"153,778 votes",52.8%,"189,951 votes"
2,Arizona,11,politics/2020-elections/arizona-president-resu...,49.4%,"1,672,143 votes",49.1%,"1,661,686 votes"
3,Colorado,9,politics/2020-elections/colorado-president-res...,55.4%,"1,804,352 votes",41.9%,"1,364,607 votes"
4,Florida,29,politics/2020-elections/florida-president-resu...,47.9%,"5,297,045 votes",51.2%,"5,668,731 votes"


In [14]:
# This is getting rid of the pesky text that follows the URL link, 
# as well as adding the site domain to make the the URL clickable.
results_2020['URL'] = results_2020['URL'].str.split(" ",n=1,expand = True)
results_2020['URL']='https://www.nbcnews.com/'+results_2020['URL']

#### 3.2: Cleaning Numeric Data

All of the columns at this point are 'object' dtype, but most of them hold numeric values. 

First, I cleaned the percentage data by eliminating the percent sign, converting each datapoint to a float data type, and dividing the number by 100 to get the raw ratio. 

I used a similar function on the vote count columns. Here, I eliminated the word 'votes' and got rid of the commas. I then converted the data to dtype integer. 

Finally, I converted the 'Electoral Votes' Column to dtype integer as well. 

In [15]:
percent_cols = ['Biden Vote Share','Trump Vote Share']
def clean_percent(item): 
    item = item.replace('%',"")
    item = float(item)
    item = item/100 
    return item
for col in percent_cols: 
    results_2020[col] = results_2020[col].apply(clean_percent)
results_2020.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 55
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   State             51 non-null     object 
 1   Electoral Votes   51 non-null     object 
 2   URL               51 non-null     object 
 3   Biden Vote Share  51 non-null     float64
 4   Biden Vote Count  51 non-null     object 
 5   Trump Vote Share  51 non-null     float64
 6   Trump Vote Count  51 non-null     object 
dtypes: float64(2), object(5)
memory usage: 3.2+ KB


In [16]:
count_cols = ['Biden Vote Count', 'Trump Vote Count']
def clean_count(item):
    item = item.replace(' votes',"")
    item = item.replace(',',"")
    item = int(item)
    return item
for col in count_cols: 
    results_2020[col] = results_2020[col].apply(clean_count)
results_2020.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 55
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   State             51 non-null     object 
 1   Electoral Votes   51 non-null     object 
 2   URL               51 non-null     object 
 3   Biden Vote Share  51 non-null     float64
 4   Biden Vote Count  51 non-null     int64  
 5   Trump Vote Share  51 non-null     float64
 6   Trump Vote Count  51 non-null     int64  
dtypes: float64(2), int64(2), object(3)
memory usage: 3.2+ KB


In [17]:
results_2020['Electoral Votes'] = results_2020['Electoral Votes'].astype(int)
results_2020.head()

Unnamed: 0,State,Electoral Votes,URL,Biden Vote Share,Biden Vote Count,Trump Vote Share,Trump Vote Count
0,Alabama,9,https://www.nbcnews.com/politics/2020-election...,0.366,849648,0.62,1441168
1,Alaska,3,https://www.nbcnews.com/politics/2020-election...,0.428,153778,0.528,189951
2,Arizona,11,https://www.nbcnews.com/politics/2020-election...,0.494,1672143,0.491,1661686
3,Colorado,9,https://www.nbcnews.com/politics/2020-election...,0.554,1804352,0.419,1364607
4,Florida,29,https://www.nbcnews.com/politics/2020-election...,0.479,5297045,0.512,5668731


#### 3.3: Sorting Data

The data is really close to being ready to export, but I realized that the states are in a random order. I sorted the data alphabetically by state, the format in which one would expect this data to be displayed.

In [18]:
results_2020 = results_2020.sort_values(by='State')
results_2020.reset_index(drop=True,inplace = True,)
results_2020.head()

Unnamed: 0,State,Electoral Votes,URL,Biden Vote Share,Biden Vote Count,Trump Vote Share,Trump Vote Count
0,Alabama,9,https://www.nbcnews.com/politics/2020-election...,0.366,849648,0.62,1441168
1,Alaska,3,https://www.nbcnews.com/politics/2020-election...,0.428,153778,0.528,189951
2,Arizona,11,https://www.nbcnews.com/politics/2020-election...,0.494,1672143,0.491,1661686
3,Arkansas,6,https://www.nbcnews.com/politics/2020-election...,0.348,423932,0.624,760647
4,California,55,https://www.nbcnews.com/politics/2020-election...,0.635,11109764,0.343,6005961


### Step 4: Exporting the Data

Now that I am happy with how my data looks, I can export my data to a CSV file. 
I simply used the pandas.to_csv( ) function for this. 

In [19]:
results_2020.to_csv('/Users/Austen/Desktop/python/NBC 2020 US Presidential Results.csv', index=False)