# Data engineering

This notebook describes the process to download and prepare United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table.

***

## Load and prepare data

To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, and a Pandas dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Pandas dataframe.

##### Import needed modules

In [1]:
import arcgis
import pandas as pd
import os
#import arcpy

  pd.datetime,


##### Read data into Python

In [2]:
data_df = pd.read_csv("countypres2016.csv")

In [3]:
data_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016,Alabama,AL,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973,20190722
1,2016,Alabama,AL,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973,20190722
2,2016,Alabama,AL,Autauga,1001.0,President,Other,,865.0,24973,20190722
3,2016,Alabama,AL,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215,20190722
4,2016,Alabama,AL,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215,20190722


***

## Handle missing data 

The election data includes a records that are missing data in the FIPS field. This missing data is referred to as null values. You will identify how many rows have null values and create a new dataframe that does not include them.
![Null Values](img/null_values.gif "Null Values")

In [4]:
data_df.isnull().sum()

year                 0
state                0
state_po            12
county               0
FIPS                12
office               0
candidate            0
party             3158
candidatevotes       6
totalvotes           0
version              0
dtype: int64

In [5]:
# Perform a query on the dataframe using the loc function and the necessary field name.
data_df.loc[data_df['FIPS'].isnull()]  # We can use the isnull function built in to Pandas to find the records with null FIPS.

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
9462,2016,Connecticut,,Statewide writein,,President,Hillary Clinton,democrat,,5056,20190722
9463,2016,Maine,,Maine UOCAVA,,President,Hillary Clinton,democrat,3017.0,5056,20190722
9464,2016,Alaska,,District 99,,President,Hillary Clinton,democrat,274.0,5056,20190722
9465,2016,Rhode Island,,Federal Precinct,,President,Hillary Clinton,democrat,637.0,5056,20190722
9466,2016,Connecticut,,Statewide writein,,President,Donald Trump,republican,,5056,20190722
9467,2016,Maine,,Maine UOCAVA,,President,Donald Trump,republican,648.0,5056,20190722
9468,2016,Alaska,,District 99,,President,Donald Trump,republican,40.0,5056,20190722
9469,2016,Rhode Island,,Federal Precinct,,President,Donald Trump,republican,53.0,5056,20190722
9470,2016,Connecticut,,Statewide writein,,President,Other,,,5056,20190722
9471,2016,Maine,,Maine UOCAVA,,President,Other,,321.0,5056,20190722


In [6]:
#query out missing instances
data_df.loc[data_df['state_po'].isnull()]

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
9462,2016,Connecticut,,Statewide writein,,President,Hillary Clinton,democrat,,5056,20190722
9463,2016,Maine,,Maine UOCAVA,,President,Hillary Clinton,democrat,3017.0,5056,20190722
9464,2016,Alaska,,District 99,,President,Hillary Clinton,democrat,274.0,5056,20190722
9465,2016,Rhode Island,,Federal Precinct,,President,Hillary Clinton,democrat,637.0,5056,20190722
9466,2016,Connecticut,,Statewide writein,,President,Donald Trump,republican,,5056,20190722
9467,2016,Maine,,Maine UOCAVA,,President,Donald Trump,republican,648.0,5056,20190722
9468,2016,Alaska,,District 99,,President,Donald Trump,republican,40.0,5056,20190722
9469,2016,Rhode Island,,Federal Precinct,,President,Donald Trump,republican,53.0,5056,20190722
9470,2016,Connecticut,,Statewide writein,,President,Other,,,5056,20190722
9471,2016,Maine,,Maine UOCAVA,,President,Other,,321.0,5056,20190722


In [7]:
# Determine how many rows are in the table
rowcount = data_df.shape[0]
rowcount

9474

In [8]:
data_df.shape

(9474, 11)

In [9]:
# Determine how many rows have null FIPS 
null_fips_rowcount = data_df.loc[data_df['FIPS'].isnull()].shape[0]
null_fips_rowcount

12

In [10]:
# Calculate how much of the data this represents as a percentage
percentage_null_fips = round((null_fips_rowcount / rowcount) * 100, 2)
percentage_null_fips

0.13

In [11]:
# Use a print statement to report this information
print("There were "+str(null_fips_rowcount)+" records with null FIPS values in the data.\nThis amounts to " +str(percentage_null_fips)+"% of the available data.")

There were 12 records with null FIPS values in the data.
This amounts to 0.13% of the available data.


In [12]:
# Use the notnull function and the loc function to create a new dataframe without null FIPS records
data_df = data_df.loc[data_df['FIPS'].notnull()]

In [13]:
data_df.isnull().sum()

year                 0
state                0
state_po             0
county               0
FIPS                 0
office               0
candidate            0
party             3154
candidatevotes       3
totalvotes           0
version              0
dtype: int64

In [14]:
#we can fill the candidatevote with the mean value
data_df["candidatevotes"].fillna(data_df["candidatevotes"].mean(), inplace=True)

In [15]:
data_df.isnull().sum()

year                 0
state                0
state_po             0
county               0
FIPS                 0
office               0
candidate            0
party             3154
candidatevotes       0
totalvotes           0
version              0
dtype: int64

In [16]:
# TO avoid being biased, replace 'nan'in party feature with "not recorded"
data_df["party"].fillna("not recorded", inplace=True)

In [17]:
data_df.isnull().sum()

year              0
state             0
state_po          0
county            0
FIPS              0
office            0
candidate         0
party             0
candidatevotes    0
totalvotes        0
version           0
dtype: int64

In [18]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9462 entries, 0 to 9461
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            9462 non-null   int64  
 1   state           9462 non-null   object 
 2   state_po        9462 non-null   object 
 3   county          9462 non-null   object 
 4   FIPS            9462 non-null   float64
 5   office          9462 non-null   object 
 6   candidate       9462 non-null   object 
 7   party           9462 non-null   object 
 8   candidatevotes  9462 non-null   float64
 9   totalvotes      9462 non-null   int64  
 10  version         9462 non-null   int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 887.1+ KB


***

## Explore and handle data types

In reviewing your data, you notice that the FIPS field is considered a numeric field instead of a string. As a result, leading zeroes in the FIPS values have been removed. The resulting FIPS values only have four characters instead of five. You will determine how many records are missing leading zeroes and add, or append, the missing zero.
![fix_truncated_zeroes](img/trunc_zeroes.gif "Fix Truncated Zeroes")

In [19]:
# Get the random five records of the table
data_df.sample(5)

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
3871,2016,Michigan,MI,Macomb,26099.0,President,Donald Trump,republican,224665.0,419312,20190722
6829,2016,Pennsylvania,PA,Dauphin,42043.0,President,Donald Trump,republican,60863.0,130872,20190722
1807,2016,Idaho,ID,Twin Falls,16083.0,President,Donald Trump,republican,19828.0,29874,20190722
5787,2016,North Carolina,NC,Davidson,37057.0,President,Hillary Clinton,democrat,18109.0,74856,20190722
2794,2016,Kansas,KS,Grant,20067.0,President,Donald Trump,republican,1804.0,2389,20190722


In [20]:
import numpy as np

In [21]:
m = np.array([1.0,2.0,3.0])

m.astype(np.str)

In [22]:
b = data_df['FIPS'].astype(np.str)

In [23]:
fip = data_df.loc[data_df['FIPS']]

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

In [None]:
g = data_df['FIPS'].apply(lambda x: (len(str(x))-2) ==4)

In [24]:
data_df.loc[g]

NameError: name 'g' is not defined

In [25]:
# Check how many records have a FIPS value with four characters

trunc_df = data_df.loc[data_df['FIPS'].str.len() == 4]
trunc_data_per = (trunc_df.shape[0] / data_df.shape[0])*100

# Use another print statement (using the f format key) to report this information
print(f"{round(trunc_data_per, 2)}% of data ({trunc_df.shape[0]} rows) has truncated FIPS values.")

AttributeError: Can only use .str accessor with string values!

The following cell creates a function in python that adds a leading zero to the FIPS value if it only has four characters.  

In [26]:
# Define a helper function to fix truncated zeros, with one parameter: the value to be processed
def fix_trunc_zeros(val):
    # Use an if statement to check if there are four characters in the string representation of the value
    if len(str(val)) == 4:
        # If this is the case, return the value with an appended "0" in the front
        return "0"+str(val)
    # Otherwise...
    else:
        # Return the value itself
        return str(val)

In [27]:
# Test helper function with truncated value
fix_trunc_zeros(7042)  # You should see an appended zero: "07042"

'07042'

In [28]:
# Run helper function on the FIPS field using the apply and lambda method 
data_df['FIPS'] = data_df['FIPS'].apply(lambda x: fix_trunc_zeros(x))

# Print information on the operation performed, and show the first few records to confirm it worked
print(f"{round(trunc_data_per, 2)}% of data ({trunc_df.shape[0]} rows) had truncated FIPS IDs corrected.")
data_df.head()

NameError: name 'trunc_data_per' is not defined

***

## Reformat the table structure

Currently, each record in the table corresponds to a candidate and their votes in a county. You need to reformat the table so that each record corresponds to each county, with fields showing the votes for different candidates in that election year. 
It is possible to do this using the [Pivot Table geoprocessing tool](https://pro.arcgis.com/en/pro-app/tool-reference/data-management/pivot-table.htm) or Excel pivot tables, but Python may make it easier to automate and share.
The animation below illustrates the steps in restructuring the table:
1. Set a few fields aside, "locking" them from the table pivot. 
2. Pivot the table using the remaining fields.
3. Rename the pivoted fields to designate each party. 
4. Bring the locked fields back to the table. 
The following code cell performs these steps.
![reformat_table](img/reformat_table.gif "Reformat Table")


In [29]:
# Set an index using mulitple fields, which "locks" these fields before the table pivots
# Use the built-in groupby function for the FIPS and year fields, which you use to group the data by candidate
# Use unstack to perform the table pivot, which will rotate the table and turn rows into columns
df_out = data_df.set_index(['FIPS', 
                            'year', 
                            'county', 
                            'state', 
                            'state_po', 
                            'office', 
                            data_df.groupby(['FIPS', 'year']).cumcount()+1]).unstack()

# Use the indexes for the columns to set column names (Ex: candidate_1, candidate_2, votes_1, votes_2, etc.)
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)

# Rename columns 
df_out = df_out.rename(columns={"candidate_1": "candidate_dem",
                                "candidatevotes_1": "votes_dem",
                                "candidate_2": "candidate_gop",
                                "candidatevotes_2": "votes_gop",
                                "totalvotes_1": "votes_total",
                                "state_po": "state_abbrev"
                                })

# Keep only the necessary columns
df_out = df_out[["candidate_dem", "votes_dem",
                 "candidate_gop", "votes_gop",
                 "votes_total"]]

# Remove the multiindex since we no longer need these fields to be "locked" for the pivot
df_out.reset_index(inplace=True)

# Print out the first few records to confirm everything worked
df_out.head()

Unnamed: 0,FIPS,year,county,state,state_po,office,candidate_dem,votes_dem,candidate_gop,votes_gop,votes_total
0,10001.0,2016,Kent,Delaware,DE,President,Hillary Clinton,33351.0,Donald Trump,36991.0,74598
1,10003.0,2016,New Castle,Delaware,DE,President,Hillary Clinton,162919.0,Donald Trump,85525.0,262391
2,10005.0,2016,Sussex,Delaware,DE,President,Hillary Clinton,39333.0,Donald Trump,62611.0,106008
3,1001.0,2016,Autauga,Alabama,AL,President,Hillary Clinton,5936.0,Donald Trump,18172.0,24973
4,1003.0,2016,Baldwin,Alabama,AL,President,Hillary Clinton,18458.0,Donald Trump,72883.0,95215


In [36]:
data_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016,Alabama,AL,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973,20190722
1,2016,Alabama,AL,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973,20190722
2,2016,Alabama,AL,Autauga,1001.0,President,Other,not recorded,865.0,24973,20190722
3,2016,Alabama,AL,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215,20190722
4,2016,Alabama,AL,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215,20190722


In [31]:
df_out['county'].nunique()

1850

In [32]:
df_out.shape

(3154, 11)

In [33]:
df_out['FIPS'].nunique()

3154

In [None]:
df_out.shape, data_df.shape

In [35]:
df_out.sample(10)

Unnamed: 0,FIPS,year,county,state,state_po,office,candidate_dem,votes_dem,candidate_gop,votes_gop,votes_total
748,21011.0,2016,Bath,Kentucky,KY,President,Hillary Clinton,1361.0,Donald Trump,3082.0,4587
960,24037.0,2016,St. Mary's,Maryland,MD,President,Hillary Clinton,17534.0,Donald Trump,28663.0,49842
936,23019.0,2016,Penobscot,Maine,ME,President,Hillary Clinton,32838.0,Donald Trump,41622.0,80540
2980,55101.0,2016,Racine,Wisconsin,WI,President,Hillary Clinton,42512.0,Donald Trump,46611.0,94133
782,21079.0,2016,Garrard,Kentucky,KY,President,Hillary Clinton,1453.0,Donald Trump,5904.0,7623
407,17125.0,2016,Mason,Illinois,IL,President,Hillary Clinton,2014.0,Donald Trump,4058.0,6486
1969,40087.0,2016,McClain,Oklahoma,OK,President,Hillary Clinton,2894.0,Donald Trump,13169.0,16858
1786,38011.0,2016,Bowman,North Dakota,ND,President,Hillary Clinton,227.0,Donald Trump,1446.0,1787
36,1067.0,2016,Henry,Alabama,AL,President,Hillary Clinton,2292.0,Donald Trump,5632.0,8072
2883,54019.0,2016,Fayette,West Virginia,WV,President,Hillary Clinton,4290.0,Donald Trump,10357.0,15337


Pandas has three powerful capabilities that helped you perform this operation: 
- The ability to set an index using multiple fields, which acts as our "locking" mechanism. 
- The ability to unstack (or pivot) a table.
- The ability to perform an operation using a "groupby" function.

## Calculate additional columns

You will use the values from the updated table to add additional columns of information, such as the number of votes for a non major party, the percentage of voters for each party, and so on. Each column is referred to as an attribute of the dataset.

##### Calculate an attribute for the total votes for non major party

In [None]:
# Calculate votes that did not choose the Democrat or Republican party
df_out['votes_other'] = df_out['votes_total'] - (df_out['votes_dem'] + df_out['votes_gop'])
df_out.head()

##### Calculate additional attributes

In [None]:
# Calculate voter share attributes
df_out['voter_share_major_party'] = (df_out['votes_dem'] + df_out['votes_gop']) / df_out['votes_total']
df_out['voter_share_dem'] = df_out['votes_dem'] / df_out['votes_total']
df_out['voter_share_gop'] = df_out['votes_gop'] / df_out['votes_total']
df_out['voter_share_other'] = df_out['votes_other'] / df_out['votes_total']

# Calculate raw difference attributes
df_out['rawdiff_dem_vs_gop'] = df_out['votes_dem'] - df_out['votes_gop']
df_out['rawdiff_gop_vs_dem'] = df_out['votes_gop'] - df_out['votes_dem']
df_out['rawdiff_dem_vs_other'] = df_out['votes_dem'] - df_out['votes_other']
df_out['rawdiff_gop_vs_other'] = df_out['votes_gop'] - df_out['votes_other']
df_out['rawdiff_other_vs_dem'] = df_out['votes_other'] - df_out['votes_dem']
df_out['rawdiff_other_vs_gop'] = df_out['votes_other'] - df_out['votes_gop']

# Calculate percent difference attributes
df_out['pctdiff_dem_vs_gop'] = (df_out['votes_dem'] - df_out['votes_gop']) / df_out['votes_total']
df_out['pctdiff_gop_vs_dem'] = (df_out['votes_gop'] - df_out['votes_dem']) / df_out['votes_total']
df_out['pctdiff_dem_vs_other'] = (df_out['votes_dem'] - df_out['votes_other']) / df_out['votes_total']
df_out['pctdiff_gop_vs_other'] = (df_out['votes_gop'] - df_out['votes_other']) / df_out['votes_total']
df_out['pctdiff_other_vs_dem'] = (df_out['votes_other'] - df_out['votes_dem']) / df_out['votes_total']
df_out['pctdiff_other_vs_gop'] = (df_out['votes_other'] - df_out['votes_gop']) / df_out['votes_total']

df_out.head()

***

## Geoenable the data

You will eventually use this data in a spatial analysis. This means that the data needs to include location information to determine where the data is located on a map. You will geoenable the data, or add location to the data, using existing geoenabled county data.

##### Define the ArcGIS Pro project, database, and existing geoenabled data

In [None]:
# Create variables that represent the ArcGIS Pro project and map
aprx = arcpy.mp.ArcGISProject("CURRENT")
mp = aprx.listMaps('Data Engineering')[0]

# Create a variable that represents the default file geodatabase
fgdb = r"Data Engineering and Visualization.gdb"
aprx.defaultGeodatabase = fgdb
arcpy.env.workspace = fgdb

There are various resources that you can use to find geoenabled data. [ArcGIS Living Atlas of the World](https://livingatlas.arcgis.com) is an authoritative source provided by Esri. Each record in your election data represents information for a county, so you will use a Living Atlas dataset that represents county geometry. This dataset has been downloaded and added to your project.

In [None]:
# Create a variable that represents the county geometry dataset
counties_fc_name = "Counties_2016_VotingAgePopulation"
counties_fc = os.path.join(fgdb, counties_fc_name)

**Note: Executing the following cell may take a few minutes.**

In [None]:
# Load the dataset into a spatially-enabled dataframe
counties_df = pd.DataFrame.spatial.from_featureclass(counties_fc)
counties_df.head()

The county geometry dataset includes various attributes. You will simplify the dataframe to only include the attributes that you need. The Total_cvap_est attribute represents the total population in each county that are of voting age for the year 2016.

In [None]:
# Modify the dataframe to only include the attributes that are needed
counties_df = counties_df[['OBJECTID', 'GEOID', 'GEONAME',
                           'Total_cvap_est',
                           'SHAPE', 'Shape__Area', 'Shape__Length']]
counties_df.head()

***

## Join the data

You have a dataframe with election data ('df_out') and a spatially-enabled dataframe of the county geometry data ('counties_df'). You will merge these datasets into one. 

In [None]:
# Join the election dataframe with the county geometry dataframe
geo_df = pd.merge(df_out, counties_df, left_on='FIPS', right_on="GEOID", how='left')

# Visualize the merged data
geo_df.head()

The resulting dataframe includes the attributes from your election data and the specified attributes from the county geometry data. The SHAPE field represents the county geometry and is used to locate each record, or feature, on the map.

***

## Query and calculate attributes

Because you have the voting age population for 2016, you can now calculate the average voter participation (voter turnout) for 2016. The dataframe includes records from 2010-2016 but only has voting age population for 2016. You will need to create a subset dataframe for 2016 before calculating the voter turnout.

In [None]:
# Create a copy of the data, and perform a query
data_2016_df = geo_df.copy()
data_2016_df.query("year == '2016'", inplace=True)
data_2016_df.head()

You will calculate a new field named voter turnout using field operators in Pandas. The operations will apply to all values across the columns. 

In [None]:
# Calculate voter turnout attributes
data_2016_df['voter_turnout'] = data_2016_df['votes_total'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_majparty'] = (data_2016_df['votes_dem']+data_2016_df['votes_gop']) / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_dem'] = data_2016_df['votes_dem'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_gop'] = data_2016_df['votes_gop'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_other'] = data_2016_df['votes_other'] / data_2016_df['Total_cvap_est']
data_2016_df.head()

***

## Validate the data

Before continuing with other data preparation, you should confirm that the output data has been successfully created. 

First, you will validate the values for voter turnout. You will remove null values, and because these values represent a fraction (total votes divided by voting age population), you will confirm that the values range between 0 and 1.

In [None]:
# Check for null values
data_2016_df.loc[data_2016_df['voter_turnout'].isnull()]

In [None]:
# Remove records with no voter turnout value
data_2016_df = data_2016_df.loc[data_2016_df['voter_turnout'].notnull()]

In [None]:
# Run a describe to get the distribution of voter turnout values
data_2016_df['voter_turnout'].describe()

The describe function indicates that there are voter turnout values over one, indicating a voter turnout above 100%. You will further investigate by querying for these records.

In [None]:
# Perform query for voter turnout above 100%
data_2016_df.loc[data_2016_df['voter_turnout'] > 1]

There are four counties with very low population that resulted in voter turnout values above 100%. You could remove these records from the data or do additional research to identify the source of this issue. 

***

## Update validated data

After reviewing the Census Bureau voting age population data for 2016, you determined that these counties have a low voting age population with a fairly high margin of error. This may be the reason why these counties have a voter turnout rate higher than 100%. You will recalculate the voter turnout field for these counties using the upper range of their margin of error: 
- San Juan County, Colorado: 574
- Harding County, New Mexico: 562
- Loving County, Texas: 86
- McMullen County, Texas: 566

**Note: This information was extracted from this [table](https://data.census.gov/cedsci/table?q=voting%20age%20population%202016&g=0500000US08111,35021,48301,48311&hidePreview=true&table=DP05&tid=ACSDP5Y2016.DP05&t=Age%20and%20Sex&y=2016&lastDisplayedRow=6&vintage=2016&mode=&moe=true).**

In [None]:
# Correct each county
data_2016_df.loc[data_2016_df['FIPS'] == "08111", "Total_cvap_est"] = 574
data_2016_df.loc[data_2016_df['FIPS'] == "35021", "Total_cvap_est"] = 562
data_2016_df.loc[data_2016_df['FIPS'] == "48301", "Total_cvap_est"] = 86
data_2016_df.loc[data_2016_df['FIPS'] == "48311", "Total_cvap_est"] = 566

In [None]:
# Recalculate voter turnout fields
data_2016_df['voter_turnout'] = data_2016_df['votes_total'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_majparty'] = (data_2016_df['votes_dem']+data_2016_df['votes_gop']) / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_dem'] = data_2016_df['votes_dem'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_gop'] = data_2016_df['votes_gop'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_other'] = data_2016_df['votes_other'] / data_2016_df['Total_cvap_est']

To confirm that this correction addressed the issue, you will again query for counties with a voter turnout value above 100%.

In [None]:
data_2016_df.loc[data_2016_df['voter_turnout'] > 1]

No records are returned, indicating that there are no counties with a turnout value above 100%. Well done! You have cleaned the data. Next, you will convert the dataframe to a permanent dataset called a feature class. Feature classes are stored in an ArcGIS Pro file geodatabase.

***

## Convert dataframes to feature classes

You will use the ArcGIS API for Python, imported at the beginning of this script, to export the spatially-enabled dataframe to a feature class.

**Note: Executing the following cell may take a few minutes**

In [None]:
# Create a feature class for the 2016 presidential election 
out_2016_fc_name = "county_elections_pres_2016"
out_2016_fc = data_2016_df.spatial.to_featureclass(os.path.join(fgdb, out_2016_fc_name))
out_2016_fc

1. At the top of the page, click the Data Engineering map tab.

2. Drag the Data Engineering map tab to display as its own window. 

3. Review the feature class that was added to the Data Engineering map.

![DataFrameToFeatureClass](img/DataFrameToFeatureClass.PNG "Map of counties, with missing county")

**Note: The color of the data will vary every time it is added to the map.** 


***

## Correct for missing data

The feature class is missing a county in South Dakota. You will correct this issue by further exploring the data.

1. In Catalog pane, expand Databases, and then Data Engineering and Visualization.gdb.
2. Right-click Counties_2016_VotingAgePopulation and choose Add To Current Map.
3. In the Contents pane, drag Counties_2016_VotingAgePopulation under county_elections_pres_2016.
4. Open the Data Engineering tab.
5. On the map, click the missing county.

![missing county](img/missing_county_view.PNG "Pop-up window for Oglala Lakota County")

The county geometry dataset identifies the missing county as Oglala Lakota County. By searching online for this county, you determine that Oglala Lakota County changed its county name and FIPS in 2015. It was originally Shannon County with a FIPS of 46113 and is now Oglala Lakota County with a FIPS of 46102. You will search the election data for the current FIPS to try to find the missing data.

In [None]:
# Perform query for county FIPS 46102
df_out.loc[df_out['FIPS'] == '46102']

There are no records returned, which indicates that the election data does not have the correct FIPS for this county. You will check for the old FIPS value, when it was named Shannon County.

In [None]:
df_out.loc[df_out['FIPS'] == '46113']

There is the issue! The data has the correct name (Oglala Lakota) but the wrong FIPS (46113). You will correct this data issue.

In [None]:
df_out.loc[df_out['FIPS'] == '46113', 'FIPS'] = "46102"
df_out.loc[df_out['FIPS'] == '46102']

With the corrected FIPS value for Oglala Lakota County, you can now rejoin the geometry, recalculate the voting turnout field, and recreate the feature class. 

**Note: Executing the following cell may take a few minutes.**

In [None]:
# Join the county geometry data to the updated election data table
geo_df = pd.merge(df_out, counties_df, left_on='FIPS', right_on="GEOID", how='left')

# Create a copy of the data that only includes records from 2016
data_2016_df = geo_df.copy()
data_2016_df.query("year == '2016'", inplace=True)
data_2016_df.head()

# Correct counties with low population
data_2016_df.loc[data_2016_df['FIPS'] == "08111", "Total_cvap_est"] = 574
data_2016_df.loc[data_2016_df['FIPS'] == "35021", "Total_cvap_est"] = 562
data_2016_df.loc[data_2016_df['FIPS'] == "48301", "Total_cvap_est"] = 86
data_2016_df.loc[data_2016_df['FIPS'] == "48311", "Total_cvap_est"] = 566

# Calculate voter turnout
data_2016_df['voter_turnout'] = data_2016_df['votes_total'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_majparty'] = (data_2016_df['votes_dem']+data_2016_df['votes_gop']) / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_dem'] = data_2016_df['votes_dem'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_gop'] = data_2016_df['votes_gop'] / data_2016_df['Total_cvap_est']
data_2016_df['voter_turnout_other'] = data_2016_df['votes_other'] / data_2016_df['Total_cvap_est']

# Remove records with no voter turnout value
data_2016_df = data_2016_df.loc[data_2016_df['voter_turnout'].notnull()]

You will export the dataframe to a feature class that you can visualize and analyze in ArcGIS Pro. 

**Note: Executing the following cell may take a few minutes.**

In [None]:
# Create a feature class for the 2016 election and voter turnout data
out_2016_fc_name = "county_elections_pres_2016_final"
out_2016_fc = data_2016_df.spatial.to_featureclass(os.path.join(fgdb, out_2016_fc_name))

You have prepared this data for a predictive analysis that will model voter turnout using demographic variables, such as per capita income. In the next step, you will use ArcGIS Pro to geoenrich your feature class with these demographic variables. 

Open the Perform data engineering tasks exercise PDF and refer to the Open the Enrich tool step for the remaining instructions.

***