# Data engineering with Dask

This notebook describes the process to download and prepare United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table.

***

## Load and prepare data

To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, matplotlib for visualization and a Dask dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Dask dataframe.

##### Import needed modules

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


import arcgis
import pandas as pd
import dask.dataframe as dd
import os
import arcpy

##### Read data into Python

In [2]:
dask_df = dd.read_csv("countypres2016.csv", assume_missing=True)

The is usually a dtype inference failure as Dask in attempt to aid memory management takes all numeric values as 'Íntegers (int64)', this can be fixed by manually adding the dtype when reading the data or provide 'assume_missing=True' to intepret all unspecified integer columns as floats.

***

## Cleaning the data 

##### Exploratory Data Analysis

In [4]:
### Getting an overview of the data
dask_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016.0,Alabama,AL,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973.0,20190722.0
1,2016.0,Alabama,AL,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973.0,20190722.0
2,2016.0,Alabama,AL,Autauga,1001.0,President,Other,,865.0,24973.0,20190722.0
3,2016.0,Alabama,AL,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215.0,20190722.0
4,2016.0,Alabama,AL,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215.0,20190722.0


In [5]:
# Getting overview of the the data type (dtype) of all the features and get an overview of features with missing values via the 'Non-Null count'
dask_df.compute().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9474 entries, 0 to 9473
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            9474 non-null   float64
 1   state           9474 non-null   object 
 2   state_po        9462 non-null   object 
 3   county          9474 non-null   object 
 4   FIPS            9462 non-null   float64
 5   office          9474 non-null   object 
 6   candidate       9474 non-null   object 
 7   party           6316 non-null   object 
 8   candidatevotes  9468 non-null   float64
 9   totalvotes      9474 non-null   float64
 10  version         9474 non-null   float64
dtypes: float64(5), object(6)
memory usage: 814.3+ KB


In an attempt to manage memory, Dask takes all the numeric values as float and non-numeric values as objects

#### Dropping redundant features

From the preview of the dataset above, it can be observed that the 'state_po' is an acronym for the 'state' feature. To make the data cleaner, we have to remove these redundant feature.

In [6]:
# dask operation
dask_df = dask_df.drop('state_po', axis=1)

In [7]:
dask_df.head()

Unnamed: 0,year,state,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016.0,Alabama,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973.0,20190722.0
1,2016.0,Alabama,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973.0,20190722.0
2,2016.0,Alabama,Autauga,1001.0,President,Other,,865.0,24973.0,20190722.0
3,2016.0,Alabama,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215.0,20190722.0
4,2016.0,Alabama,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215.0,20190722.0


#### Handle missing data 

In [8]:
dask_df.isnull().sum().compute()

year                 0
state                0
county               0
FIPS                12
office               0
candidate            0
party             3158
candidatevotes       6
totalvotes           0
version              0
dtype: int64

The election data includes records that are missing data in the **,FIPS,party and candidatevotes** field. This missing data is referred to as null values. We have to ways to work with features with missing values after proper identification.
- Fill them with a value
- Remove that instance in the datasets

##### Lets investigate the features with missing values more by running queries on those features using `dask query method`

In [9]:
missing_query = dask_df.query('(FIPS == "NaN") | (candidatevotes == "NaN") ').compute()
missing_query

Unnamed: 0,year,state,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
8781,2016.0,Virginia,Bedford,51515.0,President,Hillary Clinton,democrat,,0.0,20190722.0
8782,2016.0,Virginia,Bedford,51515.0,President,Donald Trump,republican,,0.0,20190722.0
8783,2016.0,Virginia,Bedford,51515.0,President,Other,,,0.0,20190722.0
9462,2016.0,Connecticut,Statewide writein,,President,Hillary Clinton,democrat,,5056.0,20190722.0
9463,2016.0,Maine,Maine UOCAVA,,President,Hillary Clinton,democrat,3017.0,5056.0,20190722.0
9464,2016.0,Alaska,District 99,,President,Hillary Clinton,democrat,274.0,5056.0,20190722.0
9465,2016.0,Rhode Island,Federal Precinct,,President,Hillary Clinton,democrat,637.0,5056.0,20190722.0
9466,2016.0,Connecticut,Statewide writein,,President,Donald Trump,republican,,5056.0,20190722.0
9467,2016.0,Maine,Maine UOCAVA,,President,Donald Trump,republican,648.0,5056.0,20190722.0
9468,2016.0,Alaska,District 99,,President,Donald Trump,republican,40.0,5056.0,20190722.0


The strategy of handling missing values that we will employ here will be replacing the missing values with a valid and representative value. 

This can be achieved with the Dask dataframe using the `fillna` method.

The 'FIPS' and 'candidatevotes' features are both numerical. In this scenario, since the data continous we could use either mean or the median would be a good representative of the central tendency of the features. In this case, we will fill the missing values with the mean of those features.

In [10]:
# Filling the missing values with the mean
dask_df["FIPS"] = dask_df["FIPS"].fillna(dask_df["FIPS"].mean().compute())
dask_df["candidatevotes"] = dask_df["candidatevotes"].fillna(dask_df["candidatevotes"].mean().compute())

In [11]:
dask_df.isnull().sum().compute()

year                 0
state                0
county               0
FIPS                 0
office               0
candidate            0
party             3158
candidatevotes       0
totalvotes           0
version              0
dtype: int64

We are left with  missing values in 'party' feature. The missing values is quite large making it critical for us to make a good choice in what to fill it with. Let's get a overview of the unique values in the feature. 

In [12]:
dask_df['party'].unique().compute()

0      democrat
1    republican
2           NaN
Name: party, dtype: object

As seen above, this depicts the voting parties in the election. To have an unbiased datasets we will fill the missing values with 'not recorded'

In [13]:
# Filling the missing values with 'not recorded'
dask_df["party"] = dask_df["party"].fillna('Others')

In [14]:
dask_df.isnull().sum().compute()

year              0
state             0
county            0
FIPS              0
office            0
candidate         0
party             0
candidatevotes    0
totalvotes        0
version           0
dtype: int64

***

## Explore and handle data types

In reviewing your data, you notice that the `FIPS` field is considered a numeric field instead of a string. As a result, leading zeroes in the FIPS values have been removed. The resulting FIPS values only have four characters instead of five. You will determine how many records are missing leading zeroes and add, or append, the missing zero.
![fix_truncated_zeroes](img/trunc_zeroes.gif "Fix Truncated Zeroes")

Also fields like `year` should be integer value rather than a float.

In [15]:
# Change the 'FIPS' field to integer firstly, to remove the decimals
dask_df['FIPS'] = dask_df['FIPS'].astype('int64')
# Then change ot to string
dask_df['FIPS'] = dask_df['FIPS'].astype('object')

# Change the 'year' field to integer
dask_df['year'] = dask_df['year'].astype('int64')

In [16]:
# Check how many records have a FIPS value with four characters
trunc_df = dask_df.loc[dask_df['FIPS'].str.len() == 4]
trunc_data_per = (trunc_df.shape[0] / dask_df.shape[0])*100

The following cell creates a function in python that adds a leading zero to the FIPS value if it only has four characters.  

In [17]:
# Define a helper function to fix truncated zeros, with one parameter: the value to be processed
def fix_trunc_zeros(val):
    # Use an if statement to check if there are four characters in the string representation of the value
    if len(str(val)) == 4:
        # If this is the case, return the value with an appended "0" in the front
        return "0"+str(val)
    # Otherwise...
    else:
        # Return the value itself
        return str(val)

In [18]:
# Test the function
fix_trunc_zeros(7042)  # You should see an appended zero: "07042"

'07042'

In [19]:
# Run the function on the FIPS field using the apply and lambda method 
dask_df['FIPS'] = dask_df['FIPS'].apply(lambda x: fix_trunc_zeros(x),meta=('FIPS', 'object'))
# The metadata makes it possible for Dask not to guess the dtype 

# Print information on the operation performed, and show the first few records to confirm it worked
dask_df.head()

Unnamed: 0,year,state,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016,Alabama,Autauga,1001,President,Hillary Clinton,democrat,5936.0,24973.0,20190722.0
1,2016,Alabama,Autauga,1001,President,Donald Trump,republican,18172.0,24973.0,20190722.0
2,2016,Alabama,Autauga,1001,President,Other,Others,865.0,24973.0,20190722.0
3,2016,Alabama,Baldwin,1003,President,Hillary Clinton,democrat,18458.0,95215.0,20190722.0
4,2016,Alabama,Baldwin,1003,President,Donald Trump,republican,72883.0,95215.0,20190722.0


***

## Reformat the table structure

Currently, each record in the table corresponds to a candidate and their votes in a county. You need to reformat the table so that each record corresponds to each county, with fields showing the votes for different candidates in that election year. 
It is possible to do this using the [Pivot Table geoprocessing tool](https://pro.arcgis.com/en/pro-app/tool-reference/data-management/pivot-table.htm) or Excel pivot tables, but Python may make it easier to automate and share.
The animation below illustrates the steps in restructuring the table:

The following code cell performs these steps.
![reformat_table](img/reformat_table.gif "Reformat Table")


In [20]:
c = dask_df["county"].unique().compute()
county = dict((i,dict()) for i in list(c))

Creating a new dataframe would have been done by `dd.DataFrame()` but dask advices us not use this class directly.  Instead use functions like
``dd.read_csv``, ``dd.read_parquet``, or ``dd.from_pandas``.
So, we will work with pandas to create a new dataframe then convert it to a Dask dataframe.

In [21]:
i = 0
data = []

for row in range(len(dask_df)):
    
    df = dask_df.compute()
    
    c = df.loc[row,"county"]
    s = df.loc[row,"state"]
    f = df.loc[row,"FIPS"]
    
    can_nm = df.loc[row, "candidate"]
    party =  df.loc[row, "party"]
    votes =  df.loc[row, "candidatevotes"]
    
    if f not in county[c].keys():
        county[c][f] = {}
        
    county[c][f]['county'] = c
    county[c][f]["fips"] = f
    county[c][f][f"candidate({party.strip()[0]})"] = can_nm
    county[c][f][f"votes ({party.strip()[0]})"] = votes

In [22]:
data = []
for key, items in county.items():

    for key, item in items.items():
        data.append(item)

In [23]:
dt = pd.DataFrame(data)
df = dd.from_pandas(dt,npartitions=1)

In [24]:
df.head()

Unnamed: 0,county,fips,candidate(d),votes (d),candidate(r),votes (r),candidate(O),votes (O)
0,Autauga,1001,Hillary Clinton,5936.0,Donald Trump,18172.0,Other,865.0
1,Baldwin,1003,Hillary Clinton,18458.0,Donald Trump,72883.0,Other,3874.0
2,Baldwin,13009,Hillary Clinton,7970.0,Donald Trump,7697.0,Other,449.0
3,Barbour,1005,Hillary Clinton,4871.0,Donald Trump,5454.0,Other,144.0
4,Barbour,54001,Hillary Clinton,1222.0,Donald Trump,4527.0,Other,305.0


***

## Calculate additional columns: Feature Engineering

Here, we will be using the values from the updated table to add additional columns of information, such as the number of votes for a non major party, the percentage of voters for each party, and so on. Each column is referred to as an attribute of the dataset.

##### Check :Calculate an attribute for the total votes

In [25]:
df['votes_total'] = df['votes (d)'] + df['votes (r)'] + df['votes (O)']

In [26]:
df.head()

Unnamed: 0,county,fips,candidate(d),votes (d),candidate(r),votes (r),candidate(O),votes (O),votes_total
0,Autauga,1001,Hillary Clinton,5936.0,Donald Trump,18172.0,Other,865.0,24973.0
1,Baldwin,1003,Hillary Clinton,18458.0,Donald Trump,72883.0,Other,3874.0,95215.0
2,Baldwin,13009,Hillary Clinton,7970.0,Donald Trump,7697.0,Other,449.0,16116.0
3,Barbour,1005,Hillary Clinton,4871.0,Donald Trump,5454.0,Other,144.0,10469.0
4,Barbour,54001,Hillary Clinton,1222.0,Donald Trump,4527.0,Other,305.0,6054.0


##### Calculate additional attributes

In [27]:
# Calculate voter share attributes
df['voter_share_major_party'] = (df['votes (d)'] + df['votes (r)']) / df['votes_total']
df['voter_share_dem'] = df['votes (d)'] / df['votes_total']
df['voter_share_rep'] = df['votes (r)'] / df['votes_total']
df['voter_share_other'] = df['votes (O)'] / df['votes_total']

# Calculate raw difference attributes
df['rawdiff_dem_vs_rep'] = df['votes (d)'] - df['votes (r)']
df['rawdiff_rep_vs_dem'] = df['votes (r)'] - df['votes (d)']
df['rawdiff_dem_vs_other'] = df['votes (d)'] - df['votes (O)']
df['rawdiff_rep_vs_other'] = df['votes (r)'] - df['votes (O)']
df['rawdiff_other_vs_dem'] = df['votes (O)'] - df['votes (d)']
df['rawdiff_other_vs_rep'] = df['votes (O)'] - df['votes (r)']

# Calculate percent difference attributes
df['pctdiff_dem_vs_rep'] = (df['votes (d)'] - df['votes (r)']) / df['votes_total']
df['pctdiff_rep_vs_dem'] = (df['votes (r)'] - df['votes (d)']) / df['votes_total']
df['pctdiff_dem_vs_other'] = (df['votes (d)'] - df['votes (O)']) / df['votes_total']
df['pctdiff_rep_vs_other'] = (df['votes (r)'] - df['votes (O)']) / df['votes_total']
df['pctdiff_other_vs_dem'] = (df['votes (O)'] - df['votes (d)']) / df['votes_total']
df['pctdiff_other_vs_rep'] = (df['votes (O)'] - df['votes (r)']) / df['votes_total']

df.head()

Unnamed: 0,county,fips,candidate(d),votes (d),candidate(r),votes (r),candidate(O),votes (O),votes_total,voter_share_major_party,voter_share_dem,voter_share_rep,voter_share_other,rawdiff_dem_vs_rep,rawdiff_rep_vs_dem,rawdiff_dem_vs_other,rawdiff_rep_vs_other,rawdiff_other_vs_dem,rawdiff_other_vs_rep,pctdiff_dem_vs_rep,pctdiff_rep_vs_dem,pctdiff_dem_vs_other,pctdiff_rep_vs_other,pctdiff_other_vs_dem,pctdiff_other_vs_rep
0,Autauga,1001,Hillary Clinton,5936.0,Donald Trump,18172.0,Other,865.0,24973.0,0.965363,0.237697,0.727666,0.034637,-12236.0,12236.0,5071.0,17307.0,-5071.0,-17307.0,-0.489969,0.489969,0.203059,0.693028,-0.203059,-0.693028
1,Baldwin,1003,Hillary Clinton,18458.0,Donald Trump,72883.0,Other,3874.0,95215.0,0.959313,0.193856,0.765457,0.040687,-54425.0,54425.0,14584.0,69009.0,-14584.0,-69009.0,-0.571601,0.571601,0.153169,0.72477,-0.153169,-0.72477
2,Baldwin,13009,Hillary Clinton,7970.0,Donald Trump,7697.0,Other,449.0,16116.0,0.972139,0.49454,0.4776,0.027861,273.0,-273.0,7521.0,7248.0,-7521.0,-7248.0,0.01694,-0.01694,0.466679,0.449739,-0.466679,-0.449739
3,Barbour,1005,Hillary Clinton,4871.0,Donald Trump,5454.0,Other,144.0,10469.0,0.986245,0.465278,0.520967,0.013755,-583.0,583.0,4727.0,5310.0,-4727.0,-5310.0,-0.055688,0.055688,0.451524,0.507212,-0.451524,-0.507212
4,Barbour,54001,Hillary Clinton,1222.0,Donald Trump,4527.0,Other,305.0,6054.0,0.94962,0.20185,0.74777,0.05038,-3305.0,3305.0,917.0,4222.0,-917.0,-4222.0,-0.54592,0.54592,0.15147,0.69739,-0.15147,-0.69739


***

## Geoenable the data

You will eventually use this data in a spatial analysis. This means that the data needs to include location information to determine where the data is located on a map. You will geoenable the data, or add location to the data, using existing geoenabled county data.

##### Define the ArcGIS Pro project, database, and existing geoenabled data

In [28]:
# Create variables that represent the ArcGIS Pro project and map
aprx = arcpy.mp.ArcGISProject("CURRENT")
mp = aprx.listMaps('Data Engineering')[0]

# Create a variable that represents the default file geodatabase
fgdb = r"Data Engineering and Visualization.gdb"
aprx.defaultGeodatabase = fgdb
arcpy.env.workspace = fgdb

There are various resources that you can use to find geoenabled data. [ArcGIS Living Atlas of the World](https://livingatlas.arcgis.com) is an authoritative source provided by Esri. Each record in your election data represents information for a county, so you will use a Living Atlas dataset that represents county geometry. This dataset has been downloaded and added to your project.

In [29]:
# Create a variable that represents the county geometry dataset
counties_fc_name = "Counties_2016_VotingAgePopulation"
counties_fc = os.path.join(fgdb, counties_fc_name)

**Note: Executing the following cell may take a few minutes.**

In [30]:
# Load the dataset into a spatially-enabled dataframe
counties_df = pd.DataFrame.spatial.from_featureclass(counties_fc)
#counties_df = dd.from_pandas(counties_df,npartitions=1)

##### The county geometry dataset includes various attributes. You will simplify the dataframe to only include the attributes that you need. The Total_cvap_est attribute represents the total population in each county that are of voting age for the year 2016.

In [31]:
# Modify the dataframe to only include the attributes that are needed
counties_df = counties_df[['OBJECTID', 'GEOID', 'GEONAME',
                           'Total_cvap_est',
                           'SHAPE', 'Shape__Area', 'Shape__Length']]

counties_df.head()

Unnamed: 0,OBJECTID,GEOID,GEONAME,Total_cvap_est,SHAPE,Shape__Area,Shape__Length
0,1,1001,"Autauga County, Alabama",40690,"{'rings': [[[-9619465, 3856529.0001000017], [-...",2208654000.0,249886.4
1,2,1003,"Baldwin County, Alabama",151770,"{'rings': [[[-9746859, 3539643.0001000017], [-...",5671048000.0,1655940.0
2,3,1005,"Barbour County, Alabama",20375,"{'rings': [[[-9468394, 3771591.0001000017], [-...",3257902000.0,320896.4
3,4,1007,"Bibb County, Alabama",17590,"{'rings': [[[-9692114, 3928124.0001000017], [-...",2311999000.0,227918.4
4,5,1009,"Blount County, Alabama",42430,"{'rings': [[[-9623907, 4063676.0001000017], [-...",2456909000.0,292642.9


***

## Join the data

You have a dataframe with election data ('df') and a spatially-enabled dataframe of the county geometry data ('counties_df'). You will merge these datasets into one. 

In [32]:
counties_df.head()

Unnamed: 0,OBJECTID,GEOID,GEONAME,Total_cvap_est,SHAPE,Shape__Area,Shape__Length
0,1,1001,"Autauga County, Alabama",40690,"{'rings': [[[-9619465, 3856529.0001000017], [-...",2208654000.0,249886.4
1,2,1003,"Baldwin County, Alabama",151770,"{'rings': [[[-9746859, 3539643.0001000017], [-...",5671048000.0,1655940.0
2,3,1005,"Barbour County, Alabama",20375,"{'rings': [[[-9468394, 3771591.0001000017], [-...",3257902000.0,320896.4
3,4,1007,"Bibb County, Alabama",17590,"{'rings': [[[-9692114, 3928124.0001000017], [-...",2311999000.0,227918.4
4,5,1009,"Blount County, Alabama",42430,"{'rings': [[[-9623907, 4063676.0001000017], [-...",2456909000.0,292642.9


In [33]:
df.head()

Unnamed: 0,county,fips,candidate(d),votes (d),candidate(r),votes (r),candidate(O),votes (O),votes_total,voter_share_major_party,voter_share_dem,voter_share_rep,voter_share_other,rawdiff_dem_vs_rep,rawdiff_rep_vs_dem,rawdiff_dem_vs_other,rawdiff_rep_vs_other,rawdiff_other_vs_dem,rawdiff_other_vs_rep,pctdiff_dem_vs_rep,pctdiff_rep_vs_dem,pctdiff_dem_vs_other,pctdiff_rep_vs_other,pctdiff_other_vs_dem,pctdiff_other_vs_rep
0,Autauga,1001,Hillary Clinton,5936.0,Donald Trump,18172.0,Other,865.0,24973.0,0.965363,0.237697,0.727666,0.034637,-12236.0,12236.0,5071.0,17307.0,-5071.0,-17307.0,-0.489969,0.489969,0.203059,0.693028,-0.203059,-0.693028
1,Baldwin,1003,Hillary Clinton,18458.0,Donald Trump,72883.0,Other,3874.0,95215.0,0.959313,0.193856,0.765457,0.040687,-54425.0,54425.0,14584.0,69009.0,-14584.0,-69009.0,-0.571601,0.571601,0.153169,0.72477,-0.153169,-0.72477
2,Baldwin,13009,Hillary Clinton,7970.0,Donald Trump,7697.0,Other,449.0,16116.0,0.972139,0.49454,0.4776,0.027861,273.0,-273.0,7521.0,7248.0,-7521.0,-7248.0,0.01694,-0.01694,0.466679,0.449739,-0.466679,-0.449739
3,Barbour,1005,Hillary Clinton,4871.0,Donald Trump,5454.0,Other,144.0,10469.0,0.986245,0.465278,0.520967,0.013755,-583.0,583.0,4727.0,5310.0,-4727.0,-5310.0,-0.055688,0.055688,0.451524,0.507212,-0.451524,-0.507212
4,Barbour,54001,Hillary Clinton,1222.0,Donald Trump,4527.0,Other,305.0,6054.0,0.94962,0.20185,0.74777,0.05038,-3305.0,3305.0,917.0,4222.0,-917.0,-4222.0,-0.54592,0.54592,0.15147,0.69739,-0.15147,-0.69739


In [36]:
counties_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3220 entries, 0 to 3219
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   OBJECTID        3220 non-null   int64   
 1   fips            3220 non-null   object  
 2   GEONAME         3220 non-null   object  
 3   Total_cvap_est  3220 non-null   int64   
 4   SHAPE           3220 non-null   geometry
 5   Shape__Area     3220 non-null   float64 
 6   Shape__Length   3220 non-null   float64 
dtypes: float64(2), geometry(1), int64(2), object(2)
memory usage: 176.2+ KB


In [37]:
df.compute().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3158 entries, 0 to 3157
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   county                   3158 non-null   object 
 1   fips                     3158 non-null   object 
 2   candidate(d)             3158 non-null   object 
 3   votes (d)                3158 non-null   float64
 4   candidate(r)             3158 non-null   object 
 5   votes (r)                3158 non-null   float64
 6   candidate(O)             3158 non-null   object 
 7   votes (O)                3158 non-null   float64
 8   votes_total              3158 non-null   float64
 9   voter_share_major_party  3158 non-null   float64
 10  voter_share_dem          3158 non-null   float64
 11  voter_share_rep          3158 non-null   float64
 12  voter_share_other        3158 non-null   float64
 13  rawdiff_dem_vs_rep       3158 non-null   float64
 14  rawdiff_rep_vs_dem      

In [42]:
type(df), type(counties_df)

(<class 'dask.dataframe.core.DataFrame'>, <class 'pandas.core.frame.DataFrame'>)

In [90]:
counties_df['GEOID'].unique()

array(['01001', '01003', '01005', ..., '72149', '72151', '72153'],
      dtype=object)

In [94]:
df.columns

Index(['county', 'fips', 'candidate(d)', 'votes (d)', 'candidate(r)',
       'votes (r)', 'candidate(O)', 'votes (O)', 'votes_total',
       'voter_share_major_party', 'voter_share_dem', 'voter_share_rep',
       'voter_share_other', 'rawdiff_dem_vs_rep', 'rawdiff_rep_vs_dem',
       'rawdiff_dem_vs_other', 'rawdiff_rep_vs_other', 'rawdiff_other_vs_dem',
       'rawdiff_other_vs_rep', 'pctdiff_dem_vs_rep', 'pctdiff_rep_vs_dem',
       'pctdiff_dem_vs_other', 'pctdiff_rep_vs_other', 'pctdiff_other_vs_dem',
       'pctdiff_other_vs_rep'],
      dtype='object')

In [34]:
counties_df.columns

Index(['OBJECTID', 'GEOID', 'GEONAME', 'Total_cvap_est', 'SHAPE',
       'Shape__Area', 'Shape__Length'],
      dtype='object')

In [35]:
# rename columns
counties_df = counties_df.rename(columns={'GEOID': 'fips'})
counties_df.head()

Unnamed: 0,OBJECTID,fips,GEONAME,Total_cvap_est,SHAPE,Shape__Area,Shape__Length
0,1,1001,"Autauga County, Alabama",40690,"{'rings': [[[-9619465, 3856529.0001000017], [-...",2208654000.0,249886.4
1,2,1003,"Baldwin County, Alabama",151770,"{'rings': [[[-9746859, 3539643.0001000017], [-...",5671048000.0,1655940.0
2,3,1005,"Barbour County, Alabama",20375,"{'rings': [[[-9468394, 3771591.0001000017], [-...",3257902000.0,320896.4
3,4,1007,"Bibb County, Alabama",17590,"{'rings': [[[-9692114, 3928124.0001000017], [-...",2311999000.0,227918.4
4,5,1009,"Blount County, Alabama",42430,"{'rings': [[[-9623907, 4063676.0001000017], [-...",2456909000.0,292642.9


In [37]:
geo_df = dd.merge(df, counties_df, how='left', on='fips')

# Visualize the merged data
geo_df.head()

TypeError: data type not understood

Part 2 entails:
- Geoenable data
- Join the data
- Query and calculate attributes
- Validate the data
- Update validate data
- Convert dataframe to feature classes
- Correct for missing values