# Data engineering

This notebook describes the process to download and prepare United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table.

***

## Load and prepare data

To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, and a Dask dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Dask dataframe.

##### Import needed modules

In [32]:
import arcgis
import dask.dataframe as dd
import os
import arcpy

##### Read data into Python

In [44]:
dask_df = dd.read_csv("countypres2016.csv", assume_missing=True)

The is usually a dtype inference failure as Dask in attempt to aid memory management takes all numeric values as 'Íntegers (int64)', this can be fixed by manually adding the dtype when reading the data or provide 'assume_missing=True' to intepret all unspecified integer columns as floats.

In [46]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client

0,1
Client  Scheduler: inproc://192.168.42.65/8868/15  Dashboard: http://192.168.42.65/8868/15:4922/status,Cluster  Workers: 1  Cores: 4  Memory: 2.00 GB


The Dask Client will provide a dashboard which is useful to gain insight on the computation. The dashboard link can be seen above

It is important to remember that, while Dask dataframe is very similar to Pandas dataframe, some differences do exist. Most Dask user interfaces are lazy, meaning that they don’t evaluate until you explicitly ask for a result using the compute method.

***

##### Exploratory Data Analysis

In [49]:
### Getting an overview of the data
dask_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016.0,Alabama,AL,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973.0,20190722.0
1,2016.0,Alabama,AL,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973.0,20190722.0
2,2016.0,Alabama,AL,Autauga,1001.0,President,Other,,865.0,24973.0,20190722.0
3,2016.0,Alabama,AL,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215.0,20190722.0
4,2016.0,Alabama,AL,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215.0,20190722.0


## Handle missing data 

In [53]:
dask_df.isnull().sum().compute()

year                 0
state                0
state_po            12
county               0
FIPS                12
office               0
candidate            0
party             3158
candidatevotes       6
totalvotes           0
version              0
dtype: int64

The election data includes a records that are missing data in the state_po,FIPS field,party. This missing data is referred to as null values. We have to ways to work with features with missing values after proper identification.
- Fill them with a value
- Remove that instance in the datasets

#### Lets investigate the features with missing values more by running queries on those features.

In [None]:
dask_df.loc[dask_df['FIPS'].isnull()]  # We can use the isnull function built in to Pandas to find the records with null FIPS.