# Data engineering with Dask

This notebook describes the process to download and prepare United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table.

***

## Load and prepare data

To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, matplotlib for visualization and a Dask dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Dask dataframe.

##### Import needed modules

In [1]:
import arcgis
import dask.dataframe as dd
import dask.array as da
import matplotlib.pyplot as plt
import os


#import arcpy

  pd.datetime,
  import pandas.util.testing as tm


##### Read data into Python

In [3]:
dask_df = dd.read_csv("countypres2016.csv", assume_missing=True)

The is usually a dtype inference failure as Dask in attempt to aid memory management takes all numeric values as 'Íntegers (int64)', this can be fixed by manually adding the dtype when reading the data or provide 'assume_missing=True' to intepret all unspecified integer columns as floats.

In [4]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


0,1
Client  Scheduler: inproc://192.168.42.173/10006/1  Dashboard: http://localhost:37241/status,Cluster  Workers: 1  Cores: 4  Memory: 2.00 GB


The Dask Client will provide a dashboard which is useful to gain insight on the computation. The dashboard link can be seen above.

***

##### Exploratory Data Analysis

In [5]:
### Getting an overview of the data
dask_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016.0,Alabama,AL,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973.0,20190722.0
1,2016.0,Alabama,AL,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973.0,20190722.0
2,2016.0,Alabama,AL,Autauga,1001.0,President,Other,,865.0,24973.0,20190722.0
3,2016.0,Alabama,AL,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215.0,20190722.0
4,2016.0,Alabama,AL,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215.0,20190722.0


## Handle missing data 

In [6]:
dask_df.isnull().sum().compute()

year                 0
state                0
state_po            12
county               0
FIPS                12
office               0
candidate            0
party             3158
candidatevotes       6
totalvotes           0
version              0
dtype: int64

The election data includes a records that are missing data in the **state_po,FIPS,party and candidatevotes** field. This missing data is referred to as null values. We have to ways to work with features with missing values after proper identification.
- Fill them with a value
- Remove that instance in the datasets

#### Lets investigate the features with missing values more by running queries on those features.

In [8]:
missing_query =dask_df.query('(FIPS == "NaN") | (state_po =="NaN") | (candidatevotes == "NaN") ').compute()
missing_query

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
8781,2016.0,Virginia,VA,Bedford,51515.0,President,Hillary Clinton,democrat,,0.0,20190722.0
8782,2016.0,Virginia,VA,Bedford,51515.0,President,Donald Trump,republican,,0.0,20190722.0
8783,2016.0,Virginia,VA,Bedford,51515.0,President,Other,,,0.0,20190722.0
9462,2016.0,Connecticut,,Statewide writein,,President,Hillary Clinton,democrat,,5056.0,20190722.0
9463,2016.0,Maine,,Maine UOCAVA,,President,Hillary Clinton,democrat,3017.0,5056.0,20190722.0
9464,2016.0,Alaska,,District 99,,President,Hillary Clinton,democrat,274.0,5056.0,20190722.0
9465,2016.0,Rhode Island,,Federal Precinct,,President,Hillary Clinton,democrat,637.0,5056.0,20190722.0
9466,2016.0,Connecticut,,Statewide writein,,President,Donald Trump,republican,,5056.0,20190722.0
9467,2016.0,Maine,,Maine UOCAVA,,President,Donald Trump,republican,648.0,5056.0,20190722.0
9468,2016.0,Alaska,,District 99,,President,Donald Trump,republican,40.0,5056.0,20190722.0


The strategy of handling missing values that we will employ here will be replacing the missing values with a valid and representative value. 

This can be achieved with the Dask dataframe using the 'fillna' method.

The 'state_po' feature is of a categorical nature. The approach of filling the missing values is using the mode (highest occurence) in that feature.

In [9]:
# Getting the mode for the 'state_po' feature
value_counts = dask_df['state_po'].value_counts().compute()
value_counts[:5]

TX    762
GA    477
VA    402
KY    360
MO    348
Name: state_po, dtype: int64

It can be seen that the most occuring value of the 'state_po' feature is 'TX'

In [10]:
# Filling the missing values with the mode
dask_df["state_po"] = dask_df["state_po"].fillna('TX')

The 'FIPS' and 'candidatevotes' features are both numerical. In this case, using the mode wouldn't be the best option since the feature is continous and the mode of the distribution wouldnt be a good representative of the central tendency of the features. In this case, we will fill the missing values with the mean of those features.

In [11]:
# Filling the missing values with the mean
dask_df["FIPS"] = dask_df["FIPS"].fillna(dask_df["FIPS"].mean().compute())
dask_df["candidatevotes"] = dask_df["candidatevotes"].fillna(dask_df["candidatevotes"].mean().compute())

In [12]:
dask_df.isnull().sum().compute()

year                 0
state                0
state_po             0
county               0
FIPS                 0
office               0
candidate            0
party             3158
candidatevotes       0
totalvotes           0
version              0
dtype: int64

We are left with  missing values in 'party' feature. The missing values is quite large making it critical for us to make a good choice in what to fill it with. Let's get a overview of the unique values in the feature. 

In [13]:
dask_df['party'].unique().compute()

0      democrat
1    republican
2           NaN
Name: party, dtype: object

As seen above, this depicts the voting parties in the election. To have an unbiased datasets we will fill the missing values with 'not recorded'

In [14]:
# Filling the missing values with 'not recorded'
dask_df["party"] = dask_df["party"].fillna('not recorded')

In [15]:
dask_df.isnull().sum().compute()

year              0
state             0
state_po          0
county            0
FIPS              0
office            0
candidate         0
party             0
candidatevotes    0
totalvotes        0
version           0
dtype: int64

***

## Explore and handle data types

In reviewing your data, you notice that the FIPS field is considered a numeric field instead of a string. As a result, leading zeroes in the FIPS values have been removed. The resulting FIPS values only have four characters instead of five. You will determine how many records are missing leading zeroes and add, or append, the missing zero.
![fix_truncated_zeroes](img/trunc_zeroes.gif "Fix Truncated Zeroes")