# Data engineering with Dask

This notebook describes the process to download and prepare United States presidential election data. You will address missing values, reformat data types, and restructure the format of a table.

***

## Load and prepare data

To download and prepare the election data, you will use ArcPy, the ArcGIS API for Python, matplotlib for visualization and a Dask dataframe. First, you will import these modules to use them. Then, you will create a variable for the United States county election data and use this variable to read the data into a Dask dataframe.

##### Import needed modules

In [1]:
import arcgis
import dask.dataframe as dd
import dask.array as da
import matplotlib.pyplot as plt
import os


#import arcpy

  pd.datetime,
  import pandas.util.testing as tm


##### Read data into Python

In [2]:
dask_df = dd.read_csv("countypres2016.csv", assume_missing=True)

The is usually a dtype inference failure as Dask in attempt to aid memory management takes all numeric values as 'Íntegers (int64)', this can be fixed by manually adding the dtype when reading the data or provide 'assume_missing=True' to intepret all unspecified integer columns as floats.

In [3]:
from dask.distributed import Client
client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client

0,1
Client  Scheduler: inproc://192.168.42.173/27087/1  Dashboard: http://localhost:8787/status,Cluster  Workers: 1  Cores: 4  Memory: 2.00 GB


The Dask Client will provide a dashboard which is useful to gain insight on the computation. The dashboard link can be seen above

It is important to remember that, while Dask dataframe is very similar to Pandas dataframe, some differences do exist. Most Dask user interfaces are lazy, meaning that they don’t evaluate until you explicitly ask for a result using the compute method.

***

##### Exploratory Data Analysis

In [4]:
### Getting an overview of the data
dask_df.head()

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
0,2016.0,Alabama,AL,Autauga,1001.0,President,Hillary Clinton,democrat,5936.0,24973.0,20190722.0
1,2016.0,Alabama,AL,Autauga,1001.0,President,Donald Trump,republican,18172.0,24973.0,20190722.0
2,2016.0,Alabama,AL,Autauga,1001.0,President,Other,,865.0,24973.0,20190722.0
3,2016.0,Alabama,AL,Baldwin,1003.0,President,Hillary Clinton,democrat,18458.0,95215.0,20190722.0
4,2016.0,Alabama,AL,Baldwin,1003.0,President,Donald Trump,republican,72883.0,95215.0,20190722.0


## Handle missing data 

In [5]:
dask_df.isnull().sum().compute()

year                 0
state                0
state_po            12
county               0
FIPS                12
office               0
candidate            0
party             3158
candidatevotes       6
totalvotes           0
version              0
dtype: int64

The election data includes a records that are missing data in the **state_po,FIPS,party** field. This missing data is referred to as null values. We have to ways to work with features with missing values after proper identification.
- Fill them with a value
- Remove that instance in the datasets

#### Lets investigate the features with missing values more by running queries on those features.

In [6]:
missing_fips_state_po_query =dask_df.query('(FIPS == "NaN") | (state_po =="NaN") ').compute()
missing_fips_state_po_query

Unnamed: 0,year,state,state_po,county,FIPS,office,candidate,party,candidatevotes,totalvotes,version
9462,2016.0,Connecticut,,Statewide writein,,President,Hillary Clinton,democrat,,5056.0,20190722.0
9463,2016.0,Maine,,Maine UOCAVA,,President,Hillary Clinton,democrat,3017.0,5056.0,20190722.0
9464,2016.0,Alaska,,District 99,,President,Hillary Clinton,democrat,274.0,5056.0,20190722.0
9465,2016.0,Rhode Island,,Federal Precinct,,President,Hillary Clinton,democrat,637.0,5056.0,20190722.0
9466,2016.0,Connecticut,,Statewide writein,,President,Donald Trump,republican,,5056.0,20190722.0
9467,2016.0,Maine,,Maine UOCAVA,,President,Donald Trump,republican,648.0,5056.0,20190722.0
9468,2016.0,Alaska,,District 99,,President,Donald Trump,republican,40.0,5056.0,20190722.0
9469,2016.0,Rhode Island,,Federal Precinct,,President,Donald Trump,republican,53.0,5056.0,20190722.0
9470,2016.0,Connecticut,,Statewide writein,,President,Other,,,5056.0,20190722.0
9471,2016.0,Maine,,Maine UOCAVA,,President,Other,,321.0,5056.0,20190722.0


since the 'state_po' features is categorical, lets replace the missing values with the most occuring (mode)

In [7]:
value_counts = dask_df["state_po"].value_counts().compute()
value_counts = value_counts[:5]
value_counts

TX    762
GA    477
VA    402
KY    360
MO    348
Name: state_po, dtype: int64

The most occuring value of the 'state_po' feature is 'TX'

In [8]:
# Filling the missing values with the mode
dask_df["state_po"] = dask_df["state_po"].fillna('TX')

The 'FIPS' feature is numeric, so we would fill it with the mean value

In [9]:
# Filling the missing values with the mean
dask_df["FIPS"] = dask_df["FIPS"].fillna(dask_df["FIPS"].mean().compute())

In [10]:
dask_df.isnull().sum().compute()

year                 0
state                0
state_po             0
county               0
FIPS                 0
office               0
candidate            0
party             3158
candidatevotes       6
totalvotes           0
version              0
dtype: int64

Leaving only the 'party' feature to have missing values, lets explore what unique values are in it to guide us in the way to fill it.

In [11]:
dask_df['party'].unique().compute()

0      democrat
1    republican
2           NaN
Name: party, dtype: object

It wouldn't be ideal to use the mode as a strategy to fill missing values as this will be highly baised on the datasets since it has a collosal amount of missing values. So here we will replace the missing values with 'not recorded'

In [12]:
# Filling the missing values with the mode
dask_df["party"] = dask_df["party"].fillna('not recorded')

In [13]:
dask_df.isnull().sum().compute()

year              0
state             0
state_po          0
county            0
FIPS              0
office            0
candidate         0
party             0
candidatevotes    6
totalvotes        0
version           0
dtype: int64

***

## Explore and handle data types

In reviewing your data, you notice that the FIPS field is considered a numeric field instead of a string. As a result, leading zeroes in the FIPS values have been removed. The resulting FIPS values only have four characters instead of five. You will determine how many records are missing leading zeroes and add, or append, the missing zero.
![fix_truncated_zeroes](img/trunc_zeroes.gif "Fix Truncated Zeroes")