## EDA IOWA Dataset 03 - Fixing Missing Values in Counties

**Status:** PUBLIC Distribution <br>

**Author:** Jaume Manero IE<br>
**Date created:** 2021/02/1<br>
**Last modified:** 2024/01/18 <br>
**Description:** Fixing missing values

The original dataset has many missing values in the county. This is an example on how to fix it

In [45]:
import pandas as pd
import numpy as np
%matplotlib inline
warnings.filterwarnings('ignore')

In [46]:
file = 'Iowa_Liquor_Sales.csv'
file = 'Iowa_Liquor_Sales_NOV23.csv'
df = pd.read_csv(file, header=0)

In [47]:
df.dtypes

Invoice/Item Number       object
Date                      object
Store Number               int64
Store Name                object
Address                   object
City                      object
Zip Code                  object
Store Location            object
County Number            float64
County                    object
Category                 float64
Category Name             object
Vendor Number            float64
Vendor Name               object
Item Number               object
Item Description          object
Pack                       int64
Bottle Volume (ml)         int64
State Bottle Cost        float64
State Bottle Retail      float64
Bottles Sold               int64
Sale (Dollars)           float64
Volume Sold (Liters)     float64
Volume Sold (Gallons)    float64
dtype: object

In [4]:
df.columns

Index(['Invoice/Item Number', 'Date', 'Store Number', 'Store Name', 'Address',
       'City', 'Zip Code', 'Store Location', 'County Number', 'County',
       'Category', 'Category Name', 'Vendor Number', 'Vendor Name',
       'Item Number', 'Item Description', 'Pack', 'Bottle Volume (ml)',
       'State Bottle Cost', 'State Bottle Retail', 'Bottles Sold',
       'Sale (Dollars)', 'Volume Sold (Liters)', 'Volume Sold (Gallons)'],
      dtype='object')

In [5]:
# how many missing values in County?
sum(pd.isnull(df['County']))

159892

In [6]:
# and in County Number?
sum(pd.isnull(df['County Number']))

3578185

In [14]:
sum(pd.isnull(df['Zip Code']))

83156

In [10]:
# in Both?

df.isna().pivot_table(index='County', columns='County Number', aggfunc='size').stack()

County  County Number
False   False            24132843.0
        True              3418293.0
True    True               159892.0
dtype: float64

In [13]:
# Conclusion Empty County has Empty County Number
# Do they have Zip Code?
df.isna().pivot_table(index='County', columns='Zip Code', aggfunc='size').stack()

County  Zip Code
False   False       27551114
        True              22
True    False          76758
        True           83134
dtype: int64

In [None]:
# We have 83134 empty zip codes. I think we can delete them
# Lets create County from ZIP CODE
# We can use a package uszipcode
import uszipcode
from uszipcode import SearchEngine

search = SearchEngine()
zipcode = search.by_zipcode("57001")
zipcode.county

# we create a small function that returns county from ZIPCODE
# Careful We create County in uppercase without the word county
def reverseZIP(zipcode):
    if np.isnan(zipcode):
        return zipcode
    zipcode = int(zipcode)
    z = search.by_zipcode(zipcode)
    if z is None:
        return zipcode
    county = z.county
    county = county.replace(' County','')
    return county.upper()

print (reverseZIP(57101))

In [None]:
# now we replace the county name in the WHOLE dataset from its ZIP
# we do that because in this way we assure that some changes of County boundaries are avoided 
# and then the whole dataset is consistent

In [None]:
df['New County'] = df['Zip Code'].apply(reverseZIP)

In [None]:
# Now in the column New County we have the correct county and we have very few missing values
# You can proceed to delete the rows
sum(pd.isnull(df['New County']))

In [None]:
import session_info
session_info.show(html=False)