## Auditing and cleaning data before EDA in Tableau

The goal of this process is to ensure the data is as clean as possible to create visualizations in Tableau.

In [12]:
import os
import pandas as pd

os.listdir()

['.ipynb_checkpoints',
 '2018-2010_export.csv',
 '2018-2010_import.csv',
 'clean.ipynb',
 'create.sql']

In [14]:
exports = pd.read_csv('2018-2010_export.csv')
imports = pd.read_csv('2018-2010_import.csv')

print("EXPORTS:")
display(exports.head())
print("IMPORTS:")
display(imports.head())

EXPORTS:


Unnamed: 0,HSCode,Commodity,value,country,year
0,2,MEAT AND EDIBLE MEAT OFFAL.,0.18,AFGHANISTAN TIS,2018
1,3,"FISH AND CRUSTACEANS, MOLLUSCS AND OTHER AQUAT...",0.0,AFGHANISTAN TIS,2018
2,4,DAIRY PRODUCE; BIRDS' EGGS; NATURAL HONEY; EDI...,12.48,AFGHANISTAN TIS,2018
3,6,LIVE TREES AND OTHER PLANTS; BULBS; ROOTS AND ...,0.0,AFGHANISTAN TIS,2018
4,7,EDIBLE VEGETABLES AND CERTAIN ROOTS AND TUBERS.,1.89,AFGHANISTAN TIS,2018


IMPORTS:


Unnamed: 0,HSCode,Commodity,value,country,year
0,5,"PRODUCTS OF ANIMAL ORIGIN, NOT ELSEWHERE SPECI...",0.0,AFGHANISTAN TIS,2018
1,7,EDIBLE VEGETABLES AND CERTAIN ROOTS AND TUBERS.,12.38,AFGHANISTAN TIS,2018
2,8,EDIBLE FRUIT AND NUTS; PEEL OR CITRUS FRUIT OR...,268.6,AFGHANISTAN TIS,2018
3,9,"COFFEE, TEA, MATE AND SPICES.",35.48,AFGHANISTAN TIS,2018
4,11,PRODUCTS OF THE MILLING INDUSTRY; MALT; STARCH...,,AFGHANISTAN TIS,2018


## Handling missing data

In assessing the null values we will be addressing each of the following points:

1. Which features contain null values?
2. How many rows contain null values? (What percentage?)
3. Why are there null values? (Does it make sense?)
4. Final Decision (Drop, Imputation Strategy)

**1. Which features contain null values?**

In [57]:
# Columns with null in exports
print(exports.isnull().any(),end='\n')
# Columns with null in imports
print(imports.isnull().any())

HSCode       False
Commodity    False
value         True
country      False
year         False
dtype: bool
HSCode       False
Commodity    False
value         True
country      False
year         False
dtype: bool


The only columns with Null values for both datasets is the variable `value`

**2. How many rows contain null values? (What percentage?)**

To aid in our decision on how to handle the null values we will examine the proportion of rows with missing values.

In [61]:
n_exports = exports.shape[0]
null_exports = exports[exports.value.isnull()].shape[0]
null_percent_exports = round(null_exports / n_exports, 2) * 100
n_imports = imports.shape[0]
null_imports = imports[imports.value.isnull()].shape[0]
null_percent_imports = round(null_imports / n_imports, 2) * 100

print("PERCENT MISSING IN EXPORTS:", null_percent_exports,"%", str(null_exports)+"/"+str(n_exports))
print("PERCENT MISSING IN IMPORTS:", null_percent_imports,"%", str(null_imports)+"/"+str(n_imports))

PERCENT MISSING IN EXPORTS: 10.0 % 14037/137023
PERCENT MISSING IN IMPORTS: 15.0 % 14027/93095


Both datasets contain over 14K missing values. This constitutes 10-15% of the data. This is a very significant amount of missing values and should be investigated further.

In [67]:
missing_edf = exports[exports.value.isnull()]
missing_idf = imports[imports.value.isnull()]

missing_edf.country.value_counts()

MACAO                                       151
ST LUCIA                                    125
SWAZILAND                                   120
UNSPECIFIED                                 118
GUINEA BISSAU                               117
BELIZE                                      117
DOMINICA                                    117
MONTSERRAT                                  116
KYRGHYZSTAN                                 112
C AFRI REP                                  112
BAHAMAS                                     111
MARTINIQUE                                  111
ARUBA                                       110
UNION OF SERBIA & MONTENEGRO                110
MOLDOVA                                     109
GRENADA                                     109
BR VIRGN IS                                 108
COMOROS                                     107
TURKMENISTAN                                107
TAJIKISTAN                                  106
BOSNIA-HRZGOVIN                         

In [8]:
exports['Commodity'].iloc[11]

'VEGETABLE PLAITING MATERIALS; VEGETABLE PRODUCTS NOT ELSEWHERE SPECIFIED OR INCLUDED.'

In [10]:
import csv

with open('2018-2010_export.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)
        if reader.line_num == 13:
            break

OrderedDict([('HSCode', '2'), ('Commodity', 'MEAT AND EDIBLE MEAT OFFAL.'), ('value', '0.18'), ('country', 'AFGHANISTAN TIS'), ('year', '2018')])
OrderedDict([('HSCode', '3'), ('Commodity', 'FISH AND CRUSTACEANS, MOLLUSCS AND OTHER AQUATIC INVERTABRATES.'), ('value', '0'), ('country', 'AFGHANISTAN TIS'), ('year', '2018')])
OrderedDict([('HSCode', '4'), ('Commodity', "DAIRY PRODUCE; BIRDS' EGGS; NATURAL HONEY; EDIBLE PROD. OF ANIMAL ORIGIN, NOT ELSEWHERE SPEC. OR INCLUDED."), ('value', '12.48'), ('country', 'AFGHANISTAN TIS'), ('year', '2018')])
OrderedDict([('HSCode', '6'), ('Commodity', 'LIVE TREES AND OTHER PLANTS; BULBS; ROOTS AND THE LIKE; CUT FLOWERS AND ORNAMENTAL FOLIAGE.'), ('value', '0'), ('country', 'AFGHANISTAN TIS'), ('year', '2018')])
OrderedDict([('HSCode', '7'), ('Commodity', 'EDIBLE VEGETABLES AND CERTAIN ROOTS AND TUBERS.'), ('value', '1.89'), ('country', 'AFGHANISTAN TIS'), ('year', '2018')])
OrderedDict([('HSCode', '8'), ('Commodity', 'EDIBLE FRUIT AND NUTS; PEEL OR 