# Explore the Dataset
## Iowa Liquor Sales from Kaggle
### https://www.kaggle.com/residentmario/iowa-liquor-sales

To be able to ETL this file we need to get information about. This file is quite challenging since its ~3.5GB and +12 Millions rows to wrangle.

First we need to get informations about the data type of each columns. We can work on the first 20 rows to get insight then refine some observation.

In [1]:
# install and import some useful librairies
!conda install -y pandas
import pandas as pd
import time

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [2]:
# set some pandas default values to display every information
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# import os
# notebook_path = os.path.abspath("Explore_dataset.ipynb")
df_sample = pd.read_csv('./dataset/dataset.csv', nrows=20)
print(df_sample.head(20))
print(df_sample.dtypes)

   Invoice/Item Number        Date  Store Number  \
0         S29198800001  11/20/2015          2191   
1         S29195400002  11/21/2015          2205   
2         S29050300001  11/16/2015          3549   
3         S28867700001  11/04/2015          2513   
4         S29050800001  11/17/2015          3942   
5         S28869200001  11/11/2015          3650   
6         S28865700001  11/09/2015          2538   
7         S28869500001  11/10/2015          3942   
8         S29339300091  11/30/2015          2662   
9         S29050900001  11/16/2015          4307   
10        S29049900001  11/17/2015          2661   
11        S28868200001  11/05/2015          2561   
12        S28869600001  11/09/2015          4114   
13        S28866900001  11/11/2015          3650   
14        S29050100001  11/19/2015          2806   
15        S29049600001  11/17/2015          2624   
16        S28868400001  11/04/2015          2572   
17        S29196300002  11/24/2015          2595   
18        S2

From this sample I can start to come to some conclusion.

1. I can identify the column's name easily and start guessing the data type of eac columns. That will speed up the reading of the full dataset later.
2. The column 'Invoice/Item Number' seems to be a unique identifier that I can use for each rows. Change the name of the colum to be `id` (step 2b).
3. 'Date' is obviously a date format that I have to convert from `string/object` to `datetime`. I could also try to find the `min()` and `max()` values to know the range of date we are dealing with (step 3b).
4. 'ZipCode' is automaticaly typed as `object` because despite most of the data look like a numeric value, as attended, we have row 9: `712-2`. I quick search on my favourite search engine [Qwant](www.qwant.com) shows me that it's an Area Code. I will then consider `NaN` a non numeric value in this column.
5. Since there is some errors to handle on the address columns, I will not take the values in `Store Location` for granted and use the [Geocoding Google API](https://developers.google.com/maps/documentation/geocoding/intro) to create a `Lat`, `Lon` columns with the right values in the ETL process.
6. `Category` column is set as a float but on the sample we can only see integer like number, I have to explore this to optimize the memory spended. I also note that there will be `NaN` to handle here.
7. For future calculation it's going to be easier if the currency value is treated as numeric. I need to get rid of the dollard sign and convert the value as cents to handle integer instead of float (Bank and fintech use this optimization to handle billions of transaction with less memory used).
8. We have the bottle volume as **ml** (mililiter) and we have total `Volume Sold` in both **liters** and **gallons**. I prefer to use the [SI](https://en.wikipedia.org/wiki/International_System_of_Units) units for more convinence. So I will get rid of the extra column in **gallon**'s unit and convert the **liter** unit to **ml** for the same reason as the currency column's values.

Let try to do this 8 steps (not necesseraly in this order) on this sample data. That will be useful for the final ETL in a wrangling step.

In [4]:
from pandas import UInt32Dtype

# step 2
df_sample.set_index("Invoice/Item Number", inplace=True)
# df_sample.rename(columns = {'Invoice/Item Number':'id'}, inplace = True)
# step 3
df_sample['Date'] = pd.to_datetime(df_sample['Date'], infer_datetime_format=True)
# step 4
df_sample['Zip Code'] = pd.to_numeric(df_sample['Zip Code'], errors='coerce')
# step 6 - done on the sample but needed to verify on the full dataset later - unsigned int 32 seems to be enought
df_sample.astype({'Category': UInt32Dtype()})
# step 7.1 - need to build function in the ETL to factorize this.
df_sample['State Bottle Cost'] = df_sample['State Bottle Cost'].str.slice(1)
df_sample['State Bottle Cost'] = pd.to_numeric(df_sample['State Bottle Cost'])
df_sample['State Bottle Cost'] = df_sample['State Bottle Cost'].multiply(100)
# step 7.2
df_sample['State Bottle Retail'] = df_sample['State Bottle Retail'].str.slice(1)
df_sample['State Bottle Retail'] = pd.to_numeric(df_sample['State Bottle Retail'])
df_sample['State Bottle Retail'] = df_sample['State Bottle Retail'].multiply(100)
# step 7.3
df_sample['Sale (Dollars)'] = df_sample['Sale (Dollars)'].str.slice(1)
df_sample['Sale (Dollars)'] = pd.to_numeric(df_sample['Sale (Dollars)'])
df_sample['Sale (Dollars)'] = df_sample['Sale (Dollars)'].multiply(100)
# step 8
df_sample['Volume Sold (Liters)'] = df_sample['Volume Sold (Liters)'].multiply(100)
df_sample.drop(columns=['Volume Sold (Gallons)'])


print(df_sample.head(20))
print(df_sample.dtypes)

                          Date  Store Number  \
Invoice/Item Number                            
S29198800001        2015-11-20          2191   
S29195400002        2015-11-21          2205   
S29050300001        2015-11-16          3549   
S28867700001        2015-11-04          2513   
S29050800001        2015-11-17          3942   
S28869200001        2015-11-11          3650   
S28865700001        2015-11-09          2538   
S28869500001        2015-11-10          3942   
S29339300091        2015-11-30          2662   
S29050900001        2015-11-16          4307   
S29049900001        2015-11-17          2661   
S28868200001        2015-11-05          2561   
S28869600001        2015-11-09          4114   
S28866900001        2015-11-11          3650   
S29050100001        2015-11-19          2806   
S29049600001        2015-11-17          2624   
S28868400001        2015-11-04          2572   
S29196300002        2015-11-24          2595   
S29134300126        2015-11-18          