# Data exploration

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data loading and exploration

In [2]:
orders = pd.read_csv('../00.Data/orders_cripted.csv')

In [3]:
orders.head()

Unnamed: 0,Name,Customer,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,...,Tax 2 Name,Tax 2 Value,Tax 3 Name,Tax 3 Value,Tax 4 Name,Tax 4 Value,Tax 5 Name,Tax 5 Value,Phone,Receipt Number
0,#1247,1.0,paid,2019-04-16 18:36:20 +0200,fulfilled,2019-04-16 23:30:49 +0200,yes,EUR,59.0,4.9,...,,,,,,,,,,
1,#1246,2.0,paid,2019-04-16 02:05:29 +0200,unfulfilled,,yes,EUR,94.0,0.0,...,,,,,,,,,,
2,#1246,2.0,,,,,,,,,...,,,,,,,,,,
3,#1245,3.0,paid,2019-04-15 23:35:10 +0200,fulfilled,2019-04-16 09:00:25 +0200,yes,EUR,32.5,4.9,...,,,,,,,,,,
4,#1244,4.0,paid,2019-04-15 09:51:49 +0200,fulfilled,2019-04-15 23:47:18 +0200,yes,EUR,45.0,4.9,...,,,,,,,,,,


It seems that we have a lot of nan values and I'm not sure what the Tax N Value and Name are and seem to be empty. Let's check all Nan values compared to the total amount of data we have.

In [4]:
orders.shape

(2419, 50)

In [5]:
orders.isna().sum()

Name                              0
Customer                          2
Financial Status               2179
Paid at                        2260
Fulfillment Status             2179
Fulfilled at                   2209
Accepts Marketing              2179
Currency                       2179
Subtotal                       2179
Shipping                       2179
Taxes                          2179
Total                          2179
Discount Code                  2382
Discount Amount                2179
Shipping Method                2189
Created at                        0
Lineitem quantity                 0
Lineitem name                     0
Lineitem price                    0
Lineitem compare at price      2396
Lineitem sku                     83
Lineitem requires shipping        0
Lineitem taxable                  0
Lineitem fulfillment status       0
Shipping City                  2187
Shipping Zip                   2187
Notes                          2366
Note Attributes             

Ok, so we have plenty of NaNs. We'll drop all 'Tax' related values except the #1 as there is no data in them. Also 'Receipt Number' has no values so we will also drop it.

In [6]:
orders.drop(columns=['Tax 2 Name', 'Tax 2 Value', 'Tax 3 Name', 'Tax 3 Value', 'Tax 4 Name', 'Tax 4 Value', 'Tax 5 Name', 'Tax 5 Value', 'Receipt Number'], inplace=True)

The DB comes directly from a Shopify website. This means that the DB is structured as follows:
* First row provides the information of the order and the first item in the order
* The following rows provide information on the rest of items contained in the order

This means that we should split this DB into two tables:

1) Orders information

2) Items information

We could have a third table with customers information but we don't due to privacy issues.

In [7]:
orders.groupby('Created at').count().shape

(240, 40)

We have 240 different Timestamps, meaning that we actually have 240 orders, the empty data refers to specific information of the order. Actually, the 'Name' appears to be the ID of the orders.

In [8]:
orders.groupby('Name').count().shape

(240, 40)

We can see that there is unique information for the first row. This would be the general order information, which includes: 

    order_specific = ['Name', 'Customer', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Shipping City', 'Shipping Zip', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source', 'Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']

   
Other variables are clearly item-specific:

     item_specific = ['Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status', 'Lineitem compare at price']
    

We will now split the DataFrame as follows:
* order_specific information
* item_specific information

To do that we need to reference both tables with some sort of Order Id

## Splitting the dataframe

To ease the understanding of the information we will split our dataframe in 2. The first will be order-specific and the second item-specific. This will help us to better structure the database, and clean each DF separately with a better undestanding of each variable.

In [9]:
orders.columns

Index(['Name', 'Customer', 'Financial Status', 'Paid at', 'Fulfillment Status',
       'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping',
       'Taxes', 'Total', 'Discount Code', 'Discount Amount', 'Shipping Method',
       'Created at', 'Lineitem quantity', 'Lineitem name', 'Lineitem price',
       'Lineitem compare at price', 'Lineitem sku',
       'Lineitem requires shipping', 'Lineitem taxable',
       'Lineitem fulfillment status', 'Shipping City', 'Shipping Zip', 'Notes',
       'Note Attributes', 'Cancelled at', 'Payment Method',
       'Payment Reference', 'Refunded Amount', 'Vendor', 'Id', 'Tags',
       'Risk Level', 'Source', 'Lineitem discount', 'Tax 1 Name',
       'Tax 1 Value', 'Phone'],
      dtype='object')

In [10]:
cols_order = ['Name', 'Customer', 'Financial Status', 'Paid at', 'Fulfillment Status', 'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping', 'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Shipping City', 'Shipping Zip', 'Note Attributes', 'Payment Method', 'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source', 'Discount Code','Notes', 'Cancelled at', 'Tags', 'Tax 1 Name', 'Tax 1 Value']
order_specific = orders[cols_order]

#note that we are including 'name' and 'id' to the item-specific dataframe so that this information is traceable to the specific order.
cols_item = ['Name', 'Id', 'Lineitem quantity', 'Lineitem name', 'Lineitem price', 'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable', 'Lineitem fulfillment status', 'Lineitem compare at price']
item_specific = orders[cols_item]

In [11]:
order_specific.columns

Index(['Name', 'Customer', 'Financial Status', 'Paid at', 'Fulfillment Status',
       'Fulfilled at', 'Accepts Marketing', 'Currency', 'Subtotal', 'Shipping',
       'Taxes', 'Total', 'Discount Amount', 'Shipping Method', 'Shipping City',
       'Shipping Zip', 'Note Attributes', 'Payment Method',
       'Payment Reference', 'Refunded Amount', 'Id', 'Risk Level', 'Source',
       'Discount Code', 'Notes', 'Cancelled at', 'Tags', 'Tax 1 Name',
       'Tax 1 Value'],
      dtype='object')

In [12]:
item_specific.columns

Index(['Name', 'Id', 'Lineitem quantity', 'Lineitem name', 'Lineitem price',
       'Lineitem sku', 'Lineitem requires shipping', 'Lineitem taxable',
       'Lineitem fulfillment status', 'Lineitem compare at price'],
      dtype='object')

## Load data to DB

We now upoad the two tables to the SQL DB so we can use them for the next steps.

In [13]:
from sqlalchemy import create_engine


driver = 'mysql+pymysql:'
user = 'adria'
password = '00000'
ip = '35.187.114.125'
database = 'vimet'

connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
engine = create_engine(connection_string)

In [14]:
order_specific.to_sql('orders', con = engine, if_exists='replace')
item_specific.to_sql('items', con = engine, if_exists='replace')