# Ecommerce Logistics Analysis, Section 0

Objective: Answer business questions for the Olist business in SQL.

This is real, commercial data from the Olist Store. It contains information on over 100k orders from 2016 through 2018. Note that it has been anonymized (company names have been replaced with Game of Thrones house names). 

Olist is a Brazilian Ecommerce store (add more description later -> see the kaggle description as well as the actual website): https://olist.com/ 

Dataset sources: <br>
Source 1: https://www.kaggle.com/olistbr/brazilian-ecommerce <br>
Source 2: https://www.kaggle.com/olistbr/marketing-funnel-olist <br>

This section contains 4 parts:
1. Creating the database as a .db file
2. Identifying structural issues in the database
3. Identifying sourcing issues 
4. Handling missing values

## Part 1: Create database 

Note that this dataset was originally a collection of 12 csv files (10 csv files from Source 1 and 2 csv files from Source 2.)

1. Created a database (ecommerce.db)
2. Created tables (specified table name, columns, column types, primary keys, foreign keys, column constaints etc)
3. Inserted data from the csv files into the tables. 

Routine:

1. Create table using template using template, specifying primary and foreign keys.
2. Make sure there are no duplicates on the primary key in the database. 
3. insert the data into the table using template for sellers table. 
4. Reopen DB browser and can confirm it was inserted.
5. Run a sample sql query joining the new table.

### Create empty database

In command prompt:
![title](images/create_empty_db.JPG)


### Create tables, insert data, data normalization

Create tables: Will create each table, designating column data types and constraints so as to maintain data integrity and relationships between tables. <br>
Make sure table is created w/ correct column and column types.<br>
Data normalization: reduce data duplication, improve data integrity<br>
Insert data into tables: <br>
Close DB Browser.<br>

#### geolocation table

- Using 'geolocation_zip_code_prefix' as the primary key for this table. 
- The 'geolocation_zip_code_prefix' column is not unique in the original dataset. Each repeating value of the 'geolocation_zip_code_prefix' is accompanied by the exact same information in every other column. In other words, entire rows in this dataset are repeated frequently.
- Dropping duplicates of the 'geolocation_zip_code_prefix' so that it can be used as the primary key in the table and to reduce data redundancy/duplication as part of the data normalization process.
- No foreign keys are necessary for this table.

Create table:
![title](images/create_geolocation_table_2.JPG)

In [1]:
#Data normalization (reduce data duplication) and make 'geolocation_zip_code_prefix' the primary key
import pandas as pd

geolocation_long = pd.read_csv('data/brazilian-ecommerce/olist_geolocation_dataset.csv')
print(geolocation_long.shape)
geolocation_short = geolocation_long.drop_duplicates(subset='geolocation_zip_code_prefix', keep="first")
print(geolocation_short.shape)
geolocation_short.to_csv('data/brazilian-ecommerce/geolocation_short.csv', index=False)

(1000163, 5)
(19015, 5)


Insert data into geolocation table:
![title](images/insert_geolocation_data.JPG)

#### customers table

- Using 'customer_id' as the primary key for this table. Reminder that a new customer_id is generated for each new order, regardless of whether it is a new customer or an existing one.
- The 'customer_id' column is unique and non-null so it can be used as primary key w/o any changes (see below).
- The 'customer_unique_id' column can be used to track the individual customers. This is unique to each customer. Values of this column repeat for customers who have made multiple purchases through Olist.
- Using 'customer_zip_code_prefix' as foreign key, corresponding to the 'geolocation_zip_code' primary key in the geolocation table

Create table:
![title](images/create_customers_table_updated.JPG)

In [2]:
#Make primary key, reduce data duplication
customers_long = pd.read_csv('data/brazilian-ecommerce/olist_customers_dataset.csv')
print(customers_long.shape)
customers_short = customers_long.drop_duplicates(subset='customer_id', keep="first")
print(customers_short.shape)
print("The customer_id column is unique for each row. Can be used as primary key w/o any changes")

(99441, 5)
(99441, 5)
The customer_id column is unique for each row. Can be used as primary key w/o any changes


Insert data into customers table:
![title](images/insert_customers_data_updated.JPG)

#### sellers table

- Using 'seller_id' as the primary key for this table. 
- The 'seller_id' column is unique and non-null so it can be used as primary key w/o any changes (see below).
- Using 'seller_zip_code_prefix' as foreign key, corresponding to the 'geolocation_zip_code' primary key in the geolocation table

Create table:
![title](images/create_sellers_table.JPG)

In [3]:
sellers_long = pd.read_csv('data/brazilian-ecommerce/olist_sellers_dataset.csv')
print(sellers_long.shape)
sellers_short = sellers_long.drop_duplicates(subset='seller_id', keep="first")
print(sellers_short.shape)
sellers_short.to_csv('data/brazilian-ecommerce/sellers_short.csv', index=False)
print("The seller_id column is unique for each row. Can be used as primary key w/o any changes")

(3095, 4)
(3095, 4)
The seller_id column is unique for each row. Can be used as primary key w/o any changes


Insert data into sellers table:
![title](images/insert_sellers_data.JPG)

#### orders table

- Using 'order_id' as the primary key for this table.
- The 'order_id' column is unique and non-null so it can be used as primary key w/o any changes (see below).
- Using 'customer_id' as foreign key, corresponding to the 'customer_id' primary key in the customers table

Create table:
![title](images/create_orders_table.JPG)

In [4]:
# Can the 'order_id' column be used as the primary key for the 'orders' table? 
#(ie is 'order_id' a unique, non-null identifier for each row in this table?)
orders_dataset = pd.read_csv('data/brazilian-ecommerce/olist_orders_dataset.csv')
print(len(orders_dataset))
print(len(orders_dataset['order_id'].unique()))
print("The order_id column is unique for each row. Can be used as primary key w/o any changes")

99441
99441
The order_id column is unique for each row. Can be used as primary key w/o any changes


Insert data into sellers table:
![title](images/insert_orders_data.JPG)

#### order_reviews table

- Using 'review_id' as the primary key for this table.
- The 'review_id' column is NOT unique. There are ~500 reviews that have duplicates. 
- The investigation of the 'review_id' duplicates (see below) indicates that review_ids show up for multiple orders of the same item. The same rating and reviews are the same for a given repeated review_id. For now, I am going to just drop any duplicates of a 'review_id' so that I can use it as a primary key and b/c only most redundant information is lost. May revisit this if I want to do an even more detailed analysis of this.
- Using 'order_id' as foreign key, corresponding to the 'order_id' primary key in the orders table

Create table:
![title](images/create_order_reviews_table.JPG)

In [5]:
# Can the 'review_id' column be used as the primary key for the 'order_reviews' table? 
#(ie is 'review_id' a unique, non-null identifier for each row in this table?)
order_reviews = pd.read_csv('data/brazilian-ecommerce/olist_order_reviews_dataset.csv')

print(len(order_reviews))
# The 'order_id' column is not unique for this dataset
print(len(order_reviews['order_id'].unique()))
# The 'review_id' column is not unique for this dataset
print(len(order_reviews['review_id'].unique()))

# Make compound primary key using 'order_id' and 'review_id'??? How will that work for connecting to other tables?
# Need to look at rows w/ duplicate 'review_id' and see what is actually happening here.
# Need to do investigation before creating this table

order_reviews_duplicated = order_reviews[order_reviews.duplicated('review_id', keep=False) == True]
order_reviews_duplicated.head()

100000
99441
99173


Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
200,28642ce6250b94cc72bc85960aec6c62,e239d280236cdd3c40cb2c033f681d1c,5,,,2018-03-25 00:00:00,2018-03-25 21:03:02
346,a0a641414ff718ca079b3967ef5c2495,169d7e0fd71d624d306f132acd791cbe,5,,,2018-03-04 00:00:00,2018-03-06 20:12:53
348,f4d74b17cd63ee35efa82cd2567de911,f269e83a82f64baa3de97c2ebf3358f6,3,,"A embalagem deixou a desejar, por pouco o prod...",2018-01-12 00:00:00,2018-01-13 18:46:10
362,ecbaf1fce7d2c09bfab46f89065afeaf,2451b9756f310d4cff5c7987b393870d,5,,,2017-07-27 00:00:00,2017-07-28 16:57:18
395,6b1de94de0f4bd84dfc4136818242faa,92acf87839903a94aeca0e5040d99acb,5,,,2018-02-16 00:00:00,2018-02-19 19:04:21


In [6]:
duplicate_ex = order_reviews[order_reviews['review_id']=='289450935cf7a082af13e04160716ce5']
duplicate_ex

Unnamed: 0,review_id,order_id,review_score,review_comment_title,review_comment_message,review_creation_date,review_answer_timestamp
49710,289450935cf7a082af13e04160716ce5,2ef81da37176cfe633d519636052f2dd,5,,"produto chegou correto , recomendo",2018-03-25 00:00:00,2018-03-27 17:57:49
99428,289450935cf7a082af13e04160716ce5,4fe48790875f264fd93c9009892c3e39,5,,"produto chegou correto , recomendo",2018-03-25 00:00:00,2018-03-27 17:57:49


In [7]:
specific_dup_rev_1 = orders_dataset[orders_dataset['order_id']=='2ef81da37176cfe633d519636052f2dd']
specific_dup_rev_2 = orders_dataset[orders_dataset['order_id']=='4fe48790875f264fd93c9009892c3e39']

In [8]:
specific_dup_rev_1

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
13976,2ef81da37176cfe633d519636052f2dd,cb209afb7b4b522a2f22a76c0865aaa1,delivered,2018-03-18 15:50:21,2018-03-18 16:05:33,2018-03-21 16:33:19,2018-03-24 13:41:58,2018-04-03 00:00:00


In [9]:
specific_dup_rev_2

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
37793,4fe48790875f264fd93c9009892c3e39,52fcca01e852db365ea24614a227cbd0,delivered,2018-03-18 15:50:21,2018-03-18 16:05:30,2018-03-21 16:42:29,2018-03-24 13:38:27,2018-04-02 00:00:00


As we can see multiple review_ids show up for multiple orders of the same item. Can safely remove duplicated review_ids w/o loss of much information (besides the fact that multiple orders were placed by customer on the same item)

In [10]:
order_reviews_short = order_reviews.drop_duplicates(subset='review_id', keep="first")
print(order_reviews_short.shape)
order_reviews_short.to_csv('data/brazilian-ecommerce/order_reviews_short.csv', index=False)

(99173, 7)


Insert data into order_reviews table:
![title](images/insert_order_reviews_data.JPG)

#### order_payments table

The repeats of the 'order_id' column are a result of the customer using multiple forms of payments on the order (see investigation below). <br>
Will create a new column, 'order_id_payment', that concatenates these two columns together.<br>
'order_id_payment' is a unique identifier for every row in this dataset. It will be used as the primary key.<br>
'order_id' will be used as a foreign key to connect to the primary key 'order_id' in the orders dataset.<br>

Create table:
![title](images/create_order_payments_table.JPG)

In [11]:
# Can the 'order_id' column be used as the primary key for the 'order_reviews' table? 
#(ie is 'order_id' a unique, non-null identifier for each row in this table?)
order_payments = pd.read_csv('data/brazilian-ecommerce/olist_order_payments_dataset.csv')

print(len(order_payments))
# The 'order_id' column is not unique for this dataset
print(len(order_payments['order_id'].unique()))

# Answer: No, 'order_id' is not a unique identifer for this table.

103886
99440


In [12]:
#Investigation into duplicated order_id column:
order_payments_duplicated = order_payments[order_payments.duplicated('order_id', keep=False) == True]
order_payments_duplicated.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
25,5cfd514482e22bc992e7693f0e3e8df7,2,voucher,1,45.17
35,b2bb080b6bc860118a246fd9b6fad6da,1,credit_card,1,173.84
75,3689194c14ad4e2e7361ebd1df0e77b0,2,voucher,1,57.53
84,723e462ce1ee50e024887c0b403130f3,1,credit_card,1,13.8
102,21b8b46679ea6482cbf911d960490048,2,voucher,1,43.12


In [13]:
duplicate_ex_1 = order_payments[order_payments['order_id']=='5cfd514482e22bc992e7693f0e3e8df7']
duplicate_ex_1

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
25,5cfd514482e22bc992e7693f0e3e8df7,2,voucher,1,45.17
57742,5cfd514482e22bc992e7693f0e3e8df7,1,credit_card,4,665.41


In [14]:
duplicate_ex_2 = order_payments[order_payments['order_id']=='b2bb080b6bc860118a246fd9b6fad6da']
duplicate_ex_2

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
35,b2bb080b6bc860118a246fd9b6fad6da,1,credit_card,1,173.84
13238,b2bb080b6bc860118a246fd9b6fad6da,2,voucher,1,24.08


In [15]:
# The repeats of the 'order_id' column are a result of the customer using multiple forms of payments on the order. 
# Will create a new column, 'order_id_payment', that concatenates these two columns together.
# 'order_id_payment' is a unique identifier for every row in this dataset. It will be used as the primary key.
# 'order_id' will be used as a foreign key to connect to the primary key 'order_id' in the orders dataset.

order_payments['order_id_payment'] = order_payments['order_id'] + order_payments['payment_sequential'].astype(str)
order_payments.head()

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value,order_id_payment
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33,b81ef226f3fe1789b1e8b2acac839d171
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39,a9810da82917af2d9aefd1278f1dcfa01
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71,25e8ea4e93396b6fa0d3dd708e76c1bd1
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78,ba78997921bbcdc1373bb41e913ab9531
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45,42fdf880ba16b47b59251dd489d4441a1


In [16]:
print(len(order_payments))
# The 'order_id_payment' column IS unique for this dataset
print(len(order_payments['order_id_payment'].unique()))

#Rearrange columns so that 'order_id_payment' displays first
new_column_order = ['order_id_payment', 'order_id', 'payment_sequential', 'payment_type', 'payment_installments', 'payment_value']
order_payments = order_payments[new_column_order]
order_payments.head(2)

103886
103886


Unnamed: 0,order_id_payment,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d171,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa01,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39


In [17]:
#Save this modified dataframe as csv file so that I can insert it into the order_payments table:
order_payments.to_csv('data/brazilian-ecommerce/order_payments_modified.csv', index=False)

Insert data into order_reviews table:
![title](images/insert_order_payments_data.JPG)

#### order_items table

The repeats of the 'order_id' column are a result of the customer using multiple forms of payments on the order (see investigation below). <br>
Will create a new column, 'order_id_payment', that concatenates these two columns together.<br>
'order_id_payment' is a unique identifier for every row in this dataset. It will be used as the primary key.<br>
'order_id' will be used as a foreign key to connect to the primary key 'order_id' in the orders dataset.<br>

Create table:
![title](images/create_order_items_table_updated.JPG)

In [18]:
# Can the 'order_id' column be used as the primary key for the 'order_items' table? 
#(ie is 'order_id' a unique, non-null identifier for each row in this table?)
order_items = pd.read_csv('data/brazilian-ecommerce/olist_order_items_dataset.csv')

print(len(order_items))
# The 'order_id' column is not unique for this dataset
print(len(order_items['order_id'].unique()))

# Answer: No, 'order_id' is not a unique identifer for this table.

112650
98666


In [19]:
#Investigation into duplicated order_id column:
order_items_duplicated = order_items[order_items.duplicated('order_id', keep=False) == True]
order_items_duplicated.head()

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
13,0008288aa423d2a3f00fcb17cd7d8719,1,368c6c730842d78016ad823897a372db,1f50f920176fa81dab994f9023523100,2018-02-21 02:55:52,49.9,13.37
14,0008288aa423d2a3f00fcb17cd7d8719,2,368c6c730842d78016ad823897a372db,1f50f920176fa81dab994f9023523100,2018-02-21 02:55:52,49.9,13.37
32,00143d0f86d6fbd9f9b38ab440ac16f5,1,e95ee6822b66ac6058e2e4aff656071a,a17f621c590ea0fab3d5d883e1630ec6,2017-10-20 16:07:52,21.33,15.1
33,00143d0f86d6fbd9f9b38ab440ac16f5,2,e95ee6822b66ac6058e2e4aff656071a,a17f621c590ea0fab3d5d883e1630ec6,2017-10-20 16:07:52,21.33,15.1
34,00143d0f86d6fbd9f9b38ab440ac16f5,3,e95ee6822b66ac6058e2e4aff656071a,a17f621c590ea0fab3d5d883e1630ec6,2017-10-20 16:07:52,21.33,15.1


In [20]:
duplicate_ex_1 = order_items[order_items['order_id']=='0008288aa423d2a3f00fcb17cd7d8719']
duplicate_ex_1

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
13,0008288aa423d2a3f00fcb17cd7d8719,1,368c6c730842d78016ad823897a372db,1f50f920176fa81dab994f9023523100,2018-02-21 02:55:52,49.9,13.37
14,0008288aa423d2a3f00fcb17cd7d8719,2,368c6c730842d78016ad823897a372db,1f50f920176fa81dab994f9023523100,2018-02-21 02:55:52,49.9,13.37


In [21]:
# The repeats of the 'order_id' column are a result of the order having multiple order_items
# Will create a new column, 'order_id_item_id', that concatenates these two columns together.
# 'order_id_item_id' is a unique identifier for every row in this dataset. It will be used as the primary key.
# 'order_id' will be used as a foreign key to connect to the primary key 'order_id' in the orders dataset.

order_items['order_id_item_id'] = order_items['order_id'] + order_items['order_item_id'].astype(str)

print(len(order_items))
# The 'order_id_payment' column IS unique for this dataset
print(len(order_items['order_id_item_id'].unique()))

#Rearrange columns so that 'order_id_payment' displays first
new_column_order = ['order_id_item_id', 'order_id', 'order_item_id', 'product_id', 'seller_id', 'shipping_limit_date', 'price', 'freight_value']
order_items = order_items[new_column_order]

#Save this modified dataframe as csv file so that I can insert it into the order_items table:
order_items.to_csv('data/brazilian-ecommerce/order_items_modified.csv', index=False)

order_items.head(2)

112650
112650


Unnamed: 0,order_id_item_id,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value
0,00010242fe8c5a6d1ba2dd792cb162141,00010242fe8c5a6d1ba2dd792cb16214,1,4244733e06e7ecb4970a6e2683c13e61,48436dade18ac8b2bce089ec2a041202,2017-09-19 09:45:35,58.9,13.29
1,00018f77f2f0320c557190d7a144bdd31,00018f77f2f0320c557190d7a144bdd3,1,e5f2d52b802189ee658865ca93d83a8f,dd7ddc04e1b6c2c614352b383efe2d36,2017-05-03 11:05:13,239.9,19.93


Insert data into order_reviews table:
![title](images/insert_order_items_data.JPG)

Delete header row (header is placed as first row):
![title](images/delete_first_row_order_items.JPG)

#### products table

'product_id' is a unique identifier for every row in this dataset. It will be used as the primary key.<br>
'product_id' links to the foreign key 'product_id' field in the order_items table.<br>

Create table:
![title](images/create_products_table.JPG)

In [22]:
# Can the 'product_id' column be used as the primary key for the 'products' table? 
products = pd.read_csv('data/brazilian-ecommerce/olist_products_dataset.csv')

print(len(products))
# The 'order_id' column is not unique for this dataset
print(len(products['product_id'].unique()))

# Answer: Yes, 'product_id' IS a unique identifer for this table.

32951
32951


Insert data into products table:
![title](images/insert_products_data.JPG)

Delete header row (header is placed as first row):
![title](images/delete_header_products.JPG)

#### closed_deals table

'mql_id' is a unique identifier for every row in this dataset. It will be used as the primary key.<br>

Create table:
![title](images/create_products_table.JPG)

In [23]:
closed_deals = pd.read_csv('data/marketing-funnel-olist/olist_closed_deals_dataset.csv')

print(len(closed_deals))
print(len(closed_deals['mql_id'].unique()))

# Answer: Yes, 'mql_id' IS a unique identifer for this table.

842
842


Insert data into closed_deals table:
![title](images/insert_closed_deals_data.JPG)

#### marketing_leads table

'mql_id' is a unique identifier for every row in this dataset. It will be used as the primary key.<br>

I will be joining the marketing_leads table and closed_deals by the 'mql_id' column from each table (this is the primary key in each table. I can do this b/c it is a 1:1 relationship where each 'mql_id' can only have one possible counterpart in the other table (corresponding to the same 'mql_id' value). The primary key, 'mql_id' in marketing_leads will also function as a foreign key. 

Create table:
![title](images/create_marketing_leads_table.JPG)

In [24]:
marketing_leads = pd.read_csv('data/marketing-funnel-olist/olist_marketing_qualified_leads_dataset.csv')
print(len(marketing_leads))
print(len(marketing_leads['mql_id'].unique()))

# Answer: Yes, 'mql_id' IS a unique identifer for this table.
#In this table, make 'mql_id' both primary key AND foreign key

8000
8000


Insert data into closed_deals table:
![title](images/insert_marketing_leads_data.JPG)

Below is a schema I made using draw.io software to show the relationships between the tables in the database I created: 
![title](images/schema_rearranged.JPG)

### Part 2: Structural Issue In Dataset

The dataset structure (set by Olist) is overly normalized. There is important information that is missing as a result of the way that Olist structured their csv files: Orders can contain multiple items. These items can be from different sellers. 


<b>Issue 1</b>:
- Fields in orders csv file (‘order_delivered_carrier’, ‘order_delivered_customer_date’, ‘order_estimated_delivery_date’) should be individual to each item, especially when the items in the order are from different sellers. 
- Instead of having this information for each order_item within the order, based on the way this data has been structured by Olist, this information only exists for each order. 
- Since this information can not be recovered, will proceed with logistics analysis by only looking at orders that have order_items from one unique seller (orders can have multiple items as long as all of the items are from the same seller).

![title](images/issue_1.JPG)

<b>Issue 2</b>:
- There is no field that can be used as a connection/relationship between the ‘order_items’ csv and the ‘order_reviews’ csv 
- For orders with multiple items, a customer may leave multiple reviews under that order (each review corresponding to a different item). 
- With the way the datasets are currently structured, unable to tell which review corresponds to which item within an order. Since my analysis includes how the logistic operations impact reviews, this is a problem. In order to proceed with analysis, will only be looking at orders that have order_items from one unique product (orders can have multiple items as long as all of the items are the same product and thus from the same seller).

![title](images/issue_2.JPG)


<b>Action</b>: 
- In order to deal with both of these issues and proceed with the logistics analysis, I will be filtering the order_items table I have created to only include orders that have only one unique product_id and only one unique seller_id. 95,430 (96%) of the orders in this dataset meet these requirements (see below). The same procedure will be used in filter the orders, order_reviews and order_payments table. This way my analysis is consistently applied for the same 95,430 orders throughout the entire project.

In [25]:
import sqlite3
import pandas as pd

# Create modified table of order_items that filters the original order_items to only include orders that have only one unique product_id in the order. 

def run_query(q):
    with sqlite3.connect('data/ecommerce.db') as conn:
        return pd.read_sql_query(q,conn)

def run_command(c):
    with sqlite3.connect('data/ecommerce.db') as conn:
        conn.isolation_level=None
        conn.execute(c)    

# Create the modified tables and insert the relevant data

# order_items_modified
c1 = '''
CREATE TABLE 'order_items_modified' (
    'order_id_item_id' [TEXT] PRIMARY KEY,
    'order_id' [TEXT],
    'order_item_id' [INTEGER],
    'product_id' [TEXT],
    'seller_id' [TEXT],
    'shipping_limit_date' [TEXT],
    'price' [REAL],
    'freight_value' [REAL],
    FOREIGN KEY ('order_id')
        REFERENCES orders ('order_id'),
    FOREIGN KEY ('seller_id')
        REFERENCES sellers ('seller_id'),
    FOREIGN KEY ('product_id')
        REFERENCES products ('product_id')
    )
'''
#run_command(c1)

c2 = '''
INSERT INTO order_items_modified
    SELECT * 
    FROM order_items oi
    WHERE oi.order_id IN (
                      SELECT order_id
                      FROM order_items
                      GROUP BY order_id
                      HAVING COUNT(DISTINCT(product_id))=1 AND COUNT(DISTINCT(seller_id))=1
                      )
'''

#run_command(c2)

# orders_modified 
c3 = '''
CREATE TABLE 'orders_modified' (
    'order_id' [TEXT] PRIMARY KEY,
    'customer_id' [TEXT],
    'order_status' [TEXT],
    'order_purchase_timestamp' [TEXT],
    'order_approved_at' [TEXT],
    'order_delivered_carrier' [TEXT],
    'order_delivered_customer_date' [TEXT],
    'order_estimated_delivery_date' [TEXT],
    FOREIGN KEY ('customer_id')
        REFERENCES customers ('customer_id')
    )
'''
#run_command(c3)

c4 = '''
INSERT INTO orders_modified
    SELECT o.*
    FROM orders o
    INNER JOIN order_items_modified oim ON oim.order_id=o.order_id
    GROUP BY oim.order_id
'''

#run_command(c4)

# order_reviews_modified
c5 = '''
CREATE TABLE 'order_reviews_modified' (
    'review_id' [TEXT] PRIMARY KEY,
    'order_id' [TEXT],
    'review_score' [INTEGER],
    'review_comment_title' [TEXT],
    'review_comment_message' [TEXT],
    'review_creation_date' [TEXT],
    'review_answer_timestamp' [TEXT],
    FOREIGN KEY ('order_id')
        REFERENCES orders ('order_id')
    )
'''
#run_command(c5)

c6 = '''
INSERT INTO order_reviews_modified
    SELECT orev.*
    FROM orders_modified om
    LEFT JOIN order_reviews orev ON orev.order_id=om.order_id
    GROUP BY orev.order_id
    HAVING length(orev.order_id) > 0
'''

#run_command(c6)

# order_payments_modified
c7 = '''
CREATE TABLE 'order_payments_modified' (
    'order_id_payment' [TEXT] PRIMARY KEY,
    'order_id' [TEXT],
    'payment_sequential' [INTEGER],
    'payment_type' [TEXT],
    'payment_installments' [INTEGER],
    'payment_value' [REAL],
    FOREIGN KEY ('order_id')
        REFERENCES orders ('order_id')
    )
'''
#run_command(c7)

c8 = '''
INSERT INTO order_payments_modified
    SELECT op.*
    FROM orders_modified om
    LEFT JOIN order_payments op ON op.order_id=om.order_id
'''

#run_command(c8)

In [26]:
q0 = '''
SELECT
    COUNT(DISTINCT(order_id)) num_orders,
    (CAST(COUNT(DISTINCT(order_id)) as float)/(SELECT COUNT(DISTINCT(order_id)) FROM orders))*100 percent
FROM order_items_modified
'''

run_query(q0)

Unnamed: 0,num_orders,percent
0,95430,95.966452


In [27]:
## Summary of changes: 

q1a = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM orders
'''
print('Orders:')
print(run_query(q1a))

q1b = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM orders_modified
'''
print('Orders modified:')
print(run_query(q1b))

q2a = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM order_items
'''
print('Order items:')
print(run_query(q2a))

q2b = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM order_items_modified
'''
print('Order items modified:')
print(run_query(q2b))

q3a = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM order_reviews
'''
print('Order reviews:')
print(run_query(q3a))

q3b = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM order_reviews_modified
'''
print('Order reviews modified:')
print(run_query(q3b))

q4a = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM order_payments
'''
print('Order payments:')
print(run_query(q4a))

q4b = '''
SELECT 
    COUNT(*) number_of_rows,
    COUNT(order_id) number_order_id,
    COUNT(DISTINCT order_id) number_unique_order_id
FROM order_payments_modified
'''
print('Order payments modified:')
print(run_query(q4b))

Orders:
   number_of_rows  number_order_id  number_unique_order_id
0           99441            99441                   99441
Orders modified:
   number_of_rows  number_order_id  number_unique_order_id
0           95430            95430                   95430
Order items:
   number_of_rows  number_order_id  number_unique_order_id
0          112650           112650                   98666
Order items modified:
   number_of_rows  number_order_id  number_unique_order_id
0          104882           104882                   95430
Order reviews:
   number_of_rows  number_order_id  number_unique_order_id
0           99173            99173                   98926
Order reviews modified:
   number_of_rows  number_order_id  number_unique_order_id
0           94975            94975                   94975
Order payments:
   number_of_rows  number_order_id  number_unique_order_id
0          103886           103886                   99440
Order payments modified:
   number_of_rows  number_order_id

<b> Modifed Tables Summary</b>:
    
The modified tables contain 95,430 orders instead of 99,441. The 95,430 orders are orders that are from one unique seller and are purchases of one unique product (can purchase multiple of that same product). 

Note that 455 of the orders do NOT contain an accompanying review. 1 of the orders doesn't have any payment information associated w/ it. 

Note for order_items_modified table: Currently each order only has one row b/c the data is being grouped by order_id. This means that the orders that are from one unique seller and are of one unique product but is a purchase of multiple of that same unique product, are only represented by one row. This means that there is not a new row for each multiple of that same product. That means there is no information on which orders contain multiple of the same product and which orders are simply just an order of one product. This also eliminates the data on total price and total freight. I need to fix this.... This is going to be tricky...

### Part 3: Data Sourcing Issue

According to Olist, this dataset is a simple random sampling of orders that have reviews. In other words, orders that don't have a review were not considered for the random sampling. Note from above, however, that they must not have sampled exclusively from orders that had reviews b/c the original dataset had 515 orders that didn't have a review (0.5% of the total orders). The modified dataset had 455 orders that didn't have a review (0.5% of the total orders). 

Additionally, due to the data structuring issues (and resulting missing data) mentioned above I am only analyzing orders that have items with one unique product_id from one unique seller_id. Ie filtering to only include orders that have one item or multiple identical items from same seller. Note, however, that this is by far the most common situation for the customers purchasing through Olist in this dataset: 95,430 (96%) of the orders in this dataset meet these requirements. 

<b>Summary</b>:<br>
This dataset (and the resulting analysis) may not be representative of the actual entire Olist dataset because the analysis is only for orders that: 1.) Have a review submitted 2.) Have only one item or multiple identical items from the same seller. 

### Part 4: Missing Values

<b> Need to go through systematically and deal w/ missing values! </b>

'table_audit_function' is a function I built to "audit" a table/dataframe created from joining multiple tables.
I will be using this to keep track of missing values in the tables I create.

For each stage in the analysis, need to do the following:
1. Identify if there are missing values
2. If missing data is missing at random or if there are reasons/factors for why data is missing.
3. How to best handle this missing data: 
    1. Remove missing data or 
    2. Impute the missing data and if so, impute in what way