## Second Data Wrangling and Pre-processing
In the first notebook, we removed all of the data that could be revealing of an individual purchaser. <br>
In this notebook, we'll eliminate some unnecessary columns and create some more important feature columns that we can then look at in more detail in the Exploratory Data Analysis.

## Goal: Eliminate unnecessary columns, create some obvious features, minimize Nan values, and separate into Items, Orders, and Customers DataFrames

In [None]:
import os
import pandas as pd
import numpy as np
import datetime
import pickle

In [None]:
# change to the path with the raw csv file data

# load the pickled version of the
df = pd.read_csv("cust_pub4_pydata.csv")
# look at the first 10 rows of this file
df.head(10)

In [None]:
# let's drop all of the tax columns from this DF
df.drop(
    [
        "Tax 1 Name",
        "Tax 1 Value",
        "Tax 2 Name",
        "Tax 2 Value",
        "Tax 3 Name",
        "Tax 3 Value",
        "Tax 4 Name",
        "Tax 4 Value",
        "Tax 5 Name",
        "Tax 5 Value",
    ],
    axis=1,
    inplace=True,
)

In [None]:
df.info()

In [None]:
# we noticed from the first 10 rows that some of these values aren't filled. Let's use forward fill since that is the same order
df["Paid at"].fillna(method="ffill", inplace=True, limit=None)

In [None]:
# we need to convert the "Paid at" column into datetime
df["Paid at"] = pd.to_datetime(df["Paid at"], infer_datetime_format=True)

In [None]:
df.head(10)

In [None]:
# let's drop some more useless columns
df.drop(["Taxes", "Notes", "Note Attributes", "Cancelled at"], axis=1, inplace=True)

In [None]:
df.info()

In [None]:
# Reciept Number is empty - drop that
# Fullfilled at is missing a lot of values - we are using 'Paid at '
# remove a few more columns that are too sparse to be useful in modeling
df.drop(
    ["Fulfilled at", "Receipt Number", "Location", "Device ID", "Id", "Risk Level"],
    axis=1,
    inplace=True,
)

In [None]:
# Let's see what currencies are used
df["Currency"].value_counts()

In [None]:
# it's just USD ($) or NaN. Not worth keeping that column
df.drop(["Currency"], axis=1, inplace=True)

In [None]:
# let's look at Paid at vs. Created at
df[["Paid at", "Created at"]].sample(10)

Those looks to be identical except for a 1-2 second lag for the payment. I'm good with dropping the paid at column

In [None]:
df.drop(["Paid at"], axis=1, inplace=True)

In [None]:
# since we are using 'Created at' as the time stamp, let's convert it to date time
# we need to convert the "Paid at" column into datetime
df["Created at"] = pd.to_datetime(df["Created at"], infer_datetime_format=True, utc=True)

In [None]:
df.info()

### These look pretty good. Now, it's time to start filling in some of the NaN values

In [None]:
# For financial status
df["Financial Status"].value_counts()

In [None]:
# it looks like the first line of an order has that Financial Status; we'll forward fill
df["Financial Status"].fillna(method="ffill", inplace=True, limit=25)

In [None]:
# same applies for Fulfillment Status
df["Fulfillment Status"].fillna(method="ffill", inplace=True, limit=25)

In [None]:
# same is true for Accepts Marketing
df["Accepts Marketing"].fillna(method="ffill", inplace=True, limit=25)

In [None]:
df["Tags"].value_counts()

In [None]:
# these look unnecessarily complicated, so we'll drop - or maybe not
# df.drop(['Tags'], axis=1, inplace=True)

In [None]:
df.info()

In [None]:
# let's look at Payment Reference
df["Payment Reference"].value_counts()

In [None]:
# let's drop it
df.drop(["Payment Reference"], axis=1, inplace=True)

In [None]:
# let's create one more feature that would be usable: total items in an order
df["ITEMS"] = df.groupby("Name")["Lineitem quantity"].transform("sum")

In [None]:
df["ITEMS"].unique()

In [None]:
# see how many unique "names" are in the DF
df["Name"].value_counts()

This looks like the same number of "subtotal" and some other fields that are order specific.

In [None]:
# let see if we can use the compare at price relative to the lineitem price as another feature
df["compared"] = (df["Lineitem compare at price"] - df["Lineitem price"]) / df["Lineitem price"]
# positive values mean the line item price is cheaper
# this relative price is more important than the absolute

In [None]:
# let's convert this to a difference in price
df["Lineitem compare at price"] = df["Lineitem compare at price"] - df["Lineitem price"]

### Separate the Dataframe <br>
Right now, the items and the orders are each lines in the DataFrame; we are going to separate out the orders and items in the order into 2 separate dataframes:
### 1. Order - contains the order information
### 2. Items - line by line items contained in an order
### 3. Customers - contains the sum of the orders and items

In [None]:
# create Order DF by taking the first line of a name
Order = df.groupby("Name").first()

# or I could do it groupby Name and then take the value that has a subtotal that is not null

In [None]:
Order.info()

In [None]:
# let's look at discount codes
Order["Discount Code"].value_counts()

The most popular discount codes are used largely enough that they could provide some value, but the largest code is used on 2% of all orders; discount codes are used on 11% of orders. I think it's best to just consider the discount amount to start and that's already contained in another column, so we'll drop this column.

In [None]:
Order.drop(["Discount Code"], axis=1, inplace=True)

In [None]:
# for orders, it shouldn't matter if that particular items is taxable, so we'll drop that or the fulfillment status
Order.drop(["Lineitem taxable", "Lineitem fulfillment status"], axis=1, inplace=True)

In [None]:
Order.info()

In [None]:
# let's fill the payment method with "unknown for the missing values"
Order["Payment Method"].fillna(value="Unknown", inplace=True)

In [None]:
# let's look at Line item requires shipping
Order["Lineitem requires shipping"].value_counts()

That seems reasonable enough; let's keep that

In [None]:
Order["Lineitem sku"].isna().sum()

In [None]:
# let's drop some more unnecessary info; line item name should be covered in the sku
Order["Lineitem sku"].fillna(value=Order["Lineitem name"], inplace=True)
Order.drop(["Lineitem name"], axis=1, inplace=True)

In [None]:
Order.info()

In [None]:
# Customer ID should be an integer - but this gets weird, so we'll skip it.
# Order['Cust_ID'] = Order['Cust_ID'].astype('int')

In [None]:
# let's find out how this shipping method looks
Order["Shipping Method"].value_counts()

In [None]:
# let's fill that shipping method with unknown - Shipping Method
Order["Shipping Method"].fillna(value="Unknown", inplace=True)

In [None]:
Order.head(10)

In [None]:
# based on some weird data, let's look at the source
Order.Source.value_counts()

shopify_draft_order may just be draft orders that were used to test the system and not actual orders

In [None]:
Order[Order["Source"] == "shopify_draft_order"]

These look weird and are probably just tests. I'm dropping them.

In [None]:
Order = Order[~(Order["Source"] == "shopify_draft_order")]

In [None]:
Order.info()

In [None]:
# accepts marketing is currently "yes" or "no"; it's much better if we consider them as 1 and 0 respectively
# then when we sum them up for multiple customer orders, it represents what went on better
Order["Accepts Marketing"].replace(to_replace="yes", value=1, inplace=True)
Order["Accepts Marketing"].replace(to_replace="no", value=0, inplace=True)

### I think that wraps it up for the Order DF

### On to the Items DF that contains all of the line items in the orders

In [None]:
# every row in the dataframe represents a line item, so we'll keep them in
Items = df.copy()

### That takes care of the Items DF

### Still have to work on the Customer DF

In [None]:
Order["Cust_ID"].value_counts()

Let's separate the customers based on these value counts

In [None]:
# Order[Order['Cust_ID'] == -2147483648]
# this order showed up in 60k orders when we changed these from float to integer. I have non idea why

In [None]:
# Create customer DF by aggregating the orders DF over the Customer ID
# 'Accepts Marketing': 'mode', 'Shipping Method': 'mode', 'Payment Method': 'mode',
Cust = Order.groupby("Cust_ID", as_index=False).agg(
    {
        "Total": ["sum", "mean", "first"],
        "Fulfillment Status": "count",
        "Subtotal": "sum",
        "Shipping": "sum",
        "Refunded Amount": "sum",
        "Accepts Marketing": ["sum", "first"],
        "ITEMS": ["sum", "mean", "first"],
        "Created at": ["first", "last"],
        "Server": "first",
        "Discount Amount": "sum",
        "Vendor": "first",
        "Employee": "first",
        "Source": "first",
        "ship_bill": "first",
        "Area_Code": "first",
        "Shipping Zip": "first",
        "Lineitem sku": "first",
    }
)

In [None]:
Cust.info()

In [None]:
# this is exciting let's look at the first 10 rows
Cust.head(10)

In [None]:
# multi-indexing can be a pain. I will reduce this to a single index
col = [
    "Cust_ID",
    "Life_Total",
    "Avg_Order",
    "first_total",
    "Orders",
    "Sub_Total",
    "Ship_Total",
    "Refund_Total",
    "Marketing_lf",
    "Marketing_first",
    "Total_Items",
    "Avg_Items",
    "first_items",
    "first_order",
    "last_order",
    "server",
    "Disc_Total",
    "Vendor",
    "Emp",
    "Source",
    "ship_bill",
    "Area_Code",
    "Ship_Zip",
    "lead_sku",
]
Cust.columns = col

In [None]:
# this is exciting let's look at the first 10 rows
Cust.head(10)

In [None]:
Cust["Orders"].value_counts()

In [None]:
Order.groupby("Cust_ID")["Created at"].min()

In [None]:
Order["1st"] = Order["Created at"] == Order["Created at"]

In [None]:
# Order['first_6mon'] = Order['']

In [None]:
# Create customer DF by aggregating the orders DF over the Customer ID
# 'Accepts Marketing': 'mode', 'Shipping Method': 'mode', 'Payment Method': 'mode',
cust2 = Order.groupby("Cust_ID", as_index=False).apply(lambda g: g.sort_values("Created at"))

I think that does it for data wrangling. Let's export the data so that we can do EDA in the next notebook.

## Set timezones

In [None]:
def set_timezone(df, date_cols):
    for date_col in date_cols:
        df[date_col] = pd.to_datetime(df[date_col], utc=True)
        df[date_col] = df[date_col].dt.tz_convert("US/Pacific")
        return df


Cust = set_timezone(Cust, ["first_order", "last_order"])
Items = set_timezone(Items, ["Created at"])
Order = set_timezone(Order, ["Created at"])

In [None]:
Order.to_csv("../order.csv")

In [None]:
Items.to_csv("../items.csv")

In [None]:
Cust.to_csv("../cust.csv")

See you in the EDA