In [1]:
import pandas as pd 
import numpy as np 
import matplotlib as plt 
%matplotlib inline 
import seaborn as sns 

In [2]:
df = pd.read_excel('../data/gamezone-orders-data.xlsx')
df.head()

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
0,2c06175e,0001328c3c220830,2020-12-24 00:00:00,2020-12-13,Nintendo Switch,e682,168.0,website,affiliate,unknown,US
1,ee8e5bc2,0002af7a5c6100772,2020-10-01 00:00:00,2020-09-21,Nintendo Switch,e682,160.61,website,direct,desktop,DE
2,9eb4efe0,0002b8350e167074,2020-04-21 00:00:00,2020-02-16,Nintendo Switch,8d0d,151.2,website,direct,desktop,US
3,cac7cbaf,0006d06b98385729,2020-04-07 00:00:00,2020-04-04,Sony PlayStation 5 Bundle,54ed,1132.82,website,direct,desktop,AU
4,6b0230bc,00097279a2f46150,2020-11-24 00:00:00,2020-08-02,Nintendo Switch,8d0d,33.89,website,direct,desktop,TR


## Data Cleaning Plan

For this project, I'll be following the **CLEAN** approach to prepare the dataset for analysis:

- **Conceptualize the Data:** Understand what each feature represents and identify key columns.
- **Locate Solvable Problems:** Find and fix obvious issues like missing values, duplicates, and inconsistent formatting.
- **Evaluate Unsolvable Issues:** Address more complex problems such as outliers and unresolved missing data.
- **Augment the Data:** Create new features if they add value to the analysis.
- **Note and Document:** Record all cleaning steps and decisions for transparency.

This structured process will help ensure the data is reliable and ready for analysis.

## 1. Conceptualize

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21864 entries, 0 to 21863
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   USER_ID                  21864 non-null  object        
 1   ORDER_ID                 21864 non-null  object        
 2   PURCHASE_TS              21864 non-null  object        
 3   SHIP_TS                  21864 non-null  datetime64[ns]
 4   PRODUCT_NAME             21864 non-null  object        
 5   PRODUCT_ID               21864 non-null  object        
 6   USD_PRICE                21859 non-null  float64       
 7   PURCHASE_PLATFORM        21864 non-null  object        
 8   MARKETING_CHANNEL        21781 non-null  object        
 9   ACCOUNT_CREATION_METHOD  21781 non-null  object        
 10  COUNTRY_CODE             21826 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(9)
memory usage: 1.8+ MB


- We have around 22k orders.

- We have just one metric or numeric features here and that's `USD_PRICE`, which is also our key features.
- More useful features could be: `PURCHASE_TS`, `SHIP_TS`M, `PRODUCT_NAME`, `MARKETING_CHANNEL`, `COUNTRY_CODE`.

- Possible type conversions: `PURCHASE_TS` to datetime.


## 2. Locate issues

### `USER_ID`

In [4]:
len(df['USER_ID'].unique()) / len(df) * 100

90.79308452250274

In [5]:
df['USER_ID'].duplicated().sum()

np.int64(2013)

In [6]:
2013/21864

0.09206915477497256

Almost 91% are unique user id, and only around 9% are repeating user ids in the dataset.

### `ORDER_ID`

In [7]:
len(df['ORDER_ID'].unique())

21719

In [8]:
df['ORDER_ID'].duplicated().sum()

np.int64(145)

We have 145 duplicate orders. We need to investigate this because there shouldn't be any duplicate orders in the data.

In [9]:
df[df['ORDER_ID'].duplicated() == True]['PRODUCT_NAME'].value_counts()

PRODUCT_NAME
Nintendo Switch              98
27in 4K gaming monitor       37
Sony PlayStation 5 Bundle    10
Name: count, dtype: int64

In [10]:
df[df['ORDER_ID'].duplicated() == True].head()

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
9379,b66cdb8d,7a5f67e18fa77291,2020-01-27 00:00:00,2020-01-28,27in 4K gaming monitor,e7e6,480.0,website,direct,desktop,US
9564,6270d6f9,7d09de332e342684,2020-01-27 00:00:00,2020-01-30,27in 4K gaming monitor,891b,332.2,website,direct,desktop,BR
9922,e80b93ad,815caec5eb998020,2020-01-30 00:00:00,2020-01-31,27in 4K gaming monitor,891b,408.2,website,email,desktop,DE
10052,3838f9e6,833086c869925765,2020-01-24 00:00:00,2020-01-25,27in 4K gaming monitor,891b,317.14,website,direct,desktop,JP
10142,fd7dd923,844a97334cd107082,2020-01-22 00:00:00,2020-01-25,27in 4K gaming monitor,891b,480.0,website,direct,desktop,US


Let's check the first duplicate.

In [11]:
df[df['ORDER_ID'] == '7a5f67e18fa77291']

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
9378,b66cdb8d,7a5f67e18fa77291,2020-01-27 00:00:00,2020-01-28,27in 4K gaming monitor,e7e6,480.0,website,direct,desktop,US
9379,b66cdb8d,7a5f67e18fa77291,2020-01-27 00:00:00,2020-01-28,27in 4K gaming monitor,e7e6,480.0,website,direct,desktop,US


Yes these are exact copies of each other.

I'm removing this duplicates right now.

In [12]:
df.drop_duplicates(subset='ORDER_ID', keep='first', inplace=True)

In [13]:
df.shape

(21719, 11)

### `PURCHASE_TS`:

In [14]:
df['PURCHASE_TS'].isna().sum()

np.int64(0)

In [15]:
# converting it to datetime
df['PURCHASE_TS'] = pd.to_datetime(df['PURCHASE_TS'], errors='coerce')

In [16]:
df['PURCHASE_TS'].isna().sum()

np.int64(5)

We have 5 missings. Let's check them.

In [17]:
df[df.PURCHASE_TS.isna()]

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
1047,a5298a4d,0dda212aaea69940,NaT,2019-07-08,JBL Quantum 100 Gaming Headset,ab0f,21.96,website,direct,desktop,FR
5846,a81bb521,4cd9ab100d971208,NaT,2021-01-11,Nintendo Switch,8d0d,120.26,website,direct,desktop,IE
11853,2fa9f33d,99d824517da22388,NaT,2019-04-11,JBL Quantum 100 Gaming Headset,ab0f,21.19,website,direct,mobile,JP
16163,b313cea5,c9e0aea0d9a75871,NaT,2019-05-18,JBL Quantum 100 Gaming Headset,ab0f,19.2,website,direct,desktop,US
20725,67f8050b,f4de38506b644875,NaT,2019-01-17,JBL Quantum 100 Gaming Headset,ab0f,25.69,website,direct,desktop,GB


In [18]:
print("Start date: ", df['PURCHASE_TS'].min())
print("End date: ", df['PURCHASE_TS'].max())

Start date:  2019-01-01 00:00:00
End date:  2021-02-28 00:00:00


### `SHIP_TS`

In [19]:
df['SHIP_TS'].isna().sum()

np.int64(0)

In [20]:
print("Start: ", min(df['SHIP_TS']))
print("End: ", max(df['SHIP_TS']))

Start:  2018-10-18 00:00:00
End:  2021-11-16 00:00:00


### `PRODUCT_NAME`

In [27]:
df['PRODUCT_NAME'].value_counts()

PRODUCT_NAME
Nintendo Switch                   10288
27in 4K gaming monitor             4686
JBL Quantum 100 Gaming Headset     4296
Sony PlayStation 5 Bundle           967
Dell Gaming Mouse                   719
Lenovo IdeaPad Gaming 3             669
Acer Nitro V Gaming Laptop           87
Razer Pro Gaming Headset              7
Name: count, dtype: int64

There are 8 unique products in this dataset. And almost half of the orders are of Nintendo Switch.

We have a spelling inconsistency, 27in vs 27inches.

In [26]:
df['PRODUCT_NAME'] = df['PRODUCT_NAME'].replace('27inches 4k gaming monitor', '27in 4K gaming monitor')

### `USD_PRICE`

In [30]:
df['USD_PRICE'].isna().sum()

np.int64(5)

In [31]:
df[df['USD_PRICE'].isna() == True][['PRODUCT_NAME', 'PURCHASE_TS', 'SHIP_TS', 'COUNTRY_CODE']]

Unnamed: 0,PRODUCT_NAME,PURCHASE_TS,SHIP_TS,COUNTRY_CODE
1190,Dell Gaming Mouse,2020-08-20,2020-05-26,GH
13282,JBL Quantum 100 Gaming Headset,2019-12-25,2019-12-28,KE
14189,Dell Gaming Mouse,2020-09-01,2020-09-04,KE
20044,Dell Gaming Mouse,2020-07-11,2020-07-12,VE
20227,Dell Gaming Mouse,2021-01-08,2021-01-10,BO


4 out of 5 values are Dell gaming mouse. One way to deal with this is to get the average price of Dell gaming mouse from around the time of purchase and fill it.

In [34]:
df[(df['PRODUCT_NAME'] == 'Dell Gaming Mouse') & (df['PURCHASE_TS'].dt.year == 2020)]['USD_PRICE'].mean()

np.float64(50.690429338103755)

However, the missing rows are only 5 out of 20k, I'm not filling these missing values.

In [35]:
df['USD_PRICE'].describe()

count    21714.000000
mean       281.085203
std        366.177372
min          0.000000
25%        126.000000
50%        168.000000
75%        356.495000
max       3146.880000
Name: USD_PRICE, dtype: float64

In [None]:
# $0 or negative values
df[df['USD_PRICE'] <= 0]

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
2191,a701bdf9,1ceddd6a12170762,2020-06-20,2020-06-21,27in 4K gaming monitor,7f86,0.0,website,direct,desktop,US
2505,85ac39f0,2146feba6e756862,2019-04-30,2019-05-02,27in 4K gaming monitor,7f86,0.0,website,direct,desktop,US
2680,5fd53171,23940db170e41722,2019-05-04,2019-05-06,JBL Quantum 100 Gaming Headset,f5ca,0.0,website,affiliate,unknown,US
4564,555eca81,3be225ada5637556,2019-05-24,2019-05-25,27in 4K gaming monitor,7599,0.0,website,direct,desktop,CA
5276,efe56e44,45a51a63bcd101353,2019-09-26,2019-09-29,JBL Quantum 100 Gaming Headset,f5ca,0.0,website,direct,desktop,GB
5277,efe56e44,45a51a63bcd101354,2019-09-26,2019-09-29,JBL Quantum 100 Gaming Headset,f5ca,0.0,website,direct,desktop,GB
5889,6af2bb7f,4d7c9611b7146120,2020-03-09,2020-03-12,JBL Quantum 100 Gaming Headset,f5ca,0.0,website,direct,desktop,US
5903,640abe84,4dbbfcd41ca43316,2020-11-29,2020-12-02,Dell Gaming Mouse,640d,0.0,website,direct,desktop,US
6028,eec7e8f6,4fbc71f4344100865,2020-04-03,2020-04-06,JBL Quantum 100 Gaming Headset,f5ca,0.0,website,direct,desktop,US
6245,29bb865f,524641383d819880,2020-02-17,2020-02-20,JBL Quantum 100 Gaming Headset,f5ca,0.0,website,affiliate,unknown,NL


In [40]:
df[df['USD_PRICE'] <= 0]['PRODUCT_NAME'].value_counts()

PRODUCT_NAME
JBL Quantum 100 Gaming Headset    20
27in 4K gaming monitor             8
Dell Gaming Mouse                  1
Name: count, dtype: int64

There are 29 rows with $0.0 price.<br>
I'm keeping these values. However, we can treat this values just like missing values, so if we want to fill them, we can using the average price.

In [None]:
valid_prices = df[df['USD_PRICE'] > 0]

# cheapest product in the dataset
valid_prices[valid_prices['USD_PRICE'] == valid_prices['USD_PRICE'].min()][['PRODUCT_NAME', 'USD_PRICE', 'PURCHASE_TS']]

Unnamed: 0,PRODUCT_NAME,USD_PRICE,PURCHASE_TS
728,JBL Quantum 100 Gaming Headset,6.11,2020-09-26
9693,JBL Quantum 100 Gaming Headset,6.11,2020-12-09
19068,JBL Quantum 100 Gaming Headset,6.11,2020-12-02


In [43]:
# most expensive product
valid_prices[valid_prices['USD_PRICE'] == valid_prices['USD_PRICE'].max()][['PRODUCT_NAME', 'USD_PRICE', 'PURCHASE_TS']]

Unnamed: 0,PRODUCT_NAME,USD_PRICE,PURCHASE_TS
19196,Sony PlayStation 5 Bundle,3146.88,2019-05-19


### `PURCHASE_PLATFORM`

In [45]:
df['PURCHASE_PLATFORM'].isna().sum()

np.int64(0)

In [46]:
df['PURCHASE_PLATFORM'].value_counts()

PURCHASE_PLATFORM
website       19642
mobile app     2077
Name: count, dtype: int64

### `MARKETING_CHANNEL`

In [47]:
df['MARKETING_CHANNEL'].isna().sum()

np.int64(83)

In [48]:
df['MARKETING_CHANNEL'].value_counts()

MARKETING_CHANNEL
direct          17316
email            3240
affiliate         714
social media      320
unknown            46
Name: count, dtype: int64

Notice that we have an unknown category, maybe we can fill the missing values with this category for analysis.

### `ACCOUNT_CREATION_METHOD`

In [49]:
df['ACCOUNT_CREATION_METHOD'].isna().sum()

np.int64(83)

In [50]:
df['ACCOUNT_CREATION_METHOD'].value_counts()

ACCOUNT_CREATION_METHOD
desktop    16331
mobile      4225
unknown      735
tablet       320
tv            25
Name: count, dtype: int64

In [51]:
df[df['ACCOUNT_CREATION_METHOD'].isna() == True]

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
243,da197ff9,033d64725cc92138,2020-08-17,2020-06-11,JBL Quantum 100 Gaming Headset,2997,24.00,mobile app,,,US
563,398d3631,079158acb8d26288,2020-07-29,2020-04-27,Nintendo Switch,e682,168.00,website,,,US
783,e278a2a8,0a7be61380495673,2020-12-11,2020-10-09,Dell Gaming Mouse,f81e,49.98,mobile app,,,US
1495,5a79012a,1377b7d717d39556,2020-08-13,2020-06-09,Nintendo Switch,8d0d,155.08,website,,,JP
1642,5fe4e093,1567779706d41744,2020-09-29,2020-05-31,Nintendo Switch,e682,177.11,website,,,GB
...,...,...,...,...,...,...,...,...,...,...,...
19854,ab2de8ae,ed40a9451a772581,2020-05-23,2020-05-25,27in 4K gaming monitor,891b,312.92,website,,,US
19929,8714e62e,ee0fe031fb557445,2020-06-21,2020-06-23,Nintendo Switch,8d0d,168.00,website,,,HK
20114,1172b6b9,ef855ac0b6c10146,2020-05-16,2020-05-19,Nintendo Switch,8d0d,146.59,website,,,IE
20277,cc830ec3,f0e9b1810f786376,2020-05-30,2020-06-01,Nintendo Switch,e682,168.00,website,,,US


We can see that the missing data in `ACCOUNT_CREATION_METHOD` and `MARKETING_CHANNEL` are of the same orders.

Both have unknown categories and these missing values can be filled as unknown category.

### `COUNTRY_CODE`

In [57]:
df['COUNTRY_CODE'].isna().sum()

np.int64(38)

In [58]:
df[df['COUNTRY_CODE'].isna() == True]

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE
526,6af1d816,06ee8b82fbc46119,2019-04-12,2018-12-12,JBL Quantum 100 Gaming Headset,8315,24.3,mobile app,affiliate,unknown,
671,2ad6743f,08feac8f0a020345,2020-04-16,2019-12-07,JBL Quantum 100 Gaming Headset,8315,22.98,mobile app,affiliate,unknown,
1043,7a4a13ce,0dc92d0562552247,2020-05-13,2020-02-23,Nintendo Switch,8e5d,157.42,mobile app,affiliate,unknown,
3585,9cef5a34,2fa5682923166358,2020-04-22,2020-04-25,Nintendo Switch,8e5d,161.02,mobile app,direct,desktop,
4083,e5d4f232,360891064a397089,2020-09-27,2020-09-29,Nintendo Switch,8e5d,165.3,mobile app,affiliate,unknown,
4084,e5d4f232,360891064a397090,2020-09-27,2020-09-29,Nintendo Switch,8e5d,165.3,mobile app,affiliate,unknown,
4875,4552ac90,4054d07e48c31128,2020-07-27,2020-07-30,Nintendo Switch,8e5d,163.04,mobile app,affiliate,unknown,
5144,b26a797c,43ce0b4a8fe75579,2019-03-30,2019-04-01,Nintendo Switch,b5f7,85.11,mobile app,affiliate,unknown,
6124,42edad8a,50e43de8ab930156,2020-12-19,2020-12-22,JBL Quantum 100 Gaming Headset,4c58,24.5,mobile app,affiliate,unknown,
6125,42edad8a,50e43de8ab930157,2020-12-19,2020-12-22,JBL Quantum 100 Gaming Headset,4c58,24.5,mobile app,affiliate,unknown,


Majority missing country code also has unknown account creation code.

In [59]:
df['COUNTRY_CODE'].value_counts()

COUNTRY_CODE
US    10231
GB     1794
CA      946
AU      889
DE      845
      ...  
RE        1
MZ        1
MH        1
MD        1
LC        1
Name: count, Length: 150, dtype: int64

### Now, that we have seen each column, let's try to fix some of the issues that we found.

| issue                                     | feature                 | magnitude | solvalbe |
| ----------------------------------------- | ----------------------- | --------- | -------- |
| Missing purchase dates                    | PURCHASE_TS             | 5         | N        | 
| missing prices                            | USD_PRICE               | 5         | Maybe    | 
| $0 price                                  | USD_PRICE               | 29        | N        | 
| missing marketing channels                | MARKETING_CHANNEL       | 83        | N        | 
| missing account creation methods          | ACCOUNT_CREATION_METHOD | 83        | N        | 
| missing country code                      | COUNTRY_CODE            | 38        | N        |

<hr>
Out of these issues, the following can be solved right now:

1. Missing prices and $0 prices: fill these prices with the mean price of that product in that year and month if it's available, else keep it as it is.
2. Missing marketing channels and account creation methods, they are the same orders (rows), fill these values with unknown category.

The rest of the issues can't be solved with available data.

In [60]:
import pandas as pd
import numpy as np

def get_mean_price(row, df):
    # Only fill if price is missing or <= 0
    if pd.isna(row['USD_PRICE']) or row['USD_PRICE'] <= 0:
        product = row['PRODUCT_NAME']
        year = row['PURCHASE_TS'].year if not pd.isna(row['PURCHASE_TS']) else None
        month = row['PURCHASE_TS'].month if not pd.isna(row['PURCHASE_TS']) else None

        # Try mean for product, year, month
        mask = (
            (df['PRODUCT_NAME'] == product) &
            (df['PURCHASE_TS'].dt.year == year) &
            (df['PURCHASE_TS'].dt.month == month) &
            (df['USD_PRICE'] > 0)
        )
        mean_price = df.loc[mask, 'USD_PRICE'].mean()

        # If not found, try mean for product, year
        if np.isnan(mean_price):
            mask = (
                (df['PRODUCT_NAME'] == product) &
                (df['PURCHASE_TS'].dt.year == year) &
                (df['USD_PRICE'] > 0)
            )
            mean_price = df.loc[mask, 'USD_PRICE'].mean()

        # If not found, try global mean for product
        if np.isnan(mean_price):
            mask = (
                (df['PRODUCT_NAME'] == product) &
                (df['USD_PRICE'] > 0)
            )
            mean_price = df.loc[mask, 'USD_PRICE'].mean()

        # If still not found, keep as is (NaN or 0)
        return mean_price if not np.isnan(mean_price) else row['USD_PRICE']
    else:
        return row['USD_PRICE']

- Filling missing prices or prices less than or equal to $0:

In [61]:
# Apply the function to fill missing or zero prices
df['USD_PRICE'] = df.apply(lambda row: get_mean_price(row, df), axis=1)

In [62]:
df['USD_PRICE'].isna().sum()

np.int64(0)

In [63]:
df[df['USD_PRICE'] <= 0]

Unnamed: 0,USER_ID,ORDER_ID,PURCHASE_TS,SHIP_TS,PRODUCT_NAME,PRODUCT_ID,USD_PRICE,PURCHASE_PLATFORM,MARKETING_CHANNEL,ACCOUNT_CREATION_METHOD,COUNTRY_CODE


- Filling missing account creation method and marketing channel with 'unknown':

In [66]:
df['ACCOUNT_CREATION_METHOD'] = df['ACCOUNT_CREATION_METHOD'].fillna('unknown')
df['MARKETING_CHANNEL'] = df['MARKETING_CHANNEL'].fillna('unknown')

In [67]:
df[['ACCOUNT_CREATION_METHOD', 'MARKETING_CHANNEL']].isna().sum()

ACCOUNT_CREATION_METHOD    0
MARKETING_CHANNEL          0
dtype: int64

- Checking if there are any inconsistencies in purchase and ship dates (their difference):

In [73]:
len(df[df['SHIP_TS'] <= df['PURCHASE_TS']])

2002

There are 2000 orders where the ship date comes before the purchase date.

In [81]:
inconsistent_order_dates = df[df['SHIP_TS'] <= df['PURCHASE_TS']]
(inconsistent_order_dates['SHIP_TS'] - inconsistent_order_dates['PURCHASE_TS']).mean()

Timedelta('-75 days +07:42:16.109890110')

On average, these 2000 orders were shipped 75 days even before they were ordered, this must be a mistake while data entry.

## 3. Evaluate unsolvable issues

The following issues are still remaining:

| issue                                    | feature                 | magnitude |
| ---------------------------------------- | ----------------------- | --------- |
| Missing purchase dates                   | PURCHASE_TS             | 5         |
| missing country code                     | COUNTRY_CODE            | 38        |
| inconsistent shipping and purchase dates | PURCHASE_TS AND SHIP_TS | 2000      |

We have no way of fixing these issues, as we don't have a source to fill them. Thus, we can either keep them or totally remove them. However, as their magnitude is negligible, we can keep them without worry.


In [68]:
df.isna().sum()

USER_ID                     0
ORDER_ID                    0
PURCHASE_TS                 5
SHIP_TS                     0
PRODUCT_NAME                0
PRODUCT_ID                  0
USD_PRICE                   0
PURCHASE_PLATFORM           0
MARKETING_CHANNEL           0
ACCOUNT_CREATION_METHOD     0
COUNTRY_CODE               38
dtype: int64

Creating a separate df with only valid dates:

In [83]:
valid_df = df[df['SHIP_TS'] > df['PURCHASE_TS']]

In [84]:
valid_df.shape

(19712, 11)

## 4. Augment the data