**Things to note in cleaned Online Retail dataset:**

1. Some values in 'Country' column are 'unspecified'
2. Some values in 'CustomerID' column are empty
3. Currency in GBP
4. In 'StockCode' column, here are the meaning of certain stock codes not adhering to the definition in dataset description:
{'D': discount; 'M': temporary listings that are manually added; 'POST': postage; 'DCGSXXXX': normal listings that are differently coded; 'gift_0001_XX': gift card of XX value (in GBP); 'C2': service to carry parcel upstairs}
5. In 'InvoiceID' column, values starting with 'C' are orders that are canceled, but they are different from the invoice number generated when item is first purchased
6. Column datatypes:
- {['Invoice Date', 'Invoice Time']: datetime; ['UnitPrice', 'CustomerID', 'TotalPrice']: float; ['Quantity']: int; ['InvoiceNo', 'StockCode', 'Description', 'Country']: str}

# **1. Mount, load, inspect excel**

In [None]:
import pandas as pd

Mounted at /content/drive


In [None]:
# Load online retail csv

online_retail = pd.read_csv('online_retail.csv')

In [None]:
# dataset inspection

online_retail.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


# **2. Separate date and time in 'InvoiceDate' column**

In [None]:
# Separate Date and Time in InvoiceDate column

# Convert into datetime format
online_retail['InvoiceDate'] = pd.to_datetime(online_retail['InvoiceDate'])

# Separate columns
online_retail['Invoice Date'] = online_retail['InvoiceDate'].dt.date
online_retail['Invoice Time'] = online_retail['InvoiceDate'].dt.time

# Drop original 'InvoiceDate' column
online_retail.drop('InvoiceDate', axis=1, inplace=True)

online_retail.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Invoice Date,Invoice Time
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2.55,17850.0,United Kingdom,2010-12-01,08:26:00
1,536365,71053,WHITE METAL LANTERN,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2.75,17850.0,United Kingdom,2010-12-01,08:26:00
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00


# **3. Standardize 'Country' column**

- Drop CustomerID == 15108. 15108's country is 'European community', too generic
- Change country == EIRE to Ireland, RSA to South Africa

*Note: There are datapoints with unspecified country but all else seems normal*

In [None]:
# Drop customer 15108
online_retail.drop(online_retail[online_retail['CustomerID'] == 15108].index, inplace=True)

# Change EIRE to Ireland, RSA to South Africa
online_retail['Country'] = online_retail['Country'].replace({'EIRE': 'Ireland', 'RSA': 'South Africa'})

In [None]:
# Sanity check
online_retail['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'Ireland', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'Malta', 'South Africa'], dtype=object)

# **4. Drop abnormal data points and duplicates**

- Delete rows with invoice no. not starting with C but qty is negative
- Delete rows with CustomerID empty and unit price == 0
- Drop rows with unit price == -11062.06. These are bad debt not tagged to any customer ID
-
 Drop duplicate rows

In [None]:
# Delete rows with invoice no. not starting with C but qty is negative
condition_1 = ((online_retail['InvoiceNo'].str[0] != 'C') & (online_retail['Quantity'] < 0))

# Filter the DataFrame based on the condition
online_retail_1 = online_retail[~condition_1]

In [None]:
# Delete rows with CustomerID empty and unit price == 0
condition_2 = (online_retail_1['CustomerID'].isna()) & (online_retail_1['UnitPrice'] == '0')

# Filter the DataFrame based on the condition
online_retail_2 = online_retail_1[~condition_2]

In [None]:
# Drop rows with unit price == -11062.06
indices_to_drop = online_retail_2[online_retail_2['UnitPrice'] == -11062.06].index
online_retail_filtered = online_retail_2.drop(indices_to_drop)

In [None]:
# Drop duplicate rows
online_retail_filtered = online_retail_filtered.drop_duplicates()

# **5. Add 'TotalPrice' column**

TotalPrice = Quantity * UnitPrice

In [None]:
online_retail_filtered['TotalPrice'] = online_retail_filtered['Quantity'] * online_retail_filtered['UnitPrice']
online_retail_filtered.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  online_retail_filtered['TotalPrice'] = online_retail_filtered['Quantity'] * online_retail_filtered['UnitPrice']


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,UnitPrice,CustomerID,Country,Invoice Date,Invoice Time,TotalPrice
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2.55,17850.0,United Kingdom,2010-12-01,08:26:00,15.3
1,536365,71053,WHITE METAL LANTERN,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2.75,17850.0,United Kingdom,2010-12-01,08:26:00,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,3.39,17850.0,United Kingdom,2010-12-01,08:26:00,20.34


# **6. Clean 'StockCode' column**

- Drop rows where StockCode == 'AMAZONFEE' (these are random cashflow streams floating in the system with no assigned CustomerID)
- Drop rows where StockCode == 'B' (these are bad debt not assigned to any customer as well)
- Drop rows where StockCode == 'BANK CHARGES' (these are cashflow streams ranging from -1000 to 20 mostly not assigned to any customer and span countable datapoints)
- Drop rows where StockCode == 'CRUK' (these are credit card commision charges on seller/amazon in the UK)
- Drop rows where StockCode == 'S'. These are simply refund for samples and cannot be backtraced to orders due to different invoice code. 63 entries with price mostly ranging from -100 to 0.
- Drop rows where StockCode == 'PADS'. Only 4 entries with unit price ranging from 0 to 0.001. No value-add if we keep.
- Drop rows where InvoiceNo == 540699 (product description says postage but unit price is 0 and quantity is 1000, no customer linked)
- Drop rows where InvoiceNo == 564761 and 564762. They are faulty entries claiming gift card with no value is sold with no customer tagged to.
- Change StockCode == 'DOT' to StockCode == 'POST' and Description to 'POSTAGE'. Both denote postage cost.

In [None]:
# Execute first 6 pointers

values_to_drop = ['AMAZONFEE', 'B', 'BANK CHARGES', 'CRUK', 'S', 'PADS']

# Drop rows
online_retail_filtered = online_retail_filtered[~online_retail_filtered['StockCode'].isin(values_to_drop)]

In [None]:
# Execute pointers 7 & 8

values_to_drop = [540699, 564761, 564762]

# Drop rows
online_retail_filtered = online_retail_filtered[~online_retail_filtered['InvoiceNo'].isin(values_to_drop)]

In [None]:
# Execute last pointer

online_retail_filtered['StockCode'] = online_retail_filtered['StockCode'].replace('DOT', 'POST')

online_retail_filtered['Description'] = online_retail_filtered['Description'].replace('DOTCOM POSTAGE', 'POSTAGE')

# **7. Check and unify data types**

In [None]:
online_retail_filtered.dtypes
# Invoice date and time are datetime objects

Unnamed: 0,0
InvoiceNo,object
StockCode,object
Description,object
Quantity,int64
UnitPrice,float64
CustomerID,float64
Country,object
Invoice Date,object
Invoice Time,object
TotalPrice,float64


In [None]:
# Ensure object columns are strings

# List of columns to change
columns_to_convert = ['InvoiceNo', 'StockCode', 'Description', 'Country']

# Convert specified columns to strings
online_retail_filtered[columns_to_convert] = online_retail_filtered[columns_to_convert].apply(lambda x: x.astype(str))

# **8. Export cleaned data to csv**

In [None]:
#Run this to export csv onto drive
online_retail_filtered.to_csv('/content/drive/My Drive/DSA3101/online_retail_clean.csv', index=False)