In [1]:
import pandas as pd

# 1.&nbsp; Import a csv file to DataFrame


In [2]:
url = "https://drive.google.com/file/d/1FYhN_2AzTBFuWcfHaRuKcuCE6CWXsWtG/view?usp=sharing" # orderlines.csv
path = "https://drive.google.com/uc?export=download&id="+url.split("/")[-2]
orderlines = pd.read_csv(path)

In [None]:
pd.options.display.max_rows = 10

In [None]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 293983 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   id                293983 non-null  int64 
 1   id_order          293983 non-null  int64 
 2   product_id        293983 non-null  int64 
 3   product_quantity  293983 non-null  int64 
 4   sku               293983 non-null  object
 5   unit_price        293983 non-null  object
 6   date              293983 non-null  object
dtypes: int64(4), object(3)
memory usage: 15.7+ MB


doesn't look like there's any missing value

# 2.&nbsp; Clean up missing & duplicates 

## 2.1 Check for Missing Value

In [None]:
orderlines.isna().sum()

id                  0
id_order            0
product_id          0
product_quantity    0
sku                 0
unit_price          0
date                0
dtype: int64

There is no missing value

## 2.2 Check for Duplicates

In [None]:
# check for duplicates
orderlines.duplicated().sum()

0

In [None]:
orderlines.nunique()

id                  293983
id_order            204855
product_id               1
product_quantity        67
sku                   7951
unit_price           11329
date                251631
dtype: int64

In [None]:
orderlines.shape

(293983, 7)

[DataFrame.size](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.size.html) returns the total number of values that the DataFrame has (the number of rows multiplied by the number of columns):

In [None]:
orderlines.size

2057881

We can check if the `.size` and `.shape` agree

In [None]:
orderlines.shape[0] * orderlines.shape[1] == orderlines.size

True

# 3.&nbsp; Data types

* `date` should be a datetime datatype
* `unit_price` should be a float datatype

In [None]:
# take a look at the dataset with random row
orderlines.sample(10)

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
291363,1645675,525235,0,1,DLL0056,259.99,2018-03-10 22:31:05
50438,1222221,343293,0,1,APP1814,2.029.99,2017-04-04 20:05:28
238217,1561026,491040,0,1,MMW0016,19.99,2018-01-15 09:15:52
213527,1517747,472813,0,1,APP2485,890.02,2017-12-27 00:52:39
103343,1315987,388493,0,1,TRK0011,49.99,2017-08-10 21:48:53
100254,1310596,385839,0,1,APP1662,660.33,2017-08-03 13:37:31
59037,1237065,350472,0,1,WAC0221,266.93,2017-04-27 19:00:43
40699,1204575,334968,0,1,ACM0009,23.99,2017-03-14 11:50:37
245178,1572682,495529,0,1,JYB0009,98.99,2018-01-21 17:11:48
97463,1305395,382497,0,1,WOE0008,8.99,2017-07-28 16:42:30


We can see that there are prices with 2 decimal points

## 3.1 Decimal point problem in unit_price


First, let's see how many values are affected by the 2 decimal point in price.

In [None]:
orderlines.unit_price.str.contains("\d+\.\d+\.\d+").value_counts()

False    257814
True      36169
Name: unit_price, dtype: int64

In [4]:
orderlines.unit_price.str.contains("\d+\.\d+\.\d+").value_counts()

False    257814
True      36169
Name: unit_price, dtype: int64

In [3]:
orderlines.unit_price.str.contains("\d+\.\d{3,}").value_counts()

False    257814
True      36169
Name: unit_price, dtype: int64

Looks like over 36000 rows in orderlines are affected by this problem. This is a bit of a tricky decision as 12.3% is a significant amount of our data... and we might even end up losing a larger portion of our data than this too. For the moment we will delete the rows as we only have 2 weeks for this project and I'd like some quick, accurate results to show. If we have time at the end, we can come back and investigate this problem further, maybe there's a solution?

Each row of orderlines represents a product in an order. For example, if order number 175 contained 3 seperate products, then order 175 would have 3 rows in orderlines, one row for each of the products. If 2 of those products have 'normal' prices (14.99, 15.85) and 1 has a price with 2 decimal points (1.137.99), we need to remove the whole order and not just the affected row. If we only remove the row with 2 decimal places then any later analysis about products and prices could be misleading.

We therefore need to find the order numbers associated with the rows that have 2 decimal points, and then remove all the associated rows.

In [5]:
orderlines = orderlines.loc[(~orderlines.unit_price.astype(str).str.contains("\d+\.\d+\.\d+"))&(~orderlines.unit_price.astype(str).str.contains("\d+\.\d{3,}")), :]

In [None]:
orderlines = orderlines.loc[~orderlines.unit_price.astype(str).str.contains("\d+\.\d+\.\d+")]

In [6]:
# verify if above code is successfully executed, true are gone
orderlines.unit_price.str.contains("\d+\.\d+\.\d+").value_counts()

False    257814
Name: unit_price, dtype: int64

In [7]:
orderlines.unit_price.str.contains("\d+\.\d{3,}").value_counts()

False    257814
Name: unit_price, dtype: int64

## 3.2 Change object to integer in unit_price

In [8]:
orderlines['unit_price'] = pd.to_numeric(orderlines['unit_price'])

In [9]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 257814 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                257814 non-null  int64  
 1   id_order          257814 non-null  int64  
 2   product_id        257814 non-null  int64  
 3   product_quantity  257814 non-null  int64  
 4   sku               257814 non-null  object 
 5   unit_price        257814 non-null  float64
 6   date              257814 non-null  object 
dtypes: float64(1), int64(4), object(2)
memory usage: 15.7+ MB


## 3.3 `date` should become datetime datatype

In [None]:
orderlines["date"] = pd.to_datetime(orderlines["date"])

In [None]:
orderlines.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 257814 entries, 0 to 293982
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   id                257814 non-null  int64         
 1   id_order          257814 non-null  int64         
 2   product_id        257814 non-null  int64         
 3   product_quantity  257814 non-null  int64         
 4   sku               257814 non-null  object        
 5   unit_price        257814 non-null  float64       
 6   date              257814 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 15.7+ MB


In [None]:
orderlines_cl=orderlines

In [None]:
orderlines_cl.nlargest(5, "unit_price")

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
41728,1206437,335869,0,1,PAC1428,999.99,2017-03-16 16:56:58
43442,1209508,337333,0,1,DLL0044,999.99,2017-03-21 10:20:49
43450,1209521,337340,0,1,DLL0044,999.99,2017-03-21 10:33:30
43711,1209994,337551,0,1,DLL0044,999.99,2017-03-21 18:20:18
44523,1211462,338225,0,1,DLL0044,999.99,2017-03-23 13:34:00


In [None]:
orderlines_cl.nsmallest(5, "unit_price")

Unnamed: 0,id,id_order,product_id,product_quantity,sku,unit_price,date
77008,1268645,365886,0,1,APP1465,-119.0,2017-06-15 12:48:54
53515,1227566,345934,0,1,KIN0153-2,0.0,2017-04-13 13:47:21
53530,1227590,345957,0,1,WDT0347,0.0,2017-04-13 14:44:05
56529,1232832,348502,0,1,LIBRO,0.0,2017-04-21 18:14:54
56562,1232888,348531,0,1,LIBRO,0.0,2017-04-21 19:46:54


Don't forget to download/save your new DataFrames. Also, give them an obvious name, so that you know they are the cleaned version and not the original DataFrame.

In [None]:
from google.colab import files
orderlines_cl.to_csv("orderlines_cl.csv", index=False)
files.download("orderlines_cl.csv")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>