# Online Retail Data Set from UCI ML repo
### Business Data Analysis by [Kyuhyung Choi](https://headstartup.tistory.com)

last update: 2019.04.05

#### Description
* A transactional data which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

#### Data Content
* Data Set Charateristics: Multivariate, Sequential, Time-Series
* Number of Instances: 541909
* Area: Business
* Attribute Characteristics: Integer, Real Num
* Number of Attributes: 8
* Data Donated: 2015-11-06
* Associated Tasks: Classification, Clustering
* Missing Values: N/A

#### Source
* Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

#### Relevant Papers
* **The evolution of direct, data and digital marketing**, Richard Webber, Journal of Direct, Data and Digital Marketing Practice (2013) 14, 291-309. 
* **Clustering Experiments on Big Transaction Data for Market Segmentation**, Ashishkumar Singh, Grace Rumantir, Annie South, Blair Bethwaite, Proceedings of the 2014 International Conference on Big Data Science and Computing. 
* **A decision-making framework for precision marketing**, Zhen You, Yain-Whar Si, Defu Zhang, XiangXiang Zeng, Stephen C.H. Leung c, Tao Li, Expert Systems with Applications, 42 (2015) 3357-3367.

## Features 
* `InvoiceNo` = Invoice number, Nominal, 6-digit integral number uniquely assigned to each transaction.
  * If this code starts with letter 'c', it indicates a cancellation.
* `StockCode` = Product (item) code. Nominal, 5-digit integral number uniquely assigned to each distinct product.
* `Description` = Product (item) name. Nominal. 
* `Quatity` = The quantities of each product (item) per transaction. Numeric. 
* `InvoiceDate` = Invice Date and time. Numeric, the day and time when each transaction was generated.
* `UnitPrice` = Unit price. Numeric, Product price per unit in sterling.
* `CustomerID` = Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
* `Country` = Country name. Nominal, the name of the country where each customer resides.



# 1. Optimizing Dataset

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [102]:
data = pd.read_excel('Online Retail.xlsx')

In [103]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


## Optimizing Dataframe Memory Footprint
* Assuming that this dataset size is much more bigger, we optimize dataframe memory footprint and accelerate the speed to load the data.

In [104]:
data._data

BlockManager
Items: Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')
Axis 1: RangeIndex(start=0, stop=541909, step=1)
FloatBlock: slice(5, 7, 1), 2 x 541909, dtype: float64
IntBlock: slice(3, 4, 1), 1 x 541909, dtype: int64
DatetimeBlock: slice(4, 5, 1), 1 x 541909, dtype: datetime64[ns]
ObjectBlock: [0, 1, 2, 7], 4 x 541909, dtype: object

In [105]:
# True memory footprint = 134.9MB
data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 134.9 MB


In [106]:
# deep memory footprint of each column
data.memory_usage(deep=True)

Index                80
InvoiceNo      19768872
StockCode      20988204
Description    45252969
Quantity        4335272
InvoiceDate     4335272
UnitPrice       4335272
CustomerID      4335272
Country        38137498
dtype: int64

In [107]:
# total deep memory footprints
tot_mem_megabytes = data.memory_usage(deep=True).sum() / 2**20
print(tot_mem_megabytes)

134.934149742


In [108]:
# deep memory footprints for string columns
obj_cols = data.select_dtypes(include=['object'])
obj_cols_mem = obj_cols.memory_usage(deep=True)
print(obj_cols_mem)

obj_cols_mem_tot_megabytes = obj_cols_mem.sum() / 2**20
print('Total memory footprints for string columns:', obj_cols_mem_tot_megabytes)

Index                80
InvoiceNo      19768872
StockCode      20988204
Description    45252969
Country        38137498
dtype: int64
Total memory footprints for string columns: 118.396399498


In [109]:
# String columns take nearly 90% of total memory footprints
print(obj_cols_mem_tot_megabytes/tot_mem_megabytes*100)

87.7438363263


## If the number of unique values are less than 50% of the total length of each string column, we convert string(object) dtype to categorical dtype.

```
Converting each column to the category dtype reduced the memory footprint to just 6.4 mb. While converting all of the columns to this type is appealing, it's important to be aware of the trade-offs.

The biggest one is the inability to perform numerical computations.
We can't do arithmetic with category columns or use methods like Series.min() and Series.max() without converting to a true numeric dtype first.

We should stick to using the category type primarily for object columns where less than 50% of the values are unique.
If all of the values in a column are unique, the category type will end up using more memory.
```

In [110]:
for oc in obj_cols.columns:
    if len(data[oc])*.5 > len(data[oc].unique()):
        data[oc] = data[oc].astype('category')
        print(oc, 'converted to category dtype.')

InvoiceNo converted to category dtype.
StockCode converted to category dtype.
Description converted to category dtype.
Country converted to category dtype.


In [111]:
# All the string columns converted to category dtype columns.
data[obj_cols.columns].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 4 columns):
InvoiceNo      541909 non-null category
StockCode      541909 non-null category
Description    540455 non-null category
Country        541909 non-null category
dtypes: category(4)
memory usage: 5.4 MB


In [112]:
data['Quantity'].dtype

dtype('int64')

In [113]:
# convert 'Quantity' dtype from int64 to float64
data['Quantity'] = data['Quantity'].astype('float64')

In [114]:
# deep memory footprints for float columns
float_cols = data.select_dtypes(include=['float'])
float_cols_mem = float_cols.memory_usage(deep=True)
print(float_cols_mem)

float_cols_mem_tot_megabytes = float_cols_mem.sum() / 2**20
print('Total memory footprints for float columns:', float_cols_mem_tot_megabytes)

Index              80
Quantity      4335272
UnitPrice     4335272
CustomerID    4335272
dtype: int64
Total memory footprints for float columns: 12.4033889771


In [115]:
# check the min/max of each float column
print('Quantity', data['Quantity'].min(), data['Quantity'].max())
print('UnitPrice', data['UnitPrice'].min(), data['UnitPrice'].max())
print('CustomerID', data['CustomerID'].min(), data['CustomerID'].max())

Quantity -80995.0 80995.0
UnitPrice -11062.06 38970.0
CustomerID 12346.0 18287.0


In [116]:
# drop the zero or minus values in 'Quantity' and 'UnitPrice'

print('num of rows before dropping', len(data))

quantity_col_minus_index = data[data['Quantity']<=0].index
unitprice_col_minus_index = data[data['UnitPrice']<=0].index
data = data.drop(quantity_col_minus_index | unitprice_col_minus_index)

print('num of rows after dropping', len(data))

num of rows before dropping 541909
num of rows after dropping 530104


In [118]:
for float_colname in float_cols.columns:
    data[float_colname] = pd.to_numeric(data[float_colname], downcast='float')
    print(float_colname, data[float_colname].dtype)

Quantity float32
UnitPrice float32
CustomerID float32


## Optimization Result
* Original deep memory footprints = **134.9MB**
* Optimized deep memory footprints = **20.8MB**

In [119]:
data.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 530104 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      530104 non-null category
StockCode      530104 non-null category
Description    530104 non-null category
Quantity       530104 non-null float32
InvoiceDate    530104 non-null datetime64[ns]
UnitPrice      530104 non-null float32
CustomerID     397884 non-null float32
Country        530104 non-null category
dtypes: category(4), datetime64[ns](1), float32(3)
memory usage: 20.8 MB


# Now we're ready to dig in - let's go to Part 2.