# Online Retailer Analysis
___
Data science and machine learning have countless applications in the business world, and analysis for online retailers is one that will be important for all kinds of small and medium businesses. I'm a big fan of illustrating emerging technologies with applied examples — for this project I will be trying to meaningfully segment the customers in the retailer's database as well as create some reporting dashboards to help draw out more insights visually.


For this project, I'm going to be analyzing [this](https://archive.ics.uci.edu/ml/datasets/Online+Retail#) data set from the epansive UCI Machine Learning Repository.

## Data Context
**Abstract:** This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

**Source:** Dr. Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

**Citation:** I'd like to give as much thanks as possible to Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

## Data Set Information
This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

> ### Attribute Details
- `InvoiceNo`: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. 
- `StockCode`: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. 
- `Description`: Product (item) name. Nominal. 
- `Quantity`: The quantities of each product (item) per transaction. Numeric.	
- `InvoiceDate`: Invoice Date and time. Numeric, the day and time when each transaction was generated. 
- `UnitPrice`: Unit price. Numeric, Product price per unit in sterling. 
- `CustomerID`: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. 
- `Country`: Country name. Nominal, the name of the country where each customer resides.

___
# Data Processing

In [1]:
import pandas as pd

In [2]:
data = pd.read_excel('data/Online-Retail.xlsx')
data.head(3)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom


In [3]:
data.shape

(541909, 8)

In [4]:
data.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [16]:
data.loc[data.Description.isnull(), :].head(10)

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
622,536414,22139,,56,2010-12-01 11:52:00,0.0,,United Kingdom
1970,536545,21134,,1,2010-12-01 14:32:00,0.0,,United Kingdom
1971,536546,22145,,1,2010-12-01 14:33:00,0.0,,United Kingdom
1972,536547,37509,,1,2010-12-01 14:33:00,0.0,,United Kingdom
1987,536549,85226A,,1,2010-12-01 14:34:00,0.0,,United Kingdom
1988,536550,85044,,1,2010-12-01 14:34:00,0.0,,United Kingdom
2024,536552,20950,,1,2010-12-01 14:34:00,0.0,,United Kingdom
2025,536553,37461,,3,2010-12-01 14:35:00,0.0,,United Kingdom
2026,536554,84670,,23,2010-12-01 14:35:00,0.0,,United Kingdom
2406,536589,21777,,-10,2010-12-01 16:50:00,0.0,,United Kingdom


In a scenario where we have access to the retailer, I'd dive into these issues further to assess the best course of action. In our dataset here, I'm going to eliminate the instances where we have a null value in the `Description`.

The average `Quantity` is negative which doesn't make much sense, and it's only ~1400 values out of more than 540,000 — a negligible amount.