# Online Retail Analysis
[UCI Data source](https://archive.ics.uci.edu/ml/datasets/Online+Retail+II#)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
data = pd.read_csv('../data/retail_data.csv')

### Loading and Cleaning the Data

In [4]:
data.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,01/12/2009 07:45,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,01/12/2009 07:45,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,01/12/2009 07:45,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,01/12/2009 07:45,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,01/12/2009 07:45,1.25,13085.0,United Kingdom


We can see that column names are capitalized , so it will become a nightmare for analyst and more error prone. 

Next I will be rename the columns to my own naming convention.

In [5]:
new_col_names = {
    "Invoice" : "invoice",
    "StockCode": "stockcode",
    "Quantity" : "quantity",
    "InvoiceDate" : "date",
    "Price" : "unit_price",
    "Country" : "country",
    "Description" : "desc",
    "Customer ID" : "customer_id",
}

data.rename(columns=new_col_names, inplace=True)

data.head()

Unnamed: 0,invoice,stockcode,desc,quantity,date,unit_price,customer_id,country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,01/12/2009 07:45,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,01/12/2009 07:45,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,01/12/2009 07:45,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,01/12/2009 07:45,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,01/12/2009 07:45,1.25,13085.0,United Kingdom


In [6]:
data.describe()

Unnamed: 0,quantity,unit_price,customer_id
count,525461.0,525461.0,417534.0
mean,10.337667,4.688834,15360.645478
std,107.42411,146.126914,1680.811316
min,-9600.0,-53594.36,12346.0
25%,1.0,1.25,13983.0
50%,3.0,2.1,15311.0
75%,10.0,4.21,16799.0
max,19152.0,25111.09,18287.0


In [7]:
data.shape

(525461, 8)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   invoice      525461 non-null  object 
 1   stockcode    525461 non-null  object 
 2   desc         522533 non-null  object 
 3   quantity     525461 non-null  int64  
 4   date         525461 non-null  object 
 5   unit_price   525461 non-null  float64
 6   customer_id  417534 non-null  float64
 7   country      525461 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 32.1+ MB


- From shape of the dataframe we can see that there are total 525461 entries. But from info we see that some of the entries are missing in desc and customer_id column.

- That means there are missing values as we have to investigate further.



In [9]:
# checking missing values presents in Data

data.isnull().sum().sort_values(ascending=False)

customer_id    107927
desc             2928
invoice             0
stockcode           0
quantity            0
date                0
unit_price          0
country             0
dtype: int64

So most of the missing values are in customer_id column.