##  EDA Part 1

## Introduction
In this notebook, we will perform Exploratory Data Analysis (EDA) on the customer data for a segmentation project. The goal is to understand the characteristics of the dataset and gain insights that will be useful for customer segmentation.


### Import Libraries
Let's start by importing the necessary libraries for data analysis and visualization.


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.offline as pyoff

### Load the Dataset
Next, we'll load the customer data from a CSV file  `data/uk_retail.csv` 

In [4]:
# Load the dataset
df = pd.read_csv('data/uk_retail.csv', encoding='unicode_escape')

### Basic Data Overview
Let's take a look at the basic information about the dataset, including the first few rows, data types, and missing values.

In [5]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [6]:
# Check the basic information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [7]:
# Descriptive statistics summary (for )
df.describe()    

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


### Note on Negative Values
During the `df.describe()` analysis, it was observed that 'UnitPrice' and 'Quantity' have some negative values. In a realistic scenario, prices and quantities cannot be negative. Are these negative values errors or something else ? Let's investigate.

In [8]:

# Explore instances with negative 'Quantity'
negative_quantity = df[df['Quantity'] < 0]

print("\nInstances with negative 'Quantity':" + str(negative_quantity.shape[0]))
negative_quantity.head()


Instances with negative 'Quantity':10624


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
141,C536379,D,Discount,-1,12/1/2010 9:41,27.5,14527.0,United Kingdom
154,C536383,35004C,SET OF 3 COLOURED FLYING DUCKS,-1,12/1/2010 9:49,4.65,15311.0,United Kingdom
235,C536391,22556,PLASTERS IN TIN CIRCUS PARADE,-12,12/1/2010 10:24,1.65,17548.0,United Kingdom
236,C536391,21984,PACK OF 12 PINK PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom
237,C536391,21983,PACK OF 12 BLUE PAISLEY TISSUES,-24,12/1/2010 10:24,0.29,17548.0,United Kingdom


It seems that these invoices starting with a "C" are actually cancelation orders

In [9]:
# Check for missing values
df.isnull().sum()

InvoiceNo           0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
UnitPrice           0
CustomerID     135080
Country             0
dtype: int64

In [10]:
customer_na = "{:.2f}%".format((df.CustomerID.isnull().sum() / df.shape[0]) * 100)
print(f"Percentage of Missing CustomerID: {customer_na}")

Percentage of Missing CustomerID: 24.93%


The number of Missing CustomerID is significant and represent almost a quarter of the dataset, we will use them for now in calculating revenue but not in the actual customer segmentation process.

## Calculatin revenue 

In [19]:
df

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,YearMonth,Revenue
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12,15.30
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12,22.00
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12,20.34
...,...,...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,2011-12-09 12:50:00,0.85,12680.0,France,2011-12,10.20
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,2011-12-09 12:50:00,2.10,12680.0,France,2011-12,12.60
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12,16.60
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,2011-12-09 12:50:00,4.15,12680.0,France,2011-12,16.60


In [11]:
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'], format='%m/%d/%Y %H:%M')
# Create a new column 'YearMonth'
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

In [12]:
df['Revenue'] = df['UnitPrice'] * df['Quantity']


df_revenue = df.groupby(['YearMonth'])['Revenue'].sum().reset_index()
df_revenue

Unnamed: 0,YearMonth,Revenue
0,2010-12,748957.02
1,2011-01,560000.26
2,2011-02,498062.65
3,2011-03,683267.08
4,2011-04,493207.121
5,2011-05,723333.51
6,2011-06,691123.12
7,2011-07,681300.111
8,2011-08,682680.51
9,2011-09,1019687.622


Now let's visualize with a simple line graph

In [13]:
# df_revenue.info()
df_revenue['YearMonth'] = pd.to_datetime(df_revenue['YearMonth'].astype(str), format='%Y-%m')

In [14]:
# Assuming df_revenue is your DataFrame
plot_data = [
    go.Scatter(
        x=df_revenue['YearMonth'],
        y=df_revenue['Revenue'],
        mode='lines',
        marker=dict(color='blue'),  # Optional: Set marker color
        line=dict(width=2),         # Optional: Set line width
    )
]

plot_layout = go.Layout(
    xaxis=dict(title='Invoice Year-Month'),
    yaxis=dict(title='Revenue'),
    title='Monthly Revenue',
    showlegend=False,
)

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)


This clearly shows our revenue is growing especially Aug ‘11 onwards **(and our data in December is incomplete)**

Now let's see how our monthly revenue is growing in percentage wise

In [15]:
import plotly.graph_objs as go
import plotly.offline as pyoff

#using pct_change() function to see monthly percentage change
df_revenue['MonthlyGrowth'] = df_revenue['Revenue'].pct_change()*100


plot_data = [
    go.Scatter(
        x=df_revenue['YearMonth'].iloc[1:-1], #remove 1st (NAN) and last month (incomplete data)
        y=df_revenue['MonthlyGrowth'].iloc[1:-1],
        mode='lines',
        marker=dict(color='blue'),  # Optional: Set marker color
        line=dict(width=2),         # Optional: Set line width
    )
]

plot_layout = go.Layout(
    xaxis=dict(title='Invoice Year-Month'),
    yaxis=dict(title='Revenue'),
    title='Montly Growth Rate',
    showlegend=False,
)

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)


In [18]:
#exporting
df_revenue.to_csv('df_revenue.csv', index=False)
df.to_csv('df.csv', index=False)

### See next part ->
We Will continue with this in the next notebook