## Project Name: Customer Segmentation for marketing strategies 

## Information About Dataset
##### The dataset you've provided is an online retail dataset, which contains transactional data. Each row represents an individual transaction, capturing details about the products purchased, the transaction specifics, and customer information.
##### InvoiceNo: Unique transaction identifier.<br> StockCode: Unique product identifier.<br> Description: Product name or type.<br> Quantity: Number of units purchased.<br> InvoiceDate: Date and time of the transaction.<br> UnitPrice: Price per unit of the product.<br> CustomerID: Unique customer identifier.<br> Country: Customer's country.

##### Importing all required libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

##### Reading the dataset

In [None]:
data = pd.read_csv('Customer_data/Online_Retail.csv')

##### Viewing the first 5 rows of the dataset

In [None]:
data.head()

##### Checking missing values in the dataset

In [None]:
data.isnull().sum()

##### Dropping the missing values of the CustomerID column

In [None]:
# Remove rows with missing CustomerID
clean_data = data.dropna(subset=['CustomerID'])

##### Formatting the InvoiceDate column

In [None]:
clean_data["InvoiceDate"] = pd.to_datetime(clean_data["InvoiceDate"], format="%m/%d/%Y %H:%M")

##### Creating a new column TotalPrice 

In [None]:
clean_data["year"] = clean_data["InvoiceDate"].dt.year
clean_data['Month'] = clean_data['InvoiceDate'].dt.month_name()
clean_data['Day'] = clean_data['InvoiceDate'].dt.day_name()
clean_data['TotalPrice'] = clean_data['Quantity'] * clean_data['UnitPrice']

##### Finding top ten products 

In [None]:
top_ten_prod = (
    clean_data['Description'].value_counts().nlargest(10).reset_index().rename(columns={'index': 'Product_name', 'Description': 'Count'}))
top_ten_prod.columns

##### Visualizing the top ten products

In [None]:
# Top 10 items in terms of description
plt.figure(figsize=(12, 6))
sns.barplot(x=top_ten_prod['count'], y=top_ten_prod['Count'])
plt.xticks(rotation=40)
plt.title('Top 10 Products')
plt.xlabel('Count')
plt.ylabel('Product')  
plt.show()

##### Finding top five countries based on highest no. of customers

In [None]:
top_5_countries = clean_data['Country'].value_counts().reset_index().rename(columns={'index':'Country','Country':'Customer_count'})
top_5_countries.columns

##### Visualizing the top five countries based on highest no. of customers

In [None]:
plt.figure(figsize=(15,5))
sns.barplot(x=top_5_countries['count'].head(5),y=top_5_countries['Customer_count'].head(5))
plt.title('Top 5 Countries based on highest number of customers')

##### Finding sales in the diffrent months 

In [None]:
sales_in_month = clean_data['Month'].value_counts().reset_index().rename(columns={'index':'Month','Month':'Sales_count'})
sales_in_month.columns

##### Visualizing the sales in the diffrent months

In [None]:
# Sales count in different months.
plt.figure(figsize=(20,6))
sns.barplot(x=sales_in_month['count'],y=sales_in_month['Sales_count'])
plt.title('Sales count in different Months')

##### Aggregating data by CustomerID to count unique invoices and sum quantities and total prices

In [None]:
customer_data = clean_data.groupby('CustomerID').agg({
    'InvoiceNo': 'nunique',  
    'Quantity': 'sum',       
    'TotalPrice': 'sum'      
}).reset_index()

##### Renaming the columns for better understanding 

In [None]:
customer_data.columns = ['CustomerID', 'NumInvoices', 'TotalQuantity', 'TotalSpending']

##### Display the first few rows of the aggregated customer data

In [None]:
customer_data.head()

##### Standardizing the numerical features 'NumInvoices', 'TotalQuantity', and 'TotalSpending' by scaling them to have a mean of 0 and a standard deviation of 1.

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(customer_data[['NumInvoices', 'TotalQuantity', 'TotalSpending']])

##### Appling K-Means clustering

In [None]:
kmeans = KMeans(n_clusters=4, random_state=42)
customer_data['Cluster'] = kmeans.fit_predict(scaled_data)

##### Appling PCA for dimensionality reduction

In [None]:
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

##### Create a DataFrame with the principal components

In [None]:
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['Cluster'] = customer_data['Cluster']

##### Plotting the clusters

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', data=pca_df, palette='viridis', s=100, alpha=0.7)
plt.title('Customer Segmentation using K-Means Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='Cluster')
plt.grid(True)
plt.show()

##### This above plot shows the distribution of customers across different clusters based on their purchasing behavior. Each color represents a different cluster, helping to identify distinct customer segments for targeted marketing strategies.

##### Appling DBSCAN clustering <br> Using eps=0.5 and min_samples=5 as initial parameters <br> These parameters might need tuning based on the data distribution

In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=5)
customer_data['DBSCAN_Cluster'] = dbscan.fit_predict(scaled_data)
pca_df['DBSCAN_Cluster'] = customer_data['DBSCAN_Cluster']

##### Visualizing the DBSCAN clustering results using PCA

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PC1', y='PC2', hue='DBSCAN_Cluster', data=pca_df, palette='viridis', s=100, alpha=0.7)
plt.title('Customer Segmentation using DBSCAN Clustering')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend(title='DBSCAN Cluster')
plt.grid(True)
plt.show()

##### Cluster -1 represents the outliers or noise points. These are typically shown in a distinct color, often gray or black, in the plot. Cluster 0 represents the main cluster of customers.