# Casino Customer Segmentation Model Using Python and Snowflake

## Introduction
This notebook demonstrates the process of building a customer segmentation model for a casino using Python and Snowflake. The segmentation allows the casino to categorize customers based on their behavioral patterns and tailor marketing strategies accordingly.

### Objectives:
- Retrieve customer and transaction data from Snowflake.
- Perform data preprocessing and feature engineering.
- Explore data through visualization.
- Use machine learning to cluster customers into distinct groups.
- Analyze and interpret each customer segment for actionable insights.

## 1. Library Setup and Installation
To start, we need to install and import the necessary Python libraries, including Snowflake connectors for data access and Pandas, Scikit-learn, and Matplotlib for data manipulation, machine learning, and visualization.

### Purpose:
Ensure that all required libraries are installed and configured for data ingestion, processing, and modeling.

## 2. Data Import and Preprocessing
This section involves loading customer and transaction data from the Snowflake database into Pandas DataFrames. After importing, the data undergoes initial cleaning and transformation.

### Key Steps:
- Import customer and transaction data.
- Convert data into Pandas and Snowflake DataFrames.
- Preprocess the data by handling missing values and transforming categorical variables.
- Scale numerical data to ensure consistency across features.

### Outcome:
A clean and preprocessed dataset ready for further analysis.

## 3. Feature Engineering
Feature engineering is used to create additional attributes that provide more insights into customer behavior. These new features help improve the clustering and segmentation process.

### Key Features:
- **Visit Frequency**: How often a customer visits the casino, calculated from the total visits and time span between their first and last visit.
- **Preferred Game**: The game where each customer spends the most time, providing insight into individual preferences.
- **Revenue Metrics**: Total chips won or lost, giving a financial perspective on customer activity.

### Purpose:
Generate new features to better capture customer behavior, enhancing the clustering algorithm's effectiveness.

## 4. Exploratory Data Analysis (EDA) and Visualization
In this step, we use data visualization techniques to explore the relationships between variables and identify patterns in customer behavior. This helps in understanding the data before applying clustering.

### Visualizations:
- **Correlation Matrix**: Shows relationships between numerical features.
- **Distribution Plots**: Visualize the spread of important variables like customer age, revenue, and visit frequency.
- **Histograms**: Display the distribution of key variables and identify outliers.

### Purpose:
EDA helps uncover hidden trends and relationships in the data, guiding further analysis and machine learning tasks.

## 5. Dimensionality Reduction and Clustering
To reduce the complexity of the data, we apply Principal Component Analysis (PCA) and then use clustering algorithms such as K-Means and Agglomerative Clustering to group customers based on their behavior.

### Steps:
- **PCA**: Reduces the number of dimensions, simplifying the dataset while retaining important information.
- **K-Means Clustering**: Groups customers into clusters based on their behavioral patterns.
- **Agglomerative Clustering**: Another clustering technique used to hierarchically group customers.

### Purpose:
Dimensionality reduction makes the dataset easier to visualize, and clustering reveals natural groupings of customers.


## 6. Cluster Profiling
Once clusters are created, we profile each one to understand the characteristics of customers within each group. Profiling helps in labeling the clusters and identifying their unique behaviors.

### Key Metrics:
- **Total Revenue**: How much each cluster contributes to the casino's revenue.
- **Visit Frequency**: The frequency with which customers in each cluster visit the casino.
- **Preferred Games**: Which games are favored by different clusters.

### Visualizations:
- **3D Plot**: Visualizes clusters in a 3D space.
- **Box Plots**: Compare how key metrics like age, revenue, and number of visits vary across clusters.

### Outcome:
Each cluster is assigned meaningful labels based on customer behavior, making it easier to target them with specific marketing strategies.

## 7. Customer Segmentation
Now that we have clustered the customers, we assign descriptive names to the segments to better understand and interpret the groups.

### Example Segments:
- **High Roller Professionals**: Customers who frequently visit and spend large amounts.
- **Conservative Low Spenders**: Customers who visit infrequently and spend minimal amounts.
- **Cross-Spending Players**: Customers who engage in various activities such as gaming, dining, and hotel stays.

### Purpose:
Assign meaningful labels to customer segments to help the casino personalize marketing and engagement strategies.

## 8. Feature Interaction and Exploration
This section explores how key features interact within the different clusters, providing deeper insights into customer behavior.

### Example Interactions:
- **Revenue vs. Visit Frequency**: A scatter plot showing the relationship between visit frequency and total revenue.
- **Age vs. Cluster**: How age influences cluster membership.

### Visualizations:
- **Scatter Plots**: Show how key variables differ across clusters.
- **Box Plots**: Compare the distribution of revenue and other features across clusters.

### Purpose:
Feature interactions provide insights into customer preferences and behaviors, enabling the casino to develop more tailored marketing strategies.

## Summary
In this notebook, we have successfully segmented casino customers into distinct groups based on their behavioral patterns. By analyzing these clusters, the casino can optimize its marketing efforts and offer personalized services to improve customer retention.

### Key Takeaways:
- We identified meaningful customer segments such as high rollers and low spenders.
- Data visualization provided critical insights into customer behavior.
- Clustering allowed us to uncover groups of customers with similar behavioral patterns.
- These results can be used to enhance marketing campaigns and improve customer satisfaction.

This approach provides a strong foundation for developing personalized marketing strategies that increase engagement and drive revenue.


# Setup and Installation of Required Libraries

### In this section, we install and configure all the necessary libraries for connecting to Snowflake, data manipulation, visualization, machine learning, and other utilities. We ensure all packages are up-to-date and compatible with the project requirements.


In [None]:
# Upgrade pip to the latest version to avoid compatibility issues
!pip install --upgrade pip

# Install Snowflake connectors along with Pandas integration for Snowflake, 
# as well as necessary libraries for data processing and machine learning (e.g., numpy, scikit-learn, xgboost, matplotlib, etc.)
!pip install "snowflake-connector-python[pandas]" "snowflake-snowpark-python[pandas]" snowflake-snowpark-python==1.9.0 numpy pandas matplotlib scikit-learn xgboost seaborn python-dateutil tqdm holidays faker

# Ensure Snowflake Snowpark Python is upgraded to the desired version (1.9.0)
!pip install --upgrade --q snowflake-snowpark-python==1.9.0

# Uninstalling the current version of urllib3 to avoid version conflicts
!pip uninstall urllib3 -y

# Installing a specific version of urllib3 (1.26.15) known to work with Snowflake and other libraries
!pip install urllib3==1.26.15

# Install fosforml library for machine learning model management and integration
!pip install fosforml==1.1.6

# Install Yellowbrick for machine learning visualization
!pip install yellowbrick


# Importing Required Libraries for Data Processing, Visualization, and Modeling

### In this section, we import various libraries needed for data manipulation, visualization, machine learning, and system utilities. These libraries will help in tasks such as configuring the environment, plotting graphs, managing models, and performing advanced mathematical operations.


In [1]:
# Import all required modules from the fosforml library for model management
from fosforml import *

# Import model flavor constants from fosforml for handling various ML model types
from fosforml.constants import MLModelFlavours

# Importing visualization library (matplotlib) for plotting
from matplotlib import pyplot as plt

# Import essential libraries for data manipulation and numerical operations
import numpy as np
import pandas as pd

# Display a larger number of columns in the DataFrame for better visibility
pd.set_option('display.max_columns', 500)

# Seaborn for statistical data visualization
import seaborn as sns

# Scikit-learn's metric for calculating Mean Absolute Percentage Error (MAPE)
from sklearn.metrics import mean_absolute_percentage_error

# Suppress warnings for cleaner output in the notebook
import warnings; warnings.simplefilter('ignore')

# Joblib for saving and loading ML models efficiently
from joblib import dump, load

# Requests for making HTTP requests to access data from external sources
import requests

# tqdm for progress bars, especially for loops that take a long time to execute
from tqdm import tqdm

# Time utilities for time-based operations
import time
import calendar

# Sleep function to add delays in execution, useful for timing-based operations
from time import sleep

# ConfigParser for reading configuration files
import configparser

# Date manipulation library for working with date ranges, intervals, and easter date calculations
from dateutil.relativedelta import relativedelta
import datetime
from dateutil.easter import easter

# Import functions from scipy for mathematical optimizations and curve fitting
from scipy.optimize import minimize_scalar
from scipy.optimize import curve_fit

# Enable inline plotting for Matplotlib, ensuring plots appear within the notebook
%matplotlib inline


# Configuring Matplotlib and Suppressing Warnings

### This section configures Matplotlib for consistent visualization aesthetics and suppresses warnings to ensure a clean output in the notebook.


In [2]:
# Importing Matplotlib for plotting
import matplotlib.pyplot as plt

# Set Matplotlib's default font family to 'DeJavu Serif' to ensure a consistent font style across plots
plt.rcParams['font.family'] = 'DeJavu Serif'

# Import the warnings library to suppress unnecessary warnings
import warnings

# Suppress all warnings for cleaner notebook output
warnings.filterwarnings("ignore")

# Importing rcParams from Matplotlib for further font configuration
from matplotlib import rcParams

# Configure Matplotlib to use 'DejaVu Sans' font to avoid 'sans-serif' related warnings
rcParams['font.family'] = 'DejaVu Sans'  # or another system-available font


# Importing the Required Libraries for Data Processing, Visualization, and Clustering

### In this section, we import libraries for handling various tasks such as data preprocessing, visualization, and applying machine learning clustering algorithms. Warnings are also suppressed for cleaner outputs.


In [3]:
# Importing datetime to work with date and time objects
import datetime

# Importing Matplotlib and related components for data visualization
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import colors

# Seaborn for statistical data visualizations
import seaborn as sns

# Importing Scikit-learn's preprocessing utilities for encoding categorical variables and scaling numerical data
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# PCA (Principal Component Analysis) for dimensionality reduction
from sklearn.decomposition import PCA

# Yellowbrick's KElbowVisualizer to help determine the optimal number of clusters for KMeans
from yellowbrick.cluster import KElbowVisualizer

# KMeans clustering algorithm from Scikit-learn
from sklearn.cluster import KMeans

# Importing 3D plotting support from Matplotlib for visualizing clusters in 3D space
from mpl_toolkits.mplot3d import Axes3D

# Agglomerative Clustering algorithm for hierarchical clustering
from sklearn.cluster import AgglomerativeClustering

# Importing Matplotlib colormap utilities to handle color mapping for plots
from matplotlib.colors import ListedColormap

# Scikit-learn's metrics module for evaluating clustering performance
from sklearn import metrics

# Importing warnings library to suppress any unnecessary warnings during execution
import warnings

# Import sys to check and apply warnings filter conditions
import sys
if not sys.warnoptions:
    # Suppressing all warnings to clean up output
    warnings.simplefilter("ignore")

# Set a random seed for reproducibility of results
np.random.seed(42)


# Establishing a Snowflake Session for Data Operations

### In this section, we establish a connection to the Snowflake database using the `fosforml` library. This session will be used to execute queries and perform data operations within Snowflake.


In [4]:
# Importing the get_session function from fosforml's Snowflake session manager
from fosforml.model_manager.snowflakesession import get_session

# Establishing a Snowflake session for executing queries and performing operations
my_session = get_session()


# Loading Customer Data into a Pandas DataFrame

### In this section, we load customer data from a CSV file into a Pandas DataFrame. This dataset will be used for further analysis and processing in the notebook.


In [5]:
# Loading the customer data from a CSV file into a Pandas DataFrame
cust_df = pd.read_csv('customer_table.csv')

# Display the first few rows of the DataFrame to inspect the loaded data
cust_df


Unnamed: 0,PLAYER_ID,AGE,GENDER,HOME_COUNTRY,HOME_CITY,DATE_FIRST_VISIT,DATE_LAST_VISIT,TOTAL_NUMBER_OF_VISITS,TOTAL_DURATION_SPENT,AVERAGE_DURATION_PER_VISIT,TOTAL_CHIPS_WON_OR_LOST,AVERAGE_CHIPS_WON_OR_LOST_PER_VISIT,UNIQUE_GAMES_PLAYED,IS_PREMIUM_PLAYER,IS_LOYALTY_CARD_HOLDER,TOTAL_AMOUNT_SPENT_IN_HOTEL,TOTAL_DAYS_SPENT_HOTEL,TOTAL_AMOUNT_SPENT_IN_CASINO_RESTAURANT,NUMBER_OF_RESTAURANT_VISITS,TOTAL_AMOUNT_SPENT_IN_SPA,NUMBER_OF_SPA_VISITS,TOTAL_REVENUE_TO_CASINO,NUMBER_OF_CONCIERGE_VISITS,PREFERRED_GAME_CATEGORY,PREFERRED_GAME_NAME
0,1,28,Female,India,Hyderabad,02-10-2021,19-08-2024,24,111.17,4.632083,4048,168.666667,8,False,True,26043.68,181,5634.89,111,11490.81,51,39121.38,96,Slot games,Video Slots
1,2,69,Female,Singapore,Singapore,06-09-2021,17-08-2024,24,114.26,4.760833,-2159,-89.958333,10,True,False,28966.09,207,7598.32,105,14077.87,50,52801.28,102,Slot games,3D Slots
2,3,52,Female,Singapore,Singapore,27-11-2021,22-06-2024,17,81.23,4.778235,-419,-24.647059,9,False,False,23593.94,109,4642.91,96,10904.30,35,39560.15,66,Table games,Roulette
3,4,41,Female,India,Delhi,10-10-2021,18-05-2024,17,69.72,4.101176,-246,-14.470588,9,True,False,23461.30,86,5322.28,85,6804.14,36,35833.72,78,Table games,Roulette
4,5,66,Female,UK,Manchester,08-09-2021,01-08-2024,15,70.89,4.726000,-755,-50.333333,8,True,True,24073.65,141,4136.14,63,7313.53,31,36278.32,82,Table games,Poker
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,29,Male,Singapore,Singapore,11-09-2021,30-08-2024,28,119.51,4.268214,-441,-15.750000,11,True,True,24560.35,129,7388.70,166,11911.56,62,44301.61,138,Speciality Number Games,Keno
9996,9997,47,Female,Singapore,Singapore,19-10-2021,28-07-2024,19,102.57,5.398421,1089,57.315789,9,False,False,21510.39,122,5016.20,98,8048.02,37,33485.61,84,Slot games,Video Slots
9997,9998,55,Female,US,Los Angeles,30-09-2021,09-07-2024,19,79.72,4.195789,-1700,-89.473684,11,False,True,31041.60,194,4244.95,101,10212.66,33,47199.21,82,Table games,Craps
9998,9999,51,Male,US,San Jose,11-09-2021,26-08-2024,17,64.82,3.812941,-1995,-117.352941,8,True,True,18119.32,166,4530.26,75,9866.26,29,34510.84,83,Table games,Roulette


# Loading Transaction Data from CSV Files

### Transaction data from multiple CSV files is loaded into separate Pandas DataFrames. These datasets will be merged for further analysis.


In [6]:
# Load transaction data from multiple CSV files into separate DataFrames
t1_df = pd.read_csv('trx_1.csv')  # First transaction dataset
t2_df = pd.read_csv('trx_2.csv')  # Second transaction dataset
t3_df = pd.read_csv('trx_3.csv')  # Third transaction dataset


# Writing Customer Data to Snowflake Table

### The customer data is converted from a Pandas DataFrame into a Snowflake DataFrame and written to a Snowflake table named `casino_customers`. The `overwrite` mode ensures that the table is replaced if it already exists.


In [7]:
# Convert the Pandas DataFrame (cust_df) into a Snowflake DataFrame
cust_sfdf = my_session.createDataFrame(cust_df)

# Write the Snowflake DataFrame to a Snowflake table named 'casino_customers'
# The 'overwrite' mode ensures that the table is replaced if it already exists
cust_sfdf.write.mode("overwrite").save_as_table("casino_customers")


# Merging Transaction Data into a Single DataFrame

### The transaction data from multiple DataFrames is merged into a single DataFrame. This combined dataset will be used for further analysis. The `ignore_index=True` ensures that the index is reset after merging.


In [8]:
# Merge the first two transaction DataFrames (t1_df and t2_df) into one DataFrame
inter_df = t1_df._append(t2_df, ignore_index=True)

# Append the third transaction DataFrame (t3_df) to the intermediate DataFrame
trx_df = inter_df._append(t3_df, ignore_index=True)

# Display the structure and details of the combined transaction DataFrame
trx_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 24 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   TRANSACTION_ID                      200000 non-null  int64  
 1   DATE                                200000 non-null  object 
 2   PLAYER_ID                           200000 non-null  int64  
 3   PLAYER_AGE                          200000 non-null  int64  
 4   PLAYER_GENDER                       200000 non-null  object 
 5   HOME_COUNTRY                        200000 non-null  object 
 6   HOME_CITY                           200000 non-null  object 
 7   GAME_CATEGORY                       200000 non-null  object 
 8   GAME_NAME                           200000 non-null  object 
 9   TABLE_MINIMUM_BET                   200000 non-null  float64
 10  IS_PREMIUM_PLAYER                   200000 non-null  bool   
 11  DURATION_SPENT            

# Writing Merged Transaction Data to Snowflake Table

### The merged transaction data is converted into a Snowflake DataFrame and written to a Snowflake table named `casino_transactions`. The `overwrite` mode ensures that the table is replaced if it already exists.


In [9]:
# Convert the merged Pandas DataFrame (trx_df) into a Snowflake DataFrame
trx_sfdf = my_session.createDataFrame(trx_df)

# Write the Snowflake DataFrame to a Snowflake table named 'casino_transactions'
# The 'overwrite' mode ensures that the table is replaced if it already exists
trx_sfdf.write.mode("overwrite").save_as_table("casino_transactions")


# Querying Data from Snowflake and Converting to Pandas DataFrame

### This section retrieves all records from the `CASINO_TRANSACTIONS` table in Snowflake and converts the resulting Snowflake DataFrame into a Pandas DataFrame for local analysis.


In [None]:
# Define the name of the Snowflake table to query
table_name = 'CASINO_TRANSACTIONS'

# Execute a SQL query to select all records from the specified table in Snowflake
transaction_df = my_session.sql("select * from {}".format(table_name))

# Convert the Snowflake DataFrame to a Pandas DataFrame for further processing
transaction_df = transaction_df.to_pandas()

# Check the type of the resulting DataFrame to confirm it is a Pandas DataFrame
type(transaction_df)


transaction_df = trx_df.copy()
transaction_df.columns = [col.upper() for col in transaction_df.columns]




In [None]:
customer_aggregation = transaction_df.groupby('PLAYER_ID').agg(
    DATE_FIRST_VISIT=('DATE', 'min'),
    DATE_LAST_VISIT= ('DATE', 'max'),
    TOTAL_NUMBER_OF_VISITS=('TRANSACTION_ID', 'count'),
    TOTAL_DURATION_SPENT=('DURATION_SPENT', 'sum'),
    AVERAGE_DURATION_PER_VISIT=('DURATION_SPENT', 'mean'),
    TOTAL_CHIPS_WON_OR_LOST=('CHIPS_WON_OR_LOST', 'sum'),
    AVERAGE_CHIPS_WON_OR_LOST_PER_VISIT=('CHIPS_WON_OR_LOST', 'mean'),
    UNIQUE_GAMES_PLAYED=('GAME_NAME', 'nunique'),
    IS_PREMIUM_PLAYER=('IS_PREMIUM_PLAYER', 'max'),
    IS_LOYALTY_CARD_HOLDER=('IS_LOYALTY_CARD_HOLDER', 'max'),
    TOTAL_AMOUNT_SPENT_IN_HOTEL=('AMOUNT_SPENT_IN_HOTEL_STAY', 'sum'),
    TOTAL_DAYS_SPENT_HOTEL=('NUMBER_OF_DAYS_SPENT_IN_HOTEL', 'sum'),
    TOTAL_AMOUNT_SPENT_IN_CASINO_RESTAURANT=('AMOUNT_SPENT_IN_CASINO_RESTAURANT', 'sum'),
    NUMBER_OF_RESTAURANT_VISITS=('NUMBER_OF_RESTAURANT_VISITS', 'sum'),
    TOTAL_AMOUNT_SPENT_IN_SPA=('AMOUNT_SPENT_IN_SPA', 'sum'),
    NUMBER_OF_SPA_VISITS=('NUMBER_OF_SPA_VISITS', 'sum'),
    TOTAL_REVENUE_TO_CASINO=('REVENUE_MADE_BY_CASINO_FROM_PLAYER', 'sum'),
    NUMBER_OF_CONCIERGE_VISITS=('NUMBER_OF_CONCIERGE_VISITS', 'sum')
).reset_index()

In [None]:
customer_aggregation['DATE_FIRST_VISIT'] = pd.to_datetime(customer_aggregation['DATE_FIRST_VISIT'], format = 'mixed')
customer_aggregation['DATE_LAST_VISIT'] = pd.to_datetime(customer_aggregation['DATE_LAST_VISIT'], format = 'mixed')
transaction_df['DATE'] = pd.to_datetime(transaction_df['DATE'], format = 'mixed')

In [None]:
customer_aggregation.head()

In [None]:
customer_aggregation[customer_aggregation['DATE_FIRST_VISIT']<customer_aggregation['DATE_LAST_VISIT']]

In [None]:
# Calculating preferred game category and name
preferred_game = transaction_df.groupby(['PLAYER_ID', 'GAME_CATEGORY', 'GAME_NAME','PLAYER_AGE', 'PLAYER_GENDER', 'HOME_COUNTRY', 'HOME_CITY'])['DURATION_SPENT'].sum().reset_index()
preferred_game = preferred_game.loc[preferred_game.groupby('PLAYER_ID')['DURATION_SPENT'].idxmax()][['PLAYER_ID', 'GAME_CATEGORY', 'GAME_NAME', 'PLAYER_AGE', 'PLAYER_GENDER', 'HOME_COUNTRY', 'HOME_CITY']]
preferred_game


In [None]:
customer_aggregation = customer_aggregation.merge(preferred_game, on='PLAYER_ID', how='left')
customer_aggregation


In [None]:
customer_aggregation.rename(columns= {'PLAYER_AGE':'AGE' , 'PLAYER_GENDER':'GENDER', 'GAME_CATEGORY':'PREFERRED_GAME_CATEGORY', 
                                      'GAME_NAME':'PREFERRED_GAME_NAME'}, inplace= True) 
customer_aggregation.info()

In [None]:
customer_df = customer_aggregation.copy()

# Data Preparation

In [None]:
# Date transformation data type

customer_df['DATE_FIRST_VISIT'] = pd.to_datetime(customer_df['DATE_FIRST_VISIT'], format = 'mixed')
customer_df['DATE_LAST_VISIT'] = pd.to_datetime(customer_df['DATE_LAST_VISIT'], format = 'mixed')
transaction_df['DATE'] = pd.to_datetime(transaction_df['DATE'], format = 'mixed')

In [None]:
print(customer_df.info())
print(transaction_df.info())

In [None]:
# Adding new features for better classification
customer_df['VISIT_FREQUENCY'] = customer_df['TOTAL_NUMBER_OF_VISITS'] / ((pd.to_datetime(customer_df['DATE_LAST_VISIT']) - 
                                                                           pd.to_datetime(customer_df['DATE_FIRST_VISIT'])).dt.days + 1)

In [None]:
#Get list of categorical variables
s = (customer_df.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables in the dataset:", object_cols)

In [None]:
#Label Encoding the object dtypes.
LE=LabelEncoder()
for i in object_cols:
    customer_df[i]=customer_df[[i]].apply(LE.fit_transform)
    
print("All features are now numerical")

In [None]:
customer_df.info()

In [None]:
print(customer_df.max(axis=0)) # will return max value of each column
print(customer_df.min(axis=0)) # will return min value of each column

In [None]:
print(customer_df.mean(axis=0)) # will return min value of each column

In [None]:
# sf_df = my_session.createDataFrame(customer_df)
# sf_df.write.mode("overwrite").save_as_table("CASINO_CUSTOMERS")
# my_session.table("CASINO_CUSTOMERS").show()

In [None]:
customer_df.head()

In [None]:
# Creating a copy of data
ds = customer_df.copy()

# Creating a subset of dataframe by dropping certain columns
cols_del = ['DATE_FIRST_VISIT', 'DATE_LAST_VISIT']
ds = ds.drop(cols_del, axis=1)

# Check for NaN, inf, or -inf in the dataset
print(f"NaN values in the dataset:\n{ds.isna().sum()}")
print(f"Any inf or -inf in the dataset:\n{(ds == np.inf).sum() + (ds == -np.inf).sum()}")

# Option 1: Replace inf, -inf, and NaN with a specified value (e.g., mean, median, or 0)
ds = ds.replace([np.inf, -np.inf], np.nan)  # Replace inf/-inf with NaN
ds.fillna(ds.mean(), inplace=True)  # Replace NaN with mean of each column

# Option 2: Alternatively, you can drop rows containing these values
# ds = ds.replace([np.inf, -np.inf], np.nan).dropna()

# Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(ds)
scaled_ds = pd.DataFrame(scaler.transform(ds), columns=ds.columns)

print("All features are now scaled")


In [None]:
ds.info()

# Data Exploration

In [None]:
# Descriptive statistics for customer data
print(customer_df.describe(include='all'))
print(transaction_df.describe(include='all'))

In [None]:
# Plotting distribution of age
from matplotlib import rcParams

# Configure the font family to avoid 'sans-serif' warnings
rcParams['font.family'] = 'DejaVu Sans'  # or another available font on your system

plt.figure(figsize=(12, 6))
sns.histplot(customer_df['AGE'], bins=20, kde=True)
plt.title('Age Distribution')
plt.show()

In [None]:
# Checking for missing values
print(customer_df.isna().sum())

In [None]:
# Fill missing values or drop them if appropriate
customer_df = customer_df.dropna()  # Here we simply drop missing values

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Creating a copy of the dataset
ds = customer_df.copy()

# Check for and handle 'inf' or '-inf' values
ds.replace([np.inf, -np.inf], np.nan, inplace=True)

# Option 1: Drop rows containing NaN values
ds.dropna(inplace=True)

# Option 2: Alternatively, you can fill NaN values with the mean or median
# ds.fillna(ds.mean(), inplace=True)

# Plotting distribution for numeric features in the customer data
ds.hist(bins=15, figsize=(20, 15))
plt.suptitle('Distribution of Numeric Features in Customer Data')
plt.show()


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Correlation matrix for customer_df
correlation_matrix = customer_df.corr()

# Set the figure size larger
plt.figure(figsize=(25, 20))

# Create the heatmap with larger annotations
sns.heatmap(
    correlation_matrix, 
    annot=True, 
    cmap='coolwarm', 
    annot_kws={"size": 12},  # Increase the annotation font size
    linewidths=0.5,          # Add spacing between cells for better readability
    linecolor='gray'         # Add lines to separate cells
)

# Add title and display
plt.title('Correlation Matrix for Customer Data', fontsize=18)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()


In [None]:
corrmat= customer_df.corr()
plt.figure(figsize=(20,20))  
sns.heatmap(corrmat,annot=True, cmap='coolwarm', center=0)

In [None]:
scaled_ds.head()

In [None]:
customer_df.info()

In [None]:
# Plotting categorical features against target variable
sns.countplot(x='GENDER', data=customer_df)
plt.title('Gender Distribution')
plt.show()

In [None]:
sns.countplot(x='HOME_COUNTRY', data=customer_df)
plt.title('Country Distribution')
plt.show()

# PCA: Principal component analysis

In [None]:
#Initiating PCA to reduce dimentions aka features to 3
pca = PCA(n_components=3)
pca.fit(scaled_ds)
PCA_ds = pd.DataFrame(pca.transform(scaled_ds), columns=(["col1","col2", "col3"]))
PCA_ds.describe().T

In [None]:
#A 3D Projection Of Data In The Reduced Dimension
x =PCA_ds["col1"]
y =PCA_ds["col2"]
z =PCA_ds["col3"]
#To plot
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection="3d")
ax.scatter(x,y,z, c="blue", marker="o" )
ax.set_title("A 3D Projection Of Data In The Reduced Dimension")
plt.show()

# Clustering

In [None]:
# Quick examination of elbow method to find numbers of clusters to make.
print('Elbow Method to determine the number of clusters to be formed:')
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(PCA_ds)
Elbow_M.show()

In [None]:
# Quick examination of elbow method to find numbers of clusters to make.
print('Elbow Method to determine the number of clusters to be formed:')
Elbow_M = KElbowVisualizer(KMeans(), k=10)
Elbow_M.fit(scaled_ds)
Elbow_M.show()

In [None]:
#Initiating the Agglomerative Clustering model 
AC = AgglomerativeClustering(n_clusters=4)

# fit model and predict clusters
yhat_AC = AC.fit_predict(PCA_ds)
PCA_ds["CLUSTERS"] = yhat_AC

#Adding the Clusters feature to the orignal dataframe.
customer_df["CLUSTERS"]= yhat_AC

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

# Plotting the clusters
fig = plt.figure(figsize=(10, 8))
ax = plt.subplot(111, projection='3d')

# Use a valid colormap like 'viridis'
ax.scatter(x, y, z, s=40, c=PCA_ds["CLUSTERS"], marker='o', cmap='viridis')

# Set the title
ax.set_title("The Plot Of The Clusters")

# Show the plot
plt.show()


# Model Evaluation

In [None]:
#Plotting countplot of clusters
pal = ["#FF5733","#33FF57", "#3357FF","#FF33A1",] #"#FFD700"]
pl = sns.countplot(x=customer_df["CLUSTERS"], palette= pal)
pl.set_title("Distribution Of The Clusters")
plt.show()

In [None]:
pl = sns.scatterplot(data = customer_df,x=customer_df["TOTAL_REVENUE_TO_CASINO"], y=customer_df["VISIT_FREQUENCY"],hue=customer_df["CLUSTERS"], palette= pal)
pl.set_title("Cluster's Profile Based On revenue to Casino")
plt.legend()
plt.show()

In [None]:
plt.figure()
pl=sns.swarmplot(x=customer_df["CLUSTERS"], y=customer_df["TOTAL_REVENUE_TO_CASINO"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=customer_df["CLUSTERS"], y=customer_df["TOTAL_REVENUE_TO_CASINO"], palette=pal)
plt.show()

In [None]:
plt.figure()
pl=sns.swarmplot(x=customer_df["CLUSTERS"], y=customer_df["TOTAL_NUMBER_OF_VISITS"], color= "#CBEDDD", alpha=0.5 )
pl=sns.boxenplot(x=customer_df["CLUSTERS"], y=customer_df["TOTAL_NUMBER_OF_VISITS"], palette=pal)
plt.show()

In [None]:
#Plotting
plt.figure()
pl=sns.boxenplot(y=customer_df["TOTAL_CHIPS_WON_OR_LOST"],x=customer_df["CLUSTERS"], palette= pal)
pl.set_title("TOTAL_CHIPS_WON_OR_LOST")
plt.show()

# PROFILING

In [None]:
column_list = [ "AGE", "GENDER","HOME_COUNTRY","TOTAL_DURATION_SPENT","TOTAL_CHIPS_WON_OR_LOST","AVERAGE_DURATION_PER_VISIT",
               "AVERAGE_CHIPS_WON_OR_LOST_PER_VISIT", "UNIQUE_GAMES_PLAYED", 
               "IS_PREMIUM_PLAYER", "IS_LOYALTY_CARD_HOLDER", "TOTAL_AMOUNT_SPENT_IN_HOTEL", "TOTAL_DAYS_SPENT_HOTEL",
               "TOTAL_AMOUNT_SPENT_IN_CASINO_RESTAURANT","NUMBER_OF_RESTAURANT_VISITS", "TOTAL_AMOUNT_SPENT_IN_SPA", "NUMBER_OF_SPA_VISITS",
               "TOTAL_REVENUE_TO_CASINO", "NUMBER_OF_CONCIERGE_VISITS", "VISIT_FREQUENCY"]

for i in column_list:
    plt.figure()
    sns.jointplot(x=customer_df[i], y=customer_df["TOTAL_REVENUE_TO_CASINO"], hue =customer_df["CLUSTERS"], kind="kde", palette=pal)
    plt.show()

In [None]:
customer_df['PLAYER_SEGMENT'] = customer_df.apply(lambda x: 'High roller Professionals' if x['CLUSTERS']==0 else x['PLAYER_SEGMENT'], axis=1)
customer_df['PLAYER_SEGMENT'] = customer_df.apply(lambda x: 'Conservative Low spenders' if x['CLUSTERS']==1 else x['PLAYER_SEGMENT'], axis=1)
customer_df['PLAYER_SEGMENT'] = customer_df.apply(lambda x: 'Mediocre cross spending players' if x['CLUSTERS']==2 else x['PLAYER_SEGMENT'], axis=1)
customer_df['PLAYER_SEGMENT'] = customer_df.apply(lambda x: 'Money losing players' if x['CLUSTERS']==3 else x['PLAYER_SEGMENT'], axis=1)

In [None]:
customer_df['SEGMENT_DESC'] = customer_df.apply(lambda x: 'Risk taking professional, regular players with deep pockets spending across services ' if x['CLUSTERS']==0 else x['SEGMENT_DESC'], axis=1)
customer_df['SEGMENT_DESC'] = customer_df.apply(lambda x: 'Low spends, low money making low risk players' if x['CLUSTERS']==1 else x['SEGMENT_DESC'], axis=1)
customer_df['SEGMENT_DESC'] = customer_df.apply(lambda x: 'Budding good Players with potential to win, exploring multiple casino services' if x['CLUSTERS']==2 else x['SEGMENT_DESC'], axis=1)
customer_df['SEGMENT_DESC'] = customer_df.apply(lambda x: 'Money losing players' if x['CLUSTERS']==3 else x['SEGMENT_DESC'], axis=1)