<a href="https://colab.research.google.com/github/VargheseTito/E-Commerce-Customer-Satisfaction-Score-Prediction-DL-Model/blob/main/deeplearningproject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'ecomdata:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F5058629%2F8481205%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240603%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240603T053520Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3D0e77030e4c12c48f3d44329007c054ca7bfde34ddba8e1ce742d328e7a6487bd2c997d66738b70c389e7974a39e826dc41b1dc7558e1e3f898ae1ffb253ee417fdece758525b3774e588b51a10beba1e3445207291420acdfd0fb1372b6c9ddc23152c554d138340cc14df8232168d3c7c0eaf26d31b1e31a2abfbca7e964cf10d689d6148af3a2d0bbb1d8d527817229f71e579d2f0ccd61d8f3a2d301ddaf97db883938c5d9f455bacd4e91e84801c8d4fb2255caf384738cc7311f48eecea6a12d97ade0f9d5246b9310a83b98f800892ca2d700c9a235a58a1362cdd6b4c597dfd6b86774e85ee220da65910d09b4f872e7d62ac96088c245bd97e39916b'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


# **Project Name**    -  **E-Commerce Customer Satisfaction Score Prediction**



##### **Project Type**    - Classification
##### **Contribution**    - Individual


# **Problem Statement**


**BUSINESS PROBLEM OVERVIEW**


Customer satisfaction in the e-commerce sector is a pivotal metric that influences loyalty, repeat business, and word-of-mouth marketing. Traditionally, companies have relied on direct surveys to gauge customer satisfaction, which can be time-consuming and may not always capture the full spectrum of customer experiences. With the advent of deep learning, it's now possible to predict customer satisfaction scores in real-time, offering a granular view of service performance and identifying areas for immediate improvement.

**Project Goal**

The primary goal of this project is to develop a deep learning model that can accurately predict the CSAT scores based on customer interactions and feedback. By doing so, we aim to provide e-commerce businesses with a powerful tool to monitor and enhance customer satisfaction in real-time, thereby improving service quality and fostering customer loyalty.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Data manipulation and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning and ANN building
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
#Loading E-Commerce Dataset in pandas dataframe
dataset=pd.read_csv("/kaggle/input/ecomdata/eCommerce_Customer_support_data.csv")


### Dataset First View

In [None]:
# Dataset First
dataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns
dataset.shape

### Dataset Information

In [None]:
# Dataset Info
dataset.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(dataset[dataset.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(dataset.isnull().sum())

In [None]:
# Visualizing the missing values

# Step 1: Calculate the count of missing values in each column and sort in descending order
missing_values = dataset.isnull().sum().sort_values(ascending=False)


# Step 2: Create a horizontal bar plot
plt.figure(figsize=(10, 8))
sns.barplot(x=missing_values, y=missing_values.index, orient='h')
plt.xlabel('Count of Missing Values')
plt.ylabel('Columns')
plt.title('Count of Missing Values in Each Column')
plt.show()

### What did you know about your dataset?

The dataset given is a dataset from E-Commerce industry, and we have to analysis the customers satisfaction score and the insights behind it.

Customer Satisfaction Score (CSAT) is a key performance indicator (KPI) used to gauge the level of satisfaction customers have with a company's products, services, or overall experience. In the context of e-commerce, CSAT typically measures how happy customers are with their online shopping experience, including aspects like product quality, website usability, delivery speed, and customer service.

CSAT is an essential metric for e-commerce businesses, as it directly reflects the customers' perceptions and experiences, driving both immediate and long-term business success.

The above dataset has 85907 rows and 20 columns. There are no duplicate values in the dataset, but there are  mising values in a few columns such as Customer_city,Product_category,item_price,order_id,order_date_time,customer remarks and connected_handling_time.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
dataset.columns

In [None]:
# Dataset Describe
dataset.describe(include='all')

### Variables Description

**Unique id:** Unique identifier for each record (integer).

**Channel name:** Name of the customer service channel (object/string),3 unique channel name.

**Category:** Category of the interaction (object/string) ,12 unique category.

**Sub-category:** Sub-category of the interaction (object/string),57 unique sub-category.

**Customer Remarks:** Feedback provided by the customer (object/string).

**Order id:** Identifier for the order associated with the interaction (integer).

**Order date time:** Date and time of the order (datetime).

**Issue reported at:** Timestamp when the issue was reported (datetime).

**Issue responded:** Timestamp when the issue was responded to (datetime).

**Survey response date:** Date of the customer survey response (datetime).

**Customer city:** City of the customer (object/string),1782 unique Customer city.

**Product category:** Category of the product (object/string),9 unique product category.

**Item price:** Price of the item (float).

**Connected handling time:** Time taken to handle the interaction (float).

**Agent name:** Name of the customer service agent (object/string),1371 unique agent name.

**Supervisor:** Name of the supervisor (object/string),40 unique Supervisor.

**Manager:** Name of the manager (object/string),6 unique manager.

**Tenure Bucket:** Bucket categorizing agent tenure (object/string).

**Agent Shift:** Shift timing of the agent (object/string).

**CSAT Score:** Customer Satisfaction (CSAT) score (integer) (Target-Variable).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in dataset.columns.tolist():
  print("No. of unique values in ",i,"is",dataset[i].nunique(),".")

## 3. ***Exploratory Data Analaysis (Data Wrangling)***

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to df
df=dataset.copy()
# Checking Shape of True Value
print("No. of customers interaction and feedbacks with highest customer satisfaction scores  :",len(df[df['CSAT Score']==5]))
# Assigning  customers data to variable df_best_score
df_best_score=df[(df['CSAT Score']==5)]
df_least_score=df[(df['CSAT Score']==1)]

### **Q1. Top 5 Product Category with highest Customer Satisfaction Score**

In [None]:
# Groupby Product_category Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Product_category').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:5]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:5].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Product Category')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Top 5 Product Category with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q2. Top 5 Category with highest customer satisfaction score**

In [None]:
# Groupby Customer_City Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('category').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:5]

In [None]:


# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:5].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Category')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Top 5 Categories with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q3. Top 5 Sub Category with highest Customer Satisfaction Score**

In [None]:
# Groupby Sub-category Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Sub-category').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:5]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:5].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Sub Category')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Top 5 Sub Category with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q4. Top 5 cities with highest customer satisfaction score**

In [None]:
# Groupby Customer_City Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Customer_City').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:5]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:5].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Cities')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Top 5 Cities with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q5. Best performing Channels**

In [None]:
# Groupby Channel name Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('channel_name').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:5].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Channels')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Channel with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q6. Top 3 best performing Managers**

In [None]:
# Groupby Manager Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Manager').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:3]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:3].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Manager Name')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Managers with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q7. Top 3 best performing Agents**

In [None]:
# Groupby Agent_name Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Agent_name').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:3]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:3].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Agent Name')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Agents with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q8. Top 3 best performing Supervisors**

In [None]:
# Groupby Supervisor Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Supervisor').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:3]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:3].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Supervisor Name')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Supervisors with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q9. Which tenure group of employees is performing the best?**

In [None]:
# Groupby Supervisor Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Tenure Bucket').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:3]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:3].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Tenure bucket')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Tenure group with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q10. Which shift timings of agents is performing the best?**

In [None]:
# Groupby Supervisor Wise w.r.t Customer satisfaction score data
grouped_df = df_best_score.groupby('Agent Shift').agg(
    Count=('CSAT Score', 'size')
).sort_values(by='Count',ascending=False)

grouped_df[:3]

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df[:3].plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('Agents Shift Timings')
ax.set_ylabel('Count of CSAT Scores')
ax.set_title('Shift Timings with highest Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q11. How response time impacts the customer satisfaction score?**

In [None]:
# Ensure the 'Issue reported at' and 'Issue responded' columns are in datetime format
df['Issue_reported at'] = pd.to_datetime(df['Issue_reported at'], dayfirst=True)
df['issue_responded'] = pd.to_datetime(df['issue_responded'], dayfirst=True)

# Calculate the response time
df['Response_Time'] = df['issue_responded'] - df['Issue_reported at']

# Convert 'Response_Time' to a numerical format in seconds for aggregation
df['Response_Time_seconds'] = df['Response_Time'].dt.total_seconds()

# Groupby CSAT Score and calculate the mean response time
grouped_df = df.groupby('CSAT Score').agg(
    Mean_Response_Time=('Response_Time_seconds', 'mean')
).sort_values(by='Mean_Response_Time', ascending=False)

# Convert the mean response time back to timedelta for readability
grouped_df['Mean_Response_Time'] = pd.to_timedelta(grouped_df['Mean_Response_Time'], unit='s')

# Display the grouped DataFrame
print(grouped_df)

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df.plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('CSAT Scores')
ax.set_ylabel('Mean Response Time')
ax.set_title('Mean Response Time in each Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### **Q12. How customer handling time duration impacts the customer satisfaction score?**

In [None]:
# Groupby Customer satisfaction score data w.r.t customer handling time
grouped_df = df.groupby('CSAT Score').agg(
    Mean_Response_Time=('connected_handling_time', 'mean')
).sort_values(by='Mean_Response_Time', ascending=False)

grouped_df

In [None]:
# Plotting the data
fig, ax = plt.subplots(figsize=(10, 6))
grouped_df.plot(kind='bar', ax=ax)

# Adding labels and title
ax.set_xlabel('CSAT Scores')
ax.set_ylabel('Mean Customer Handling Time')
ax.set_title('Mean Customer Handling Time in each Customer Satisfaction Score')

# Rotating the x-axis labels for better readability
plt.xticks(rotation=45)

# Adding grid for better readability
ax.grid(True, which='both', linestyle='--', linewidth=0.5)

# Display the plot
plt.show()

### What all manipulations have you done and insights you found?

Based on the provided data, we aimed to gain a clear understanding of customer satisfaction scores through graphical representations. However, it is crucial to delve deeper into the behavior of customers with varying satisfaction scores to uncover insights and hypothetical statements that might explain the reasons behind these scores. Thus, I focused on the data of customers with high satisfaction scores to identify patterns and potential reasons for their satisfaction.

Potential reasons for lower customer satisfaction scores are noted below based on the findings from the analysis:

**Insights from Analysis:**

**Response Time:** Identified that longer response times were correlated with lower customer satisfaction scores. This suggests a need for quicker response mechanisms.

**Product Category:** Found that certain product categories had consistently lower satisfaction scores, indicating potential issues with these products or their support processes.

**Channel Name:** Discovered that certain customer service channels were more effective at resolving issues satisfactorily, leading to higher CSAT scores.

**Agent Tenure:** Noted that agents with longer tenures tended to receive higher satisfaction scores, suggesting that experience plays a crucial role in customer service effectiveness.

**Shift Timings:** Found variations in satisfaction scores based on agent shifts, with some shifts having lower scores, possibly due to higher workloads or fewer resources during those times.

**Customer Feedback:** Analyzed customer remarks to identify common themes and keywords associated with low satisfaction scores, providing qualitative insights into customer pain points.



## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Pie Chart on Dependant Variable i.e., CSAT Score (Univariate)

In [None]:
# Chart - 1 visualization code
# Dependant Column Value Counts
# Display the value counts of the 'CSAT Score' column
print(df['CSAT Score'].value_counts())
print(" ")

# Visualize the 'CSAT Score' value counts as a pie chart
df['CSAT Score'].value_counts().plot(
    kind='pie',
    figsize=(15, 6),
    autopct="%1.1f%%",
    startangle=90,
    shadow=True,
    labels=df['CSAT Score'].value_counts().index,
    colors=plt.cm.Paired(range(len(df['CSAT Score'].value_counts()))),
    explode=[0.1] * len(df['CSAT Score'].value_counts())  # Slightly explode all slices for better visibility
)

# Set the title and display the plot
plt.title('Customer Satisfaction Score Distribution')
plt.ylabel('')  # Hide the y-label as it's redundant in a pie chart
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart expresses a part-to-whole relationship in your data. It's easy to explain the percentage comparison through area covered in a circle with different colors. Where differenet percentage comparison comes into action pie chart is used frequently. So, I used Pie chart and which helped me to get the percentage comparision of the dependant variable.

##### 2. What is/are the insight(s) found from the chart?

Based on the chart, I observed that 59,617 customers rated the service with a CSAT Score of 5, which accounts for 69.4% of the total feedback in the dataset. Conversely, 1,283 customers were dissatisfied and gave a CSAT Score of 2, representing 1.5% of the total responses.

Additionally, 13.1% of customers gave a poor CSAT score of 1, another 13.1% rated it as 4, and 3% of customers provided a score of 3. This means nearly 15% of customers experienced poor service. Therefore, it is crucial to examine the factors contributing to this dissatisfaction.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights will help create a positive business impact. Here's how:

**Focus on Strengths:** Knowing that 69.4% of customers rated the service with a CSAT score of 5 indicates a strong positive reception. By analyzing what is working well for these satisfied customers, the business can replicate these strategies across other areas to further enhance customer satisfaction.

**Targeted Improvements:** Identifying that 15% of customers are experiencing poor service (CSAT scores of 1, 2, and 3) allows the business to focus on specific areas of improvement. Understanding the reasons behind these low scores can help address the root causes, such as response time, service quality, or specific product issues.

**Resource Allocation:** Insights into customer satisfaction distribution can guide the allocation of resources. For instance, more training and support can be provided to agents or departments that receive lower scores to elevate their performance.

**Strategic Planning:** The data can be used to set targeted goals for improvement in customer satisfaction metrics, driving a continuous improvement culture within the organization.
Are there any insights that lead to negative growth? Justify with specific reasons.

While the insights primarily aim to create a positive impact, if not properly managed, they could potentially lead to negative growth:

**Neglecting High Performers:** If the focus shifts too heavily on addressing negative feedback without recognizing and maintaining what leads to high satisfaction (69.4% with a score of 5), there is a risk of neglecting the positive aspects. This could inadvertently lead to a decline in the areas that are currently performing well.

**Inadequate Response to Poor Scores:** If the business fails to adequately address the issues leading to the 15% of poor scores, customer dissatisfaction could worsen. Dissatisfied customers are more likely to churn, leave negative reviews, and dissuade potential customers, negatively impacting growth.
Overemphasis on Quick Fixes: Prioritizing quick fixes over sustainable, long-term solutions can lead to temporary improvements in CSAT scores without addressing underlying issues. This might result in a superficial improvement in customer satisfaction but could cause long-term dissatisfaction if deeper problems are ignored.

**Justification with Specific Reasons**

**Positive Business Impact:** The insights provide a clear indication of customer satisfaction levels and areas needing improvement. For instance, since a significant majority (69.4%) are highly satisfied, the business can study and reinforce the strategies that contribute to high satisfaction. Additionally, addressing the 15% of poor scores by understanding and resolving their causes will likely result in improved overall customer satisfaction and loyalty.

**Potential for Negative Growth:** Ignoring the insights related to low satisfaction scores or failing to act on them effectively could lead to increased dissatisfaction. For example, if the business does not address the issues faced by the 1.5% of customers who gave a score of 2, this dissatisfaction can spread, potentially leading to higher churn rates and negative word-of-mouth. Similarly, failing to balance efforts between maintaining high satisfaction levels and improving lower ones can also be detrimental.

In conclusion, the insights have the potential to drive positive business impact by highlighting areas of strength and weakness. However, careful and balanced management of these insights is crucial to avoid any negative consequences and ensure sustained growth and customer satisfaction.

#### Chart - 2 - Agent Vs. Average Response Time Percentage (Bivariate with Categorical - Numerical)

In [None]:
# Chart - 2 visualization code
# Showing Average True Churn Percentage state wise
# Showing top 10 churned state
print((df.groupby(['Agent_name'])['Response_Time_seconds'].mean()*100).sort_values(ascending = False).reset_index(name="Average Response Time %").head(10))
print(" ")

# State vs. average true churn percantage visualization code
# Vizualizing top 10 churned state
plt.rcParams['figure.figsize'] = (12, 7)
color = plt.cm.copper(np.linspace(0, 0.5, 20))
((df.groupby(['Agent_name'])['Response_Time_seconds'].mean())*100).sort_values(ascending = False).head(10).plot.bar(color = ['violet','indigo','b','g','y','orange','r'])
plt.title(" Agent average Response_Time_seconds percentage", fontsize = 20)
plt.xlabel('Agent', fontsize = 15)
plt.ylabel('percentage', fontsize = 15)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the average percentage of response time with respect to agents, I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

There are 10 agents with varying average response times.

The average response times by agent range from 2.09 to 4.09 hours. Elizabeth Rose and Donald Jordan have the shortest average response times, providing the best service to their clients through prompt action.

On the other hand, Christine Castro has the longest average response time for addressing client queries. Therefore, evaluating her performance and providing additional training is crucial to enhance the CSAT Score.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained can help create a positive business impact. By identifying which agents have the shortest and longest response times, businesses can take specific actions to improve overall customer satisfaction:

**Performance Recognition:** Recognizing and rewarding agents like Elizabeth Rose and Donald Jordan for their exemplary service can boost morale and set a benchmark for other agents.

**Targeted Training:** Providing additional training and support to agents like Christine Castro can help reduce response times, leading to better customer experiences and potentially higher CSAT scores.

**Resource Allocation:** Understanding the distribution of response times can help in reallocating resources and support where needed most, ensuring a more balanced and efficient customer service operation.

**Process Improvements:** Identifying bottlenecks and inefficiencies in the service process can lead to improvements that benefit all agents and customers, enhancing overall service quality.

**Are there any insights that lead to negative growth?**

There are no direct insights that would lead to negative growth; however, if not acted upon appropriately, some insights could potentially have a negative impact:

**Failure to Address Poor Performance:** If agents with high response times are not given the necessary training and support, customer dissatisfaction may continue or worsen, leading to negative reviews and loss of customers.

**Ignoring Top Performers:** Not recognizing and rewarding top-performing agents could lead to decreased motivation and performance over time, potentially affecting overall service quality.

**Overemphasis on Speed:** Focusing solely on reducing response times without maintaining quality of service might lead to rushed interactions and unresolved issues, which could harm customer satisfaction in the long run.

**Justification with Specific Reasons**

**Positive Business Impact:** By addressing the variations in response times, the business can ensure a more consistent and satisfactory customer experience. For instance, agents like Elizabeth Rose and Donald Jordan, who provide quick responses, set a standard for others. This can be leveraged through training programs to improve the performance of other agents.

**Potential for Negative Growth:** If insights are ignored, such as the need for retraining agents with higher response times like Christine Castro, customer dissatisfaction may persist. Dissatisfied customers are more likely to churn and spread negative word-of-mouth, which can harm the business’s reputation and growth prospects.

In conclusion, the insights gained will likely foster a positive business impact if acted upon effectively, leading to improved customer satisfaction and service quality. However, neglecting these insights or mismanaging the response to them could result in negative growth.

#### Chart - 3 - Box Plot on Connected handling time with CSAT Score (Bivariate)

In [None]:
# Box Plot for connected_handling_time attribute w.r.t to CSAT Score
df.boxplot(column='connected_handling_time',by='CSAT Score')

##### 1. Why did you pick the specific chart?



Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers. So, I used box plot to get the maximum and minimum value with well sagreggated outliers with well defined mean and median as shown in the box plot graph.

##### 2. What is/are the insight(s) found from the chart?

From the above boxplot, we can observe that there are a few outliers in the CSAT Scores of 4 and 5. Specifically, outliers appear when the connected handling time exceeds 750 for CSAT Score 4, and when it exceeds 1000 for CSAT Score 5. Analyzing these outliers is crucial for understanding the underlying factors contributing to these anomalies.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained can help create a positive business impact. Here's how:

**Targeted Improvements:** By identifying outliers in CSAT Scores when the connected handling time is high, the business can focus on improving processes that lead to long handling times. Reducing these handling times can enhance customer satisfaction.
**Quality Control:** Understanding why CSAT Scores drop when handling times increase will allow the business to implement quality control measures. This can involve additional training for agents, better resource allocation, or process optimizations.
**Customer Experience Enhancement:** By addressing the factors leading to long handling times and subsequent lower satisfaction scores, the business can improve the overall customer experience, which can lead to increased loyalty and positive word-of-mouth.

Are there any insights that lead to negative growth?

While the primary goal of the insights is to foster positive business impact, there could be potential risks if not managed properly:

**Overemphasis on Speed:** If the business focuses too much on reducing handling times without ensuring the quality of interactions, it might lead to rushed and ineffective customer service. This can result in unresolved issues and lower overall satisfaction.

**Neglecting Non-Outlier Data:** Focusing exclusively on outliers might lead to neglecting the broader dataset. Improvements should be holistic, ensuring that all areas of customer service are enhanced, not just those with extreme values.


**Justification with Specific Reasons**

**Positive Business Impact:** Addressing the outliers in CSAT Scores related to high handling times can directly improve customer satisfaction by ensuring quicker and more efficient service. For instance, by training agents to handle calls more effectively or by implementing better call routing systems, the business can reduce handling times and thus improve scores.

**Potential for Negative Growth:** If the business focuses solely on reducing handling times without maintaining the quality of interactions, it may lead to superficial improvements in satisfaction scores. For example, customers might experience quicker service but still be dissatisfied if their issues are not fully resolved. Additionally, neglecting other areas in need of improvement can result in an overall decline in service quality.

In conclusion, while the insights can lead to positive business impacts by targeting and resolving specific issues, a balanced and comprehensive approach is essential to avoid any potential negative growth and ensure sustainable improvements in customer satisfaction.

#### Chart - 4 - CSAT Score vs Item price (Bivariate)

In [None]:
# Chart - 4 visualization code
# CSAT Score wise average Item_price Percentage
# Calculate the average item price percentage by CSAT Score
csat_avg_item_price_percentage = dataset.groupby('CSAT Score')['Item_price'].mean() * 100
print(csat_avg_item_price_percentage)
print(" ")

# Visualizing the CSAT Score wise average item price percentage
plt.bar(csat_avg_item_price_percentage.index, csat_avg_item_price_percentage, color=['r', 'b', 'g', 'c', 'm'])

plt.rcParams['figure.figsize'] = (10, 6)  # Adjust the figure size
plt.xlabel('CSAT Score', fontsize=15)
plt.ylabel('Item Price Percentage', fontsize=15)
plt.title('CSAT Score Wise Average Item Price Percentage', fontsize=18)
plt.xticks(csat_avg_item_price_percentage.index, fontsize=12)
plt.yticks(fontsize=12)
plt.show()

##### 1. Why did you pick the specific chart?

Bar charts show the frequency counts of values for the different levels of a categorical or nominal variable. Sometimes, bar charts show other statistics, such as percentages.

To show the average percentage of true churn with respect to Area Code, I have used Bar Chart.

##### 2. What is/are the insight(s) found from the chart?

From the bar graph, it is evident that the mean item price is highest when the CSAT score is 1 and lowest when the CSAT score is 5. This indicates an inverse correlation between item price and CSAT score, suggesting that higher item prices are generally associated with lower customer satisfaction scores.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact in the following ways:

**Pricing Strategy:** Understanding that higher item prices are associated with lower CSAT scores can guide pricing strategies. By adjusting prices or offering better value at higher price points, businesses can improve customer satisfaction.

**Customer Segmentation:** These insights can help in segmenting customers based on their spending behavior and satisfaction levels. Targeted promotions and personalized offers can be designed to enhance satisfaction among different customer segments.

 **Are there any insights that lead to negative growth?**

There are potential risks if the insights are not managed properly:

**Price Reduction Risks:** Simply lowering prices to improve CSAT scores might not be sustainable and could negatively impact profitability. Businesses need to balance price adjustments with maintaining profit margins.

**Overemphasis on Price:** Focusing solely on price without addressing other factors that contribute to customer satisfaction (such as product quality, customer service, and overall experience) may not yield the desired improvement in CSAT scores.

**Justification with Specific Reasons**

**Positive Business Impact:** By aligning pricing strategies with customer expectations, businesses can enhance customer satisfaction. For instance, offering more features or better service for higher-priced items can justify the cost and improve CSAT scores. Additionally, personalized marketing strategies based on customer segments can lead to increased loyalty and repeat purchases.

**Potential for Negative Growth:** If businesses reduce prices without maintaining value, it can lead to a perception of reduced quality. For example, if a high-end product's price is reduced significantly without adding corresponding value, customers might perceive it as less premium, leading to lower sales. Additionally, overemphasizing price reductions can erode profit margins, affecting the overall financial health of the business.

In conclusion, while the insights provide valuable guidance for improving customer satisfaction and driving positive business impact, it is crucial to implement them thoughtfully. Balancing price adjustments with value enhancement and considering all factors affecting customer satisfaction will help avoid potential negative consequences and ensure sustainable growth.



#### Chart - 5- Column wise Histogram & Box Plot Univariate Analysis

In [None]:
# Chart - 5 visualization code
# Visualizing code of hist plot for each columns to know the data distibution
for col in dataset.describe().columns:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (dataset[col])
  sns.distplot(dataset[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
plt.show()

# Visualizing code of box plot for each columns to know the data distibution
for col in dataset.describe().columns:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    dataset.boxplot( col, ax = ax)
    ax.set_title('BoxPlot Label by ' + col)
    #ax.set_ylabel("Churn")
plt.show()


##### 1. Why did you pick the specific chart?

The histogram is a popular graphing tool. It is used to summarize discrete or continuous data that are measured on an interval scale. It is often used to illustrate the major features of the distribution of the data in a convenient form. It is also useful when dealing with large data sets (greater than 100 observations). It can help detect any unusual observations (outliers) or any gaps in the data.

Thus, I used the histogram plot to analysis the variable distributions over the whole dataset whether it's symmetric or not.

Box plots are used to show distributions of numeric data values, especially when you want to compare them between multiple groups. They are built to provide high-level information at a glance, offering general information about a group of data's symmetry, skew, variance, and outliers.

Thus, for each numerical varibale in the given dataset, I used box plot to analyse the outliers and interquartile range including mean, median, maximum and minimum value.

##### 2. What is/are the insight(s) found from the chart?

The "Connected Handling Time" feature is symmetrically distributed, with the mean being almost the same as the median for numerical columns. However, the "Item Price" feature does not follow a symmetric distribution and contains noise.









##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Just a histogram and box plot cannot define business impact. It's done just to see the distribution of the column data over the dataset.



#### Chart - 6 - Correlation Heatmap

In [None]:
# Calculate the correlation matrix
correlation_matrix = df[df.describe().columns.to_list()].corr()

# Select only the correlation of the target variable with other features
target_variable='CSAT Score'
correlation_with_target = correlation_matrix[[target_variable]].sort_values(by=target_variable, ascending=False)

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_with_target, annot=True, cmap='coolwarm', cbar=True)
plt.title(f'Correlation of {target_variable} with Independent Numerical Features')
plt.show()


##### 1. Why did you pick the specific chart?

A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. The range of correlation is [-1,1].

Thus to know the correlation between all the variables along with the correlation coeficients, i used correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

Based on the above correlation heatmap, we can see that "Issue Reported," "Issue Responded," and "Connected Handling Time" are moderately positively correlated with the CSAT Score.

Additionally, "Connected Handling Time" has a positive correlation with the CSAT Score and a negative correlation with both "Response Time" and "Item Price."

All other correlations can be observed from the chart.


#### Chart - 7 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, hue="CSAT Score")

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

From the above chart I got to know, there are less linear relationship between variables and data points aren't linearly separable. Customers feedback data is clusetered and ovearlapped each other. connected_handling_time are quite symmetrical in nature and item_price feature and response time are quite non symmetric in nature. In this whole pair plot, the importance of response time can be seen and the connected_hanling_time with respect to different features are really insightful. Rest insights can be depicted from the above graph.

## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1. **When the Mean Response Time is less than 2, the Customer Satisfaction Score is 5.**
2. **When the price of an item above 5660, does it result in customer satisfaction scores to go below 3**


### **Hypothetical Statement - 1**
**When the Mean Response Time is less than 2, the Customer Satisfaction Score is 5.**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): The mean Response Time is equal to 2 when the CSAT Score is 5.

Alternative Hypothesis (H1): The mean Response Time is less than 2 when the CSAT Score is 5.

Perform One-Sample t-test:

We will use a one-sample t-test to compare the sample mean of Response Time against the population mean (2).


#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import ttest_1samp



# Step 1: Filter the data for CSAT Score of 5
df_csat_5 = df[df['CSAT Score'] == 5]

# Step 2: Calculate the mean Response Time
mean_response_time = df_csat_5['Response_Time_seconds'].mean()

# Step 3: Perform one-sample t-test
# Null Hypothesis: Mean Response Time = 2*3600 (2 hours converted to seconds)
hypothesized_mean = 2 * 3600  # 2 hours in seconds

# Perform the t-test
t_stat, p_value = ttest_1samp(df_csat_5['Response_Time_seconds'], hypothesized_mean)

# Print the results
print(f"Mean Response Time: {mean_response_time} seconds")
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

# Step 4: Conclusion
alpha = 0.05  # significance level
if p_value < alpha:
    print("Reject the Null Hypothesis: The mean Response Time is significantly less than 2 hours when the CSAT Score is 5.")
else:
    print("Fail to Reject the Null Hypothesis: There is no significant evidence that the mean Response Time is less than 2 hours when the CSAT Score is 5.")


##### Which statistical test have you done to obtain P-Value?

I have used t-Test as the statistical testing to obtain P-Value and found the result that Null hypothesis has been rejected.

Based on the results of the one-sample t-test, the following findings can be made:

**Mean Response Time:**

The mean response time for customers who gave a CSAT Score of 5 is approximately 5706.44 seconds (about 1.58 hours).

**T-statistic and P-value:**

The t-statistic is -11.85, indicating that the observed mean response time is significantly different from the hypothesized mean of 7200 seconds (2 hours).
The p-value is extremely small (2.30e-32), which is far below the significance level of 0.05.

**Conclusion:**

Given the p-value is much less than the significance level of 0.05, we reject the null hypothesis.

This means there is strong statistical evidence to conclude that the mean response time for customers who rated the service with a CSAT Score of 5 is significantly less than 2 hours.

**Business Implication:**

The significantly lower response time for customers with a high satisfaction score suggests that prompt response times are correlated with higher customer satisfaction.

Focusing on reducing response times could be a key strategy to enhance overall customer satisfaction.

This analysis indicates that improving response times can positively impact customer satisfaction scores, supporting efforts to maintain or enhance quick response rates in customer service operations.

##### Why did you choose the specific statistical test?

**Rationale for Choosing the One-Sample T-Test:**

**Nature of the Data:**

We have a single sample of response times for customers who gave a CSAT Score of 5.

We need to compare the mean of this sample to a known value (2 hours or 7200 seconds).

**Continuous Variable:**

Response time is a continuous variable measured in seconds.
The t-test is suitable for comparing means of continuous data.

**Comparing to a Hypothesized Value:**

The one-sample t-test is designed to determine whether the sample mean is significantly different from a known or hypothesized population mean.
In this case, we are comparing the mean response time to the hypothesized value of 7200 seconds (2 hours).

**Small Sample Size or Unknown Population Variance:**

If the population variance is unknown and the sample size is reasonably small, the t-test is appropriate as it accounts for sample size in its calculation.
The t-distribution is used instead of the normal distribution when the sample size is small or the population variance is unknown.

**Conclusion:**

The one-sample t-test was chosen because it effectively tests whether the mean response time for a sample (customers who rated the service with a CSAT Score of 5) is significantly different from a specified value (2 hours). The test provides a t-statistic and p-value that help determine if the observed difference is statistically significant, thereby allowing us to make an informed conclusion regarding the hypothesis.

This choice of test aligns with the objective of assessing the mean response time against a benchmark, making it a suitable and robust statistical method for this analysis.

In [None]:
import matplotlib.pyplot as plt

# Visualizing the distribution of Response Time
plt.figure(figsize=(12, 6))

# Histogram for Response Time
plt.subplot(1, 2, 1)
plt.hist(df['Response_Time_seconds'], bins=30, color='blue', edgecolor='black')
plt.title('Distribution of Response Time')
plt.xlabel('Response Time (seconds)')
plt.ylabel('Frequency')

# Histogram for CSAT Score when it is 5
plt.subplot(1, 2, 2)
plt.hist(df[df['CSAT Score'] == 5]['Response_Time_seconds'], bins=30, color='green', edgecolor='black')
plt.title('Distribution of Response Time (CSAT Score = 5)')
plt.xlabel('Response Time (seconds)')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


### **Hypothetical Statement - 2**
**When the price of an item above 5660, does it result in customer satisfaction scores to go below 3**

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null Hypothesis (H0):** The mean CSAT score for items priced above 5660 is not significantly different from 3.

**Alternative Hypothesis (H1):** The mean CSAT score for items priced above 5660 is significantly less than 3.

 **Test Type :** Use a one-sample t-test to compare the mean CSAT score of the filtered data to the value 3.



#### 2. Perform an appropriate statistical test.

In [None]:
import pandas as pd
from scipy.stats import ttest_1samp



# Step 1: Filter the Data
high_price_df = df[df['Item_price'] > 5660]

# Step 2: Perform a One-Sample t-test
# Null Hypothesis: Mean CSAT score is 3
# Alternative Hypothesis: Mean CSAT score is less than 3
t_stat, p_value = ttest_1samp(high_price_df['CSAT Score'], 3)

# Since it's a one-tailed test, we need to divide the p-value by 2
p_value /= 2

# Check if we reject the null hypothesis
significance_level = 0.05
reject_null = p_value < significance_level and t_stat < 0

# Print the results
print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")
if reject_null:
    print("Reject the Null Hypothesis: The mean CSAT score for items priced above 5660 is significantly less than 3.")
else:
    print("Fail to Reject the Null Hypothesis: There is no significant evidence that the mean CSAT score for items priced above 5660 is less than 3.")


**Findings Interpretation**

Fail to Reject Null Hypothesis: The p-value is greater than 0.05, it means there is not enough evidence to suggest that items priced above 5660 significantly affect customer satisfaction scores to be below 3.

**Explanation of Output**

**T-statistic:** This value indicates how many standard deviations the sample mean is away from the hypothesized mean. A negative t-statistic would support the alternative hypothesis that the sample mean is less than the hypothesized mean.

**P-value:** This value tells us the probability of obtaining a result at least as extreme as the one observed, assuming the null hypothesis is true. Since we are performing a one-tailed test, the p-value is divided by 2.

**Decision Rule:** If the p-value is less than the significance level (0.05) and the t-statistic is negative, we reject the null hypothesis, indicating that the mean CSAT score for high-priced items is significantly less than 3.

##### Which statistical test have you done to obtain P-Value?

To determine whether the price of an item above 5660 results in customer satisfaction scores below 3, I performed a one-sample t-test. Here's a detailed explanation of the choice and procedure for the test:



##### Why did you choose the specific statistical test?



**Objective:** We want to compare the mean CSAT score of a subset of data (items priced above 5660) to a specific value (3).

**Type of Test:** A one-sample t-test is appropriate when you are comparing the mean of a single sample to a known or hypothesized population mean.

**Assumption:** The t-test assumes that the data is approximately normally distributed, which is a reasonable assumption for many real-world data sets, especially when the sample size is large.

In [None]:
# Visualization
plt.figure(figsize=(14, 7))

# Histogram of CSAT Scores for high priced items
plt.subplot(1, 2, 1)
sns.histplot(high_price_df['CSAT Score'], kde=True, bins=10, color='skyblue')
plt.axvline(x=3, color='red', linestyle='--')
plt.title('Distribution of CSAT Scores for Items Priced Above 5660')
plt.xlabel('CSAT Score')
plt.ylabel('Frequency')
plt.legend(['Hypothesized Mean (3)', 'CSAT Scores'])

# Boxplot of CSAT Scores for high priced items
plt.subplot(1, 2, 2)
sns.boxplot(y=high_price_df['CSAT Score'], color='skyblue')
plt.axhline(y=3, color='red', linestyle='--')
plt.title('Boxplot of CSAT Scores for Items Priced Above 5660')
plt.ylabel('CSAT Score')
plt.legend(['Hypothesized Mean (3)', 'CSAT Scores'])

plt.tight_layout()
plt.show()

Histogram: We create a histogram to visualize the distribution of CSAT scores for items priced above 5660. The hypothesized mean (3) is marked with a red dashed line.

Boxplot: We create a boxplot to visualize the spread and central tendency of the CSAT scores. The hypothesized mean (3) is again marked with a red dashed line.

These visualizations help to see how the CSAT scores are distributed around the hypothesized mean and can provide a visual confirmation of the results of the hypothesis test.

## ***6. Feature Engineering & Data Pre-processing***

In [None]:
# Creating a copy of the dataset for further feature engineering
df_new=dataset.copy()

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
# Missing Values/Null Values Count
print(df_new.isnull().sum())

# Visualizing the missing values

# Step 1: Calculate the count of missing values in each column and sort in descending order
missing_values = df_new.isnull().sum().sort_values(ascending=False)


# Step 2: Create a horizontal bar plot
plt.figure(figsize=(10, 8))
sns.barplot(x=missing_values, y=missing_values.index, orient='h')
plt.xlabel('Count of Missing Values')
plt.ylabel('Columns')
plt.title('Count of Missing Values in Each Column')
plt.show()

#### What all missing value imputation techniques have you used and why did you use those techniques?

We employed various missing value imputation techniques based on the nature of the features and the distribution of the data:

**Order_id:** As this feature is not significant for our analysis and the number of missing values is minimal, we opted to drop this column entirely.

**Customer Remarks:** With a substantial number of missing values (57165), we couldn't discard this feature as it holds crucial information. Instead, we replaced the NaN values with "Missing Reviews" to ensure we retain the textual data for analysis.

**Categorical Column Imputation (Customer city and Product Category):** Since these categorical features are vital for our analysis, we used mode imputation to fill in the missing values. Mode imputation was chosen as it replaces missing values with the most frequently occurring category, thereby preserving the distribution of the data.

**Numerical Column Imputation (connected_handling_time and item_price):** For connected_handling_time, which follows a normal distribution with minimal outliers, we applied mean imputation to replace missing values. Conversely, for item_price, where outliers are more prominent, median imputation was utilized to ensure robustness against outliers.

**order_date_time:** Mode imputation was applied to handle missing values in this feature, as it represents datetime data. Subsequently, we converted it into datetime format to extract additional temporal features like day and month.

These techniques were selected to effectively manage missing data while preserving the integrity and utility of the dataset for subsequent analysis.

In [None]:
# Step 1: Drop 'Order_id' column
df_new.drop(columns=['Order_id'], inplace=True)

# Step 2: Replace missing values in 'Customer Remarks' with 'Missing Reviews'
df_new['Customer Remarks'].fillna('Missing Reviews', inplace=True)

# Step 3: Impute missing values in categorical columns ('Customer city' and 'Product Category') with mode
df_new['Customer_City'].fillna(df_new['Customer_City'].mode()[0], inplace=True)
df_new['Product_category'].fillna(df_new['Product_category'].mode()[0], inplace=True)

# Step 4: Impute missing values in numerical columns ('connected_handling_time' and 'item_price')
# Impute 'connected_handling_time' with mean
df_new['connected_handling_time'].fillna(df_new['connected_handling_time'].mean(), inplace=True)
# Impute 'item_price' with median
df_new['Item_price'].fillna(df_new['Item_price'].median(), inplace=True)

# Step 5: Impute missing values in 'order_date_time' with mode
df_new['order_date_time'].fillna(df_new['order_date_time'].mode()[0], inplace=True)


# Display the first few rows of the DataFrame to verify changes
print(df_new.head())

In [None]:
# Handling Missing Values & Missing Value Imputation
# Missing Values/Null Values Count
print(df_new.isnull().sum())

# Visualizing the missing values

# Step 1: Calculate the count of missing values in each column and sort in descending order
missing_values = df_new.isnull().sum().sort_values(ascending=False)


### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
# To separate the symmetric distributed features and skew symmetric distributed features
df_new["CSAT Score"]=df_new["CSAT Score"].astype('str')
symmetric_feature=[]
non_symmetric_feature=[]
for i in df_new.describe().columns:
  if abs(df_new[i].mean()-df_new[i].median())<0.2:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)




In [None]:
# For Skew Symmetric features defining upper and lower boundry
def outlier_treatment(df,feature):
  upper_boundary= df[feature].mean()+3*df[feature].std()
  lower_boundary= df[feature].mean()-3*df[feature].std()
  return upper_boundary,lower_boundary

In [None]:
# Restricting the data to lower and upper boundry
for feature in non_symmetric_feature:
  df_new.loc[df_new[feature]<= outlier_treatment(df=df_new,feature=feature)[1], feature]=outlier_treatment(df=df_new,feature=feature)[1]
  df_new.loc[df_new[feature]>= outlier_treatment(df=df_new,feature=feature)[0], feature]=outlier_treatment(df=df_new,feature=feature)[0]

In [None]:
# After Outlier Treatment showing the dataset distribution using strip plot
# Visualising  code for the numerical columns
for col in df_new.describe().columns:
  fig=plt.figure(figsize=(9,6))
  sns.stripplot(df_new[col])

##### What all outlier treatment techniques have you used and why did you use those techniques?

First I changed the CSAT Score column to sring as it shouldn't be treated as numerical column as there are only five type of values and should be treated as categorical column. Then I separated the skew symmetric and symmetric features and define the upper and lower boundry as defined below. Again, as it is a classification problem I restrict the both boundaries and I pull down the higher value restricted to the upper limit



In a Gaussian distribution while it’s the symmetric curve and outlier are present. Then, we can set the boundary by taking standard deviation into action.

The box plot is a useful graphical display for describing the behavior of the data in the middle as well as at the ends of the distributions. The box plot uses the median and the lower and upper quartiles (defined as the 25th and 75th percentiles). If the lower quartile is Q1 and the upper quartile is Q3, then the difference (Q3 — Q1) is called the interquartile range or IQ. A box plot is constructed by drawing a box between the upper and lower quartiles with a solid line drawn across the box to locate the median. The following quantities (called fences) are needed for identifying extreme values in the tails of the distribution:
1.	lower inner fence: Q1–1.5*IQ
2.	upper inner fence: Q3 + 1.5*IQ
3.	lower outer fence: Q1–3*IQ
4.	upper outer fence: Q3 + 3*IQ


### 3. Categorical Encoding

In [None]:
df_new.info()

In [None]:
df_new.drop(columns='Unique id', inplace=True)


In [None]:

# Encode your categorical columns
# Getting the categorical columns
df_new["CSAT Score"]=df_new["CSAT Score"].astype('int')
categorical_columns=list(set(df_new.columns.to_list()).difference(set(df_new.describe().columns.to_list())))
non_cat_columns=['issue_responded','order_date_time','Issue_reported at','Survey_response_Date','Customer Remarks']
categorical_columns = list(set(categorical_columns) - set(non_cat_columns))
print("Categorical Columns are :-", categorical_columns, " :- ", len(categorical_columns))

In [None]:
# Perform one-hot encoding
df_encoded = pd.get_dummies(df_new, columns=categorical_columns)

# Display the encoded DataFrame
df_encoded.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

I have used One Hot Encoding for all the categorical features,because these features are likely nominal categorical variables, meaning there is no inherent order or ranking among the categories. For these variables, it would be appropriate to apply one-hot encoding.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

Created Some new features like Response_Time_seconds,day_number_order_date,weekday_number_order_date,weekday_num_response_date and day_num_response_date

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# Ensure the 'Issue reported at' and 'Issue responded' columns are in datetime format
df_encoded['Issue_reported at'] = pd.to_datetime(df_encoded['Issue_reported at'], format='%d/%m/%Y %H:%M')
df_encoded['issue_responded'] = pd.to_datetime(df_encoded['issue_responded'], format='%d/%m/%Y %H:%M')


# Create a new feature the response time
df_encoded['Response_Time'] = df_encoded['issue_responded'] - df_encoded['Issue_reported at']

# Convert 'Response_Time' to a numerical format in seconds for aggregation
df_encoded['Response_Time_seconds'] = df_encoded['Response_Time'].dt.total_seconds()

In [None]:
# Convert order_date_time to datetime
df_encoded['order_date_time'] = pd.to_datetime(df_encoded['order_date_time'], format='%d/%m/%Y %H:%M')

# Extract day number (day of the month)
df_encoded['day_number_order_date'] = df_encoded['order_date_time'].dt.day

# Extract weekday (numerical value: 0 for Sunday, 1 for Monday, etc.)
df_encoded['weekday_num_order_date'] = df_encoded['order_date_time'].dt.weekday + 1  # Monday=1, Sunday=7





# Convert 'Survey_response_Date' to datetime format
df_encoded['Survey_response_Date'] = pd.to_datetime(df_encoded['Survey_response_Date'], format='%d-%b-%y')

# Extract day number (day of the month)
df_encoded['day_number_response_date'] = df_encoded['Survey_response_Date'].dt.day

# Extract weekday (numerical value: 0 for Sunday, 1 for Monday, etc.)
df_encoded['weekday_num_response_date'] = df_encoded['Survey_response_Date'].dt.weekday + 1


In [None]:
# Drop Date columns after feature extraction
df_encoded.drop(columns=['order_date_time', 'Survey_response_Date','Issue_reported at','issue_responded','Response_Time'], inplace=True)

In [None]:
df_encoded.head()

#### 2. Feature Selection

In [None]:
# Checking the shape of dataset
df_encoded.shape

In [None]:
# Dropping Constant and Quasi Constant Feature
def dropping_constant(data):
    from sklearn.feature_selection import VarianceThreshold

    # Drop non-numeric columns
    numeric_data = data.select_dtypes(include=['number'])

    var_thres = VarianceThreshold(threshold=0.05)
    var_thres.fit(numeric_data)

    concol = [column for column in numeric_data.columns
              if column not in numeric_data.columns[var_thres.get_support()]]

    if "CSAT Score" in concol:
        concol.remove("CSAT Score")

    df_removed_var = data.drop(concol, axis=1)
    return df_removed_var

In [None]:
# Calling the function
df_removed_var=dropping_constant(df_encoded)

In [None]:
# Checking the shape after feature dropped
df_removed_var.shape

In [None]:
# Correlation Heatmap visualization code
# Drop non-numeric columns
numeric_data = df_removed_var.select_dtypes(include=['number'])
numeric_data.drop(columns=['CSAT Score'], inplace=True)
corr = numeric_data.corr()
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)



# Create the heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
!pip install statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = numeric_data.columns
    vif_data["VIF"] = [variance_inflation_factor(numeric_data.values, i) for i in range(numeric_data.shape[1])]
    return vif_data

# Assuming df is your DataFrame containing the features
vif_results = calculate_vif(df)
print(vif_results)


In [None]:
# Drop highly correlated feature
df_removed_var.drop(columns=['weekday_num_order_date'], inplace=True)

In [None]:
# Check Feature Correlation and finding multicolinearity
def correlation(df,threshold):
  col_corr=set()
  corr_matrix= df.corr()
  for i in range (len(corr_matrix.columns)):
    for j in range(i):
      if abs (corr_matrix.iloc[i,j])>threshold:
        colname=corr_matrix.columns[i]
        col_corr.add(colname)
  return list(col_corr)



In [None]:
# Getting multicolinear columns and dropping them
numeric_data = df_removed_var.select_dtypes(include=['number'])
numeric_data.drop(columns=['CSAT Score'], inplace=True)
highly_correlated_columns=correlation(numeric_data,0.5)

if "CSAT Score" in highly_correlated_columns:
  highly_correlated_columns.remove("CSAT Score")
else:
  pass

df_removed=df_removed_var.drop(highly_correlated_columns,axis=1)
df_removed.shape

In [None]:
def calculate_vif(df):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = numeric_data.columns
    vif_data["VIF"] = [variance_inflation_factor(numeric_data.values, i) for i in range(numeric_data.shape[1])]
    return vif_data

# Assuming df is your DataFrame containing the features
vif_results = calculate_vif(df)
print(vif_results)

In [None]:
# After Feature Selection checking the shape left with
df_removed.shape

In [None]:
df_removed.isnull().sum()

##### What all feature selection methods have you used  and why?

I used Dropping Constant Feature, Dropping columns having multicolinearity and validate through VIF.

Feature Selector that removes all low variance features. This feature selection algorithm looks only at the features(X), not the desired outputs(Y), and can be used for unsupported learning.

A Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related. The Pearson correlation is also known as the “product moment correlation coefficient” (PMCC) or simply “correlation”

Pearson correlations are suitable only for metric variables The correlation coefficient has values between -1 to 1

• A value closer to 0 implies weaker correlation (exact 0 implying no correlation)

• A value closer to 1 implies stronger positive correlation

• A value closer to -1 implies stronger negative correlation

Collinearity is the state where two variables are highly correlated and contain similar information about the variance within a given dataset. To detect collinearity among variables, simply create a correlation matrix and find variables with large absolute values.

Steps for Implementing VIF

• Calculate the VIF factors.

• Inspect the factors for each predictor variable, if the VIF is between 5–10, multicollinearity is likely present and you should consider dropping the variable.

In VIF method, we pick each feature and regress it against all of the other features. For each regression, the factor is calculated as :

VIF=\frac{1}{1-R^2}

Where, R-squared is the coefficient of determination in linear regression. Its value lies between 0 and 1.

1st I dropped columns having constant or quasi constant variance. Then using pearson corelation I removed the columns having multicolinearity and again validate the VIFs for each feauture and found some features having VIF of more than 5-10 and I considered it to be 8 and again manipulated some features and again dropped multicolinear columns to make the VIF less than 8. The features got decreased from 77 to 10.

##### Which all features you found important and why?

### 5. Data Transformation

In [None]:
# Getting symmetric and skew symmetric features from the cplumns
symmetric_feature=[]
non_symmetric_feature=[]
for i in df_removed.describe().columns:
  if abs(df_removed[i].mean()-df_removed[i].median())<0.25:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)
# Removing Customer Service Calls column from the list as it's an important factor
# which can't be treated as outliers here will is already leading to higher churn as we have seen furing analysis.
non_symmetric_feature.remove('CSAT Score')

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

**First Transformation**

In [None]:
# Transform Your data
# Exponential Transforming the required column
df_removed['Item_price']=np.sqrt(df_removed['Item_price'])
df_removed['Response_Time_seconds']=np.sqrt(df_removed['Response_Time_seconds'])
df_removed['day_number_order_date']=(df_removed['day_number_order_date'])**0.25
df_removed['day_number_response_date']=(df_removed['day_number_response_date'])**0.25


In [None]:
df_removed.isnull().sum()

In [None]:
#Fill NaN values with the median of Response_Time_seconds columns
df_removed['Response_Time_seconds'] = df_removed['Response_Time_seconds'].fillna(df_removed['Response_Time_seconds'].median())


In [None]:
# Getting symmetric and skew symmetric features from the cplumns
symmetric_feature=[]
non_symmetric_feature=[]
for i in df_removed.describe().columns:
  if abs(df_removed[i].mean()-df_removed[i].median())<0.25:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)
# Removing Customer Service Calls column from the list as it's an important factor
# which can't be treated as outliers here will is already leading to higher churn as we have seen furing analysis.
non_symmetric_feature.remove('CSAT Score')

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)

**Second Transformation**

In [None]:
df_removed['Response_Time_seconds'] = np.sqrt(df_removed['Response_Time_seconds'])
df_removed['Item_price'] = (df_removed['Item_price'])**0.25

In [None]:
# Getting symmetric and skew symmetric features from the cplumns
symmetric_feature=[]
non_symmetric_feature=[]
for i in df_removed.describe().columns:
  if abs(df_removed[i].mean()-df_removed[i].median())<0.25:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)
# Removing Customer Service Calls column from the list as it's an important factor
# which can't be treated as outliers here will is already leading to higher churn as we have seen furing analysis.
non_symmetric_feature.remove('CSAT Score')

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)

**Third Transformation**

In [None]:
# Perform sqrt transform on 'Response_Time_seconds' column
df_removed['Response_Time_seconds'] = np.sqrt(df_removed['Response_Time_seconds'])

In [None]:
# Getting symmetric and skew symmetric features from the cplumns
symmetric_feature=[]
non_symmetric_feature=[]
for i in df_removed.describe().columns:
  if abs(df_removed[i].mean()-df_removed[i].median())<0.25:
    symmetric_feature.append(i)
  else:
    non_symmetric_feature.append(i)

# Getting Symmetric Distributed Features
print("Symmetric Distributed Features : -",symmetric_feature)
# Removing Customer Service Calls column from the list as it's an important factor
# which can't be treated as outliers here will is already leading to higher churn as we have seen furing analysis.
non_symmetric_feature.remove('CSAT Score')

# Getting Skew Symmetric Distributed Features
print("Skew Symmetric Distributed Features : -",non_symmetric_feature)

In [None]:
# Visualizing code of hist plot for each columns to know the data distibution
for col in df_removed.loc[:,symmetric_feature]:
  fig=plt.figure(figsize=(9,6))
  ax=fig.gca()
  feature= (df_removed[col])
  sns.distplot(df_removed[col])
  ax.axvline(feature.mean(),color='magenta', linestyle='dashed', linewidth=2)
  ax.axvline(feature.median(),color='cyan', linestyle='dashed', linewidth=2)
  ax.set_title(col)
plt.show()

From the features, I got to know that there are 2 features which aren't symmetric so aren't following gaussian distribution and rest are having szymmetric curve. Thus, for those two columns I have used Exponential transformation to achieve gaussian distribution.

 I tried with other transformations and found exponetial tranformation with no infinity value and working fine. So, I am continuing with Exponentia lransformation with a power of 0.25.


### 6. Data Scaling

In [None]:
# Scaling your data
# Checking the data
df_removed.head()

##### Which method have you used to scale you data and why?

In [None]:
final_df=df_removed.copy()

In [None]:
y=df_removed['CSAT Score']

In [None]:
type(y)

In [None]:
df_removed.drop(columns=['CSAT Score'],inplace=True)

In [None]:
df_removed.head()

In [None]:
import joblib
from sklearn.preprocessing import StandardScaler

# Select only the numerical columns from df_removed
numerical_columns = df_removed.select_dtypes(include=['number']).columns

# Initialize the StandardScaler
scaler = StandardScaler()

# Apply the scaler to the numerical columns
df_removed[numerical_columns] = scaler.fit_transform(df_removed[numerical_columns])

# Save the fitted scaler
joblib.dump(scaler, "scaler.pkl")

# Display the scaled DataFrame
df_removed.head()



In [None]:
numerical_columns

In [None]:
# Save the fitted scaler
joblib.dump(scaler, "scaler.pkl")

In [None]:
df_removed.isnull().sum()

When you are using an algorithm that assumes your features have a similar range, you should use feature scaling.

If the ranges of your features differ much then you should use feature scaling. If the range does not vary a lot like one of them is between 0 and 2 and the other one is between -1 and 0.5 then you can leave them as it's. However, you should use feature scaling if the ranges are, for example, between -2 and 2 and between -100 and 100.

Use Standardization when your data follows Gaussian distribution.
Use Normalization when your data does not follow Gaussian distribution.

So, in my data only Account Length column having large data difference and following gaussian distribution. That's why, I have used standardization using atandardscaler.


### 7. Textual Data Preprocessing
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)



####**Importing necessary libraries for text preprocessing**

In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
nltk.data.path

#### 1. Expand Contraction

In [None]:
# Expand Contraction
# Function to expand contractions
def expand_contractions(text):
    # Dictionary of English contractions
    contractions = {
        "ain't": "am not",
        "aren't": "are not",
        "can't": "cannot",
        "could've": "could have",
        "couldn't": "could not",
        "didn't": "did not",
        "doesn't": "does not",
        "don't": "do not",
        "hadn't": "had not",
        "hasn't": "has not",
        "haven't": "have not",
        "he'd": "he would",
        "he'll": "he will",
        "he's": "he is",
        "how'd": "how did",
        "how'll": "how will",
        "how's": "how is",
        "i'd": "i would",
        "i'll": "i will",
        "i'm": "i am",
        "i've": "i have",
        "isn't": "is not",
        "it'd": "it would",
        "it'll": "it will",
        "it's": "it is",
        "let's": "let us",
        "might've": "might have",
        "must've": "must have",
        "shan't": "shall not",
        "she'd": "she would",
        "she'll": "she will",
        "she's": "she is",
        "should've": "should have",
        "shouldn't": "should not",
        "that'd": "that would",
        "that's": "that is",
        "there's": "there is",
        "they'd": "they would",
        "they'll": "they will",
        "they're": "they are",
        "they've": "they have",
        "wasn't": "was not",
        "we'd": "we would",
        "we'll": "we will",
        "we're": "we are",
        "we've": "we have",
        "weren't": "were not",
        "what'll": "what will",
        "what're": "what are",
        "what's": "what is",
        "what've": "what have",
        "when's": "when is",
        "where'd": "where did",
        "where's": "where is",
        "who'll": "who will",
        "who's": "who is",
        "who've": "who have",
        "why's": "why is",
        "won't": "will not",
        "would've": "would have",
        "wouldn't": "would not",
        "y'all": "you all",
        "you'd": "you would",
        "you'll": "you will",
        "you're": "you are",
        "you've": "you have"
    }
    # Regular expression pattern to match contractions
    pattern = re.compile(r'\b(' + '|'.join(contractions.keys()) + r')\b')
    # Replace contractions with their expansions
    expanded_text = pattern.sub(lambda match: contractions[match.group(0)], text)
    return expanded_text
#Apply text preprocessing to the 'Customer Remarks' feature
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(expand_contractions)

#### 2. Lower Casing

In [None]:
# Lower Casing
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].str.lower()

#### 3. Removing Punctuations

In [None]:
# Remove Punctuations
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

#### 4. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs
# Function to remove URLs
def remove_urls(text):
    # Regular expression pattern to match URLs
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    # Remove URLs from the text
    cleaned_text = url_pattern.sub('', text)
    return cleaned_text

df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(remove_urls)

In [None]:
# Function to remove words containing digits
def remove_digits(text):
    # Regular expression pattern to match words containing digits
    digit_pattern = re.compile(r'\w*\d\w*')
    # Remove words containing digits from the text
    cleaned_text = digit_pattern.sub('', text)
    return cleaned_text

df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(remove_digits)

#### 5. Removing Stopwords & Removing White spaces

In [None]:
#Removing Stopwords

# Get the English stopwords
stop_words = set(stopwords.words('english'))

# Function to remove stopwords without tokenization
def remove_stopwords(text):
    filtered_tokens = [word for word in text.split() if word.lower() not in stop_words]
    return ' '.join(filtered_tokens)

# Remove stopwords
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(remove_stopwords)

In [None]:
#Removing White spaces
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].str.strip()

#### 6. Tokenization

In [None]:
# Tokenization

# Tokenize text
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(nltk.word_tokenize)

#### 7. Text Normalization

In [None]:
import nltk
import subprocess

# Download and unzip wordnet
try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

# Now you can import the NLTK resources as usual
from nltk.corpus import wordnet

In [None]:
from nltk.stem import WordNetLemmatizer
# Initialize the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
# Function for text normalization
def normalize_text(text):

    # Check if input is a list
    if isinstance(text, list):
        # Initialize an empty list to store normalized text
        normalized_texts = []
        # Iterate over each element in the list
        for item in text:
            # Lemmatize each word and convert to lowercase
            normalized_words = [lemmatizer.lemmatize(word.lower()) for word in item.split()]
            # Join the normalized words back into a string
            normalized_text = ' '.join(normalized_words)
            # Remove non-alphanumeric characters
            normalized_text = re.sub(r'[^a-zA-Z0-9\s]', '', normalized_text)
            # Append the normalized text to the list
            normalized_texts.append(normalized_text)
        return normalized_texts
    else:
        # Lemmatize each word and convert to lowercase
        normalized_words = [lemmatizer.lemmatize(word.lower()) for word in text.split()]
        # Join the normalized words back into a string
        normalized_text = ' '.join(normalized_words)
        # Remove non-alphanumeric characters
        normalized_text = re.sub(r'[^a-zA-Z0-9\s]', '', normalized_text)
        return normalized_text

# Apply text normalization to the 'Customer Remarks' column
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(normalize_text)


##### Which text normalization technique have you used and why?

 I used lemmatization and removal of non-alphanumeric characters for text normalization.

**Lemmatization:** Lemmatization reduces words to their base or root form, which helps in standardizing variations of words. For example, words like "running", "ran", and "runs" all reduce to the base form "run". This technique ensures that different forms of the same word are treated as identical, which can improve the effectiveness of text analysis tasks such as sentiment analysis or topic modeling.

**Removal of Non-Alphanumeric Characters:** This step removes any characters that are not letters or numbers from the text. Non-alphanumeric characters, such as punctuation marks and special symbols, do not typically carry meaningful information for many natural language processing tasks. Removing them helps to simplify the text and focus on the essential content, improving the efficiency of subsequent analyses.

By employing lemmatization and removal of non-alphanumeric characters, the text normalization process aims to standardize and clean the textual data, making it more suitable for further analysis or modeling.

#### 8. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert list of strings to a single string
df_removed['Customer Remarks'] = df_removed['Customer Remarks'].apply(lambda x: ' '.join(x) if isinstance(x, list) else x)

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the 'Customer Remarks' column
customer_remarks_tfidf = tfidf_vectorizer.fit_transform(df_removed['Customer Remarks'])

# Convert the vectors to an array
customer_remarks_tfidf_array = customer_remarks_tfidf.toarray()

# Get the feature names (vocabulary)
feature_names_tfidf = tfidf_vectorizer.get_feature_names_out()

# Create a DataFrame for the TF-IDF vectors
customer_remarks_tfidf_df = pd.DataFrame(customer_remarks_tfidf_array, columns=feature_names_tfidf)


In [None]:
# Reset the index of 'df_removed' for proper concatenation
#df_removed.reset_index(drop=True, inplace=True)

# Concatenate the DataFrames along the columns axis
#df_combined = pd.concat([df_removed, customer_remarks_tfidf_df], axis=1)

# Now 'df_combined' contains all the columns from 'df_removed' and the TF-IDF vectors as additional features

In [None]:
#df_combined.shape

##### Which text vectorization technique have you used and why?

In the code provided, I used the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization technique.

TF-IDF is a popular text vectorization technique that reflects the importance of a word in a document relative to a collection of documents (corpus). It considers both the frequency of the word in the document (TF) and the rarity of the word across the entire corpus (IDF). Words that appear frequently in a document but rarely in other documents are given higher weights.

I chose TF-IDF vectorization because it helps capture the importance of words in the 'Customer Remarks' column while also reducing the impact of commonly occurring words across all remarks. This technique is suitable for tasks such as text classification, clustering, and sentiment analysis, where it's important to identify unique features in the text data.

In [None]:
df_removed.drop(columns=['Customer Remarks'], inplace=True)

### 9. Data Splitting

In [None]:
#One Hot Encoding of Target Variable

from sklearn.preprocessing import OneHotEncoder

# Extract the target variable
y = final_df['CSAT Score'].values.reshape(-1, 1)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the target variable
y_one_hot = encoder.fit_transform(y)

# Convert to pandas DataFrame
y_one_hot_df = pd.DataFrame(y_one_hot, columns=[f'class_{int(i)}' for i in range(y_one_hot.shape[1])])


In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
 # split into 70:30 ratio

X_train, X_test, y_train, y_test = train_test_split(df_removed,y_one_hot_df, test_size = 0.3, random_state = 0)

# describes info about train and test set
print("Number transactions X_train dataset: ", X_train.shape)
print("Number transactions y_train dataset: ", y_train.shape)
print("Number transactions X_test dataset: ", X_test.shape)
print("Number transactions y_test dataset: ", y_test.shape)

In [None]:
features_list=X_train.columns.to_list()
import joblib
# Save the fitted scaler
joblib.dump(features_list, "features.pkl")

##### What data splitting ratio have you used and why?

There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.

In this case the training dataset is small, that's why I have taken 70:30 ratio.

### 10. Handling Imbalanced Dataset

In [None]:
# Chart - 1 visualization code
# Dependant Column Value Counts
print(y_train.value_counts())
print(" ")
# Dependant Variable Column Visualization
y_train.value_counts().plot(kind='pie',
                              figsize=(15,6),
                               labels=['5','1','4','3','2']


                              )

##### Do you think the dataset is imbalanced? Explain Why.

Imbalanced dataset is relevant primarily in the context of supervised machine learning involving two or more classes.

Imbalance means that the number of data points available for different the classes is different:
If there are two classes, then balanced data would mean 50% points for each of the class. For most machine learning techniques, little imbalance is not a problem. So, if there are 60% points for one class and 40% for the other class, it should not cause any significant performance degradation. Only when the class imbalance is high, e.g. 90% points for one class and 10% for the other, standard optimization criteria or performance measures may not be as effective and would need modification.

In our case the dataset dependent column data ratio is 85:15. So, during model creating it's obvios that there will be bias and having a great chance of predicting the majority one so frequently. SO the dataset should be balanced before it going for the model creation part.

In [None]:
!pip install imbalanced-learn==0.8.0


In [None]:
# Handling Imbalance in the Target Variable using S.M.O.T.E

# Convert the one-hot encoded DataFrame back to a Series of original class labels to apply SMOTE
y_series = y_train.idxmax(axis=1).apply(lambda x: int(x.split('_')[1]))

from imblearn.over_sampling import SMOTE

# Create an instance of SMOTE
sm = SMOTE(random_state=42)

# Resample the training data using SMOTE
X_train_resampled, y_train_resampled = sm.fit_resample(X_train, y_series)

# Describe info about train and test set
print("Number of transactions in X_train dataset: ", X_train_resampled.shape)
print("Number of transactions in y_train dataset: ", y_train_resampled.shape)
print("Number of transactions in X_test dataset: ", X_test.shape)
print("Number of transactions in y_test dataset: ", y_test.shape)

In [None]:
#Converting the target variable train data shape that of test data shape

# Extract the target variable
y = y_train_resampled.values.reshape(-1, 1)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

# Fit and transform the target variable
y_one_hot = encoder.fit_transform(y)

# Convert y_train_resampled_one_hot back to DataFrame for consistency
y_train_resampled_df = pd.DataFrame(y_one_hot, columns=[f'class_{int(i)}' for i in range(y_one_hot.shape[1])])

In [None]:
# Chart - 1 visualization code
# Dependant Column Value Counts
print(y_train_resampled_df.value_counts())
print(" ")
# Dependant Variable Column Visualization
y_train_resampled_df.value_counts().plot(kind='pie',
                              figsize=(15,6),
                               labels=['1','2','3','4','5']


                              )

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

I have used SMOTE (Synthetic Minority Over-sampling technique) for balanced the 85:15 dataset.

SMOTE is a technique in machine learning for dealing with issues that arise when working with an unbalanced data set. In practice, unbalanced data sets are common and most ML algorithms are highly prone to unbalanced data so we need to improve their performance by using techniques like SMOTE.

To address this disparity, balancing schemes that augment the data to make it more balanced before training the classifier were proposed. Oversampling the minority class by duplicating minority samples or undersampling the majority class is the simplest balancing method.

The idea of incorporating synthetic minority samples into tabular data was first proposed in SMOTE, where synthetic minority samples are generated by interpolating pairs of original minority points.

SMOTE is a data augmentation algorithm that creates synthetic data points from raw data. SMOTE can be thought of as a more sophisticated version of oversampling or a specific data augmentation algorithm.

SMOTE has the advantage of not creating duplicate data points, but rather synthetic data points that differ slightly from the original data points. SMOTE is a superior oversampling option.

That's why for lots of advantages, I have used SMOTE technique for balancinmg the dataset.


In [None]:

# Further splitting the training data into training and validation sets (70:15:15 ratio)
#X_train, X_val, y_train, y_val = train_test_split(X_train_resampled, y_train_resampled_df, test_size=0.2, random_state=0)


In [None]:
# Describe info about train and test set
print("Number of transactions in X_train dataset: ", X_train_resampled.shape)
print("Number of transactions in y_train dataset: ", y_train_resampled_df.shape)
print("Number of transactions in X_test dataset: ", X_test.shape)
print("Number of transactions in y_test dataset: ", y_test.shape)

## ***7. ML Model Implementation***

### DL Model - 1 - **Deep Learning ANN Classification Model**

### **Step 1: Install Required Libraries**

In [None]:
pip install tensorflow==2.15.0

In [None]:
!pip install scikeras

In [None]:
import tensorflow as tf
print(tf.__version__)

### **Step 2: Import Libraries**

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
from scikeras.wrappers import KerasClassifier
import matplotlib.pyplot as plt

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report
from keras.callbacks import EarlyStopping
from keras.layers import Dense, BatchNormalization, Dropout
from keras.models import Sequential
from keras.optimizers import Adam
from keras.regularizers import l2
import numpy as np
import matplotlib.pyplot as plt






### **Step 3: Ensuring the target labels  in the correct format.**

In [None]:
# Ensure target labels are numerical and feature arrays are float
y_train_numerical = y_train_resampled_df.astype(int)
y_test_numerical = y_test.astype(int)

# Convert DataFrame to numpy array and ensure float32 type
X_train_array = X_train_resampled.values.astype(np.float32)
X_test_array = X_test.values.astype(np.float32)

# Ensure target labels are numpy arrays
y_train_array = np.array(y_train_numerical)
y_test_array = np.array(y_test_numerical)


In [None]:
# Get input dimensions
input_dim = X_train.shape[1]
num_classes = len(np.unique(y))
input_dim ,num_classes

### **Step 4: Define the ANN Model**

 **Adding Learning Rate Scheduler**

First, you need to define a learning rate scheduler function:

In [None]:
from tensorflow.keras.callbacks import LearningRateScheduler

def scheduler(epoch, lr):
    if epoch < 10:
        return lr
    else:
        return lr * tf.math.exp(-0.1)

In [None]:
from tensorflow import keras
from tensorflow.keras import models

# Dropout rate
dropout_rate = 0.5

# Define the neural network model with BatchNormalization and Dropout layers
neural_classifier = Sequential(
    [
        Dense(128, activation="relu", kernel_regularizer=l2(),input_dim=X_train.shape[1]),
        BatchNormalization(),
        Dropout(dropout_rate),

        Dense(96, activation="relu", kernel_regularizer=l2()),
        BatchNormalization(),
        Dropout(dropout_rate),

        Dense(64, activation="relu", kernel_regularizer=l2()),
        BatchNormalization(),
        Dropout(dropout_rate),

        Dense(32, activation="relu", kernel_regularizer=l2()),
        BatchNormalization(),
        Dropout(dropout_rate),

        Dense(num_classes, activation="softmax")
    ]
)

# Print the model summary
neural_classifier.summary()

### **Step 5: Define and Initialize the Keras Classifier model**

In [None]:
### Initialize Model

scikeras_classifier = KerasClassifier(model=neural_classifier,
                                    optimizer="adam",
                                    loss=keras.losses.categorical_crossentropy,
                                    batch_size=4000,
                                    epochs=30,
                                    metrics=['accuracy'],
                                    random_state=42,
                                    warm_start=True
                          )

### **Step 6: Initialize StratifiedKFold Cross Validation (no. of folds=3)**

In [None]:
# Define number of folds
n_folds = 3

# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)


### **Step 7: Performing 3-fold cross validation and training the ANN deep learning model**

In [None]:
# Lists to store train and test accuracies
train_accuracies = []
test_accuracies = []

# Lists to store train and test accuracies for visualization
history_list = []

# Perform 3-fold cross-validation
for train_index, test_index in skf.split(X_train_array, np.argmax(y_train_array, axis=1)):
    X_train_fold, X_test_fold = X_train_array[train_index], X_train_array[test_index]
    y_train_fold, y_test_fold = y_train_array[train_index], y_train_array[test_index]

    # Define EarlyStopping callback
    early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
    # Define LearningRateScheduler callback
    lr_scheduler = LearningRateScheduler(scheduler)

    # Fit the model with early stopping and learning rate scheduler
    scikeras_classifier.fit(X_train_fold, y_train_fold,
              validation_data=(X_test_fold, y_test_fold),
              callbacks=[early_stopping, lr_scheduler],
              verbose=1)

    # Append the history for visualization later
    history_list.append(scikeras_classifier.history_)

    # Evaluate the model on train data
    train_accuracy = scikeras_classifier.score(X_train_fold, y_train_fold)
    train_accuracies.append(train_accuracy)

    # Evaluate the model on test data
    test_accuracy = scikeras_classifier.score(X_test_fold, y_test_fold)
    test_accuracies.append(test_accuracy)

    # Train metric
    y_pred_tr = scikeras_classifier.predict(X_train_fold)
    y_pred_classes_tr = np.argmax(y_pred_tr, axis=1)
    y_test_classes_tr = np.argmax(y_train_fold, axis=1)

    # Test Metric
    y_pred = scikeras_classifier.predict(X_test_fold)
    y_pred_classes = np.argmax(y_pred, axis=1)
    y_test_classes = np.argmax(y_test_fold, axis=1)

    print("Train Accuracy:", accuracy_score(y_test_classes_tr, y_pred_classes_tr))
    print("Test Accuracy:", accuracy_score(y_test_classes, y_pred_classes))
    print("Classification Report:\n", classification_report(y_test_classes, y_pred_classes))

In [None]:
# Calculate mean train and test accuracies
mean_train_accuracy = np.mean(train_accuracies)
mean_test_accuracy = np.mean(test_accuracies)

# Evaluation Metrics
print("Mean Train Accuracy:", mean_train_accuracy)
print("Mean Test Accuracy:", mean_test_accuracy)

### **Step-8 Grid Search Model Hyperparameters**


In [None]:
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings("ignore")

params = {
    "optimizer__learning_rate": [0.01, 0.001],
    "epochs":[19,30],
    "batch_size": [32,64,4000],
}

grid = GridSearchCV(scikeras_classifier, params, scoring='accuracy')

grid.fit(X_train, y_train)

In [None]:
print("Best Score  : {}".format(grid.best_score_))
print("Best Params : {}".format(grid.best_params_))

### **Step 9: Evaluating performance of ANN deep learning model**

In [None]:
### Evaluate Model

from sklearn.metrics import mean_squared_error


print("Train Accuracy : {}".format(accuracy_score(y_train, grid.predict(X_train))))
print("Test  Accuracy : {}".format(accuracy_score(y_test, grid.predict(X_test))))


### **Step 10: Visualization the performance of ANN deep learning model**
Analyze the model's predictions to identify trends, patterns, and areas for service improvement.


# **ROC AUC Curve**

In [None]:
from sklearn.metrics import roc_curve, auc, confusion_matrix, classification_report, accuracy_score

def plot_roc_curve(y_test, y_pred, num_classes):
    plt.figure(figsize=(10, 6))
    for i in range(num_classes):
        fpr, tpr, _ = roc_curve(y_test[:, i], y_pred[:, i])
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=2, label='Class %d (area = %0.2f)' % (i, roc_auc))

    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.show()

# Assuming `y_test_classes` and `y_pred` are the one-hot encoded true and predicted labels
plot_roc_curve(y_test_fold, y_pred, num_classes)


# **Confusion Matrix**

In [None]:
def plot_confusion_matrix(y_test, y_pred_classes, class_names):
    cm = confusion_matrix(np.argmax(y_test, axis=1), y_pred_classes)
    plt.figure(figsize=(10, 7))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=class_names, yticklabels=class_names)
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title('Confusion Matrix')
    plt.show()

# Assuming `y_test_classes` and `y_pred_classes` are the true and predicted labels
class_names = ['Class 1', 'Class 2', 'Class 3','Class 4','Class 5']  # Replace with your actual class names
plot_confusion_matrix(y_test_fold, y_pred_classes, class_names)


# **Classification Report**

In [None]:
def print_classification_report(y_test, y_pred_classes, class_names):
    class_names = ['Class 1', 'Class 2', 'Class 3','Class 4','Class 5']
    report = classification_report(np.argmax(y_test, axis=1), y_pred_classes, target_names=class_names)
    print("Classification Report:\n", report)

# Print classification report
print_classification_report(y_test_fold, y_pred_classes, class_names)

# **Training and Validation Accuracy Plot**

In [None]:
# Assuming history_list contains the training history for each fold
mean_train_accuracy = []
mean_val_accuracy = []

# Calculate mean accuracy for each epoch
for epoch in range(30):  # Assuming max 30 epochs
    epoch_train_acc = np.mean([history['accuracy'][epoch] for history in history_list if epoch < len(history['accuracy'])])
    epoch_val_acc = np.mean([history['val_accuracy'][epoch] for history in history_list if epoch < len(history['val_accuracy'])])
    mean_train_accuracy.append(epoch_train_acc)
    mean_val_accuracy.append(epoch_val_acc)

# Plot mean train and validation accuracy
plt.figure(figsize=(10, 5))
plt.plot(mean_train_accuracy, label='Mean Train Accuracy')
plt.plot(mean_val_accuracy, label='Mean Validation Accuracy')
plt.title('Mean Training and Validation Accuracy Across Epochs')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()


# **Data Preprocessing Blog**


https://medium.com/almabetter/data-preprocessing-ea09fac6a7f7

# **Conclusion**

1. Here are some solution to Reduce Customer Churn

* Modify International Plan as the charge is same as normal one.
* Be proactive with communication.
* Ask for feedback often.
* Periodically throw Offers to retain customers.
* Look at the customers facing problem in the most churning states.
* Lean into best customers.
* Regular Server Maintenance.
* Solving Poor Network Connectivity Issue.
* Define a roadmap for new customers.
* Analyze churn when it happens.
* Stay competitive.

2. The four charge fields are linear functions of the minute fields.

3. The area code field and/or the state field are anomalous, and can be omitted.

4. Customers with the International Plan tend to churn more frequently.

5. Customers with four or more customer service calls churn more than four times as often as do the other customers.

6. Customers with high day minutes and evening minutes tend to churn at a higher rate than do the other customers.

7. There is no obvious association of churn with the variables day calls, evening calls, night calls, international calls, night minutes, international minutes, account length, or voice mail messages.

8. We can deploy the model with XGBoost algorithm. Because For training dataset, i found precision of 100% and recall of 91% and f1-score of 95% for False Churn customer data. BUt, I am also interested to see the result for Churning cutomer result as I got precision of 46% and recall of 95% and f1-score of 62%. Accuracy is 92% and average percision, recall & f1_score are 73%, 93% and 79% respectively with a roc auc score of 72%. For testing dataset, i found precision of 99% and recall of 90% and f1-score of 94% for False Churn customer data. BUt, I am also interested to see the result for Churning cutomer result as I got precision of 35% and recall of 81% and f1-score of 49%. Accuracy is 90% and average percision, recall & f1_score are 67%, 86% and 72% respectively with a roc auc score of 66%. It's the best performing model i found.

9. No overfitting is seen.

10. Due to less no. of data in the dataset, the scores are around 80%. Once we get more data we can retrain our algorithm for better performance.

### ***Hurrah! You have successfully completed your Deep Learning Capstone Project !!!***

In [None]:
scikeras_classifier.model_.save("csat_model.h5")