# Data Analysis Of Black Friday Sales at Walmart

## Table of Contents

* [Problem Statement](#problem-statement)
* [Business Problem](#business-problem)
* [Data Description](#data-description)
* [Key Questions to Address](#key-questions-to-address)
* [Approach](#approach)
* [Exploratory Data Analysis](#exploratory-data-analysis)
* [Data Visualization](#data-visualization)
* [Data Analysis](#data-analysis)
* [Concluding Observations](#concluding-observations) 

# Problem Statement
Analyze the provided dataset to gain insights and develop strategies for optimizing Black Friday sales at Walmart. Key areas of focus may include understanding customer demographics, identifying popular products, analyzing purchasing patterns based on different features, and developing targeted marketing or promotion strategies.




# Business Problem

**Background:**
Walmart wants to optimize its sales during the Black Friday event, which is a critical period for retail businesses. To achieve this, the company aims to leverage data-driven insights from past Black Friday transactions. The provided dataset contains information about individual purchases made by customers during previous Black Friday sales.


<div style="text-align:center">
    <img src="https://image.cnbcfm.com/api/v1/image/107334223-1700059846433-gettyimages-1793882242-mt1_2906_fc6jiecp.jpeg?v=1700682968&w=740&h=416&ffmt=webp&vtcrop=y" alt="Image Alt Text" />
</div>


**Objectives:**
1. **Maximize Revenue:** Increase overall sales and revenue during the Black Friday event.
2. **Customer Segmentation:** Understand the diverse customer base and tailor marketing strategies for different segments.
3. **Product Strategy:** Identify popular products and product categories to optimize inventory and promotions.
4. **Demographic Insights:** Analyze customer demographics to customize marketing messages and offerings.
5. **Promotional Effectiveness:** Evaluate the effectiveness of past promotions and identify areas for improvement.

# Key Questions to Address
1. **Product Insights:**
   - Which products and categories are most popular during Black Friday?
   - Are there particular products that consistently drive high sales?

2. **Demographic Analysis:**
   - How do age, gender, occupation, marital status, and city category impact purchasing behavior?
   - Can customer segments be identified for targeted marketing?

3. **City and Region Analysis:**
   - Are sales concentrated in specific cities or regions (City_Category)?
   - Do customers who have stayed longer in a city exhibit different purchasing patterns?

4. **Promotional Strategies:**
   - What promotions or discounts have been most effective in driving sales?
   - Can personalized promotions be designed based on customer characteristics?

# Approach
1. **Exploratory Data Analysis (EDA):** Conduct a thorough exploration of the dataset to understand its characteristics and uncover initial insights.
2. **Customer Segmentation:** Use clustering or grouping techniques to identify distinct customer segments based on demographics and purchasing behavior.
3. **Product Analysis:** Identify top-selling products and categories to optimize inventory and promotional efforts.


# Data Description

Certainly! It looks like you've provided data related to Black Friday sales at Walmart. Let's break down the columns and understand the information:

1. **User_ID:** A unique identifier for each user or customer.

2. **Product_ID:** A unique identifier for each product that was purchased.

3. **Gender:** The gender of the customer (M for male, F for female).

4. **Age:** The age group of the customer. In this dataset, it's categorized into different ranges (e.g., 0-17, 55+).

5. **Occupation:** The occupation code of the customer.

6. **City_Category:** The category of the city where the customer resides (A, B, or C).

7. **Stay_In_Current_City_Years:** The number of years the customer has stayed in the current city.

8. **Marital_Status:** Marital status of the customer (0 for single, 1 for married).

9. **Product_Category:** The category of the purchased product.

10. **Purchase:** The amount spent by the customer on the product during the Black Friday sale.

Now, let's interpret the first row as an example:

- **User_ID:** 1000001
- **Product_ID:** P00069042
- **Gender:** Female (F)
- **Age:** 0-17 years
- **Occupation:** Occupation code 10
- **City_Category:** City category A
- **Stay_In_Current_City_Years:** 2 years
- **Marital_Status:** Single (0)
- **Product_Category:** Category 3
- **Purchase:** $8370

This indicates that a female customer aged 0-17 years, with occupation code 10, residing in a city of category A, and having stayed in the current city for 2 years, made a purchase of a product in category 3 for a total amount of $8370 on Black Friday.

In [None]:
import seaborn as sns
import pandas as pd
sns.set(color_codes = True)
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Reading data from a CSV file into a DataFrame and assigning it to 'custom_df'
custom_df = pd.read_csv("../data/walmart_data.csv")
custom_df.head()

# Exploratory Data Analysis

A Python function to display unique values for each column in a DataFrame with a custom name, along with an example usage.


In [None]:
# Function to print unique values for each column in a DataFrame with a custom name
def display_custom_unique_values(df_custom_name):
    for col_custom_name in df_custom_name.columns:
        unique_vals_custom_name = df_custom_name[col_custom_name].unique()
        print(f"\nUnique Values of {col_custom_name}: ", unique_vals_custom_name)

# Example usage
display_custom_unique_values(custom_df)


A Python function to print the number of unique values for each column in a DataFrame, with a custom name, and an example usage.


In [None]:
# Function to print the number of unique values for each column in a DataFrame
def display_custom_nunique_values(df_custom):
    for col_custom in df_custom.columns:
        unique_vals_custom = df_custom[col_custom].nunique()
        print(f"\nNumber of Unique Values in {col_custom}: ", unique_vals_custom)

# Example usage
display_custom_nunique_values(custom_df)


Displaying unique values of the 'Stay_In_Current_City_Years' column in the DataFrame.


In [None]:
# Displaying unique values of the 'Stay_In_Current_City_Years' column in the DataFrame
custom_df['Stay_In_Current_City_Years'].unique()

Removing the "+" symbol from the 'Stay_Duration' column and displaying unique values of the modified 'Stay_In_Current_City_Years' column in the DataFrame.


In [None]:
# Removing "+" symbol from the 'Stay_Duration' column
custom_df['Stay_In_Current_City_Years'] = custom_df['Stay_In_Current_City_Years'].str.replace("+", "")
# Displaying unique values of the modified 'Stay_In_Current_City_Years' column in the DataFrame
custom_df['Stay_In_Current_City_Years'].unique()


Converting the 'Stay_Duration' column to numeric, assuming '+' symbols have already been removed.


In [None]:

# Converting the 'Stay_Duration' column to numeric, assuming '+' symbols have already been removed
custom_df['Stay_In_Current_City_Years'] = pd.to_numeric(custom_df['Stay_In_Current_City_Years'])

Calculating skewness for columns of type 'int64' in the DataFrame.


In [None]:
# Calculating skewness for columns of type 'int64' in the DataFrame
custom_df.select_dtypes(include=['int64']).skew()


#### Missing Value Detection

Checking for missing values in the DataFrame

In [None]:
custom_df.isna().sum()

Duplicate Value Detection

Checking for duplicate values in the DataFrame. No duplicate values in the dataset.


In [None]:
# Checking for duplicate values in the DataFrame
custom_df.duplicated(subset=None, keep='first').sum()  # No duplicate values in the data set


# Data Visualization

List comprehension to extract column names with 'int64' data type in the DataFrame.

In [None]:
# List comprehension to extract column names with 'int64' data type
[int_col for int_col in custom_df.select_dtypes(include=['int64']).columns]


Creating a 2x2 grid of distribution plots using Seaborn and Matplotlib for selected columns in the DataFrame, including a fitted normal curve for the 'Purchase' variable.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create a 2x2 grid of subplots
fig, custom_axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.subplots_adjust(top=0.9)  # Adjust the top spacing of the subplots

# Plot distribution plots for each specified column
sns.distplot(custom_df['Occupation'], kde=True, ax=custom_axes[0, 0], color="#4CAF50")
sns.distplot(custom_df['Stay_In_Current_City_Years'].astype(int), kde=True, ax=custom_axes[0, 1], color="#FFC107")
sns.distplot(custom_df['Marital_Status'], kde=True, ax=custom_axes[1, 0], color="#2196F3")

# Plotting a distribution plot for the 'Purchase' variable with normal curve fit
sns.distplot(custom_df['Purchase'], ax=custom_axes[1, 1], color="#E91E63", fit=norm)

# Fitting the target variable to the normal curve
mu, sigma = norm.fit(custom_df['Purchase'])
print("The mu (mean) is {} and sigma (standard deviation) for the curve is {}".format(mu, sigma))

# Adding a legend for the 'Purchase' distribution plot
custom_axes[1, 1].legend(['Normal Distribution (μ = {:.2f}, σ = {:.2f})'.format(mu, sigma)], loc='best')

# Show the plots
plt.show()


Generating a 4x2 subplot grid using Plotly to display histograms for different DataFrame columns with distinct colors, offering insights into the distribution of categorical and numerical data.


In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create subplots
fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=("Gender", "Age", "Occupation", "City Category",
                    "Stay In Current City Years", "Marital Status", "Product Category", "Purchase")
)

# Define color sequence for each subplot
colors = ['#FF5733', '#3498DB', '#1ABC9C', '#E74C3C', '#8E44AD', '#F39C12', '#27AE60', '#9B59B6']

# Add histograms for each subplot with different colors
fig.add_trace(go.Histogram(x=custom_df['Gender'], marker=dict(color=colors[0])), row=1, col=1)
fig.add_trace(go.Histogram(x=custom_df['Age'], marker=dict(color=colors[1])), row=1, col=2)
fig.add_trace(go.Histogram(x=custom_df['Occupation'], marker=dict(color=colors[2])), row=2, col=1)
fig.add_trace(go.Histogram(x=custom_df['City_Category'], marker=dict(color=colors[3])), row=2, col=2)
fig.add_trace(go.Histogram(x=custom_df['Stay_In_Current_City_Years'], marker=dict(color=colors[4])), row=3, col=1)
fig.add_trace(go.Histogram(x=custom_df['Marital_Status'], marker=dict(color=colors[5])), row=3, col=2)
fig.add_trace(go.Histogram(x=custom_df['Product_Category'], marker=dict(color=colors[6])), row=4, col=1)
fig.add_trace(go.Histogram(x=custom_df['Purchase'], marker=dict(color=colors[7])), row=4, col=2)

# Update layout if needed
fig.update_layout(height=1200, width=1000, title_text="Count Plots")
fig.update_layout(showlegend=False)  # Hide the legend if not needed

# Show the figure
fig.show()


Creating a 2x2 grid of boxplots using Seaborn and Matplotlib to visualize the distribution and central tendency of 'Occupation', 'Stay_In_Current_City_Years', 'Purchase', and 'Product_Category' columns in the DataFrame.


In [None]:
# Create subplots for boxplots
fig, custom_axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
fig.subplots_adjust(top=1.2)

# Boxplot for 'Occupation' column
sns.boxplot(data=custom_df, x="Occupation", ax=custom_axes[0, 0])

# Boxplot for 'Stay_In_Current_City_Years' column
sns.boxplot(data=custom_df, x="Stay_In_Current_City_Years", orient='h', ax=custom_axes[0, 1])

# Boxplot for 'Purchase' column
sns.boxplot(data=custom_df, x="Purchase", orient='h', ax=custom_axes[1, 0])

# Boxplot for 'Product_Category' column
sns.boxplot(data=custom_df, x="Product_Category", orient='h', ax=custom_axes[1, 1])

# Show the plots
plt.show()


Analyzing the relationship between 'Purchase' and various attributes (Gender, Age, Occupation, City_Category, Stay_In_Current_City_Years, Marital_Status) through a series of boxplots, providing insights into the distribution of purchase amounts across different categories.



In [None]:
# Define attributes for boxplot analysis
custom_attrs = ['Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status']

# Set color codes
sns.set(color_codes=True)

# Create subplots for boxplots
fig, axs = plt.subplots(nrows=3, ncols=2, figsize=(20, 16))
fig.subplots_adjust(top=1.3)

count = 0
for row in range(3):
    for col in range(2):
        sns.boxplot(data=custom_df, y='Purchase', x=custom_attrs[count], ax=axs[row, col])
        axs[row, col].set_title(f"Purchase vs {custom_attrs[count]}", pad=12, fontsize=13)
        count += 1

# Show the plots
plt.show()

# Boxplot for 'Marital_Status' column
plt.figure(figsize=(10, 8))
sns.boxplot(data=custom_df, y='Purchase', x=custom_attrs[-1])
plt.show()


Exploring the relationship between 'Purchase' and 'Gender' across various dimensions (Age, City_Category, Marital_Status, Stay_In_Current_City_Years) using boxplots, revealing insights into purchase distribution patterns.


In [None]:
# Set color codes
sns.set(color_codes=True)

# Create subplots for boxplots
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(20, 6))
fig.subplots_adjust(top=1.5)

# Boxplot for 'Purchase' vs 'Gender' with 'Age' as hue
sns.boxplot(data=custom_df, y='Purchase', x='Gender', hue='Age', ax=axs[0, 0])

# Boxplot for 'Purchase' vs 'Gender' with 'City_Category' as hue
sns.boxplot(data=custom_df, y='Purchase', x='Gender', hue='City_Category', ax=axs[0, 1])

# Boxplot for 'Purchase' vs 'Gender' with 'Marital_Status' as hue
sns.boxplot(data=custom_df, y='Purchase', x='Gender', hue='Marital_Status', ax=axs[1, 0])

# Boxplot for 'Purchase' vs 'Gender' with 'Stay_In_Current_City_Years' as hue
sns.boxplot(data=custom_df, y='Purchase', x='Gender', hue='Stay_In_Current_City_Years', ax=axs[1, 1])

# Adjust legend position
axs[1, 1].legend(loc='upper left')

# Show the plots
plt.show()


# Data Analysis

Calculating the average amount spent per customer categorized by gender in a Pandas DataFrame named `avg_amt_custom_df` from the grouped and summed data in `custom_df`.


In [None]:
# Calculating the average amount spent per customer for Male and Female
amt_custom_df = custom_df.groupby(['User_ID', 'Gender'])[['Purchase']].sum()
avg_amt_custom_df = amt_custom_df.reset_index()
avg_amt_custom_df


In [None]:
# Gender-wise value counts in avg_amt_custom_df
avg_amt_custom_df['Gender'].value_counts()

Creating a histogram in Matplotlib to visualize the distribution of average amounts spent by male and female customers, displayed on the same plot with distinct colors and labels.

In [None]:
# Histogram of average amount spent for each customer - Male and Female on the same plot
plt.hist(avg_amt_custom_df[avg_amt_custom_df['Gender'] == 'M']['Purchase'], bins=100, alpha=0.7, label='Male', color='#2196F3')
plt.hist(avg_amt_custom_df[avg_amt_custom_df['Gender'] == 'F']['Purchase'], bins=100, alpha=0.7, label='Female', color='#FFC107')
plt.legend(loc='upper right')
plt.show()


Calculating and printing the average amount spent by male and female customers, providing the results with two decimal places in a clear format.

In [None]:
# Calculating and printing the average amount spent by Male and Female customers
male_avg_custom = avg_amt_custom_df[avg_amt_custom_df['Gender'] == 'M']['Purchase'].mean()
female_avg_custom = avg_amt_custom_df[avg_amt_custom_df['Gender'] == 'F']['Purchase'].mean()

print("Average amount spend by Male customers: {:.2f}".format(male_avg_custom))
print("Average amount spend by Female customers: {:.2f}".format(female_avg_custom))


Filtering data for male and female customers in the `avg_amt_custom_df` DataFrame and conducting a simulation to generate the sampling distribution of means for male and female customers, with specified sample sizes and repetitions.

In [None]:
# Filtering data for Male and Female customers in the custom DataFrame
male_df_custom = avg_amt_custom_df[avg_amt_custom_df['Gender'] == 'M']
female_df_custom = avg_amt_custom_df[avg_amt_custom_df['Gender'] == 'F']

# Simulation for sampling distribution of means for Male and Female customers
genders = ["M", "F"]
male_sample_size = 1000
female_sample_size = 500
num_repitions = 800
male_means = []
female_means = []

for _ in range(num_repitions):
    male_mean = male_df_custom.sample(male_sample_size, replace=True)['Purchase'].mean()
    female_mean = female_df_custom.sample(female_sample_size, replace=True)['Purchase'].mean()
    male_means.append(male_mean)
    female_means.append(female_mean)


Creating subplots for histograms of the sampling distribution of means for male and female customers, along with printing population mean and statistics for male and female groups in terms of the amount spent.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create subplots for histograms
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))
axis[0].hist(male_means, bins=100, color='#2196F3', alpha=0.7)
axis[1].hist(female_means, bins=100, color='#FFC107', alpha=0.7)
axis[0].set_title("Male - Distribution of means, Sample size: 9500")
axis[1].set_title("Female - Distribution of means, Sample size: 8500")
plt.show()

# Print population mean and statistics for Male and Female
print("\nPopulation mean - Mean of sample means of amount spent for Male: {:.2f}".format(np.mean(male_means)))
print("Population mean - Mean of sample means of amount spent for Female: {:.2f}".format(np.mean(female_means)))

print("\nMale - Sample mean: {:.2f} Sample std: {:.2f}".format(male_df_custom['Purchase'].mean(), male_df_custom['Purchase'].std()))
print("Female - Sample mean: {:.2f} Sample std: {:.2f}".format(female_df_custom['Purchase'].mean(), female_df_custom['Purchase'].std()))


Calculating confidence intervals for the mean amount spent by male and female customers using the Central Limit Theorem (CLT) and providing the results with a confidence level of 90% (z-score of 1.64).

In [None]:
# Confidence interval calculation using Central Limit Theorem (CLT) for Male and Female
male_margin_of_error_clt = 1.64 * male_df_custom['Purchase'].std() / np.sqrt(len(male_df_custom))
male_sample_mean = male_df_custom['Purchase'].mean()
male_lower_lim = male_sample_mean - male_margin_of_error_clt
male_upper_lim = male_sample_mean + male_margin_of_error_clt

female_margin_of_error_clt = 1.64 * female_df_custom['Purchase'].std() / np.sqrt(len(female_df_custom))
female_sample_mean = female_df_custom['Purchase'].mean()
female_lower_lim = female_sample_mean - female_margin_of_error_clt
female_upper_lim = female_sample_mean + female_margin_of_error_clt

print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))


Calculating confidence intervals for the mean amount spent by male and female customers using a 95% confidence level (z-score of 1.96) and displaying the results.

In [None]:
# Confidence interval calculation using a 95% confidence level for Male and Female
male_margin_of_error_clt = 1.96 * male_df_custom['Purchase'].std() / np.sqrt(len(male_df_custom))
male_sample_mean = male_df_custom['Purchase'].mean()
male_lower_lim = male_sample_mean - male_margin_of_error_clt
male_upper_lim = male_sample_mean + male_margin_of_error_clt

female_margin_of_error_clt = 1.96 * female_df_custom['Purchase'].std() / np.sqrt(len(female_df_custom))
female_sample_mean = female_df_custom['Purchase'].mean()
female_lower_lim = female_sample_mean - female_margin_of_error_clt
female_upper_lim = female_sample_mean + female_margin_of_error_clt

print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))


Calculating confidence intervals for the mean amount spent by male and female customers using a 99% confidence level (z-score of 2.58) and presenting the results.

In [None]:
# Confidence interval calculation using a 99% confidence level for Male and Female
male_margin_of_error_clt = 2.58 * male_df_custom['Purchase'].std() / np.sqrt(len(male_df_custom))
male_sample_mean = male_df_custom['Purchase'].mean()
male_lower_lim = male_sample_mean - male_margin_of_error_clt
male_upper_lim = male_sample_mean + male_margin_of_error_clt

female_margin_of_error_clt = 2.58 * female_df_custom['Purchase'].std() / np.sqrt(len(female_df_custom))
female_sample_mean = female_df_custom['Purchase'].mean()
female_lower_lim = female_sample_mean - female_margin_of_error_clt
female_upper_lim = female_sample_mean + female_margin_of_error_clt

print("Male confidence interval of means: ({:.2f}, {:.2f})".format(male_lower_lim, male_upper_lim))
print("Female confidence interval of means: ({:.2f}, {:.2f})".format(female_lower_lim, female_upper_lim))



Grouping the data by 'User_ID' and 'Marital_Status' in the Pandas DataFrame `custom_df`, then calculating the sum of 'Purchase' for each group and storing the result in the DataFrame `avg_amt_custom_df`.

In [None]:
# Grouping by 'User_ID' and 'Marital_Status' and calculating the sum of 'Purchase'
amt_custom_df = custom_df.groupby(['User_ID', 'Marital_Status'])[['Purchase']].sum()
avg_amt_custom_df = amt_custom_df.reset_index()
avg_amt_custom_df


In [None]:
avg_amt_custom_df['Marital_Status'].value_counts()

Conducting a simulation to generate the sampling distribution of means for married and unmarried customers, with specified sample sizes and repetitions. Creating subplots for histograms of the sampling distribution and printing population mean and statistics for both groups in terms of the amount spent.

In [None]:
# Simulation for sampling distribution of means for Married and Unmarried customers
married_samp_size = 2000
unmarried_samp_size = 1500
num_repitions = 900
married_means = []
unmarried_means = []

for _ in range(num_repitions):
    married_mean = avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == 1].sample(married_samp_size, replace=True)['Purchase'].mean()
    unmarried_mean = avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == 0].sample(unmarried_samp_size, replace=True)['Purchase'].mean()
    married_means.append(married_mean)
    unmarried_means.append(unmarried_mean)

# Create subplots for histograms
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(20, 6))
axis[0].hist(married_means, bins=100, color='#2196F3', alpha=0.7)
axis[1].hist(unmarried_means, bins=100, color='#FFC107', alpha=0.7)
axis[0].set_title("Married - Distribution of means, Sample size: 3000")
axis[1].set_title("Unmarried - Distribution of means, Sample size: 2000")
plt.show()

# Print population mean and statistics for Married and Unmarried
print("\nPopulation mean - Mean of sample means of amount spend for Married: {:.2f}".format(np.mean(married_means)))
print("Population mean - Mean of sample means of amount spend for Unmarried: {:.2f}".format(np.mean(unmarried_means)))

print("\nMarried - Sample mean: {:.2f} Sample std: {:.2f}".format(avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == 1]['Purchase'].mean(), avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == 1]['Purchase'].std()))
print("Unmarried - Sample mean: {:.2f} Sample std: {:.2f}".format(avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == 0]['Purchase'].mean(), avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == 0]['Purchase'].std()))


Calculating confidence intervals for the mean amount spent by married and unmarried customers separately using the Central Limit Theorem (CLT) and displaying the results with a confidence level of 90% (z-score of 1.64).

In [None]:
# Confidence interval calculation using Central Limit Theorem (CLT) for Married and Unmarried
for val in ["Married", "Unmarried"]:
    new_val = 1 if val == "Married" else 0
    new_df_custom = avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == new_val]
    margin_of_error_clt = 1.64 * new_df_custom['Purchase'].std() / np.sqrt(len(new_df_custom))
    sample_mean = new_df_custom['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt
    print("{} confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))


Calculating confidence intervals for the mean amount spent by married and unmarried customers separately using a 99% confidence level (z-score of 2.58) and presenting the results.


In [None]:
# Confidence interval calculation using a 99% confidence level for Married and Unmarried
for val in ["Married", "Unmarried"]:
    new_val = 1 if val == "Married" else 0
    new_df_custom = avg_amt_custom_df[avg_amt_custom_df['Marital_Status'] == new_val]
    margin_of_error_clt = 2.58 * new_df_custom['Purchase'].std() / np.sqrt(len(new_df_custom))
    sample_mean = new_df_custom['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt
    print("{} confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))


Grouping the data by 'User_ID' and 'Age' in the Pandas DataFrame `custom_df`, then calculating the sum of 'Purchase' for each group and storing the result in the DataFrame `avg_amt_custom_df`.

In [None]:
# Grouping by 'User_ID' and 'Age' and calculating the sum of 'Purchase'
amt_custom_df = custom_df.groupby(['User_ID', 'Age'])[['Purchase']].sum()
avg_amt_custom_df = amt_custom_df.reset_index()
avg_amt_custom_df

In [None]:
# Age-wise value counts in avg_amt_custom_df
avg_amt_custom_df['Age'].value_counts()


Conducting a simulation to generate the sampling distribution of means for each age group in the Pandas DataFrame `avg_amt_custom_df`, with a specified sample size and number of repetitions. The results are stored in the dictionary `all_means` where each age group has a list of means.

In [None]:
# Simulation for sampling distribution of means for each Age group
sample_size = 200
num_repitions = 1000
all_means = {}
age_intervals = ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']

for age_interval in age_intervals:
    all_means[age_interval] = []

for age_interval in age_intervals:
    for _ in range(num_repitions):
        mean = avg_amt_custom_df[avg_amt_custom_df['Age'] == age_interval].sample(sample_size, replace=True)['Purchase'].mean()
        all_means[age_interval].append(mean)

print(all_means)

Calculating confidence intervals for the mean amount spent in each age group using the Central Limit Theorem (CLT) with a confidence level of 90% (z-score of 1.64) and presenting the results.

In [None]:
# Confidence interval calculation using Central Limit Theorem (CLT) for each Age group
for val in ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']:
    new_df_custom = avg_amt_custom_df[avg_amt_custom_df['Age'] == val]
    margin_of_error_clt = 1.64 * new_df_custom['Purchase'].std() / np.sqrt(len(new_df_custom))
    sample_mean = new_df_custom['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt
    print("For age {}, confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))


Calculating confidence intervals for the mean amount spent in each age group using a 95% confidence level (z-score of 1.96) and presenting the results.

In [None]:
# Confidence interval calculation using a 95% confidence level for each Age group
for val in ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']:
    new_df_custom = avg_amt_custom_df[avg_amt_custom_df['Age'] == val]
    margin_of_error_clt = 1.96 * new_df_custom['Purchase'].std() / np.sqrt(len(new_df_custom))
    sample_mean = new_df_custom['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt
    print("For age {}, confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))


Calculating confidence intervals for the mean amount spent in each age group using a 99% confidence level (z-score of 2.58) and presenting the results.

In [None]:
# Confidence interval calculation using a 99% confidence level for each Age group
for val in ['26-35', '36-45', '18-25', '46-50', '51-55', '55+', '0-17']:
    new_df_custom = avg_amt_custom_df[avg_amt_custom_df['Age'] == val]
    margin_of_error_clt = 2.58 * new_df_custom['Purchase'].std() / np.sqrt(len(new_df_custom))
    sample_mean = new_df_custom['Purchase'].mean()
    lower_lim = sample_mean - margin_of_error_clt
    upper_lim = sample_mean + margin_of_error_clt
    print("For age {}, confidence interval of means: ({:.2f}, {:.2f})".format(val, lower_lim, upper_lim))


# Concluding Observations
Upon thorough examination of the data, we have derived significant observations regarding customer expenditure trends, considering factors such as age, gender, marital status, city category, and product categories.

### Actionable Insights

Analyzing the Age feature, it was observed that approximately 80% of customers in the age group 25-40 (40%: 26-35, 18%: 18-25, 20%: 36-45) tend to exhibit higher spending patterns. 

Regarding the Gender feature, around 75% of purchases are made by male customers, while the remaining 25% is attributed to female customers. This indicates that male consumers significantly contribute to the retail store's sales volume. On average, males tend to spend more on purchases compared to females, as evidenced by the average amounts spent: 9,25,408.28 for males and 7,12,217.18 for females.

Combining Purchase and Marital_Status for analysis (60% single, 40% married), it was found that single men tend to spend the most during Black Friday. This also suggests that men generally reduce their spending once they are married, possibly due to added responsibilities.

An interesting observation was made in the Stay_In_Current_City_Years column, revealing that people who have spent 1 year in the city tend to spend the most. This trend is understandable, as those who have spent more than 4 years in the city are usually well settled and less inclined to make new purchases compared to newcomers (35% staying for 1 year, 18% for 2 years, 17% for 3 years).

Examining the City_Category, although city B contributes significantly to overall sales income, the specific product mentioned is predominantly purchased in city C.

Among the 20 product_categories, Product_Category 1, 5, 8, and 11 exhibit the highest purchasing frequency.

Lastly, there are 20 different types of occupations in the city.