<a href="https://colab.research.google.com/github/abhi78945/Descriptive-analysis/blob/main/Descriptive_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Customer Purchase Behavior Analysis using Descriptive Statistics
##Problem Statement

🔍 Problem Statement:

Welcome to the Probability and Statistics project! 📊🔍 In this exciting journey, you'll get the chance to apply the concepts you've learned in probability theory and statistics to analyze a real-world dataset. This project is your opportunity to dive deep into the world of data analysis and gain practical experience with the tools and techniques you've been learning. 🚀

🎯 Objective:

Your mission is to analyze the provided dataset containing customer information and purchasing behavior to make informed decisions. Your goal is to identify patterns, trends, and correlations that will help your company optimize its marketing efforts and increase offer acceptance rates. 🎉

##About the Dataset

Here's the link to the dataset

This data was gathered during last year's campaign. Data description is as follows;

Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise
ID - Unique ID of each customer
Year_Birth - Age of the customer
Complain - 1 if the customer complained in the last 2 years
Dt_Customer - date of customer's enrollment with the company
Education - customer's level of education
Marital - customer's marital status
Kidhome - number of small children in customer's household
Teenhome - number of teenagers in customer's household
Income - customer's yearly household income
MntFishProducts - the amount spent on fish products in the last 2 years
MntMeatProducts - the amount spent on meat products in the last 2 years
MntFruits - the amount spent on fruits products in the last 2 years
MntSweetProducts - amount spent on sweet products in the last 2 years
MntWines - the amount spent on wine products in the last 2 years
MntGoldProds - the amount spent on gold products in the last 2 years
NumDealsPurchases - number of purchases made with discount
NumCatalogPurchases - number of purchases made using catalog (buying goods to be shipped through the mail)
NumStorePurchases - number of purchases made directly in stores
NumWebPurchases - number of purchases made through the company's website
NumWebVisitsMonth - number of visits to company's website in the last month
Recency - number of days since the last purchase
##Task 1 - Basic CleanUp

Clean and preprocess the dataset (handling missing values, data types, etc.).

Analyze the distribution of customer demographics (age, education, marital status) using descriptive statistics and visualizations.

Deliverables:

Cleaned and Preprocessed Dataset:

Provide a detailed report on the steps taken to handle missing values, including imputation methods used if applicable. Document the process of ensuring consistent data types for each variable, addressing any inconsistencies.

Summary of Basic Statistics:

Present calculated statistics such as mean, median, variance, and standard deviation for each relevant numerical variable. Include a concise table or summary showcasing these measures for easy reference.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from datetime import datetime
import seaborn as sns
import scipy.stats as stats
import warnings
warnings.filterwarnings('ignore')

In [None]:
#reading the csv dataset
dfs = pd.read_csv("/content/Superstore Marketing Data - Sheet1 (1).csv")

In [None]:
dfs.head()

In [None]:
#basic info about our data
dfs.info()

In [None]:
dfs.shape

In [None]:
print(dfs.apply(lambda col:col.unique()))

In [None]:
# Counting number of unwanated string in the column
dt_cust =  dfs['Dt_Customer'] == '########'
print(dt_cust.sum())

In [None]:
# replacing string with NA value to calculate amount of null values in the dataset
dfs['Dt_Customer'] = dfs['Dt_Customer'].replace('########', pd.NA)

In [None]:
#get the num of nulls in each column
dfs.isnull().sum()

In [None]:
# Calculate the percentage of null values in each column
null_percentage = (dfs.isnull().sum() / len(dfs)) * 100

# Create a new DataFrame to display the null percentage for each column
null_percentage_df = pd.DataFrame({'NullPercentage': null_percentage})

# Display the null percentage for each column
null_percentage_df

In [None]:
dfs['Dt_Customer'] = pd.to_datetime(dfs['Dt_Customer'])

We can see that the column 'Dt_Customer' has 40% missing data, which is the date has not been updated since year 2014, so I will drop the column as the major chunk of the data is missing.

In [None]:
dfs.drop(columns='Dt_Customer', inplace=True)

##Analyzing Income column

In [None]:
dfs.describe()

We know that the 'Income' column has 1% missing value, so I'll be filling it with median value.

In [None]:
# Filling missing value with median value
median_income = dfs['Income'].median()
dfs['Income'].fillna(median_income, inplace=True)

In [None]:
# Creating a boxplot for 'Income' column
plt.figure(figsize=(8, 6))
plt.boxplot(dfs['Income'], vert=False)
plt.title('Boxplot of Income')
plt.xlabel('Income')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

In [None]:
dfs.describe()

In [None]:
# Removing extreme outliers (e.g., values beyond 3 standard deviations)
std_dev = dfs['Income'].std()
mean_income = dfs['Income'].mean()
df = dfs[(dfs['Income'] >= mean_income - 3 * std_dev) & (dfs['Income'] <= mean_income + 3 * std_dev)]

In [None]:
# Creating a boxplot for 'Income' column
plt.figure(figsize=(8, 6))
plt.boxplot(dfs['Income'], vert=False)
plt.title('Boxplot of Income')
plt.xlabel('Income')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()

In [None]:
# Numerical variables - Calculating basic statistics (mean, median, variance, standard deviation)
numerical_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
                  'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                  'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                  'NumWebVisitsMonth']
numerical_stats = dfs[numerical_cols].describe()

# Categorical variables - Calculating frequency counts or percentages
categorical_cols = ['Education', 'Marital_Status', 'Response', 'Complain']
categorical_stats = {}
for col in categorical_cols:
    categorical_stats[col] = dfs[col].value_counts(normalize=True) * 100  # Calculate percentages

# Displaying the calculated statistics
print("Basic Statistics for Numerical Variables:")
print(numerical_stats)

print("\nFrequency Counts or Percentages for Categorical Variables:")
for col, values in categorical_stats.items():
    print(f"\n{col}:")
    print(values)

Visualizing the distribution of customer demographics including age, education, and marital status using histograms and bar charts

In [None]:
# Visualization of Age distribution using a histogram
plt.figure(figsize=(8, 6))
sns.histplot(dfs['Year_Birth'], bins=30, kde=True, color='skyblue')
plt.title('Age Distribution')
plt.xlabel('Year of Birth')
plt.ylabel('Frequency')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Visualization of Education and Marital Status using bar charts
plt.figure(figsize=(12, 5))

# Education distribution
plt.subplot(1, 2, 1)
education_counts = dfs['Education'].value_counts()
sns.barplot(x=education_counts.index, y=education_counts.values, palette='viridis')
plt.title('Education Distribution')
plt.xlabel('Education Level')
plt.ylabel('Count')
plt.xticks(rotation=45)

# Marital Status distribution
plt.subplot(1, 2, 2)
marital_counts = dfs['Marital_Status'].value_counts()
sns.barplot(x=marital_counts.index, y=marital_counts.values, palette='muted')
plt.title('Marital Status Distribution')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

##Analysis of Customer Demographics
**Age Distribution (Histogram)**

**Interpretation:**

The histogram represents the distribution of customer ages based on birth year.
Provides an overview of age spread and concentration within the dataset.

**Observations:**

The distribution appears to be relatively normal or slightly right-skewed.
Dominant customer segments within specific age ranges might be identifiable.

**Education Distribution (Bar Chart)**

**Interpretation:**

The bar chart displays the frequency of customers in different education levels.
Visualizes the relative proportions of customers across educational backgrounds.

**Observations:**

Graduated custoers prevalent highest education levels among customers based on the tallest bars.
Shows the diversity in educational backgrounds within the customer base.

**Marital Status Distribution (Bar Chart)**

**Interpretation:**

The bar chart illustrates the distribution of customers across various marital statuses.
Helps in understanding the prevalence of different marital statuses among customers.

**Observations:**

Highlights the dominant marital status category among the customer base are the married people.
Provides insights into the relative distribution across different marital statuses.

**Overall Insights and Actionable Points**

**Understanding Customer Demographics:**

Helps in targeted marketing strategies based on age groups, educational backgrounds, and marital statuses.

**Tailored Marketing and Offers:**

Customizing campaigns or products to match the preferences of specific demographic segments identified.

##Task 2 - Descriptive Statistics 📊
Calculate measures of central tendency (mean, median, mode) and measures of dispersion (variance, standard deviation) for key variables. Identify and handle outliers if necessary.

**Deliverables:**

Descriptive statistics that reveal the central tendencies, variations, and potential outliers in the dataset.:

In [None]:
# Extracting relevant numerical columns
numerical_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
                  'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                  'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                  'NumWebVisitsMonth']

# Calculating statistics for numerical columns
numerical_stats = dfs[numerical_cols].agg(['mean', 'median', 'var', 'std']).transpose()

# Renaming index and formatting the table
numerical_stats.index.name = 'Numerical Variables'
numerical_stats.columns = ['Mean', 'Median', 'Variance', 'Standard Deviation']

# Displaying the summary table
numerical_stats

In [None]:
# List of columns
columns_to_visualize = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
                        'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
                        'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                        'NumWebVisitsMonth']

# Create boxplots for outlier detection
plt.figure(figsize=(12, 8))

# Loop through columns to create boxplots
for i, column in enumerate(columns_to_visualize, start=1):
    plt.subplot(3, 4, i)
    plt.boxplot(df[column].dropna(), vert=False)
    plt.title(column)
    plt.grid(True)

plt.tight_layout()
plt.show()

##**Handling Outliers**

In [None]:
for column in columns_to_visualize:
    Q1 = dfs[column].quantile(0.25)
    Q3 = dfs[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Use .loc to modify the DataFrame explicitly
    dfs.loc[dfs[column] < lower_bound, column] = lower_bound
    dfs.loc[dfs[column] > upper_bound, column] = upper_bound

In [None]:
# List of columns
columns_to_visualize = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
                        'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
                        'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                        'NumWebVisitsMonth']

# Create boxplots for outlier detection
plt.figure(figsize=(12, 8))

# Loop through columns to create boxplots
for i, column in enumerate(columns_to_visualize, start=1):
    plt.subplot(3, 4, i)
    plt.boxplot(dfs[column].dropna(), vert=False)
    plt.title(column)
    plt.grid(True)

plt.tight_layout()
plt.show()

##Discriptive Statistics after handling outliers

In [None]:
# Extracting relevant numerical columns
numerical_cols = ['Year_Birth', 'Income', 'Kidhome', 'Teenhome', 'Recency', 'MntWines', 'MntFruits',
                  'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
                  'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                  'NumWebVisitsMonth']

# Calculating statistics for numerical columns
numerical_stats = dfs[numerical_cols].agg(['mean', 'median', 'var', 'std']).transpose()

# Renaming index and formatting the table
numerical_stats.index.name = 'Numerical Variables'
numerical_stats.columns = ['Mean', 'Median', 'Variance', 'Standard Deviation']

# Displaying the summary table
numerical_stats


##Task 3 - Probability Distributions 🎲
Identify variables that could follow specific probability distributions (e.g., Binomial, Normal). Calculate probabilities and expected values based on these distributions.

**Deliverables:**

Determination of suitable probability distributions for relevant variables and corresponding calculated probabilities and expected values.:

In [None]:
# Assuming 'Response' is a binary variable (0: No response, 1: Response)
# Calculate the probability of a response (success) based on the dataset
prob_response = dfs['Response'].mean()  # Probability of success (response)

# Probability Mass Function (PMF) for Bernoulli distribution (probability of success = prob_response)
prob_success = stats.bernoulli.pmf(1, prob_response)
print(f"Probability of Response = 1 (success): {prob_success:.4f}")

# Expected value (mean) for a Bernoulli distribution
expected_value = stats.bernoulli.mean(prob_response)
print(f"Expected value (mean) for Response: {expected_value:.4f}")

# Visualize Bernoulli distribution
x = [0, 1]  # Possible outcomes for a Bernoulli distribution
pmf_values = stats.bernoulli.pmf(x, prob_response)

plt.bar(x, pmf_values, align='center', alpha=0.5)
plt.xticks(x)
plt.xlabel('Response')
plt.ylabel('Probability')
plt.title('Bernoulli Distribution for Response')
plt.show()

In [None]:
# Assuming 'Complain' is a binary variable (0: No complaint, 1: Complaint)
# Calculate the probability of a complaint (success) based on the dataset
prob_complain = dfs['Complain'].mean()  # Probability of success (complaint)

# Probability Mass Function (PMF) for Bernoulli distribution (probability of success = prob_complain)
prob_success = stats.bernoulli.pmf(1, prob_complain)
print(f"Probability of Complain = 1 (success): {prob_success:.4f}")

# Expected value (mean) for a Bernoulli distribution
expected_value = stats.bernoulli.mean(prob_complain)
print(f"Expected value (mean) for Complain: {expected_value:.4f}")

# Visualize Bernoulli distribution
x = [0, 1]  # Possible outcomes for a Bernoulli distribution
pmf_values = stats.bernoulli.pmf(x, prob_complain)

plt.bar(x, pmf_values, align='center', alpha=0.5)
plt.xticks(x)
plt.xlabel('Complain')
plt.ylabel('Probability')
plt.title('Bernoulli Distribution for Complain')
plt.show()

In [None]:
# List of continuous variables
continuous_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']

for column in continuous_columns:
    # Extracting data for the column (replace 'df' with your DataFrame)
    data = df[column].dropna()  # Remove missing values if any

    # Fit a Normal distribution using maximum likelihood estimation (MLE)
    mu, sigma = stats.norm.fit(data)

    # Calculate probabilities and expected value
    prob_100 = stats.norm.cdf(100, mu, sigma)  # Probability of value <= 100
    expected_value = stats.norm.mean(mu, sigma)  # Expected value (mean)

    # Display results for each column
    print(f"Variable: {column}")
    print(f"Estimated Mean (mu): {mu:.2f}")
    print(f"Estimated Standard Deviation (sigma): {sigma:.2f}")
    print(f"Probability of value <= 100: {prob_100:.4f}")
    print(f"Expected Value (mean): {expected_value:.2f}\n")

Distribution for products columns

In [None]:
fig, axes = plt.subplots(nrows=len(continuous_columns), ncols=2, figsize=(12, 8))

for i, column in enumerate(continuous_columns):
    # Extracting data for the column (replace 'df' with your DataFrame)
    data = df[column].dropna()  # Remove missing values if any

    # Fit a Normal distribution using maximum likelihood estimation (MLE)
    mu, sigma = stats.norm.fit(data)

    # Create histogram
    axes[i, 0].hist(data, bins=30, density=True, alpha=0.6, color='blue')
    axes[i, 0].set_title(f'Histogram for {column}')
    axes[i, 0].set_xlabel('Value')
    axes[i, 0].set_ylabel('Frequency')

    # Create Q-Q plot (quantile-quantile plot)
    stats.probplot(data, dist="norm", plot=axes[i, 1])
    axes[i, 1].get_lines()[1].set_color('red')  # Highlight the Normal distribution line
    axes[i, 1].set_title(f'Q-Q plot for {column}')
    axes[i, 1].set_xlabel('Theoretical quantiles')
    axes[i, 1].set_ylabel('Sample quantiles')

plt.tight_layout()
plt.show()

In [None]:
# List of count/frequency variables
count_columns = ['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'NumDealsPurchases']

for column in count_columns:
    # Extracting data for the column (replace 'df' with your DataFrame)
    data = df[column].dropna()  # Remove missing values if any

    # Fit a Poisson distribution using maximum likelihood estimation (MLE)
    mu = data.mean()  # Using sample mean as parameter for Poisson
    fitted_poisson = stats.poisson(mu)

    # Calculate probabilities and expected value
    prob_5 = fitted_poisson.pmf(5)  # Probability of value = 5
    expected_value = fitted_poisson.mean()  # Expected value (mean)

    # Display results for each column
    print(f"Variable: {column}")
    print(f"Estimated Mean (mu) for Poisson: {mu:.2f}")
    print(f"Probability of value = 5: {prob_5:.4f}")
    print(f"Expected Value (mean): {expected_value:.2f}\n")

In [None]:
fig, axes = plt.subplots(nrows=len(count_columns), ncols=2, figsize=(12, 8))

for i, column in enumerate(count_columns):
    # Extracting data for the column (replace 'df' with your DataFrame)
    data = df[column].dropna()  # Remove missing values if any

    # Fit a Poisson distribution using maximum likelihood estimation (MLE)
    mu = data.mean()  # Using sample mean as parameter for Poisson
    fitted_poisson = stats.poisson(mu)

    # Create histogram
    axes[i, 0].hist(data, bins=30, density=True, alpha=0.6, color='green')
    axes[i, 0].set_title(f'Histogram for {column}')
    axes[i, 0].set_xlabel('Value')
    axes[i, 0].set_ylabel('Frequency')

    # Plot the probability mass function (PMF) of the fitted Poisson distribution
    x = range(0, int(max(data)) + 1)  # Ensure maximum value is converted to an integer
    pmf_values = fitted_poisson.pmf(x)
    axes[i, 1].bar(x, pmf_values, alpha=0.6, color='orange')
    axes[i, 1].set_title(f'Poisson PMF for {column}')
    axes[i, 1].set_xlabel('Value')
    axes[i, 1].set_ylabel('Probability')

plt.tight_layout()
plt.show()

##Task 4: Insights and Customer Segmentation 📈

Explore relationships between customer characteristics and spending habits. Segment customers based on their behaviors and characteristics.

**Deliverables:**

Key insights regarding relationships between variables and distinct customer segments based on behaviors.

In [None]:
# Select only the numerical columns before calculating correlation
numerical_dfs = dfs.select_dtypes(include=['number','float'])

# Create the correlation matrix
corr_matrix = numerical_dfs.corr()

# Create a heatmap
plt.figure(figsize=(20, 15))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', square=True)
plt.title('Correlation Heatmap')
plt.show()


##Key insights regarding variables:

**1. Income and Spending Patterns:**

There seems to be a positive correlation between 'Income' and spending on 'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumWebPurchases', 'NumCatalogPurchases'and 'NumStorePurchases'. Higher-income customers tend to spend more on these categories.

**2. 'Kidhome' and 'NumWebVisitsMonth' (correlation = 0.47):**

A correlation of 0.47 indicates a moderate positive correlation between the number of children at home ('Kidhome') and the number of web visits per month ('NumWebVisitsMonth').
This suggests that customers with more children at home are moderately more likely to make a higher number of web visits each month. The positive correlation implies that as the number of children at home increases, the number of web visits tends to increase as well.

**3. 'Teenhome' and 'NumDealsPurchases':**

A correlation of 0.44 between 'Teenhome' and 'NumDealsPurchases' indicates a moderate positive association, suggesting that customers with more teenagers at home tend to have a higher number of deals purchases.

**4. 'NumDealsPurchases' and 'NumWebPurchases' (correlation = 0.30):**

A correlation of 0.30 indicates a positive, but relatively weak, correlation between the number of deals purchases ('NumDealsPurchases') and the number of web purchases ('NumWebPurchases').
This suggests that customers who engage in more deals purchases are somewhat more likely to make more web-based purchases. However, the correlation is not very strong, and other factors may also influence the relationship.

**'NumDealsPurchases' and 'NumWebVisitsMonth' (correlation = 0.38):**

A correlation of 0.38 indicates a positive, moderate correlation between the number of deals purchases ('NumDealsPurchases') and the number of web visits per month ('NumWebVisitsMonth').

This suggests that customers who make a higher number of deals purchases are moderately more likely to have a higher number of web visits each month. The positive correlation implies that as the number of deals purchases increases, the number of web visits tends to increase as well.

##Task 5: Conclusion and Recommendations

Create clear visualizations to showcase your findings. Use insights to make recommendations for the company based on your analysis.

**Deliverables:**

Well-designed visualizations that visually represent your insights and actionable recommendations based on customer behavior analysis.

**1. Targeted Marketing Campaigns:**

Identify high-value customer segments and design targeted marketing campaigns to cater to their specific needs and preferences.

In [None]:
# Calculate average income for each marital status category
avg_income_by_marital_status = dfs.groupby('Marital_Status')['Income'].mean().reset_index()

plt.figure(figsize=(10, 6))
sns.barplot(x='Marital_Status', y='Income', data=avg_income_by_marital_status, palette='viridis')
plt.title('Average Income by Marital Status')
plt.xlabel('Marital Status')
plt.ylabel('Average Income')
plt.xticks(rotation=45)  # Rotating x-axis labels for better readability
plt.show()

In [None]:
selected_columns = ['Income', 'NumDealsPurchases']

# Select relevant columns
df_selected = dfs[selected_columns]

plt.figure(figsize=(10, 6))
sns.boxplot(x='NumDealsPurchases', y='Income', data=df_selected, palette='viridis')
plt.title('Distribution of Income for Customers with Deals Purchases')
plt.xlabel('Number of Deals Purchases')
plt.ylabel('Income')
plt.show()

This box plot provides a more detailed view of the income distribution within different segments of customers who made varying numbers of deals purchases.

These visualizations can help understand the behavior of customers responding well to deals and promotions, allowing to design targeted promotions and marketing strategies accordingly. Adjust the selected columns and explore additional variables based on your specific business context and goals.

**3. Website Optimization:**

Focus on improving the online shopping experience, as customers making more web purchases ('NumWebPurchases') and having higher web visits ('NumWebVisitsMonth') represent an engaged online audience.
Optimize the website for better user experience and ensure a seamless checkout process.

In [None]:
selected_columns = ['NumWebPurchases', 'NumWebVisitsMonth']

# Select relevant columns
df_selected = dfs[selected_columns]

plt.figure(figsize=(10, 6))
sns.scatterplot(x='NumWebPurchases', y='NumWebVisitsMonth', data=df_selected, alpha=0.7, color='blue')
sns.regplot(x='NumWebPurchases', y='NumWebVisitsMonth', data=df_selected, scatter=False, color='red')
plt.title('Online Shopping Behavior: Web Purchases vs Web Visits')
plt.xlabel('Number of Web Purchases')
plt.ylabel('Number of Web Visits per Month')
plt.show()

**Key insights from a downward-sloping regression line:**

**Inverse Relationship:** Customers who engage in more web visits per month are making fewer web purchases. This suggests that a higher number of web visits might not necessarily translate into a higher number of transactions.

**Potential Issues:** There may be potential issues or barriers on the website that hinder customers from completing transactions despite frequent visits. It could be related to the user interface, checkout process, or other aspects of the online shopping experience.

In summary, a downward-sloping regression line suggests that there might be room for improvement in converting web visits into actual purchases. Analyzing the reasons behind this inverse relationship can guide website optimization efforts to create a more effective and conversion-friendly online shopping experience.

**4. Segment-Specific Communication:**

Craft communication strategies based on customer segments. For example, communicate differently with families having 'Kidhome' and 'Teenhome' compared to those without children.

In [None]:
selected_columns = ['Kidhome', 'Teenhome']

# Select relevant columns
df_selected = dfs[selected_columns]

# Count plot for family structure
plt.figure(figsize=(10, 6))
sns.countplot(x='Kidhome', data=df_selected, hue='Teenhome', palette='coolwarm')
plt.title('Family Structure: Customers with Kidhome and Teenhome')
plt.xlabel('Presence of Kidhome')
plt.ylabel('Count')
plt.legend(title='Presence of Teenhome', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

**Insights from the visualization:**

**Family Structure Distribution:**

The count plot displays the distribution of customers based on the presence of 'Kidhome' and 'Teenhome'.
Different colors represent the presence or absence of 'Teenhome' within each category of 'Kidhome'.

**Communication Strategy Insights:**

If there are distinct segments with families having both 'Kidhome' and 'Teenhome', it indicates a unique customer segment with specific communication needs.
By understanding the composition of your customer base in terms of family structure, you can tailor communication strategies to address the preferences and needs of different segments.
Targeted Messaging:

Consider crafting targeted messages for families with both 'Kidhome' and 'Teenhome' that resonate with their specific interests and challenges.
Communication strategies for customers without children may focus on different aspects that align with their lifestyle and preferences.

**Personalized Offers:**

Use this information to personalize offers or promotions that cater to the needs of specific family segments, enhancing the effectiveness of your marketing campaigns.

**5. In-Store Experience Enhancement:**

Consider enhancing the in-store experience for customers who make a significant number of in-store purchases ('NumStorePurchases').
Implement loyalty programs or exclusive in-store promotions to encourage repeat visits.

In [None]:
selected_columns = ['NumStorePurchases', 'Response']

# Select relevant columns
df_selected = dfs[selected_columns]

# Count plot for in-store purchases
plt.figure(figsize=(12, 6))
sns.countplot(x='NumStorePurchases', data=df_selected, palette='viridis')
plt.title('Distribution of In-Store Purchases')
plt.xlabel('Number of In-Store Purchases')
plt.ylabel('Count')
plt.show()

# Bar plot for participation in loyalty programs based on in-store purchases
loyalty_counts = df_selected.groupby('NumStorePurchases')['Response'].value_counts().unstack()
loyalty_counts.plot(kind='bar', stacked=True, color=['lightblue', 'darkblue'], figsize=(12, 6))
plt.title('In-Store Experience Enhancement: In-Store Purchases vs Loyalty Program Participation')
plt.xlabel('Number of In-Store Purchases')
plt.ylabel('Count')
plt.legend(title='Response', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


**Insights from the visualizations:**

**In-Store Purchases Distribution:**

The count plot displays the distribution of customers based on the number of in-store purchases ('NumStorePurchases').
Identify the segments of customers who make a significant number of in-store purchases.

**Effectiveness of Loyalty Programs:**

The bar plot compares the participation in loyalty programs ('Response') for different levels of in-store purchases.
If customers with a higher number of in-store purchases show a higher participation rate in loyalty programs, it indicates the effectiveness of these programs in retaining these customers.
Opportunities for Enhancement:

Focus on enhancing the in-store experience for customers who make fewer in-store purchases, as indicated by the distribution.
Consider implementing exclusive in-store promotions, personalized offers, or loyalty programs to encourage repeat visits and purchases.
Tailored In-Store Strategies:

Tailor in-store strategies based on the insights gained from the distribution of in-store purchases. Consider factors such as product placement, customer service, and promotional activities.
These visualizations provide insights into the distribution of in-store purchases and the effectiveness of loyalty programs, guiding efforts to enhance the in-store experience for different customer segments.

**Bonus Task - Geogebra Experiment**

Here's the link to an intriguing GeoGebra experiment: GeoGebra Experiment Link

This experiment lets you simulate coin flips as per your preferences and specifications!

Your task involves recording a video where you'll explain the concept of the Law of Large Numbers using this experiment. Dive further into the experience by adjusting the number of coins and exploring varying coin biases. 🪙📹🔍