# Marketing Campaigns


## 1. Business Understanding

### Problem scenario: 
Marketing mix stands as a widely utilized concept in the execution of marketing strategies. It encompasses various facets within a comprehensive marketing plan, with a central focus on the four Ps of marketing: product, price, place, and promotion.

### Problem objective:
As a data scientist, you must conduct exploratory data analysis and hypothesis testing to enhance your comprehension of the diverse factors influencing customer acquisition.

Data description:
The variables such as birth year, education, income, and others pertain to the first 'P' or 'People' in the tabular data presented to the user. The expenditures on items like wine, fruits, and gold, are associated with ‘Product’. Information relevant to sales channels, such as websites and stores, is connected to ‘Place’, and the fields discussing promotions and the outcomes of various campaigns are linked to ‘Promotion’.

Steps to perform:

	•	After importing the data, examine variables such as Dt_Customer and Income to verify their accurate importation.

	•	There are missing income values for some customers. Conduct missing value imputation, considering that customers with similar education and marital status tend to have comparable yearly incomes, on average. It may be necessary to cleanse the data before proceeding. Specifically, scrutinize the categories of education and marital status for data cleaning.

	•	Create variables to represent the total number of children, age, and total spending.

	•	Derive the total purchases from the number of transactions across the three channels.

	•	Generate box plots and histograms to gain insights into the distributions and identify outliers. Implement outlier treatment as needed.

	•	Apply ordinal and one-hot encoding based on the various types of categorical variables.

	•	Generate a heatmap to illustrate the correlation between different pairs of variables.

	•	Test the following hypotheses:
	•	Older individuals may not possess the same level of technological proficiency and may, therefore, lean toward traditional in-store shopping preferences.
	•	Customers with children likely experience time constraints, making online shopping a more convenient option.
	•	Sales at physical stores may face the risk of cannibalization by alternative distribution channels.
	•	Does the United States significantly outperform the rest of the world in total purchase volumes?


	•	Use appropriate visualization to help analyze the following:
	•	Identify the top-performing products and those with the lowest revenue.
	•	Examine if there is a correlation between customers' age and the acceptance rate of the last campaign.
	•	Determine the country with the highest number of customers who accepted the last campaign.
	•	Investigate if there is a discernible pattern in the number of children at home and the total expenditure.
	•	Analyze the educational background of customers who lodged complaints in the last two years.



### Dataset Overview

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# For better visualizations
plt.style.use('ggplot')
sns.set(style="whitegrid")


df = pd.read_csv("data/1736837301_datasets/marketing_data.csv")

# Display basic information
print(f"Dataset Shape: {df.shape}")
df.info(memory_usage="deep")

ModuleNotFoundError: No module named 'matplotlib'

In [12]:
# Display basic statistics for numerical columns
print("Summary Statistics for Categorical Variables:")
df.describe(include='O').T

Summary Statistics for Categorical Variables:


Unnamed: 0,count,unique,top,freq
Education,2240,5,Graduation,1127
Marital_Status,2240,8,Married,864
Income,2216,1974,"$7,500.00",12
Dt_Customer,2240,663,8/31/12,12
Country,2240,8,SP,1095


In [10]:
df['Education'].astype("string")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 28 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   ID                   2240 non-null   int64 
 1   Year_Birth           2240 non-null   int64 
 2   Education            2240 non-null   object
 3   Marital_Status       2240 non-null   object
 4    Income              2216 non-null   object
 5   Kidhome              2240 non-null   int64 
 6   Teenhome             2240 non-null   int64 
 7   Dt_Customer          2240 non-null   object
 8   Recency              2240 non-null   int64 
 9   MntWines             2240 non-null   int64 
 10  MntFruits            2240 non-null   int64 
 11  MntMeatProducts      2240 non-null   int64 
 12  MntFishProducts      2240 non-null   int64 
 13  MntSweetProducts     2240 non-null   int64 
 14  MntGoldProds         2240 non-null   int64 
 15  NumDealsPurchases    2240 non-null   int64 
 16  NumWeb

In [6]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,2240.0,5592.159821,3246.662198,0.0,2828.25,5458.5,8427.75,11191.0
Year_Birth,2240.0,1968.805804,11.984069,1893.0,1959.0,1970.0,1977.0,1996.0
Kidhome,2240.0,0.444196,0.538398,0.0,0.0,0.0,1.0,2.0
Teenhome,2240.0,0.50625,0.544538,0.0,0.0,0.0,1.0,2.0
Recency,2240.0,49.109375,28.962453,0.0,24.0,49.0,74.0,99.0
MntWines,2240.0,303.935714,336.597393,0.0,23.75,173.5,504.25,1493.0
MntFruits,2240.0,26.302232,39.773434,0.0,1.0,8.0,33.0,199.0
MntMeatProducts,2240.0,166.95,225.715373,0.0,16.0,67.0,232.0,1725.0
MntFishProducts,2240.0,37.525446,54.628979,0.0,3.0,12.0,50.0,259.0
MntSweetProducts,2240.0,27.062946,41.280498,0.0,1.0,8.0,33.0,263.0


In [14]:
df.describe(include=0).T

TypeError: Cannot interpret '0' as a data type

2.2 Data Quality Assessment
Before drawing conclusions from the data, we need to assess its quality. Poor data quality can lead to inaccurate analysis and misleading insights.

We'll assess four key dimensions of data quality:

Completeness: Are there missing values?
Correctness: Are there outliers, duplicates, or invalid values?
Relevance: Are the variables distributed in a way that's useful for analysis?
Trustworthiness: Are there inconsistencies between related variables? Are data sources reliable?
2.2.1 Missing Values Analysis
Let's first check for missing values in the dataset. Missing data can impact our analysis and modeling approaches.

In [17]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage.round(2)
})

print("Missing Values Analysis:")
missing_values_filter = missing_df['Missing Values'] > 0
print(missing_df[missing_values_filter].sort_values(by='Percentage', ascending=False))

Missing Values Analysis:
          Missing Values  Percentage
 Income               24        1.07


In [15]:


# Visualize missing values
plt.figure(figsize=(12, 6))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
plt.title('Missing Value Heatmap')
plt.tight_layout()
plt.show()

# Summary of completeness
print("\nCompleteness Analysis:")
print(f"Total number of cells in the dataset: {df.size}")
print(f"Total number of missing values: {df.isnull().sum().sum()}")
print(f"Percentage of missing data: {(df.isnull().sum().sum() / df.size * 100).round(2)}%")

Missing Values Analysis:
          Missing Values  Percentage
 Income               24        1.07


NameError: name 'plt' is not defined