<a href="https://colab.research.google.com/github/chetnaarora93/Data-Analysis-Project/blob/main/Walmart.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Walmart is an American multinational retail corporation that operates a chain of supercenters, discount departmental stores, and grocery stores from the United States. Walmart has more than 100 million customers worldwide.

**IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from scipy.stats import poisson
from scipy.stats import binom
import scipy.stats as stats
import math


**DOWNLOADING THE DATASET**

In [None]:
df=pd.read_csv("https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/001/293/original/walmart_data.csv?1641285094")
df.head(10)

**ANALYSING SHAPE**

In [None]:
df.shape

The number of rows in the Walmart dataframe is 550068. And the number of columns is 10.

**ANALYSING DATATYPE AND BASIC METRICS**

In [None]:
df.info()

User_ID, Occupation Product_category and Purchase columns are of int dtype.

Rest of the columns are of Object dtype.

In [None]:
df.describe()

**ANALYSING NULL VALUES**

In [None]:
df.isna().sum()   # to check null values

**NON GRAPHICAL ANALYSIS**

Gender Category Value Counts

In [None]:
df["Gender"].value_counts()

The total number of Male customer is 414259.

The total number of Female customer is 135809

***marital Status Count ***

In [None]:
df["Marital_Status"].value_counts()

**Occupation Value Counts**

In [None]:
df["Occupation"].value_counts()

Occupation category 4 is the most frequent

**Product ID  Counts**

In [None]:
df["Product_ID"].nunique()

There are total 3631 products

In [None]:
Gender_wise_marital_status=df.groupby("Gender")["Marital_Status"].value_counts()
Gender_wise_marital_status

In [None]:
df['Gender'].value_counts(normalize= True)*100

The total number of Male customer is 414259 out of whicg 78821 are married and est are unmarried

The total number of Female customer is 135809 out of which 245910 are married and rest are unmarried.

Also, the percentage of Male customers is 75.3% and that of Female customers is 24.68

**Average Amount Spent by Males and Females**

In [None]:
avg_spent = df.groupby('Gender')['Purchase'].mean()
print('Average purchase of Males and Females : \n',avg_spent)

The average purchase made by Males is around 9437.526040.

And the average purcahse made by Females is around 8734.565765.

**Top 10 purchases made by Males and Females**

In [None]:
# Separating male and female customers
female_customers = df[df['Gender'] == 'F']
male_customers = df[df['Gender'] == 'M']

#Calculating mean and getting top 10 values
female_amount_spent = (female_customers.groupby('User_ID')['Purchase'].mean()).sort_values(ascending=False).head(10)
male_amount_spent = (male_customers.groupby('User_ID')['Purchase'].mean()).sort_values(ascending=False).head(10)

#Printing the mean
print("Female_amount_spent:\n", female_amount_spent)
print("\n")
print("Male_amount_spent:\n", male_amount_spent)

In Females, the top purchase made is by User ID 1005069 and the amount is 18490.166667.

In Males, the top purchase made is by User ID 1003902 and the amount is 18577.893617

**GRAPHICAL ANALYSIS**

**PURCHASE HISTOGRAM**

In [None]:
plt.figure(figsize=(15,5))
sns.histplot(data=df, x='Purchase', color= 'blue', kde=True)
plt.show()

**Gender, City category, Occupation and Marital Status Distribution**

In [None]:
fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(15,10))
sns.countplot(data=df, x='Gender', ax=axis[0,0])
sns.countplot(data=df, x='City_Category', ax=axis[0,1])
sns.countplot(data=df, x='Occupation', ax=axis[1,0])
sns.countplot(data=df, x='Marital_Status',ax=axis[1,1])
plt.show()

The number of male customers is far more than the female customers.

Most of the customers belong to the city category B.

The occupation followed by most of the customers is around the category 4.

Most of the customers are unmarried

**Count of Product Category**

In [None]:
plt.figure(figsize=(10, 8))
sns.countplot(data=df, x='Product_Category', order=df['Product_Category'].value_counts().index)
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.title('Product Category Count')

for p in plt.gca().patches:
 plt.gca().annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
 ha='center', va='center', fontsize=8, color='black', xytext=(0, 10), textcoords='offset points')
plt.show()

Product categories having maximum purchase is 5, 1 and 8

**Age Groups and Stay in Current City Distribution**

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))

data_age = df['Age']
axis[0].hist(data_age, color= 'gray', edgecolor='black')
axis[0].set_xlabel('Age Groups')
axis[0].set_ylabel('Count')
axis[0].set_title('Distribution of Age')

data_city_years = df['Stay_In_Current_City_Years']
axis[1].hist(data_city_years, color= 'gray', edgecolor='black')
axis[1].set_xlabel('Stay in City Groups')
axis[1].set_ylabel('Count')
axis[1].set_title('Distribution of Stay in Current City')

plt.show

The maximum amount if shopping is done by the people of Age group 26-35.

Also, the stay preferred is in City 1.


# `OUTLIER DETECTION

In [None]:
fig, axis = plt.subplots(nrows=1, ncols=2, figsize=(12,1))
fig.subplots_adjust(top=2)

sns.boxplot(data=df, x='Purchase', ax=axis[0])
sns.boxplot(data=df, x='Product_Category', ax=axis[1])
plt.show()

Purchase and Product category have outliers.

**BI VARIATE ANALYSIS**

ANALYSIS OF CATEGORICAL DATA USING BOXPLOTS

In [None]:
fig, axis = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))

sns.boxplot(data=df, y='Gender',x ='Purchase',orient='h',ax=axis[0,0])
axis[0, 0].set_title("Gender vs Purchase", fontsize=12)
axis[0, 0].set_xlabel("Purchase", fontsize=10)
axis[0, 0].set_ylabel("Gender", fontsize=10)


sns.boxplot(data=df, y='Marital_Status',x ='Purchase',orient='h',ax=axis[0,1])
axis[0, 1].set_title("Marital_Status vs Purchase", fontsize=12)
axis[0, 1].set_xlabel("Purchase", fontsize=10)
axis[0, 1].set_ylabel("Marital_Status", fontsize=10)

sns.boxplot(data=df, y='Age',x ='Purchase',orient='h',ax=axis[1,0])
axis[1, 0].set_title("Age vs Purchase", fontsize=12)
axis[1, 0].set_xlabel("Purchase", fontsize=10)
axis[1, 0].set_ylabel("Age", fontsize=10)

sns.boxplot(data=df, y='City_Category',x ='Purchase',orient='h',ax=axis[1,1])
axis[1, 1].set_title("City_Category vs Purchase", fontsize=12)
axis[1, 1].set_xlabel("Purchase", fontsize=10)
axis[1, 1].set_ylabel("City_Category", fontsize=10)


plt.show()

Gender vs Purchase

The median value for both is almost same. The purchasing done by Males is more than Females. The number of outliers is more in Females than in Males.

Marital Status vs Purchase

The median value for single and married people is also almost equal. Also, the number of outliers is same in both the cases.

Age vs Purchase

The median for all the age groups is almost same. And the outliers is also present in all of them.

City Category vs Purchase

The medians of city A and B is almost same, but that of C is a bit higher.

City A and B have more outliers than C.

**CONFIDENCE INTERVAL AND SAMPLE AVERAGE**

**Sample mean and std dev to plot the Distribution for Purchases made Genderwise**

In [None]:
genderwise_sum = df.groupby(['User_ID', 'Gender'])['Purchase'].sum()
genderwise_sum = genderwise_sum.reset_index()
genderwise_sum = genderwise_sum.sort_values(by='User_ID', ascending=False)
genderwise_sum

In [None]:
# filtering Genders
male_df = genderwise_sum[genderwise_sum['Gender']=='M']
female_df = genderwise_sum[genderwise_sum['Gender']=='F']

# Taking sample sizes
male_sample_size = 2000
female_sample_size = 1500

# calculating male and female population mean from filtered dataframe
male_pop_mean = round(male_df['Purchase'].mean(),2)
print('Male population mean:', male_pop_mean)

female_pop_mean = round(female_df['Purchase'].mean(),2)
print('Female population mean:', female_pop_mean)
print('\n')

# Taking random samples from male and female dataframe
sample_male = male_df.sample(n=male_sample_size)
sample_female = female_df.sample(n=female_sample_size)

# calculating male and female sample means and sample standard deviation
male_sample_mean = sample_male['Purchase'].mean()
print('Male sample mean:', male_sample_mean)

female_sample_mean = sample_female['Purchase'].mean()
print('Female sample mean:', female_sample_mean)
print('\n')

male_std = round(male_df['Purchase'].std(),2)
print('Male standard deviation:', male_std)

female_std = round(female_df['Purchase'].std(),2)
print('Female standard deviation:', female_std)
print('\n')


**INTERVAL CALCULATION USING CLT**

Central Limit Theorem to get the Intervals for expenses made Genderwise at 99% confidence level

In [None]:
# setting sample size and Confidence level at 99%
sample_size = 2000
confidence_level = 0.99

# Calculate the margin of error using the z-distribution
z_critical = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_critical * (male_std / np.sqrt(sample_size))

z_critical = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_critical * (female_std / np.sqrt(sample_size))

# Calculate the confidence interval for male and female
male_confidence_interval = (male_sample_mean - margin_of_error, male_sample_mean + margin_of_error)
print("Male at 99% Confidence Interval:", male_confidence_interval)

female_confidence_interval = (female_sample_mean - margin_of_error, female_sample_mean + margin_of_error)
print("Female at 99% Confidence Interval:", female_confidence_interval)

**Central Limit Theorem to get the Intervals for expenses by Married and Unmarried customers at 95% confidence level**

In [None]:
# setting sample size and Confidence level at 95%
sample_size = 2000
confidence_level = 0.95

# Calculate the margin of error using the z-distribution
z_critical = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_critical * (male_std / np.sqrt(sample_size))

z_critical = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_critical * (female_std / np.sqrt(sample_size))

# Calculate the confidence interval for male and female
male_confidence_interval = (male_sample_mean - margin_of_error, male_sample_mean + margin_of_error)
print("Male at 95% Confidence Interval:", male_confidence_interval)

female_confidence_interval = (female_sample_mean - margin_of_error, female_sample_mean + margin_of_error)
print("Female at 95% Confidence Interval:", female_confidence_interval)

1.At 99% confidence interval and sample size of 2000,

The average amount of products purchased by males lies between 860285.9455262646 and 953290.6844737353.

And the average amount of products purchased by females lies between 653524.4671929313 and 746529.206140402.

2.At 95% confidence interval and sample size of 2000,

The average amount of products purchased by males lies between 871404.3828276615 and 942172.2471723384.

And the average amount of products purchased by females lies between 664642.9044943282 and 735410.7688390051.

The itervals are not overlapping

**Sample mean and std dev to plot the Distribution for Purchases made by Married and Unmarried customers**

In [None]:
marital_statuswise_sum = df.groupby(['User_ID', 'Marital_Status'])['Purchase'].sum()
marital_statuswise_sum = marital_statuswise_sum.reset_index()
marital_statuswise_sum = marital_statuswise_sum.sort_values(by='User_ID', ascending=False)

#calculating married and unmarried means
married_mean = marital_statuswise_sum[marital_statuswise_sum['Marital_Status']==1]['Purchase'].mean()
print('Amount spent by married customers:',married_mean)

unmarried_mean = marital_statuswise_sum[marital_statuswise_sum['Marital_Status']==0]['Purchase'].mean()
print('Amount spent by unmarried customers:', unmarried_mean)

In [None]:
# Getting Marital Statuswise dataframe and setting sample sizes
unmarried_df = marital_statuswise_sum[marital_statuswise_sum['Marital_Status']==0]
married_df = marital_statuswise_sum[marital_statuswise_sum['Marital_Status']==1]

unmarried_sample_size = 2000
married_sample_size = 1500


# Taking random sample from unmarried and married dataframe
unmarried_sample = unmarried_df.sample(n=unmarried_sample_size)
married_sample = married_df.sample(n=married_sample_size)

# calculating male and female population mean from filtered dataframe
unmarried_pop_mean = round(unmarried_df['Purchase'].mean(),2)
print('Unmarried population mean:', unmarried_pop_mean)

married_pop_mean = round(married_df['Purchase'].mean(),2)
print('Married population mean:', married_pop_mean)
print('\n')

# calculating male and female sample means and sample standard deviation
unmarried_sample_mean = unmarried_sample['Purchase'].mean()
print('Unmarried sample mean:', unmarried_sample_mean)

married_sample_mean = married_sample['Purchase'].mean()
print('Married sample mean:', married_sample_mean)
print('\n')


unmarried_std = round(unmarried_sample['Purchase'].std(),2)
print('Unmarried standard deviation:', unmarried_std)

married_std = round(married_sample['Purchase'].std(),2)
print('Married standard deviation:', married_std)
print('\n')


**INTERVAL CALCULATION USING CLI**

Central Limit Theorem to get the Intervals for expenses by Married and Unmarried customers at 99% confidence level

In [None]:
# setting sample size and Confidence level at 95%
sample_size = 2000
confidence_level = 0.99

# Calculate the margin of error using the z-distribution
z_critical = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_critical * (unmarried_std / np.sqrt(sample_size))

z_critical = stats.norm.ppf((1 + confidence_level) / 2)
margin_of_error = z_critical * (married_std / np.sqrt(sample_size))

# Calculate the confidence interval for male and female
unmarried_confidence_interval = (unmarried_sample_mean - margin_of_error, unmarried_sample_mean + margin_of_error)
print("Unamrried Customers at 99% Confidence Interval:", unmarried_confidence_interval)

married_confidence_interval = (married_sample_mean - margin_of_error, married_sample_mean + margin_of_error)
print("Married Customers at 99% Confidence Interval:", married_confidence_interval)


Central Limit Theorem to get the Intervals for expenses by Married and Unmarried customers at 95% confidence level

In [None]:
  # setting sample size and Confidence level at 95%
  sample_size = 2000
  confidence_level = 0.95

  # Calculate the margin of error using the z-distribution
  z_critical = stats.norm.ppf((1 + confidence_level) / 2)
  margin_of_error = z_critical * (unmarried_std / np.sqrt(sample_size))

  z_critical = stats.norm.ppf((1 + confidence_level) / 2)
  margin_of_error = z_critical * (married_std / np.sqrt(sample_size))

  # Calculate the confidence interval for male and female
  unmarried_confidence_interval = (unmarried_sample_mean - margin_of_error, unmarried_sample_mean + margin_of_error)
  print("Unamrried Customers at 95% Confidence Interval:", unmarried_confidence_interval)

  married_confidence_interval = (married_sample_mean - margin_of_error, married_sample_mean + margin_of_error)
  print("Married Customers at 95% Confidence Interval:", married_confidence_interval)

1.At 99% confidence interval and sample size of 2000,

The average amount of products purchases by unmarried individuals lies between 840754.0871049457 and 949928.3148950543.

And the average amount of products purchases by married individuals lies between 800413.143438279 and 909587.3712283877.

2.At 95% confidence interval and sample size of 2000,

The average amount of products purchased by unmarried individuals lies between 853805.5382497989 and 936876.8637502011.

And the average amount of products purchased by married individuals lies between 813464.5945831323 and 896535.9200835344.

**INSIGHTS**

The number of male customers is far more than the female customers.

The percentage of Male customers is 75.3% and that of Female custoemrs is 24.68

Most of the customers belong to the city category B.

The occupation followed by most of the customers is around the category 4.

Most of the customers are unmarried.

Product categories having maximum purchase is 5, 1 and 8.

The maximum amount if shopping is done by the people of Age group 26-35.

Also, the stay preferred is in City 1.

The average amount spent by male customers is 944233.944.

The average amount spent by female customers is 717142.09.

Male customers have done more purchases than female customers.

1.At 99% confidence interval and sample size of 2000,

The average amount of products purchased by males lies between 860285.9455262646 and 953290.6844737353.

And the average amount of products purchased by females lies between 653524.4671929313 and 746529.206140402.

The average amount of products purchases by unmarried individuals lies between 840754.0871049457 and 949928.3148950543.

And the average amount of products purchases by married individuals lies between 800413.143438279 and 909587.3712283877.

2.At 95% confidence interval and sample size of 2000,

The average amount of products purchased by males lies between 871404.3828276615 and 942172.2471723384.

And the average amount of products purchased by females lies between 664642.9044943282 and 735410.7688390051.

The average amount of products purchased by unmarried individuals lies between 853805.5382497989 and 936876.8637502011.

And the average amount of products purchased by married individuals lies between 813464.5945831323 and 896535.9200835344.

RECOMMENDATIONS

1.Gender

Since the number of Male customers is significantly higher than Female customers, it would be beneficial to focus marketing efforts on attracting more male customers.

2.City Category

Since most customers belong to city category B, it would be wise to prioritize marketing efforts in this category. This could involve offering exclusive discounts to customers from city category B to encourage repeative purchases.

3.Occupation

Most of the customers have an occupation around category 4, the marketing strategy should be focused around it as this would cater to the needs and preferences of this specific occupation group.

4.Marital Status

As most customers are Unmarried, it would be beneficial to develop marketing campaigns that target this group. This could include promotions or product offerings that cater to the lifestyle and preferences of unmarried individuals.

5.Product Categories

The Product categories 5, 1, and 8 have the maximum purchase, it would be advantageous to focus on these categories by offering a wider range of products, improving product quality, or providing discounts.

6.Age Groups

The highest amount of shopping is done by the Age group 26-35, it would be wise to create marketing campaigns that target this age group.

7.Average Purchase Amount

Since Male customers have a higher average purchase amount compared to Female customers, it would be advantageous to analyze the factors contributing to this difference. It could be conducting customer surveys or market research to understand the preferences and motivations of male customers and make marketing strategies accordingly.