> ## PROBLEM STATEMENT : 
> The management team of the company wants to analyze the customer purchase behavior (specifically, purchase amount) against the customer’s gender and the various other factors to help the business make better decisions.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [3]:
df = pd.read_csv('retail_data.csv')
df.head()

Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,1057
4,1000002,P00285442,M,55+,16,C,4+,0,8,7969


> ## BASIC OBSERVATION

In [4]:
df.shape

(550068, 10)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype 
---  ------                      --------------   ----- 
 0   User_ID                     550068 non-null  int64 
 1   Product_ID                  550068 non-null  object
 2   Gender                      550068 non-null  object
 3   Age                         550068 non-null  object
 4   Occupation                  550068 non-null  int64 
 5   City_Category               550068 non-null  object
 6   Stay_In_Current_City_Years  550068 non-null  object
 7   Marital_Status              550068 non-null  int64 
 8   Product_Category            550068 non-null  int64 
 9   Purchase                    550068 non-null  int64 
dtypes: int64(5), object(5)
memory usage: 42.0+ MB


In [6]:
df.describe()

Unnamed: 0,User_ID,Occupation,Marital_Status,Product_Category,Purchase
count,550068.0,550068.0,550068.0,550068.0,550068.0
mean,1003029.0,8.076707,0.409653,5.40427,9263.968713
std,1727.592,6.52266,0.49177,3.936211,5023.065394
min,1000001.0,0.0,0.0,1.0,12.0
25%,1001516.0,2.0,0.0,1.0,5823.0
50%,1003077.0,7.0,0.0,5.0,8047.0
75%,1004478.0,14.0,1.0,8.0,12054.0
max,1006040.0,20.0,1.0,20.0,23961.0


In [7]:
df.describe(include= object)

Unnamed: 0,Product_ID,Gender,Age,City_Category,Stay_In_Current_City_Years
count,550068,550068,550068,550068,550068
unique,3631,2,7,3,5
top,P00265242,M,26-35,B,1
freq,1880,414259,219587,231173,193821


In [8]:
5023/9263 ##std_dev is above 50% of the mean value.

0.542264924970312

* from the describe() function it can be seen that the spread in Purchase column is pretty huge from 12.0 to value like 23961.0. Now regarding outliers lets do a little calculations. Mainly because the std dev is 5023 where the mean is 9263 ie. the std dev is above 50% of the mean value.
* We can see very easilly that there are total of 550068 and also non null rows numbers are 550068. So, basiclly there are NO Null Values.

Now, lets look at the Outliers. Here we are gonna assume that the values which are (1.5 IQR) above 75th Percentile and (1.5 IQR) below 25th Percentile are the Outliers.

In [9]:
IQR_Purchase = 12054.0 - 5823.0
df['Purchase'][(df['Purchase'] >= (12054.0 + 1.5 * IQR_Purchase)) | (df['Purchase'] < (5823.0 -1.5 * IQR_Purchase))].sort_values(ascending=True)

195524    21401
38050     21401
242742    21401
30222     21402
354885    21404
          ...  
292083    23960
321782    23960
93016     23961
370891    23961
87440     23961
Name: Purchase, Length: 2677, dtype: int64

In [10]:
outlier_percentage = 2677/ df.shape[0] 
outlier_percentage

0.004866671029763593

So, there are very less of 0.5% outliers present in the data. Thus a large These basically make the whole distributio very spreaded.

> ## UNIQUE ATTRIBUTES

In [11]:
df.columns

Index(['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category',
       'Purchase'],
      dtype='object')

In [12]:
df_columns = df.columns
unique_values_no = [df[column].nunique() for column in df.columns]
uniques_df = pd.DataFrame({'column':df_columns,
                    'unique_values_no':unique_values_no})
uniques_df.set_index('column')

Unnamed: 0_level_0,unique_values_no
column,Unnamed: 1_level_1
User_ID,5891
Product_ID,3631
Gender,2
Age,7
Occupation,21
City_Category,3
Stay_In_Current_City_Years,5
Marital_Status,2
Product_Category,20
Purchase,18105


Now, Gender, City_Category and Marital_Status have only 2 and 3 unique values; so converting them into categorical variable will be more memory efficient.

In [13]:
cat_var = ['Gender', 'City_Category', 'Marital_Status']
for var in cat_var:
    df[var] = df[var].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550068 entries, 0 to 550067
Data columns (total 10 columns):
 #   Column                      Non-Null Count   Dtype   
---  ------                      --------------   -----   
 0   User_ID                     550068 non-null  int64   
 1   Product_ID                  550068 non-null  object  
 2   Gender                      550068 non-null  category
 3   Age                         550068 non-null  object  
 4   Occupation                  550068 non-null  int64   
 5   City_Category               550068 non-null  category
 6   Stay_In_Current_City_Years  550068 non-null  object  
 7   Marital_Status              550068 non-null  category
 8   Product_Category            550068 non-null  int64   
 9   Purchase                    550068 non-null  int64   
dtypes: category(3), int64(4), object(3)
memory usage: 31.0+ MB


In [16]:
df['Gender'].value_counts() / len(df)

M    0.753105
F    0.246895
Name: Gender, dtype: float64

So, Male customer in the dataset is 75.3 % and female customer is 24.7 % . So, this dataset is basically a sample dataset and as mentioned in the population dataset its mentioned that there is 50 million male and 50 million female are there. So, we are gonna extrapolate for the population data from this sample analysis. 

In [17]:
df_male = df.loc[df['Gender'] == 'M']
df_female = df.loc[df['Gender'] == 'F']

In [19]:
for sample_size in range(10000,50001):
    purchase_sample = df_male['Purchase'].sample(sample_size).mean()


Unnamed: 0,User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category,Purchase
0,1000001,P00069042,F,0-17,10,A,2,0,3,8370
1,1000001,P00248942,F,0-17,10,A,2,0,1,15200
2,1000001,P00087842,F,0-17,10,A,2,0,12,1422
3,1000001,P00085442,F,0-17,10,A,2,0,12,1057
14,1000006,P00231342,F,51-55,9,A,1,0,5,5378
...,...,...,...,...,...,...,...,...,...,...
550061,1006029,P00372445,F,26-35,1,C,1,1,20,599
550064,1006035,P00375436,F,26-35,1,C,3,0,20,371
550065,1006036,P00375436,F,26-35,15,B,4+,1,20,137
550066,1006038,P00375436,F,55+,1,C,2,0,20,365
