### INDUSTRY STANDARD END TO END DATA ANALYTICS PORTFOLIO PROJECT

#### PROBLEM STATEMENT:

A leading retail company wants to better understand its customers' shopping behavior in order to improve sales, customer satisfaction, and long-term loyalty. The management team has noticed changes in purchasing patterns across demographics, product categories, and sales channels(online vs. offline). They are particularly interested in uncovering which factors, such as discounts, reviews, seasons, or payment preferences, drive customer decisions and repeat purchases.

You are tasked with analyzing the company's consumer behavior dataset to answet the following overarching business question:

"How can the company leverage consumer shopping data to identify trends, improve customer engagement, and optimize marketing and product strategies?"

In [4]:
import pandas as pd

df = pd.read_excel('customer_shopping_behavior.xlsx')
df.head()

# EXPLANATION:

# Here we are using the pandas library to read an Excel file named 'customer_shopping_behavior.xlsx'. 
# The read_excel function loads the data into a DataFrame, which is a powerful data structure for data manipulation and analysis in Python.
# The df.head() function is then called to display the first few rows of the DataFrame, allowing us to quickly inspect the data that has been loaded.
# This is useful for understanding the structure and content of the dataset before performing further analysis or processing.
# The shebang line at the top indicates that this script should be run using Python 3.

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


In [5]:
# We need to know the overall strcture of the data

df.info()

# Here we use the info() method of the DataFrame to get a concise summary of the DataFrame.
# This summary includes the number of non-null entries in each column, the data type of each column, and the memory usage of the DataFrame.
# This information is crucial for understanding the dataset's structure, identifying any missing values, and determining the appropriate data types for analysis.
# It helps in making informed decisions about data cleaning and preprocessing steps that may be necessary before further analysis.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Item Purchased          3900 non-null   object 
 4   Category                3900 non-null   object 
 5   Purchase Amount (USD)   3900 non-null   int64  
 6   Location                3900 non-null   object 
 7   Size                    3900 non-null   object 
 8   Color                   3900 non-null   object 
 9   Season                  3900 non-null   object 
 10  Review Rating           3863 non-null   float64
 11  Subscription Status     3900 non-null   object 
 12  Shipping Type           3900 non-null   object 
 13  Discount Applied        3900 non-null   object 
 14  Promo Code Used         3900 non-null   

In [7]:
df.describe(include='all')

# The describe() method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a datasetâ€™s distribution, excluding NaN values.
# It provides insights such as count, mean, standard deviation, minimum and maximum values, and
# the quartiles (25th, 50th, and 75th percentiles) for each numerical column in the DataFrame.
# This is useful for quickly understanding the distribution and spread of the data, identifying potential outliers, and gaining insights into the overall characteristics of the dataset.

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
count,3900.0,3900.0,3900,3900,3900,3900.0,3900,3900,3900,3900,3863.0,3900,3900,3900,3900,3900.0,3900,3900
unique,,,2,25,4,,50,4,25,4,,2,6,2,2,,6,7
top,,,Male,Blouse,Clothing,,Montana,M,Olive,Spring,,No,Free Shipping,No,No,,PayPal,Every 3 Months
freq,,,2652,171,1737,,96,1755,177,999,,2847,675,2223,2223,,677,584
mean,1950.5,44.068462,,,,59.764359,,,,,3.750065,,,,,25.351538,,
std,1125.977353,15.207589,,,,23.685392,,,,,0.716983,,,,,14.447125,,
min,1.0,18.0,,,,20.0,,,,,2.5,,,,,1.0,,
25%,975.75,31.0,,,,39.0,,,,,3.1,,,,,13.0,,
50%,1950.5,44.0,,,,60.0,,,,,3.8,,,,,25.0,,
75%,2925.25,57.0,,,,81.0,,,,,4.4,,,,,38.0,,


In [9]:
df.isnull().sum()

# EXPLANATION:

# Here we use the isnull() method to identify missing values in the DataFrame.
# The sum() function is then applied to count the total number of missing values in each column
# This is important for data cleaning and preprocessing, as it helps us understand the extent of missing data in the dataset.

Customer ID                0
Age                        0
Gender                     0
Item Purchased             0
Category                   0
Purchase Amount (USD)      0
Location                   0
Size                       0
Color                      0
Season                     0
Review Rating             37
Subscription Status        0
Shipping Type              0
Discount Applied           0
Promo Code Used            0
Previous Purchases         0
Payment Method             0
Frequency of Purchases     0
dtype: int64

#### EXPLANATION

Here in the output, we can see that there are no missing values in any of the columns of the dataset.
This indicates that the dataset is complete and does not require any imputation or handling of missing data before further analysis.
Each column has a count equal to the total number of rows in the DataFrame, confirming that there are no null entries.
This is a positive aspect of the dataset, as it simplifies the data preprocessing steps and allows us to proceed directly with analysis and modeling without concerns about missing data.
Overall, the absence of missing values enhances the reliability and integrity of the dataset for subsequent analytical tasks.
It is always beneficial to have a complete dataset, as it ensures that all available information can be utilized effectively in the analysis.
This also reduces the risk of bias or inaccuracies that may arise from handling missing data improperly.
Therefore, we can confidently move forward with our analysis, knowing that the dataset is intact and ready for exploration and modeling.
