#INTRODUCTION


This notebook demonstrates the process of cleaning and transforming the "Customer Shopping Trends Dataset" to prepare it for analysis. The dataset includes various customer-related attributes such as age, gender, purchase details, and subscription status. The steps involve importing the dataset, inspecting its structure, handling missing or duplicate data, and engineering new features to enhance its analytical value.



 📌Customer ID - Unique identifier for each customer.

📌 Age - Age of the customer.

📌 Gender - Gender of the customer (Male/Female).

📌 Item Purchased - The item purchased by the customer.

📌 Category - Category of the item purchased.

📌 Purchase Amount (USD) - The amount of the purchase in USD.

📌 Location - Location where the purchase was made.

📌 Size - Size of the purchased item.

📌 Color - Color of the purchased item.

📌 Season - Season during which the purchase was made.

📌 Review Rating - Rating given by the customer for the purchased item.

📌 Subscription Status - Indicates if the customer has a subscription (Yes/No).

📌 Shipping Type - Type of shipping chosen by the customer.

📌 Discount Applied - Indicates if a discount was applied to the purchase (Yes/No).

📌 Promo Code Used - Indicates if a promo code was used for the purchase (Yes/No).

📌 Previous Purchases - Number of previous purchases made by the customer.

📌 Payment Method - Customer's most preferred payment method.

📌 Frequency of Purchases - Frequency at which the customer makes purchases (e.g., Weekly, Fortnightly, Monthly).


#IMPORT LIBRARY AND KAGGLE FILE

##### Step 1: Import Required Libraries
We begin by importing the necessary Python libraries for data analysis:
- `pandas` for handling dataframes
- `numpy` for numerical operations


In [13]:
import pandas as pd
import numpy as np


##Step 2: Download Dataset from Kaggle
Using the kagglehub library, the dataset is downloaded directly from Kaggle.

In [14]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("iamsouravbanerjee/customer-shopping-trends-dataset")

print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/customer-shopping-trends-dataset


#Initial Data Overview

##Step 3. Load the Dataset
The dataset is loaded into a pandas DataFrame for further processing.

In [15]:
data = pd.read_csv(path + "/shopping_trends.csv")
data.sample(5)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Payment Method,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Preferred Payment Method,Frequency of Purchases
3581,3582,26,Female,Dress,Clothing,95,Georgia,M,Orange,Fall,4.9,No,Venmo,Next Day Air,No,No,50,Credit Card,Weekly
565,566,59,Male,Backpack,Accessories,43,Idaho,L,Lavender,Fall,4.2,Yes,Bank Transfer,Store Pickup,Yes,Yes,10,Venmo,Weekly
2045,2046,31,Male,Jewelry,Accessories,20,Georgia,M,Red,Fall,4.4,No,PayPal,Express,No,No,26,PayPal,Quarterly
1902,1903,31,Male,Sandals,Footwear,22,New Mexico,S,Olive,Spring,5.0,No,Bank Transfer,Store Pickup,No,No,21,Venmo,Monthly
2917,2918,58,Female,Shirt,Clothing,45,Colorado,XL,White,Winter,2.9,No,Cash,Store Pickup,No,No,50,Venmo,Annually


##Step 4: Inspect the Dataset
We inspect the dataset to understand its structure and identify potential issues.

###Using  `.shape` to see how many columns and rows

In [16]:
data.shape

(3900, 19)

###Using `.columns` to View column names

In [17]:
data.columns

Index(['Customer ID', 'Age', 'Gender', 'Item Purchased', 'Category',
       'Purchase Amount (USD)', 'Location', 'Size', 'Color', 'Season',
       'Review Rating', 'Subscription Status', 'Payment Method',
       'Shipping Type', 'Discount Applied', 'Promo Code Used',
       'Previous Purchases', 'Preferred Payment Method',
       'Frequency of Purchases'],
      dtype='object')

### Using  `.info()`  to Get dataset information

In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Customer ID               3900 non-null   int64  
 1   Age                       3900 non-null   int64  
 2   Gender                    3900 non-null   object 
 3   Item Purchased            3900 non-null   object 
 4   Category                  3900 non-null   object 
 5   Purchase Amount (USD)     3900 non-null   int64  
 6   Location                  3900 non-null   object 
 7   Size                      3900 non-null   object 
 8   Color                     3900 non-null   object 
 9   Season                    3900 non-null   object 
 10  Review Rating             3900 non-null   float64
 11  Subscription Status       3900 non-null   object 
 12  Payment Method            3900 non-null   object 
 13  Shipping Type             3900 non-null   object 
 14  Discount

### Using  `.describe()`  to View summary statistics

In [19]:
data.describe()

Unnamed: 0,Customer ID,Age,Purchase Amount (USD),Review Rating,Previous Purchases
count,3900.0,3900.0,3900.0,3900.0,3900.0
mean,1950.5,44.068462,59.764359,3.749949,25.351538
std,1125.977353,15.207589,23.685392,0.716223,14.447125
min,1.0,18.0,20.0,2.5,1.0
25%,975.75,31.0,39.0,3.1,13.0
50%,1950.5,44.0,60.0,3.7,25.0
75%,2925.25,57.0,81.0,4.4,38.0
max,3900.0,70.0,100.0,5.0,50.0


### Using  `.isnull().sum()`Check for missing values

In [20]:
missing_values = data.isnull().sum()
missing_values

Unnamed: 0,0
Customer ID,0
Age,0
Gender,0
Item Purchased,0
Category,0
Purchase Amount (USD),0
Location,0
Size,0
Color,0
Season,0


### Using  `.duplicated().sum()` Check for duplicate rows:

In [21]:
duplicated_rows = data.duplicated().sum()
duplicated_rows

np.int64(0)

#Step 5: Feature Engineering

New features are created to enhance the dataset's analytical value.




*  Age Group: Categorize customers into age groups. `(Teen,Young,Adult,Middle Age and Senior)`
*  Price Category: Categorize price based on the amount. `[0, 20(low), 50(Medium), 80(High), 100(Permium)]`
*  Discount Flag: Create a binary flag for discount `If have discount 1 then 0`
*  Profit Margin: Calculate a 30% profit margin from the purchase amount.
*  Loyalty Score: Combine previous purchases and subscription status to calculate a loyalty score.`if have subscription status plus 10 score then plus 0 `
*AVG Spending: caculate customer averange spending

In [23]:
# 1. ฟีเจอร์ Age Group
data['Age Group'] = pd.cut(data['Age'], bins=[0, 18, 35, 50, 65, 100],
                         labels=['Teen', 'Young Adult', 'Adult', 'Middle Age', 'Senior'])

# 2. Price Category
data['Price Category'] = pd.cut(data['Purchase Amount (USD)'],
                                 bins=[0, 20, 50, 80, 100],
                                 labels=['Low', 'Medium', 'High', 'Premium'])

# 3. Discount Flag
data['Has Discount'] = data['Discount Applied'].apply(lambda x: 1 if x == 'Yes' else 0)


#4 profit margin 30% from Purchase Amount
data['Profit Margin (USD)'] = data['Purchase Amount (USD)'] * 0.3

# 5. Loyalty Score (จาก Previous Purchases และ Subscription)
data['Loyalty Score'] = data['Previous Purchases'] + data['Subscription Status'].map({'Yes': 10, 'No': 0})

#6 caculate customer averange spending
data['AVG Spending'] = data.groupby('Customer ID')['Purchase Amount (USD)'].transform('mean')

#Step 7: Transform Data Into CSV file for another process and download file

In [24]:
data.to_csv("shopping_trends_feature_engineered.csv", index=False)

In [25]:
from google.colab import files
files.download('shopping_trends_feature_engineered.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Summary
In this notebook, we:
1. Imported the necessary libraries and downloaded the dataset from Kaggle.
2. Loaded the dataset into a pandas DataFrame for exploration.
3. Inspected the dataset for missing values, duplicates, and overall structure.
4. Engineered new features such as age groups, purchase categories, discount flags, profit margins, and loyalty scores.
5. Saved the cleaned and transformed dataset as a CSV file for further analysis.

This process ensures the dataset is well-prepared for any subsequent data analysis or machine learning tasks.