# **1. Data Collection**

**1.1- Importing File formats**

In [2]:
import pandas as pd

# Importing CSV file
df = pd.read_csv('/content/E-commerce.csv')

**1.21- Checking Data Types:**

In [3]:
data_types = df.dtypes
print(data_types)

Customer ID           int64
Age                   int64
Gender               object
Location             object
Annual Income         int64
Purchase History     object
Browsing History     object
Product Reviews      object
Time on Site        float64
dtype: object


**1.22- Checking Duplicates:**

In [4]:
duplicates = df.duplicated().sum()
print(f'Duplicates: {duplicates}')

Duplicates: 0


# **2. Data Exploration**

**2.1- Understanding the Structure**

**2.11- View the First Few Rows:**

In [5]:
df.head()

Unnamed: 0,Customer ID,Age,Gender,Location,Annual Income,Purchase History,Browsing History,Product Reviews,Time on Site
0,1001,25,Female,City D,45000,"[{""Date"": ""2022-03-05"", ""Category"": ""Clothing""...","[{""Timestamp"": ""2022-03-10T14:30:00Z""}, {""Time...","Great pair of jeans, very comfortable. Rating:...",32.5
1,1001,28,Female,City D,52000,"[{""Product Category"": ""Clothing"", ""Purchase Da...","[{""Product Category"": ""Home & Garden"", ""Timest...",Great customer service!,123.45
2,1001,28,Female,City D,65000,"[{""Product Category"": ""Electronics"", ""Purchase...","[{""Product Category"": ""Clothing"", ""Timestamp"":...",Great electronics. The sound quality is excell...,125.6
3,1001,45,Female,City D,70000,"{'Purchase Date': '2022-08-15', 'Product Categ...",{'Timestamp': '2022-09-03 14:30:00'},"{""Product 1"": {""Rating"": 4, ""Review"": ""Great e...",327.6
4,1002,34,Male,City E,45000,"{'Purchase Date': '2022-07-25', 'Product Categ...",{'Timestamp': '2022-08-10 17:15:00'},"{""Product 1"": {""Rating"": 3, ""Review"": ""Good pr...",214.9


**2.12- View DataFrame Information:**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Customer ID       50 non-null     int64  
 1   Age               50 non-null     int64  
 2   Gender            50 non-null     object 
 3   Location          50 non-null     object 
 4   Annual Income     50 non-null     int64  
 5   Purchase History  50 non-null     object 
 6   Browsing History  50 non-null     object 
 7   Product Reviews   50 non-null     object 
 8   Time on Site      50 non-null     float64
dtypes: float64(1), int64(3), object(5)
memory usage: 3.6+ KB


**2.13- Summary Statistics:**

In [7]:
df.describe()

Unnamed: 0,Customer ID,Age,Annual Income,Time on Site
count,50.0,50.0,50.0,50.0
mean,1004.88,39.96,65780.0,232.597
std,3.623281,11.067437,17059.667198,109.669736
min,1001.0,24.0,40000.0,32.5
25%,1002.0,30.25,50500.0,124.1
50%,1004.0,37.0,65000.0,243.45
75%,1007.75,48.0,80000.0,300.0
max,1013.0,65.0,100000.0,486.3


**2.14- View Column Names:**

In [8]:
df.columns

Index(['Customer ID', 'Age', 'Gender', 'Location', 'Annual Income',
       'Purchase History', 'Browsing History', 'Product Reviews',
       'Time on Site'],
      dtype='object')

**2.15- Shape of the Data:**

In [9]:
df.shape

(50, 9)

# **2.2- Missing Values**

**2.21- Identifying Missing Values:**

In [10]:
missing_values = df.isnull().sum()
print(missing_values)

Customer ID         0
Age                 0
Gender              0
Location            0
Annual Income       0
Purchase History    0
Browsing History    0
Product Reviews     0
Time on Site        0
dtype: int64


**2.22- Dropping Rows or Columns with Missing Values:**

In [11]:
# Drop rows with any missing values
df_cleaned_rows = df.dropna()

# Drop columns with any missing values
df_cleaned_columns = df.dropna(axis=1)

# **3. Data Cleaning**

**3.1- Identifying Outliers using Z-Score:**

In [12]:
from scipy import stats

# Calculate Z-scores of numeric columns
z_scores = stats.zscore(df.select_dtypes(include=['float64', 'int64']))

# Identify outliers where Z-score is > 3 or < -3
df_outliers = df[(z_scores > 3).any(axis=1) | (z_scores < -3).any(axis=1)]

**3.2- Remove Outliers:**

In [13]:
# Remove rows with outliers based on Z-score
df_no_outliers = df[(z_scores < 3).all(axis=1) & (z_scores > -3).all(axis=1)]

# **A more detailed look into Customer Behavior**

**What is the distribution of customer age?**

In [14]:
import plotly.express as px

fig = px.histogram(df_no_outliers, x='Age', nbins=20, title='Distribution of Customer Age')
fig.show()

**What is the distribution of customer gender?**

In [15]:
fig = px.pie(df_no_outliers, names='Gender', title='Customer Gender Distribution')
fig.show()

**What is the distribution of customer income levels?**

In [16]:
fig = px.histogram(df_no_outliers, x='Annual Income', nbins=20, title='Distribution of Annual Income')
fig.show()

**How does time spent on the site vary with age?**

In [17]:
fig = px.scatter(df_no_outliers, x='Age', y='Time on Site', title='Time on Site vs Age')
fig.show()

**What is the distribution of product reviews?**

In [18]:
fig = px.histogram(df_no_outliers, x='Product Reviews', nbins=15, title='Distribution of Product Reviews', height=1300)
fig.show()

**How do product reviews affect purchase history?**

In [19]:
fig = px.density_heatmap(df_no_outliers, x='Product Reviews', y='Purchase History',
                         title='Relationship between Product Reviews and Purchase History', width=2500, height=1500)
fig.show()

**What is the relationship between annual income and purchase history?**

In [20]:
fig = px.scatter(df_no_outliers, x='Annual Income', y='Purchase History', color='Gender',
                 title='Annual Income vs Purchase History', width=2500)
fig.show()

**How long do customers stay on the site relative to their purchase history?**

In [21]:
fig = px.scatter(df_no_outliers, x='Time on Site', y='Purchase History', color='Annual Income',
                 title='Time on Site vs Purchase History', width=2500)
fig.show()

**How does purchase behavior differ across customer segments (e.g., based on income or gender)?**

In [22]:
fig = px.histogram(df_no_outliers, x='Purchase History', color='Gender', marginal='rug', title='Purchase History by Gender', height=2500)
fig.show()

**How do customers in different locations vary in their time spent on the site?**

In [23]:
avg_time_on_site_by_location = df_no_outliers.groupby('Location')['Time on Site'].mean().reset_index()
fig = px.bar(avg_time_on_site_by_location, x='Location', y='Time on Site', title='Average Time on Site by Location')
fig.show()

**What is the relationship between browsing history and annual income?**

In [24]:
fig = px.scatter(df_no_outliers, x='Annual Income', y='Browsing History', title='Browsing History vs Annual Income', width=1500)
fig.show()

**How does the number of reviews given affect purchase history?**

In [25]:
fig = px.scatter(df_no_outliers, x='Product Reviews', y='Purchase History', title='Product Reviews vs Purchase History', width=2500, height=1500)
fig.show()

**What is the relationship between time on site and browsing history?**

In [26]:
fig = px.scatter(df_no_outliers, x='Time on Site', y='Browsing History', title='Time on Site vs Browsing History', width=1500)
fig.show()

**What are the purchasing patterns across different age groups?**

In [27]:
fig = px.histogram(df_no_outliers, x='Age', y='Purchase History', color='Gender',
                   title='Purchase History Distribution by Age Group', width=2500)
fig.show()

**What is the relationship between product reviews and time spent on the site?**

In [28]:
fig = px.scatter(df_no_outliers, x='Product Reviews', y='Time on Site', color='Gender',
                 title='Time on Site vs Product Reviews', height=1200)
fig.show()

**How does the distribution of time on site vary by location?**

In [29]:
fig = px.density_heatmap(df_no_outliers, x='Location', y='Time on Site', z='Time on Site',
                         title='Distribution of Time on Site by Location')
fig.show()

**What is the distribution of customers' annual income across different locations?**

In [30]:
fig = px.density_heatmap(df_no_outliers, x='Location', y='Annual Income', z='Annual Income',
                         title='Annual Income Distribution by Location')
fig.show()

**What is the interaction between product reviews, browsing history, and time on site?**

In [31]:
fig = px.scatter(df_no_outliers, x='Browsing History', y='Product Reviews', size='Time on Site', color='Age',
                 title='Interaction between Browsing History, Product Reviews, and Time on Site', height=1500, width=1500)
fig.show()

**Correlation between Age, Browsing History, and Annual Income**

In [32]:
fig = px.scatter(df_no_outliers, x='Age', y='Browsing History', size='Annual Income', color='Gender',
                 title='Age vs Browsing History with Annual Income as Bubble Size', width=1700)
fig.show()

**Analysis of Browsing History, Product Reviews, and Time on Site**

In [33]:
fig = px.scatter(df_no_outliers, x='Browsing History', y='Product Reviews', size='Time on Site', color='Gender',
                 title='Browsing History vs Product Reviews with Time on Site as Bubble Size, by Gender', height=1500, width=1700)
fig.show()

**Distribution of Time on Site by Age, Browsing History, and Gender**

In [34]:
fig = px.scatter(df_no_outliers, x='Age', y='Browsing History', size='Time on Site', color='Gender',
                 title='Age vs Browsing History with Time on Site as Bubble Size, by Gender', width=1700)
fig.show()

**Comparative Analysis of Purchase History and Annual Income by Location**

In [35]:
fig = px.scatter(df_no_outliers, x='Location', y='Purchase History', size='Annual Income', color='Gender',
                 title='Purchase History vs Location with Annual Income as Bubble Size', width=2700)
fig.show()