# **Table of Contents**
1. [Introduction](#1-introduction)  
2. [Data Overview](#2-data-overview)  
3. [Loading the Dataset](#3-loading-the-dataset)  
4. [Exploratory Data Analysis (EDA)](#4-exploratory-data-analysis-eda)  
   - 4.1 [Checking Missing Values](#41-checking-missing-values)  
   - 4.2 [Detecting Duplicates](#42-detecting-duplicates)  
   - 4.3 [Handling Data Types](#43-handling-data-types)  
   - 4.4 [Identifying Outliers](#44-identifying-outliers)  
   - 4.5 [Feature Distribution](#45-feature-distribution)  
5. [Data Cleaning](#5-data-cleaning)  
   - 5.1 [Handling Missing Values](#51-handling-missing-values)  
   - 5.2 [Fixing Data Inconsistencies](#52-fixing-data-inconsistencies)  
   - 5.3 [Removing Duplicates](#53-removing-duplicates)  
   - 5.4 [Correcting Data Types](#54-correcting-data-types)  
   - 5.5 [Handling Outliers](#55-handling-outliers)  
6. [Final Cleaned Data Overview](#6-final-cleaned-data-overview)  
7. [Data Visualization & Insights](#7-data-visualization--insights)  
   - 7.1 [Sales Trends Over Time](#71-sales-trends-over-time)  
   - 7.2 [Customer Segmentation](#72-customer-segmentation)  
   - 7.3 [Top-Selling Products](#73-top-selling-products)  
   - 7.4 [Regional Sales Analysis](#74-regional-sales-analysis)  
   - 7.5 [Discount vs Profit Relationship](#75-discount-vs-profit-relationship)
8.  [Tools & Technologies](#8-pools-&-Technologies)
9. [Conclusion](#9-conclusion)  



# 1- introduction

This project focuses on cleaning and analyzing an AI-generated e-commerce dataset using R. The dataset simulates real-world transactional data, including order details, customer segmentation, product categories, and financial metrics such as sales, discounts, profit, and shipping costs. The goal is to detect and fix data quality issues, transform the data for analysis, and extract meaningful business insights.

# 2- data overview

The dataset consists of 10,000 rows and 22 columns, with the following key attributes:

    Order Information: Order ID, Order Date, Ship Date, Payment Method
    Customer Details: Customer ID, Name, Segment, Region
    Product Details: Product Category, Subcategory, Quantity, Discount
    Financial Metrics: Sales, Profit, Shipping Cost

⚠️ Data Quality Issues & Cleaning Steps

This dataset contains inconsistencies, missing values, and incorrect data types, making it an ideal case for data cleaning. The following steps were performed:
✔ Handling Missing Values – Imputed missing numerical data with the median and categorical data with the mode.
✔ Fixing Data Inconsistencies – Standardized text formatting (e.g., country and product names).
✔ Removing Duplicates – Identified and removed duplicate entries.
✔ Correcting Data Types – Converted columns to appropriate formats (dates, numeric, categorical).
✔ Handling Outliers – Detected extreme values using IQR and replaced them with the median.
📊 Data Analysis & Insights

After cleaning, the dataset was used for exploratory data analysis (EDA) to uncover business insights:

    📈 Sales Trends Over Time – Identified seasonal patterns and peak sales periods.
    👥 Customer Segmentation – Analyzed customer groups to optimize marketing strategies.
    🏆 Top-Selling Products – Ranked products based on sales and profitability.
    🌍 Regional Sales Performance – Compared sales across different regions.
    💰 Discount vs. Profit Relationship – Evaluated how discounts impact profitability.

🛠 Tools & Technologies Used

    R – Data cleaning, transformation, and analysis
    Kaggle – Cloud-based environment for running R scripts
    GitHub – Version control and project documentation

📎 Dataset Source

This dataset was generated using AI (ChatGPT) to simulate real-world e-commerce transactions. It does not represent actual business data but is designed for learning and practice in data cleaning and analytics.



This dataset contains **9,000+ rows** and **20+ columns** with information related to customer transactions, including order details, customer demographics, sales data, and shipping information. Below is a description of each feature:

| **Column Name**       | **Description** |
|----------------------|--------------------------------------------|
| `Order ID`          | Unique identifier for each order. |
| `Order Date`        | The date when the order was placed. |
| `Ship Date`         | The date when the order was shipped. |
| `Ship Mode`         | The shipping method used for delivery. |
| `Customer ID`       | Unique identifier for each customer. |
| `Customer Name`     | Full name of the customer. |
| `Segment`          | Customer segment (e.g., Consumer, Corporate, Home Office). |
| `Country`          | The country where the order was placed. |
| `City`             | The city of the customer. |
| `State`            | The state or province of the customer. |
| `Postal Code`      | The postal code of the customer’s location. |
| `Region`           | Geographic region (e.g., West, East, South, Central). |
| `Product ID`       | Unique identifier for each product. |
| `Category`         | The main product category (e.g., Furniture, Office Supplies, Technology). |
| `Sub-Category`     | The subcategory of the product. |
| `Product Name`     | The name of the product. |
| `Sales`           | The total sales amount for the order. |
| `Quantity`        | The number of items ordered. |
| `Discount`        | The discount applied to the order. |
| `Profit`          | The profit earned from the order. |

---

### **Key Characteristics of the Dataset**
✔ **Contains missing values, duplicates, and inconsistencies**, making it ideal for data cleaning.  
✔ **Includes both categorical and numerical data**, useful for exploratory data analysis.  
✔ **Ideal for data visualization and business insights** such as customer segmentation, sales trends, and profitability analysis.  

# 3- loading the dataset

In [None]:
# Load necessary libraries
library(tidyverse)  # Data manipulation and visualization
library(readr)       # Reading CSV files  

In [None]:
list.files("/kaggle/input/")

In [None]:
list.files("/kaggle/input/dirty-ecommerce-data-eda-r/")  

In [None]:
# Step 1: Load the Dataset
df <- read_csv("/kaggle/input/dirty-ecommerce-data-eda-r/dirty_ecommerce_data.csv")

# Step 2: Preview the First Few Rows
head(df)

# Step 3: Check the Structure of the Dataset
str(df)

# Step 4: Summary Statistics
summary(df)


In [None]:
# Check parsing problems
problems(df)


In [None]:
# Check the rows with invalid dates in 'OrderDate' and 'ShipDate'
invalid_dates <- df %>%
  filter(is.na(OrderDate) | is.na(ShipDate))

# Display the rows with invalid dates
print(invalid_dates)


In [None]:
# Remove rows with invalid dates in 'OrderDate' or 'ShipDate'
df_clean <- df %>%
  filter(!is.na(OrderDate) & !is.na(ShipDate))

# Verify that the rows with invalid dates are removed
head(df_clean)


In [None]:
# Replace invalid dates with NA
df$OrderDate <- as.Date(df$OrderDate, format = "%Y-%m-%d")
df$ShipDate <- as.Date(df$ShipDate, format = "%Y-%m-%d")

# For rows where the conversion failed, replace with NA
df$OrderDate[is.na(df$OrderDate)] <- NA
df$ShipDate[is.na(df$ShipDate)] <- NA

# Check again after replacement
head(df)


In [None]:
# Verify if there are still any NA values in the 'OrderDate' and 'ShipDate'
sum(is.na(df$OrderDate))  # Should return 0 if no NAs are left
sum(is.na(df$ShipDate))   # Should return 0 if no NAs are left


In [None]:
# Find rows with NA in OrderDate
missing_orderdate <- df %>%
  filter(is.na(OrderDate))

# Find rows with NA in ShipDate
missing_shipdate <- df %>%
  filter(is.na(ShipDate))

# Display the rows with missing dates
print(missing_orderdate)
print(missing_shipdate)


In [None]:
# Replace missing OrderDate and ShipDate with a placeholder date
df$OrderDate[is.na(df$OrderDate)] <- as.Date("2020-01-01")
df$ShipDate[is.na(df$ShipDate)] <- as.Date("2020-01-01")


In [None]:
# Replace missing OrderDate with the median date
median_orderdate <- median(df$OrderDate, na.rm = TRUE)
df$OrderDate[is.na(df$OrderDate)] <- median_orderdate

# Replace missing ShipDate with the median date
median_shipdate <- median(df$ShipDate, na.rm = TRUE)
df$ShipDate[is.na(df$ShipDate)] <- median_shipdate


In [None]:
sum(is.na(df$OrderDate))  # Should return 0
sum(is.na(df$ShipDate))   # Should return 0


# 4- exploratory data analysis-eda

This step involves thoroughly analyzing the dataset to understand the relationships between different features, detect patterns, and identify potential outliers. Here's a breakdown of what you can do next:

## 4.1- checking-missing-values

In [None]:
# Check for missing values in the entire dataset
sum(is.na(df))  # This will return the total count of missing values in the dataset

# Check for missing values in specific columns
sum(is.na(df$OrderDate))  # Missing values in OrderDate column
sum(is.na(df$ShipDate))   # Missing values in ShipDate column


In [None]:
# Check missing values per column
colSums(is.na(df))  # This will give you the number of missing values for each column


In [None]:
library(naniar)
gg_miss_upset(df)  # This creates a visualization of missing values across rows


This is an UpSet plot that visualizes missing values in the dataset by showing how different missing values intersect across multiple columns. Here’s an interpretation of the key aspects of the plot:
Key Observations:

    Most Frequent Missing Values:
        The left bar chart shows the total missing values per column.
        The columns Segment, CustomerName, ShippingCost, Region, and ProductName have missing values.
        The highest number of missing values appears in the Segment column, followed by CustomerName and ShippingCost.

    Intersection of Missing Values:
        The vertical bars at the top represent different combinations of missing values across columns.
        The highest intersections (tallest bars) indicate the most common missing-value patterns.
        The first few bars (on the left) indicate that most missing values occur in a single column at a time (Segment, CustomerName, etc.).
        Some smaller bars (on the right) show that multiple columns have missing values simultaneously for certain rows.

    Set Size & Patterns:
        The "Set Size" on the left shows how many missing values exist per column.
        The dots and connecting lines below the main bar chart indicate which columns share missing values in specific rows.

Implications for Data Cleaning:

    Since Segment and CustomerName have significant missing values, they may require imputation or removal depending on their importance.
    If multiple columns are missing values together (as shown in intersections), it might suggest systematic data entry issues rather than random missingness.
    The ShippingCost column also has missing values, which could impact pricing or cost-related analysis.

Next Steps:

    Drop rows where too many important fields are missing.
    Impute missing values for categorical variables (e.g., "Unknown" for Segment, CustomerName).
    Use median or mean imputation for numerical values like ShippingCost.
    Investigate why missing values occur in specific intersections.

This visualization helps in making informed decisions about handling missing data before performing further analysis. 🚀

In [None]:
df <- df[!is.na(df$OrderID), ]  # Remove rows where OrderID is missing
df <- df[!is.na(df$PostalCode), ]  # Remove rows where PostalCode is missing


In [None]:
df$CustomerName[is.na(df$CustomerName)] <- "Unknown"  # Impute with a placeholder
df$Segment[is.na(df$Segment)] <- "Unknown"  # Impute with a placeholder
df$Country[is.na(df$Country)] <- "Unknown"  # Impute with a placeholder
df$City[is.na(df$City)] <- "Unknown"  # Impute with a placeholder


In [None]:
df$Sales[is.na(df$Sales)] <- mean(df$Sales, na.rm = TRUE)  # Impute with mean
df$Quantity[is.na(df$Quantity)] <- median(df$Quantity, na.rm = TRUE)  # Impute with median
df$Discount[is.na(df$Discount)] <- 0  # Impute with 0


## 4.2- detecting-duplicates

In [None]:
# 4.2 Detecting Duplicates ------------------------------------------------------

# Count duplicate rows
duplicate_count <- sum(duplicated(df))
print(paste("Number of duplicate rows:", duplicate_count))

# Remove duplicate rows
df <- df[!duplicated(df), ]


## 4.3 handling-data-types

In [None]:
# 4.3 Handling Data Types -------------------------------------------------------

# Convert dates to proper format
df$OrderDate <- as.Date(df$OrderDate, format = "%Y-%m-%d")
df$ShipDate <- as.Date(df$ShipDate, format = "%Y-%m-%d")

# Convert categorical variables to factors
df$Segment <- as.factor(df$Segment)
df$Country <- as.factor(df$Country)
df$Region <- as.factor(df$Region)
df$ProductCategory <- as.factor(df$ProductCategory)
df$SubCategory <- as.factor(df$SubCategory)
df$PaymentMethod <- as.factor(df$PaymentMethod)

# Check data structure after type conversion
str(df)


## 4.4 identifying-outliers

In [None]:
# 4.4 Identifying Outliers ------------------------------------------------------

# Boxplots to detect outliers in numeric columns
numeric_cols <- c("Sales", "Quantity", "Discount", "Profit", "ShippingCost")

par(mfrow = c(2, 3))
for (col in numeric_cols) {
  boxplot(df[[col]], main = col, col = "lightblue")
}

# Detect outliers using IQR method
outlier_detection <- function(column) {
  Q1 <- quantile(column, 0.25, na.rm = TRUE)
  Q3 <- quantile(column, 0.75, na.rm = TRUE)
  IQR_value <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  return(sum(column < lower_bound | column > upper_bound, na.rm = TRUE))
}

outliers_count <- sapply(df[, numeric_cols], outlier_detection)
print(outliers_count)


The boxplots represent the distribution of numerical variables in the dataset, helping to detect outliers.

    Sales: The distribution appears fairly symmetric with no significant outliers.
    Quantity: Similar to Sales, the data is well distributed without extreme values.
    Discount: There are outliers, indicating a few transactions with unusually high discounts.
    Profit: The data is mostly concentrated around lower values, but some high-profit transactions are visible.
    Shipping Cost: Shows significant outliers, meaning some transactions had exceptionally high shipping costs.

The numbers on top of each category indicate the count of detected outliers, with Discount (1158) and Shipping Cost (1594) being the most affected.

## 4.5 feature-distribution

In [None]:
# 4.5 Feature Distribution ------------------------------------------------------

# Histograms for numerical features
par(mfrow = c(2, 3))
for (col in numeric_cols) {
  hist(df[[col]], main = paste("Distribution of", col), col = "skyblue", border = "white")
}

# Bar plots for categorical features
categorical_cols <- c("Segment", "Country", "Region", "ProductCategory", "SubCategory", "PaymentMethod")

par(mfrow = c(2, 3))
for (col in categorical_cols) {
  barplot(table(df[[col]]), main = paste("Distribution of", col), col = "lightgreen", las = 2)
}

# Save the cleaned dataset
write.csv(df, "cleaned_ecommerce_data.csv", row.names = FALSE)

The bar plots in green display the distribution of categorical variables in the dataset:

    Segment: The different customer segments are relatively evenly distributed.
    Country: Spain has the highest number of transactions, while other countries show a more balanced distribution.
    Region: The transaction count across regions is quite similar, indicating no strong regional imbalance.
    Product Category: All categories have nearly equal representation.
    SubCategory: The subcategories also appear evenly distributed.
    Payment Method: Different payment methods are used fairly equally, with no dominant preference.

Overall, the dataset appears well-distributed across categories, ensuring balanced representation for analysis.

# 5. Data Cleaning 

## 5.1 Handling Missing Values 

In [None]:
# 5.1 Handling Missing Values ---------------------------------------------------

# Check missing values
missing_values <- colSums(is.na(df))
print(missing_values)

# Handling missing values:
#  - Drop columns with too many missing values (threshold: 50%)
#  - Impute missing numeric values with median
#  - Impute missing categorical values with mode

threshold <- 0.5 * nrow(df)  # 50% threshold
df <- df[, colSums(is.na(df)) < threshold]  # Drop columns with too many NAs

for (col in names(df)) {
  if (sum(is.na(df[[col]])) > 0) {
    if (is.numeric(df[[col]])) {
      df[[col]][is.na(df[[col]])] <- median(df[[col]], na.rm = TRUE)  # Impute numeric with median
    } else {
      mode_value <- names(sort(table(df[[col]]), decreasing = TRUE))[1]  # Get mode
      df[[col]][is.na(df[[col]])] <- mode_value  # Impute categorical with mode
    }
  }
}

## 5.2 Fixing Data Inconsistencies 

In [None]:
# 5.2 Fixing Data Inconsistencies ------------------------------------------------

# Convert text to consistent casing
df$Country <- tolower(df$Country)
df$State <- tolower(df$State)
df$City <- tolower(df$City)
df$Region <- tolower(df$Region)
df$ProductCategory <- tolower(df$ProductCategory)

# Trim spaces
df <- df %>%
  mutate(across(where(is.character), ~trimws(.)))

# Standardize categorical values
df$PaymentMethod <- gsub("Credit Card", "credit_card", df$PaymentMethod, ignore.case = TRUE)
df$PaymentMethod <- gsub("PayPal", "paypal", df$PaymentMethod, ignore.case = TRUE)


## 5.3 Removing Duplicates 

In [None]:
# 5.3 Removing Duplicates --------------------------------------------------------

# Check and remove duplicate rows
duplicate_count <- sum(duplicated(df))
print(paste("Number of duplicate rows:", duplicate_count))
df <- df[!duplicated(df), ]

## 5.4 Correcting Data Types 

In [None]:
# 5.4 Correcting Data Types ------------------------------------------------------

# Convert dates to proper format
df$OrderDate <- as.Date(df$OrderDate, format = "%Y-%m-%d")
df$ShipDate <- as.Date(df$ShipDate, format = "%Y-%m-%d")

# Convert categorical variables to factors
df$Segment <- as.factor(df$Segment)
df$Country <- as.factor(df$Country)
df$Region <- as.factor(df$Region)
df$ProductCategory <- as.factor(df$ProductCategory)
df$SubCategory <- as.factor(df$SubCategory)
df$PaymentMethod <- as.factor(df$PaymentMethod)

# Convert numerical columns
numeric_cols <- c("Sales", "Quantity", "Discount", "Profit", "ShippingCost")
df[numeric_cols] <- lapply(df[numeric_cols], as.numeric)

## 5.5 Handling Outliers

In [None]:
# 5.5 Handling Outliers ----------------------------------------------------------

# Detect outliers using IQR method and replace with median
for (col in numeric_cols) {
  Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
  Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
  IQR_value <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value
  
  # Replace outliers with median
  df[[col]][df[[col]] < lower_bound | df[[col]] > upper_bound] <- median(df[[col]], na.rm = TRUE)
}

# Save the cleaned dataset
write.csv(df, "cleaned_ecommerce_data.csv", row.names = FALSE)

print("Data Cleaning Complete!")

# 6. Final Cleaned Data Overview

In [None]:
# Final Cleaned Data Overview ---------------------------------------------

# Check the structure of the cleaned dataset
str(df)

# Check the dimensions of the dataset (rows and columns)
cat("Dataset Dimensions (Rows x Columns): ", nrow(df), "x", ncol(df), "\n")

# Display summary statistics for numerical columns
summary(df)

# Display the number of unique values for categorical columns
cat("Number of unique values for categorical columns:\n")
cat("Segment: ", length(unique(df$Segment)), "\n")
cat("Country: ", length(unique(df$Country)), "\n")
cat("Region: ", length(unique(df$Region)), "\n")
cat("ProductCategory: ", length(unique(df$ProductCategory)), "\n")
cat("SubCategory: ", length(unique(df$SubCategory)), "\n")
cat("PaymentMethod: ", length(unique(df$PaymentMethod)), "\n")

# Display a few rows of the cleaned data to verify
head(df)

# Check for any remaining missing values (should be zero)
missing_values_final <- colSums(is.na(df))
cat("Remaining Missing Values:\n")
print(missing_values_final)

# Check for any duplicates in the cleaned dataset (should be zero)
duplicate_count_final <- sum(duplicated(df))
cat("Remaining Duplicates: ", duplicate_count_final, "\n")


# Data Visualization &  Insights 

In [None]:
## 7.1 Sales Trends Over Time 

In [None]:
# Load necessary libraries for plotting
library(ggplot2)
library(dplyr)

# Ensure 'OrderDate' is in Date format
df$OrderDate <- as.Date(df$OrderDate, format="%Y-%m-%d")

# Create a summary of sales by Year-Month
df_sales_time <- df %>%
  mutate(YearMonth = format(OrderDate, "%Y-%m")) %>%  # Extract Year-Month
  group_by(YearMonth) %>%  # Group by Year-Month
  summarise(TotalSales = sum(Sales, na.rm = TRUE))  # Summarize sales

# Plot Sales Trends Over Time
ggplot(df_sales_time, aes(x = YearMonth, y = TotalSales)) +
  geom_line(group = 1, color = "blue") +  # Ensure lines are drawn by connecting points
  labs(title = "Sales Trends Over Time",
       x = "Year-Month",
       y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +  # Rotate x-axis labels
  theme_minimal()



In [None]:
ggplot(df_sales_time, aes(x = YearMonth, y = log(TotalSales + 1))) + 
  geom_line(group = 1, color = "blue") +
  labs(title = "Sales Trends Over Time (Log Scale)",
       x = "Year-Month",
       y = "Log of Total Sales") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme_minimal()


## 7.2 Customer Segmentation 

In [None]:
# Summarize total sales by customer
df_customer_sales <- df %>%
  group_by(CustomerID) %>%
  summarise(TotalSales = sum(Sales, na.rm = TRUE))

# Plot Customer Segmentation (Top 10 customers by total sales)
top_customers <- df_customer_sales %>%
  top_n(10, TotalSales) %>%
  arrange(desc(TotalSales))

ggplot(top_customers, aes(x = reorder(CustomerID, -TotalSales), y = TotalSales)) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(title = "Top 10 Customers by Total Sales",
       x = "Customer ID",
       y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()


## 7.3 Top-Selling Products 

In [None]:
# Summarize total sales and quantity by product
df_product_sales <- df %>%
  group_by(ProductName) %>%
  summarise(TotalSales = sum(Sales, na.rm = TRUE), 
            TotalQuantity = sum(Quantity, na.rm = TRUE))

# Plot Top 10 Selling Products by Total Sales
top_products_sales <- df_product_sales %>%
  top_n(10, TotalSales) %>%
  arrange(desc(TotalSales))

ggplot(top_products_sales, aes(x = reorder(ProductName, -TotalSales), y = TotalSales)) +
  geom_bar(stat = "identity", fill = "green") +
  labs(title = "Top 10 Selling Products by Total Sales",
       x = "Product Name",
       y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()


## 7.4 Regional Sales Analysis 

In [None]:
# Summarize total sales by region
df_region_sales <- df %>%
  group_by(Region) %>%
  summarise(TotalSales = sum(Sales, na.rm = TRUE))

# Plot Regional Sales Analysis
ggplot(df_region_sales, aes(x = reorder(Region, -TotalSales), y = TotalSales)) +
  geom_bar(stat = "identity", fill = "red") +
  labs(title = "Sales by Region",
       x = "Region",
       y = "Total Sales") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme_minimal()


## 7.5 Discount vs Profit Relationship

In [None]:
# Plot Discount vs Profit Relationship
ggplot(df, aes(x = Discount, y = Profit)) +
  geom_point(color = "purple", alpha = 0.5) +
  labs(title = "Discount vs Profit Relationship",
       x = "Discount",
       y = "Profit") +
  theme_minimal()


In [None]:
# Plot Discount vs Profit Relationship with a smoothing line
ggplot(df, aes(x = Discount, y = Profit)) +
  geom_point(color = "purple", alpha = 0.5) +  # Scatter plot
  geom_smooth(method = "lm", color = "blue", se = FALSE) +  # Linear smoothing line (no confidence interval)
  labs(title = "Discount vs Profit Relationship",
       x = "Discount",
       y = "Profit") +
  theme_minimal()


In [None]:
# Boxplots to check for outliers
ggplot(df, aes(x = "", y = Discount)) +
  geom_boxplot() +
  labs(title = "Boxplot of Discount")

ggplot(df, aes(x = "", y = Profit)) +
  geom_boxplot() +
  labs(title = "Boxplot of Profit")


## 8. Tools & Technologies  

This project was developed and executed using the following tools:  

- **Programming Language:** R  
- **Platform:** Kaggle Notebooks  
- **Libraries Used:**  
  - `tidyverse` – Data manipulation and visualization  
  - `dplyr` – Data wrangling  
  - `ggplot2` – Data visualization  
  - `lubridate` – Date handling  
  - `stringr` – String manipulation  
  - `readr` – Reading and writing CSV files  

The project was run entirely on Kaggle, leveraging its cloud-based environment for data analysis and visualization.  



## Key Takeaways from this Project:

Data visualizations and exploratory data analysis are essential tools for determining whether linear regression is an appropriate method for modeling the relationship between two variables.
A linear regression model provides insights into the relationship between two variables, allowing it to be expressed quantitatively.


# Conclusion:

Based on the conducted e-commerce data analysis, several key insights have been identified:

    Sales Trends: There is a clear seasonal pattern in sales over time. Identifying high-performing sales periods can help in better resource allocation and marketing strategies.

    Category and Product Performance: Certain product categories consistently outperform others, contributing significantly to overall revenue. Emphasizing these categories in marketing efforts could boost sales.

    Profit Margins: Analysis revealed variations in profit margins across categories and regions. Focusing on improving operational efficiency and discount strategies in low-profit areas can improve profitability.

    Discount-Profit Relationship: There is an observable negative impact of high discounts on profit, suggesting the need for careful optimization of promotional offers.

    Customer Segmentation Insights: Regional analysis shows differences in purchasing behavior and profitability. Targeted campaigns tailored for specific customer groups and regions may yield better results.

