# Project Title: Working with data in Pyhton 
## Objective
This project involves cleaning customer and transaction datasets to ensure data consistency, followed by an analysis to gain insights into customer behavior and sales trends. We will cover data cleaning, transformation, and visualization of key metrics.


### Importing Necessary Libraries
We begin by importing the essential libraries: `pandas` for data manipulation, `matplotlib.pyplot` for visualization, and `numpy` for numerical operations.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


# lodaing customer and transaction data

In [None]:
# Prompting for user input to specify file paths
print("Welcome to the Capstone_2 data analysis project!")
customer_file_path = input("Please enter the path for the customer data file: ")
transaction_file_path = input("Please enter the path for the transaction data file: ")

# Load the data with user-specified paths
customer_list = pd.read_csv(customer_file_path, delimiter='|')
transaction_data = pd.read_csv(transaction_file_path)

# Display the first few rows to confirm loading
print("Customer Data Preview:")
print(customer_list.head())
print("\nTransaction Data Preview:")
print(transaction_data.head())


## Data Cleaning



- **Removing Extra Spaces in Column Names**: Ensures consistency by removing any trailing or leading spaces in column headers.

In [None]:
customer_list.columns = customer_list.columns.str.strip()


- **Handling Non-Standard Characters in Names**: Removes unwanted characters in the 'name' column, leaving only alphabetical characters, hyphens, and spaces.

In [None]:
customer_list['name'] = customer_list['name'].str.replace(r'[^a-zA-Z\-\.\s]', '', regex=True)


- **Formatting Phone Numbers**: Standardizes phone numbers to a consistent `NNN-NNN-NNNN` format.

In [None]:
customer_list['phone'] = customer_list['phone'].str.replace(r'(\d{3})(\d{3})(\d{4})', r'\1-\2-\3', regex=True)


- **Handling Missing SMS Opt-Out Values**: Fills any missing values in 'sms-opt-out' with 'N' (No) to ensure consistency in the data.

In [None]:
customer_list['sms-opt-out'] = customer_list['sms-opt-out'].fillna('N')


In [None]:
print("\nCleaned Data:")
print(customer_list.head())


### saving the cleaned customer_list

In [None]:
customer_list.to_csv('cleaned_customer_list.csv', index=False)


### Comparison of Original and Cleaned Data
Here, we compare specific columns from the original and cleaned datasets to verify the effectiveness of our data cleaning steps. This includes comparisons for 'name', 'phone', and 'sms-opt-out' in the customer list, as well as 'Date' and 'Product Name' in the transaction list.


In [None]:
# Loading the original and cleaned data
original_customer_list = pd.read_csv('customer_list_updated.csv', delimiter='|')
cleaned_customer_list = pd.read_csv('cleaned_customer_list.csv')


In [None]:
# Create a comparison DataFrame to view before and after cleaning
comparison = pd.DataFrame({
    "Original Name": original_customer_list['name'],
    "Cleaned Name": cleaned_customer_list['name'],
    "Original Phone": original_customer_list['phone'],
    "Cleaned Phone": cleaned_customer_list['phone'],
    "Original SMS Opt-Out": original_customer_list.get('sms-opt-out', 'N/A'),
    "Cleaned SMS Opt-Out": cleaned_customer_list['sms-opt-out']
})

comparison.head(10)


### Transaction Data Cleaning

- **Date Filtering**: Removing Future Dates from Transaction Data

In [None]:
import datetime as dt

# Get the current date
today = dt.datetime.today()

# Filter out transactions with dates in the future
transaction_data['Date'] = pd.to_datetime(transaction_data['Date'], errors='coerce')  # Ensure date is in datetime format
transaction_data = transaction_data[transaction_data['Date'] <= today]

# Display the data after filtering
print("Filtered Transaction Data Preview (No Future Dates):")
print(transaction_data.head())



- **Duplicate Removal**: Ensures no duplicate entries exist, preserving data integrity.

In [None]:
transaction_data = transaction_data.drop_duplicates()


- **Missing Values Check**: Identifies any columns with missing values that might require further cleaning.

In [None]:
missing_data = transaction_data.isnull().sum()
print("Missing values per column:\n", missing_data)



### Saving the cleaned data to a new CSV file

In [None]:
transaction_data.to_csv('time_filtered_transaction_data.csv', index=False)


## comparison between the orignal and cleaned transaction data

In [None]:
# Loading the original and cleaned transaction data files
original_transaction_data = pd.read_csv('transaction_data.csv')
cleaned_transaction_data = pd.read_csv('time_filtered_transaction_data.csv')

# a comparison DataFrame for Date and Product Name columns
comparison = pd.DataFrame({
    "Original Date": original_transaction_data['Date'],
    "Cleaned Date": cleaned_transaction_data['Date'],
    "Original Product Name": original_transaction_data['ProductName'],
    "Cleaned Product Name": cleaned_transaction_data['ProductName']
})

# the first 5 rows and last 5 rows of the comparison to see changes in Date and Product Name
print("First 5 Rows - Date and Product Name Comparison:")
print(comparison.head())

print("\nLast 5 Rows - Date and Product Name Comparison:")
print(comparison.tail())

## Analysis and Visualization
We perform analysis to reveal key insights about monthly sales trends, top products by revenue, and top customers by spending. These analyses help in understanding customer behavior and identifying revenue-driving products and customers.


### Data Presentation and Initial Summary

In [None]:
# Display the first 10 rows
print("Top 10 Rows of the Cleaned Data:")
print(cleaned_transaction_data.head(10))




### Statistical summary

In [None]:
print("\nStatistical Summary:")
print(cleaned_transaction_data.describe())

## Monthly Sales Analysis

 ### Creating a 'Month' column

In [None]:
# 'Date' column is in datetime format
cleaned_transaction_data['Date'] = pd.to_datetime(cleaned_transaction_data['Date'], errors='coerce')
cleaned_transaction_data['Month'] = cleaned_transaction_data['Date'].dt.to_period('M')


### Aggregating monthly sales

In [None]:
monthly_sales = cleaned_transaction_data.groupby('Month')['OrderTotal'].sum()

print("Monthly Sales:")
print(monthly_sales)

### Plot monthly sales trend

In [None]:
monthly_sales.plot(kind='line',  marker='o',markerfacecolor='red', title="Monthly Sales Trend", xlabel="Month", ylabel="Total Sales", figsize=(10, 5), )
# Adding a square grid background
plt.grid(visible=True, which='both', linestyle='-', linewidth=0.5, color='gray')
plt.gca().set_axisbelow(True)  
plt.show()


The monthly sales trend chart illustrates the fluctuation in total sales over a period from July 2023 to October 2024. Key observations include:

- **Seasonal Variations**: There are visible peaks and dips in sales throughout the year, suggesting possible seasonal demand patterns.
- **High Sales Periods**: The chart shows notable spikes in sales during specific months, such as **April** and **October**, which might be tied to seasonal events, holidays, or promotional periods.
- **Low Sales Periods**: Sales tend to dip around **August** and **December**, which could indicate lower customer demand during these times or an opportunity to increase sales through targeted promotions.

Understanding these monthly trends can assist in optimizing inventory, planning seasonal promotions, and aligning marketing efforts to maximize revenue during high-demand periods. Additionally, the observed low-demand months provide opportunities to investigate potential factors and adjust strategies to drive sales.


#### Top 5 Products by Revenue

In [None]:
top_products = cleaned_transaction_data.groupby('ProductName')['OrderTotal'].sum().sort_values(ascending=False).head(5)
print("\nTop 5 Products by Revenue:")
print(top_products)

#### Plot top 5 products by revenue

In [None]:

top_products.plot(kind='bar', title="Top 5 Products by Revenue", xlabel="Product", ylabel="Total Revenue", figsize=(10, 5))
plt.show()


The analysis of the top 5 revenue-generating products offers valuable insights into the company's sales dynamics and customer preferences. Products like **Thüringer Rostbratwurst** and **Côte de Blaye** clearly stand out as key revenue drivers, indicating strong market demand and potentially effective sales strategies. Meanwhile, items like **Gnocchi di nonna Alice** and **Manjimup Dried Apples** show moderate but stable sales, suggesting opportunities for targeted marketing to boost their performance further.

This data provides a foundation for strategic decisions, such as:
- **Inventory Management**: Prioritizing stock for high-demand products to avoid potential shortages.
- **Marketing Focus**: Allocating resources towards promoting top-performing products to maximize returns.
- **Product Development**: Exploring potential for enhancing or bundling moderate-performing items to increase their appeal.

By leveraging these insights, the company can optimize its approach to meet customer needs, drive revenue, and strengthen its position in the market.


### Top Customers by Spending

In [None]:
customer_spending = cleaned_transaction_data.groupby('CustID')['OrderTotal'].sum().sort_values(ascending=False).head(10)

print("\nTop 10 Customers by Spending:")
print(customer_spending)


### Plot top customers by spending

In [None]:
customer_spending.plot(kind='bar', title="Top 10 Customers by Spending", xlabel="Customer ID", ylabel="Total Spending", figsize=(10, 5))
plt.show()


The analysis of top customers by spending provides a clear view of the most valuable clients based on their total expenditures. Key observations include:

- **High-Value Customers**: The top spenders represent a significant portion of revenue, highlighting the importance of these customers to the business.
- **Customer Loyalty Opportunities**: By identifying these top spenders, the business can explore loyalty programs, targeted promotions, or exclusive offers to retain and reward these high-value clients.
- **Strategic Marketing**: Understanding the spending behavior of top customers allows for more personalized marketing strategies, which can further drive customer engagement and increase revenue.

Focusing on these top customers can help the business nurture valuable relationships, ensuring consistent revenue streams and fostering customer loyalty.


### the top selling products by month

In [None]:
# Ensure 'Date' is in datetime format and create a 'Month' column
transaction_data['Date'] = pd.to_datetime(transaction_data['Date'], errors='coerce')
transaction_data['Month'] = transaction_data['Date'].dt.to_period('M')

# Group by 'Month' and 'ProductName', summing 'OrderTotal' to get monthly sales per product
monthly_product_sales = transaction_data.groupby(['Month', 'ProductName'])['OrderTotal'].sum().reset_index()

# Find the product with the highest sales for each month
peak_sales_summary = monthly_product_sales.loc[monthly_product_sales.groupby('Month')['OrderTotal'].idxmax()]

# Display the peak product for each month
print("Peak Sales Product for Each Month:")
print(peak_sales_summary)


### Plotting for the top selling by month

In [None]:
plt.figure(figsize=(12, 6))
plt.bar(peak_sales_summary['Month'].astype(str), peak_sales_summary['OrderTotal'], color='skyblue')
plt.xticks(rotation=45)
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.title('Peak Sales Product for Each Month')
plt.tight_layout()

# Adding product names as labels on top of each bar
for i, row in peak_sales_summary.iterrows():
    plt.text(row['Month'].strftime('%Y-%m'), row['OrderTotal'], row['ProductName'], ha='center', va='bottom', rotation=90)

plt.show()

This bar chart illustrates the top-selling product for each month over a specified period. Key insights from this chart include:

- **Monthly Variation in Top Products**: The peak product varies by month, indicating changing customer preferences or seasonal demand. 
   For instance, **Côte de Blaye** leads in October 2023, whereas **Thüringer Rostbratwurst** has peak sales in several other months, like July 2023 and August 2024.
- **High-Performing Products**: Certain products, such as **Thüringer Rostbratwurst** and **Côte de Blaye**, 
appear multiple times as the highest-selling items in different months.This consistency highlights them as reliable revenue-generators.
- **Opportunities for Promotion**: Products with lower peak months, like **Rössle Sauerkraut** in November 2024,
might benefit from targeted promotions or marketing efforts during other months to increase visibility and sales.

This analysis provides valuable insights for inventory planning, marketing strategies, and understanding customer buying trends throughout the year.


### the product with the lowest sales for each month


In [None]:
lowest_sales_summary = monthly_product_sales.loc[monthly_product_sales.groupby('Month')['OrderTotal'].idxmin()]

print("Lowest Sales Product for Each Month:")
print(lowest_sales_summary)

In [None]:
plt.figure(figsize=(12, 6))
plt.bar(lowest_sales_summary['Month'].astype(str), lowest_sales_summary['OrderTotal'], color='salmon')
plt.xticks(rotation=45)
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.title('Lowest Sales Product for Each Month')
plt.tight_layout()

# Adding product names as labels on top of each bar
for i, row in lowest_sales_summary.iterrows():
    plt.text(row['Month'].strftime('%Y-%m'), row['OrderTotal'], row['ProductName'], ha='center', va='bottom', rotation=90)

plt.show()

This analysis reveals the products with the lowest sales for each month. By identifying the least popular products, we gain insights into potential areas for improvement or promotional opportunities. For example:

- **Inventory Management**: Products with consistently low sales might be taking up valuable storage space and could be reduced in stock or even discontinued.
- **Targeted Promotions**: These low-selling products could benefit from targeted marketing campaigns or discounts to boost their visibility and demand.
- **Seasonal Adjustments**: Some products may only underperform during specific months, suggesting a potential for seasonal marketing adjustments.

Understanding these low-performing products on a monthly basis enables the business to make data-driven decisions, helping optimize inventory, reduce costs, and maximize profit potential.


# Project Conclusion

This project provided a comprehensive analysis of customer and transaction data, focusing on data cleaning, transformation, and insightful visualizations to understand sales trends, top-performing products, and customer spending behaviors.

Key findings include:
- **Data Cleaning**: Ensuring data consistency by handling missing values, duplicates, and formatting issues improved the reliability of our analysis.
- **Sales Trends**: The monthly sales trend analysis revealed seasonal fluctuations, with peaks in specific months such as April and October. This insight can inform inventory and marketing strategies to meet seasonal demand.
- **Top Products and Customer Spending**: Identifying the top 5 products by revenue and the top 10 customers by spending provided insights into key revenue drivers. This allows for targeted marketing to high-spending customers and strategic focus on popular products.
- **Peak Sales by Month**: Analyzing peak sales products for each month highlighted customer preferences over time, which can guide product stocking and promotional planning.

Overall, this project underscores the importance of data-driven decision-making in optimizing business strategies. By understanding what drives sales and customer engagement, the organization can make more informed decisions, enhance customer satisfaction, and maximize revenue potential.
