# Exploratory Data Analysis (EDA) for Sales Dataset

This notebook performs a comprehensive EDA on the Sales dataset as per the assignment requirements. Steps include data loading, feature analysis, statistics, visualizations, missing value checks, imbalance analysis, modelling challenges, and basic cleaning.

## 1. Import Required Libraries
Importing pandas, numpy, matplotlib, seaborn, and other libraries needed for EDA.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
%matplotlib inline

## 2. Load the Dataset
Read the Sales.csv file into a pandas DataFrame and display the first few rows.

In [None]:
# Load the Sales dataset
df = pd.read_csv('Sales.csv')
df.head()

## 3. Initial Data Overview
Show the shape of the dataset, column names, and use info() to display data types and non-null counts.

In [None]:
# Shape and columns
print('Shape:', df.shape)
print('Columns:', df.columns.tolist())
df.info()

## 4. Identify Feature Types
Categorize columns as numerical, categorical, or datetime. Print lists of each type.

In [None]:
# Identify feature types
datetime_cols = ['order_date']
df[datetime_cols] = df[datetime_cols].apply(pd.to_datetime)
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.difference(datetime_cols).tolist()
print('Numerical columns:', numerical_cols)
print('Categorical columns:', categorical_cols)
print('Datetime columns:', datetime_cols)

## 5. Check for Missing Values
Check for missing values in each column and visualize missingness with a heatmap or bar plot.

In [None]:
# Check for missing values
missing = df.isnull().sum()
print(missing[missing > 0])

# Visualize missing values
plt.figure(figsize=(10, 4))
sns.heatmap(df.isnull(), cbar=False, yticklabels=False, cmap='viridis')
plt.title('Missing Values Heatmap')
plt.show()

## 6. Descriptive Statistics
Display summary statistics for numerical and categorical features using describe() and value_counts().

In [None]:
# Summary statistics for numerical features
print(df.describe())

# Summary statistics for categorical features
for col in categorical_cols:
    print(f'\nValue counts for {col}:')
    print(df[col].value_counts())

## 7. Univariate Analysis (Plots)
Plot histograms, boxplots, or countplots for key numerical and categorical features.

In [None]:
# Histograms for numerical features
for col in numerical_cols:
    plt.figure(figsize=(6, 3))
    sns.histplot(df[col], kde=True)
    plt.title(f'Histogram of {col}')
    plt.show()

# Boxplots for numerical features
for col in numerical_cols:
    plt.figure(figsize=(6, 3))
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.show()

# Countplots for categorical features
for col in categorical_cols:
    plt.figure(figsize=(8, 3))
    sns.countplot(y=df[col], order=df[col].value_counts().index)
    plt.title(f'Countplot of {col}')
    plt.show()

## 8. Bivariate Analysis (Plots)
Create scatterplots, barplots, or boxplots to explore relationships between features, especially with the target variable (e.g., revenue_usd).

In [None]:
# Scatterplot: base_price_usd vs revenue_usd
plt.figure(figsize=(6, 4))
sns.scatterplot(x='base_price_usd', y='revenue_usd', data=df)
plt.title('Base Price vs Revenue')
plt.show()

# Boxplot: customer_income_level vs revenue_usd
plt.figure(figsize=(6, 4))
sns.boxplot(x='customer_income_level', y='revenue_usd', data=df)
plt.title('Customer Income Level vs Revenue')
plt.show()

# Barplot: payment_method vs average revenue
plt.figure(figsize=(8, 4))
sns.barplot(x='payment_method', y='revenue_usd', data=df, estimator=np.mean)
plt.title('Average Revenue by Payment Method')
plt.show()

## 9. Check for Class Imbalance
Analyze the distribution of key categorical variables (e.g., payment_method, sales_channel, customer_income_level) and plot their frequencies.

In [None]:
# Class imbalance for key categorical variables
for col in ['payment_method', 'sales_channel', 'customer_income_level']:
    plt.figure(figsize=(6, 3))
    sns.countplot(x=df[col], order=df[col].value_counts().index)
    plt.title(f'Distribution of {col}')
    plt.show()
    print(df[col].value_counts(normalize=True))

## 10. Identify Potential Modelling Challenges
Discuss findings such as skewed distributions, high cardinality, multicollinearity, or outliers that may affect modelling.

- **Skewed distributions:** Some numerical features may be highly skewed, requiring transformation.
- **High cardinality:** Columns like `model_name` may have many unique values.
- **Multicollinearity:** Features such as `base_price_usd`, `final_price_usd`, and `revenue_usd` may be correlated.
- **Outliers:** Boxplots may reveal outliers in price and revenue columns.
- **Missing values:** If present, need to be handled appropriately.
- **Class imbalance:** Some categorical variables may be imbalanced, affecting model performance.

## 11. Brief Data Cleaning
Demonstrate basic cleaning steps such as handling missing values, correcting data types, and removing duplicates if any.

In [None]:
# Remove duplicates
print('Duplicates before:', df.duplicated().sum())
df = df.drop_duplicates()
print('Duplicates after:', df.duplicated().sum())

# Handle missing values (example: fill with median for numerical, mode for categorical)
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)

# Ensure correct data types
df[datetime_cols] = df[datetime_cols].apply(pd.to_datetime)

# Final check
df.info()