---
## Task 1:
A- Load the provided CSV dataset (sample-superstore.csv) into Python and print the first ten records with the associated column names.

B- Provide a short paragraph to describe your understanding of the dataset. (around 100 words)

### Task 1 - Part A: Load and Explore the Dataset
In this task, we’ll use the Pandas library to load the sample-superstore.csv dataset and define two reusable methods:
- head(limit) – Returns the top N rows of the dataset.
- tail(limit) – Returns the bottom N rows of the dataset.

These methods take a parameter limit that specifies how many rows to return.

---

In [15]:
import pandas as pd
from pandas import DataFrame

class ExploratoryDataAnalysis:
    ## Class-level variables
    store_data_frame: DataFrame = None
    
    def __init__(self, path: str):
        self.store_data_frame = pd.read_csv(path, sep=',')

    def head(self, limit: int):
        return self.store_data_frame.head(limit)

    def tail(self, limit: int):
        return self.store_data_frame.tail(limit)

if __name__ == '__main__':
    exp_data_analysis = ExploratoryDataAnalysis('sample-superstore.csv')
    ## printing first 10 records associated with column names
    print(exp_data_analysis.head(10).to_markdown())

|    |   Row ID | Order ID       | Order Date   | Ship Date   | Ship Mode      | Customer ID   | Customer Name    | Segment     | Country       | City         | State          | Postal Code   | Region   | Product ID      | Category        | Sub-Category   | Product Name                                                                |    Sales | Quantity   |   Discount | Profit       |
|---:|---------:|:---------------|:-------------|:------------|:---------------|:--------------|:-----------------|:------------|:--------------|:-------------|:---------------|:--------------|:---------|:----------------|:----------------|:---------------|:----------------------------------------------------------------------------|---------:|:-----------|-----------:|:-------------|
|  0 |     7773 | CA-2016-108196 | 25/11/2016   | 12/02/2016  | Standard Class | CS-12505      | Cindy Stewart    | Consumer    | United States | Lancaster    | Ohio           | 43130         | Est      | TEC-MA-10000418 | T

---

### Task 1 Part B - Understanding the Sample Superstore Dataset

The Sample Superstore dataset captures detailed retail sales data from a fictional store. It includes information about customer orders such as 
- `Order ID`
- `Order Date`
- `Ship Mode`
- `Customer Name`
- `Segment`
- `City`
- `State`
- `Region`

Each transaction is linked to a product with,
- `Product ID`
- `Category`
- `Sub-Category`
- `Product Name`

Each transaction contains metrics such as 
- `sales`
- `Quantity`
- `Discount`
- `Profit`

This dataset is ideal for analysing customer purchasing behaviour, shipping performance, product profitability, and regional sales trends. It can be used in data science for performing exploratory data analysis (EDA), creating dashboards, and building predictive business models.

### Loading Data
To work with this data in Python, we use the Pandas library, which provides powerful tools for data manipulation and analysis. We load the dataset using `pd.read_csv()`, which reads the CSV file and returns a `DataFrame`. A DataFrame is a two-dimensional labelled data structure in Pandas, similar to a table in a database or an Excel spreadsheet. It allows us to easily inspect, filter, sort, and transform the data. Once loaded, the dataset becomes a DataFrame object where each row represents a single order and each column represents a different attribute related to the order.

---

## Task 2:
Process the dataset's variables and conduct exploratory data analysis. Explore the dataset thoroughly, and feel free to improvise as needed. However, you must use Python for at least four of the following techniques:

### Task 2 - Part A: Descriptive statistics
In this task, we will perform Exploratory Data Analysis (EDA) on the **Sample Superstore** dataset using Python. The goal is to better understand the structure, quality, and patterns in the data before moving on to visualisation or modelling.

We will cover the following key techniques as part of our EDA:

1. **Descriptive Statistics**  
   Generate summary statistics for numerical columns (e.g., Sales, Profit, Discount).

2. **Data Type Inspection**  
   Understand the data structure and identify numerical vs. categorical features.

3. **Missing Value Analysis**  
   Detect any null or missing values that might require cleaning.

4. **Unique Value Counts and Categorical Distributions**  
   Explore unique values in categorical columns (e.g., Segment, Category, Region).

5. **Aggregated Insights**  
   Analyse performance (e.g., sales and profit) across different categories and regions.

This initial analysis will help us identify trends, detect anomalies, and prepare the dataset for deeper visualisation or machine learning tasks.

---

In [20]:
import pandas as pd
from pandas import DataFrame

class ColumnNotFoundError(Exception):
    def __init__(self, missing_columns):
        message = f"The following column(s) are missing: {missing_columns}"
        super().__init__(message)
        self.missing_columns = missing_columns

class ExploratoryDataAnalysis:
    ## Class-level variables
    store_data_frame: DataFrame = None
    
    def __init__(self, path: str):
        self.store_data_frame = pd.read_csv(path, sep=',')

    def inspect(self):
        print("Shape of Dataset: ", self.store_data_frame.shape)
        print("\nColumn Names: ", self.store_data_frame.columns.tolist())

    def descriptive_stats(self):
        print("\nDescriptive Statistics: ", self.store_data_frame.describe())

    def basic_info(self):
        print("\nData types and non-null counts: ")
        print(self.store_data_frame.info())

    def missing_value_info(self):
        print("\nMissing values in each column: ")
        print(self.store_data_frame.isnull().sum())

    def get_categorical_candidates(self, threshold=10):
        """
        Return columns that have unique values less than or equal to `threshold`.
        These are good candidates for categorical analysis.
        """
        candidate_columns = []
        
        print(f"\nColumns with ≤ {threshold} unique values (possible categorical features):\n")
        for col in self.store_data_frame.columns:
            unique_values = self.store_data_frame[col].nunique()
            if unique_values <= threshold:
                candidate_columns.append((col, unique_values))

        return [col for col, _ in candidate_columns]

    def fill_missing_values(self, categorical_columns):
        """
        This method will fill all missing values for categorical columns with UNKNOWN.
        """
        missing = [col for col in categorical_columns if col not in self.store_data_frame.columns]
        if missing:
            raise ColumnNotFoundError(missing)
            
        print("\nFilling up missing values in categorical columns with value 'UNKNOWN'")
        for col in categorical_columns:
            self.store_data_frame[col] = self.store_data_frame[col].fillna("UNKNOWN")

    def print_uniques(self, columns = None):
        missing = [col for col in categorical_columns if col not in self.store_data_frame.columns]
        if missing:
            raise ColumnNotFoundError(missing)
         
        print("\n")   
        if columns:
            for col in columns:
                print(f"Column {col} have values {self.store_data_frame[col].unique()}.")
        else:
            print(self.store_data_frame.head(5))

    def replace_values(self, column, currVal, newVal):
        if column not in self.store_data_frame.columns:
            raise ColumnNotFoundError(column)

        print("\n") 
        ## Count how many current values will be replaced
        count_before = (self.store_data_frame[column] == currVal).sum()
        ## Perform Replacement
        self.store_data_frame[column] = self.store_data_frame[column].replace(currVal, newVal)
        ## Count how many new values being replaced
        count_after = (self.store_data_frame[column] == newVal).sum()

        print(f"Replaced {count_before} occurrence(s) of '{currVal}' with '{newVal}' in column '{column}'.")
        print(f"Total now: {count_after} instance(s) of '{newVal}' in '{column}'.")
    

if __name__ == '__main__':
    try:
        exp_data_analysis = ExploratoryDataAnalysis('sample-superstore.csv')
        ## Step 1: Load and Inspect the Dataset
        exp_data_analysis.inspect()
        ## Step 2: Data Types and Basic Info
        exp_data_analysis.basic_info()
        ## Step 3: Null / Missing Value Analysis
        exp_data_analysis.missing_value_info()
        ## Step 4: Categorical Analysis - This will return ['Ship Mode', 'Segment', 'Country', 'Category']
        categorical_columns = exp_data_analysis.get_categorical_candidates(5)
        print(categorical_columns)
        ## Step 5: Fill Missing Values for Categorical Columns
        exp_data_analysis.fill_missing_values(categorical_columns)
        ## Step 6: Print Aftering Filling Missing Values
        exp_data_analysis.print_uniques(categorical_columns)
        ## Step 7: Clean Data
        ## - As Segment have % values let's replace those with UNKNOWN
        exp_data_analysis.replace_values('Segment', '%', 'UNKNOWN')
        ## - As Country have 56 values let's replace those with UNKNOWN
        exp_data_analysis.replace_values('Country', '56', 'UNKNOWN')
        ## Step 8: Print Aftering Replacing Bad Values
        exp_data_analysis.print_uniques(categorical_columns)
        ## Step 9: Descriptive Statistics
        exp_data_analysis.descriptive_stats()
    except ColumnNotFoundError as e:    
        print(f"\nColumnNotFoundError : {e}")
    

Shape of Dataset:  (9994, 21)

Column Names:  ['Row ID', 'Order ID', 'Order Date', 'Ship Date', 'Ship Mode', 'Customer ID', 'Customer Name', 'Segment', 'Country', 'City', 'State', 'Postal Code', 'Region', 'Product ID', 'Category', 'Sub-Category', 'Product Name', 'Sales', 'Quantity', 'Discount', 'Profit']

Descriptive Statistics:              Row ID         Sales     Discount
count  9994.000000   9993.000000  9991.000000
mean   4997.500000    229.863780     0.156180
std    2885.163629    623.276019     0.206399
min       1.000000      0.444000     0.000000
25%    2499.250000     17.280000     0.000000
50%    4997.500000     54.480000     0.200000
75%    7495.750000    209.940000     0.200000
max    9994.000000  22638.480000     0.800000

Data types and non-null counts: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 