# Assignment 2: Basic Statistics

## Descriptive Analytics and Data Preprocessing on Sales & Discounts Dataset

**Topics Covered:**
- Mean, Median, Mode, Standard Deviation
- Histograms and Boxplots
- Bar Charts for categorical data
- Standardization (Z-score normalization)
- One-Hot Encoding

---
## Step 1: Import Libraries and Load Data

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('sales_data_with_discounts.csv')

# Display first few rows
print("Dataset loaded successfully!")
print("Shape of dataset:", df.shape)
print("\nFirst 5 rows:")
df.head()

In [None]:
# Check column names and data types
print("Column Names and Data Types:")
print(df.dtypes)

---
## Step 2: Descriptive Statistics

Calculate **Mean**, **Median**, **Mode**, and **Standard Deviation** for numerical columns.

- **Mean**: Average value
- **Median**: Middle value when sorted
- **Mode**: Most frequent value
- **Standard Deviation**: How spread out the values are

In [None]:
# Get list of numerical columns
numerical_columns = ['Volume', 'Avg Price', 'Total Sales Value', 'Discount Rate (%)', 'Discount Amount', 'Net Sales Value']

print("=== Descriptive Statistics ===")
print("=" * 50)

# Calculate statistics for each numerical column
for column in numerical_columns:
    print("\nColumn:", column)
    print("-" * 30)
    
    # Calculate mean
    mean_value = df[column].mean()
    print("Mean:", round(mean_value, 2))
    
    # Calculate median
    median_value = df[column].median()
    print("Median:", round(median_value, 2))
    
    # Calculate mode
    mode_value = df[column].mode()[0]
    print("Mode:", round(mode_value, 2))
    
    # Calculate standard deviation
    std_value = df[column].std()
    print("Standard Deviation:", round(std_value, 2))

---
## Step 3: Histograms

Histograms show the distribution of numerical data. We can see:
- If data is skewed (left or right)
- If there are any outliers
- The most common range of values

In [None]:
# Create histograms for numerical columns
print("Creating Histograms...")

# Create a figure with subplots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Flatten axes for easier iteration
axes = axes.flatten()

# Plot histogram for each column
for i in range(len(numerical_columns)):
    column = numerical_columns[i]
    axes[i].hist(df[column], bins=20, color='skyblue', edgecolor='black')
    axes[i].set_title('Distribution of ' + column)
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.savefig('histograms.png')
plt.show()

print("\nHistograms saved as 'histograms.png'")

**Interpretation:**
- Most distributions show right skewness (long tail on the right)
- Volume has most values concentrated in lower range
- Prices and Sales values vary widely

---
## Step 4: Boxplots

Boxplots help identify:
- **Outliers**: Points outside the whiskers
- **Interquartile Range (IQR)**: The box shows middle 50% of data
- **Median**: The line inside the box

In [None]:
# Create boxplots for numerical columns
print("Creating Boxplots...")

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i in range(len(numerical_columns)):
    column = numerical_columns[i]
    axes[i].boxplot(df[column].dropna())
    axes[i].set_title('Boxplot of ' + column)
    axes[i].set_ylabel(column)

plt.tight_layout()
plt.savefig('boxplots.png')
plt.show()

print("\nBoxplots saved as 'boxplots.png'")

**Interpretation:**
- Several columns show outliers (dots above/below whiskers)
- High-value items like mobile phones create outliers in price columns

---
## Step 5: Bar Charts for Categorical Data

Bar charts show the count/frequency of each category.

In [None]:
# Identify categorical columns
categorical_columns = ['Day', 'City', 'BU', 'Brand']

print("Creating Bar Charts...")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for i in range(len(categorical_columns)):
    column = categorical_columns[i]
    
    # Count frequency of each category
    value_counts = df[column].value_counts()
    
    # Create bar chart
    axes[i].bar(value_counts.index, value_counts.values, color='lightgreen', edgecolor='black')
    axes[i].set_title('Distribution of ' + column)
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Count')
    axes[i].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('bar_charts.png')
plt.show()

print("\nBar charts saved as 'bar_charts.png'")

**Interpretation:**
- Sales data is fairly distributed across days of the week
- Different cities and brands have varying sales volumes

---
## Step 6: Standardization (Z-Score Normalization)

**What is Standardization?**

Standardization transforms data so that it has:
- Mean = 0
- Standard Deviation = 1

**Formula:** z = (x - mean) / standard_deviation

**Why Standardize?**
- Makes different columns comparable
- Required by many machine learning algorithms

In [None]:
# Standardization
print("=== Standardization ===")
print("\nBefore Standardization:")
print(df[numerical_columns].head())

# Create a copy of dataframe for standardized values
df_standardized = df.copy()

# Apply standardization to each numerical column
for column in numerical_columns:
    # Calculate mean
    mean = df[column].mean()
    
    # Calculate standard deviation
    std = df[column].std()
    
    # Apply z-score formula: (x - mean) / std
    df_standardized[column] = (df[column] - mean) / std

print("\nAfter Standardization:")
print(df_standardized[numerical_columns].head())

# Verify: mean should be ~0, std should be ~1
print("\nVerification (should be close to 0):")
print("Mean after standardization:")
for column in numerical_columns:
    print(column + ":", round(df_standardized[column].mean(), 4))

---
## Step 7: One-Hot Encoding

**What is One-Hot Encoding?**

Converts categorical variables (text) into numerical format using 0s and 1s.

Example:
- Original: City = [A, B, C]
- Encoded: City_A = [1,0,0], City_B = [0,1,0], City_C = [0,0,1]

**Why One-Hot Encode?**
- Machine learning algorithms need numerical data

In [None]:
# One-Hot Encoding
print("=== One-Hot Encoding ===")

print("\nOriginal dataset shape:", df.shape)
print("Categorical columns:", categorical_columns)

# Apply one-hot encoding using pandas get_dummies
df_encoded = pd.get_dummies(df, columns=categorical_columns)

print("\nAfter One-Hot Encoding:")
print("New dataset shape:", df_encoded.shape)

print("\nNew columns created:")
new_columns = [col for col in df_encoded.columns if col not in df.columns]
for col in new_columns[:10]:  # Show first 10 new columns
    print(" -", col)
print("... and", len(new_columns) - 10, "more columns")

print("\nSample of encoded data:")
df_encoded.head()

---
## Summary

In this assignment, we learned:

1. **Descriptive Statistics** - Calculating mean, median, mode, and standard deviation
2. **Histograms** - Visualizing data distributions
3. **Boxplots** - Identifying outliers and IQR
4. **Bar Charts** - Analyzing categorical variables
5. **Standardization** - Z-score normalization for numerical data
6. **One-Hot Encoding** - Converting categorical data to numerical format