Load and Analyze Data for Descriptive Statistics

This section involves the loading of the dataset and performing basic descriptive statistical operations. The objectives are:

Load the dataset into a pandas DataFrame.
Identify numerical columns in the DataFrame.
Calculate and display basic statistics like mean, median, standard deviation, and mode for numerical columns.

In [None]:
# Load necessary packages
import pandas as pd

# Load data
data = pd.read_csv('datasets/sales_data_with_discounts.csv')

# Display first & last few rows of data
data

In [None]:
# Compute basic statistics for numerical columns
numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
# Calculating descriptive statistics
data[numerical_columns].describe()


Data Visualization
This section aims to visually analyze the dataset. Three types of visualizations are created:

Histograms: Useful for understanding the frequency distribution of numerical data.
Boxplots: Used to identify variability and detect outliers.
Bar Charts: Provide counts of each category in categorical columns.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot histograms for numerical columns
data[numerical_columns].hist(bins=20, figsize=(15, 10))
plt.suptitle('Histogram of Numerical Columns')
plt.show()


In [None]:
# Plot boxplots for numerical columns to observe variability and outliers
plt.figure(figsize=(15, 10))
sns.boxplot(data=data[numerical_columns], orient='h')
plt.title('Boxplot of Numerical Columns')
plt.show()

In [None]:
# Bar chart for categorical columns
categorical_columns = data.select_dtypes(include=['object']).columns

for col in categorical_columns:
    data[col].value_counts().plot(kind='bar', figsize=(10, 5))
    plt.title(f'Bar Chart for {col}')
    plt.show()

## Data Standardization

Standardizing data helps to bring all the features to a similar scale, making them comparable. Here, we perform:

- Z-score normalization using StandardScaler.
- Manual computation of Z-scores to compare results.


In [None]:
from sklearn.preprocessing import StandardScaler

# Using StandardScaler
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data[numerical_columns])
standardized_df = pd.DataFrame(standardized_data, columns=numerical_columns)

standardized_df

# Manual Z-score normalization
z_score_normalized = (data[numerical_columns] - data[numerical_columns].mean()) / data[numerical_columns].std()

z_score_normalized

In [None]:
# Manual Z-score normalization
z_score_normalized = (data[numerical_columns] - data[numerical_columns].mean()) / data[numerical_columns].std()

z_score_normalized


# Encoding Categorical Variables

Categorical variables cannot be directly used in many machine learning models. One-hot encoding converts these variables into numerical form.

Steps:
1. Perform one-hot encoding for all categorical columns using pandas.
2. Observe the expanded shape and newly created columns.

In [22]:
# Perform one-hot encoding
encoded_data = pd.get_dummies(data, columns=categorical_columns)

# Display encoded data shape and new columns
encoded_data.shape, encoded_data.columns


((450, 101),
 Index(['Volume', 'Avg Price', 'Total Sales Value', 'Discount Rate (%)',
        'Discount Amount', 'Net Sales Value', 'Date_01-04-2021',
        'Date_02-04-2021', 'Date_03-04-2021', 'Date_04-04-2021',
        ...
        'Model_Vedic Cream', 'Model_Vedic Oil', 'Model_Vedic Shampoo',
        'Model_W-Casuals', 'Model_W-Inners', 'Model_W-Lounge',
        'Model_W-Western', 'Model_YM-98 ', 'Model_YM-99', 'Model_YM-99 Plus'],
       dtype='object', length=101))

In [23]:
encoded_data.head()

Unnamed: 0,Volume,Avg Price,Total Sales Value,Discount Rate (%),Discount Amount,Net Sales Value,Date_01-04-2021,Date_02-04-2021,Date_03-04-2021,Date_04-04-2021,...,Model_Vedic Cream,Model_Vedic Oil,Model_Vedic Shampoo,Model_W-Casuals,Model_W-Inners,Model_W-Lounge,Model_W-Western,Model_YM-98,Model_YM-99,Model_YM-99 Plus
0,15,12100,181500,11.65482,21153.49882,160346.50118,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,10,10100,101000,11.560498,11676.102961,89323.897039,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,7,16100,112700,9.456886,10657.910157,102042.089843,True,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,6,20100,120600,6.935385,8364.074702,112235.925298,True,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,3,8100,24300,17.995663,4372.94623,19927.05377,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
