## M1 - Exploratory Data Analysis - EDA

### Install required libraries

In [1]:
!pip install pandas ydata-profiling sweetviz dtale kagglehub



### Loading Fashion MNIST Dataset

In [2]:
# Install dependencies as needed:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("zalando-research/fashionmnist")

print("Path to dataset files:", path)

Path to dataset files: C:\Users\Lenovo\.cache\kagglehub\datasets\zalando-research\fashionmnist\versions\4


In [3]:
import os

files = os.listdir(path)
print("Files in the downloaded directory:", files)

Files in the downloaded directory: ['fashion-mnist_test.csv', 'fashion-mnist_train.csv', 't10k-images-idx3-ubyte', 't10k-labels-idx1-ubyte', 'train-images-idx3-ubyte', 'train-labels-idx1-ubyte']


In [4]:
import pandas as pd

# Load the training and test datasets
train_df = pd.read_csv(path + '/' + 'fashion-mnist_train.csv')
test_df = pd.read_csv(path + '/' + 'fashion-mnist_test.csv')

# Display the first few rows of the training dataset
print(train_df.head())

   label  pixel1  pixel2  pixel3  pixel4  pixel5  pixel6  pixel7  pixel8  \
0      2       0       0       0       0       0       0       0       0   
1      9       0       0       0       0       0       0       0       0   
2      6       0       0       0       0       0       0       0       5   
3      0       0       0       0       1       2       0       0       0   
4      3       0       0       0       0       0       0       0       0   

   pixel9  ...  pixel775  pixel776  pixel777  pixel778  pixel779  pixel780  \
0       0  ...         0         0         0         0         0         0   
1       0  ...         0         0         0         0         0         0   
2       0  ...         0         0         0        30        43         0   
3       0  ...         3         0         0         0         0         1   
4       0  ...         0         0         0         0         0         0   

   pixel781  pixel782  pixel783  pixel784  
0         0         0         

### Using Pandas Profiling (latest Ydata Profiling) to Generate EDA Report

In [5]:
from ydata_profiling import ProfileReport

# Since the given dataset is large with 60K Images of 28x28 size (786 pixels-columns)
# Reducing size to do analysis in lower end machines
sample_df = train_df.sample(frac=0.0001, random_state=42)

# Generate the EDA report
profile = ProfileReport(sample_df, title="Fashion MNIST EDA Report", explorative=True)

# Save the report as an HTML file
profile.to_file("fashion_mnist_eda_report.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/785 [00:00<?, ?it/s][A
  0%|          | 3/785 [00:01<07:38,  1.71it/s][A
  2%|▏         | 16/785 [00:02<01:42,  7.51it/s][A
  7%|▋         | 52/785 [00:02<00:29, 25.25it/s][A
  7%|▋         | 58/785 [00:04<00:49, 14.58it/s][A
 10%|▉         | 77/785 [00:04<00:30, 23.16it/s][A
 12%|█▏        | 97/785 [00:04<00:21, 32.56it/s][A
 13%|█▎        | 105/785 [00:05<00:35, 19.07it/s][A
 14%|█▍        | 111/785 [00:06<00:35, 19.15it/s][A
 15%|█▌        | 120/785 [00:06<00:31, 21.24it/s][A
 18%|█▊        | 145/785 [00:06<00:20, 31.95it/s][A
 20%|██        | 159/785 [00:07<00:16, 38.34it/s][A
 22%|██▏       | 169/785 [00:07<00:21, 29.28it/s][A
 24%|██▍       | 190/785 [00:07<00:13, 44.72it/s][A
 25%|██▌       | 200/785 [00:07<00:11, 50.78it/s][A
 28%|██▊       | 220/785 [00:08<00:11, 47.82it/s][A
 29%|██▉       | 228/785 [00:08<00:12, 44.50it/s][A
 32%|███▏      | 253/785 [00:08<00:09, 53.42it/s][A
 33%|███▎      | 260/785 [00:09<00:17, 30.46it/s][A
 36%|███▌

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]