# 01_EDA.ipynb
## Exploratory Data Analysis: Grape Disease Dataset

In this notebook we will:
1. Load the CSV into a Pandas DataFrame.
2. Preview data, check dtypes, nulls.
3. Use `plot_histograms` and `plot_correlation_matrix` from `src/eda.py`.

In [19]:
# Cell 1: imports & paths
import pandas as pd
import os, sys

ROOT = os.path.abspath('..')       # adjust if needed
DATA_RAW = os.path.join(ROOT, 'data', 'raw', 'grape_disease', 'Grape_Disease_Dataset.csv')
FIG_DIR  = os.path.join(ROOT, 'reports', 'figures')

os.makedirs(FIG_DIR, exist_ok=True)
print (f"ROOT: {ROOT}")
print (f"DATA_RAW: {DATA_RAW}")
print(f"FIG_DIR: {FIG_DIR}")

sys.path.insert(0, ROOT)

from src.eda import plot_histograms, plot_correlation_matrix


ROOT: c:\PROG_ELE\igem\MLVineRiskPredictor
DATA_RAW: c:\PROG_ELE\igem\MLVineRiskPredictor\data\raw\grape_disease\Grape_Disease_Dataset.csv
FIG_DIR: c:\PROG_ELE\igem\MLVineRiskPredictor\reports\figures


In [None]:
# Cell 2: DataFrame 'df' is already loaded and ready to use.
# No need to reload or parse the CSV again.
df.head()



Unnamed: 0,Date _Time,Temperature,Humidity,LW
0,2023-03-20 13:00:46,26.9,22,5
1,2023-03-20 13:00:50,26.9,22,5
2,2023-03-20 13:00:53,26.9,22,6
3,2023-03-20 13:00:57,26.9,22,5
4,2023-03-20 13:01:00,26.9,22,5


In [23]:
# Cell 3: quick peek
print(df.shape)
df.head()


(10001, 4)


Unnamed: 0,Date _Time,Temperature,Humidity,LW
0,2023-03-20 13:00:46,26.9,22,5
1,2023-03-20 13:00:50,26.9,22,5
2,2023-03-20 13:00:53,26.9,22,6
3,2023-03-20 13:00:57,26.9,22,5
4,2023-03-20 13:01:00,26.9,22,5


In [24]:
# Cell 4: dtypes & null counts
print(df.dtypes)
print(df.isna().sum())


Date _Time     datetime64[ns]
Temperature           float64
Humidity                int64
LW                      int64
dtype: object
Date _Time     0
Temperature    0
Humidity       0
LW             0
dtype: int64


In [None]:
# Cell 5: histograms
plot_histograms(df, FIG_DIR)


In [26]:
# Cell 6: correlation heatmap
plot_correlation_matrix(df, os.path.join(FIG_DIR, 'correlation_matrix.png'))


### Next Steps
- Review the saved histograms in `reports/figures/` to spot:
  - Constant or near-constant columns
  - Highly skewed distributions
  - Unexpected outliers

- Inspect the correlation matrix:
  - Columns with extremely high correlation (ρ > 0.9) can be candidates for removal or dimensionality reduction.
  - Columns uncorrelated with others might not add predictive power (but check against the target later).

Once you’ve identified which features to drop or transform, move those decisions into `src/clean.py` and generate a cleaned CSV in `data/processed/`. Then you can proceed to feature engineering.
