# Benin Solar Data - Exploratory Data Analysis (EDA)

This notebook covers data profiling, cleaning, and exploratory analysis for the Benin solar dataset.

## Table of Contents
1. Load Data
2. Summary Statistics & Missing Value Report
3. Outlier Detection & Basic Cleaning
4. Time Series Analysis
5. Cleaning Impact
6. Correlation & Relationship Analysis
7. Wind & Distribution Analysis
8. Temperature Analysis
9. Bubble Chart
10. Export Cleaned Data


In [None]:
! pip install pandas


Defaulting to user installation because normal site-packages is not writeable


In [None]:
# 1. Load Data

%pip install seaborn

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv(r'C:\Users\fisse\OneDrive\Documents\KAIM\solar-challenge-week1\Data\benin-malanville.csv', parse_dates=['Timestamp'])
df.head()

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


ValueError: Missing column provided to 'parse_dates': 'Timestamp'

## 2. Summary Statistics & Missing-Value Report

In [None]:
# Numeric columns summary
df.describe()

In [None]:
# Missing values
missing = df.isna().sum()
print(missing)
# Columns with >5% nulls
threshold = 0.05
n_rows = len(df)
high_nulls = missing[missing > threshold*n_rows]
print('Columns with >5% missing:', high_nulls)

## 3. Outlier Detection & Basic Cleaning

In [None]:
from scipy.stats import zscore

outlier_cols = ['GHI', 'DNI', 'DHI', 'ModA', 'ModB', 'WS', 'WSgust']
z_scores = np.abs(df[outlier_cols].apply(zscore, nan_policy='omit'))
outliers = (z_scores > 3)
print('Number of outliers per column:')
print(outliers.sum())

In [None]:
# Remove rows with outliers (optional) or impute
df_clean = df[~(outliers.any(axis=1))].copy()
# Alternatively, we can impute with median for missing values
for col in outlier_cols:
    df_clean[col] = df_clean[col].fillna(df_clean[col].median())

df_clean.isna().sum()

## 4. Time Series Analysis

In [None]:
fig, axs = plt.subplots(4, 1, figsize=(15,12), sharex=True)
for i, col in enumerate(['GHI', 'DNI', 'DHI', 'Tamb']):
    axs[i].plot(df_clean['Timestamp'], df_clean[col])
    axs[i].set_title(col)
plt.tight_layout()
plt.show()

## 5. Cleaning Impact (ModA & ModB before/after Cleaning)

In [None]:
df_clean.groupby('Cleaning')[['ModA', 'ModB']].mean().plot(kind='bar', figsize=(8,6))
plt.title('Average ModA & ModB by Cleaning')
plt.ylabel('Irradiance (W/m²)')
plt.show()

## 6. Correlation & Relationship Analysis

In [None]:
corr_cols = ['GHI', 'DNI', 'DHI', 'TModA', 'TModB']
plt.figure(figsize=(8,6))
sns.heatmap(df_clean[corr_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Scatter plots
sns.scatterplot(x='WS', y='GHI', data=df_clean)
plt.title('WS vs GHI')
plt.show()

sns.scatterplot(x='WSgust', y='GHI', data=df_clean)
plt.title('WSgust vs GHI')
plt.show()

sns.scatterplot(x='WD', y='GHI', data=df_clean)
plt.title('WD vs GHI')
plt.show()

sns.scatterplot(x='RH', y='Tamb', data=df_clean)
plt.title('RH vs Tamb')
plt.show()

sns.scatterplot(x='RH', y='GHI', data=df_clean)
plt.title('RH vs GHI')
plt.show()

## 7. Wind & Distribution Analysis

In [None]:
# Wind rose plot (radial bar)
import matplotlib.cm as cm
from math import pi
wd = df_clean['WD'].dropna()
ws = df_clean['WS'].dropna()
plt.figure(figsize=(8,8))
ax = plt.subplot(111, polar=True)
theta = np.deg2rad(wd)
ax.scatter(theta, ws, alpha=0.5)
ax.set_title('Wind Rose: WS vs WD')
plt.show()

# Histograms
df_clean['GHI'].hist(bins=30, alpha=0.7)
plt.title('Histogram of GHI')
plt.xlabel('GHI (W/m²)')
plt.ylabel('Frequency')
plt.show()
df_clean['WS'].hist(bins=30, alpha=0.7)
plt.title('Histogram of WS')
plt.xlabel('WS (m/s)')
plt.ylabel('Frequency')
plt.show()

## 8. Temperature Analysis

In [None]:
sns.scatterplot(x='RH', y='Tamb', data=df_clean)
plt.title('Relative Humidity vs Temperature')
plt.show()

sns.scatterplot(x='RH', y='GHI', data=df_clean)
plt.title('Relative Humidity vs GHI')
plt.show()

## 9. Bubble Chart

In [None]:
plt.figure(figsize=(10,7))
sns.scatterplot(x='GHI', y='Tamb', size='RH', data=df_clean, legend=False, alpha=0.5)
plt.title('Bubble Chart: GHI vs Tamb (size=RH)')
plt.xlabel('GHI (W/m²)')
plt.ylabel('Tamb (°C)')
plt.show()

## 10. Export Cleaned Data

In [None]:
df_clean.to_csv('../Data/benin_clean.csv', index=False)

---
### References
- [Pandas documentation](https://pandas.pydata.org/docs/)
- [Matplotlib gallery](https://matplotlib.org/stable/gallery/index.html)
- [Seaborn gallery](https://seaborn.pydata.org/examples/index.html)
- [Scipy Z-score](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html)
- [GitHub task instructions](<link to your challenge>)