# Sierra Leone Solar Data - EDA

This notebook contains exploratory data analysis (EDA) for Sierra Leone solar farm data.  
The goal is to understand the dataset, detect missing values and outliers, and plan visualizations for further analysis.


In [None]:
# Libraries for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore  # for outlier detection

# Display plots inline in Jupyter
%matplotlib inline
   

# Load Sierra Leone solar data (CSV is local, do not commit CSV)
df = pd.read_csv('../data/sierraleone.csv')

# Preview first 5 rows
df.head()


## Summary Statistics & Missing Values

We review basic statistics for each column to understand the data distribution.  
We also check missing values to plan cleaning steps.


In [None]:
# Summary statistics: mean, std, min, max
df.describe()

# Count of missing values per column
df.isna().sum()


## Outliers & Cleaning Plan

To ensure data quality, we calculate the Z-score for the GHI column to detect extreme values.  
Rows with |Z| > 3 are considered potential outliers and will be reviewed.  
Missing values will be imputed with the median of each column if necessary.


In [None]:
# Calculate Z-score for GHI column
df['GHI_z'] = zscore(df['GHI'])

# Identify potential outliers where absolute Z-score > 3
outliers = df[df['GHI_z'].abs() > 3]
outliers  # Display potential outlier rows


## Planned Visualizations

The following visualizations are planned to explore relationships in the dataset:

- **Line plot:** GHI over time
- **Scatter plot:** RH vs GHI
- **Bubble chart:** GHI vs Tamb (size = RH)
- **Heatmap:** correlation between GHI, DNI, DHI, TModA, TModB


In [None]:
# Save a cleaned version locally (do not commit to GitHub)
df.to_csv('../data/sierraleone_clean.csv', index=False)
