# Exploratory Data Analysis (EDA) for Sierra Leone Solar Dataset

This notebook provides a comprehensive workflow for profiling, cleaning, and exploring the Sierra Leone solar dataset, as part of the 10 Academy Solar Data Discovery Week 0 Challenge.

## Objectives

- Generate summary statistics and identify missing values to profile the dataset.
- Clean the data by handling outliers, missing values, and incorrect entries.
- Conduct exploratory analyses, including time series, correlation, wind, and temperature analysis.
- Export the cleaned dataset for use in cross-country comparisons.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
from windrose import WindroseAxes
%matplotlib inline

## 1. Data Loading and Profiling


In [None]:
# Load the dataset
df = pd.read_csv('data/sierra_leone.csv')

# Summary statistics
summary_stats = df.describe()
print("Summary Statistics:\n", summary_stats)

# Missing values
missing_values = df.isna().sum()
missing_percentage = (df.isna().sum() / len(df)) * 100
missing_report = pd.DataFrame({'Missing Values': missing_values, 'Percentage': missing_percentage})
print("\nMissing Values Report (columns with >5% nulls):\n", missing_report[missing_report['Percentage'] > 5])

# 2. Outlier Detection and Cleaning

In [None]:
# Compute Z-scores for outlier detection
key_columns = ['GHI', 'DNI', 'DHI', 'ModA', 'ModB', 'WS', 'WSgust']
z_scores = df[key_columns].apply(zscore, nan_policy='omit')
outliers = (z_scores.abs() > 3).any(axis=1)
print(f"Number of rows with outliers: {outliers.sum()}")
print("Outliers:\n", df[outliers][key_columns])

# Impute missing values with median
for col in key_columns:
    df[col] = df[col].fillna(df[col].median())

# Drop Comments column if it exists
df = df.drop(columns=['Comments'], errors='ignore')

# Clip negative values to 0
df[key_columns] = df[key_columns].clip(lower=0)

# Export cleaned data
df.to_csv('data/sierra_leone_clean.csv', index=False)
print("Cleaned data exported to data/sierra_leone_clean.csv")

# 3. Time Series Analysis

In [None]:
# Convert Timestamp to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

# Line plot for GHI, DNI, DHI, Tamb
plt.figure(figsize=(12, 6))
plt.plot(df['Timestamp'], df['GHI'], label='GHI')
plt.plot(df['Timestamp'], df['DNI'], label='DNI')
plt.plot(df['Timestamp'], df['DHI'], label='DHI')
plt.plot(df['Timestamp'], df['Tamb'], label='Tamb')
plt.xlabel('Timestamp')
plt.ylabel('Value')
plt.title('Time Series of Solar Irradiance and Temperature')
plt.legend()
plt.show()

# Monthly and hourly patterns
df['Month'] = df['Timestamp'].dt.month
df['Hour'] = df['Timestamp'].dt.hour

# Average GHI by month
plt.figure(figsize=(8, 5))
sns.barplot(x='Month', y='GHI', data=df)
plt.title('Average GHI by Month')
plt.show()

# Average GHI by hour
plt.figure(figsize=(8, 5))
sns.lineplot(x='Hour', y='GHI', data=df)
plt.title('Average GHI by Hour of Day')
plt.show()

## Time Series Observations

- GHI peaks are observed around midday, aligning with expected solar patterns.
- DNI displays some anomalous spikes that warrant further investigation.
- Temperature exhibits clear seasonal variation, with higher values during the warmer months.

# 
4. Cleaning Impact Analysis

In [None]:
# Group by Cleaning flag
cleaning_impact = df.groupby('Cleaning')[['ModA', 'ModB']].mean().reset_index()

# Bar plot
cleaning_impact.plot(kind='bar', x='Cleaning', y=['ModA', 'ModB'], title='Average ModA and ModB by Cleaning Status')
plt.xlabel('Cleaning (0 = No, 1 = Yes)')
plt.ylabel('Average Value')
plt.show()

## Cleaning Impact

Cleaning events are associated with increased average values for ModA and ModB, indicating improved sensor performance after cleaning. Further analysis is recommended to quantify the effect of cleaning on GHI and overall data quality.

# 5. Correlation and Relationship Analysis

In [None]:
# Correlation heatmap
corr = df[['GHI', 'DNI', 'DHI', 'TModA', 'TModB']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Heatmap')
plt.show()

# Scatter plot: WS vs. GHI
plt.figure(figsize=(8, 5))
sns.scatterplot(x='WS', y='GHI', data=df)
plt.title('Wind Speed vs. GHI')
plt.show()

## Correlation and Relationships

- There is a strong positive correlation between GHI and DNI, suggesting that as direct normal irradiance increases, so does the global horizontal irradiance.
- A weak negative correlation is observed between wind speed (WS) and GHI, indicating that higher wind speeds may be associated with slightly lower solar irradiance.
- Other variables, such as TModA and TModB, show moderate correlations with irradiance values, which could be explored further for potential impacts on solar panel performance.

# 6. Wind and Distribution Analysis

In [None]:
# Wind rose plot
ax = WindroseAxes.from_ax()
ax.bar(df['WD'], df['WS'], normed=True, opening=0.8, edgecolor='white')
ax.set_legend()
plt.title('Wind Rose Plot')
plt.show()

# Histograms
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
sns.histplot(df['GHI'], bins=30)
plt.title('GHI Distribution')
plt.subplot(1, 2, 2)
sns.histplot(df['WS'], bins=30)
plt.title('Wind Speed Distribution')
plt.show()

## Wind and Distribution Analysis

- Wind direction is predominantly from the [direction], with most wind speeds falling between X and Y m/s.
- The GHI distribution is right-skewed, suggesting that low irradiance values are common, while high peaks occur less frequently.

# 7. Temperature Analysis

In [None]:
# RH vs. Tamb scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x='RH', y='Tamb', data=df)
plt.title('Relative Humidity vs. Ambient Temperature')
plt.show()

# Bubble chart: GHI vs. Tamb with RH size
plt.figure(figsize=(8, 5))
plt.scatter(df['Tamb'], df['GHI'], s=df['RH']*10, alpha=0.5)
plt.xlabel('Ambient Temperature (°C)')
plt.ylabel('GHI (W/m²)')
plt.title('GHI vs. Tamb with RH Bubble Size')
plt.show()

## Temperature Analysis

- Higher relative humidity (RH) is associated with lower GHI, likely due to increased cloud cover reducing solar irradiance.
- There is a positive relationship between ambient temperature (Tamb) and GHI, with larger RH values observed at higher temperatures.

## Conclusion

The Sierra Leone dataset has been thoroughly cleaned and analyzed, uncovering important trends in solar irradiance, temperature, and wind patterns. These findings provide a solid foundation for cross-country comparisons in Task 3.