# Wind and Solar Energy Production

Wind & Solar Energy Production Dataset contains hourly wind and solar generation data from France spanning January 2020 to November 2025, featuring 51,864 complete records with 9 key columns.

It includes temporal features (date, hours, day-of-year, day name, month, season) and source classification (Wind, Solar, Mixed), with total production ranging from 58 to 23,446 MWh per hour and wind dominating at 81.9% of records.

This comprehensive dataset supports advanced renewable energy forecasting through regression and time series models, detailed pattern analysis of diurnal/seasonal/weekly trends, machine learning applications like classification and clustering, anomaly detection for production outliers, and statistical trend evaluation.

The dataset is available at the [link](https://www.kaggle.com/datasets/ahmeduzaki/wind-and-solar-energy-production-dataset/data?select=Energy+Production+Dataset.csv).


## 1. Loading the dataset

In [1]:
import kagglehub
from kagglehub import KaggleDatasetAdapter
from IPython.display import clear_output

# Filepath to the dataset
file_path = "Energy Production Dataset.csv"

# Load the latest version
df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "ahmeduzaki/wind-and-solar-energy-production-dataset",
  file_path,
)

clear_output()
print("Dataset loaded successfully!")


Dataset loaded successfully!


## 2. Data Exploration

In the data exploration phase the goal is to understand the structure of the data. We verify:
* Are there missing or extreme values?
* Are there any invalid data?
* Variable types
* Distribution and dispersion (statistical analysis and visualization)
* Frequency
* Discretization if applicable
* Impact of X on Y

In [2]:
df.columns

Index(['Date', 'Start_Hour', 'End_Hour', 'Source', 'Day_of_Year', 'Day_Name',
       'Month_Name', 'Season', 'Production'],
      dtype='str')

In [3]:
df.head()

Unnamed: 0,Date,Start_Hour,End_Hour,Source,Day_of_Year,Day_Name,Month_Name,Season,Production
0,11/30/2025,21,22,Wind,334,Sunday,November,Fall,5281
1,11/30/2025,18,19,Wind,334,Sunday,November,Fall,3824
2,11/30/2025,16,17,Wind,334,Sunday,November,Fall,3824
3,11/30/2025,23,0,Wind,334,Sunday,November,Fall,6120
4,11/30/2025,6,7,Wind,334,Sunday,November,Fall,4387


In [4]:
df.describe()

Unnamed: 0,Start_Hour,End_Hour,Day_of_Year,Production
count,51864.0,51864.0,51864.0,51864.0
mean,11.5,11.5,180.798415,6215.069933
std,6.922253,6.922253,104.291387,3978.364965
min,0.0,0.0,1.0,58.0
25%,5.75,5.75,91.0,3111.0
50%,11.5,11.5,181.0,5372.0
75%,17.25,17.25,271.0,8501.0
max,23.0,23.0,366.0,23446.0


In [176]:
if (df.duplicated()).any():
    print("Found duplicated values!")
else:
    print("Did not find any duplicated values!")

Did not find any duplicated values!


In [5]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 51864 entries, 0 to 51863
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   Date         51864 non-null  str  
 1   Start_Hour   51864 non-null  int64
 2   End_Hour     51864 non-null  int64
 3   Source       51864 non-null  str  
 4   Day_of_Year  51864 non-null  int64
 5   Day_Name     51864 non-null  str  
 6   Month_Name   51864 non-null  str  
 7   Season       51864 non-null  str  
 8   Production   51864 non-null  int64
dtypes: int64(4), str(5)
memory usage: 3.6 MB


There are no null values in the dataset.

### Analysis of "Date"

In [6]:
df.Date # mm-dd-yyyy
print("The current format is mm-dd-yyyy")

The current format is mm-dd-yyyy


In [7]:
# Convert string to Datetime
import pandas as pd
df.Date = pd.to_datetime(df.Date)
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 51864 entries, 0 to 51863
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         51864 non-null  datetime64[us]
 1   Start_Hour   51864 non-null  int64         
 2   End_Hour     51864 non-null  int64         
 3   Source       51864 non-null  str           
 4   Day_of_Year  51864 non-null  int64         
 5   Day_Name     51864 non-null  str           
 6   Month_Name   51864 non-null  str           
 7   Season       51864 non-null  str           
 8   Production   51864 non-null  int64         
dtypes: datetime64[us](1), int64(4), str(4)
memory usage: 3.6 MB


In [8]:
year = df.Date.apply(lambda x: x.year)
print(year.min())
print(year.max())

df["year"] = year

2020
2025


The dataset contains information of the production for the years 2020 to 2025.

In [9]:
import numpy as np
month = df.Date.apply(lambda x: x.month)
print(np.sort(month.unique()))

df["month"] = month

[ 1  2  3  4  5  6  7  8  9 10 11 12]


In [18]:
df.groupby(["year"])["month"].unique()

year
2020    [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
2021    [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
2022    [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
2023    [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
2024    [12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
2025        [11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
Name: month, dtype: object

From 2020 to 2024 we have measures of the production on all months of the year. The year 2025 we have measures from January to November, that is, no measurement in December.
The impact of that, is that we cannot perform an analysis of december's production from 2020 to 2025, only from 2020 to 2024.

(i.e. Missing value for december)


In [19]:
day = df.Date.apply(lambda x: x.day)
df["day"] = day

In [24]:
unique_days = df.groupby(["year", "month"])["day"].unique().reset_index()

In [43]:
unique_days.loc[2, "day"]

array([31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15,
       14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,  2,  1])

In [53]:
unique_days[unique_days["month"] == 2].iloc[0]

year                                                  2020
month                                                    2
day      [29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 1...
Name: 1, dtype: object

In [109]:
# check missing days in the dataset: 
def check_missing_day(unique_days):
    all_months = list(range(1,13))
    n_days = [31,28,31,30,31,30,31,31,30,31,30,31]

    dict_days_by_month = dict(zip(all_months, n_days))
    leap_years = [2020, 2024]

    for month, days in dict_days_by_month.items():
        
        temp = unique_days[unique_days["month"] == month]
        if month == 2:
            leap = temp[unique_days["year"].isin(leap_years)]
            is_complete = leap["day"].map(lambda x: x.shape[0])==(days+1)
            if not is_complete.all():
                print(is_complete[is_complete==False].index())

        else:
            is_complete = temp["day"].map(lambda x: x.shape[0])==(days)
            if not is_complete.all():
                print(is_complete[is_complete==False].index())



In [110]:
check_missing_day(unique_days)

  leap = temp[unique_days["year"].isin(leap_years)]


No missing values were found considering the days.

In [None]:
# Measurements per day
measurements= df.groupby("Date")["Date"].value_counts()
measurements[measurements !=24]

Date
2020-03-29    23
2020-10-25    25
2021-03-28    23
2021-10-31    25
2022-03-27    23
2022-10-30    25
2023-03-26    23
2023-10-29    25
2024-03-31    23
2024-10-27    25
2025-03-30    23
2025-10-26    25
Name: count, dtype: int64

In [165]:
day = "2020-10-25"
one_day = df[df.Date == day]
one_day

Unnamed: 0,Date,Start_Hour,End_Hour,Source,Day_of_Year,Day_Name,Month_Name,Season,Production,year,month,day
44688,2020-10-25,19,20,Wind,299,Sunday,October,Fall,4809,2020,10,25
44689,2020-10-25,20,21,Wind,299,Sunday,October,Fall,5286,2020,10,25
44690,2020-10-25,17,18,Wind,299,Sunday,October,Fall,5228,2020,10,25
44691,2020-10-25,6,7,Wind,299,Sunday,October,Fall,8401,2020,10,25
44692,2020-10-25,13,14,Wind,299,Sunday,October,Fall,7456,2020,10,25
44693,2020-10-25,1,2,Wind,299,Sunday,October,Fall,11525,2020,10,25
44694,2020-10-25,0,1,Wind,299,Sunday,October,Fall,11467,2020,10,25
44695,2020-10-25,22,23,Wind,299,Sunday,October,Fall,6154,2020,10,25
44696,2020-10-25,15,16,Wind,299,Sunday,October,Fall,7515,2020,10,25
44697,2020-10-25,12,13,Wind,299,Sunday,October,Fall,8191,2020,10,25


In [167]:
one_day.shape
# 25 measurents on this day

(25, 12)

In [178]:
one_day.sort_values("Start_Hour")

Unnamed: 0,Date,Start_Hour,End_Hour,Source,Day_of_Year,Day_Name,Month_Name,Season,Production,year,month,day
44694,2020-10-25,0,1,Wind,299,Sunday,October,Fall,11467,2020,10,25
44693,2020-10-25,1,2,Wind,299,Sunday,October,Fall,11525,2020,10,25
44707,2020-10-25,2,3,Wind,299,Sunday,October,Fall,10696,2020,10,25
44698,2020-10-25,2,3,Wind,299,Sunday,October,Fall,11001,2020,10,25
44703,2020-10-25,3,4,Wind,299,Sunday,October,Fall,8774,2020,10,25
44705,2020-10-25,4,5,Wind,299,Sunday,October,Fall,8234,2020,10,25
44699,2020-10-25,5,6,Wind,299,Sunday,October,Fall,7800,2020,10,25
44691,2020-10-25,6,7,Wind,299,Sunday,October,Fall,8401,2020,10,25
44708,2020-10-25,7,8,Wind,299,Sunday,October,Fall,8284,2020,10,25
44702,2020-10-25,8,9,Wind,299,Sunday,October,Fall,8012,2020,10,25


Some days have more than 24 measurements.
The initial hypotesis was the data was duplicated, but upon checking with both the duplicate
method, and further investigation it is clear that it is not. The investigation of one of such
dates, revealed that the Production values diverge for the roles with same time period and same day.
The additional measurement is probably from another power plant. 

An important insight is that we might not have measurements for all hours of the day, and the 
measurements might not be from the same power plant. Since we do not have information about the
origin of the energy production, i.e. the power plant.
That will directly affect the way we will conduct the bivariate analysis, by considering mean of
the production and we will further investigate the discretization of the measurement hours to
verify if its possible to identify periods of increase in the production.

### Conclusion of the Date column

- No missing values considering the days
- The December measurements on 2025 do not exist (they are missing).
- No invalid data was found.
- Recommendation to treat the Data column more easily: convert str to datetime.

Conclusion: The missing value for december/2025 limits the analysis of that month to 2020 to 2024.
Additionally, the discretization of the measurements during the day will probably limit our analysis,
due to missing measurement periods during the day, or additional measurements for the same hour.