## Demand Forecasting: Regression Analysis and Model Training

### 1. Problem Statement
- The energy industry is undergoing a transformative journey, marked by rapid modernization and technological advancements. Infrastructure upgrades, integration of intermittent renewable energy sources, and evolving consumer demands are reshaping the sector. However, this progress comes with its challenges. Supply, demand, and prices are increasingly volatile, rendering the future less predictable. Moreover, the industry's traditional business models are being fundamentally challenged. In this competitive and dynamic landscape, accurate decision-making is pivotal. The industry relies heavily on probabilistic forecasts to navigate this uncertain future, making innovative and precise forecasting methods essential that aids stakeholders in making strategic decisions amidst the shifting energy landscape. 

### 2. Data Ingestion

#### 2.1 Import Data and Required Packages
- Importing Pandas, Numpy, Matplotlib, Seaborn, Scikit-learn 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler

#### 2.2 Import the CSV Data as Pandas DataFrame
- Importing both Demand and Weather Data of Demand Forecasting and merging them

In [None]:
df_demand = pd.read_csv('../../dataset/Demand Forecasting/Demand Forecasting Demand Data upto Feb 21.csv', sep=',')
df_weather = pd.read_csv('../../dataset/Demand Forecasting/Demand Forecasting Weather Data upto Feb 28.csv', sep=',')
df_merged=pd.merge(left=df_demand,right=df_weather, on='datetime')

### 3. Data Preprocessing and Visualizations

#### 3.1 Show Top 5 Records
 - Showing top 5 and last 5 records


In [None]:
df_merged.head()

In [None]:
df_merged.tail()

#### 3.2 Checking if Unamed columns have any data
- Checking the data in unnamed columns and removing all the empty columns

In [None]:
for i in range(21, 26):
    column_name = f'Unnamed: {i}'
    count_non_null = df_merged[column_name].notna().sum()
    print(f"Non-null values in {column_name}: {count_non_null}")

In [None]:
columns_to_drop = ['Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25']
df_merged.drop(columns_to_drop, inplace=True, axis=1)

#### 3.3 Performing Datachecks
- Checking for null values

In [None]:
df_merged.info()

#### 3.4 Filling most appropriate values for severerisk 
- Filling severerisk with 0 on nan values for more appropraite correlation analysis

In [None]:
df_merged['severerisk'].fillna(0, inplace=True)

#### 3.5 Dropping redundant data
- Dropping preciptype and precipprob as precipitation has more accurate and non-null data, similarly dropping windgust and keeping windspeed

In [None]:
df_merged.drop(["precipprob", "preciptype" ], inplace=True, axis=1)
df_merged.drop(['windgust'], inplace=True, axis=1)

In [None]:
df_merged

#### 3.6 Interpolation of data
- Using .interpolate() method to add most appropriate datas in place of NaN values

In [None]:
for column in df_merged.columns[3:17]:
    df_merged[column] = df_merged[column].interpolate(method='linear', limit_direction='forward', axis=0)

#### 3.7 Histogram & KDE
 - It is evident that the distribution of the 'Demand (MW)' column in the dataset closely aligns with a log-normal distribution.

In [None]:
df_merged['datetime'] = pd.to_datetime(df_merged['datetime'])

# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot histogram
sns.histplot(df_merged['Demand (MW)'], kde=False, bins=30, color='skyblue', ax=ax)
ax.set_title(f'Histogram of Demand (MW)')
ax.set_xlabel('Demand (MW)')
ax.set_ylabel('Frequency')

# New axis for the KDE plot
ax2 = ax.twinx()
sns.kdeplot(df_merged['Demand (MW)'], color='orange', ax=ax2)
ax2.set_ylabel('KDE', color='orange')

plt.show()

#### 3.8 Analyzing Correlation 
- Analyzing Correlation between Demand(MW) and other paramaters

In [None]:
for column in df_merged.columns[3:18]:
    print(f"Correlation of price with {column}: {df_merged['Demand (MW)'].corr(df_merged[column])}")

- Plotting a heatmap of corrlation

In [None]:
selected_columns = df_merged.columns[[1] + list(range(3, 10))]
correlation_matrix =df_merged[selected_columns].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='Greens', fmt=".2f", vmin=-1, vmax=1)

#### 3.9 Regrestion Plot of highly correlated datas
- Demand(MW) has high correlation with temperature and dewpoint whose regression plot are as follows.

In [None]:
sns.regplot(x='Temperature', y='Demand (MW)', data=df_merged, scatter_kws={'s': 10}, line_kws={'color': 'green'})

In [None]:
sns.regplot(x='dewpoint', y='Demand (MW)', data=df_merged, scatter_kws={'s': 10}, line_kws={'color': 'green'})

- Grouping data by month within each date and plotting a scatter plot between Temperature, dewpoint and Demand (MW)

In [None]:
tempdf = df_merged
tempdf['datetime'] = pd.to_datetime(df_merged['datetime'])

numeric_columns = df_merged.select_dtypes(include=['number']).columns

# Grouping by date and then by month within each date, and calculating the mean for numeric columns
monthlydf = tempdf.groupby(tempdf['datetime'].dt.to_period("M"))[numeric_columns].mean().reset_index()

In [None]:

sns.regplot(x='Temperature', y='Demand (MW)', data=monthlydf, scatter_kws={'s': 10}, line_kws={'color': 'green'})

In [None]:
sns.regplot(x='dewpoint', y='Demand (MW)', data=monthlydf, scatter_kws={'s': 10}, line_kws={'color': 'green'})

#### 3.10 Dropping less significant data
Dropping less significant data after correlation analysis, i.e very low correlation as well as redundant data (i.e solarradiation and uv index where both have almost 1 correlation , here data with higher correlation is kept)


In [None]:
print(df_merged["solarradiation"].corr(df_merged["uvindex"]))
print(df_merged["Temperature"].corr(df_merged["feelslike"]))

In [None]:
df_merged.drop(['feelslike','uvindex', 'precipitation', 'sealevelpressure', 'snow', 'snowdepth', 'windspeed', 'winddirection'], inplace=True, axis=1)

#### 3.11 Visualization of categorical data
- Plotting conditions vs Demand(MW)

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x='conditions', y='Demand (MW)',hue='conditions', data=df_merged)

plt.title('Conditions vs Demand (MW)')
plt.xlabel('Conditions')
plt.ylabel('Demand (MW)')
plt.xticks(rotation=90)

plt.show()

### 4. Feature Engineering

#### 4.1 Normalization
- Normalize continuous values and avoid vanishing gradient problems to finalize our data before model training.

In [None]:
df_merged.iloc[:,3:10]

In [None]:

scaler = MinMaxScaler()
X = scaler.fit_transform(df_merged.iloc[:,3:10])

In [None]:
df_merged.iloc[:,1]

In [None]:
scaler = MinMaxScaler()
y = scaler.fit_transform(df_merged.iloc[:,1].values.reshape(-1,1))

#### 4.2 Handling Categorical Data
Using pandas get dummies to handle categorical variables like 
condition creating new columns consisting of 0s and 1s for each columns 

In [None]:
dummies = pd.get_dummies(df_merged['conditions'], prefix='overcast')
df_final = pd.concat([df_merged, dummies], axis=1)

### 5. Conclusion
Final data set before model training

In [None]:
df_final

In [None]:
df_final.info()

In [None]:
df_final.describe()