## ⚡ India Power Generation Region Prediction

Given *data about daily power generation in India*, let's try to predict what **region** a given report is from.

We will use eight different models to make our predictions.

Data source: https://www.kaggle.com/datasets/navinmundhra/daily-power-generation-in-india-20172020?select=file_02.csv

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier

In [2]:
data = pd.read_csv('file_02.csv')
data

Unnamed: 0,index,Date,Region,Thermal Generation Actual (in MU),Thermal Generation Estimated (in MU),Nuclear Generation Actual (in MU),Nuclear Generation Estimated (in MU),Hydro Generation Actual (in MU),Hydro Generation Estimated (in MU)
0,0,2017-09-01,Northern,624.23,484.21,30.36,35.57,273.27,320.81
1,1,2017-09-01,Western,1106.89,1024.33,25.17,3.81,72.00,21.53
2,2,2017-09-01,Southern,576.66,578.55,62.73,49.80,111.57,64.78
3,3,2017-09-01,Eastern,441.02,429.39,,,85.94,69.36
4,4,2017-09-01,NorthEastern,29.11,15.91,,,24.64,21.21
...,...,...,...,...,...,...,...,...,...
4940,305,2020-08-01,Northern,669.47,602.96,26.88,23.41,348.72,351.98
4941,306,2020-08-01,Western,1116.00,1262.10,42.37,36.63,54.67,20.28
4942,307,2020-08-01,Southern,494.66,415.53,61.83,26.28,93.49,77.25
4943,308,2020-08-01,Eastern,482.86,547.03,,,87.22,93.78


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4945 entries, 0 to 4944
Data columns (total 9 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   index                                 4945 non-null   int64  
 1   Date                                  4945 non-null   object 
 2   Region                                4945 non-null   object 
 3   Thermal Generation Actual (in MU)     4945 non-null   float64
 4   Thermal Generation Estimated (in MU)  4945 non-null   float64
 5   Nuclear Generation Actual (in MU)     2967 non-null   float64
 6   Nuclear Generation Estimated (in MU)  2967 non-null   float64
 7   Hydro Generation Actual (in MU)       4945 non-null   float64
 8   Hydro Generation Estimated (in MU)    4945 non-null   float64
dtypes: float64(6), int64(1), object(2)
memory usage: 347.8+ KB


### Preprocessing

In [4]:
# Dropping index column / checking missing values
data = data.drop('index', axis=1)

In [7]:
data.isna().sum()

Date                                       0
Region                                     0
Thermal Generation Actual (in MU)          0
Thermal Generation Estimated (in MU)       0
Nuclear Generation Actual (in MU)       1978
Nuclear Generation Estimated (in MU)    1978
Hydro Generation Actual (in MU)            0
Hydro Generation Estimated (in MU)         0
dtype: int64

In [8]:
for column in ['Nuclear Generation Actual (in MU)', 'Nuclear Generation Estimated (in MU)']:
    data[column] = data[column].fillna(data[column].mean())

In [10]:
print("Total missing values: ", data.isna().sum().sum())

Total missing values:  0


#### Creating Year and Month columns

In [11]:
data

Unnamed: 0,Date,Region,Thermal Generation Actual (in MU),Thermal Generation Estimated (in MU),Nuclear Generation Actual (in MU),Nuclear Generation Estimated (in MU),Hydro Generation Actual (in MU),Hydro Generation Estimated (in MU)
0,2017-09-01,Northern,624.23,484.21,30.360000,35.570000,273.27,320.81
1,2017-09-01,Western,1106.89,1024.33,25.170000,3.810000,72.00,21.53
2,2017-09-01,Southern,576.66,578.55,62.730000,49.800000,111.57,64.78
3,2017-09-01,Eastern,441.02,429.39,37.242208,36.987877,85.94,69.36
4,2017-09-01,NorthEastern,29.11,15.91,37.242208,36.987877,24.64,21.21
...,...,...,...,...,...,...,...,...
4940,2020-08-01,Northern,669.47,602.96,26.880000,23.410000,348.72,351.98
4941,2020-08-01,Western,1116.00,1262.10,42.370000,36.630000,54.67,20.28
4942,2020-08-01,Southern,494.66,415.53,61.830000,26.280000,93.49,77.25
4943,2020-08-01,Eastern,482.86,547.03,37.242208,36.987877,87.22,93.78


In [13]:
data['Date'] = pd.to_datetime(data['Date'])
data['Year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data = data.drop('Date', axis=1)

In [14]:
data

Unnamed: 0,Region,Thermal Generation Actual (in MU),Thermal Generation Estimated (in MU),Nuclear Generation Actual (in MU),Nuclear Generation Estimated (in MU),Hydro Generation Actual (in MU),Hydro Generation Estimated (in MU),Year,Month
0,Northern,624.23,484.21,30.360000,35.570000,273.27,320.81,2017,9
1,Western,1106.89,1024.33,25.170000,3.810000,72.00,21.53,2017,9
2,Southern,576.66,578.55,62.730000,49.800000,111.57,64.78,2017,9
3,Eastern,441.02,429.39,37.242208,36.987877,85.94,69.36,2017,9
4,NorthEastern,29.11,15.91,37.242208,36.987877,24.64,21.21,2017,9
...,...,...,...,...,...,...,...,...,...
4940,Northern,669.47,602.96,26.880000,23.410000,348.72,351.98,2020,8
4941,Western,1116.00,1262.10,42.370000,36.630000,54.67,20.28,2020,8
4942,Southern,494.66,415.53,61.830000,26.280000,93.49,77.25,2020,8
4943,Eastern,482.86,547.03,37.242208,36.987877,87.22,93.78,2020,8


In [16]:
data.dtypes

Region                                   object
Thermal Generation Actual (in MU)       float64
Thermal Generation Estimated (in MU)    float64
Nuclear Generation Actual (in MU)       float64
Nuclear Generation Estimated (in MU)    float64
Hydro Generation Actual (in MU)         float64
Hydro Generation Estimated (in MU)      float64
Year                                      int32
Month                                     int32
dtype: object

In [17]:
# Encoding Labels
label_encoder = LabelEncoder()

data['Region'] = label_encoder.fit_transform(data['Region'])

In [18]:
data.dtypes

Region                                    int64
Thermal Generation Actual (in MU)       float64
Thermal Generation Estimated (in MU)    float64
Nuclear Generation Actual (in MU)       float64
Nuclear Generation Estimated (in MU)    float64
Hydro Generation Actual (in MU)         float64
Hydro Generation Estimated (in MU)      float64
Year                                      int32
Month                                     int32
dtype: object

In [19]:
data

Unnamed: 0,Region,Thermal Generation Actual (in MU),Thermal Generation Estimated (in MU),Nuclear Generation Actual (in MU),Nuclear Generation Estimated (in MU),Hydro Generation Actual (in MU),Hydro Generation Estimated (in MU),Year,Month
0,2,624.23,484.21,30.360000,35.570000,273.27,320.81,2017,9
1,4,1106.89,1024.33,25.170000,3.810000,72.00,21.53,2017,9
2,3,576.66,578.55,62.730000,49.800000,111.57,64.78,2017,9
3,0,441.02,429.39,37.242208,36.987877,85.94,69.36,2017,9
4,1,29.11,15.91,37.242208,36.987877,24.64,21.21,2017,9
...,...,...,...,...,...,...,...,...,...
4940,2,669.47,602.96,26.880000,23.410000,348.72,351.98,2020,8
4941,4,1116.00,1262.10,42.370000,36.630000,54.67,20.28,2020,8
4942,3,494.66,415.53,61.830000,26.280000,93.49,77.25,2020,8
4943,0,482.86,547.03,37.242208,36.987877,87.22,93.78,2020,8


In [20]:
# Splitting/Scaling
y = data['Region'].copy()
X = data.drop('Region', axis=1).copy()

In [21]:
# Scale X
scaler = StandardScaler()

X = scaler.fit_transform(X)

X

array([[ 0.05280804, -0.23786479, -0.55945652, ...,  2.9739146 ,
        -1.73545914,  0.66194702],
       [ 1.31138889,  1.17108805, -0.98135303, ..., -0.67425516,
        -1.73545914,  0.66194702],
       [-0.07123516,  0.00822981,  2.07190954, ..., -0.14704538,
        -1.73545914,  0.66194702],
       ...,
       [-0.28505779, -0.4170229 ,  1.9987483 , ...,  0.00496169,
         1.67228888,  0.38073327],
       [-0.31582739, -0.07399302,  0.        , ...,  0.20645944,
         1.67228888,  0.38073327],
       [-1.4851764 , -1.41538332,  0.        , ..., -0.57149497,
         1.67228888,  0.38073327]])

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=1)

In [23]:
X_train.shape

(3461, 8)

In [24]:
X_test.shape

(1484, 8)

### Modeling/Training

In [25]:
models = [
    LogisticRegression(),
    SVC(),
    MLPClassifier(),
    DecisionTreeClassifier(),
    AdaBoostClassifier(),
    BaggingClassifier(),
    GradientBoostingClassifier(),
    RandomForestClassifier()
]

In [26]:
model_names = [
    "         Logistic Regression",
    "      Support Vector Machine",
    "              Neural Network",
    "               Decision Tree",
    "         AdaBoost Classifier",
    "          Bagging Classifier",
    "Gradient Boosting Classifier",
    "    Random Forest Classifier"
]

In [27]:
results = []

for i in range(len(models)):
    models[i].fit(X_train, y_train)
    results.append(models[i].score(X_test, y_test))

### Results

In [29]:
for i in range(len(models)):
    print(model_names[i] + ": {:.5f}".format(results[i]))

         Logistic Regression: 0.99798
      Support Vector Machine: 0.99865
              Neural Network: 0.99865
               Decision Tree: 0.99933
         AdaBoost Classifier: 0.99933
          Bagging Classifier: 0.99865
Gradient Boosting Classifier: 0.99933
    Random Forest Classifier: 0.99865
