## Hamoye Data Science Internship Stage C
### Machine Learning: Classification

## Data Description
Stability of the Grid System

Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy source, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.
Dataset: https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+
It has 12 primary predictive features and two dependent variables.

Predictive features:
1.	'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
2.	'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
3.	'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

Dependent variables:
1.	'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
2.	'stabf': a categorical (binary) label ('stable' or 'unstable').


## Work Description

Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' should be dropped and 'stabf' will remain as the sole dependent variable (binary classification).
Split the data into an 80-20 train-test split with a random state of “1”. 

Use the standard scaler to transform the train set (x_train, y_train) and the test set (x_test). 
Use scikit learn to train a random forest and extra trees classifier. 
Use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model.
Use random_state = 1 for training all models and evaluate on the test set. 

Also, to improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV). 

n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None] 
hyperparameter_grid = {'n_estimators': n_estimators,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}
`

### Importing relevant Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data_df = pd.read_csv('Data_for_UCI_named.csv')

In [3]:
data_df.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [4]:
data_df.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525,0.015731
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255,0.036919
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028,-0.08076
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494,-0.015557
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002,0.017142
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433,0.044878
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993,0.109403


In [5]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   tau1    10000 non-null  float64
 1   tau2    10000 non-null  float64
 2   tau3    10000 non-null  float64
 3   tau4    10000 non-null  float64
 4   p1      10000 non-null  float64
 5   p2      10000 non-null  float64
 6   p3      10000 non-null  float64
 7   p4      10000 non-null  float64
 8   g1      10000 non-null  float64
 9   g2      10000 non-null  float64
 10  g3      10000 non-null  float64
 11  g4      10000 non-null  float64
 12  stab    10000 non-null  float64
 13  stabf   10000 non-null  object 
dtypes: float64(13), object(1)
memory usage: 1.1+ MB


In [6]:
# data with stab column is dropped
data_df1 = data_df.drop('stab', axis =1)
data_df1.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,unstable


In [7]:
X = data_df1.drop('stabf', axis = 1)
y = data_df1['stabf']

In [8]:
# splitting the dataset into train set and test set
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
y_train.value_counts()

unstable    5092
stable      2908
Name: stabf, dtype: int64

In [9]:
# Using StandardScaler to transform the train set and test set 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_train_df = scaler.fit_transform(x_train)
scaled_train_df = pd.DataFrame(scaled_train_df, columns = x_train.columns)
scaled_test_df = scaler.fit_transform(x_test)
scaled_test_df = pd.DataFrame(scaled_test_df, columns = x_test.columns)

In [10]:
# Importing Random Forest Classifier and fitting it to train set
from sklearn.ensemble import RandomForestClassifier
RFClf = RandomForestClassifier(random_state = 1)
RFClf.fit(scaled_train_df, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [11]:
RFClf_pred = RFClf.predict(scaled_test_df)
RFClf_pred

array(['unstable', 'unstable', 'stable', ..., 'stable', 'stable',
       'unstable'], dtype=object)

In [12]:
# Random Forest  Performance measure

# Accuracy
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix
accuracy = accuracy_score(y_test, RFClf_pred)
round (accuracy, 4)

0.928

In [13]:
# Precision
precision = precision_score(y_test, RFClf_pred, pos_label = 'stable')
print('Precision: {}'. format(round(precision*100),2))

Precision: 92.0


In [14]:
# Recall
recall = recall_score(y_test, RFClf_pred, pos_label = 'stable')
print('Recall: {}'.format(round(recall*100),2))

Recall: 88.0


In [15]:
# F1-Score
f1 = f1_score(y_test, RFClf_pred, pos_label = 'stable')
print('F1: {}'.format(round(f1*100), 2))

F1: 90.0


In [16]:
from sklearn.metrics import accuracy_score, classification_report
print(classification_report(y_test, RFClf_pred, digits = 4))

              precision    recall  f1-score   support

      stable     0.9176    0.8764    0.8966       712
    unstable     0.9333    0.9565    0.9448      1288

    accuracy                         0.9280      2000
   macro avg     0.9255    0.9165    0.9207      2000
weighted avg     0.9277    0.9280    0.9276      2000



In [17]:
# Confusion Matrix
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix
cnf_matrix = confusion_matrix(y_test, RFClf_pred, labels = ['unstable', 'stable'])
cnf_matrix

array([[1232,   56],
       [  88,  624]], dtype=int64)

In [18]:
# ExtraTrees Classifier
from sklearn.ensemble import ExtraTreesClassifier
ETclf = ExtraTreesClassifier(random_state = 1)
ETclf.fit(scaled_train_df, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=1, verbose=0,
                     warm_start=False)

In [19]:
ETclf_pred = ETclf.predict(scaled_test_df)
ETclf_pred

array(['unstable', 'unstable', 'stable', ..., 'stable', 'unstable',
       'unstable'], dtype=object)

In [20]:
# ExtraTrees Perfomance measure

accuracy = accuracy_score(y_test, ETclf_pred)
accuracy

0.926

In [21]:
# Confusion Matrix
cnf_matrix1 = confusion_matrix(y_test, ETclf_pred)
cnf_matrix1

array([[ 602,  110],
       [  38, 1250]], dtype=int64)

In [22]:
# Recall
recall = recall_score(y_test, ETclf_pred, pos_label = 'stable')
print('Recall: {}'.format(round(recall*100),2))

Recall: 85.0


In [23]:
# Precision
precision = precision_score(y_test, ETclf_pred, pos_label = 'stable')
print('Precision: {}'. format(round(precision*100),2))

Precision: 94.0


In [24]:
# F1-Score
f1 = f1_score(y_test, ETclf_pred, pos_label = 'stable')
print('F1: {}'.format(round(f1*100), 2))

F1: 89.0


In [25]:
# Applying Hyper-parameter tuning and Randomized Search Cross Validation
from sklearn.model_selection import RandomizedSearchCV

In [26]:
# Hyperparameter_grid
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None]
hyperparameter_grid = {'n_estimators': n_estimators, 'min_samples_split': min_samples_split, 'min_samples_leaf': min_samples_leaf,'max_features':max_features}

In [27]:
ETClf_randomized_search = RandomizedSearchCV(ETclf, hyperparameter_grid, cv=5, n_iter=10, scoring='accuracy',n_jobs=-1, verbose = 1, random_state=1)
best_ETClf_random = ETClf_randomized_search.fit(scaled_train_df, y_train)
print(best_ETClf_random.best_estimator_)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:  1.2min finished


ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features=None,
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=8, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=1000,
                     n_jobs=None, oob_score=False, random_state=1, verbose=0,
                     warm_start=False)


In [28]:
ETclf = ExtraTreesClassifier(n_estimators=1000, min_samples_leaf = 8, random_state = 1)

In [29]:
ETclf.fit(scaled_train_df, y_train)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=8, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=1000,
                     n_jobs=None, oob_score=False, random_state=1, verbose=0,
                     warm_start=False)

In [30]:
# Exploring the feature importance
feature_importance = pd.DataFrame(ETclf.feature_importances_, index=scaled_train_df.columns, 
                                  columns=['Importance']).sort_values('Importance', ascending = False)

In [31]:
feature_importance

Unnamed: 0,Importance
tau2,0.145422
tau1,0.14157
tau4,0.138173
tau3,0.136662
g3,0.106166
g2,0.102486
g4,0.100301
g1,0.092135
p3,0.01008
p2,0.009922


In [32]:
ETclf_pred1 = ETclf.predict(scaled_test_df)
ETclf_pred1

array(['unstable', 'unstable', 'stable', ..., 'stable', 'unstable',
       'unstable'], dtype=object)

In [33]:
# Performance measure

In [34]:
# Accuracy
accuracy = accuracy_score(y_test, ETclf_pred1)
accuracy

0.9105

In [35]:
# Confusion Matrix
cnf_matrix1 = confusion_matrix(y_test, ETclf_pred1)
cnf_matrix1

array([[ 546,  166],
       [  13, 1275]], dtype=int64)

In [36]:
# Recall
recall = recall_score(y_test, ETclf_pred1, pos_label = 'stable')
print('Recall: {}'.format(round(recall*100),2))

Recall: 77.0


In [37]:
# Precision
precision = precision_score(y_test, ETclf_pred1, pos_label = 'stable')
print('Precision: {}'. format(round(precision*100),2))

Precision: 98.0


In [38]:
# F1-Score
f1 = f1_score(y_test, ETclf_pred1, pos_label = 'stable')
print('F1: {}'.format(round(f1*100), 2))

F1: 86.0


In [39]:
# Classification report
print(classification_report(y_test, ETclf_pred1, digits = 4))

              precision    recall  f1-score   support

      stable     0.9767    0.7669    0.8592       712
    unstable     0.8848    0.9899    0.9344      1288

    accuracy                         0.9105      2000
   macro avg     0.9308    0.8784    0.8968      2000
weighted avg     0.9175    0.9105    0.9076      2000



In [40]:
# Extreme Boosting model - Xgboost
from xgboost import XGBClassifier
xgb = XGBClassifier(random_state = 1)
xgb.fit(scaled_train_df, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

In [41]:
xgb_pred = xgb.predict(scaled_test_df)
xgb_pred

array(['unstable', 'unstable', 'stable', ..., 'stable', 'unstable',
       'unstable'], dtype=object)

In [42]:
# XGBClassifier Performance Measure

# Accuracy
xgb_accuracy = accuracy_score(y_test, xgb_pred)
print('Accuracy: {}'.format(round(xgb_accuracy*100), 3))

Accuracy: 92.0


In [43]:
# Confusion Matrix
cnf_matrix = confusion_matrix(y_test, xgb_pred)
cnf_matrix

array([[ 599,  113],
       [  49, 1239]], dtype=int64)

In [44]:
# Recall
xgb_recall = recall_score(y_test, xgb_pred, pos_label = 'stable')
print('Recall: {}'.format(round(xgb_recall*100),2))

Recall: 84.0


In [45]:
# Precision
xgb_precision = precision_score(y_test, xgb_pred, pos_label = 'stable')
print('XGB_Precision: {}'. format(round(xgb_precision*100),2))

XGB_Precision: 92.0


In [46]:
# F1-Score
XGB_f1 = f1_score(y_test, xgb_pred, pos_label = 'stable')
print('XGB_F1: {}'.format(round(XGB_f1*100), 2))

XGB_F1: 88.0


In [47]:
# Confusion Matrix
cnf_matrix = confusion_matrix(y_test, xgb_pred)
cnf_matrix

array([[ 599,  113],
       [  49, 1239]], dtype=int64)

In [48]:
# Classification report
print(classification_report(y_test, xgb_pred, digits = 4))

              precision    recall  f1-score   support

      stable     0.9244    0.8413    0.8809       712
    unstable     0.9164    0.9620    0.9386      1288

    accuracy                         0.9190      2000
   macro avg     0.9204    0.9016    0.9098      2000
weighted avg     0.9193    0.9190    0.9181      2000



In [49]:
# Light Gadient Boosting - lightgbm
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(random_state = 1)
lgbm.fit(scaled_train_df, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=1, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [50]:
lgbm_pred = lgbm.predict(scaled_test_df)
lgbm_pred

array(['unstable', 'unstable', 'stable', ..., 'stable', 'unstable',
       'unstable'], dtype=object)

In [51]:
# LGBMClassifier Performance Measure

# Accuracy
lgbm_accuracy = accuracy_score(y_test, lgbm_pred)
print('Accuracy: {}'.format(round(lgbm_accuracy*100), 3))

Accuracy: 94.0


In [52]:
# Confusion Matrix
lgbm_cnf_matrix = confusion_matrix(y_test, lgbm_pred)
lgbm_cnf_matrix

array([[ 635,   77],
       [  52, 1236]], dtype=int64)

In [53]:
# Recall
lgbm_recall = recall_score(y_test, lgbm_pred, pos_label = 'stable')
print('Recall: {}'.format(round(lgbm_recall*100),2))

Recall: 89.0


In [54]:
# Precision
lgbm_precision = precision_score(y_test, lgbm_pred, pos_label = 'stable')
print('lgbm_Precision: {}'. format(round(lgbm_precision*100),2))

lgbm_Precision: 92.0


In [55]:
# F1-Score
lgbm_f1 = f1_score(y_test, lgbm_pred, pos_label = 'stable')
print('lgbm_F1: {}'.format(round(lgbm_f1*100), 2))

lgbm_F1: 91.0
