DATASET DESCRIPTION

Stability of the Grid System

Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy source, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

Dataset: https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+

It has 12 primary predictive features and two dependent variables.

Predictive features:
1.	'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
2.	'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
3.	'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

Dependent variables:
1.	'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
2.	'stabf': a categorical (binary) label ('stable' or 'unstable').


INSTRUCTIONS:

Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' should be dropped and 'stabf' will remain as the sole dependent variable (binary classification).

Split the data into an 80-20 train-test split with a random state of “1”. Use the standard scaler to transform the train set (x_train, y_train) and the test set (x_test). 

Use scikit learn to train a random forest and extra trees classifier. And use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. 

Use random_state = 1 for training all models and evaluate on the test set. 

Also, to improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV). 
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None] 
hyperparameter_grid = {'n_estimators': n_estimators,
                       'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split,
                       'max_features': max_features}


In [57]:
# Importing Libraries to be used as Stage C Lessons

# Importing useful libraries
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score, \
                    LeaveOneOut, KFold, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.metrics import recall_score, accuracy_score, precision_score, f1_score, confusion_matrix, roc_curve, roc_auc_score
import sklearn.utils

from imblearn.over_sampling import SMOTE

import xgboost
import lightgbm

import warnings
warnings.filterwarnings('ignore')

In [58]:
#import dataset from URL
grid= pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00471/Data_for_UCI_named.csv")


In [59]:
grid.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [60]:
#our target stabf distribution
grid['stabf'].value_counts()

unstable    6380
stable      3620
Name: stabf, dtype: int64

In [61]:
#checking for null values in the dataset
grid.isnull().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

In [62]:
#check the datatype of the dataset
grid.dtypes

tau1     float64
tau2     float64
tau3     float64
tau4     float64
p1       float64
p2       float64
p3       float64
p4       float64
g1       float64
g2       float64
g3       float64
g4       float64
stab     float64
stabf     object
dtype: object

In [63]:
#'stab' should be dropped
grid.drop(['stab'],axis=1,inplace=True)
grid.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,unstable


In [64]:
#preprocessing
X = grid.drop(columns = 'stabf')
y = grid['stabf']

In [65]:
# Splitting the data into 80:20 training and testing test with a random_state of 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) 

In [66]:
scaler = StandardScaler() # Initializes a StandardScaler object
scaled_X_train = scaler.fit_transform(X_train) # Fits and transform the training set
scaled_X_test = scaler.transform(X_test) # Transforms the testing set

Create a function that returns the metric score of a test set. The metric can be any of accuracy_score, precision_score, recall_score, f1_score and confusion matrix.¶


In [67]:
# Dictionary to be used
metrics = {'accuracy_score': accuracy_score, 'precision_score': precision_score, 'recall_score': recall_score, 
               'f1_score': f1_score, 'confusion_matrix': confusion_matrix}

In [68]:
# Defining the function
def get_metric_score(metric, ytrue, ypred, neg_pos_label):
    ''' This function returns the specified metric score. It only works with classifier metrics.
        
        Args:   metric (string): the evaluating metric, can be any of accuracy_score, precision_score, recall_score, f1_score, 
                                 or confusion matrix.
                ytrue (array): the true labels
                ypred (array): the predicted labels
                neg_pos_label (list): a list of the classes you want as the negative and positive label in order 
                                      of [negative_label, positive_label]
                
        Return: returns the metric score
    '''
    
    if metric == 'accuracy_score':
        return accuracy_score(ytrue, ypred)
    
    elif metric == 'confusion_matrix':
        return confusion_matrix(ytrue, ypred, neg_pos_label)
    
    else:
        return metrics[metric](ytrue, ypred, pos_label=neg_pos_label[1]) # this is done because precision, recall and f1_score
                                                                         # takes the same arguments

Create a  function that fits a classifier on a training set and prints out the accuracy_score, precision_score, recall_score, f1_score and confusion matrix of the testing set.

In [69]:
# Defining the function
def fit_and_score(classifier, xtrain, ytrain, xtest, ytest, neg_pos_label):
    ''' This function fits a classifier on a training set and prints out the accuracy_score, precision_score, recall_score, 
    f1_score and confusion matrix of the testing set.
    
    Args: classifier (classifier object): the classifier you want to use
          xtrain (ndarray): the training features
          ytrain (array): the training labels
          xtest (ndarray): the testing features
          ytest (array): the testing labels
          neg_pos_label (list): a list of the classes you want as the negative and positive label in order 
                                      of [negative_label, positive_label]
    '''
    classifier.fit(xtrain, ytrain) # fits the classifier
    ypred = classifier.predict(xtest) # predicts
    
    # for each metric in metrics (dictionary earlier defined), print out the metric score.
    for metric in metrics:
        
        # this 'if' block is to ensure that the confusion matrix is properly printed out to improve redability
        if metric == 'confusion_matrix':
            print()
            print('confusion_matrix is:')
            print(get_metric_score(metric, ytest, ypred, neg_pos_label))
            
        else:
            print('{} is {}'.format(metric, get_metric_score(metric, ytest, ypred, neg_pos_label)))

In [70]:
label_list = ['unstable', 'stable']

## Evaluation of our model on different classifiers

#### Training and testing on RandomForestClassifier

In [71]:
random_forest = RandomForestClassifier(random_state=1)
fit_and_score(random_forest, scaled_X_train, y_train, scaled_X_test, y_test, label_list)

accuracy_score is 0.929
precision_score is 0.9191176470588235
recall_score is 0.8778089887640449
f1_score is 0.8979885057471264

confusion_matrix is:
[[1233   55]
 [  87  625]]


#### Training and testing on ExtraTreesClassifier

In [72]:
extra_trees = ExtraTreesClassifier(random_state=1)
fit_and_score(extra_trees, scaled_X_train, y_train, scaled_X_test, y_test, label_list)

accuracy_score is 0.928
precision_score is 0.9409937888198758
recall_score is 0.851123595505618
f1_score is 0.8938053097345133

confusion_matrix is:
[[1250   38]
 [ 106  606]]


#### Training and testing on xgboost

In [73]:
xgb = xgboost.XGBClassifier(random_state=1)
fit_and_score(xgb, scaled_X_train, y_train, scaled_X_test, y_test, label_list)

accuracy_score is 0.9195
precision_score is 0.9206106870229007
recall_score is 0.8469101123595506
f1_score is 0.8822238478419898

confusion_matrix is:
[[1236   52]
 [ 109  603]]


#### Training and testing on lightgbm

In [74]:
lgbm = lightgbm.LGBMClassifier(random_state=1)
fit_and_score(lgbm, scaled_X_train, y_train, scaled_X_test, y_test, label_list)

accuracy_score is 0.9375
precision_score is 0.9297218155197657
recall_score is 0.8918539325842697
f1_score is 0.910394265232975

confusion_matrix is:
[[1240   48]
 [  77  635]]


#### Training and testing on tuned ExtraTreesClassifier

In [75]:
# initializing search space of hyperparameters
n_estimators = [50, 100, 300, 500, 1000]
min_samples_split = [2, 3, 5, 7, 9]
min_samples_leaf = [1, 2, 4, 6, 8]
max_features = ['auto', 'sqrt', 'log2', None]

# making a dictionary of the grid
hyperparameter_grid = {'n_estimators': n_estimators, 'min_samples_leaf': min_samples_leaf,
                       'min_samples_split': min_samples_split, 'max_features': max_features}

In [76]:
extra_trees2 = ExtraTreesClassifier(random_state=1) # initializes an ExtraTreesClassifier

# initializing a RandomizedSearchCV
tuned_extra_trees = RandomizedSearchCV(extra_trees2, hyperparameter_grid, random_state=1, verbose=1, n_jobs=3) 

In [77]:
fit_and_score(tuned_extra_trees, scaled_X_train, y_train, scaled_X_test, y_test, label_list)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:  1.4min
[Parallel(n_jobs=3)]: Done  50 out of  50 | elapsed:  1.6min finished


accuracy_score is 0.927
precision_score is 0.9211309523809523
recall_score is 0.8693820224719101
f1_score is 0.8945086705202311

confusion_matrix is:
[[1235   53]
 [  93  619]]
