# Dataset Description
## Stability of the Grid System

Electrical grids require a balance between electricity supply and demand in order to be stable. Conventional systems achieve this balance through demand-driven electricity production. For future grids with a high share of inflexible (i.e., renewable) energy source, the concept of demand response is a promising solution. This implies changes in electricity consumption in relation to electricity price changes. In this work, we’ll build a binary classification model to predict if a grid is stable or unstable using the UCI Electrical Grid Stability Simulated dataset.

Dataset: https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+

It has 12 primary predictive features and two dependent variables.

**Predictive features:**

    'tau1' to 'tau4': the reaction time of each network participant, a real value within the range 0.5 to 10 ('tau1' corresponds to the supplier node, 'tau2' to 'tau4' to the consumer nodes);
    'p1' to 'p4': nominal power produced (positive) or consumed (negative) by each network participant, a real value within the range -2.0 to -0.5 for consumers ('p2' to 'p4'). As the total power consumed equals the total power generated, p1 (supplier node) = - (p2 + p3 + p4);
    'g1' to 'g4': price elasticity coefficient for each network participant, a real value within the range 0.05 to 1.00 ('g1' corresponds to the supplier node, 'g2' to 'g4' to the consumer nodes; 'g' stands for 'gamma');

**Dependent variables:**

    'stab': the maximum real part of the characteristic differential equation root (if positive, the system is linearly unstable; if negative, linearly stable);
    'stabf': a categorical (binary) label ('stable' or 'unstable').

Because of the direct relationship between 'stab' and 'stabf' ('stabf' = 'stable' if 'stab' <= 0, 'unstable' otherwise), 'stab' should be dropped and 'stabf' will remain as the sole dependent variable (binary classification).

Split the data into an 80-20 train-test split with a random state of “1”. Use the standard scaler to transform the train set (x_train, y_train) and the test set (x_test). Use scikit learn to train a random forest and extra trees classifier. And use xgboost and lightgbm to train an extreme boosting model and a light gradient boosting model. Use random_state = 1 for training all models and evaluate on the test set.

Also, to improve the Extra Trees Classifier, you will use the following parameters (number of estimators, minimum number of samples, minimum number of samples for leaf node and the number of features to consider when looking for the best split) for the hyperparameter grid needed to run a Randomized Cross Validation Search (RandomizedSearchCV).

n_estimators = [50, 100, 300, 500, 1000]

min_samples_split = [2, 3, 5, 7, 9]

min_samples_leaf = [1, 2, 4, 6, 8]

max_features = ['auto', 'sqrt', 'log2', None]

hyperparameter_grid = {'n_estimators': n_estimators,

'min_samples_leaf': min_samples_leaf,

'min_samples_split': min_samples_split,

'max_features': max_features}


In [1]:
# Import required Modules
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
seed_value = 1

In [2]:
staGrid = pd.read_csv('Data_for_UCI_named.csv')
staGrid.head()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab,stabf
0,2.95906,3.079885,8.381025,9.780754,3.763085,-0.782604,-1.257395,-1.723086,0.650456,0.859578,0.887445,0.958034,0.055347,unstable
1,9.304097,4.902524,3.047541,1.369357,5.067812,-1.940058,-1.872742,-1.255012,0.413441,0.862414,0.562139,0.78176,-0.005957,stable
2,8.971707,8.848428,3.046479,1.214518,3.405158,-1.207456,-1.27721,-0.920492,0.163041,0.766689,0.839444,0.109853,0.003471,unstable
3,0.716415,7.6696,4.486641,2.340563,3.963791,-1.027473,-1.938944,-0.997374,0.446209,0.976744,0.929381,0.362718,0.028871,unstable
4,3.134112,7.608772,4.943759,9.857573,3.525811,-1.125531,-1.845975,-0.554305,0.79711,0.45545,0.656947,0.820923,0.04986,unstable


In [3]:
# Look at data types for each columns and other info.
staGrid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   tau1    10000 non-null  float64
 1   tau2    10000 non-null  float64
 2   tau3    10000 non-null  float64
 3   tau4    10000 non-null  float64
 4   p1      10000 non-null  float64
 5   p2      10000 non-null  float64
 6   p3      10000 non-null  float64
 7   p4      10000 non-null  float64
 8   g1      10000 non-null  float64
 9   g2      10000 non-null  float64
 10  g3      10000 non-null  float64
 11  g4      10000 non-null  float64
 12  stab    10000 non-null  float64
 13  stabf   10000 non-null  object 
dtypes: float64(13), object(1)
memory usage: 1.1+ MB


In [4]:
# Summary of statistics 
staGrid.describe()

Unnamed: 0,tau1,tau2,tau3,tau4,p1,p2,p3,p4,g1,g2,g3,g4,stab
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5.25,5.250001,5.250004,5.249997,3.75,-1.25,-1.25,-1.25,0.525,0.525,0.525,0.525,0.015731
std,2.742548,2.742549,2.742549,2.742556,0.75216,0.433035,0.433035,0.433035,0.274256,0.274255,0.274255,0.274255,0.036919
min,0.500793,0.500141,0.500788,0.500473,1.58259,-1.999891,-1.999945,-1.999926,0.050009,0.050053,0.050054,0.050028,-0.08076
25%,2.874892,2.87514,2.875522,2.87495,3.2183,-1.624901,-1.625025,-1.62496,0.287521,0.287552,0.287514,0.287494,-0.015557
50%,5.250004,5.249981,5.249979,5.249734,3.751025,-1.249966,-1.249974,-1.250007,0.525009,0.525003,0.525015,0.525002,0.017142
75%,7.62469,7.624893,7.624948,7.624838,4.28242,-0.874977,-0.875043,-0.875065,0.762435,0.76249,0.76244,0.762433,0.044878
max,9.999469,9.999837,9.99945,9.999443,5.864418,-0.500108,-0.500072,-0.500025,0.999937,0.999944,0.999982,0.99993,0.109403


In [5]:
# Check nulls for each columns
staGrid.isnull().sum()

tau1     0
tau2     0
tau3     0
tau4     0
p1       0
p2       0
p3       0
p4       0
g1       0
g2       0
g3       0
g4       0
stab     0
stabf    0
dtype: int64

In [6]:
# Look for count of each label in our stability column
staGrid['stabf'].value_counts()

unstable    6380
stable      3620
Name: stabf, dtype: int64

In [7]:
# Drop unneeded column
staGrid.drop('stab', axis= 1, inplace= True)

In [8]:
# Check
staGrid.columns

Index(['tau1', 'tau2', 'tau3', 'tau4', 'p1', 'p2', 'p3', 'p4', 'g1', 'g2',
       'g3', 'g4', 'stabf'],
      dtype='object')

## Split our dataset

In [9]:
y = staGrid['stabf']
x = staGrid.drop('stabf', axis = 1)

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = seed_value)

In [11]:
y_test

9953    unstable
3850    unstable
4962      stable
3886      stable
5437    unstable
          ...   
3919      stable
162       stable
7903      stable
2242    unstable
2745    unstable
Name: stabf, Length: 2000, dtype: object

## Normalization operation for numerical stability & Encoding our target

In [12]:
from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
y_train= enc.fit_transform(y_train)
y_test = enc.transform(y_test)

In [13]:
from sklearn.preprocessing import StandardScaler
std = StandardScaler()

X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

# Modeling 
## RandomForest Classifier

In [14]:
from sklearn.ensemble import RandomForestClassifier
clf_RF = RandomForestClassifier(random_state= seed_value)
clf_RF.fit(X_train, y_train)

RandomForestClassifier(random_state=1)

In [15]:
pred_RF = clf_RF.predict(X_test)

In [16]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, pred_RF)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 93


### Improve RandomForest Classifier

In [17]:
# Define parameter possibilities as lists
n_estimators = [50, 100, 300, 500, 1000]

min_samples_split = [2, 3, 5, 7, 9]

min_samples_leaf = [1, 2, 4, 6, 8]

max_features = ['auto', 'sqrt', 'log2', None]

In [18]:
from sklearn.model_selection import RandomizedSearchCV

# Create the random grid
hyperparameter_grid = {'n_estimators': n_estimators,
                       'min_samples_split': min_samples_split,
                       'min_samples_leaf': min_samples_leaf,
                       'max_features': max_features}

In [19]:
# Use the hyperparameter_grid to search for best hyperparameters
RF_search = RandomizedSearchCV(estimator = clf_RF, param_distributions = hyperparameter_grid, n_iter = 10, cv = 3, random_state= seed_value)
# Fit the random search model
RF_search.fit(X_train, y_train)

RF_search.best_params_

{'n_estimators': 1000,
 'min_samples_split': 2,
 'min_samples_leaf': 4,
 'max_features': 'log2'}

In [20]:
RF_search.best_score_

0.9135004973590605

In [21]:
RF_tune = RandomForestClassifier(n_estimators= 1000, min_samples_split= 2, min_samples_leaf= 4, max_features= 'log2', random_state= seed_value)
RF_tune.fit(X_train, y_train)

RandomForestClassifier(max_features='log2', min_samples_leaf=4,
                       n_estimators=1000, random_state=1)

In [22]:
pred_RF_tune = clf_RF.predict(X_test)

In [23]:
accuracy = accuracy_score(y_test, pred_RF_tune)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 93


## ExtraTrees Classifier

In [24]:
from sklearn.ensemble import ExtraTreesClassifier
clf_xTree = ExtraTreesClassifier(random_state=seed_value)
clf_xTree.fit(X_train, y_train)

ExtraTreesClassifier(random_state=1)

In [25]:
pred_xTree = clf_xTree.predict(X_test)

In [26]:
accuracy = accuracy_score(y_test, pred_xTree)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 93


### Improve ExtraTree Classifier

In [27]:
# The scores will go here
results = []

# Nested loops - we need to test for all combinations
for n_estimator in n_estimators:
    for min_sample_split in min_samples_split:
        for min_sample_leaf in min_samples_leaf:
            for max_feature in max_features:
            # Train the model
                model = ExtraTreesClassifier(
                    n_estimators= n_estimator,
                    min_samples_split=min_sample_split,
                    min_samples_leaf=min_sample_leaf,
                    max_features= max_feature,
                    random_state= seed_value
                )
                model.fit(X_train, y_train)
                preds = model.predict(X_test)
                # Append current results
                results.append({
                    'Accuracy': round(accuracy_score(y_test, preds), 5),
                    'n_estimators': n_estimator,
                    'min_samples_split': min_sample_split,
                    'min_samples_leaf': min_sample_leaf,
                    'max_features': max_feature
                })

# Convert to Pandas DataFrame and sort descendingly by accuracy
results = pd.DataFrame(results)
results = results.sort_values(by='Accuracy', ascending=False)
results

Unnamed: 0,Accuracy,n_estimators,min_samples_split,min_samples_leaf,max_features
303,0.9385,500,2,1,
203,0.9380,300,2,1,
320,0.9375,500,3,1,auto
321,0.9375,500,3,1,sqrt
403,0.9375,1000,2,1,
...,...,...,...,...,...
17,0.9065,50,2,8,sqrt
18,0.9065,50,2,8,log2
97,0.9065,50,9,8,sqrt
98,0.9065,50,9,8,log2


In [28]:
xTree_tune = ExtraTreesClassifier(n_estimators= 500, min_samples_split= 2, 
                                 min_samples_leaf= 1, max_features= None,
                                 random_state=seed_value)


In [29]:
xTree_tune.fit(X_train, y_train)

ExtraTreesClassifier(max_features=None, n_estimators=500, random_state=1)

In [30]:
pred_xTree_tune = xTree_tune.predict(X_test)

In [31]:
accuracy = accuracy_score(y_test, pred_xTree_tune)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 94


## Extreme boosting

In [32]:
from xgboost import XGBClassifier
clf_xgboost = XGBClassifier(random_state= seed_value, use_label_encoder=False)
clf_xgboost.fit(X_train, y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=4, num_parallel_tree=1, random_state=1,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', use_label_encoder=False,
              validate_parameters=1, verbosity=None)

In [33]:
pred_xgb = clf_xgboost.predict(X_test)

In [34]:
accuracy = accuracy_score(y_test, pred_xgb)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 95


## Light gradient boosting

In [35]:
from lightgbm import LGBMClassifier
clf_LGM =  LGBMClassifier(random_state= seed_value)
clf_LGM.fit(X_train, y_train)

LGBMClassifier(random_state=1)

In [36]:
pred_lgm = clf_LGM.predict(X_test)

In [37]:
accuracy = accuracy_score(y_test, pred_lgm)
print('Accuracy: {}'.format(round(accuracy*100), 2))

Accuracy: 94


# DONE -:-)