# Power Plant

In this project, I will try several machine learning tecniques to predict the outcome of a power plant, where the target variable is amount of power produced. I will first explore the data set, then apply several regression models. 

Before that, let's import necessary modules:

In [1]:
## Importing modules

import pandas as pd
import numpy as np
from IPython.display import display, HTML

from sklearn.preprocessing import Imputer, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split,cross_val_score
from sklearn.metrics import mean_squared_error

from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import ElasticNet, LinearRegression

Now, we can read data and explore its properties

In [2]:
## Reading data
df=pd.read_csv("Data/data_power.csv")

## Calculating summary statistics
def summary_stat (L):
    """ this function returns summary statistics of each column in df"""
    sum_stat=[]
    sum_stat.append(np.nanmin(L.values))
    sum_stat.append(np.nanmedian(L.values))
    sum_stat.append(np.nanmax(L.values))
    sum_stat.append(L.isnull().sum())
    return sum_stat

tab=np.empty([4,len(df.columns)]) #empty table for summary statistics, will be filled in the for loop below
for i in range(len(df.columns)):
        tab[:,i]=summary_stat(df.iloc[:,i])
        
sum_tab=pd.DataFrame(tab,index=['minimum','median','maximum','number of missing'],columns=df.columns) #table to dataframe

sum_tab=round(sum_tab,1) #rounding numbers

## Printing the table
print("The shape of dataset is",df.shape)
print("\nHere is the summary statistics:\n\n",sum_tab)

The shape of dataset is (7176, 5)

Here is the summary statistics:

                    power  feat1  feat2      feat3  feat4
minimum             22.1  -23.2    6.3   985830.6   21.6
median              23.8   -4.7   13.1  1025885.4   70.8
maximum             26.1   12.1   20.1  1067708.9   96.2
number of missing    0.0    0.0    1.0        6.0    4.0


Before going through regression models, there are two things that we need to consider:
- Feature3 is far bigger than the other three features in terms of size. So the data needs to be scaled.
- All features except feature1 have missing values. So, I need to use missing data imputation.

Due to these properties, I will use pipelines, which include imputation, scaler, and a model to fit. For all models, the following applies: 
- Missing values will be imputed with the mean of the corresponding column.
- Features are standardized by removing the mean and scaling to unit variance (using StandardScaler()).
- 5-fold cross-validation is used to tune model parameters.

Let's divide our dataset into train (80%) and test (20%) subsets.

In [3]:
## Creating train and test data sets
X=df.iloc[:,1:]
y=df.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=23)

### Decision Tree

In [4]:
## Defining three steps for decision tree pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler',StandardScaler()),
         ('tree', DecisionTreeRegressor())]

pipeline = Pipeline(steps) # creating a pipeline

## Defining sets of model parameters
parameters= {'tree__max_depth':[3,6,9,12],
             'tree__min_samples_split':[2,3], 
             'tree__min_samples_leaf':[1,2]
             }

## Cross-validation
cv=GridSearchCV(pipeline, 
                param_grid=parameters,scoring='neg_mean_squared_error',
                cv=5)


cv.fit(X_train,y_train) # model fitting
tree_score=round(cv.score(X_test,y_test),3) # getting score
result_tree=('Decision Tree',cv.best_params_,tree_score) # will be used later

## Printing result
print("Best score in Decision Tree model is obtained with these parameters:\n",cv.best_params_)
print("\nThe score (negative mean squared error) on the test data set is", tree_score)

Best score in Decision Tree model is obtained with these parameters:
 {'tree__max_depth': 9, 'tree__min_samples_leaf': 2, 'tree__min_samples_split': 3}

The score (negative mean squared error) on the test data set is -0.043


### K-Nearest Neighbor

In [5]:
## Defining three steps for KNN pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler',StandardScaler()),
         ('knn', KNeighborsRegressor())]

pipeline = Pipeline(steps) # creating a pipeline

## Defining sets of model parameters
parameters= {'knn__n_neighbors':[3,5,7,10,20],
             'knn__weights':['uniform','distance']
            }
## Cross-validation
cv=GridSearchCV(pipeline,param_grid=parameters,
                scoring='neg_mean_squared_error',
                cv=5)

cv.fit(X_train,y_train) # model fitting
knn_score=round(cv.score(X_test,y_test),3) #getting score
result_knn=('KNN',cv.best_params_,knn_score) #will be used later

## Printing results
print("Best score in KNN model is obtained with these parameters:\n",cv.best_params_)
print("\nThe score (negative mean squared error) on the test data set is", knn_score)

Best score in KNN model is obtained with these parameters:
 {'knn__n_neighbors': 7, 'knn__weights': 'distance'}

The score (negative mean squared error) on the test data set is -0.038


### Elastic Net Regularization

In [6]:
## Defining three steps for ElasticNet pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]

pipeline = Pipeline(steps) # creating a pipeline

## Defining sets of model parameters
parameters={'elasticnet__l1_ratio':np.linspace(0.1, 1, 5),
           'elasticnet__alpha':np.linspace(0.1, 1, 5)
           }

## Cross-validation
cv=GridSearchCV(pipeline,param_grid=parameters,
                scoring='neg_mean_squared_error',
                cv=5)

cv.fit(X_train,y_train) # model fitting
elastic_score=round(cv.score(X_test,y_test),3) #getting score

## This part is for "nice printing" of model parameters' values. otherwise, print returns numbers with many decimals.
## Since round() didn't work on dictionary values (which are type numpy.float64), I converted them to float.
floats = [float(np_float) for np_float in list(cv.best_params_.values())]
for k, v in cv.best_params_.items():
    cv.best_params_[k] = floats[list(cv.best_params_).index(k)]
    
result_elastic=('Elastic Net',cv.best_params_,elastic_score) # will be used later

## Printing results    
print("Best score in Elastic Net model is obtained with these parameters:\n",cv.best_params_)
print("\nThe score (negative mean squared error) on the test data set is", elastic_score)

Best score in Elastic Net model is obtained with these parameters:
 {'elasticnet__alpha': 0.1, 'elasticnet__l1_ratio': 0.1}

The score (negative mean squared error) on the test data set is -0.066


It is interesting that best 'l1_ratio' and 'alpha' parameters are the minimum in their set. It may be the case that we can achieve better scores if we decrease any or both of these parameters furher.

The parameter l1_ratio corresponds to alpha in the glmnet R package while alpha corresponds to the lambda parameter in glmnet. Specifically, l1_ratio = 1 is the lasso penalty.

The documentation of sklearn.linear_model.ElasticNet() states that "currently, l1_ratio <= 0.01 is not reliable". Actually, decreasing alpha to 0 (zero) makes elastic net model equivalent to an ordinary least squares. Therefore, it makes sense to try ordinary least squares and check whether it produces better result than elastic net.

### Ordinary Least Squares

In [7]:
## Defining three steps for OLS pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy='mean', axis=0)),
         ('scaler', StandardScaler()),
         ('linearreg', LinearRegression())]

pipeline = Pipeline(steps) #creating a pipeline

pipeline.fit(X_train,y_train) #fitting model
y_pred = pipeline.predict(X_test) #prediction
neg_mse_linear = round(-(mean_squared_error(y_test, y_pred)),3) # getting score

result_ols=('OLS',"",neg_mse_linear) #will be used later

## Printing results
print("\nThe score (negative mean squared error) on the test data set is", neg_mse_linear)


The score (negative mean squared error) on the test data set is -0.056


It turns out that OLS produces better result than ElasticNet in the test data set. My observation on the best parameters of elastic net led me to try OLS, which then brought out a better result.

### Conclusion

I created a table to compare the results of four regressions above:

In [8]:
labels=['Model','Best parameters chosen by 5-fold CV', 'Negative MSE (greater is better)'] # column names
res_df=pd.DataFrame([result_tree,result_knn,result_elastic,result_ols],columns=labels) # creating a dataframe for results

## Printing final table for all results

print("Here is the table that includes all results:\n\n")
pd.set_option('display.max_colwidth', -1) # to display long strings
display(HTML(res_df.to_html(index=False))) # to display without index
best_model_index=list(res_df.iloc[:,2]).index(max(list(res_df.iloc[:,2]))) #getting index of the model with best result

print("\n\nTherefore, the best result is achieved by", res_df.iloc[best_model_index,0],
      "with model parameters as the following:\n\n",res_df.iloc[best_model_index,1])

Here is the table that includes all results:




Model,Best parameters chosen by 5-fold CV,Negative MSE (greater is better)
Decision Tree,"{'tree__max_depth': 9, 'tree__min_samples_leaf': 2, 'tree__min_samples_split': 3}",-0.043
KNN,"{'knn__n_neighbors': 7, 'knn__weights': 'distance'}",-0.038
Elastic Net,"{'elasticnet__alpha': 0.1, 'elasticnet__l1_ratio': 0.1}",-0.066
OLS,,-0.056




Therefore, the best result is achieved by KNN with model parameters as the following:

 {'knn__n_neighbors': 7, 'knn__weights': 'distance'}


## THE END 