<a href="https://colab.research.google.com/github/btalbr01/MLA_BTA/blob/main/MLA2_BTA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Machine Learning Assignment 2 <br>
Ben Albright<br>
CS430-ON<br>
Machine Learning in the Cloud

Imports for specific models

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score, auc, precision_score, recall_score, f1_score, roc_curve, auc, precision_recall_curve, confusion_matrix

Linear Regression - Acquire Data

In [None]:
df = pd.read_csv('/content/gdrive/MyDrive/Steel_industry_data.csv')

In [None]:
df.rename({'Usage_kWh':'usage_kwh','Lagging_Current_Reactive.Power_kVarh':'lag_react_pwr_kvarh',
           'Leading_Current_Reactive_Power_kVarh':'lead_react_pwr_kvarh','Lagging_Current_Power_Factor':'lag_current_pwr',
           'Leading_Current_Power_Factor':'lead_current_pwr','NSM':'nsm','WeekStatus':'week_status', 'Day_of_week':'day_of_week',
           'Load_Type':'load_type'}, axis=1, inplace=True)

In [None]:
df.columns


Statistics and Visual Exploration

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
load_types = df['load_type'].value_counts()
load_types_df = pd.DataFrame(load_types)
load_types_df.reset_index(inplace=True)
load_types_df.columns = ['load_type', 'count']


In [None]:
load_types_df

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x="load_type", y="count", data = load_types_df)
plt.xlabel('load type')
plt.ylabel('count')


In [None]:
sns.pairplot(df[['usage_kwh', 'lag_react_pwr_kvarh', 'lead_react_pwr_kvarh', 'lag_current_pwr', 'lead_current_pwr', 'nsm', 'week_status', 'day_of_week']])

It looks like usage_kwh, lag_react_pwr_kvarh and lead_react_pwr_kvarh all have similar patters when compared to nsm.
It seems like the graphs comparing usage_kwh and lag_react_pwr_kvarh to lead_current_pwr are almost identical.

Splitting Data into Train/Test

In [None]:
X = df[['lag_react_pwr_kvarh', 'lead_react_pwr_kvarh',
       'CO2(tCO2)', 'lag_current_pwr', 'lead_current_pwr', 'nsm',
       'week_status', 'day_of_week', 'load_type']]

In [None]:
y = df[['usage_kwh']]


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state=42)

In [None]:
cat_attribs = ['week_status', 'day_of_week', 'load_type']
num_attribs = ['lag_react_pwr_kvarh', 'lead_react_pwr_kvarh', 'CO2(tCO2)', 'lag_current_pwr', 'lead_current_pwr', 'nsm']

Building the Pipeline

In [None]:
col_transform = ColumnTransformer(transformers=
   [('cat',OneHotEncoder(), cat_attribs),
   ('num',MinMaxScaler(), num_attribs)])

In [None]:
pipeline = Pipeline([
    ('trans', col_transform),
    ('lr', LinearRegression())
])

In [None]:
from sklearn import set_config
set_config(display='diagram')
pipeline

Executing the Model

In [None]:
pipeline.fit(X_train, y_train)


Evaluating the Model

In [None]:
y_pred = pipeline.predict(X_test)

In [None]:
print(mean_squared_error(y_train, pipeline.predict(X_train))**(0.5))
print(mean_squared_error(y_test, pipeline.predict(X_test))**(0.5))

Conclusion

The train and test sets look like they are getting almost the same results. This makes it seem like the model is consistent in both datasets.

In [None]:
y_pred = pipeline.predict(X_test)
rsv = r2_score(y_test, y_pred)

print(rsv)

It looks like the value is about 98.43%, which means that much variance is predictable in the dependant variable from the independent variable.

In [None]:
pipeline.named_steps['lr'].intercept_

In [None]:
pipeline.named_steps['lr'].coef_

I think this is a suitable model. The RSME values for both train and test are fairly close, and the R-squared value of the model is very high.

Logistic Regression - Acquire Data

In [None]:
df['week_status']=df['week_status'].replace({'Weekday':1, 'Weekend':0})

Splitting Data in to Train/Test

In [None]:
X = df[['usage_kwh', 'lag_react_pwr_kvarh', 'lead_react_pwr_kvarh', 'CO2(tCO2)', 'lag_current_pwr', 'lead_current_pwr', 'nsm', 'load_type']]

In [None]:
y = df[['week_status']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Building the Pipeline

In [None]:
cat_attribs = ['load_type']
num_attribs = ['lag_react_pwr_kvarh', 'lead_react_pwr_kvarh', 'CO2(tCO2)', 'lag_current_pwr', 'lead_current_pwr', 'nsm']

In [None]:
col_transform = ColumnTransformer(transformers=
   [('cat',OneHotEncoder(), cat_attribs),
   ('num',MinMaxScaler(), num_attribs)])

In [None]:
pipe = Pipeline([
    ('prep', col_transform),
    ('mlr', LogisticRegression(max_iter=1000))
])

Executing the Mode

In [None]:
pipe.fit(X_train, np.ravel(y_train))

Evaluating the Model

In [None]:
y_pred = pipe.predict(X_test)

In [None]:
def plot_cm(y_test, y_pred):
    cm = confusion_matrix(y_test,y_pred)
    fig = plt.figure(figsize=(10,10))
    heatmap = sns.heatmap(cm, annot=True, fmt='.2f', cmap='RdYlGn')
    plt.ylabel('True label')
    plt.xlabel('Predicted Label')

In [None]:
plot_cm(y_test, y_pred)

In [None]:
precision_score(y_test, y_pred)



In [None]:
recall_score(y_test, y_pred)

In [None]:
f1_score(y_test, y_pred)

In [None]:
precision, recall, thresholds = precision_recall_curve(y_test, pipe.predict_proba(X_test)[:, 1])
pr_auc = auc(recall, precision)

plt.figure()

plt.plot(recall, precision, label = pr_auc)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()

I used the because the difference between the precision and recall are so significant. The high recall score implies that the dataset might be imbalanced,and it also provides a better view of false positives.

Conclusion

This model seems decent to good for predicting week_status. The recall score is over 96%. The precision isn't as good at only 77.54%, but that is still better than one might get just by guessing randomly. I'm sure there is room for refinement.