We create models to predict and compare them based on F1 score to find the most suitable model:

1. Random Forest

2. Logistic Regression

3. Adaboost

4. Gradient boosting

Random Forest Classifier

Import libraries and read training data

In [None]:
import pandas as pd
import numpy as np

df_train=pd.read_csv('../input/predictive-equipment-failures/equip_failures_training_set.csv')

df_train.head()

Check the description and information from training data to analyze it

In [None]:
df_train.describe()

Check for null values

In [None]:
df_train.isna().sum()

Replace na values with 0 

In [None]:
df_train=df_train.replace('na',0)
df_train.head()

Read and clean test data

In [None]:
df_test=pd.read_csv('../input/predictive-equipment-failures/equip_failures_test_set.csv')

df_test=df_test.replace('na',0)

df_test.head()

Train the model with 70% training data and 30% testing data and predict

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

model = RandomForestClassifier(n_estimators=200,bootstrap=True,max_features='sqrt')
X = df_train.iloc[:, 2:].values
y = df_train.iloc[:, 1].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

y_pred

Get the F1 score of the predicted values

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred,digits=4))



Logistic Regression

Get the location of .csv file from local machine and store it in a variable called train

In [None]:
train = pd.read_csv('../input/predictive-equipment-failures/equip_failures_training_set.csv')

type() method returns class type of the argument(object) passed as parameter

In [None]:
type(train)

Pandas head() method is used to return top n (5 by default) rows of a data frame or series

In [None]:
train.head()

Below functions are used to obtain more information about the dataset

In [None]:
print(train.count())
print(train.info())
print(train.describe())

The data in dataset is checked for any missing or wrong values

In [None]:
descriptive_stats = train.describe()
descriptive_stats

All 'na' values are replaced with 0 and stored in a new variable 

In [None]:
new_train = train.replace({'na':0})

The training dataset is split for validation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(new_train.drop('target',axis=1), new_train['target'], test_size=0.10, 
random_state=0)

Logistic Regression is performed on the data and the accuracy is predicted

In [None]:
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)

print('Accuracy of logistic regression classifier on test set: {:.4f}'.format(logreg.score(X_test, y_test)))


F1-score, precision, recall and support is calculated by importing a library function

Adaboost

In [None]:
X = pd.read_csv('../input/predictive-equipment-failures/equip_failures_training_set.csv')
y = pd.read_csv('../input/predictive-equipment-failures/equip_failures_test_set.csv')

Training Predictors and Response Values/Replacing na string to 0

In [None]:
from sklearn.utils import resample
#separate training
new_X = X.replace('na',0)
new_y = y.replace('na',0)
df_major = new_X[new_X.target==0]
df_minor = new_X[new_X.target==1]

df_major_downsampled = resample(df_major, replace=False,n_samples=1000, random_state=123)

#combine minor class with downsampled majority class
df_downsampled = pd.concat([df_major_downsampled, df_minor])





In [None]:
df_downsampled
#print(df_downsampled.target.value_counts())
#X_train = df_downsampled.iloc[:,2:]
#y_train = df_downsampled.iloc[:,1]
#X_test = new_y.iloc[:,1:]

X_train, X_test, y_train, y_test = train_test_split(df_downsampled.drop('target',axis = 1), df_downsampled['target'], test_size = 0.20, random_state = 0)

Implementing AdaBoost with Decision Tree with Number of trees 250 and maximum depth of 1

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

classifier = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1),
    n_estimators=250
)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

Classification report displaying F-score

In [None]:
print(classification_report(y_test, y_pred))

Gradient Boosting

Implementing GradientBoosting with Number of trees 300 and maximum depth of 1

In [None]:

gb = GradientBoostingClassifier(n_estimators=300, max_depth = 1)
gb.fit(X_train, y_train)
predictions = gb.predict(X_test)

print("Classification Report")
print(classification_report(y_test, y_pred,digits=4))

Finally, we decide and use random forest because of the highest f1 score 

Write the final dataframe to csv

In [None]:
prediction = model.predict(df_test.iloc[:, 1:].values)

df_final=pd.DataFrame(df_test['id'])
df_final['target']=prediction

df_final.to_csv('final.csv')