Data Source: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

Purpose of this Analysis :  
We read in the reduced train and test data sets containg APS Failure data obtained from the previous data cleaning stage. We then impute the missing values with three strategies resulting in 3 sets of train and test data. For each of them, we then find out the highly correlated variables and drop them from the file, to create the files ready for modelling the classification.   

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [15]:
## Read in the reduced data set obtained after dropping columns that have missing values and low variance.
train_data = pd.read_csv("train_reduced.csv")

In [16]:
# Read in test data corresponding to the reduced train data 
test_data = pd.read_csv("test_reduced.csv")

In [17]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60000 entries, 0 to 59999
Data columns (total 62 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   class   60000 non-null  object 
 1   ag_003  59329 non-null  float64
 2   ag_004  59329 non-null  float64
 3   ag_005  59329 non-null  float64
 4   ag_006  59329 non-null  float64
 5   ag_007  59329 non-null  float64
 6   ah_000  59355 non-null  float64
 7   al_000  59358 non-null  float64
 8   am_0    59371 non-null  float64
 9   an_000  59358 non-null  float64
 10  ao_000  59411 non-null  float64
 11  ap_000  59358 non-null  float64
 12  aq_000  59411 non-null  float64
 13  ay_001  59329 non-null  float64
 14  ay_005  59329 non-null  float64
 15  ay_006  59329 non-null  float64
 16  ay_007  59329 non-null  float64
 17  ay_008  59329 non-null  float64
 18  az_003  59329 non-null  float64
 19  az_004  59329 non-null  float64
 20  az_005  59329 non-null  float64
 21  az_006  59329 non-null  float64
 22

In [18]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16000 entries, 0 to 15999
Data columns (total 62 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   class   16000 non-null  object 
 1   ag_003  15811 non-null  float64
 2   ag_004  15811 non-null  float64
 3   ag_005  15811 non-null  float64
 4   ag_006  15811 non-null  float64
 5   ag_007  15811 non-null  float64
 6   ah_000  15825 non-null  float64
 7   al_000  15831 non-null  float64
 8   am_0    15837 non-null  float64
 9   an_000  15831 non-null  float64
 10  ao_000  15838 non-null  float64
 11  ap_000  15831 non-null  float64
 12  aq_000  15838 non-null  float64
 13  ay_001  15808 non-null  float64
 14  ay_005  15808 non-null  float64
 15  ay_006  15808 non-null  float64
 16  ay_007  15808 non-null  float64
 17  ay_008  15808 non-null  float64
 18  az_003  15808 non-null  float64
 19  az_004  15808 non-null  float64
 20  az_005  15808 non-null  float64
 21  az_006  15808 non-null  float64
 22

In [19]:
train_data.isnull().sum().sort_values(ascending = False)

br_000    49264
bq_000    48722
bp_000    47740
bx_000     3257
cc_000     3255
          ...  
bj_000      589
ci_000      338
cj_000      338
ck_000      338
class         0
Length: 62, dtype: int64

In [20]:
train_data.isnull().sum().quantile([0.25,0.5,0.75])

0.25    669.0
0.50    671.0
0.75    687.0
dtype: float64

In [21]:
x = train_data.isnull().sum()
x[x > 687]

ba_000      688
ba_001      688
ba_002      688
ba_003      688
ba_004      688
ba_006      688
ba_007      688
bp_000    47740
bq_000    48722
br_000    49264
bu_000      691
bv_000      691
bx_000     3257
cc_000     3255
cq_000      691
dtype: int64

Most of the columns have missing values with in 700 . There are 5 columns with missing values in the range of 3000 and 40000 . We can drop these columns.  

In [22]:
cols_to_drop = ['bp_000', 'bq_000', 'br_000','bx_000','cc_000']

In [23]:
train_data.drop(cols_to_drop, axis = 1 , inplace = True)

In [24]:
train_data.shape

(60000, 57)

In [25]:
## drop the corresponding columns in test 
test_data.drop(cols_to_drop, axis = 1 , inplace = True )

In [26]:
test_data.shape

(16000, 57)

Strategies for imputation :
1. Impute with zero value , drop the highly correlated columns.  
2. Impute with mean/median , , drop the highly correlated columns.  
3. Impute with knn, , drop the highly correlated columns.   

In [27]:
#1 Impute with zero value 
train_data_1 = train_data.copy()
train_data_1.fillna(0, inplace = True)

In [29]:
cols = train_data_1.columns.tolist()

In [30]:
## Find out the highly correlated variables 
# Create correlation matrix
corr_matrix = train_data_1.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.9)]

In [31]:
print(" No of columns that are correlated : {}".format(len(to_drop)))

 No of columns that are correlated : 24


In [32]:
train_data_1.drop(to_drop, axis = 1 , inplace = True)

In [16]:
## save the file 
train_data_1.to_csv("train_data_1.csv" , index = False)

In [17]:
## impute the test data with zero and drop the columns as above.
test_data_1 = test_data.copy()
test_data_1.fillna(0, inplace = True)
test_data_1.drop(to_drop, axis = 1 , inplace = True)
print(test_data_1.shape)
## save the file 
test_data_1.to_csv("test_data_1.csv" , index = False)

(16000, 33)


##### Imputing with median value.

In [18]:
## impute the train and test daya with median values.

train_data_2 = train_data.copy()
test_data_2 = test_data.copy()

cols = train_data_2.columns.tolist()
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy = 'median')
train_data_2[cols[1:]] = si.fit_transform(train_data_2[cols[1:]])
##
si = SimpleImputer(strategy = 'median')
test_data_2[cols[1:]] = si.fit_transform(test_data_2[cols[1:]])

In [19]:
(train_data_2.isnull().sum() > 0).sum()

0

In [20]:
(test_data_2.isnull().sum() > 0).sum()

0

In [21]:
## Find out the highly correlated variables 
# Create correlation matrix
corr_matrix = train_data_2.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop_2 = [column for column in upper.columns if any(upper[column] > 0.9)]
print("No of correlated columns : {}".format(len(to_drop_2)))


No of correlated columns : 24


In [22]:
##
train_data_2.drop(to_drop_2, axis = 1 , inplace = True)
print(train_data_2.shape)


(60000, 33)


In [23]:
## drop the correspoinding columns in test_data_2.
test_data_2.drop(to_drop_2, axis = 1 , inplace = True)

In [24]:
print(test_data_2.shape)

(16000, 33)


In [25]:
## save the file 
train_data_2.to_csv("train_data_2.csv", index = False)
test_data_2.to_csv("test_data_2.csv", index = False)


In [26]:
## imputing with knn
train_data_3 = train_data.copy()
test_data_3 = test_data.copy()

cols = train_data_3.columns.tolist()
from sklearn.impute import KNNImputer
knn_imp = KNNImputer(n_neighbors = 2)
train_data_3[cols[1:]] = knn_imp.fit_transform(train_data_3[cols[1:]])
##
knn_imp = KNNImputer(n_neighbors = 2)
test_data_3[cols[1:]] = knn_imp.fit_transform(test_data_3[cols[1:]])

In [27]:
## Find out the highly correlated variables 
# Create correlation matrix
corr_matrix = train_data_3.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop_3 = [column for column in upper.columns if any(upper[column] > 0.9)]
print("No of correlated columns : {}".format(len(to_drop_3)))


No of correlated columns : 24


In [28]:
(train_data_3.isnull().sum() > 0).sum() 

0

In [29]:
(test_data_3.isnull().sum() > 0).sum() 

0

In [30]:
##
train_data_3.drop(to_drop_3, axis = 1 , inplace = True)
print(train_data_3.shape)


(60000, 33)


In [31]:
##
test_data_3.drop(to_drop_3, axis = 1 , inplace = True)
print(test_data_3.shape)


(16000, 33)


In [32]:
## save the file 
train_data_3.to_csv("train_data_3.csv", index = False)

In [33]:
## save the test file
test_data_3.to_csv("test_data_3.csv", index = False)