# PreProcess APS dataset

#### In this notebook, we first download the data from UCI and preprocess it so we can build a Machine Learning model. 
#### We then store this data in a training and testing folder.

## Dataset Description:

The dataset we use here for predictive maintenance comes from UCI Data Repository and consists of Air Pressure System failures recorded on Scania Trucks. Read more about the dataset here: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The positive class consists of failures attributed to APS and negative class consists of failures in some other system. The goal is to identify APS failures correctly so a downstream predictive maintenance action can be taken on this system, once the origin of the failure has been identified.

This is a typical use case in Predictive maintenance (PDM): a first model identifies the root cause of the failure. Once this is identified, a second system identifies how much time one has until a failure might occur which then informs the actions that need to be taken to avoid it. Predictive maintenance, like most machine learning problems can be multifaceted.

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import os

In [None]:
!conda install curl -y

#### Download the data

In [None]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv --output aps_failure_training_set.csv

In [None]:
df = pd.read_csv('aps_failure_training_set.csv', sep=' ', encoding = 'utf-8', header=None)

In [None]:
df.head(15)

Notice that this original dataset requires some preprocessing to get it in a suitable format for Machine learning. Run the function below to get a pre-processed dataset.

In [None]:
def preprocessdataset(df):
    ''' Preprocess the input dataset for Machine learning training'''
    
    import os
    try:
        os.makedirs('training_data')
    except Exception as e:
        print("directory already exists")
        
    try:
        os.makedirs('test_data')
    except Exception as e:
        print("directory already exists")
    
    print("Start Preprocessing ...")
    wholedf = pd.DataFrame(np.zeros(shape=(60000,171)), columns=np.arange(171))
    wholedf.columns = df[0][14].split(',')
    newdf = [df[0][row].split(',') for row in range(15 ,60015)]
    newdf = pd.DataFrame.from_records(newdf)
    newdf.columns = df[0][14].split(',')
    
    print("Dropping last 2 columns...")
    newdf = newdf.drop(columns = ['ef_000', 'eg_000'])
    
    print("Shape of the entire dataset ={}".format(newdf.shape))
    
    print("Convert the class categorical label to numerical values for prediction")
    newdf = newdf.replace({'class': {'neg': 0, 'pos':1}})
    newdf=newdf.replace('na',0)

    print("Changing data types to numeric...")
    newdf = newdf.apply(pd.to_numeric)
    
    print("Splitting the data into train and test...")
    
    from sklearn.model_selection import train_test_split
    X_train, X_test = train_test_split(newdf, test_size=0.2, random_state = 1234)
    
    print("Saving the data locally in train/test folders...")
    X_train.to_csv('training_data/train.csv', index = False, header = None)
    X_test.to_csv('test_data/test.csv', index=False, header=None)
    newdf.to_csv('rawdataset.csv', index=False, header=None)
    print("Shape of Training data = {}".format(X_train.shape))
    print("Shape of Test data = {}".format(X_test.shape))
    print("Success!")

In [None]:
%time
preprocessdataset(df)

Now go to "predictive-maintenance-xgboost.ipynb" and run the code cells to train your custom XGBoost model using SageMaker built in algorithms for predictive maintenance