# PreProcess APS dataset

#### In this notebook, we first download the data from UCI and preprocess it so we can build a Machine Learning model. 
#### We then store this data in a training and testing folder.

## Dataset Description:

The dataset we use here for predictive maintenance comes from UCI Data Repository and consists of Air Pressure System failures recorded on Scania Trucks. Read more about the dataset here: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks

The positive class consists of failures attributed to APS and negative class consists of failures in some other system. The goal is to identify APS failures correctly so a downstream predictive maintenance action can be taken on this system, once the origin of the failure has been identified.

This is a typical use case in Predictive maintenance (PDM): a first model identifies the root cause of the failure. Once this is identified, a second system identifies how much time one has until a failure might occur which then informs the actions that need to be taken to avoid it. Predictive maintenance, like most machine learning problems can be multifaceted.

### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
!conda install curl -y

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - curl


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    curl-7.68.0                |       hf8cf82a_0         137 KB  conda-forge
    krb5-1.16.4                |       h2fd8d38_0         1.4 MB  conda-forge
    libcurl-7.68.0             |       hda55be3_0         564 KB  conda-forge
    libedit-3.1.20170329       |    hf8c457e_1001         172 KB  conda-forge
    libssh2-1.8.2              |       h22169c7_2         257 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be I

#### Download the data

In [3]:
!curl https://archive.ics.uci.edu/ml/machine-learning-databases/00421/aps_failure_training_set.csv --output aps_failure_training_set.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 42.5M  100 42.5M    0     0  17.2M      0  0:00:02  0:00:02 --:--:-- 17.2M


In [4]:
df = pd.read_csv('aps_failure_training_set.csv', sep=' ', encoding = 'utf-8', header=None)

In [5]:
df.head(15)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,This,file,is,part,of,APS,Failure,and,Operational,Data,for,Scania,Trucks.
1,Copyright,(c),<2016>,<Scania,CV,AB>,,,,,,,
2,This,program,(APS,Failure,and,Operational,Data,for,Scania,Trucks),is,,
3,free,software:,you,can,redistribute,it,and/or,modify,,,,,
4,it,under,the,terms,of,the,GNU,General,Public,License,as,published,by
5,the,Free,Software,"Foundation,",either,version,3,of,the,"License,",or,,
6,(at,your,option),any,later,version.,,,,,,,
7,This,program,is,distributed,in,the,hope,that,it,will,be,"useful,",
8,but,WITHOUT,ANY,WARRANTY;,without,even,the,implied,warranty,of,,,
9,MERCHANTABILITY,or,FITNESS,FOR,A,PARTICULAR,PURPOSE.,,See,the,,,


Notice that this original dataset requires some preprocessing to get it in a suitable format for Machine learning. Run the function below to get a pre-processed dataset.

In [6]:
def preprocessdataset(df):
    ''' Preprocess the input dataset for Machine learning training'''
    
    import os
    try:
        os.makedirs('training_data')
    except Exception as e:
        print("directory already exists")
        
    try:
        os.makedirs('test_data')
    except Exception as e:
        print("directory already exists")
    
    print("Start Preprocessing ...")
    wholedf = pd.DataFrame(np.zeros(shape=(60000,171)), columns=np.arange(171))
    wholedf.columns = df[0][14].split(',')
    newdf = [df[0][row].split(',') for row in range(15 ,60015)]
    newdf = pd.DataFrame.from_records(newdf)
    newdf.columns = df[0][14].split(',')
    
    print("Dropping last 2 columns...")
    newdf = newdf.drop(columns = ['ef_000', 'eg_000'])
    
    print("Shape of the entire dataset ={}".format(newdf.shape))
    
    print("Convert the class categorical label to numerical values for prediction")
    newdf = newdf.replace({'class': {'neg': 0, 'pos':1}})
    newdf=newdf.replace('na',0)

    print("Changing data types to numeric...")
    newdf = newdf.apply(pd.to_numeric)
    
    print("Splitting the data into train and test...")
    
    from sklearn.model_selection import train_test_split
    X_train, X_test = train_test_split(newdf, test_size=0.2, random_state = 1234)
    
    print("Saving the data locally in train/test folders...")
    X_train.to_csv('training_data/train.csv', index = False, header = None)
    X_test.to_csv('test_data/test.csv', index=False, header=None)
    newdf.to_csv('rawdataset.csv', index=False, header=None)
    print("Shape of Training data = {}".format(X_train.shape))
    print("Shape of Test data = {}".format(X_test.shape))
    print("Success!")

In [7]:
%time
preprocessdataset(df)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.63 µs
Start Preprocessing ...
Dropping last 2 columns...
Shape of the entire dataset =(60000, 169)
Convert the class categorical label to numerical values for prediction
Changing data types to numeric...
Splitting the data into train and test...
Saving the data locally in train/test folders...
Shape of Training data = (48000, 169)
Shape of Test data = (12000, 169)
Success!


Now go to "predictive-maintenance-xgboost.ipynb" and run the code cells to train your custom XGBoost model using SageMaker built in algorithms for predictive maintenance