<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

// needed to generate the Table of contents 
// taken from github.com/kmahelona/ipython_notebook_goodies

<IPython.core.display.Javascript object>

(To work with Atom and Jupyter at the same time)

In [2]:
%load_ext autoreload
%autoreload 2

# Data Wrangling

## Data Collection

### Locating the data

To train this model we will use the Turbofan Engine Degradation Simulation Data Set from NASA ([Link to dataset](https://ti.arc.nasa.gov/c/6/)).

>A. Saxena and K. Goebel (2008). "Turbofan Engine Degradation Simulation Data Set", NASA Ames Prognostics Data Repository (http://ti.arc.nasa.gov/project/prognostic-data-repository), NASA Ames Research Center, Moffett Field, CA

In [3]:
#load python packages
import os
import glob
import pandas as pd
#import datetime
#import seaborn as sns
#import matplotlib.pyplot as plt
#import numpy as np
#%matplotlib inline

In [4]:
#navigate to the data folder
os.chdir('..')
print(os.getcwd())

/home/andrea/Dropbox/PyProjects/Predictive_Maintenance_Fanjet


In [5]:
# Download data in data/raw folder
! mkdir data/raw
! cd data/raw && wget -O turbofan.zip https://ti.arc.nasa.gov/c/6/ && unzip -o *.zip
! cd data/raw && rm *.zip

#NOTE the cd executed with the magic commanda do not change the working directory of the python script
#We are still working off the project root folder

mkdir: cannot create directory ‘data/raw’: File exists
--2020-05-29 16:11:12--  https://ti.arc.nasa.gov/c/6/
Resolving ti.arc.nasa.gov (ti.arc.nasa.gov)... 128.102.105.66, 2001:4d0:6311:2227:14b6:372b:2078:2a94
Connecting to ti.arc.nasa.gov (ti.arc.nasa.gov)|128.102.105.66|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://ti.arc.nasa.gov/m/project/prognostic-repository/CMAPSSData.zip [following]
--2020-05-29 16:11:12--  https://ti.arc.nasa.gov/m/project/prognostic-repository/CMAPSSData.zip
Reusing existing connection to ti.arc.nasa.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: 12425978 (12M) [application/zip]
Saving to: ‘turbofan.zip’


2020-05-29 16:11:14 (8.03 MB/s) - ‘turbofan.zip’ saved [12425978/12425978]

Archive:  turbofan.zip
  inflating: Damage Propagation Modeling.pdf  
  inflating: readme.txt              
  inflating: RUL_FD001.txt           
  inflating: RUL_FD002.txt           
  inflating: RUL_FD003.txt           
  

### Data Loading

After downloading and unzipping the files, the data folder has the following structure. 

```
data/raw
├── Damage Propagation Modeling.pdf
├── readme.txt
├── RUL_FD001.txt
├── RUL_FD002.txt
├── RUL_FD003.txt
├── RUL_FD004.txt
├── test_FD001.txt
├── test_FD002.txt
├── test_FD003.txt
├── test_FD004.txt
├── train_FD001.txt
├── train_FD002.txt
├── train_FD003.txt
└── train_FD004.txt
```

From the readme file we can extract the file structure. There are 4 sets of files:
1. FD001
2. FD002
3. FD003
4. FD004


Each containing 3 types of files:
1. Training data
2. Test data
3. Remaining Usable Life (RUL) data

Test and training data files have the same column structure:
1. unit number
2. cycle time
3. operating setting 1
4. operating setting 2
5. operating setting 3
6. sensor reading 1
7. sensor reading 2
...
8. sensor reading 21


The RUL data file has a single column corresponding to the RUL value.

We will import the txt files and consolidate the data in 3 labelled csv files which will reside in the data folder:
```
data
├── RUL.csv
├── test.csv
└── train.csv
```

In [6]:
# define column names
sensors_list = ["s{}".format(s) for s in range(1,22)]
train_cols = ['unit_number','cycle_time','op_setting_1', 'op_setting_2', 'op_setting_3'] + sensors_list
test_cols = train_cols
RUL_cols = ['RUL']


In [7]:
# load Training data
file_paths = glob.glob("data/raw/train_*.txt")

df_list = []
for file_path in file_paths : 
    # read txt file
    individual_df = pd.read_csv(file_path, sep=' ', header=None, usecols = [i for i in range(26)])
    individual_df.columns = train_cols
    #extract dataset Id from filename
    data_set=file_path[-9:-4]
    individual_df['dataset']=file_path[-9:-4] 
    #append temporary dataframe to list
    df_list.append(individual_df)
#merge into single dataframe
df_train=pd.concat(df_list)
df_train.head()

# write to csv
df_train.to_csv('data/train.csv', index=False)

In [8]:
# load Test data
file_paths = glob.glob("data/raw/test_*.txt")

df_list = []
for file_path in file_paths : 
    # read txt file
    individual_df = pd.read_csv(file_path, sep=' ', header=None, usecols = [i for i in range(26)])
    individual_df.columns = test_cols
    #extract dataset Id from filename
    data_set=file_path[-9:-4]
    individual_df['dataset']=file_path[-9:-4] 
    #append temporary dataframe to list
    df_list.append(individual_df)
#merge into single dataframe
df_test=pd.concat(df_list)
df_test.head()

# write to csv
df_test.to_csv('data/test.csv', index=False)

In [9]:
# load RUL data
file_paths = glob.glob("data/raw/RUL_*.txt")

df_list = []
for file_path in file_paths : 
    # read txt file
    individual_df = pd.read_csv(file_path, sep=' ', header=None, usecols=[0])
    individual_df.columns = RUL_cols
    #extract dataset Id from filename
    data_set=file_path[-9:-4]
    individual_df['dataset']=file_path[-9:-4] 
    #append temporary dataframe to list
    df_list.append(individual_df)
#merge into single dataframe
df_RUL=pd.concat(df_list)
df_RUL.head()

# write to csv
df_RUL.to_csv('data/RUL.csv', index=False)