<div class="alert alert-info">
<H2> Capstone 1 Predictive Maintenance </H2>
<H3> Data Preparation </H3>
    
Steps involved:
    <ol>
        <li>___Import libraries and data___</li>
        <li>___Rationale for inclusion/exclusion___: Here I will list the data to be included and or excluded and the reasons I made for these decisions.</li>
        <li>___Data cleaning report___: Here I will describe what decisions and actions were taken to address data quality problems. I will also consider any transformations I made on the data for cleaning purposes and their potential impact on the analysis results.</li>
        <li>___Derived attributes___: These are newly created attributes from one or more existing attributes in the same record.</li>
        <li>___Generated records___: Here, you describe the creation of any completely new files.</li>
     </ol>

<div class="alert alert-success">
<H3> Step 1. Import libraries and data </H3>

In [17]:
# imports
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from IPython.core.display import HTML

pd.set_option('display.max_rows', 2000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# statistical processing
from xgboost import XGBClassifier, cv, plot_importance
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score
from statsmodels.stats.outliers_influence import variance_inflation_factor

# data viz imports
plt.style.use('ggplot')
%matplotlib inline

# load data
df1_sensors = pd.read_csv('data/sensor.csv')

<div class="alert alert-success">
<H3> Step 2. Rationale for inclusion/exclusion </H3>
    
Here I will list the data to be included and or excluded and the reasons I made for these decisions.

In [18]:
# drop duplicated index: 'Unnamed: 0'
df2_sensors = df1_sensors.drop('Unnamed: 0' , axis='columns')

<div class="alert alert-warning">
<H3>Findings </H3>
drop duplicated index: 'Unnamed: 0'
This field was dropped as it was a duplicated index caused by the system outputting the file.

<div class="alert alert-success">
<H3> Step 3. Data cleaning report </H3>
    
Here I will describe what decisions and actions were taken to address data quality problems. I will also consider any transformations I made on the data for cleaning purposes and their potential impact on the analysis results.

In [19]:
# replace blank values with zeroes
#df3_sensors = df2_sensors.fillna(0)


# replace machine status values in dataframe column: Normal = 0, Recovering = 1, Broken = 1
df2_sensors.machine_status.replace({'NORMAL': '0', 'RECOVERING': '1', 'BROKEN': '1'}, inplace=True)

# convert timestamp to datetime value and set as index
df2_sensors.set_index(pd.to_datetime(df2_sensors['timestamp']), inplace=True)

# drop duplicated index: timestamp
df3_sensors = df2_sensors.drop('timestamp' , axis='columns')

<div class="alert alert-warning">
<H3>Findings </H3>
These changes were cleared by management during the data understanding phase. In order to build the regression model it was important to ensure that the x-variables are of a consistent shape with the y-variables. Additionally, it was critical that management normalized the understanding of the y-variable: 'NORMAL': '0', 'RECOVERING': '1', 'BROKEN': '1', so that the target being predicted is consistent.

<div class="alert alert-success">
<H3> Step 4. Derived attributes </H3>
    
These are newly created attributes from one or more existing attributes in the same record.

<div class="alert alert-warning">
<H3>Findings </H3>
N/A

<div class="alert alert-success">
<H3> Step 5. Generated records </H3>
    
Here, you describe the creation of any completely new files.

In [21]:
df3_sensors.head()

Unnamed: 0_level_0,sensor_00,sensor_01,sensor_02,sensor_03,sensor_04,sensor_05,sensor_06,sensor_07,sensor_08,sensor_09,sensor_10,sensor_11,sensor_12,sensor_13,sensor_14,sensor_15,sensor_16,sensor_17,sensor_18,sensor_19,sensor_20,sensor_21,sensor_22,sensor_23,sensor_24,sensor_25,sensor_26,sensor_27,sensor_28,sensor_29,sensor_30,sensor_31,sensor_32,sensor_33,sensor_34,sensor_35,sensor_36,sensor_37,sensor_38,sensor_39,sensor_40,sensor_41,sensor_42,sensor_43,sensor_44,sensor_45,sensor_46,sensor_47,sensor_48,sensor_49,sensor_50,sensor_51,machine_status
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1
2018-04-01 00:00:00,2.465394,47.09201,53.2118,46.31076,634.375,76.45975,13.41146,16.13136,15.56713,15.05353,37.2274,47.52422,31.11716,1.681353,419.5747,,461.8781,466.3284,2.565284,665.3993,398.9862,880.0001,498.8926,975.9409,627.674,741.7151,848.0708,429.0377,785.1935,684.9443,594.4445,682.8125,680.4416,433.7037,171.9375,341.9039,195.0655,90.32386,40.36458,31.51042,70.57291,30.98958,31.770832,41.92708,39.6412,65.68287,50.92593,38.19444,157.9861,67.70834,243.0556,201.3889,0
2018-04-01 00:01:00,2.465394,47.09201,53.2118,46.31076,634.375,76.45975,13.41146,16.13136,15.56713,15.05353,37.2274,47.52422,31.11716,1.681353,419.5747,,461.8781,466.3284,2.565284,665.3993,398.9862,880.0001,498.8926,975.9409,627.674,741.7151,848.0708,429.0377,785.1935,684.9443,594.4445,682.8125,680.4416,433.7037,171.9375,341.9039,195.0655,90.32386,40.36458,31.51042,70.57291,30.98958,31.770832,41.92708,39.6412,65.68287,50.92593,38.19444,157.9861,67.70834,243.0556,201.3889,0
2018-04-01 00:02:00,2.444734,47.35243,53.2118,46.39757,638.8889,73.54598,13.32465,16.03733,15.61777,15.01013,37.86777,48.17723,32.08894,1.708474,420.848,,462.7798,459.6364,2.500062,666.2234,399.9418,880.4237,501.3617,982.7342,631.1326,740.8031,849.8997,454.239,778.5734,715.6266,661.574,721.875,694.7721,441.2635,169.982,343.1955,200.9694,93.90508,41.40625,31.25,69.53125,30.46875,31.77083,41.66666,39.351852,65.39352,51.21528,38.194443,155.9606,67.12963,241.3194,203.7037,0
2018-04-01 00:03:00,2.460474,47.09201,53.1684,46.397568,628.125,76.98898,13.31742,16.24711,15.69734,15.08247,38.57977,48.65607,31.67221,1.579427,420.7494,,462.898,460.8858,2.509521,666.0114,399.1046,878.8917,499.043,977.752,625.4076,739.2722,847.7579,474.8731,779.5091,690.4011,686.1111,754.6875,683.3831,446.2493,166.4987,343.9586,193.1689,101.0406,41.92708,31.51042,72.13541,30.46875,31.51042,40.88541,39.0625,64.81481,51.21528,38.19444,155.9606,66.84028,240.4514,203.125,0
2018-04-01 00:04:00,2.445718,47.13541,53.2118,46.397568,636.4583,76.58897,13.35359,16.21094,15.69734,15.08247,39.48939,49.06298,31.95202,1.683831,419.8926,,461.4906,468.2206,2.604785,663.2111,400.5426,882.5874,498.5383,979.5755,627.183,737.6033,846.9182,408.8159,785.2307,704.6937,631.4814,766.1458,702.4431,433.9081,164.7498,339.963,193.877,101.7038,42.70833,31.51042,76.82291,30.98958,31.51042,41.40625,38.77315,65.10416,51.79398,38.77315,158.2755,66.55093,242.1875,201.3889,0


In [23]:
df3_sensors.to_csv(r'data/sensor_cleaned.csv', index = None, header=True)

<div class="alert alert-warning">
<H3>Findings </H3>
Here we have generated a new file: sensor_cleaned. During the modeling phase this file will be used to build the regression model.