# Outline of functions used in the training data pipeline

## This notebook demo's the pipeline for converting the feature data frame into a training and test set that can be used in testing models. Our models were ultimately built using cross-validation rather than the techniques outlined here. However, there are still many functions here that may be useful, especially for quick testing purposes.

In [1]:
import os
import pandas as pd
import random
import sys
from math import sqrt
from sklearn import svm
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

###### set the path

In [2]:
sys.path.insert(0,os.path.dirname(os.getcwd()))
import src.models.modeling_pipeline as mp

##### import the data

In [3]:
data_path = os.path.join(os.path.dirname(os.getcwd()),'data', 'interim', 'full_feature_cg_data.csv')
df = pd.read_csv(data_path)

In [4]:
df.head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth,Pct_Depth,Flaw_Volume,Flaw_Area
0,AP01,A,0,10.320653,14.854606,16.017633,12.423161,23.971155,18.113805,36.668613,...,0.330427,-0.038826,-0.681307,-0.840659,0.556654,0.333518,0.076,10.3,0.864,11.3288
1,AP01,A,10,9.256762,13.566036,12.946058,12.594721,19.365281,17.401339,28.518174,...,0.290231,-0.069128,-0.695486,-0.73928,0.548302,0.294714,0.076,10.3,0.864,11.3288
2,AP01,A,20,6.375396,8.061063,9.746397,10.879743,14.309066,18.105114,22.923653,...,0.355075,-0.013229,-0.505109,-0.781582,0.626245,0.293746,0.076,10.3,0.864,11.3288
3,AP01,A,30,9.70041,11.746437,14.777542,16.300598,18.984767,21.744765,30.582848,...,0.338012,-0.003931,-0.437232,-0.721585,0.560389,0.27756,0.076,10.3,0.864,11.3288
4,AP01,A,40,7.913722,10.29701,10.453795,10.156999,15.417897,17.797726,26.734713,...,0.372874,-0.057453,-0.44083,-0.2,0.514553,0.194392,0.076,10.3,0.864,11.3288


#### The data pipeline is intended to transform the data above into scaled X_train and X_test matrices, and y_train and y_test vectors

In [5]:
X_train, X_test, y_train, y_test, train, test = mp.get_scaled_training_test_data(df)

##### This module includes a number of functions that are integrated into the driver function above. Here, they will be outlined to clarify what the the driver function is actually doing.

##### The first function is subset_data_with_features, which takes a dataframe and, optionally, a feature list as arguments. If no list is passed to this function, it returns the dataframe with all amp/phase features, the ID info, and the flaw depth.

In [6]:
new_df1 = mp.subset_data_with_features(df)
new_df1.head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_12,Phase_13,Phase_14,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth
0,AP01,A,0,10.320653,14.854606,16.017633,12.423161,23.971155,18.113805,36.668613,...,-0.874738,1.146718,0.792448,0.330427,-0.038826,-0.681307,-0.840659,0.556654,0.333518,0.076
1,AP01,A,10,9.256762,13.566036,12.946058,12.594721,19.365281,17.401339,28.518174,...,-0.931822,1.134723,0.870461,0.290231,-0.069128,-0.695486,-0.73928,0.548302,0.294714,0.076
2,AP01,A,20,6.375396,8.061063,9.746397,10.879743,14.309066,18.105114,22.923653,...,-0.833733,1.277617,0.792631,0.355075,-0.013229,-0.505109,-0.781582,0.626245,0.293746,0.076
3,AP01,A,30,9.70041,11.746437,14.777542,16.300598,18.984767,21.744765,30.582848,...,-0.967913,1.275743,0.81841,0.338012,-0.003931,-0.437232,-0.721585,0.560389,0.27756,0.076
4,AP01,A,40,7.913722,10.29701,10.453795,10.156999,15.417897,17.797726,26.734713,...,-0.849391,1.286724,0.671175,0.372874,-0.057453,-0.44083,-0.2,0.514553,0.194392,0.076


In [7]:
new_df2 = mp.subset_data_with_features(df, feature_list=['Amp_10', 'Phase_15'])
new_df2.head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_10,Phase_15,Flaw_Depth
0,AP01,A,0,35.139706,0.330427,0.076
1,AP01,A,10,33.76892,0.290231,0.076
2,AP01,A,20,28.693021,0.355075,0.076
3,AP01,A,30,29.467458,0.338012,0.076
4,AP01,A,40,22.470982,0.372874,0.076


##### The functions also provide the option to drop a random tube alias from the dataframe

In [8]:
df['Tube_Alias'].unique()

array(['AP01', 'AP02', 'AP03', 'AP04', 'AP05', 'CP01', 'CP02', 'CP03',
       'CP04', 'CP05', 'RP02', 'RP03', 'RP04', 'RP05', 'RP06', 'WT02',
       'WT03', 'WT04', 'WT05'], dtype=object)

In [9]:
new_df3 = mp.exclude_random_tube(df)

In [10]:
new_df3['Tube_Alias'].unique()

array(['AP02', 'AP03', 'AP04', 'AP05', 'CP01', 'CP02', 'CP03', 'CP04',
       'CP05', 'RP02', 'RP03', 'RP04', 'RP05', 'RP06', 'WT02', 'WT03',
       'WT04', 'WT05'], dtype=object)

##### You can subset the data based on selecting random angles for each tube/flaw pair. The default value is 1 angle per tube/flaw, but you can choose whatever you like. Below is an example with 2 random angles per tube/flaw.

In [11]:
new_df4 = mp.pick_random_angle_rows(df, num=2)
new_df4.head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth,Pct_Depth,Flaw_Volume,Flaw_Area
0,AP01,A,100,15.787986,16.587679,21.951638,17.885413,34.074086,18.358172,51.181806,...,0.389544,-0.056397,-0.408446,-0.888428,0.73168,0.344095,0.076,10.3,0.864,11.3288
1,AP01,A,170,22.193651,16.478787,34.109705,21.119385,50.387691,16.67003,79.137553,...,0.41732,-0.001445,-0.598973,-0.780981,0.767024,0.369973,0.076,10.3,0.864,11.3288
2,AP01,B,130,53.610751,39.452027,80.06221,46.077773,120.144701,5.429487,194.063681,...,0.358438,0.029517,-0.626986,-0.907842,0.810729,-0.121476,0.152,20.6,1.728,11.3288
3,AP01,B,150,55.924986,34.477951,82.233307,49.487226,124.884456,12.935593,198.003694,...,0.390145,0.023867,-0.588234,-0.883141,0.777585,0.410427,0.152,20.6,1.728,11.3288
4,AP01,C,60,51.184711,49.003643,77.631193,72.29787,116.75074,6.37905,183.565317,...,0.341692,-0.044112,-0.630663,-0.955366,0.723759,0.382473,0.229,31.1,2.592,11.3288


##### Finally, you can choose how the training and test data are split. The first option uses scikit learn's built-in train_test_split method. The default proportion of values in the training set is 0.8, but you can set it to whatever you like.

In [12]:
training, test = mp.use_train_test_split(df, train_pct=0.75)

In [13]:
training.head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth,Pct_Depth,Flaw_Volume,Flaw_Area
0,CP01,D,230,124.850943,75.775265,187.789007,112.553831,284.140218,29.914768,453.175763,...,0.407731,-0.030815,-0.478661,-0.875626,0.807909,0.470828,0.292,39.6,3.313,11.3288
1,CP03,B,60,8.649198,13.90287,12.814413,10.313041,17.883506,15.077404,29.097524,...,0.567622,0.246343,-0.333,-0.356707,1.014557,0.80604,0.152,20.6,0.384,2.5122
2,CP05,C,150,41.815957,34.824093,63.113117,41.6099,93.287548,16.123,149.388996,...,0.570716,0.057602,-0.328801,-0.887212,0.995189,0.433921,0.216,29.3,1.088,5.0165
3,CP02,A,230,30.866836,19.604668,45.393244,11.39502,69.412435,11.545175,110.664778,...,0.485072,0.548024,-0.415244,-0.581466,0.601208,-0.499467,0.076,10.3,0.576,7.5684
4,WT03,B,190,13.43042,11.069166,19.08074,13.949902,27.868707,7.918982,44.596521,...,0.605633,0.19304,-0.297659,-0.752373,1.007591,0.67619,0.142,20.0,0.637,4.483999


In [14]:
test.head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth,Pct_Depth,Flaw_Volume,Flaw_Area
5107,WT03,F,40,51.211427,36.168807,75.747728,45.325434,117.777638,11.392083,189.86206,...,0.337132,-0.109532,-0.559054,-0.982107,0.850765,-0.142226,0.427,60.0,1.911,4.483999
2393,CP03,G,310,18.296134,16.400545,29.4274,27.168372,47.843782,12.754223,76.341804,...,0.200487,-0.262095,-0.622943,-1.108679,0.622381,0.296996,0.521,70.7,1.312,2.5122
3498,RP03,D,150,42.618348,35.248519,64.561995,41.378838,96.117111,15.903924,151.369418,...,0.507515,0.015894,-0.419556,-0.874365,0.912059,0.447304,0.29,39.4,1.296,4.483999
5788,WT05,G,130,211.564204,152.319922,322.094544,220.668464,494.019403,293.572238,808.751309,...,0.265352,-0.692511,-0.569452,-0.9621,0.854487,0.446323,0.498,70.0,8.916,17.935994
5249,WT04,A,60,2.126748,8.97853,3.698965,7.739184,5.794735,6.235919,8.041948,...,0.322392,0.003046,-0.108485,-0.913397,0.594888,0.352919,0.071,10.0,0.563,7.938234


##### Alternatively, you can choose a custom splitter that ensures tube/flaw pairs are not shared between the test/train set. For example, all instances of AP01 with flaw A would show up in either the training set or the test set, but not both.

In [15]:
training2, test2 = mp.split_tube_flaw_between_train_test(df, train_pct=0.75)

In [16]:
training2.sort_values(['Tube_Alias', 'Flaw_ID', 'Angle']).head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth,Pct_Depth,Flaw_Volume,Flaw_Area
2106,AP01,A,0,10.320653,14.854606,16.017633,12.423161,23.971155,18.113805,36.668613,...,0.330427,-0.038826,-0.681307,-0.840659,0.556654,0.333518,0.076,10.3,0.864,11.3288
63,AP01,A,10,9.256762,13.566036,12.946058,12.594721,19.365281,17.401339,28.518174,...,0.290231,-0.069128,-0.695486,-0.73928,0.548302,0.294714,0.076,10.3,0.864,11.3288
317,AP01,A,20,6.375396,8.061063,9.746397,10.879743,14.309066,18.105114,22.923653,...,0.355075,-0.013229,-0.505109,-0.781582,0.626245,0.293746,0.076,10.3,0.864,11.3288
1336,AP01,A,30,9.70041,11.746437,14.777542,16.300598,18.984767,21.744765,30.582848,...,0.338012,-0.003931,-0.437232,-0.721585,0.560389,0.27756,0.076,10.3,0.864,11.3288
2283,AP01,A,40,7.913722,10.29701,10.453795,10.156999,15.417897,17.797726,26.734713,...,0.372874,-0.057453,-0.44083,-0.2,0.514553,0.194392,0.076,10.3,0.864,11.3288


In [17]:
test2.sort_values(['Tube_Alias', 'Flaw_ID', 'Angle']).head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth,Pct_Depth,Flaw_Volume,Flaw_Area
499,AP01,B,0,36.221091,27.615068,55.197284,31.687575,83.976849,16.687873,131.858595,...,0.35742,-0.079227,-0.629653,-0.941627,0.635804,0.355119,0.152,20.6,1.728,11.3288
500,AP01,B,10,37.919316,25.727886,55.830544,33.572782,85.948893,8.328127,139.496115,...,0.358851,-0.044346,-0.631171,-0.973232,0.708834,0.344032,0.152,20.6,1.728,11.3288
501,AP01,B,20,41.675609,20.394822,61.486515,34.094998,94.081847,6.742141,152.091663,...,0.363403,0.009234,-0.633922,-0.939547,0.66702,-0.106303,0.152,20.6,1.728,11.3288
502,AP01,B,30,36.310727,21.29075,56.400562,41.954908,85.279194,15.80879,137.027767,...,0.365524,-0.001769,-0.622837,-0.965853,0.723438,-0.177416,0.152,20.6,1.728,11.3288
503,AP01,B,40,39.926,31.871166,59.242943,36.997879,92.233388,7.304527,146.747261,...,0.362425,-0.013571,-0.625705,-0.923809,0.707503,-0.053165,0.152,20.6,1.728,11.3288


# Important

##### This brings up to how these functions are all put together. Although they can be used individually, they are also packaged into one driver function, which was first shown in cell 4. The driver function has one required parameter, a dataframe, and four optional parameters: feature_list (default is None), method (more on this later), train_pct (default is 0.8), and num (default is 1). feature_list is used to subset the dataframe to include only the features of interest; train_pct is the train/test split desired; and num is the number of random angles if you want that option. method requires a string argument: 'one', 'two', 'three', 'four', 'five', 'six'; the default is 'one'. These strings will select a pre-coded combination of the above options. 



##### 'one' has no random angle selection, no dropped random tube alias, and uses sklearn's train/test split method.

##### 'two' is the same as 'one', but includes the random angle functionality.

##### 'three' is the same as 'two', but also drops a random tube alias.

##### 'four' has no random angle selection, no dropped random tube alias, and uses the custom train/test split method.

##### 'five' is the same as 'four', but includes the random angle functionality.

##### 'six' is the same as 'five', but also drops a random tube alias.

##### the function below uses method 'five' with 2 random angles per tube/flaw

In [18]:
X_train, X_test, y_train, y_test, train_df, test_df = mp.get_scaled_training_test_data(df, method='five', num=2)

##### train_df and test_df are also output as dataframes so you can check what's actually going on. Note that these are not scaled or appropriate for modeling.

In [19]:
train_df.sort_values(['Tube_Alias', 'Flaw_ID', 'Angle']).head()

Unnamed: 0,Tube_Alias,Flaw_ID,Angle,Amp_1,Amp_2,Amp_3,Amp_4,Amp_5,Amp_6,Amp_7,...,Phase_12,Phase_13,Phase_14,Phase_15,Phase_16,Phase_17,Phase_18,Phase_19,Phase_20,Flaw_Depth
188,AP01,A,40,7.913722,10.29701,10.453795,10.156999,15.417897,17.797726,26.734713,...,-0.849391,1.286724,0.671175,0.372874,-0.057453,-0.44083,-0.2,0.514553,0.194392,0.076
161,AP01,A,190,23.663649,18.49238,33.856683,19.007084,51.872652,17.542679,82.153486,...,-0.929686,1.274152,0.899454,0.440365,0.025547,-0.57336,-0.797294,0.795442,-0.167126,0.076
168,AP01,C,180,82.17803,73.852387,124.424825,91.52822,186.500312,127.050519,295.666356,...,-1.124231,1.256743,0.81779,0.375358,-0.013108,-0.586082,-0.888007,0.718511,0.43148,0.229
35,AP01,C,210,68.941938,57.190145,103.44097,84.423248,155.080616,21.485233,242.502381,...,-1.107806,1.250887,0.798592,0.367751,-0.020435,-0.594297,-0.920926,0.766144,0.423165,0.229
127,AP01,E,210,122.83218,88.209231,184.15058,128.420844,281.784246,17.981392,451.95458,...,-1.20604,1.123332,0.674553,0.255989,-0.189128,-0.648784,-0.984664,0.683224,0.383337,0.368


##### you might see that some of the Tube/Flaw pairs only have one example, when we would expect them to have two. This is the result of the evenly_distribute function, which finds the tube/flaw pair with the lowest number of examples, and then randomly selects that same number of examples from the other tube/flaw pairs so that the model training training will not be biased towards a particular flaw size.

In [20]:
df['Flaw_ID'].value_counts()

H    684
F    684
E    684
D    684
C    684
G    684
B    679
I    648
A    452
Name: Flaw_ID, dtype: int64

##### above we can see that flaw A has the few examples.

In [21]:
evenly_distributed_data = mp.evenly_distribute(df)

In [22]:
evenly_distributed_data['Flaw_ID'].value_counts()

A    452
E    452
B    452
D    452
I    452
H    452
C    452
F    452
G    452
Name: Flaw_ID, dtype: int64

##### after running the evenly_distribute function, we now have equal numbers of flaws for each flaw type.