### Step 4: Data Wrangling

The raw output data from coregistration step cannot be passed into machine learning model directly. They don't have labels and have too many features. 

The label is the thing we will try to predict with the machine learning models. In our case, the label is a True or False assertion. Each input data row is either a mineral deposit or not a mineral deposit. Given a data row, the machine learning model will tell us if the location is a mineral deposit or not. This is the ultimate goal we are trying to achieve in this machine learning workflow.

The feature is a column of the input data. There are many features in the coregistration output data. It is not wise to use too many features in the machine learning analysis because of the [Curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). More features mean more dimensions. If you have 20 features/columns, the machine learning model has to do the analysis in a 20-dimensional space. Just imagine if you were thrown into a 20-dimensional space, I guess it would not be a happy experience to find a way out. Yes, some machine learning models are smart enough and can reduce the number of dimensions. But at the current point of human history, humans are still a little bit smarter than computers. So, let's help computers by reducing the number of features.

We will create a csv file, in which the last colomn is the label(0 or 1) and the other columns are features. 

The feature selection is highly related to the specific research. We need to identify the features which are most important to the formation of mineral deposit. For example, some researchers might think the distance along the trench is important. Others might think the sea floor age is of great significance. Come out with your own hypothesis, wrangle the data accordingly and then send the data into a machine learning model to be evaluated. Repeat this process until we find the most important features. This process is similar to a psychic finding a perfect crystal ball to start a fortune telling business. The only difference is that the psychic is doing magic, but we are doing science.

The following code cell will select features from coregistration output and create a csv file for the machine learning analysis in Step 5.

In [1]:
import numpy as np
import pandas as pd
from parameters_n1 import parameters 
import Utils_c1 as Utils

#load data 
coreg_input_data = pd.read_csv('test-case-clennett/coreg_input/02_NA_Clennett_Positives_PlateID.csv')
coreg_output_data = pd.read_csv('test-case-clennett/coreg_output/02_NA_Clennett_Positives_PlateID.csv')
print('The shape of coregistration input data is: ', coreg_input_data.shape)
print('The shape of coregistration output data is: ', coreg_output_data.shape)

if coreg_input_data.shape[0] == coreg_output_data.shape[0]:
    print('Good! The input and output data has the same length ', coreg_output_data.shape[0])

print()
print('the coregistration input data')
display(coreg_input_data)
print('the coregistration output data')
display(coreg_output_data)
print('the columns in coregistration output data are: ')
Utils.print_columns()

The shape of coregistration input data is:  (272, 5)
The shape of coregistration output data is:  (272, 28)
Good! The input and output data has the same length  272

the coregistration input data


Unnamed: 0,index,lon,lat,age,plate_id
0,0,-157.24,57.05,0,16112
1,1,-158.40,56.52,4,16112
2,2,-157.24,57.05,5,16112
3,3,-127.85,50.68,5,16110
4,4,-127.58,50.33,5,16110
...,...,...,...,...,...
267,267,-127.39,50.59,167,16110
268,268,-127.46,50.59,167,16110
269,269,-127.86,50.67,168,16110
270,270,-120.00,49.17,168,16102


the coregistration output data


Unnamed: 0,lon,lat,age,plate_id,recon_lon,recon_lat,distance,sub_idx,trench_lon,trench_lat,...,dist_nearest_edge,dist_from_start,conv_ortho,conv_paral,trench_abs_ortho,trench_abs_paral,subducting_abs_rate,subducting_abs_angle,subducting_abs_ortho,subducting_abs_paral
0,-157.24,57.05,0.0,16112.0,-157.240,57.050,0.034,255.0,-154.67,55.75,...,7.25,7.25,5.40,1.10,-0.72,-2.00,-4.77,-10.90,4.69,-0.90
1,-158.40,56.52,4.0,16112.0,-157.842,57.224,0.033,166.0,-155.99,55.66,...,8.60,8.60,5.56,0.20,-1.05,-1.85,-4.81,-20.12,4.51,-1.65
2,-157.24,57.05,5.0,16112.0,-156.497,57.921,0.034,235.0,-153.90,56.61,...,7.25,7.25,5.27,1.21,-0.69,-2.05,-4.67,-10.38,4.59,-0.84
3,-127.85,50.68,5.0,16110.0,-126.765,51.250,0.018,291.0,-127.88,50.52,...,1.68,11.38,3.38,1.83,-1.15,-0.82,-2.45,23.93,2.24,0.99
4,-127.58,50.33,5.0,16110.0,-126.504,50.896,0.016,293.0,-127.57,50.32,...,1.96,11.09,3.60,1.43,-1.26,-0.69,-2.45,17.34,2.34,0.73
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
267,-127.39,50.59,167.0,16110.0,-54.951,37.991,0.128,1470.0,-63.36,41.44,...,7.27,7.27,4.95,-4.05,-1.52,2.33,-3.83,-26.68,3.42,-1.72
268,-127.46,50.59,167.0,16110.0,-54.994,38.019,0.127,1470.0,-63.36,41.44,...,7.27,7.27,4.95,-4.05,-1.52,2.33,-3.83,-26.68,3.42,-1.72
269,-127.86,50.67,168.0,16110.0,-54.800,38.273,0.129,1468.0,-63.63,41.29,...,7.54,7.54,4.96,-4.02,-1.52,2.33,-3.83,-26.32,3.43,-1.70
270,-120.00,49.17,168.0,16102.0,-51.241,31.511,,,,,...,,,,,,,,,,


the columns in coregistration output data are: 
* 0 reconstructed mineral deposits longitude
* 1 reconstructed mineral deposits latitude
* 2 distance to the nearest trench point
* 3 the index of trench point
* 4 trench point longitude
* 5 trench point latitude
* 6 subducting convergence (relative to trench) velocity magnitude (in cm/yr)
* 7 subducting convergence velocity obliquity angle (angle between trench normal vector and convergence velocity vector)
* 8 trench absolute (relative to anchor plate) velocity magnitude (in cm/yr)
* 9 trench absolute velocity obliquity angle (angle between trench normal vector and trench absolute velocity vector)
* 10 length of arc segment (in degrees) that current point is on
* 11 trench normal azimuth angle (clockwise starting at North, ie, 0 to 360 degrees) at current point
* 12 subducting plate ID
* 13 trench plate ID
* 14 distance (in degrees) along the trench line to the nearest trench edge
* 15 the distance (in degrees) along the trench line fro

##### After having a look at the data, let's start selecting features and add labels.

In [2]:
import numpy as np
import pandas as pd
from parameters_n1 import parameters 
import Utils_c1 as Utils

import os

coreg_out_dir = Utils.get_coreg_output_dir()
positive_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Positives_PlateID.csv')
negative_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Negatives_1_PlateID.csv')
candidates_data = pd.read_csv(coreg_out_dir + '/deposit_candidates.csv')

print(positive_data.columns)

feature_names = parameters['feature_names']

positive_features = positive_data[feature_names].dropna()
negative_features = negative_data[feature_names].dropna()
candidates_features = candidates_data[feature_names].dropna()

positive_features['label']=True
negative_features['label']=False

#save the data
positive_features.to_csv(Utils.get_ml_input_dir() + 'positive.csv', index=False)
negative_features.to_csv(Utils.get_ml_input_dir() + 'negative_c1.csv', index=False)
candidates_features.to_csv(Utils.get_ml_input_dir() + 'candidates.csv', index=False)

positive_data.iloc[positive_features.index].to_csv(Utils.get_ml_input_dir() + 'positive_all_columns.csv', index=False)
negative_data.iloc[negative_features.index].to_csv(Utils.get_ml_input_dir() + 'negative_c1_all_columns.csv', index=False)
candidates_data.iloc[candidates_features.index].to_csv(Utils.get_ml_input_dir() + '/candidates_all_columns.csv', index=False)

import glob
files = glob.glob(Utils.get_ml_input_dir() + '*')
print('\ngenerated files:')
for f in files:
    print(f)


Index(['lon', 'lat', 'age', 'plate_id', 'recon_lon', 'recon_lat', 'distance',
       'sub_idx', 'trench_lon', 'trench_lat', 'conv_rate', 'conv_angle',
       'trench_abs_rate', 'trench_abs_angle', 'arc_len', 'trench_norm',
       'subducting_pid', 'trench_pid', 'dist_nearest_edge', 'dist_from_start',
       'conv_ortho', 'conv_paral', 'trench_abs_ortho', 'trench_abs_paral',
       'subducting_abs_rate', 'subducting_abs_angle', 'subducting_abs_ortho',
       'subducting_abs_paral'],
      dtype='object')

generated files:
test-case-clennett/ml_input/negative_c1_all_columns.csv
test-case-clennett/ml_input/positive.csv
test-case-clennett/ml_input/candidates_all_columns.csv
test-case-clennett/ml_input/negative_c1.csv
test-case-clennett/ml_input/candidates.csv
test-case-clennett/ml_input/negative_c2.csv
test-case-clennett/ml_input/positive_all_columns.csv
test-case-clennett/ml_input/negative_c2_all_columns.csv


In [3]:
import numpy as np
import pandas as pd
from parameters_n2 import parameters 
import Utils_c2 as Utils

import os

coreg_out_dir = Utils.get_coreg_output_dir()
positive_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Positives_PlateID.csv')
negative_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Negatives_2_PlateID.csv')
candidates_data = pd.read_csv(coreg_out_dir + '/deposit_candidates.csv')

print(positive_data.columns)

feature_names = parameters['feature_names']

positive_features = positive_data[feature_names].dropna()
negative_features = negative_data[feature_names].dropna()
candidates_features = candidates_data[feature_names].dropna()

positive_features['label']=True
negative_features['label']=False

#save the data
positive_features.to_csv(Utils.get_ml_input_dir() + 'positive.csv', index=False)
negative_features.to_csv(Utils.get_ml_input_dir() + 'negative_c2.csv', index=False)
candidates_features.to_csv(Utils.get_ml_input_dir() + 'candidates.csv', index=False)

positive_data.iloc[positive_features.index].to_csv(Utils.get_ml_input_dir() + 'positive_all_columns.csv', index=False)
negative_data.iloc[negative_features.index].to_csv(Utils.get_ml_input_dir() + 'negative_c2_all_columns.csv', index=False)
candidates_data.iloc[candidates_features.index].to_csv(Utils.get_ml_input_dir() + '/candidates_all_columns.csv', index=False)

import glob
files = glob.glob(Utils.get_ml_input_dir() + '*')
print('\ngenerated files:')
for f in files:
    print(f)

Index(['lon', 'lat', 'age', 'plate_id', 'recon_lon', 'recon_lat', 'distance',
       'sub_idx', 'trench_lon', 'trench_lat', 'conv_rate', 'conv_angle',
       'trench_abs_rate', 'trench_abs_angle', 'arc_len', 'trench_norm',
       'subducting_pid', 'trench_pid', 'dist_nearest_edge', 'dist_from_start',
       'conv_ortho', 'conv_paral', 'trench_abs_ortho', 'trench_abs_paral',
       'subducting_abs_rate', 'subducting_abs_angle', 'subducting_abs_ortho',
       'subducting_abs_paral'],
      dtype='object')

generated files:
test-case-clennett/ml_input/negative_c1_all_columns.csv
test-case-clennett/ml_input/positive.csv
test-case-clennett/ml_input/candidates_all_columns.csv
test-case-clennett/ml_input/negative_c1.csv
test-case-clennett/ml_input/candidates.csv
test-case-clennett/ml_input/negative_c2.csv
test-case-clennett/ml_input/positive_all_columns.csv
test-case-clennett/ml_input/negative_c2_all_columns.csv


In [4]:
import numpy as np
import pandas as pd
from parameters_n3 import parameters 
import Utils_c3 as Utils

import os

coreg_out_dir = Utils.get_coreg_output_dir()
positive_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Positives_PlateID.csv')
negative_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Negatives_3_PlateID.csv')
candidates_data = pd.read_csv(coreg_out_dir + '/deposit_candidates.csv')

print(positive_data.columns)

feature_names = parameters['feature_names']

positive_features = positive_data[feature_names].dropna()
negative_features = negative_data[feature_names].dropna()
candidates_features = candidates_data[feature_names].dropna()

positive_features['label']=True
negative_features['label']=False

#save the data
positive_features.to_csv(Utils.get_ml_input_dir() + 'positive.csv', index=False)
negative_features.to_csv(Utils.get_ml_input_dir() + 'negative_c3.csv', index=False)
candidates_features.to_csv(Utils.get_ml_input_dir() + 'candidates.csv', index=False)

positive_data.iloc[positive_features.index].to_csv(Utils.get_ml_input_dir() + 'positive_all_columns.csv', index=False)
negative_data.iloc[negative_features.index].to_csv(Utils.get_ml_input_dir() + 'negative_c3_all_columns.csv', index=False)
candidates_data.iloc[candidates_features.index].to_csv(Utils.get_ml_input_dir() + '/candidates_all_columns.csv', index=False)

import glob
files = glob.glob(Utils.get_ml_input_dir() + '*')
print('\ngenerated files:')
for f in files:
    print(f)

Index(['lon', 'lat', 'age', 'plate_id', 'recon_lon', 'recon_lat', 'distance',
       'sub_idx', 'trench_lon', 'trench_lat', 'conv_rate', 'conv_angle',
       'trench_abs_rate', 'trench_abs_angle', 'arc_len', 'trench_norm',
       'subducting_pid', 'trench_pid', 'dist_nearest_edge', 'dist_from_start',
       'conv_ortho', 'conv_paral', 'trench_abs_ortho', 'trench_abs_paral',
       'subducting_abs_rate', 'subducting_abs_angle', 'subducting_abs_ortho',
       'subducting_abs_paral'],
      dtype='object')

generated files:
test-case-clennett/ml_input/negative_c1_all_columns.csv
test-case-clennett/ml_input/positive.csv
test-case-clennett/ml_input/candidates_all_columns.csv
test-case-clennett/ml_input/negative_c1.csv
test-case-clennett/ml_input/candidates.csv
test-case-clennett/ml_input/negative_c2.csv
test-case-clennett/ml_input/positive_all_columns.csv
test-case-clennett/ml_input/negative_c3.csv
test-case-clennett/ml_input/negative_c2_all_columns.csv
test-case-clennett/ml_input/negative_

In [5]:
import numpy as np
import pandas as pd
from parameters_n4 import parameters 
import Utils_c4 as Utils

import os

coreg_out_dir = Utils.get_coreg_output_dir()
positive_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Positives_PlateID.csv')
negative_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Negatives_4_PlateID.csv')
candidates_data = pd.read_csv(coreg_out_dir + '/deposit_candidates.csv')

print(positive_data.columns)

feature_names = parameters['feature_names']

positive_features = positive_data[feature_names].dropna()
negative_features = negative_data[feature_names].dropna()
candidates_features = candidates_data[feature_names].dropna()

positive_features['label']=True
negative_features['label']=False

#save the data
positive_features.to_csv(Utils.get_ml_input_dir() + 'positive.csv', index=False)
negative_features.to_csv(Utils.get_ml_input_dir() + 'negative_c4.csv', index=False)
candidates_features.to_csv(Utils.get_ml_input_dir() + 'candidates.csv', index=False)

positive_data.iloc[positive_features.index].to_csv(Utils.get_ml_input_dir() + 'positive_all_columns.csv', index=False)
negative_data.iloc[negative_features.index].to_csv(Utils.get_ml_input_dir() + 'negative_c4_all_columns.csv', index=False)
candidates_data.iloc[candidates_features.index].to_csv(Utils.get_ml_input_dir() + '/candidates_all_columns.csv', index=False)

import glob
files = glob.glob(Utils.get_ml_input_dir() + '*')
print('\ngenerated files:')
for f in files:
    print(f)

Index(['lon', 'lat', 'age', 'plate_id', 'recon_lon', 'recon_lat', 'distance',
       'sub_idx', 'trench_lon', 'trench_lat', 'conv_rate', 'conv_angle',
       'trench_abs_rate', 'trench_abs_angle', 'arc_len', 'trench_norm',
       'subducting_pid', 'trench_pid', 'dist_nearest_edge', 'dist_from_start',
       'conv_ortho', 'conv_paral', 'trench_abs_ortho', 'trench_abs_paral',
       'subducting_abs_rate', 'subducting_abs_angle', 'subducting_abs_ortho',
       'subducting_abs_paral'],
      dtype='object')

generated files:
test-case-clennett/ml_input/negative_c1_all_columns.csv
test-case-clennett/ml_input/positive.csv
test-case-clennett/ml_input/candidates_all_columns.csv
test-case-clennett/ml_input/negative_c1.csv
test-case-clennett/ml_input/candidates.csv
test-case-clennett/ml_input/negative_c2.csv
test-case-clennett/ml_input/positive_all_columns.csv
test-case-clennett/ml_input/negative_c3.csv
test-case-clennett/ml_input/negative_c2_all_columns.csv
test-case-clennett/ml_input/negative_

In [7]:
import numpy as np
import pandas as pd
from parameters_n5 import parameters 
import Utils_c5 as Utils

import os

coreg_out_dir = Utils.get_coreg_output_dir()
positive_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Positives_PlateID.csv')
negative_data = pd.read_csv(coreg_out_dir + '/02_NA_Clennett_Negatives_5_PlateID.csv')
candidates_data = pd.read_csv(coreg_out_dir + '/deposit_candidates.csv')

print(positive_data.columns)

feature_names = parameters['feature_names']

positive_features = positive_data[feature_names].dropna()
negative_features = negative_data[feature_names].dropna()
candidates_features = candidates_data[feature_names].dropna()

positive_features['label']=True
negative_features['label']=False

#save the data
positive_features.to_csv(Utils.get_ml_input_dir() + 'positive.csv', index=False)
negative_features.to_csv(Utils.get_ml_input_dir() + 'negative_c5.csv', index=False)
candidates_features.to_csv(Utils.get_ml_input_dir() + 'candidates.csv', index=False)

positive_data.iloc[positive_features.index].to_csv(Utils.get_ml_input_dir() + 'positive_all_columns.csv', index=False)
negative_data.iloc[negative_features.index].to_csv(Utils.get_ml_input_dir() + 'negative_c5_all_columns.csv', index=False)
candidates_data.iloc[candidates_features.index].to_csv(Utils.get_ml_input_dir() + '/candidates_all_columns.csv', index=False)

import glob
files = glob.glob(Utils.get_ml_input_dir() + '*')
print('\ngenerated files:')
for f in files:
    print(f)

Index(['lon', 'lat', 'age', 'plate_id', 'recon_lon', 'recon_lat', 'distance',
       'sub_idx', 'trench_lon', 'trench_lat', 'conv_rate', 'conv_angle',
       'trench_abs_rate', 'trench_abs_angle', 'arc_len', 'trench_norm',
       'subducting_pid', 'trench_pid', 'dist_nearest_edge', 'dist_from_start',
       'conv_ortho', 'conv_paral', 'trench_abs_ortho', 'trench_abs_paral',
       'subducting_abs_rate', 'subducting_abs_angle', 'subducting_abs_ortho',
       'subducting_abs_paral'],
      dtype='object')

generated files:
test-case-clennett/ml_input/negative_c1_all_columns.csv
test-case-clennett/ml_input/positive.csv
test-case-clennett/ml_input/candidates_all_columns.csv
test-case-clennett/ml_input/negative_c1.csv
test-case-clennett/ml_input/negative_c_all_columns.csv
test-case-clennett/ml_input/candidates.csv
test-case-clennett/ml_input/negative_c2.csv
test-case-clennett/ml_input/positive_all_columns.csv
test-case-clennett/ml_input/negative_c5_all_columns.csv
test-case-clennett/ml_inpu