# Data Processing

The client provided the data in an excel sheet - `data/main.xlsx`. There are five sheets (seawater, copper, cadmium, lead, and mix) but we didn't work with mix because we wanted to focus on building a simple model first. 

Each row of each sheet is a water sample and the output of the readings when the volts were passed through them. 

My first goal was to get everything in a format I am more familiar with: csv files and pandas dataframes. 

## The Process

1. Extract the sheets from the excel file into dataframes
2. Drop unnecessary columns
3. Create unique, descriptive column names for each sample (including metal, concentration and sample number)
4. Transfrom the dataframes from wide-form to long-form
5. Create a voltage column
6. Reset index to be unique

It's all in the `data.py` file - rename it to data_processing or something. 

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

from scripts.data_processing import transform_to_longform_df

DATA_DIR = Path('data')

In [2]:
# Read in the sheets we want from the Excel file
sheet_names = ['Seawater - No Heavy Metals', 'Copper', 'Cadmium', 'Lead']
xcel = pd.read_excel(DATA_DIR / 'main.xlsx', sheet_name=sheet_names)

# Create dataframes for each class
seawater = xcel['Seawater - No Heavy Metals']
copper = xcel['Copper']
cadmium = xcel['Cadmium']
lead = xcel['Lead']

In [3]:
copper_longform = transform_to_longform_df(copper)
cadmium_longform = transform_to_longform_df(cadmium)
lead_longform = transform_to_longform_df(lead)

In [11]:
# Create label to integer mapping - Torch needs labels as ints
from sklearn.preprocessing import LabelEncoder
labels_str = ['Cu', 'Cd', 'Pb', 'Sw']
label_enc = LabelEncoder()
labels_int = label_enc.fit_transform(labels_str)
label_to_int_mapping = dict(zip(labels_str, labels_int))
label_to_int_mapping

{'Cu': 1, 'Cd': 0, 'Pb': 2, 'Sw': 3}

In [13]:
# Can use this for all dataframes as they are all made up of seqs of the same length
final_row_index = copper_longform.index[-1]
# Add row containing int label
copper_longform.loc[final_row_index+1] = label_to_int_mapping['Cu']
cadmium_longform.loc[final_row_index+1] = label_to_int_mapping['Cd']
lead_longform.loc[final_row_index+1] = label_to_int_mapping['Pb']

In [17]:
copper_longform.tail()

Unnamed: 0,voltage,Cu_500_ppb_0,Cu_500_ppb_1,Cu_500_ppb_2,Cu_500_ppb_3,Cu_500_ppb_4,Cu_500_ppb_5,Cu_500_ppb_6,Cu_500_ppb_7,Cu_1000_ppb_8,...,Cu_1000_ppb_13,Cu_1000_ppb_14,Cu_2000_ppb_15,Cu_2000_ppb_16,Cu_2000_ppb_17,Cu_2000_ppb_18,Cu_2000_ppb_19,Cu_2000_ppb_20,Cu_3000_ppb_21,Cu_3000_ppb_22
998,0.988,9.367663,6.4519,3.846325,4.280588,3.163913,3.412063,3.039838,3.412063,5.831525,...,5.335225,5.645413,7.816725,9.243588,6.638013,7.816725,6.265788,6.575975,4.4667,10.112113
999,0.992,9.491738,6.70005,4.094475,4.404663,3.22595,3.4741,3.101875,3.4741,6.017638,...,5.397263,5.769488,8.126913,9.553775,6.886163,8.064875,6.575975,6.762088,4.590775,10.4223
1000,0.996,9.615813,6.70005,4.156513,4.404663,3.287988,3.598175,3.22595,3.598175,6.079675,...,5.583375,5.9556,8.375063,9.988038,7.134313,8.4371,6.762088,6.9482,4.776888,10.794525
1001,1.0,9.739888,6.886163,4.280588,4.590775,3.412063,3.660213,3.350025,3.660213,6.327825,...,5.70745,6.141713,8.68525,10.298225,7.320425,8.68525,7.072275,7.134313,4.838925,11.16675
1002,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [18]:
# Save metals as csv files
copper_longform.to_csv(DATA_DIR / 'copper.csv')
cadmium_longform.to_csv(DATA_DIR / 'cadmium.csv')
lead_longform.to_csv(DATA_DIR / 'lead.csv')

In [19]:
# seawater_long col names are seawater_SW0_n (for some n)
# Let's keep everything after the first underscore
seawater_long = transform_to_longform_df(seawater)
example_col = seawater_long.columns[1]
underscore_position = example_col.find('_')
short_col_names = [col[underscore_position+1:] \
                   # Don't modify the voltage column
                   if col.startswith('s') else col \
                   for col in seawater_long.columns]
# Rename cols
seawater_long.columns = short_col_names
# Add final row containing int label
seawater_long.loc[final_row_index+1] = label_to_int_mapping['Sw']
seawater_long.to_csv(DATA_DIR / 'seawater.csv')

Now the data is in a form that is easier to work with, let's start exploring!