# Data Processing

The client provided the data in an excel sheet - `data/main.xlsx`. There are five sheets (seawater, copper, cadmium, lead, and mix) but we didn't work with mix because we wanted to focus on building a simple model first. 

Each row of each sheet is a water sample and the output of the readings when the volts were passed through them. 

My first goal was to get everything in a format I am more familiar with: csv files and pandas dataframes. 

## The Process

1. Extract the sheets from the excel file into dataframes
2. Drop unnecessary columns
3. Create unique, descriptive column names for each sample (including metal, concentration and sample number)
4. Transfrom the dataframes from wide-form to long-form
5. Create a voltage column
6. Reset index to be unique

It's all in the `data.py` file - rename it to data_processing or something. 

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

DATA_DIR = Path('data')

In [2]:
# Read in the sheets we want from the Excel file
sheet_names = ['Seawater - No Heavy Metals', 'Copper', 'Cadmium', 'Lead']
xcel = pd.read_excel(DATA_DIR / 'main.xlsx', sheet_name=sheet_names)

# Create dataframes for each class
seawater = xcel['Seawater - No Heavy Metals']
copper = xcel['Copper']
cadmium = xcel['Cadmium']
lead = xcel['Lead']

In [4]:
from data_processing import get_longform_df

get_longform_df(copper).columns

Index(['voltage', 'Cu_500_ppb_0', 'Cu_500_ppb_1', 'Cu_500_ppb_2',
       'Cu_500_ppb_3', 'Cu_500_ppb_4', 'Cu_500_ppb_5', 'Cu_500_ppb_6',
       'Cu_500_ppb_7', 'Cu_1000_ppb_8', 'Cu_1000_ppb_9', 'Cu_1000_ppb_10',
       'Cu_1000_ppb_11', 'Cu_1000_ppb_12', 'Cu_1000_ppb_13', 'Cu_1000_ppb_14',
       'Cu_2000_ppb_15', 'Cu_2000_ppb_16', 'Cu_2000_ppb_17', 'Cu_2000_ppb_18',
       'Cu_2000_ppb_19', 'Cu_2000_ppb_20', 'Cu_3000_ppb_21', 'Cu_3000_ppb_22'],
      dtype='object')

In [23]:
copper

Unnamed: 0,Name,Analyte,Concentration,1,0.996,0.992,0.988,0.984,0.98,0.976,...,0.968.1,0.972.1,0.976.1,0.98.1,0.984.1,0.988.1,0.992.1,0.996.1,1.1,metal_concentration
0,Cu 500ppb,Cu,500_ppb,-3.22595,-3.350025,-3.350025,-3.350025,-3.350025,-3.287988,-3.350025,...,8.809325,8.9334,8.995438,9.119513,9.243588,9.367663,9.491738,9.615813,9.739888,Cu_500 ppb
1,Cu 500ppb,Cu,500_ppb,-4.032438,-4.094475,-4.094475,-4.094475,-4.094475,-4.094475,-4.094475,...,5.9556,6.079675,6.20375,6.265788,6.389863,6.4519,6.70005,6.70005,6.886163,Cu_500 ppb
2,NC Cu 500 ppb 2nd day,Cu,500_ppb,-0.4963,-0.558338,-0.558338,-0.4963,-0.558338,-0.4963,-0.558338,...,3.536138,3.536138,3.660213,3.72225,3.846325,3.846325,4.094475,4.156513,4.280588,Cu_500 ppb
3,NC Cu 500 ppb 2nd day,Cu,500_ppb,-0.806488,-0.74445,-0.74445,-0.74445,-0.74445,-0.682413,-0.682413,...,3.784288,3.846325,3.9704,4.032438,4.156513,4.280588,4.404663,4.404663,4.590775,Cu_500 ppb
4,Ocean 1 500 ppb,Cu,500_ppb,-0.434263,-0.4963,-0.434263,-0.434263,-0.434263,-0.372225,-0.434263,...,2.791688,2.853725,2.853725,2.9778,3.101875,3.163913,3.22595,3.287988,3.412063,Cu_500 ppb
5,Ocean 1 500 ppb,Cu,500_ppb,-0.558338,-0.558338,-0.558338,-0.558338,-0.4963,-0.4963,-0.4963,...,3.039838,3.039838,3.163913,3.22595,3.350025,3.412063,3.4741,3.598175,3.660213,Cu_500 ppb
6,Ocean 17 500 ppb Cu,Cu,500_ppb,-0.434263,-0.434263,-0.434263,-0.434263,-0.434263,-0.372225,-0.372225,...,2.667613,2.72965,2.791688,2.915763,2.9778,3.039838,3.101875,3.22595,3.350025,Cu_500 ppb
7,Ocean 17 500 ppb Cu,Cu,500_ppb,-0.558338,-0.4963,-0.4963,-0.4963,-0.434263,-0.4963,-0.4963,...,2.9778,3.039838,3.101875,3.163913,3.350025,3.412063,3.4741,3.598175,3.660213,Cu_500 ppb
8,Cu 1000ppb,Cu,1000_ppb,-2.9778,-2.915763,-2.9778,-2.915763,-2.853725,-2.915763,-2.915763,...,5.149113,5.273188,5.4593,5.521338,5.70745,5.831525,6.017638,6.079675,6.327825,Cu_1000 ppb
9,Cu 1000ppb,Cu,1000_ppb,-3.22595,-3.101875,-3.163913,-3.039838,-2.9778,-2.915763,-2.915763,...,4.900963,4.963,5.149113,5.21115,5.397263,5.521338,5.70745,5.893563,6.017638,Cu_1000 ppb
