# Data Processing

The client provided the data in an excel sheet - `data/main.xlsx`. There are five sheets (seawater, copper, cadmium, lead, and mix) but we didn't work with mix because we wanted to focus on building a simple model first. 

Each row of each sheet is a water sample and the output of the readings when the volts were passed through them. 

My first goal was to get everything in a format I am more familiar with: csv files and pandas dataframes. 

## The Process

1. Extract the sheets from the excel file into dataframes
2. Drop unnecessary columns
3. Create unique, descriptive column names for each sample (including metal, concentration and sample number)
4. Transfrom the dataframes from wide-form to long-form
5. Create a voltage column
6. Reset index to be unique

It's all in the `data.py` file - rename it to data_processing or something. 

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

from scripts.data_processing import transform_to_longform_df

DATA_DIR = Path('data')

In [2]:
# Read in the sheets we want from the Excel file
sheet_names = ['Seawater - No Heavy Metals', 'Copper', 'Cadmium', 'Lead']
xcel = pd.read_excel(DATA_DIR / 'main.xlsx', sheet_name=sheet_names)

# Create dataframes for each class
seawater = xcel['Seawater - No Heavy Metals']
copper = xcel['Copper']
cadmium = xcel['Cadmium']
lead = xcel['Lead']

In [22]:
# Save metals as csv files
transform_to_longform_df(copper).to_csv(DATA_DIR / 'copper.csv')
transform_to_longform_df(cadmium).to_csv(DATA_DIR / 'cadmium.csv')
transform_to_longform_df(lead).to_csv(DATA_DIR / 'lead.csv')

In [23]:
# seawater_long col names are seawater_SW0_n (for some n)
# Let's keep everything after the first underscore
seawater_long = transform_to_longform_df(seawater)
example_col = seawater_long.columns[1]
underscore_position = example_col.find('_')
short_col_names = [col[underscore_position+1:] \
                   # Don't modify the voltage column
                   if col.startswith('s') else col \
                   for col in seawater_long.columns]
# Rename cols
seawater_long.columns = short_col_names
seawater_long.to_csv(DATA_DIR / 'seawater.csv')

Now the data is in a form that is easier to work with, let's start exploring!