# Ohio Blood Alcohol Concentration Data Pre-Processing

## 1. Introduction
In this notebook, we prepare the dataset for model training. 

In [1]:
import pandas as pd

## 2. Load Data
Dataset: [Name or Source]\
Our dataset comes from two university studies:
- Study 1: Student volunteers from Ohio State University were given 12 oz beers
    - Features: Gender, Weight (lb), Sobriety Test Score Before and After
    - Target: Blood alcohol concentration (measured via breathalyzer 30 minutes after drinking)
    - Observations: 16
- Study 2: Volunteers from an Australian university were offered 10% alcohol/volume wine in 120 mL glasses.
    - Features: Gender, Weight (kg), Height (cm), Wine, Age
    - Target: Blood alcohol concentration (measured via breathalyzer 1-hr after drinking started and 15 minutes after they were told to stop drinking)
    - Observations: 22

In [2]:
bac_dataset = pd.read_csv("../data/raw/bac_dataset.csv")
bac_dataset

Unnamed: 0,ID_OSU,Gender_OSU,Weight_OSU,Beers,BAC,1st-Sobr,2nd-Sobr,ID_AUST,Gend_AUS,Wght_AUS,Height,Age,1hr-BAC,Wine
0,1,female,132,5,0.1,10,6,1,female,70,167,20,0.025,4
1,2,female,128,2,0.03,9.5,9.25,2,female,66,161,21,0.04,4
2,3,female,110,9,0.19,9.75,4.75,3,male,67,169,27,0.07,6
3,4,male,192,8,0.12,10,7.5,4,male,91,187,20,0.065,6
4,5,male,172,3,0.04,10,9.75,5,female,58,158,25,0.015,3
5,6,female,250,7,0.095,9.5,6.5,6,male,80,177,29,0.02,3
6,7,female,125,3,0.07,9.5,7,7,female,63,162,26,0.0,1
7,8,male,175,5,0.06,9.75,8.75,8,male,75,170,48,0.015,3
8,9,female,175,3,0.02,9.5,6,9,male,124,184,22,0.0,3
9,10,male,275,5,0.05,9.75,8.5,10,male,90,171,50,0.02,3


In [3]:
# As seen, the two datasets have been concatenated by columns. Thus, to work with the Ohio dataset, we take a subset of the columns.

ohio_bac_dataset = bac_dataset.loc[:,:'2nd-Sobr']
ohio_bac_dataset

Unnamed: 0,ID_OSU,Gender_OSU,Weight_OSU,Beers,BAC,1st-Sobr,2nd-Sobr
0,1,female,132,5,0.1,10,6
1,2,female,128,2,0.03,9.5,9.25
2,3,female,110,9,0.19,9.75,4.75
3,4,male,192,8,0.12,10,7.5
4,5,male,172,3,0.04,10,9.75
5,6,female,250,7,0.095,9.5,6.5
6,7,female,125,3,0.07,9.5,7
7,8,male,175,5,0.06,9.75,8.75
8,9,female,175,3,0.02,9.5,6
9,10,male,275,5,0.05,9.75,8.5


## 2. Data Cleaning
- Handling missing values
- Correcting data types

In [4]:
# The null values left over from the concatenation are removed.

ohio_bac_dataset = ohio_bac_dataset.dropna()
ohio_bac_dataset

Unnamed: 0,ID_OSU,Gender_OSU,Weight_OSU,Beers,BAC,1st-Sobr,2nd-Sobr
0,1,female,132,5,0.1,10.0,6.0
1,2,female,128,2,0.03,9.5,9.25
2,3,female,110,9,0.19,9.75,4.75
3,4,male,192,8,0.12,10.0,7.5
4,5,male,172,3,0.04,10.0,9.75
5,6,female,250,7,0.095,9.5,6.5
6,7,female,125,3,0.07,9.5,7.0
7,8,male,175,5,0.06,9.75,8.75
8,9,female,175,3,0.02,9.5,6.0
9,10,male,275,5,0.05,9.75,8.5


In [5]:
ohio_bac_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16 entries, 0 to 15
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   ID_OSU      16 non-null     object
 1   Gender_OSU  16 non-null     object
 2   Weight_OSU  16 non-null     object
 3   Beers       16 non-null     object
 4   BAC         16 non-null     object
 5   1st-Sobr    16 non-null     object
 6   2nd-Sobr    16 non-null     object
dtypes: object(7)
memory usage: 1.0+ KB


In [6]:
# As seen, some columns containing numeric values are of type 'object'. We convert them to the appropriate types.

ohio_bac_dataset['Weight_OSU'] = ohio_bac_dataset['Weight_OSU'].astype(int)
ohio_bac_dataset['Beers'] = ohio_bac_dataset['Beers'].astype(int)
ohio_bac_dataset['BAC'] = ohio_bac_dataset['BAC'].astype(float)
ohio_bac_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16 entries, 0 to 15
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ID_OSU      16 non-null     object 
 1   Gender_OSU  16 non-null     object 
 2   Weight_OSU  16 non-null     int64  
 3   Beers       16 non-null     int64  
 4   BAC         16 non-null     float64
 5   1st-Sobr    16 non-null     object 
 6   2nd-Sobr    16 non-null     object 
dtypes: float64(1), int64(2), object(4)
memory usage: 1.0+ KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohio_bac_dataset['Weight_OSU'] = ohio_bac_dataset['Weight_OSU'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohio_bac_dataset['Beers'] = ohio_bac_dataset['Beers'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ohio_bac_dataset['BAC'] = ohio_bac_dataset['BAC'].astype(float)

## 5. Feature Selection
Drop unnecessary columns and select relevant features.

In [7]:
# The column 'ID_OSU' is dropped as it offers no useful information.
# The columns '1st-Sobr' and '2nd-Sobr' are dropped as our goal is to predict the BAC based solely on information available to the individual.

ohio_bac_dataset = ohio_bac_dataset.drop('ID_OSU', axis = 1).drop('1st-Sobr', axis = 1).drop('2nd-Sobr', axis = 1)
ohio_bac_dataset

Unnamed: 0,Gender_OSU,Weight_OSU,Beers,BAC
0,female,132,5,0.1
1,female,128,2,0.03
2,female,110,9,0.19
3,male,192,8,0.12
4,male,172,3,0.04
5,female,250,7,0.095
6,female,125,3,0.07
7,male,175,5,0.06
8,female,175,3,0.02
9,male,275,5,0.05


## 6. Feature Engineering

In [8]:
# To use the categorical variable gender in the regression model, we convert it to a one-hot encoded variable.

gender_one_hot = pd.get_dummies(ohio_bac_dataset['Gender_OSU'], prefix='Gender', drop_first=True)
ohio_bac_dataset = pd.concat([ohio_bac_dataset, gender_one_hot.astype(int)], axis=1)
ohio_bac_dataset = ohio_bac_dataset.drop('Gender_OSU', axis=1)
ohio_bac_dataset

Unnamed: 0,Weight_OSU,Beers,BAC,Gender_male
0,132,5,0.1,0
1,128,2,0.03,0
2,110,9,0.19,0
3,192,8,0.12,1
4,172,3,0.04,1
5,250,7,0.095,0
6,125,3,0.07,0
7,175,5,0.06,1
8,175,3,0.02,0
9,275,5,0.05,1


In [9]:
# To standardize the bodyweight units, we convert it from pounds to grams. The conversion factor is 1 pound = 453.6 grams.

ohio_bac_dataset['Weight_OSU'] = ohio_bac_dataset['Weight_OSU'] * 453.6
ohio_bac_dataset = ohio_bac_dataset.rename(columns={'Weight_OSU': 'Bodyweight_grams'})
ohio_bac_dataset['Bodyweight_grams']

0      59875.2
1      58060.8
2      49896.0
3      87091.2
4      78019.2
5     113400.0
6      56700.0
7      79380.0
8      79380.0
9     124740.0
10     58968.0
11     76204.8
12     58060.8
13    111585.6
14     74390.4
15     79380.0
Name: Bodyweight_grams, dtype: float64

In [10]:
# To allow the model to be easily used for different drinks, we convert the number of beers consumed to grams of ethanol. The conversion factor is 14 grams of ethanol per beer.

ohio_bac_dataset['Beers'] = ohio_bac_dataset['Beers'] * 14
ohio_bac_dataset = ohio_bac_dataset.rename(columns={'Beers': 'Ethanol_grams'})
ohio_bac_dataset['Ethanol_grams']

0      70
1      28
2     126
3     112
4      42
5      98
6      42
7      70
8      42
9      70
10     56
11     84
12     70
13     98
14     14
15     56
Name: Ethanol_grams, dtype: int64

In [11]:
# Reorder columns of dataframe to have the target variable 'BAC' at the end.

ohio_bac_dataset = ohio_bac_dataset[['Gender_male', 'Bodyweight_grams', 'Ethanol_grams', 'BAC']]
ohio_bac_dataset

Unnamed: 0,Gender_male,Bodyweight_grams,Ethanol_grams,BAC
0,0,59875.2,70,0.1
1,0,58060.8,28,0.03
2,0,49896.0,126,0.19
3,1,87091.2,112,0.12
4,1,78019.2,42,0.04
5,0,113400.0,98,0.095
6,0,56700.0,42,0.07
7,1,79380.0,70,0.06
8,0,79380.0,42,0.02
9,1,124740.0,70,0.05


## 7. Export Cleaned Data
- Saving the cleaned dataset for model training
- File saved as: ohio_bac_dataset_processed.pkl

In [12]:
ohio_bac_dataset.to_csv('../data/processed/ohio_bac_dataset_processed.csv', index=False)