# Basic Setup and Data Loading

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt

sns.set_style("whitegrid")

In [2]:
caffeine_history = pd.read_csv("new-data/releases_2023_v4release_1027_clinical_Caffeine_history.csv")
demographics_new = pd.read_csv("demographics_new.csv")
caffeine_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2155 entries, 0 to 2154
Data columns (total 6 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   participant_id                   2155 non-null   object 
 1   GUID                             1453 non-null   object 
 2   visit_name                       2155 non-null   object 
 3   visit_month                      1418 non-null   float64
 4   caff_drinks_ever_used_regularly  2155 non-null   object 
 5   caff_drinks_current_use          2063 non-null   object 
dtypes: float64(1), object(5)
memory usage: 101.1+ KB


In [3]:
caffeine_history.head()

Unnamed: 0,participant_id,GUID,visit_name,visit_month,caff_drinks_ever_used_regularly,caff_drinks_current_use
0,HB-PD_INVAA223GY7,PD_INVAA223GY7,M0,0.0,No,No
1,HB-PD_INVAB465GYE,PD_INVAB465GYE,M0,0.0,Yes,Yes
2,HB-PD_INVAD033HX2,PD_INVAD033HX2,M0,0.0,No,No
3,HB-PD_INVAD802MY3,PD_INVAD802MY3,M0,0.0,No,No
4,HB-PD_INVAD946MJ7,PD_INVAD946MJ7,M0,0.0,Yes,Yes


# Data Preprocessing And Cleaning 

## Checking For Duplicates And Nan Values

We begin our analysis by removing patients who have missing (NaN) values in the `GUID` column, as well as those with conflicting identifiers — that is, cases where multiple `participant_id`s share the same `GUID`. To ensure consistency, we retain only the participants whose `participant_id`s appear in the cleaned reference file "demographics_new.csv".

In [6]:
caffeine_history = caffeine_history[caffeine_history['participant_id'].isin(demographics_new['participant_id'])]


In [7]:
caffeine_history.nunique()

participant_id                     1453
GUID                               1453
visit_name                            2
visit_month                           1
caff_drinks_ever_used_regularly       2
caff_drinks_current_use               2
dtype: int64

In [8]:
caffeine_history['GUID'].isna().sum()
   

np.int64(0)

We remove the GUID column from the dataset, as it is no longer required for the subsequent steps of our analysis. Then we assess the data for duplicate entries by examining combinations of the `participant_id` and `visit_month` columns to ensure each participant's visit is uniquely represented.

In [9]:
caffeine_history.drop('GUID', axis = 1,  inplace = True)

In [10]:
caffeine_history.duplicated(subset = ['visit_month','participant_id']).sum()

np.int64(0)