In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt

sns.set_style("whitegrid")

In [2]:
demographics = pd.read_csv("new-data/releases_2023_v4release_1027_clinical_Demographics.csv")
clinical_enrolment = pd.read_csv("new-data/releases_2023_v4release_1027_clinical_Enrollment.csv")
demographics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10908 entries, 0 to 10907
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   participant_id         10908 non-null  object
 1   GUID                   4413 non-null   object
 2   visit_name             10908 non-null  object
 3   visit_month            10908 non-null  int64 
 4   age_at_baseline        10908 non-null  int64 
 5   sex                    10908 non-null  object
 6   ethnicity              6303 non-null   object
 7   race                   10889 non-null  object
 8   education_level_years  6322 non-null   object
dtypes: int64(2), object(7)
memory usage: 767.1+ KB


In [3]:
clinical_enrolment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10058 entries, 0 to 10057
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   participant_id                          10058 non-null  object 
 1   GUID                                    4394 non-null   object 
 2   visit_name                              10058 non-null  object 
 3   visit_month                             10058 non-null  int64  
 4   enrollment_months_after_baseline        3976 non-null   float64
 5   informed_consent_months_after_baseline  4334 non-null   float64
 6   prodromal_category                      5565 non-null   object 
 7   study_arm                               10055 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 628.8+ KB


In [4]:
demographics.head()

Unnamed: 0,participant_id,GUID,visit_name,visit_month,age_at_baseline,sex,ethnicity,race,education_level_years
0,BF-1001,PDNW781VHY,M0,0,55,Male,Not Hispanic or Latino,White,12-16 years
1,BF-1002,PDCB969UGG,M0,0,66,Female,Not Hispanic or Latino,White,12-16 years
2,BF-1003,PDLW805AHT,M0,0,61,Male,Not Hispanic or Latino,White,12-16 years
3,BF-1004,PDKW284DYW,M0,0,62,Male,Not Hispanic or Latino,White,12-16 years
4,BF-1005,PDTM274KX6,M0,0,61,Female,Not Hispanic or Latino,White,12-16 years


In [5]:
demographics.columns

Index(['participant_id', 'GUID', 'visit_name', 'visit_month',
       'age_at_baseline', 'sex', 'ethnicity', 'race', 'education_level_years'],
      dtype='object')

## Duplicate And NaN Values 

We begin by counting the number of unique values of `participant_id' and 'GUID' to check for any discrepancy between these.

In [6]:
demographics['participant_id'].nunique()

10908

In [7]:
demographics['GUID'].nunique()

4408

The observed difference in the count of unique values between the columns `participant_id` and `GUID` can be due to the substantial number of missing (NaN) entries in the `GUID` column, as previously indicated by our summary statistics and .info() output. To ensure this discrepancy is indeed due to missing data, we remove all rows with NaN values in the `GUID` column and then reassess the number of non-unique values.

In [8]:
demographics_new = demographics.dropna(subset=['GUID'])
demographics_new.nunique()

participant_id           4413
GUID                     4408
visit_name                  2
visit_month                 2
age_at_baseline            68
sex                         2
ethnicity                   3
race                        8
education_level_years       4
dtype: int64

After removing rows with missing `GUID` values, we determined the number of unique identifiers in each column:

- **participant_id:** 4413 unique values
- **GUID:** 4408 unique values

This indicates that there are slightly more unique `participant_id`s than `GUID`s in the dataset. This small difference could be due to some participants having multiple `participant_id`s mapped to a single `GUID`, or vice versa, or a small number of records with mismatched identifiers. We look at the number of `participant_id` that share the same `GUID`s.


In [9]:
demographics_multiple_GID_counts =   demographics_new.groupby('GUID')['participant_id'].nunique()
mult_guid =  demographics_multiple_GID_counts[demographics_multiple_GID_counts > 1].index
rows_with_multi_pid = demographics_new[demographics_new['GUID'].isin(mult_guid)]
rows_with_multi_pid.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 78 to 10625
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   participant_id         10 non-null     object
 1   GUID                   10 non-null     object
 2   visit_name             10 non-null     object
 3   visit_month            10 non-null     int64 
 4   age_at_baseline        10 non-null     int64 
 5   sex                    10 non-null     object
 6   ethnicity              10 non-null     object
 7   race                   10 non-null     object
 8   education_level_years  10 non-null     object
dtypes: int64(2), object(7)
memory usage: 800.0+ bytes


In [10]:
rows_with_multi_pid.head(10)

Unnamed: 0,participant_id,GUID,visit_name,visit_month,age_at_baseline,sex,ethnicity,race,education_level_years
78,BF-1088,PDJV686AAB,M0,0,66,Female,Not Hispanic or Latino,White,Greater than 16 years
80,BF-1091,PDEB612LPE,M0,0,68,Male,Not Hispanic or Latino,White,12-16 years
87,BF-1098,PDUA781LH0,M0,0,64,Male,Not Hispanic or Latino,White,12-16 years
88,BF-1100,PDBW494GHE,M0,0,71,Male,Not Hispanic or Latino,White,Less than 12 years
6774,PD-PDBW494GHE,PDBW494GHE,M0,0,71,Male,Not Hispanic or Latino,White,Less than 12 years
6920,PD-PDEB612LPE,PDEB612LPE,M0,0,67,Male,Not Hispanic or Latino,White,12-16 years
7142,PD-PDHB484JY8,PDHB484JY8,M0,0,58,Female,Not Hispanic or Latino,White,12-16 years
7273,PD-PDJV686AAB,PDJV686AAB,M0,0,65,Female,Not Hispanic or Latino,White,12-16 years
7783,PD-PDUA781LH0,PDUA781LH0,M0,0,63,Male,Not Hispanic or Latino,White,12-16 years
10625,SY-PDHB484JY8,PDHB484JY8,M0,0,59,Female,Not Hispanic or Latino,White,Unknown


In [11]:
rows_with_multi_pid[['GUID', 'visit_name', 'visit_month',
       'age_at_baseline', 'sex', 'ethnicity', 'race', 'education_level_years']].duplicated().sum()

np.int64(1)

A closer examination reveals suggests that the majority of individuals  with same `GUID` but different `participant_id` differ by only one year in the `age_at_baseline` column, suggesting that these participants likely enrolled in a different cohort after a one-year interval. We now conduct a similar analysis for the clinical enrollment data to check our hypothesis.

In [12]:
clinical_enrolment.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10058 entries, 0 to 10057
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   participant_id                          10058 non-null  object 
 1   GUID                                    4394 non-null   object 
 2   visit_name                              10058 non-null  object 
 3   visit_month                             10058 non-null  int64  
 4   enrollment_months_after_baseline        3976 non-null   float64
 5   informed_consent_months_after_baseline  4334 non-null   float64
 6   prodromal_category                      5565 non-null   object 
 7   study_arm                               10055 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 628.8+ KB


In [13]:
clinical_enrolment.head()

Unnamed: 0,participant_id,GUID,visit_name,visit_month,enrollment_months_after_baseline,informed_consent_months_after_baseline,prodromal_category,study_arm
0,BF-1001,PDNW781VHY,M0,0,,0.0,Unknown/Not collected as enrollment criterion,Healthy Control
1,BF-1002,PDCB969UGG,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
2,BF-1003,PDLW805AHT,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
3,BF-1004,PDKW284DYW,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
4,BF-1005,PDTM274KX6,M0,0,,0.0,Unknown/Not collected as enrollment criterion,Healthy Control


In [14]:
clinical_enrolment_new = clinical_enrolment.dropna(subset=['GUID'])
clinical_enrolment_new.nunique()

participant_id                            4394
GUID                                      4389
visit_name                                   2
visit_month                                  2
enrollment_months_after_baseline            20
informed_consent_months_after_baseline      22
prodromal_category                           3
study_arm                                   10
dtype: int64

In [15]:
enrolment_multiple_GID_counts =   clinical_enrolment_new.groupby('GUID')['participant_id'].nunique()
mult_guid_enrolment =  enrolment_multiple_GID_counts[enrolment_multiple_GID_counts > 1].index
enrolment_rows_with_multi_pid = clinical_enrolment_new[clinical_enrolment_new['GUID'].isin(mult_guid)]
enrolment_rows_with_multi_pid.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, 78 to 9876
Data columns (total 8 columns):
 #   Column                                  Non-Null Count  Dtype  
---  ------                                  --------------  -----  
 0   participant_id                          10 non-null     object 
 1   GUID                                    10 non-null     object 
 2   visit_name                              10 non-null     object 
 3   visit_month                             10 non-null     int64  
 4   enrollment_months_after_baseline        5 non-null      float64
 5   informed_consent_months_after_baseline  9 non-null      float64
 6   prodromal_category                      10 non-null     object 
 7   study_arm                               10 non-null     object 
dtypes: float64(2), int64(1), object(5)
memory usage: 720.0+ bytes


In [16]:
enrolment_rows_with_multi_pid.head(10)

Unnamed: 0,participant_id,GUID,visit_name,visit_month,enrollment_months_after_baseline,informed_consent_months_after_baseline,prodromal_category,study_arm
78,BF-1088,PDJV686AAB,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
80,BF-1091,PDEB612LPE,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
87,BF-1098,PDUA781LH0,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
88,BF-1100,PDBW494GHE,M0,0,,0.0,Unknown/Not collected as enrollment criterion,PD
6040,PD-PDBW494GHE,PDBW494GHE,M0,0,-0.5,-0.5,Unknown/Not collected as enrollment criterion,PD
6185,PD-PDEB612LPE,PDEB612LPE,M0,0,0.0,0.0,Unknown/Not collected as enrollment criterion,PD
6405,PD-PDHB484JY8,PDHB484JY8,M0,0,,,Unknown/Not collected as enrollment criterion,PD
6536,PD-PDJV686AAB,PDJV686AAB,M0,0,-0.5,-0.5,Unknown/Not collected as enrollment criterion,PD
7041,PD-PDUA781LH0,PDUA781LH0,M0,0,-0.5,-0.5,Unknown/Not collected as enrollment criterion,PD
9876,SY-PDHB484JY8,PDHB484JY8,M0,0,0.0,0.0,Unknown/Not collected as enrollment criterion,PD


We observe that the patients with same `GUID` but different `patient_id` in the demographics dataset also appear in the clinical enrolment data are the same as those in demographics data. Notably, most of these cases show a 0.5-month difference in `informed_consent_months_after_baseline` which accounts for the age discrepancies observed earlier. To maintain consistency, we remove all patients with duplicate `GUID`s mapped to multiple `participant_id`s. Going forward, we will filter every dataset to retain only those records whose  `participant_id` is in the "demographics_new.csv" reference file we create now.

In [17]:
demographics_new = demographics_new[~demographics_new['GUID'].isin(mult_guid)]
demographics_new.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4403 entries, 0 to 10907
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   participant_id         4403 non-null   object
 1   GUID                   4403 non-null   object
 2   visit_name             4403 non-null   object
 3   visit_month            4403 non-null   int64 
 4   age_at_baseline        4403 non-null   int64 
 5   sex                    4403 non-null   object
 6   ethnicity              4403 non-null   object
 7   race                   4403 non-null   object
 8   education_level_years  4403 non-null   object
dtypes: int64(2), object(7)
memory usage: 344.0+ KB


In [18]:
demographics_new.to_csv('demographics_new.csv', index=False)


