# Basic Setup and Data Loading

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt

sns.set_style("whitegrid")

In [2]:
DatScan = pd.read_csv("new-data/releases_2023_v4release_1027_clinical_DaTSCAN_SBR.csv")
demographics_new = pd.read_csv("demographics_new.csv")
DatScan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2280 entries, 0 to 2279
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  2280 non-null   object 
 1   GUID            881 non-null    object 
 2   visit_name      2280 non-null   object 
 3   visit_month     2276 non-null   float64
 4   sbr_caudate_r   2280 non-null   float64
 5   sbr_caudate_l   2280 non-null   float64
 6   sbr_putamen_r   2280 non-null   float64
 7   sbr_putamen_l   2280 non-null   float64
dtypes: float64(5), object(3)
memory usage: 142.6+ KB


In [3]:
DatScan.head()

Unnamed: 0,participant_id,GUID,visit_name,visit_month,sbr_caudate_r,sbr_caudate_l,sbr_putamen_r,sbr_putamen_l
0,LC-1010006,,M0,0.0,4.06,4.27,3.61,3.32
1,LC-1030006,,M0,0.0,3.67,4.04,3.1,2.91
2,LC-120006,,M0,0.0,3.05,2.98,2.84,2.93
3,LC-1240006,,M0,0.0,4.01,3.9,2.77,2.93
4,LC-1250006,,M0,0.0,3.6,3.84,2.37,2.36


# Data Preprocessing And Cleaning 

## Checking For Duplicates And Nan Values

We begin our analysis by removing patients who have missing (NaN) values in the `GUID` column, as well as those with conflicting identifiers — that is, cases where multiple `participant_id`s share the same `GUID`. To ensure consistency, we retain only the participants whose `participant_id`s appear in the cleaned reference file "demographics_new.csv".

In [7]:
DatScan = DatScan[DatScan['participant_id'].isin(demographics_new['participant_id'])]


In [8]:
DatScan.nunique()

participant_id    342
GUID              342
visit_name          8
visit_month         7
sbr_caudate_r     278
sbr_caudate_l     274
sbr_putamen_r     211
sbr_putamen_l     196
dtype: int64

In [9]:
DatScan['GUID'].isna().sum()
   

np.int64(0)

We remove the GUID column from the dataset, as it is no longer required for the subsequent steps of our analysis. Then we assess the data for duplicate entries by examining combinations of the `participant_id` and `visit_month` columns to ensure each participant's visit is uniquely represented.

In [10]:
DatScan.drop('GUID', axis = 1,  inplace = True)

In [13]:
DatScan.duplicated(subset = ['visit_month','participant_id']).sum()

np.int64(0)