In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Loading the Datasets

### Columns Description
| ID            | Identification                      |
|---------------|-------------------------------------|
| M/F           | Gender                              |
| Dominant Hand | Hand                                |
| Age           | Age                                 |
| Edu           | Education Level                     |
| SES           | Socioeconomic Status                |
| MMSE          | Mini Mental State Examination       |
| eTIV          | Estimated Total Intracranial Volume |
| CDR           | Clinical Dementia Rating            |
| nWBV          | Normalize Whole Brain Volume        |
| ASF           | Atlas Scaling Factor                |
| Delay         |                                     |

In [2]:
path_to_oasis_cs = "../../dat/alzheimer/oasis_cross-sectional.csv"
path_to_oasis_long = "../../dat/alzheimer/oasis_longitudinal.csv"
oasis_cs_df = pd.read_csv(path_to_oasis_cs)
oasis_long_df = pd.read_csv(path_to_oasis_long)

In [3]:
oasis_cs_df.head()

Unnamed: 0,ID,M/F,Hand,Age,Educ,SES,MMSE,CDR,eTIV,nWBV,ASF,Delay
0,OAS1_0001_MR1,F,R,74,2.0,3.0,29.0,0.0,1344,0.743,1.306,
1,OAS1_0002_MR1,F,R,55,4.0,1.0,29.0,0.0,1147,0.81,1.531,
2,OAS1_0003_MR1,F,R,73,4.0,3.0,27.0,0.5,1454,0.708,1.207,
3,OAS1_0004_MR1,M,R,28,,,,,1588,0.803,1.105,
4,OAS1_0005_MR1,M,R,18,,,,,1737,0.848,1.01,


In [4]:
oasis_cs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 436 entries, 0 to 435
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      436 non-null    object 
 1   M/F     436 non-null    object 
 2   Hand    436 non-null    object 
 3   Age     436 non-null    int64  
 4   Educ    235 non-null    float64
 5   SES     216 non-null    float64
 6   MMSE    235 non-null    float64
 7   CDR     235 non-null    float64
 8   eTIV    436 non-null    int64  
 9   nWBV    436 non-null    float64
 10  ASF     436 non-null    float64
 11  Delay   20 non-null     float64
dtypes: float64(7), int64(2), object(3)
memory usage: 41.0+ KB


Column **Delay** has only 20 non-null values, so it might not be that useful here.

Columns **Educ, SES (Socioeconomic Status), MMSE (Mini Mental State Examination), eTIV (Estimated Total Intracranial Volume)** are missing approximately half of their values. In general, we might look at these in more detail in connection to our target variable of interest **CDR** before figuring out how to handle this - however, **CDR** itself is missing half of its values. If the missing coincide, it might make sense to just look at the non-null part of the data.

In [5]:
oasis_long_df.head()

Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,OAS2_0001_MR1,Nondemented,1,0,M,R,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,OAS2_0001_MR2,Nondemented,2,457,M,R,88,14,2.0,30.0,0.0,2004,0.681,0.876
2,OAS2_0002,OAS2_0002_MR1,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,OAS2_0002_MR2,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.01
4,OAS2_0002,OAS2_0002_MR3,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034


In [6]:
oasis_long_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Subject ID  373 non-null    object 
 1   MRI ID      373 non-null    object 
 2   Group       373 non-null    object 
 3   Visit       373 non-null    int64  
 4   MR Delay    373 non-null    int64  
 5   M/F         373 non-null    object 
 6   Hand        373 non-null    object 
 7   Age         373 non-null    int64  
 8   EDUC        373 non-null    int64  
 9   SES         354 non-null    float64
 10  MMSE        371 non-null    float64
 11  CDR         373 non-null    float64
 12  eTIV        373 non-null    int64  
 13  nWBV        373 non-null    float64
 14  ASF         373 non-null    float64
dtypes: float64(5), int64(5), object(5)
memory usage: 43.8+ KB


The column **SES** is missing 19 of its values, **MMSE** two. We could fill these values just with the average, of even train a quick model in order to predict these (**without the target variable**).