# Early-Stage Alzheimer's Disease Prediction Using Machine Learning Models

[ISTRAŽIVAČKI RAD](https://www.frontiersin.org/articles/10.3389/fpubh.2022.853294/full#B21)  

[DATASET](https://www.kaggle.com/datasets/jboysen/mri-and-alzheimers/data?select=oasis_longitudinal.csv)

#### UČITAVANJE PODATAKA

In [2]:
import pandas as pd
import numpy as np

In [3]:
X = pd.read_csv("dataset/oasis_longitudinal.csv")
X.shape

(373, 15)

U skupu podataka dostupno 373 zapisa sa 15 značajki.

#### INICIJALI POGLED NA PODATKE

In [4]:
X.head(7)

Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,OAS2_0001_MR1,Nondemented,1,0,M,R,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,OAS2_0001_MR2,Nondemented,2,457,M,R,88,14,2.0,30.0,0.0,2004,0.681,0.876
2,OAS2_0002,OAS2_0002_MR1,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,OAS2_0002_MR2,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.01
4,OAS2_0002,OAS2_0002_MR3,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034
5,OAS2_0004,OAS2_0004_MR1,Nondemented,1,0,F,R,88,18,3.0,28.0,0.0,1215,0.71,1.444
6,OAS2_0004,OAS2_0004_MR2,Nondemented,2,538,F,R,90,18,3.0,27.0,0.0,1200,0.718,1.462


In [5]:
X.dtypes

Subject ID     object
MRI ID         object
Group          object
Visit           int64
MR Delay        int64
M/F            object
Hand           object
Age             int64
EDUC            int64
SES           float64
MMSE          float64
CDR           float64
eTIV            int64
nWBV          float64
ASF           float64
dtype: object

Pojašnjenja značajki:
 - Subject ID - jedinstveni identifikator osobe, u prikupljanju podataka sudjelovalo je 150 osoba pa tako postoji 150 različitih Subject ID vrijednosti
 - MRI ID - jedinstveni identifikator MRI skeniranja
 - Group - oznaka
    - "Demented"
    - "Nondemented"
    - "Converted"
 - Visit - redni broj skeniranja osobe
      - cjelobrojna vrijednost
 - MR Delay - vremensko kašnjenje MR (kontrast)
      - cjelobrojna vrijednost
 - M/F - spol osobe 
      - 'M' - muškarac
      - 'F'- žena'
 - Hand - dominantna ruka osobe
     - R - dešnjaci 
     - L- ljevaci
 - Age - dob osobe
     - cjelobrojna vrijednost
 - EDUC - broj godina obrazovanja
     - cjelobrojna vrijednost
 - SES - socioekonomski status, float
 - MMSE - Mini mental state examination score - mjera kognitivnih sposobnosti subjekta  
    - MMSE <= 9  - ozbiljno kognitivno oštećenje
    - 10 <= MMSE <= 18 - umjereno kognitivno oštećenje
    - 19 <= MMSE <= 23 -  blago kognitivno oštećenje
    - 24 <= MMSE <= 30 - normalne kognitivne sposobnosti
 - CDR - rang kliničke demencije
   - 0 = odsutnost demencije
   - 1 = demencija slabo prisutna
   - 2= umjerena demencija
   - 3 = ozbiljna demencija
   - 4 = težak oblik demencije
   - 5 = terminalna demecija
 - eTIV - intrakranijalni volumen subjekta
    - cjelobrojna vrijednost
 - nWBV - normalizirani volumen  mozga
    - float vrijednost
 - ASF - normalizacijska vrijednost intrakranijalnog volumena 
    - float

In [6]:
X.describe()

Unnamed: 0,Visit,MR Delay,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
count,373.0,373.0,373.0,373.0,354.0,371.0,373.0,373.0,373.0,373.0
mean,1.882038,595.104558,77.013405,14.597855,2.460452,27.342318,0.290885,1488.128686,0.729568,1.195461
std,0.922843,635.485118,7.640957,2.876339,1.134005,3.683244,0.374557,176.139286,0.037135,0.138092
min,1.0,0.0,60.0,6.0,1.0,4.0,0.0,1106.0,0.644,0.876
25%,1.0,0.0,71.0,12.0,2.0,27.0,0.0,1357.0,0.7,1.099
50%,2.0,552.0,77.0,15.0,2.0,29.0,0.0,1470.0,0.729,1.194
75%,2.0,873.0,82.0,16.0,3.0,30.0,0.5,1597.0,0.756,1.293
max,5.0,2639.0,98.0,23.0,5.0,30.0,2.0,2004.0,0.837,1.587


#### MONOTONI ATRIBUTI

Pogledajmo postoje li u promatranom skupu podataka monotoni podaci.

In [7]:
X.nunique()

Subject ID    150
MRI ID        373
Group           3
Visit           5
MR Delay      201
M/F             2
Hand            1
Age            39
EDUC           12
SES             5
MMSE           18
CDR             4
eTIV          286
nWBV          136
ASF           265
dtype: int64

In [20]:
X.loc[:, "MRI ID"]

0      OAS2_0001_MR1
1      OAS2_0001_MR2
2      OAS2_0002_MR1
3      OAS2_0002_MR2
4      OAS2_0002_MR3
           ...      
368    OAS2_0185_MR2
369    OAS2_0185_MR3
370    OAS2_0186_MR1
371    OAS2_0186_MR2
372    OAS2_0186_MR3
Name: MRI ID, Length: 373, dtype: object

Značajka MRI ID jest monotona i identificira MRI skeniranja. Budući da u skupu podataka postoje atributi SUBJECT ID i VISIT koji zajedno donose istu infomraciju kao i ovaj atribut, možemo MRI ID maknuti iz skupa podataka bez da izgubimo ikakvu informaciju

In [23]:
X.drop('MRI ID', axis=1)

Unnamed: 0,Subject ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
0,OAS2_0001,Nondemented,1,0,M,R,87,14,2.0,27.0,0.0,1987,0.696,0.883
1,OAS2_0001,Nondemented,2,457,M,R,88,14,2.0,30.0,0.0,2004,0.681,0.876
2,OAS2_0002,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.010
4,OAS2_0002,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
368,OAS2_0185,Demented,2,842,M,R,82,16,1.0,28.0,0.5,1693,0.694,1.037
369,OAS2_0185,Demented,3,2297,M,R,86,16,1.0,26.0,0.5,1688,0.675,1.040
370,OAS2_0186,Nondemented,1,0,F,R,61,13,2.0,30.0,0.0,1319,0.801,1.331
371,OAS2_0186,Nondemented,2,763,F,R,63,13,2.0,30.0,0.0,1327,0.796,1.323


#### NEDOSTAJEĆE VRIJEDNOSTI

In [38]:
X.info()
X.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 373 entries, 0 to 372
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Subject ID  373 non-null    object 
 1   MRI ID      373 non-null    object 
 2   Group       373 non-null    object 
 3   Visit       373 non-null    int64  
 4   MR Delay    373 non-null    int64  
 5   M/F         373 non-null    object 
 6   Hand        373 non-null    object 
 7   Age         373 non-null    int64  
 8   EDUC        373 non-null    int64  
 9   SES         354 non-null    float64
 10  MMSE        371 non-null    float64
 11  CDR         373 non-null    float64
 12  eTIV        373 non-null    int64  
 13  nWBV        373 non-null    float64
 14  ASF         373 non-null    float64
dtypes: float64(5), int64(5), object(5)
memory usage: 43.8+ KB


Subject ID     0
MRI ID         0
Group          0
Visit          0
MR Delay       0
M/F            0
Hand           0
Age            0
EDUC           0
SES           19
MMSE           2
CDR            0
eTIV           0
nWBV           0
ASF            0
dtype: int64

Iz gornjeg ispisa vidljivo je da većina vrijednosti nije nedostajuća. Međutim kod značajki SES i MMSE vidimo da postoji redom 19 i 2 nedostajuća zapisa. Govoreći u terminima postotaka možemo reći da značajka SES ima 5% nedostajećih vrijednosti, a značajka MMSE 0.5% nedostajećih vrijednosti.

Pogledajmo detaljnije te zapise.

In [39]:
X.loc[X.SES.isna(), :]

Unnamed: 0,Subject ID,MRI ID,Group,Visit,MR Delay,M/F,Hand,Age,EDUC,SES,MMSE,CDR,eTIV,nWBV,ASF
2,OAS2_0002,OAS2_0002_MR1,Demented,1,0,M,R,75,12,,23.0,0.5,1678,0.736,1.046
3,OAS2_0002,OAS2_0002_MR2,Demented,2,560,M,R,76,12,,28.0,0.5,1738,0.713,1.01
4,OAS2_0002,OAS2_0002_MR3,Demented,3,1895,M,R,80,12,,22.0,0.5,1698,0.701,1.034
10,OAS2_0007,OAS2_0007_MR1,Demented,1,0,M,R,71,16,,28.0,0.5,1357,0.748,1.293
11,OAS2_0007,OAS2_0007_MR3,Demented,3,518,M,R,73,16,,27.0,1.0,1365,0.727,1.286
12,OAS2_0007,OAS2_0007_MR4,Demented,4,1281,M,R,75,16,,27.0,1.0,1372,0.71,1.279
134,OAS2_0063,OAS2_0063_MR1,Demented,1,0,F,R,80,12,,30.0,0.5,1430,0.737,1.228
135,OAS2_0063,OAS2_0063_MR2,Demented,2,490,F,R,81,12,,27.0,0.5,1453,0.721,1.208
207,OAS2_0099,OAS2_0099_MR1,Demented,1,0,F,R,80,12,,27.0,0.5,1475,0.762,1.19
208,OAS2_0099,OAS2_0099_MR2,Demented,2,807,F,R,83,12,,23.0,0.5,1484,0.75,1.183


Opećnito govoreći problem nedostajećih podataka moguće je rješiti:
- interpolacijom - u ovom konkretnom niti za jednu značajku nema prevište smisla
- izbacivanje cijele značajke - također u ovoj primjeni nema smisla jer relativno malo primjera ima nedostajeću vrijednost
- nadopunjavanje nedostajećih vrijednosti srednjom vrijednošću - također nije opcija
- izbacivanje zapisa - ovo je najbolja opcija za našu primjenu jer govorimo o relativno malom skupu podataka


**Napomena** : Budući da postoje 2 zapisa za koje su i značajke SES i MMSE nedostajeće dovoljno je izbaciti samo 19 zapisa za koje je značajka SES nedostajeća.

In [44]:
X_noNaN = X.copy()

X_noNaN = X_noNaN.loc[X_noNaN.SES.notnull(), :]


####  VIZUALIZACIJA I STRŠEĆI PODACI