# Prepare data

#### Purpose

In this notebook, we are going to prepare the data for analysis, generating an ARFF file that can be read by our Java project. More especifically, we are going to:
* Generate the attributes that will be considered by the clustering algorithms. Combining MDS-NMS items to generate non-motor symptoms and combining MDS-UPDRS items to generate motor symptoms.
* Select socio-demographic, clinical and H&Y stage for the posterior comparative cluster analysis.

----
#### Load data

In [3]:
import pandas as pd
import numpy as np

# Load original data
original_data = pd.read_csv("../data/original_data.csv", sep=";")
# Replace ? with NaN
original_data = original_data.replace("?", np.nan)
original_data.shape

(402, 412)

## 1 - Select original attributes

### 1.1 - Socio-demographic and clinical

In [4]:
original_socio_demographic_cols = ["patnum", "age", "sex", "pdonset", "durat_pd"]
original_socio_demographic = original_data[original_socio_demographic_cols]

### 1.2 - Hoehn-Yahr

In [5]:
original_hoehn_yahr_cols = ["hy"]
original_hoehn_yahr = original_data[original_hoehn_yahr_cols]

### 1.3 - MDS - NMS

Select the combination of severity and frequency attributes.

In [6]:
original_mds_nms_cols = [col for col in original_data.columns if col.startswith("mdsnms_") and not col.endswith("f") and not col.endswith("s")]
original_mds_nms = original_data[original_mds_nms_cols]
for col in original_mds_nms.columns:
    original_mds_nms[col] = pd.to_numeric(original_mds_nms[col])

for col in original_mds_nms.columns:
    original_mds_nms[col] = original_mds_nms[col].astype("float64")

#original_mds_nms.dtypes
#original_mds_nms.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


### 1.4 - UPDRS - Motor

Manually select those variables that represent motor symptoms corresponding to the "Part 2: Part II: Motor Aspects of Experiences of Daily Living (M-EDL)", "Part 3: Motor examination", and "Part 4: Motor complications" domains.

In [7]:
original_updrs_motor_cols = [
    # M-EDL
    "mdsupdrs2_1", "mdsupdrs2_2", "mdsupdrs2_3", "mdsupdrs2_4", "mdsupdrs2_5", "mdsupdrs2_6", "mdsupdrs2_7", "mdsupdrs2_8", "mdsupdrs2_9",
    "mdsupdrs2_10", "mdsupdrs2_11", "mdsupdrs2_12", "mdsupdrs2_13",
    # Motor examination
    "mdsupdrs3_1","mdsupdrs3_2","mdsupdrs3_3a","mdsupdrs3_3b","mdsupdrs3_3c","mdsupdrs3_3d","mdsupdrs3_3e","mdsupdrs3_4a","mdsupdrs3_4b",
    "mdsupdrs3_5a","mdsupdrs3_5b","mdsupdrs3_6a","mdsupdrs3_6b","mdsupdrs3_7a","mdsupdrs3_7b","mdsupdrs3_8a","mdsupdrs3_8b","mdsupdrs3_9",
    "mdsupdrs3_10","mdsupdrs3_11","mdsupdrs3_12","mdsupdrs3_13","mdsupdrs3_14","mdsupdrs3_15a","mdsupdrs3_15b","mdsupdrs3_16a","mdsupdrs3_16b",
    "mdsupdrs3_17a","mdsupdrs3_17b","mdsupdrs3_17c","mdsupdrs3_17d","mdsupdrs3_17e","mdsupdrs3_18",
    # Motor complications
    "mdsupdrs4_1","mdsupdrs4_2","mdsupdrs4_3","mdsupdrs4_4","mdsupdrs4_5","mdsupdrs4_6"
]
original_updrs_motor = original_data[original_updrs_motor_cols]

for col in original_updrs_motor.columns:
    original_updrs_motor[col] = pd.to_numeric(original_updrs_motor[col])
    
for col in original_updrs_motor.columns:
    original_updrs_motor[col] = original_updrs_motor[col].astype("float64")

#original_updrs_motor.dtypes
#original_updrs_motor.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### 1.5 - Combine them all

Once selected, combine all data frames in a single one, which we are going to use to locate missing values

In [8]:
data_combined = pd.concat([original_socio_demographic, original_hoehn_yahr, original_mds_nms, original_updrs_motor], axis = 1)

## 2 - Missing data

Count missing values in each column, then filter those rows with misssing

In [9]:
missing_data = data_combined.isnull()
true_counts = [(column, missing_data[column].values.sum()) for column in missing_data.columns]
false_counts = [(column, (~missing_data[column].values).sum()) for column in missing_data.columns]
true_counts.sort(key=lambda x:x[1], reverse = True)

We observe a concentration of missing values on certain variables of the MDS-NMS and MDS-UPDRS scales

In [10]:
data_no_missing = data_combined.dropna()
data_no_missing.shape

(350, 110)

## 4 - Generate final attributes
We have two datasets: 
* One with missing values <code>data_combined</code> (402, 96)
* One without them <code>data_no_missing</code> (352, 96)

By default we are going to select the dataset with missing values because our multidimensional clustering method can deal with them.

In [11]:
data = data_combined
#data = data_no_missing
data.shape

(402, 110)

### 4.1 - MDS-NMS

For this, we are going to combine all the attributes of the same subscale (A, B, C, etc) into a single attribute, following the work of [1].

* **A:** Depression (5 attributes)
* **B:** Anxiety (4 attributes)
* **C:** Apathy (3 attributes)
* **D:** Psychosis (4 attributes)
* **E:** Impulse control and related disorders (4 attributes)
* **F:** Cognition (6 attributes)
* **G:** Orthostatic hypotension (2 attributes)
* **H:** Urinary (3 attributes)
* **I:** Sexual (2 attributes)
* **J:** Gastrointestinal (4 attributes)
* **K:** Sleep and wakefulness (6 attributes)
* **L:** Pain (4 attributes)
* **M:** Other (5 attributes)
    * **M1:** Weight loss
    * **M2:** Sense of smell
    * **M3:** Physical tiredness
    * **M4:** Mental fatigue
    * **M5:** Excessive sweating

For the subscale M, instead of grouping all its attributes, we are going to consider each of them separately. Reason being, each one of them represent a different aspect.

In [12]:
mds_nms_names=["depression", "anxiety", "apathy", "psychosis", "impulse_control", "cognition", "hypotension",
               "urinary", "sexual", "gastrointestinal", "sleep", "pain", "weight_loss", "smell", 
               "physical_tiredness", "mental_fatigue", "sweating"]

mds_nms = pd.DataFrame(columns=mds_nms_names)

mds_nms["depression"] = data["mdsnms_A1"].values + data["mdsnms_A2"].values + data["mdsnms_A3"].values + data["mdsnms_A4"].values + data["mdsnms_A5"].values
mds_nms["anxiety"] = data["mdsnms_B1"].values + data["mdsnms_B2"].values + data["mdsnms_B3"].values + data["mdsnms_B4"].values
mds_nms["apathy"] = data["mdsnms_C1"].values + data["mdsnms_C2"].values + data["mdsnms_C3"].values
mds_nms["psychosis"] = data["mdsnms_D1"].values + data["mdsnms_D2"].values + data["mdsnms_D3"].values + data["mdsnms_D4"].values
mds_nms["impulse_control"] = data["mdsnms_E1"].values + data["mdsnms_E2"].values + data["mdsnms_E3"].values + data["mdsnms_E4"].values
mds_nms["cognition"] = data["mdsnms_F1"].values + data["mdsnms_F2"].values + data["mdsnms_F3"].values + data["mdsnms_F4"].values + data["mdsnms_F5"].values + data["mdsnms_F6"].values
mds_nms["hypotension"] = data["mdsnms_G1"].values + data["mdsnms_G2"].values
mds_nms["urinary"] = data["mdsnms_H1"].values + data["mdsnms_H2"].values + data["mdsnms_H3"].values
mds_nms["sexual"] = data["mdsnms_I1"].values + data["mdsnms_I2"].values
mds_nms["gastrointestinal"] = data["mdsnms_J1"].values + data["mdsnms_J2"].values + data["mdsnms_J3"].values + data["mdsnms_J4"].values
mds_nms["sleep"] = data["mdsnms_K1"].values + data["mdsnms_K2"].values + data["mdsnms_K3"].values + data["mdsnms_K4"].values + data["mdsnms_K5"].values + data["mdsnms_K6"].values
mds_nms["pain"] = data["mdsnms_L1"].values + data["mdsnms_L2"].values + data["mdsnms_L3"].values + data["mdsnms_L4"].values
mds_nms["weight_loss"] = data["mdsnms_M1"].values
mds_nms["smell"] = data["mdsnms_M2"].values
mds_nms["physical_tiredness"] = data["mdsnms_M3"].values
mds_nms["mental_fatigue"] = data["mdsnms_M4"].values
mds_nms["sweating"] = data["mdsnms_M5"].values

Once we have generated them, we would like to normalize them in the range [0.0, 1.0] (for an easier analysis and comparison), thankfully we have the maximum and minimum values of each column, so the process is pretty much straightforward. 
* The minimum value is 0.
* The maximum value is obtained by multiplying the number of items in each domain by 16, which is the maximum value of an attribute that has severity of 4 and frequency of 4.

Therefore we simply have to divide each column by its maximum possible value and it will be normalized in the [0.0, 1.0] range.

In [13]:
mds_nms["depression"] = mds_nms["depression"] / (5 * 16)
mds_nms["anxiety"] = mds_nms["anxiety"] / (4 * 16)
mds_nms["apathy"] = mds_nms["apathy"] / (3 * 16)
mds_nms["psychosis"] = mds_nms["psychosis"] / (4 * 16)
mds_nms["impulse_control"] = mds_nms["impulse_control"] / (4 * 16)
mds_nms["cognition"] = mds_nms["cognition"] / (6 * 16)
mds_nms["hypotension"] = mds_nms["hypotension"] / (2 * 16)
mds_nms["urinary"] = mds_nms["urinary"] / (3 * 16)
mds_nms["sexual"] = mds_nms["sexual"] / (2 * 16)
mds_nms["gastrointestinal"] = mds_nms["gastrointestinal"] / (4 * 16)
mds_nms["sleep"] = mds_nms["sleep"] / (6 * 16)
mds_nms["pain"] = mds_nms["pain"] / (4 * 16)
mds_nms["weight_loss"] = data["mdsnms_M1"].values / (1 * 16)
mds_nms["smell"] = data["mdsnms_M2"].values / (1 * 16)
mds_nms["physical_tiredness"] = data["mdsnms_M3"].values / (1 * 16)
mds_nms["mental_fatigue"] = data["mdsnms_M4"].values / (1 * 16)
mds_nms["sweating"] = data["mdsnms_M5"].values / (1 * 16)

mds_nms.shape

(402, 17)

In [14]:
mds_nms.max()

depression            0.900000
anxiety               0.843750
apathy                0.750000
psychosis             0.562500
impulse_control       0.390625
cognition             0.687500
hypotension           0.750000
urinary               1.000000
sexual                1.000000
gastrointestinal      0.734375
sleep                 0.791667
pain                  0.828125
weight_loss           1.000000
smell                 1.000000
physical_tiredness    1.000000
mental_fatigue        1.000000
sweating              1.000000
dtype: float64

### 4.2 - MDS-UPDRS

There are multiple motor variables, in this case we are going to group them according to the cardinal signs and motor subtypes, as explained by "documentation/MDS-UPDRS item groups.docx". While the MDS-UPDRS variables can be intrepreted as categorical or numerical, we are going to consider them numerical because of their groups, which are obtained by summing their values.
* Tremor: 11 variables
* Rigidity: 5 variables
* Dyskinesias: 2 variables
* Fluctuations: 4 variables
* Bradykinesia: 11 variables
* Axial no PIGD: 7 variables
* PIGD: 5 variables

In [15]:
mds_updrs_names=["tremor", "rigidity", "dyskinesias", "fluctuations", "bradykinesia", "axial_no_pigd", "pigd"]

mds_updrs= pd.DataFrame(columns=mds_updrs_names)

mds_updrs["tremor"] = data["mdsupdrs2_10"].values + data["mdsupdrs3_15a"].values + data["mdsupdrs3_15b"].values + data["mdsupdrs3_16a"].values + data["mdsupdrs3_16b"].values+ data["mdsupdrs3_17a"].values + data["mdsupdrs3_17b"].values + data["mdsupdrs3_17c"].values + data["mdsupdrs3_17d"].values + data["mdsupdrs3_17e"].values + data["mdsupdrs3_18"].values
mds_updrs["rigidity"] = data["mdsupdrs3_3a"].values + data["mdsupdrs3_3b"].values + data["mdsupdrs3_3c"].values + data["mdsupdrs3_3d"].values + data["mdsupdrs3_3e"].values
mds_updrs["dyskinesias"] = data["mdsupdrs4_1"].values + data["mdsupdrs4_2"].values
mds_updrs["fluctuations"] = data["mdsupdrs4_3"].values + data["mdsupdrs4_4"].values + data["mdsupdrs4_5"].values + data["mdsupdrs4_6"].values
mds_updrs["bradykinesia"] = data["mdsupdrs3_4a"].values + data["mdsupdrs3_4b"].values + data["mdsupdrs3_5a"].values + data["mdsupdrs3_5b"].values + data["mdsupdrs3_6a"].values + data["mdsupdrs3_6b"].values + data["mdsupdrs3_7a"].values + data["mdsupdrs3_7b"].values + data["mdsupdrs3_8a"].values + data["mdsupdrs3_8b"].values + data["mdsupdrs3_14"].values
mds_updrs["axial_no_pigd"] = data["mdsupdrs2_1"].values + data["mdsupdrs2_2"].values + data["mdsupdrs2_3"].values + data["mdsupdrs3_1"].values + data["mdsupdrs3_2"].values + data["mdsupdrs3_9"].values + data["mdsupdrs3_13"].values
mds_updrs["pigd"] = data["mdsupdrs2_12"].values + data["mdsupdrs2_13"].values + data["mdsupdrs3_10"].values + data["mdsupdrs3_11"].values + data["mdsupdrs3_12"].values

mds_updrs.head()

Unnamed: 0,tremor,rigidity,dyskinesias,fluctuations,bradykinesia,axial_no_pigd,pigd
0,2.0,8.0,2.0,3.0,22.0,18.0,4.0
1,3.0,10.0,1.0,1.0,6.0,8.0,3.0
2,3.0,6.0,1.0,0.0,5.0,7.0,3.0
3,5.0,7.0,0.0,6.0,10.0,13.0,3.0
4,11.0,2.0,1.0,4.0,8.0,8.0,6.0


Identical to the MDS-NMS case, we can normalize MDS-UPDRS variables in the [0,1] range because we know the maximum values. This will allow us to compare variables and clusters more effectively.

In [16]:
mds_updrs["tremor"] = mds_updrs["tremor"] / (4 * 11)
mds_updrs["rigidity"] = mds_updrs["rigidity"] / (4 * 5)
mds_updrs["dyskinesias"] = mds_updrs["dyskinesias"] / (4 * 2)
mds_updrs["fluctuations"] = mds_updrs["fluctuations"] / (4 * 4)
mds_updrs["bradykinesia"] = mds_updrs["bradykinesia"] / (4 * 11)
mds_updrs["axial_no_pigd"] = mds_updrs["axial_no_pigd"] / (4 * 7)
mds_updrs["pigd"] = mds_updrs["pigd"] / (4 * 5)

In [17]:
mds_updrs.max()

tremor           0.568182
rigidity         1.000000
dyskinesias      1.000000
fluctuations     0.812500
bradykinesia     0.886364
axial_no_pigd    0.857143
pigd             0.850000
dtype: float64

### 4.3 - Socio-demographic

In [18]:
socio_demographic = pd.DataFrame(columns=original_socio_demographic_cols)
socio_demographic["patnum"] = data["patnum"].values
socio_demographic["age"] = data["age"].values
socio_demographic["sex"] = data["sex"].values
socio_demographic["pdonset"] = data["pdonset"].values
socio_demographic["durat_pd"] = data["durat_pd"].values

# Change sex codes from (0,1) to (male, female)
socio_demographic["sex"] = socio_demographic["sex"].astype("category")
socio_demographic["sex"].cat.categories = ["male", "female"]

socio_demographic.shape

(402, 5)

### 4.4 - Hoehn-Yahr

In [19]:
hoehn_yahr = pd.DataFrame(columns=original_hoehn_yahr_cols)
hoehn_yahr = data["hy"]
hoehn_yahr.shape

(402,)

## 5 - Combine and generate the final data frame

In [20]:
final_df = pd.concat([socio_demographic, hoehn_yahr, mds_nms, mds_updrs], axis = 1)
final_df.to_csv("../data/data_numerical.csv", index=False, na_rep="?")
final_df.shape

(402, 30)