# Partition analysis

Our greedy latent structure learner (GLSL) method has produced a model with 9 latent variables. Each latent variable groups a series of motor/non-motor symptoms. In addition, each latent variable follows a categorical distribution that can be analyzed from a clustering perspective, where a state represents a cluster.

#### Purpose

In this notebook we are going to analyze the age, sex, pd_onset, pd_durat of each cluster of each partition to observe if there are significant differences.

#### Notes

To improve the analysis we are going to rename the states of latent variables (we will also do the same on the XDSL models) and order them according to their respective symptoms intensity. This way, it will be easier to analyze the resulting plots. Once renamed we can observe that patients may belong to "low-intensity" clusters in certain partitions and to "high-intensity" clusters in other partitions.

#### Hypothesis tests
* In the case of two clusters, we will do a Mann-Whitney U-test.
* In the case of three or more clusters, we will first do a Kruskal-Wallis test, followed by a post-hoc analysis of Tukey-HSD.

We use a significance value of 0.05. Therefore p-value should be < 0.05 for the difference between clusters to be significant.

-----

#### Load data

In [2]:
from scipy.io import arff
import pandas as pd
import numpy as np

# Load original data with socio-demographic data, the patient number and the Hoehn Yahr scale
# 6 attributes
original_data = arff.loadarff("../../data/mds_parkinson/mds_parkinson_info.arff")
original_data = pd.DataFrame(original_data[0])
print(original_data.shape)

# Load partition data (data with completed partitions)
# This data has 9 extra attributes (one for each latent variable), but doesnt have socio-demographic columns because they weren't 
# used during the learning process
# 22 + 9 attributes
partitions_data = arff.loadarff("../../clustering_results/mds_parkinson/mds_parkinson_GLSL_CIL_10.arff")
partitions_data = pd.DataFrame(partitions_data[0])
print(partitions_data.shape)

(402, 6)
(402, 31)


#### Object data types in UTF-8 format 

Object data types are in binary form (b'), we need to pass them to UTF-8

In [3]:
# Object data types are in binary form, we need to pass them to utf-8
object_columns = original_data.select_dtypes("object").columns
original_data[object_columns] = original_data[object_columns].stack().str.decode('utf-8').unstack()
#original_data.head()

In [4]:
# Object data types are in binary form, we need to pass them to utf-8
object_columns = partitions_data.select_dtypes("object").columns
partitions_data[object_columns] = partitions_data[object_columns].stack().str.decode('utf-8').unstack()
#partitions_data.head()

#### Subset data for analysis

In [5]:
analysis_columns = ["sex", "age", "pdonset", "durat_pd", "hy"]
data = pd.DataFrame()

data[analysis_columns] = original_data[analysis_columns]

data["A"] = partitions_data["A"]
data["B"] = partitions_data["B"]
data["C"] = partitions_data["C"]
data["D"] = partitions_data["D"]
data["E"] = partitions_data["E"]
data["F"] = partitions_data["F"]
data["G"] = partitions_data["G"]
data["H"] = partitions_data["H"]
data["I"] = partitions_data["I"]

# Sex as categorical data
data["sex"] = data["sex"].astype("category")

print(data.shape)

(402, 14)


In [6]:
data["age"].head()

0    71.0
1    68.0
2    81.0
3    61.0
4    63.0
Name: age, dtype: float64

----

## 1 - Clustering A

* **Attributes:** sweating

#### 1.1 - Prepare data analysis

In [7]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["A"] = data["A"].astype("category")
data["A"] = data["A"].cat.rename_categories({"0":"A2", "1": "A1"})
data["A"] = data["A"].cat.reorder_categories(['A1', 'A2'])

#### 1.2 - Age

##### Table

In [8]:
partition = "A"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
A        
A1  67.94
A2  64.91
      age
A        
A1   9.79
A2  10.44


##### Hypothesis test

In [9]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["A"] == "A1", "age"]
cluster_2_data = data.loc[data["A"] == "A2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.00603903813524118

#### 1.3 - PD Onset

##### Table

In [10]:
partition = "A"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
A          
A1    59.96
A2    55.68
    pdonset
A          
A1    10.44
A2    11.14


##### Hypothesis test

In [11]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["A"] == "A1", "pdonset"]
cluster_2_data = data.loc[data["A"] == "A2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.0007425953013215071

#### 1.4 - PD duration

##### Table

In [12]:
partition = "A"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
A           
A1      7.98
A2      9.23
    durat_pd
A           
A1      5.90
A2      6.02


##### Hypothesis test

In [13]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["A"] == "A1", "durat_pd"]
cluster_2_data = data.loc[data["A"] == "A2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

0.039907573198545966

#### 1.5 - Sex

##### Table

In [14]:
import numpy as np

partition = "A"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

A   sex   
A1  male      64.26
    female    35.74
A2  male      52.17
    female    47.83
Name: sex, dtype: float64

##### Hypothesis test

Note: see https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [15]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["A"] == "A1", "sex"].value_counts()
cluster_2_data = data.loc[data["A"] == "A2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.08036122361865285

----

## 2 - Clustering B

* **Attributes:** speech, fatigue, freezing, motor fluctuations, gait, postural instability

#### 2.1 - Prepare data for analysis

In [16]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["B"] = data["B"].astype("category")
data["B"] = data["B"].cat.rename_categories({"0":"B3", "1": "B1", "2": "B2"})
data["B"] = data["B"].cat.reorder_categories(['B1', 'B2', 'B3'])

#### 2.3 - Age

##### Table

In [17]:
partition = "B"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
B        
B1  67.05
B2  68.28
B3  64.09
      age
B        
B1  10.30
B2   9.73
B3   8.93


##### Hypothesis test

In [18]:
from scipy.stats import kruskal

cluster_1_data = data.loc[data["B"] == "B1", "age"]
cluster_2_data = data.loc[data["B"] == "B2", "age"]
cluster_3_data = data.loc[data["B"] == "B3", "age"]

kruskal(cluster_1_data, cluster_2_data, cluster_3_data).pvalue

0.07796802519079396

#### 2.4 - PD onset

##### Table

In [19]:
partition = "B"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
B          
B1    60.37
B2    58.91
B3    55.00
    pdonset
B          
B1    11.01
B2    10.61
B3     7.96


##### Hypothesis test

In [20]:
from scipy.stats import kruskal

cluster_1_data = data.loc[data["B"] == "B1", "pdonset"]
cluster_2_data = data.loc[data["B"] == "B2", "pdonset"]
cluster_3_data = data.loc[data["B"] == "B3", "pdonset"]

kruskal(cluster_1_data, cluster_2_data, cluster_3_data).pvalue

0.01812264697353745

In [21]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

print(pairwise_tukeyhsd(endog=data["pdonset"], groups=data["B"]))

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
    B1     B2   -1.458 0.3868  -4.0585  1.1426  False
    B1     B3  -5.3721  0.024 -10.1753 -0.5689   True
    B2     B3  -3.9141 0.1296  -8.6676  0.8393  False
-----------------------------------------------------


#### 2.5 - PD duration

##### Table

In [22]:
partition = "B"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
B           
B1      6.68
B2      9.37
B3      9.09
    durat_pd
B           
B1      5.07
B2      6.33
B3      6.02


##### Hypothesis test

In [23]:
from scipy.stats import kruskal

cluster_1_data = data.loc[data["B"] == "B1", "durat_pd"]
cluster_2_data = data.loc[data["B"] == "B2", "durat_pd"]
cluster_3_data = data.loc[data["B"] == "B3", "durat_pd"]

kruskal(cluster_1_data, cluster_2_data, cluster_3_data).pvalue

8.03994061176031e-05

Given that the Kruskal-Wallis test returns a p-value < 0.05, we can now apply the Tukey-HSD post-hoc

In [24]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

print(pairwise_tukeyhsd(endog=data["durat_pd"], groups=data["B"]))

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
    B1     B2   2.6885 0.001  1.2673 4.1096   True
    B1     B3   2.4135 0.079 -0.2115 5.0385  False
    B2     B3  -0.2749   0.9 -2.8727 2.3229  False
--------------------------------------------------


#### 2.6 - Sex

##### Table

In [25]:
import numpy as np

partition = "B"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
cluster_3_total = data[columns_1].groupby([partition]).count().iloc[2,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total, cluster_3_total, cluster_3_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

B   sex   
B1  male      64.53
    female    35.47
B2  male      62.12
    female    37.88
B3  female    50.00
    male      50.00
Name: sex, dtype: float64

##### Hypothesis test

Note: see https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [26]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["B"] == "B1", "sex"].value_counts()
cluster_2_data = data.loc[data["B"] == "B2", "sex"].value_counts()
cluster_3_data = data.loc[data["B"] == "B3", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
cluster_3_data_array = [cluster_3_data[0], cluster_3_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array, cluster_3_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.297481820258858

----

## 3 - Clustering C

* **Attributes:** smell, rigidity

#### 3.1 - Prepare data for analysis

In [27]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["C"] = data["C"].astype("category")
data["C"] = data["C"].cat.rename_categories({"0":"C1", "1": "C2"})
data["C"] = data["C"].cat.reorder_categories(['C1', 'C2'])

#### 3.3 - Age

##### Table

In [28]:
partition = "C"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
C        
C1  67.97
C2  66.97
      age
C        
C1  10.22
C2   9.74


##### Hypothesis test
In the case of three or more clusters, we will first do a Kruskal-Wallis test, followed by a post-hoc analysis of Tukey-HSD.

In [29]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["C"] == "C1", "age"]
cluster_2_data = data.loc[data["C"] == "C2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.058665838685665

#### 3.4 - PD onset

##### Table

In [30]:
partition = "C"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
C          
C1    60.01
C2    58.58
    pdonset
C          
C1     10.5
C2     10.8


##### Hypothesis test

In [31]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["C"] == "C1", "pdonset"]
cluster_2_data = data.loc[data["C"] == "C2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.04747256770006673

#### 3.5 - PD duration

##### Table

In [32]:
partition = "C"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
C           
C1      7.96
C2      8.39
    durat_pd
C           
C1      6.16
C2      5.74


##### Hypothesis test

In [33]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["C"] == "C1", "durat_pd"]
cluster_2_data = data.loc[data["C"] == "C2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

0.13979979106834056

#### 3.6 - Sex

##### Table

In [34]:
import numpy as np

partition = "C"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

C   sex   
C1  male      63.74
    female    36.26
C2  male      60.91
    female    39.09
Name: sex, dtype: float64

##### Hypothesis test
https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [35]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["C"] == "C1", "sex"].value_counts()
cluster_2_data = data.loc[data["C"] == "C2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.6322614975902524

----
## 4 - Clustering D

* **Attributes:** bradykinesia, weight change

#### 4.1 - Prepare data for analysis

In [36]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["D"] = data["D"].astype("category")
data["D"] = data["D"].cat.rename_categories({"0":"D1", "1": "D2"})
data["D"] = data["D"].cat.reorder_categories(['D1', 'D2'])

#### 4.3 - Age

##### Table

In [37]:
partition = "D"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
D        
D1  67.21
D2  68.74
      age
D        
D1  10.09
D2   9.07


##### Hypothesis test

In [38]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["D"] == "D1", "age"]
cluster_2_data = data.loc[data["D"] == "D2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.14163393709267103

#### 4.4 - PD onset

##### Table

In [39]:
partition = "D"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
D          
D1    59.18
D2    59.51
    pdonset
D          
D1    10.60
D2    11.22


##### Hypothesis test

In [40]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["D"] == "D1", "pdonset"]
cluster_2_data = data.loc[data["D"] == "D2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.4575971094553487

#### 4.5 - PD duration

##### Table

In [41]:
partition = "D"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
D           
D1      8.03
D2      9.23
    durat_pd
D           
D1      5.86
D2      6.28


##### Hypothesis test

In [42]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["D"] == "D1", "durat_pd"]
cluster_2_data = data.loc[data["D"] == "D2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

0.07895421525658242

#### 4.6 - Sex

##### Table

In [43]:
import numpy as np

partition = "D"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

D   sex   
D1  male      63.48
    female    36.52
D2  male      54.39
    female    45.61
Name: sex, dtype: float64

##### Hypothesis test
https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [44]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["D"] == "D1", "sex"].value_counts()
cluster_2_data = data.loc[data["D"] == "D2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.24442596763326768

----

## 5 - Clustering E

* **Attributes:** gastrointestinal

#### 5.1 - Prepare data for analysis

In [45]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["E"] = data["E"].astype("category")
data["E"] = data["E"].cat.rename_categories({"0":"E2", "1": "E1"})
data["E"] = data["E"].cat.reorder_categories(['E1', 'E2'])

#### 5.3 - Age

##### Table

In [46]:
partition = "E"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
E        
E1  66.05
E2  69.22
      age
E        
E1  10.44
E2   9.01


##### Hypothesis test

In [47]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["E"] == "E1", "age"]
cluster_2_data = data.loc[data["E"] == "E2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.0013151106676227002

#### 5.4 - PD onset

##### Table

In [48]:
partition = "E"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
E          
E1    58.92
E2    59.63
    pdonset
E          
E1    11.02
E2    10.22


##### Hypothesis test

In [49]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["E"] == "E1", "pdonset"]
cluster_2_data = data.loc[data["E"] == "E2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.2619010875034382

#### 5.5 - PD duration

##### Table

In [50]:
partition = "E"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
E           
E1      7.13
E2      9.60
    durat_pd
E           
E1      5.65
E2      6.01


##### Hypothesis test

In [51]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["E"] == "E1", "durat_pd"]
cluster_2_data = data.loc[data["E"] == "E2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

3.4991394008328035e-06

#### 5.6 - Sex

##### Table

In [52]:
import numpy as np

partition = "E"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

E   sex   
E1  male      59.65
    female    40.35
E2  male      65.52
    female    34.48
Name: sex, dtype: float64

##### Hypothesis test

https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [53]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["E"] == "E1", "sex"].value_counts()
cluster_2_data = data.loc[data["E"] == "E2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.27204515565701215

----

## 6 - Clustering F

* **Attributes:** hallucinations

#### 6.1 - Prepare data for analysis

In [54]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["F"] = data["F"].astype("category")
data["F"] = data["F"].cat.rename_categories({"0":"F2", "1": "F1"})
data["F"] = data["F"].cat.reorder_categories(['F1', 'F2'])

#### 6.3 - Age

##### Table

In [55]:
partition = "F"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
F        
F1  67.10
F2  68.52
      age
F        
F1  10.27
F2   8.78


##### Hypothesis test

In [56]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["F"] == "F1", "age"]
cluster_2_data = data.loc[data["F"] == "F2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.12015941548766618

#### 6.4 - PD onset

##### Table

In [57]:
partition = "F"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
F          
F1    59.59
F2    58.00
    pdonset
F          
F1    10.90
F2     9.84


##### Hypothesis test

In [58]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["F"] == "F1", "pdonset"]
cluster_2_data = data.loc[data["F"] == "F2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.14595270795352638

#### 6.5 - PD duration

##### Table

In [59]:
partition = "F"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
F           
F1      7.51
F2     10.52
    durat_pd
F           
F1      5.67
F2      6.22


##### Hypothesis test

In [60]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["F"] == "F1", "durat_pd"]
cluster_2_data = data.loc[data["F"] == "F2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

6.83754306787582e-06

#### 6.6 - Sex

##### Table

In [61]:
import numpy as np

partition = "F"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

F   sex   
F1  male      61.29
    female    38.71
F2  male      65.22
    female    34.78
Name: sex, dtype: float64

##### Hypothesis test

https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [62]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["F"] == "F1", "sex"].value_counts()
cluster_2_data = data.loc[data["F"] == "F2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.5756782162103371

## 7 - Clustering G

* **Attributes:** dyskinesias, cardiovascular

#### 7.1 - Prepare data for analysis

In [63]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["G"] = data["G"].astype("category")
data["G"] = data["G"].cat.rename_categories({"0":"G2", "1": "G1"})
data["G"] = data["G"].cat.reorder_categories(['G1', 'G2'])

#### 7.3 - Age

##### Table

In [64]:
partition = "G"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
G        
G1  67.00
G2  68.13
      age
G        
G1  10.07
G2   9.76


##### Hypothesis test

In [65]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["G"] == "G1", "age"]
cluster_2_data = data.loc[data["G"] == "G2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.13835531181409694

#### 7.4 - PD onset

##### Table

In [66]:
partition = "G"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
G          
G1     59.3
G2     59.1
    pdonset
G          
G1    11.01
G2    10.12


##### Hypothesis test

In [67]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["G"] == "G1", "pdonset"]
cluster_2_data = data.loc[data["G"] == "G2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.4620272295021909

#### 7.5 - PD duration

##### Table

In [68]:
partition = "G"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
G           
G1      7.70
G2      9.03
    durat_pd
G           
G1      5.59
G2      6.39


##### Hypothesis test

In [69]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["G"] == "G1", "durat_pd"]
cluster_2_data = data.loc[data["G"] == "G2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

0.020801137904933816

#### 7.6 - Sex

##### Table

In [70]:
import numpy as np

partition = "G"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

G   sex   
G1  male      61.35
    female    38.65
G2  male      63.58
    female    36.42
Name: sex, dtype: float64

##### Hypothesis test

https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [71]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["G"] == "G1", "sex"].value_counts()
cluster_2_data = data.loc[data["G"] == "G2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.7348717408135872

-----

## 8 - Clustering H

* **Attributes**: attention, urinary

#### 8.1 - Prepare data for analysis

In [72]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["H"] = data["H"].astype("category")
data["H"] = data["H"].cat.rename_categories({"0":"H1", "1": "H2"})
data["H"] = data["H"].cat.reorder_categories(['H1', 'H2'])

#### 8.3 - Age

##### Table

In [73]:
partition = "H"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
H        
H1  66.52
H2  68.29
      age
H        
H1   9.85
H2  10.01


##### Hypothesis test

In [74]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["H"] == "H1", "age"]
cluster_2_data = data.loc[data["H"] == "H2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.07542872751909142

#### 8.4 - PD onset

##### Table

In [75]:
partition = "H"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
H          
H1    59.06
H2    59.39
    pdonset
H          
H1    10.78
H2    10.60


##### Hypothesis test

In [76]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["H"] == "H1", "pdonset"]
cluster_2_data = data.loc[data["H"] == "H2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.498800490008505

#### 8.5 - PD duration

##### Table

In [77]:
partition = "H"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
H           
H1      7.46
H2      8.90
    durat_pd
H           
H1      5.86
H2      5.93


##### Hypothesis test

In [78]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["H"] == "H1", "durat_pd"]
cluster_2_data = data.loc[data["H"] == "H2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

0.0028741633236330595

#### 8.6 - Sex

##### Table

In [79]:
import numpy as np

partition = "H"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

H   sex   
H1  male      58.88
    female    41.12
H2  male      65.37
    female    34.63
Name: sex, dtype: float64

##### Hypothesis test

https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [80]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["H"] == "H1", "sex"].value_counts()
cluster_2_data = data.loc[data["H"] == "H2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.2160679058384105

## 9 - Clustering I

* **Attributes:** sexual, mood

#### 9.1 - Prepare data for analysis

In [81]:
# First, we make this variable categorical so Pandas can order its states in the plot. Then we rename its categories accordingly (and reorder them if necessary)
data["I"] = data["I"].astype("category")
data["I"] = data["I"].cat.rename_categories({"0":"I2", "1": "I1"})
data["I"] = data["I"].cat.reorder_categories(['I1', 'I2'])

#### 9.3 - Age

##### Table

In [82]:
partition = "I"
columns_1 = ["age", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

      age
I        
I1  67.05
I2  67.81
      age
I        
I1  10.94
I2   8.83


##### Hypothesis test

In [83]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["I"] == "I1", "age"]
cluster_2_data = data.loc[data["I"] == "I2", "age"]

mw(cluster_1_data, cluster_2_data).pvalue

0.39135499838365956

#### 9.4 - PD onset

##### Table

In [84]:
partition = "I"
columns_1 = ["pdonset", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    pdonset
I          
I1    59.67
I2    58.77
    pdonset
I          
I1    11.45
I2     9.82


##### Hypothesis test

In [85]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["I"] == "I1", "pdonset"]
cluster_2_data = data.loc[data["I"] == "I2", "pdonset"]

mw(cluster_1_data, cluster_2_data).pvalue

0.1466139168534677

#### 9.5 - PD duration

##### Table

In [86]:
partition = "I"
columns_1 = ["durat_pd", partition]

print(data[columns_1].groupby([partition]).mean().round(2))
print(data[columns_1].groupby([partition]).std().round(2))

    durat_pd
I           
I1      7.38
I2      9.05
    durat_pd
I           
I1      5.51
I2      6.24


##### Hypothesis test

In [87]:
from scipy.stats import mannwhitneyu as mw

cluster_1_data = data.loc[data["I"] == "I1", "durat_pd"]
cluster_2_data = data.loc[data["I"] == "I2", "durat_pd"]

mw(cluster_1_data, cluster_2_data).pvalue

0.00298979283150035

#### 9.6 - Sex

##### Table

In [88]:
import numpy as np

partition = "I"
columns_1 = ["sex", partition]

cluster_1_total = data[columns_1].groupby([partition]).count().iloc[0,0]
cluster_2_total = data[columns_1].groupby([partition]).count().iloc[1,0]
total = np.array([cluster_1_total, cluster_1_total, cluster_2_total, cluster_2_total])

# Percentage
(data[columns_1].groupby([partition]).sex.value_counts()/total * 100).round(2) 
# Counts
#data[columns_1].groupby([partition]).sex.value_counts()

I   sex   
I1  male      57.07
    female    42.93
I2  male      67.51
    female    32.49
Name: sex, dtype: float64

##### Hypothesis test

https://stats.stackexchange.com/questions/110718/chi-squared-test-with-scipy-whats-the-difference-between-chi2-contingency-and

In [89]:
from scipy.stats import chi2_contingency 

cluster_1_data = data.loc[data["I"] == "I1", "sex"].value_counts()
cluster_2_data = data.loc[data["I"] == "I2", "sex"].value_counts()

cluster_1_data_array = [cluster_1_data[0], cluster_1_data[1]]
cluster_2_data_array = [cluster_2_data[0], cluster_2_data[1]]
contingency_table = [cluster_1_data_array, cluster_2_data_array]

tat, p, dof, expected = chi2_contingency(contingency_table)
p

0.039885902656203745