Collaborative coding using GitHub
===========

Alexandre Perera Luna, Mónica Rojas Martínez

December 15th 2023


# Goal

The objective of this assignment is to construct a project through collaborative coding, showcasing an Exploratory Data Analysis (EDA) and a classification. To facilitate your understanding of GitHub, we will utilize code snippets from previous exercises, allowing you to focus on the process without concerns about the final outcome. The current notebook will serve as the main function in the project, and each participant is required to develop additional components and integrate their contributions into the main branch.


## Requirements

In order to work with functions created in other jupyter notebooks you need to install the package `nbimporter` using a shell and the following command:

<font color='grey'>pip install nbimporter</font> 

`nbimporter` allows you to import jupyter notebooks as modules. Once intalled and imported, you can use a command like the following to import a function called *fibonacci* that is stored on a notebook *fibbo_func* in the same path as the present notebook:

<font color='green'>from</font> fibbo_func <font color='green'>import</font> fibbonaci  <font color='green'>as</font> fibbo



In [1]:
## Modify this cell by importing all the necessary modules you need to solve the assigmnent. Observe that we are importing
## the library nbimporter. You will need it for calling fuctions created in other notebooks. 
import nbimporter
import pandas as pd



In [2]:
# Here is an example of invoking the Fibonacci function, whisch should be located in the same directory as the main:
from fibbo_func import fibbonaci as fibbo
fibbo(24)

46368

## Exercises
As an illustration of Git workflow, you will analyze the *Parkinson's* dataset, which has been previously examined in past assignments. Each team member has specific responsibilities that may be crucial for the progress of others. Make sure all of you organize your tasks accordingly. We've structured the analysis into modules to assist you in tracking your tasks, but feel free to deviate from it if you prefer.   
Please use Markdown cells for describing your workflow and expalining the findings of your work. 
Remember you need both, to modify this notebook and, to create additional functions outside. Your work will only be available for others when you modify and merge your changes.


In [3]:
# We will start by loading the parkinson dataset. The rest is up to you!
df = pd.read_csv('parkinsons.data', 
                 dtype = { # indicate categorical variables
                     'status': 'category'})
df.head(5)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [4]:
df.shape

(195, 24)

In [5]:
list(df.columns.values)

['name',
 'MDVP:Fo(Hz)',
 'MDVP:Fhi(Hz)',
 'MDVP:Flo(Hz)',
 'MDVP:Jitter(%)',
 'MDVP:Jitter(Abs)',
 'MDVP:RAP',
 'MDVP:PPQ',
 'Jitter:DDP',
 'MDVP:Shimmer',
 'MDVP:Shimmer(dB)',
 'Shimmer:APQ3',
 'Shimmer:APQ5',
 'MDVP:APQ',
 'Shimmer:DDA',
 'NHR',
 'HNR',
 'status',
 'RPDE',
 'DFA',
 'spread1',
 'spread2',
 'D2',
 'PPE']

The above are names of columns that have varied characters, adn are hard to read. This is the reason why we need to clean and tidy the dataset, and create clear column names.

### 1. Cleaning and tidying the dataset

This dataset entails information about Parkinson's subjects (n=31). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from 31 individuals that were tested.

In [14]:
import pandas as pd

# Define the renamevars function
def renamevars(df, dict_names):
    return df.rename(columns=dict_names)

# This is the established names of all columns we have and which need renaming
dict_names = {
    'MDVP:Fo(Hz)': 'avFF',
    'MDVP:Fhi(Hz)': 'maxFF', 
    'MDVP:Flo(Hz)': 'minFF',
    'MDVP:Jitter(%)': 'percJitter',
    'MDVP:Jitter(Abs)': 'absJitter',
    'MDVP:RAP': 'rap',
    'MDVP:PPQ': 'ppq',
    'Jitter:DDP': 'ddp',
    'MDVP:Shimmer': 'lShimer',
    'MDVP:Shimmer(dB)': 'dbShimer',
    'Shimmer:APQ3': 'apq3',
    'Shimmer:APQ5': 'apq5',
    'MDVP:APQ': 'apq',
    'Shimmer:DDA': 'dda'
}

# loading dataset
file_path = 'parkinsons.data'
parkinsons_df = pd.read_csv(file_path)

# Here we have to apply the renamevars function to the parkinsons dataset with the specified column name changes
renamed_parkinsons_df = renamevars(parkinsons_df, dict_names)

# Display the first few rows of the dataframe with renamed columns
print(renamed_parkinsons_df.head(3))

             name     avFF    maxFF    minFF  percJitter  absJitter      rap  \
0  phon_R01_S01_1  119.992  157.302   74.997     0.00784    0.00007  0.00370   
1  phon_R01_S01_2  122.400  148.650  113.819     0.00968    0.00008  0.00465   
2  phon_R01_S01_3  116.682  131.111  111.555     0.01050    0.00009  0.00544   

       ppq      ddp  lShimer  ...      dda      NHR     HNR  status      RPDE  \
0  0.00554  0.01109  0.04374  ...  0.06545  0.02211  21.033       1  0.414783   
1  0.00696  0.01394  0.06134  ...  0.09403  0.01929  19.085       1  0.458359   
2  0.00781  0.01633  0.05233  ...  0.08270  0.01309  20.651       1  0.429895   

        DFA   spread1   spread2        D2       PPE  
0  0.815285 -4.813031  0.266482  2.301442  0.284654  
1  0.819521 -4.075192  0.335590  2.486855  0.368674  
2  0.825288 -4.443179  0.311173  2.342259  0.332634  

[3 rows x 24 columns]


In [15]:
# Now we have to splitting the 'name' column into separate components
aux = renamed_parkinsons_df.name.str.split('_', expand=True)


# Dropping the first 2 columns as they are not needed
aux.drop(aux.columns[[0, 1]], axis=1, inplace=True)

# Renaming the remaining columns for better interpretation
aux.columns = ['subject_id', 'trial']

# Adding these new columns back to the original DataFrame
renamed_parkinsons_df['subject_id'] = aux['subject_id']
renamed_parkinsons_df['trial'] = aux['trial']

# Grouping by subject_id to count the number of trials per subject
summary = renamed_parkinsons_df.groupby('subject_id')['trial'].count().reset_index()
summary.columns = ['subject_id', 'trial_count']

# Display the summary
summary.head(10)

Unnamed: 0,subject_id,trial_count
0,S01,6
1,S02,6
2,S04,6
3,S05,6
4,S06,6
5,S07,6
6,S08,6
7,S10,6
8,S13,6
9,S16,6


In [16]:
renamed_parkinsons_df.head(2)

Unnamed: 0,name,avFF,maxFF,minFF,percJitter,absJitter,rap,ppq,ddp,lShimer,...,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE,subject_id,trial
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,S01,1
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,S01,2


In [17]:
# Drop the first 2 columns
renamed_parkinsons_df = renamed_parkinsons_df.drop(columns=['name'])

In [18]:
renamed_parkinsons_df

Unnamed: 0,avFF,maxFF,minFF,percJitter,absJitter,rap,ppq,ddp,lShimer,dbShimer,...,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE,subject_id,trial
0,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.426,...,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,S01,1
1,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.626,...,19.085,1,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674,S01,2
2,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,0.482,...,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,S01,3
3,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,0.517,...,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,S01,4
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335,S01,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,0.405,...,19.517,0,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050,S50,2
191,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,0.263,...,19.147,0,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895,S50,3
192,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,0.256,...,17.883,0,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728,S50,4
193,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,0.241,...,19.020,0,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306,S50,5


Question 1: Are there any correlations present in our data?

In [19]:
# Calculating the correlation matrix
correlation_matrix = renamed_parkinsons_df.corr()

# Setting a threshold for identifying high correlations
correlation_threshold = 0.8

# Identifying pairs of highly correlated variables
highly_correlated_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > correlation_threshold:
            col_name = correlation_matrix.columns[i]
            row_name = correlation_matrix.columns[j]
            corr_value = correlation_matrix.iloc[i, j]
            highly_correlated_pairs.append((col_name, row_name, corr_value))

# Print the highly correlated pairs
for pair in highly_correlated_pairs:
    print(f"Variables {pair[0]} and {pair[1]} have a correlation coefficient of {pair[2]:.2f}")


Variables absJitter and percJitter have a correlation coefficient of 0.94
Variables rap and percJitter have a correlation coefficient of 0.99
Variables rap and absJitter have a correlation coefficient of 0.92
Variables ppq and percJitter have a correlation coefficient of 0.97
Variables ppq and absJitter have a correlation coefficient of 0.90
Variables ppq and rap have a correlation coefficient of 0.96
Variables ddp and percJitter have a correlation coefficient of 0.99
Variables ddp and absJitter have a correlation coefficient of 0.92
Variables ddp and rap have a correlation coefficient of 1.00
Variables ddp and ppq have a correlation coefficient of 0.96
Variables dbShimer and percJitter have a correlation coefficient of 0.80
Variables dbShimer and ppq have a correlation coefficient of 0.84
Variables dbShimer and lShimer have a correlation coefficient of 0.99
Variables apq3 and lShimer have a correlation coefficient of 0.99
Variables apq3 and dbShimer have a correlation coefficient of 0

We have set the threshold to 0.8, and thus we consider two variables to be highly correlated if the absolute value of their correlation coefficient is greater than or equal to 0.8. By setting the threshold to 0.8, we aim to strike a balance, removing variables that are likely to cause multicollinearity issues while retaining as much useful information in the dataset as possible.

In [23]:
# Here we are defining the function to remove highly correlated variables
def remove_highly_correlated_variables(df, threshold=0.8):
    corr_matrix = df.corr()
    vars_to_remove = set()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                vars_to_remove.add(corr_matrix.columns[i])
    reduced_df = df.drop(columns=list(vars_to_remove))
    return reduced_df

# Load your DataFrame (this is an example, replace with your actual data loading)
# renamed_parkinsons_df = pd.read_csv('path_to_your_dataset.csv')

# Using the function to create a DataFrame with reduced multicollinearity
reduced_parkinsons_df = remove_highly_correlated_variables(renamed_parkinsons_df, 0.8)

# Displaying the first few rows of the reduced DataFrame
print(reduced_parkinsons_df.head(16))


       avFF    maxFF    minFF  percJitter  lShimer  status      RPDE  \
0   119.992  157.302   74.997     0.00784  0.04374       1  0.414783   
1   122.400  148.650  113.819     0.00968  0.06134       1  0.458359   
2   116.682  131.111  111.555     0.01050  0.05233       1  0.429895   
3   116.676  137.871  111.366     0.00997  0.05492       1  0.434969   
4   116.014  141.781  110.655     0.01284  0.06425       1  0.417356   
5   120.552  131.162  113.787     0.00968  0.04701       1  0.415564   
6   120.267  137.244  114.820     0.00333  0.01608       1  0.596040   
7   107.332  113.840  104.315     0.00290  0.01567       1  0.637420   
8    95.730  132.068   91.754     0.00551  0.02093       1  0.615551   
9    95.056  120.103   91.226     0.00532  0.02838       1  0.547037   
10   88.333  112.240   84.072     0.00505  0.02143       1  0.611137   
11   91.904  115.871   86.292     0.00540  0.02752       1  0.583390   
12  136.926  159.866  131.276     0.00293  0.01259       1  0.46

These are the variables that are now left after removing those  variables that are highly correlated with others (correlation coefficient > 0.8), we have been able to reduce multicollinearity. This helps in making our model's estimates more stable and interpretable.

Question 1: How many observations do you have? 

In [21]:
reduced_parkinsons_df.shape

(195, 13)

We have a total of 195 observations, and the rows indiciate the number of rows we have in our current df.

Question 2: Are there apparent differences between controls and patients? 

In [27]:
# To assess differences in each variable for patients and controls, we can use groupby and describe
# This will provide a more detailed statistical summary for each group
# we can use the variable status which have a 0 or 1 value and most likley indicates control (0) or patient (1)

detailed_stats_by_status = reduced_parkinsons_df.groupby('status').describe()

# Displaying the detailed statistics for each variable grouped by status
detailed_stats_by_status





Unnamed: 0_level_0,avFF,avFF,avFF,avFF,avFF,avFF,avFF,avFF,maxFF,maxFF,...,spread2,spread2,D2,D2,D2,D2,D2,D2,D2,D2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
0,48.0,181.937771,52.731067,110.739,120.9475,198.996,229.077,260.105,48.0,223.63675,...,0.193766,0.291954,48.0,2.154491,0.310269,1.423287,1.974217,2.12951,2.339487,2.88245
1,147.0,145.180762,32.34805,88.333,117.572,145.174,170.071,223.361,147.0,188.441463,...,0.30366,0.450493,147.0,2.456058,0.375742,1.765957,2.180933,2.439597,2.668479,3.671155


In summary, based on the provided statistics, there are apparent differences between controls (status 0) and patients (status 1) in terms of their vocal features. Patients generally exhibit lower values for avFF and maxFF but higher values for spread2 and D2 compared to controls. These differences may indicate potential distinctions in vocal characteristics between the two groups, 

Question 3: Is the variability comparable? If you check the minimum and maximum values are there outliers? 

### 2. Basic EDA based on plots and descriptive statistics

In [None]:
# your code here

### 3. Aggregating and transforming variables in the dataset

In [None]:
# your code here

### 4. Differentiating between controls (healthy subjects) and patients

In [None]:
# your code here