<hr>
<h1 style="text-align: center">An ensemble technique to predict Parkinson’s disease using machine
learning algorithms<h1>

<header style="text-align: left; font-size:20px; ">
    <span id='author'>
        El Maftouhi Imad
        Zitan Houssam
        Amgrout Zakaria 
    </span> <br>
    <span id='filiere'>Master IASDS1 2024/2025</span><br>
    <span id='module'>Module “Apprentissage automatique” MST IASD/S1 2024-2025 (M. AIT KBIR)</span><br>
    <span id='filiere'>Departement Genie informatique</span><br>
    <span id="date" style='text-align:left; '>December, 2024 the 1st</span><br>
<header>

<hr/>

## ABSTRACT
La maladie de Parkinson (MP) est une affection neurodégénérative progressive qui provoque des symptômes moteurs et non moteurs.
Ses symptômes se développent lentement, ce qui rend difficile une identification précoce. L'apprentissage automatique a un potentiel important pour prédire la maladie de Parkinson à partir des caractéristiques cachées dans les données vocales.


prédire la maladie de Parkinson à partir de caractéristiques cachées dans les données vocales. Ce travail a pour but d'identifier les caractéristiques
caractéristiques les plus pertinentes à partir d'un ensemble de données à haute dimension, ce qui permet de classer avec précision la maladie de Parkinson avec un temps de calcul réduit. Un ensemble de données individuels comportant diverses caractéristiques médicales basées sur la voix ont été analysés dans ce travail. Une technique d'algorithme de sélection de caractéristiques d'ensemble (EFSA) basée sur des algorithmes de filtrage, d'enveloppement et d'intégration qui sélectionnent les caractéristiques les plus pertinentes pour la classification de la maladie de Parkinson. Ces techniques peuvent réduire le temps de formation pour améliorer la précision du modèle et minimiser l'ajustement excessif.

In [3]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns, sklearn as sk
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import os


In [4]:
df = pd.read_csv('pd_speech_features.csv', header=1).drop(columns=['id'])
df

Unnamed: 0,gender,PPE,DFA,RPDE,numPulses,numPeriodsPulses,meanPeriodPulses,stdDevPeriodPulses,locPctJitter,locAbsJitter,...,tqwt_kurtosisValue_dec_28,tqwt_kurtosisValue_dec_29,tqwt_kurtosisValue_dec_30,tqwt_kurtosisValue_dec_31,tqwt_kurtosisValue_dec_32,tqwt_kurtosisValue_dec_33,tqwt_kurtosisValue_dec_34,tqwt_kurtosisValue_dec_35,tqwt_kurtosisValue_dec_36,class
0,1,0.85247,0.71826,0.57227,240,239,0.008064,0.000087,0.00218,0.000018,...,1.5620,2.6445,3.8686,4.2105,5.1221,4.4625,2.6202,3.0004,18.9405,1
1,1,0.76686,0.69481,0.53966,234,233,0.008258,0.000073,0.00195,0.000016,...,1.5589,3.6107,23.5155,14.1962,11.0261,9.5082,6.5245,6.3431,45.1780,1
2,1,0.85083,0.67604,0.58982,232,231,0.008340,0.000060,0.00176,0.000015,...,1.5643,2.3308,9.4959,10.7458,11.0177,4.8066,2.9199,3.1495,4.7666,1
3,0,0.41121,0.79672,0.59257,178,177,0.010858,0.000183,0.00419,0.000046,...,3.7805,3.5664,5.2558,14.0403,4.2235,4.6857,4.8460,6.2650,4.0603,1
4,0,0.32790,0.79782,0.53028,236,235,0.008162,0.002669,0.00535,0.000044,...,6.1727,5.8416,6.0805,5.7621,7.7817,11.6891,8.2103,5.0559,6.1164,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
751,0,0.80903,0.56355,0.28385,417,416,0.004627,0.000052,0.00064,0.000003,...,3.0706,3.0190,3.1212,2.4921,3.5844,3.5400,3.3805,3.2003,6.8671,0
752,0,0.16084,0.56499,0.59194,415,413,0.004550,0.000220,0.00143,0.000006,...,1.9704,1.7451,1.8277,2.4976,5.2981,4.2616,6.3042,10.9058,28.4170,0
753,0,0.88389,0.72335,0.46815,381,380,0.005069,0.000103,0.00076,0.000004,...,51.5607,44.4641,26.1586,6.3076,2.8601,2.5361,3.5377,3.3545,5.0424,0
754,0,0.83782,0.74890,0.49823,340,339,0.005679,0.000055,0.00092,0.000005,...,19.1607,12.8312,8.9434,2.2044,1.9496,1.9664,2.6801,2.8332,3.7131,0


In [5]:
df.isna().sum().sum()

0

In [6]:
df.duplicated().sum()


1

In [7]:

def missingValues(dataframe, option: str = 'remove'):
    """
    Handle missing values in a DataFrame based on specified strategy.
    Parameters:
    - dataframe: Input pandas DataFrame
    - option: Strategy for handling missing values
    Returns:
    - Processed DataFrame
    """
    # Input validation
    if dataframe is None:
        raise ValueError('Dataframe argument cannot be None')
    
    # Check for missing values
    missing_count = dataframe.isna().sum().sum()
    if missing_count == 0:
        print('No Missing Values Detected, returning original DataFrame.')
        return dataframe
    
    # Validate option
    valid_options = ['remove', 'max', 'min', 'mean', 'median', 'mode']
    if option.lower() not in valid_options:
        raise ValueError(f'Option must be one of: {" | ".join(valid_options)}. Received: {option}')
    
    # Create a copy to avoid modifying the original DataFrame
    df = dataframe.copy()
    
    # Handle missing values based on strategy
    match option.lower():
        case 'remove':
            # Remove rows with any missing values
            return df.dropna().reset_index(drop=True)
        
        case 'mean':
            numeric_columns = df.select_dtypes(include=[np.number]).columns
            df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())
            return df
        
        case 'median':
            numeric_columns = df.select_dtypes(include=[np.number]).columns
            df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
            return df
        
        case 'mode':
            for column in df.columns:
                df[column] = df[column].fillna(df[column].mode()[0])
            return df
        
        case 'max':
            numeric_columns = df.select_dtypes(include=[np.number]).columns
            df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].max())
            return df
        
        case 'min':
            numeric_columns = df.select_dtypes(include=[np.number]).columns
            df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].min())
            return df
        
        case _:
            raise ValueError(f'Unhandled option: {option}')


def preprocessingPipeline(dataframe, target_column, missingValuesOption:str='remove',split:bool=True, rebalance:bool=True, standarize:bool=True, test_size:float=0.2, random_state=42,verbose=False):
    """
    Comprehensive data preprocessing pipeline with multiple advanced techniques
    
    Parameters:
    - dataframe: Input pandas DataFrame
    - target_column: Name of the target variable column
    - test_size: Proportion of data to use for testing
    - random_state: Seed for reproducibility
    
    Returns:
    - X_train, X_test, y_train, y_test: Preprocessed and split datasets
    """
    # Remove duplicate rows
    if dataframe.duplicated().sum() > 0:
        if verbose: print('dropping duplicates ...', end='')
        dataframe_processed = dataframe.drop_duplicates(inplace=False).reset_index(drop=True)
        if verbose: print('done')

    # Handle missing values accordingly
    dataframe_processed = missingValues(dataframe , option = missingValuesOption)
    
    # Separate features and target
    X_processed = dataframe_processed.drop(columns=[target_column])
    y_processed = dataframe_processed[target_column]

    # SMOTE to rebalance the classes (only for classification problems)
    if rebalance:
        if verbose: print('Applying SMOTE ...', end='')
        if len(np.unique(dataframe[target_column])) > 1:
            smote = SMOTE(random_state=random_state)
            X_processed, y_processed = smote.fit_resample(
                dataframe.drop(columns=[target_column]),
                dataframe[target_column]
            )
            dataframe_processed = pd.concat(
                [pd.DataFrame(X_processed, columns=dataframe.drop(columns=[target_column]).columns),
                pd.Series(y_processed, name=target_column)],
                axis=1
            )
            if verbose: print('done')
        else:
            if verbose: print('Skipping, Only one class detected.')
    
    # Apply standerization
    if standarize:
        if verbose: print('Applying standerization ...', end='')
        X_processed = StandardScaler().fit_transform(X_processed)
        X_processed = pd.DataFrame(X_processed, columns=dataframe_processed.columns[:-1])
        if verbose: print('done')
    
    # Split data into train and test sets
    if split:
        X_train, X_test, y_train, y_test = train_test_split(X_processed, y_processed, test_size=test_size, random_state=random_state)

    return dataframe_processed, X_train, X_test, y_train, y_test

# def EFSAPipeline(dataframe, target_column, verbose=False):
#     subsets = {}
#     if verbose: print('Performing EFSA:')

#     # First subset with Filter method (Correlation Filter)
#     if verbose: print('\t--->Correlation Filter: ', end='')
#     selected_features = FS.corr_filter(dataframe, target_column)
#     subsets['Correlation Filter'] = selected_features
#     if verbose: print(selected_features)
    
#     # Second subset with Wrapper method (Backward elminination based wrapper)
#     if verbose: print('\t--->Backward elminination based wrapper: ', end='')
#     selected_features = FS.backward_selection_rf(data, target='class')
#     subsets['Backward elminination based wrapper'] = selected_features
#     if verbose: print(selected_features)
    
#     # Third subset with Embedded method (Lasso)
#     if verbose: print('\t--->Recursive Feature Elimination (RandomForest): ', end='')
#     selected_features = FS.recursive_feature_elimination_rf(data, target_column, min_feature=1)
#     subsets['RFE-RF'] = selected_features
#     if verbose: print(selected_features)
#     print('\tEFSA done')

#     return subsets


In [8]:
data_processed, X_train, X_test, y_train, y_test = preprocessingPipeline(
    df, 'class', 'remove', split=True, rebalance=True, standarize=False, random_state=42, verbose=True
)

# subsets = EFSAPipeline(data, 'class', verbose=True)

dropping duplicates ...done
No Missing Values Detected, returning original DataFrame.
Applying SMOTE ...done


In [9]:
print(data_processed, X_train, y_train)

      gender       PPE       DFA      RPDE  numPulses  numPeriodsPulses  \
0          1  0.852470  0.718260  0.572270        240               239   
1          1  0.766860  0.694810  0.539660        234               233   
2          1  0.850830  0.676040  0.589820        232               231   
3          0  0.411210  0.796720  0.592570        178               177   
4          0  0.327900  0.797820  0.530280        236               235   
...      ...       ...       ...       ...        ...               ...   
1123       0  0.846790  0.672417  0.462422        504               503   
1124       0  0.830738  0.696754  0.353928        440               439   
1125       0  0.822565  0.563505  0.413189        493               492   
1126       0  0.787599  0.609335  0.207256        594               593   
1127       0  0.478263  0.635587  0.352974        387               385   

      meanPeriodPulses  stdDevPeriodPulses  locPctJitter  locAbsJitter  ...  \
0             0.0080

In [26]:
Subsets = {}

In [27]:
from EFSA.Filter import PearsonFeatureSelector
selector = PearsonFeatureSelector(correlation_threshold=0.1, method='pearson', absolute=False)
selector.fit(X_train, y_train)
Subsets['Filter'] = selector.selected_features_
 
# Print selected features
print(len(selector.selected_features_))
print("Selected features based on correlation with target variable:")
print(selector.selected_features_)

33
Selected features based on correlation with target variable:
['std_7th_delta_delta', 'tqwt_energy_dec_15', 'tqwt_stdValue_dec_34', 'tqwt_kurtosisValue_dec_25', 'tqwt_energy_dec_7', 'mean_2nd_delta', 'stdDevPeriodPulses', 'tqwt_meanValue_dec_11', 'tqwt_meanValue_dec_16', 'tqwt_kurtosisValue_dec_8', 'tqwt_medianValue_dec_12', 'tqwt_entropy_shannon_dec_27', 'tqwt_meanValue_dec_22', 'b4', 'tqwt_meanValue_dec_25', 'tqwt_medianValue_dec_33', 'tqwt_medianValue_dec_30', 'det_LT_entropy_shannon_4_coef', 'tqwt_skewnessValue_dec_12', 'tqwt_meanValue_dec_23', 'tqwt_medianValue_dec_29', 'tqwt_skewnessValue_dec_7', 'tqwt_kurtosisValue_dec_1', 'tqwt_skewnessValue_dec_33', 'tqwt_entropy_shannon_dec_25', 'tqwt_skewnessValue_dec_18', 'mean_4th_delta_delta', 'mean_12th_delta', 'tqwt_skewnessValue_dec_3', 'mean_11th_delta', 'tqwt_meanValue_dec_1', 'tqwt_meanValue_dec_15', 'b2']


In [None]:
from EFSA.Wrapper import backward_elimination_wrapper
from models.LR import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

selected_features, performance_scores =  backward_elimination_wrapper(DecisionTreeClassifier, X_train, y_train, threshold=0.05, verbose=True,)
print('Selected features:', selected_features)
Subsets['Wrapper'] = selected_features


In [None]:
Subsets['Wrapper'] = X_train.columns[Subsets['Wrapper']]
Subsets['Wrapper']

In [28]:
from EFSA.Embedded import ridge_feature_selection
selection_results = ridge_feature_selection(X_train, y_train, alphas=[0.01, 0.1, 1, 10],threshold=0.05)
Subsets['Embedded'] = selection_results['selected_features']

print("Best Regularization Strength (Alpha):", selection_results['best_alpha'])
print("Best Features:", selection_results['selected_features'])



Best Regularization Strength (Alpha): 10.0
Best Features: ['DFA', 'RPDE', 'meanPeriodPulses', 'ppq5Jitter', 'minIntensity', 'GNE_mean', 'VFER_std', 'VFER_NSR_TKEO', 'mean_MFCC_1st_coef', 'mean_MFCC_3rd_coef', 'mean_MFCC_4th_coef', 'mean_MFCC_11th_coef', 'std_Log_energy', 'std_MFCC_3rd_coef', 'std_MFCC_9th_coef', 'std_0th_delta', 'std_2nd_delta', 'std_10th_delta', 'std_delta_delta_log_energy', 'std_7th_delta_delta', 'std_9th_delta_delta', 'std_10th_delta_delta', 'det_entropy_log_4_coef', 'det_TKEO_mean_5_coef', 'det_TKEO_std_6_coef', 'tqwt_energy_dec_15', 'tqwt_energy_dec_16', 'tqwt_energy_dec_19', 'tqwt_energy_dec_21', 'tqwt_energy_dec_22', 'tqwt_energy_dec_24', 'tqwt_energy_dec_25', 'tqwt_entropy_shannon_dec_11', 'tqwt_entropy_shannon_dec_22', 'tqwt_entropy_shannon_dec_25', 'tqwt_entropy_log_dec_1', 'tqwt_entropy_log_dec_2', 'tqwt_entropy_log_dec_3', 'tqwt_entropy_log_dec_7', 'tqwt_entropy_log_dec_8', 'tqwt_entropy_log_dec_9', 'tqwt_entropy_log_dec_11', 'tqwt_entropy_log_dec_20', 'tqw

In [29]:
print(len(Subsets['Filter']), len(Subsets['Embedded']))

33 90


In [30]:
selected_features = []

for subset in Subsets:
    selected_features.extend(Subsets[subset])

selected_features = list(set(selected_features))
print(len(selected_features))
print(selected_features)

115
['std_MFCC_3rd_coef', 'tqwt_stdValue_dec_34', 'std_10th_delta', 'std_delta_delta_log_energy', 'tqwt_medianValue_dec_29', 'mean_4th_delta_delta', 'mean_MFCC_1st_coef', 'tqwt_maxValue_dec_31', 'meanPeriodPulses', 'tqwt_entropy_log_dec_3', 'VFER_std', 'mean_12th_delta', 'tqwt_kurtosisValue_dec_29', 'tqwt_minValue_dec_17', 'tqwt_entropy_log_dec_7', 'tqwt_TKEO_std_dec_12', 'tqwt_kurtosisValue_dec_25', 'tqwt_meanValue_dec_11', 'tqwt_TKEO_std_dec_31', 'RPDE', 'tqwt_entropy_log_dec_8', 'mean_11th_delta', 'tqwt_entropy_log_dec_33', 'tqwt_meanValue_dec_22', 'tqwt_kurtosisValue_dec_18', 'det_TKEO_mean_5_coef', 'std_2nd_delta', 'tqwt_entropy_shannon_dec_22', 'tqwt_maxValue_dec_2', 'tqwt_TKEO_std_dec_26', 'tqwt_kurtosisValue_dec_22', 'tqwt_energy_dec_24', 'tqwt_skewnessValue_dec_18', 'tqwt_minValue_dec_15', 'tqwt_TKEO_mean_dec_27', 'tqwt_meanValue_dec_15', 'tqwt_entropy_log_dec_24', 'tqwt_skewnessValue_dec_9', 'stdDevPeriodPulses', 'b2', 'tqwt_TKEO_mean_dec_1', 'tqwt_skewnessValue_dec_33', 'DFA

In [31]:
data_processed[selected_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1128 entries, 0 to 1127
Columns: 115 entries, std_MFCC_3rd_coef to tqwt_medianValue_dec_33
dtypes: float64(115)
memory usage: 1013.6 KB


In [None]:
from models.RF import RandomForest

model = RandomForest()

model.fit(X_train[selected_features], y_train)

print(model.score(X_test[selected_features], y_test))


0.8539823008849557


In [34]:
from models.LR import LogisticRegression

model = LogisticRegression()

model.fit(X_train[selected_features], y_train)

print(model.score(X_test[selected_features], y_test))


0.5929203539823009


In [36]:
from models.LightGBM import LightGBMClassifier

model = LightGBMClassifier()

model.fit(X_train[selected_features], y_train)

print(model.score(X_test[selected_features], y_test))



IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).