### notes

Definition:

- Correlation is a highly applied technique in machine learning during data analysis and data mining. 
- It can extract key problems from a given set of features, which can later cause significant damage during the fitting model.
- Data having non-correlated features have many benefits. Such as:
    - Learning of Algorithm will be faster
    - Interpretability will be high
    - Bias will be less

Interpret the Values:

- Positive Correlation (> 0): As one variable increases, the other variable tends to increase.
- Negative Correlation (< 0): As one variable increases, the other variable tends to decrease.
- No Correlation (â‰ˆ 0): No linear relationship between the variables.

Identify Strong Correlations:

- Typically, correlations above 0.7 or below -0.7 are considered strong.
- Look for pairs of variables with high absolute correlation values.

Derive Insights:

- Detect Multicollinearity: High correlations between independent variables can indicate multicollinearity, which can affect the performance of regression models.
- Feature Selection: Identify and remove redundant features. For example, if two features are highly correlated, you might choose to drop one.
- Understand Relationships: Gain insights into how variables interact with each other, which can inform feature engineering and model building.

### objective

- load, eda
- preprocess
- train models
- feature selection and train models
- visualize and compare

### load and eda

download dataset: 
- option1: link - https://www.kaggle.com/datasets/debasisdotcom/parkinson-disease-detection
- option2: run this command 

    ```kaggle datasets download -d debasisdotcom/parkinson-disease-detection```

In [None]:
# # install dependencies
# %pip install --quiet numpy pandas scikit-learn xgboost lightgbm
# %pip install --quiet ydata-profiling ipywidgets
# %pip install --quiet plotly "nbformat>=4.2.0" statsmodels

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px
from ydata_profiling import ProfileReport

In [None]:
# read data
df = pd.read_csv("dataset/parkinson-disease.csv")

print(f"shape: {df.shape}")
print(f"count of rows with missing values: {len(df[df.isna().any(axis=1)])}")
print(f"columns: {df.columns}")

In [None]:
# check count and dtypes
df.info()

In [None]:
# check null values
df.isna().sum()

In [None]:
# check statistics of numerical cols 
df.describe()

In [None]:
# stats of categorical cols
df.describe(include="object")

In [None]:
# # quick eda
# keyword = "train"
# profile = ProfileReport(df, title=f"{keyword} dataset")
# profile.to_notebook_iframe()

# # visualize
# for col in df.columns:
#     fig = px.histogram(df, x=col)
#     fig.show()

### preprocess

In [None]:
df_new = df.copy()

# drop columns
df_new.drop(["status"], axis=1, inplace=True)

# handle categorical variables
df_new = df_new.select_dtypes(exclude=['object'])

print(df_new.shape)
df_new.head()

In [None]:
# plot the correlation matrix
fig = px.imshow(
    df_new.corr(), 
    text_auto=True, aspect="auto", 
    color_continuous_scale="RdBu"
)
fig.show()

In [None]:
def preprocessing(df):
    # drop name due to high cardinality
    df.drop("name", axis=1, inplace=True) 

In [None]:
# apply preprocessing
preprocessing(df)

In [None]:
# check if non-null values == count(rows)
df.info()

In [None]:
for colname in df.select_dtypes("object"):
    df[colname], _ = df[colname].factorize()

# create feature columns
X = df.drop(["status"], axis=1)

# one hot encode
# X = pd.get_dummies(X)
# create target columns
y = df["status"]

In [None]:
# plot the correlation matrix
fig = px.imshow(
    X.corr(), 
    text_auto=True, aspect="auto", 
    color_continuous_scale="RdBu"
)
fig.show()

### train

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score

In [None]:
def train_models(data, labels, pipelines):
    results = []
    # split dataset
    X_train, X_valid, y_train, y_valid = train_test_split(data, labels, test_size=0.2, random_state=66)
    
    for clf, pipeline in pipelines.items():
        model = pipeline.fit(X_train, y_train)
        
        y_hat = model.predict(X_valid)
        accuracy = accuracy_score(y_valid, y_hat)
        precision = precision_score(y_valid, y_hat)
        recall = recall_score(y_valid, y_hat)
        
        results.append({
            "classifier": clf,
            "accuracy": accuracy,
            "precision": precision,
            "recall": recall
        })
    
    return results

def fe_apply_correlation(data, threshold):
    df = data.copy()
    corr_matrix = df.corr()
    
    corr_columns = set()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                corr_columns.add(corr_matrix.columns[i])
    df.drop(corr_columns, axis=1, inplace=True)
    return df            

def bench_corr_coeff(thresholds, data, labels, pipelines):
    results = []
    for threshold in thresholds:
        reduced_data = fe_apply_correlation(data, threshold)
        results_temp = train_models(reduced_data, labels, pipelines)
        results_temp = [{**item, 'coeff_threshold': threshold} for item in results_temp]
        results += results_temp
    return results

In [None]:
pipelines = {
    "Logistic": make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000)),
    "KNeighbors": make_pipeline(StandardScaler(), KNeighborsClassifier()),
    "DecisionTree": make_pipeline(StandardScaler(), DecisionTreeClassifier(random_state=66)),
    "XGBoost": make_pipeline(StandardScaler(), XGBClassifier(objective="binary:logistic", random_state=66)),
    "LightGBM": make_pipeline(StandardScaler(), LGBMClassifier(random_state=66)),
    "RandomForest": make_pipeline(StandardScaler(), RandomForestClassifier(random_state=66, n_estimators=200)),
    "AdaBoost": make_pipeline(StandardScaler(), AdaBoostClassifier(random_state=66)),
    "GradientBoosting": make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state=66)),
    "HistGradientBoosting": make_pipeline(StandardScaler(), HistGradientBoostingClassifier(random_state=66)),
}

In [None]:
%%time
# train model with data as it is
results = train_models(data=X, labels=y, pipelines=pipelines)

### feature selection + train models

In [None]:
%%time
# apply coefficient thresholds and train models
coeff_thresholds = [0.6, 0.7, 0.8, 0.9, 0.95, 0.99]
coeff_results = bench_corr_coeff(thresholds=coeff_thresholds, data=X, labels=y, pipelines=pipelines)

### visualize and compare

In [None]:
results_df = pd.DataFrame(results)
results_df.style.highlight_max(subset=results_df.columns[-3:])

In [None]:
coeff_results_df = pd.DataFrame(coeff_results)
coeff_results_df.groupby("coeff_threshold")[["classifier", "accuracy", "precision"]].apply(lambda x: x)

In [None]:
coeff_results_df.style.highlight_max(subset=coeff_results_df.columns[1:-1])

In [None]:
# original features
X.columns


In [None]:
# best threshold features
reduced_X = fe_apply_correlation(data=X, threshold=0.8)
reduced_X.columns

Observations:
- Best accuracy with all features: KNeighborsâ€Š-â€Š0.94
- Best accuracy after feature selection based on correlation coefficient threshold: KNeighborsâ€Š-â€Š0.97
- Improvement of ~3% ðŸš€