<h2>Classification Case-Studies: Logistic Regression</h2>

<ul>
  <li>Data preprocessing</li>
  <li>Exploratory Data Analysis</li>
  <li>Feature Selection and Analysis</li>
  <li>Classification using Logistic Regression</li>
  <li>Feature Analaysis using Logistic Regression</li>
</ul>

<h2>Case Study 1: Early Detection of Parkinson Disease: Predictive modelling and Feature Analysis</h2>

The ElderlyHealth department is a division within MedResearchX Labs dedicated to researching methods for addressing mental health issues affecting the elderly population. One significant focus area within this field is Parkinson's disease, a neurodegenerative disorder characterized by tremors, stiffness, and difficulty with movement. If detected early, Parkinson's disease can be effectively managed. As a data analyst, you've been assigned to develop a data-driven model using historical data ( 'parkinson_disease.csv'). This model aims to accurately <span  style="color:blue"> assess whether an individual has Parkinson's disease or not and 
identify which features are the most explanatory for the disease<span/>.

<b style="color: blue;">Step 1: Load your datasets into pandas</b>

In [2]:
import pandas as pd

df = pd.read_csv('../datasets/parkinson_disease.csv')
df.head()

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


<b style="color: blue;">Step 2: Perform EDA using the following requirements</b>
<ol>
  <li>Count the number of rows with missing records - deal with missing values accordingly </li>
  <li>Provide boxplot and density distribution function for each attribute (except status) in the dataset (optional)</li>
  <li>Provide a barplot that shows the number of data point per class label (see status variable)</li>
</ol>

<b style="color: blue;">Step 3: Identify features and the target variable in the problem</b>

<b style="color: blue;">Step 4: Feature Engineering, Selection and Analysis</b>

<ol>
    <li>Using Variance thresholding, remove all features below a $\epsilon$ variance level</li>
    <li>Using Mutual Information, rank the importance of each feature against the target variable</li>
    <li>Select the best K features to use in your problem</li>
    <li>You will compare the performance of your model with all features and k features at a later stage</li>
</ol>

<span style="color:blue;">Note: Filter-based feature selection is often used to reduce dimensionality before applying more resource-intense wrapper-based methods. <b>Rule of thumb: 20-30 features is considered low dimensional</b>. Filtering maybe unnecessary.</span>

<span style="color:blue">Why do we filter? Essentially for computational reasons or if domain knowledge requires it. Otherwise use wrapper-based filtering</span>

<b style="color: blue;">Step 5: Scale all features using a Standard Scaler and Split the dataset into Training/Test set (80:20)</b>

<b style="color: blue;">Step 6: Build Logistic Regression Models</b>

<ul>
    <li>Logistic Regression with all features: $Lg_{all}$</li>
    <li>Logistic Regression with remaining features after variance thresholding: $Lg_{var}$</li>
    <li>Logistic Regression with k-selected features: $Lg_{k}$</li>
</ul>

<b style="color: blue;">Step 7: Model Evaluation</b>
<ol>
    <li>Generate Confusion matrices on the test set for each model</li>
    <li>Generate Classification reports for each model </li>
    <li>Comment on the performance of each models</li>
</ol>

<b style="color: blue;">Step 8: Feature Analysis using Logistic Regression</b>
    <ol>
    <li>Using the coefficients of model: $Lg_{all}$, rank the contribution of each feature to identify of Parkinson disease</li>
        <li> Use may use the absolute value of each coefficient as your ranking score: $|\theta_i|$</li>
    <li>Generate a dataframe of feature importance and provide a barplot</li>
    </ol>

<b style="color: blue;">Step 9: Model Training and Wrapper-based Feature Selection</b>

Using all features in the model, implement Sequential Feature Selection to identify
the most optimal feature subset. You may use the $F1_{score}$ as your evaluation metric.
    

<b style="color: blue;">Step 9: Classification Report - Wrapper-based FS</b>

Provide a classification report of your new model on the test set.