<a href="https://colab.research.google.com/github/txusser/Master_IA_Sanidad/blob/main/Modulo_2/2_3_4_Regresion_Log%C3%ADstica.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic regression

Datos: ILPD (Indian Liver Patient Dataset) Data Set
* [Descarga haciendo click en este enlace](https://github.com/txusser/Master_IA_Sanidad/blob/main/Modulo_2/datos/Indian_Liver_Patient_Dataset_(ILPD).csv)

The Indian Liver Patient Dataset (ILPD) is a dataset used for the analysis and prediction of liver diseases in patients. This dataset comes from a database of Indian patients and aims to assist in identifying individuals who may be suffering from liver diseases based on various clinical and biochemical characteristics.

### Dataset Characteristics:
- **Number of instances**: 583 patients
- **Number of variables/features**: 10+1

### Variables in the ILPD dataset:

1. **Age**: Patient's age
   - **Description**: Indicates the age of the patient in years.

2. **Gender**: Patient's gender
   - **Description**: Indicates the gender of the patient, where 'Male' corresponds to male and 'Female' to female.

3. **Total_Bilirubin (TB)**: Total bilirubin
   - **Description**: The total amount of bilirubin in the blood. Bilirubin is a yellow substance produced during the normal breakdown of red blood cells. High levels can indicate liver problems.

4. **Direct_Bilirubin (DB)**: Direct bilirubin
   - **Description**: The amount of conjugated bilirubin in the blood. Direct bilirubin is bound to other molecules that make it water-soluble and can be a more specific indicator of liver diseases.

5. **Alkaline_Phosphotase (Alkphos)**: Alkaline phosphatase
   - **Description**: An enzyme related to the bile duct. Elevated levels may indicate bile duct blockage or liver disease.

6. **Alamine_Aminotransferase (Sgpt)**: Alanine aminotransferase
   - **Description**: An enzyme primarily found in the liver. Elevated levels can be a sign of liver damage.

7. **Aspartate_Aminotransferase (Sgot)**: Aspartate aminotransferase
   - **Description**: An enzyme found in the liver and other tissues of the body. Elevated levels may indicate liver damage.

8. **Total_Proteins (TP)**: Total proteins
   - **Description**: The total amount of proteins in the blood. Proteins are essential for the structure and function of all cells in the body.

9. **Albumin (ALB)**: Albumin
   - **Description**: A protein produced by the liver that helps maintain blood volume and pressure. Low levels may indicate liver problems.

10. **Albumin_and_Globulin_Ratio (A/G Ratio)**: Albumin-to-globulin ratio
    - **Description**: The proportion of albumin to globulin in the blood. This ratio can help identify different types of liver diseases.

11. **Dataset (Selector)**: Data selector
    - **Description**: A field used to split the data into two sets (labeled by experts). It is generally used to indicate whether the patient has liver disease (1) or not (2).

This explanation aims to provide a better understanding of the content and purpose of each column in the ILPD dataset.


In [None]:
import pandas as pd
import numpy as np

# Scikit-learn libraries
from sklearn.impute import SimpleImputer  # For handling missing values
from sklearn.preprocessing import LabelEncoder  # For encoding categorical variables
from sklearn.preprocessing import StandardScaler  # For applying value transformations
from statsmodels.tools import add_constant  # To add a column of ones (constant) to the dataset
import statsmodels.api as sm  # To build a logistic regression model for feature selection
from scipy import stats  # To perform statistical calculations
from sklearn.linear_model import LogisticRegression  # Logistic regression model to fit the data
from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score  # Model evaluation metrics
import matplotlib.pyplot as plt  # Library for creating plots


## 1. Review and processing of missing values

In [None]:
df = pd.read_csv("/content/Indian_Liver_Patient_Dataset_(ILPD).csv")
print("Columns:", df.columns)

# Check the dataset status to determine if any imputation operations are necessary

# Identify missing values
missing_value_count = df.isnull().sum()
missing_value_percentage = (df.isnull().sum() / len(df)) * 100

# Create a DataFrame to display the count and percentage of missing values
missing_data = pd.DataFrame({
    'Count': missing_value_count,
    'Percentage': missing_value_percentage
})

print(f" - Missing values: \n {missing_data}")

In [None]:
# Imputation of missing values using the SimpleImputer algorithm from scikit-learn
# We fill missing values based on the median of the column to be imputed. Note: this
# procedure (median imputation) is applied to numerical data columns.
imputer = SimpleImputer(strategy='median')
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns.tolist()

df[numeric_columns] = imputer.fit_transform(df[numeric_columns])

# Missing values after the operation
print(f" => Number of missing values: \n {df.isnull().sum()}")

## 2. Data transformation

In [None]:
# Data Transformation Operations

# Transform the 'SEXO' column into numerical variables using Label Encoding
print("Values in the 'SEXO' column:", np.unique(df['SEXO']))

# Use a dictionary to perform the encoding, mapping Male to 1 and Female to 0
df['SEXO'] = df['SEXO'].map({'Male': 1, 'Female': 0})

# Remap the target variable to take binary values 1 or 0
df['CLASS'] = df['CLASS'].apply(lambda x: 1 if x == 1 else 0)

# Verify the data types and the transformations performed
print("\n - Data types:", df.dtypes)
print(f"\n - First rows of the dataset after categorical variable encoding:\n {df.head(10)}")


## 3. Feature Selection
We will explore how to use a logistic regression model to identify the most relevant features by statistically evaluating their predictive capacity. Feature selection is a crucial step in building machine learning models because it improves model interpretability, reduces overfitting, and enhances overall model performance.

### Feature Selection Process Context:

1. **Add a Constant**:
   - **Purpose**: Include an intercept in the logistic regression model. This allows the model to correctly adjust the baseline prediction.

2. **Fit the Logit Model**:
   - **Purpose**: Train the logistic regression model using the dataset features. This involves finding the coefficients that best relate the independent features to the dependent variable.

3. **Model Evaluation**:
   - **Purpose**: Assess the model's fit through the summary of the model adjustment. The summary includes metrics such as coefficients, standard errors, p-values, etc.

4. **Feature Selection Based on p-values**:
   - **Purpose**: Identify which features are statistically significant in predicting the target variable. P-values measure the probability that the observed coefficients differ from zero due to chance.
   - **Process**: Relevant features are those with p-values less than a threshold (typically 0.05), indicating there is less than a 5% chance that the observed association is due to random variation.

### Importance of the Process:

- **Dimensionality Reduction**: By identifying and selecting only the most relevant features, the number of variables in the model can be reduced, simplifying the model and potentially improving its performance.
- **Model Performance Improvement**: Removing irrelevant or redundant features can enhance model accuracy and reduce overfitting risk.
- **Interpretability**: A model with fewer features is easier to interpret and understand, which is crucial in applications such as medicine or finance.
- **Computational Efficiency**: Simpler models require fewer computational resources to train and evaluate.

In summary, the feature selection process using logistic regression models and p-values is a powerful technique for building effective and efficient predictive models, allowing analysts to focus on the variables that truly matter.


In [None]:
# Add a constant to the set of features
X = df.drop('CLASS', axis=1)

df_constant = sm.add_constant(X)

# Fit the Logit model
model = sm.Logit(df['CLASS'], df_constant)
result = model.fit()

# Display the model summary
print(f" - Results: {result.summary()}")

# Identify the most relevant features

p_values = result.pvalues
relevant_features = p_values[p_values < 0.05].index.tolist()

# Remove the constant from relevant features if present
if 'const' in relevant_features:
    relevant_features.remove('const')

print("Relevant features based on p-values:", relevant_features)
print("- p-values:\n", p_values)


### Comments:

The previous results show a p-value higher than the recommended threshold (5%) for some features. This implies that they have a low statistical relationship with the likelihood of heart disease.

Next, we will use the backward elimination technique to remove variables that provide less information. In [this link](https://medium.com/@abhinav.mahapatra10/ml-basics-feature-selection-part-2-3b9b3e71c14a), you will find more information about this feature selection technique.

The backward elimination technique consists of removing the least significant variables one by one, followed by repeatedly running the regression until all attributes have p-values below 0.05.

Additional reference:
* [Multiple Linear Regression (Backward Elimination Technique)](https://barcelonageeks.com/ml-regresion-lineal-multiple-tecnica-de-eliminacion-hacia-atras/)


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

def back_feature_elem(df, dep_var, cols):
    """
    Takes the DataFrame, the dependent variable, and a list of column names.
    Repeatedly performs logistic regression by removing the feature with the highest p-value above an alpha threshold 
    in each iteration until all p-values are below alpha, returning the final regression summary.
    """
    while len(cols) > 0:  # Continue the process until no columns are left to evaluate
        model = sm.Logit(dep_var, df[cols])  # Create a logistic model with the current columns
        result = model.fit(disp=0)  # Fit the model without displaying the process
        largest_pvalue = round(result.pvalues, 3).nlargest(1)  # Find the highest p-value and round to three decimals
        if largest_pvalue.iloc[0] < 0.05:  # If the highest p-value is less than 0.05, return the result
            return result
        else:
            # Remove the column with the highest p-value from the list of columns
            cols.remove(largest_pvalue.index[0])  # Remove the column by name

# Assuming 'df_constant' is your DataFrame and 'df' is another DataFrame with the dependent variable
if 'const' not in df_constant.columns:
    df_constant['const'] = 1
cols = df_constant.columns.tolist()  # Assuming df_constant is defined and contains the appropriate columns

def calculate_vif(df, cols):
    vif_data = pd.DataFrame()
    vif_data["feature"] = cols
    vif_data["VIF"] = [variance_inflation_factor(df[cols].values, i) for i in range(len(cols))]
    return vif_data

# Calculate VIF
vif_df = calculate_vif(df_constant, cols)
print(vif_df)

# Remove variables with very high VIF (e.g., VIF > 10)
cols = [col for col in cols if vif_df[vif_df['feature'] == col]['VIF'].values[0] < 10]

result = back_feature_elem(df_constant, df['CLASS'], cols)
print(result.summary())  # Display the summary of the final model


## 4. Model training

In [None]:
# Select features to train the model based on the previous results

X = df[relevant_features]  # Feature matrix
y = df['CLASS']  # Class vector (target variable)

import sklearn
from sklearn.model_selection import train_test_split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.20, random_state=5)

# Print the shapes and first few values of the training data
print("X train shape:", X_train.shape)
print("y train shape:", y_train.shape)
print("First ten values of y_train:", y_train[:10])


### We train and evaluate the resulting model

In [None]:
# Train the logistic regression model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)

# Predict on the test set
y_pred = logreg.predict(X_test)

# Evaluate the model's accuracy
acc = accuracy_score(y_test, y_pred)
print(50 * "*")
print("\n => Model Accuracy: => {:.2f}".format(acc))

# Display a more detailed classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))


In [None]:
# Plot the ROC curve
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
roc_auc = roc_auc_score(y_test, y_pred_prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

## Results and Model Evaluation

Based on the model's evaluation results, several conclusions can be drawn regarding its performance. Below is a breakdown of the results and their implications:

### Results:

1. **Model Accuracy**: 0.70
   - **Description**: The model achieves 70% accuracy, meaning it correctly classifies 70% of the test instances.

2. **Classification Report**:
   - **Classes**:
     - `0`: Does not have liver disease.
     - `1`: Has liver disease.

   - **Metrics per Class**:
     - **Precision**:
       - Class `0`: 0.58
       - Class `1`: 0.71
     - **Recall**:
       - Class `0`: 0.19
       - Class `1`: 0.94
     - **F1-Score**:
       - Class `0`: 0.29
       - Class `1`: 0.81
     - **Support**:
       - Class `0`: 37 instances
       - Class `1`: 80 instances

   - **Averages**:
     - **Macro Average**:
       - Precision: 0.65
       - Recall: 0.56
       - F1-Score: 0.55
     - **Weighted Average**:
       - Precision: 0.67
       - Recall: 0.70
       - F1-Score: 0.64

### Conclusions:

1. **Overall Performance**:
   - The model achieves 70% accuracy, indicating that it correctly predicts 70% of the cases overall.

2. **Performance by Class**:
   - **Class `0` (Does not have liver disease)**:
     - Precision is low (0.58), meaning that when the model predicts no liver disease, it is correct 58% of the time.
     - Recall is very low (0.19), indicating the model identifies only 19% of patients who truly do not have liver disease.
     - The F1-Score is poor (0.29), reflecting a lack of balance between precision and recall for this class.
   - **Class `1` (Has liver disease)**:
     - Precision is high (0.71), meaning the model is correct 71% of the time when predicting liver disease.
     - Recall is very high (0.94), indicating the model identifies 94% of patients who truly have liver disease.
     - The F1-Score is strong (0.81), reflecting a good balance between precision and recall for this class.

3. **Class Imbalance**:
   - There is a class imbalance in the dataset (37 instances for `0` and 80 for `1`), which likely affects the performance metrics.
   - The model appears better suited to identifying patients with liver disease (`1`) than those without it (`0`).

### Implications:

- **Need for Adjustment**: The poor performance on Class `0` suggests that the model might benefit from adjustments such as class-balancing techniques (e.g., undersampling, oversampling) or threshold tuning.
- **Risk Evaluation**: In medical contexts, correctly identifying patients with a disease is critical. However, the model must also improve its ability to identify patients without the disease to minimize false positives.

### Next Steps:

- **Model Adjustment**: Apply techniques to handle class imbalance and improve precision and recall for Class `0`.
- **Cross-Validation**: Use cross-validation to obtain a more robust assessment of the model's overall performance.
- **Explore Alternative Models**: Evaluate other classification algorithms that might better handle class imbalance and improve overall performance.

### Summary:

While the model performs well for identifying patients with liver disease, it requires significant improvement to correctly classify those without the disease. Further adjustments and evaluations are essential to enhance its reliability in both categories.
