# Data Analytics Fall 2025 &mdash; Exercises 6

### XXXXX XXXXX (last modified: Tue 18 Nov)

- Five problems + round 5 peer review
- Theme: logistic regression
- Keep your originals up to date by running the code cell below:

In [None]:
import os
os.system('/usr/bin/bash /home/varpha/dan/config.sh');

## Round 5 peer review

As before.

## Use of AI in this exercise
I leveraged AI tools (ie chatgpt) to:
- Learn new concepts related to tasks
- Brainstorm solutions
- Generate and adapt sample code to solve the problems in different ways


### Problem 1. Wines


[Here](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p01_wine.csv) is some data on Portuguese wines. 

Drop rows with missing values.

Use logistic regression to predict the type (white/red) from the other fields.

Split train/test set 70/30 %. Print the score and the confusion matrix.


### Solution 1

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

csv_location = "exrc06p01_wine.csv"

# Load input CSV which contains data related to Portuguese wines into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Drop rows with missing values ie nan entries
df_clean = df.dropna().reset_index(drop=True)   # removes null/nan entries and reset index
# print(df_clean.info()) # check if there are no null values

# Extract Features (X) and Target (y) using ´iloc´ indexer
X = df_clean.iloc[:, 1:]    # Features (all rows and columns except 1st column)
y = df_clean.iloc[:, 0]     # Target (all rows with 1st column Only)

# # Lets print sample data and datatypes for Features and Target
# # Note: Regression models works only with numeric data
# print(X.head())
# print(X.info())   # All Features are numeric
# print(y.head())
# print(y.info())   # Target value is non-numeric needs to convert to numeric

# Lets print what diff values on Target column ie ´type´.
# print(y.value_counts())   # values are either "white" or "red"

# Map ´type´ column ie Target to binary: white -> 1, red -> 0
# Note: Regression models works only with numeric data
y = y.map({"white": 1, "red": 0})

# print(y.info())           # confirm the datatype again
# print(y.value_counts())   # confirm the values / distribution for Target ie ´type´ column
# # Target values ie classes are imbalanced 1 (white) -> 4870 and 0 (red) -> 1593

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y,
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio (for Target Column) in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Create Logistic Regression Model
model = LogisticRegression(
    class_weight="balanced",    # automatically handle imbalanced classes (Target) by adjusting weights
    max_iter=2000,              # allow more steps so the model can fully converge
    solver="liblinear"          # best solver for binary classification and smaller datasets   
)

# Train the Logistic Regression odel
model.fit(X_train, y_train)

# Prediction with test data
y_pred = model.predict(X_test)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score = accuracy_score(y_test, y_pred)      
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy Score with Logistic Regression Model: {acc_score:.4f}")
print(f"\nConfusion Matrix with Logistic Regression Model:\n{conf_matrix}")

Accuracy Score with Logistic Regression Model: 0.9737

Confusion Matrix with Logistic Regression Model:
[[ 466   12]
 [  39 1422]]


### Problem 2. Voices
[Here](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p02_voice.csv) is some data on human voices ([column info](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p02_voice.txt)).
 
Predict the label from the other fields using a support vector machine.

Split train/test set 70/30 %.

Print the score and the confusion matrix.


### Solution 2

While working with **Opiton 1 - without Scaling** it was observed that the dataset contains features with very different numeric ranges. Because SVM relies on distance-based calculations, features with large values dominate the learning process, while smaller-scale features barely influence the model. As a result, the SVM model becomes confused, fails to learn meaningful patterns, and produces poor accuracy. In simple terms, without scaling, SVM cannot properly understand the data, which leads to weak model performance.

While working with **Option 2 - with Scaling**, the performance improves significantly because scaling brings all features to a similar range. After scaling, every feature contributes equally to the model, allowing SVM to correctly identify patterns and draw better decision boundaries. This helps the model train more effectively, reduces the dominance of large-range features, and results in much higher accuracy and more stable predictions.

### Solution 2 (Option 1 - without scaling)

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score

csv_location = "exrc06p02_voice.csv"

# Load input CSV which contains data related to human voices into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Extract Features (X) and Target (y) using ´iloc´ indexer
X = df.iloc[:, :-1]   # Features (all rows and columns except last column)
y = df.iloc[:, -1]    # Target (all rows with last column Only)

# # Lets print info related to datatypes for Features and Target
# # Note: Regression models works only with numeric data
# print(X.info())   # All Features are numeric
# print(y.info())   # Target value is non-numeric needs to convert to numeric

# Lets print what diff values on Target ie ´label´ column 
# print(y.value_counts())   # values are either male or female

# Map ´label´ column ie Target to binary: male -> 0, female -> 1
y = y.map({"male": 0, "female": 1})

# print(y.info())         # confirm the datatype again
# print(y.value_counts()) # confirm the values and distribution for Target ie ´label´ column 

# Target values ie classes are equally distributed 50% male ie 0 and 50% female ie 1

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Create SVM model (Support Vector Classifier)
model = SVC(
    kernel="rbf",             # use RBF kernel to learn non-linear decision boundaries
    random_state=42           # ensure reproducible and consistent results
)

# Train the SVM model
model.fit(X_train, y_train)

# Prediction with test data
y_pred = model.predict(X_test)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy Score with SVM Model: {acc_score:.4f}")
print(f"\nConfusion Matrix with SVM Model:\n{conf_matrix}")

Accuracy Score with SVM Model: 0.6614

Confusion Matrix with SVM Model:
[[377  99]
 [223 252]]


### Solution 2 (Option 2 - with scaling)
 

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler

csv_location = "exrc06p02_voice.csv"

# Load input CSV which contains data related to human voices into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Extract Features (X) and Target (y) using ´iloc´ indexer
X = df.iloc[:, :-1]   # Features (all rows and columns except last column)
y = df.iloc[:, -1]    # Target (all rows with last column Only)

# # Lets print info related to datatypes for Features and Target
# # Note: Regression models works only with numeric data
# print(X.info())   # All Features are numeric
# print(y.info())   # Target value is non-numeric needs to convert to numeric

# Lets print what diff values on Target ie ´label´ column 
# print(y.value_counts())   # values are either male or female

# Map ´label´ column ie Target to binary: male -> 0, female -> 1
y = y.map({"male": 0, "female": 1})

# print(y.info())         # confirm the datatype again

# print(y.value_counts()) # confirm the values and distribution for Target ie ´label´ column 
# Target values ie classes are equally distributed 50% male ie 0 and 50% female ie 1

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Standardize features for SVM model
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Create SVM model (Support Vector Classifier)
model = SVC(
    kernel="rbf",             # use RBF kernel to learn non-linear decision boundaries
    random_state=42           # ensure reproducible and consistent results
)

# Train the SVM model
model.fit(X_train_scaled, y_train)

# Prediction with test data
y_pred = model.predict(X_test_scaled)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score = accuracy_score(y_test, y_pred)      
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy Score with SVM Model: {acc_score:.4f}")
print(f"\nConfusion Matrix with SVM Model:\n{conf_matrix}")

Accuracy Score with SVM Model: 0.9811

Confusion Matrix with SVM Model:
[[465  11]
 [  7 468]]


### Problem 3. NBA
[Here](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p03_nba.csv) is some data on NBA basketball players in their first season ([column info](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p03_nba.csv)).

The last column tells if a player's career has exceed 5 years or not.

Fill any missing values with the field median.

Try to predict if the career has exceeded 5 years or not by using both logistic regression and a support vector machine. Print scores and confusion matrices. Split train/test data as you wish. Compare the results.


### Solution 3
I tried to solve the problem with two different ways like Solution 2 **Opiton 1 - without Scaling**  and **Opiton 2 - with Scaling** but not much difference this time.

### Solution 3 (Option 1 - without scaling)

In [12]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer

csv_location = "exrc06p03_nba.csv"

# Load input CSV which contains data related to human voices into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Extract Feature (X) and Target (y)
# Drop non-feature column ie ´Name´
X = df.drop(columns=["TARGET_5Yrs", "Name"])
y = df["TARGET_5Yrs"]

# Lets print sample data and datatypes for Features and Target
# Note: Regression models works only with numeric data
# print(X.head())
# print(X.info())   # All Features are numeric but contains null values
# print(y.head())
# print(y.info())   # Target value is numeric

# Replace missing values with the median of each column
# print("\nMissing values on Features:\n", X.isna().sum())
X_cleaned = X.fillna(df.median(numeric_only=True))
# print("\nCheck missing values after imputation:\n", X_cleaned.isna().sum())

# Lets print what diff values on Target ie ´TARGET_5Yrs´ column 
# print(y.value_counts())   # confirm the values / distribution for Target ie ´TARGET_5Yrs´ column

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X_cleaned,
    y,
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Create SVM model (Support Vector Classifier)
svm_model = SVC(
    #class_weight="balanced",  # adjust importance of classes to handle imbalanced data
    kernel="rbf",             # use RBF kernel to learn non-linear decision boundaries
    random_state=42           # ensure reproducible and consistent results
)

# Train the SVM model
svm_model.fit(X_train, y_train)

# Prediction with test data
y_pred_svm = svm_model.predict(X_test)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score_svm = accuracy_score(y_test, y_pred_svm)          # Calculate Accuracy score
confusion_matrix_svm = confusion_matrix(y_test, y_pred_svm) # Calculate Confusion Matrix

print(f"Accuracy Score with SVM Model: {acc_score_svm:.4f}")
print(f"Confusion Matrix with SVM Model:\n{confusion_matrix_svm}")

# Create Logistic Regression Model
model_lr = LogisticRegression(
    #class_weight="balanced",    # automatically handle imbalanced classes by adjusting weights
    max_iter=2000,              # allow more steps so the model can fully converge
    solver="liblinear"          # best solver for binary classification and smaller datasets   
)

# Train the Logistic Regression model
model_lr.fit(X_train, y_train)

# Prediction with test data
y_pred_lr = model_lr.predict(X_test)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score_lr = accuracy_score(y_test, y_pred_lr)
confusion_matrix_lr = confusion_matrix(y_test, y_pred_lr)

print(f"\nAccuracy Score with Logistic Regression Model: {acc_score_lr:.4f}")
print(f"Confusion Matrix with Logistic Regression Model:\n{confusion_matrix_lr}")

Accuracy Score with SVM Model: 0.6998
Confusion Matrix with SVM Model:
[[ 73  80]
 [ 41 209]]

Accuracy Score with Logistic Regression Model: 0.7097
Confusion Matrix with Logistic Regression Model:
[[ 83  70]
 [ 47 203]]


### Solution 3 (Option 2 - with scaling)

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

csv_location = "exrc06p03_nba.csv"

# Load input CSV which contains data related to human voices into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Extract Features (X) and Target (y)
# Drop non-feature column ie ´Name´ 
X = df.drop(columns=["TARGET_5Yrs", "Name"])
y = df["TARGET_5Yrs"]

# Lets print sample data and datatypes for Features and Target
# Note: Regression models works only with numeric data
# print(X.head())
# print(X.info())   # All Features are numeric but contains null values
# print(y.head())
# print(y.info())   # Target value is numeric

# Replace missing values with the median of each column
# print("\nMissing values on Features:\n", X.isna().sum())
X_cleaned = X.fillna(df.median(numeric_only=True))
# print("\nCheck missing values after imputation:\n", X_cleaned.isna().sum())

# Lets print what diff values on Target ie ´TARGET_5Yrs´ column 
# print(y.value_counts())   # confirm the values / distribution for Target ie ´TARGET_5Yrs´ column

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X_cleaned,
    y,
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Standardize features for SVM and Logistic Regression models
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create SVM model (Support Vector Classifier)
svm_model = SVC(
    # class_weight="balanced",  # adjust importance of classes to handle imbalanced data
    kernel="rbf",             # use RBF kernel to learn non-linear decision boundaries
    random_state=42           # ensure reproducible and consistent results
)

# Train the SVM model
svm_model.fit(X_train_scaled, y_train)

# Prediction with test data
y_pred_svm = svm_model.predict(X_test_scaled)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score_svm = accuracy_score(y_test, y_pred_svm)          # Calculate Accuracy score
confusion_matrix_svm = confusion_matrix(y_test, y_pred_svm) # Calculate Confusion Matrix

print(f"Accuracy Score with SVM Model: {acc_score_svm:.4f}")
print(f"Confusion Matrix with SVM Model:\n{confusion_matrix_svm}")

# Create Logistic Regression Model
# Use liblinear solver, good for smaller datasets and binary classification
model_lr = LogisticRegression(
    # class_weight="balanced",    # automatically handle imbalanced classes by adjusting weights
    max_iter=2000,              # allow more steps so the model can fully converge
    solver="liblinear"          # best solver for binary classification and smaller datasets   
)

# Train the Logistic Regression model
model_lr.fit(X_train_scaled, y_train)

# Prediction with test data
y_pred_lr = model_lr.predict(X_test_scaled)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score_lr = accuracy_score(y_test, y_pred_lr)
confusion_matrix_lr = confusion_matrix(y_test, y_pred_lr)

print(f"\nAccuracy Score with Logistic Regression Model: {acc_score_lr:.4f}")
print(f"Confusion Matrix with Logistic Regression Model:\n{confusion_matrix_lr}")

Accuracy Score with SVM Model: 0.7047
Confusion Matrix with SVM Model:
[[ 82  71]
 [ 48 202]]

Accuracy Score with Logistic Regression Model: 0.7022
Confusion Matrix with Logistic Regression Model:
[[ 79  74]
 [ 46 204]]


### Problem 4.  Mushrooms
[Here](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p04_mushrooms.csv) is some data on mushrooms ([column info](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p04_mushrooms.txt)).

Try to predict the class (edible or poisonous) from the other fields. Use whatever you want!

Fields are categorial so one-hot-encoding (or dummy encoding) is needed.


### Solution 4

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

csv_location = "exrc06p04_mushrooms.csv"

# Load input CSV which contains some data on mushrooms into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Extract Features (X) and Target (y) using ´iloc´ indexer
X = df.iloc[:, 1:]    # Features (all rows and columns except 1st column)
y = df.iloc[:, 0]     # Target (all rows with 1st column Only)

# All columns are categorical, so we can use one-hot encode with ´get_dummies()´ method
X_encoded = pd.get_dummies(
    X, 
    drop_first=False # to keep all dummy columns
)

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, 
    y, 
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Create Logistic Regression Model
model = LogisticRegression(
    class_weight="balanced",    # automatically handle imbalanced classes by adjusting weights
    max_iter=2000,              # allow more steps so the model can fully converge
    solver="liblinear"          # best solver for binary classification and smaller datasets   
)

# Train the Logistic Regression model
model.fit(X_train, y_train)

# Prediction with test data
y_pred = model.predict(X_test)

# Evaluate model's Accuracy Score and Confusion Matrix
acc_score = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy Score with Logistic Regression Model: {acc_score:.4f}")
print(f"\nConfusion Matrix with Logistic Regression Model:\n{conf_matrix}")

Accuracy Score with Logistic Regression Model: 0.9996

Confusion Matrix with Logistic Regression Model:
[[1263    0]
 [   1 1174]]


### Problem 5. Loan status
[Here](https://student.labranet.jamk.fi/~varpha/data_analytics/exrc06p05_loan.txt) is some data on loanees. The last column (Loan_Status Y/N) should be predicted from the other fields. Use whatever you want.  


Do modifications:
* categorial fields to numeric (two-value fields to 0/1, multivalue as dummies/onehot)
* replace missing values with median
* remove rows with outliers: ApplicantIncome, CoapplicantIncome or LoanAmount over 3 standard deviations away from field average


Check what would be model's probability to Loan_status = Yes with values:

```
Gender                   Male
Married                    No
Dependents                  0
Education            Graduate
Self_Employed              No
ApplicantIncome          2400
CoapplicantIncome        2000
LoanAmount                 36
Loan_Amount_Term          360
Credit_History              1
Property_Area           Urban
```

### Solution 5

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

csv_location = "exrc06p05_loan.csv"

# Load input CSV which contains some data on loanees into pandas.DataFrame
df = pd.read_csv(csv_location)

# # Get basic information about data
# print(df.info())  # prints concise summary about DataFrame's structure
# print(df.head())  # prints first five rows - default

# Drop ´Loan_ID´ column not a Feature to be used
df = df.drop(columns=["Loan_ID"])
# print(df.info())  # prints concise summary about DataFrame's structure


# Identify numeric columns
numeric_cols = df.select_dtypes(include=["number"]).columns

# print("\nMissing values on numeric columns:\n", df[numeric_cols].isna().sum())

# Replace missing values in numeric columns with median
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# print("\nCheck missing values on numeric columns after imputation:\n", df[numeric_cols].isna().sum())

# Columns to check for outliers
outlier_cols = ["ApplicantIncome", "CoapplicantIncome", "LoanAmount"]

# Compute mean and standard deviation for selected outlier columns
means = df[outlier_cols].mean()
stds  = df[outlier_cols].std()

# print("Shape before removing outliers", df.shape)

# Keep only rows where ALL selected columns are within ±3 standard deviation
df = df[
    ((df[outlier_cols] - means).abs() <= 3 * stds).all(axis=1)
].reset_index(drop=True)

# print("Shape after removing outliers", df.shape)

# Identify non-numeric columns
non_numeric_cols = df.select_dtypes(exclude=["number"]).columns

# print("\nMissing values on non-numeric columns:\n", df[non_numeric_cols].isna().sum())
# There are few columns (Gender-21, Married-3, Self_Employed-52) where values are missing
# Will use mode() method to fill the missing data with with most frequent value

for col in non_numeric_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# print("\nCheck missing values on non-numeric columns post imputation:\n", df[non_numeric_cols].isna().sum())


# Find the diff values on non_numeric or categorical columns
# for col in non_numeric_cols:
#     print(f"\n--- {col} ---")
#     print(df[col].value_counts())

# There are columns which has only two possible values 
# For those columns lets map with 0 and 1
mappings = {
    "Loan_Status": {"Y": 1, "N": 0},
    "Married": {"Yes": 1, "No": 0},
    "Gender": {"Male": 1, "Female": 0},
    "Self_Employed": {"Yes": 1, "No": 0},
    "Education": {"Graduate": 1, "Not Graduate": 0}
}

for col, mp in mappings.items():
    df[col] = df[col].map(mp)


# There is a column ´Property_Area´ which is multivalue ie more than two values
# Add a dummy 1/0 variable to each of the ´Property_Area´ column value
df = pd.get_dummies(df, columns=["Property_Area"], drop_first=False, dtype=int)

# # Let see sample and information post preprocessing and feature engineering
# print(df.head())
# print(df.info())

# # Extract Features (X) and Target (y)
X = df.drop(columns=["Loan_Status"]) 
y = df["Loan_Status"]

# print(y.value_counts())   # confirm the values / distribution for Target ie ´type´ column

# Split train/test set 70/30 %
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size=0.7, # to split data as 70% for training and rest 30% for testing
    stratify=y,     # to keep the same class ratio in training and test sets
    random_state=42 # to ensure same rows go to train and test sets in every run for consistency purpose
)

# Create Logistic Regression Model
model = LogisticRegression(
    class_weight="balanced",    # automatically handle imbalanced classes by adjusting weights
    max_iter=2000,              # allow more steps so the model can fully converge
    solver="liblinear"          # best solver for binary classification and smaller datasets   
)

# Train the Logistic Regression model
model.fit(X_train, y_train)

# Prediction with test data
y_pred = model.predict(X_test)

# Evaluate model with Accuracy Score and Confusion Matrix
acc_score = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Accuracy Score with Logistic Regression Model: {acc_score:.4f}")
print(f"Confusion Matrix with Logistic Regression Model:\n{conf_matrix}")

# Check Model's probability to Loan_status = Yes with below data
new_applicant = pd.DataFrame({
    "Gender": [1],
    "Married": [0],
    "Dependents": [0],
    "Education": [1],
    "Self_Employed": [0],
    "ApplicantIncome": [2400],
    "CoapplicantIncome": [2000],
    "LoanAmount": [36],
    "Loan_Amount_Term": [360],
    "Credit_History": [1],
    "Property_Area_Rural": [0],
    "Property_Area_Semiurban": [0],
    "Property_Area_Urban": [1]
})

# Ensure correct column order for the above data
new_applicant = new_applicant[X.columns]

# Compute probability and Prediction
prob_yes = model.predict_proba(new_applicant)[0, 1]
prediction = model.predict(new_applicant)[0]

print(f"\nProbability of Loan Approval (Yes) for the given case: {prob_yes:.4f}")
print("Predicted Loan_Status for the given case:", "Yes" if prediction == 1 else "No")

Accuracy Score with Logistic Regression Model: 0.8233
Confusion Matrix with Logistic Regression Model:
[[ 48  29]
 [ 21 185]]

Probability of Loan Approval (Yes) for the given case: 0.7402
Predicted Loan_Status for the given case: Yes
