Blood Donor Prediction

Step 1:Blood Donor Prediction: Downloading Dataset with KaggleHub

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("ninalabiba/blood-transfusion-dataset")

print("Path to dataset files:", path)

Step 2: Read Blood Donation Data with Pandas

In [None]:
import pandas as pd
transfusion=pd.read_csv(r"C:\Users\hp\.cache\kagglehub\datasets\ninalabiba\blood-transfusion-dataset\versions\transfusion.csv")
transfusion

### ✅ Task 1 : Inspecting the Dataset

Using the shell command `!head datasets/transfusion.data`, I printed the first 5 lines of the dataset to verify its structure and contents. This initial inspection helped confirm the format and column arrangement before loading the data with pandas.


!head -n5 datasets/transfusion.data

### ✅ Task 2 : Loading the Dataset

I imported the pandas library and successfully loaded the `transfusion.csv` file into a DataFrame named `transfusion`. Using `head()`, I verified that the dataset contains 5 columns and 749 rows, confirming it was loaded correctly.


In [None]:
transfusion.head()

### ✅ Task 3 : Inspecting DataFrame Structure

Using the `info()` method, I examined the structure of the `transfusion` DataFrame. It confirmed that all 5 columns are non-null, the data types are appropriate for analysis, and the dataset contains 749 entries. This step ensures the data is ready for preprocessing.


In [None]:
transfusion.info()

### ✅ Task 4 : Renaming Target Column

Renamed the column `'whether he/she donated blood in March 2007'` to `'target'` for brevity and clarity using `rename()` with `inplace=True`. Verified the change by printing the first two rows with `head(2)`, confirming the updated column name is now reflected in the DataFrame.


In [None]:
transfusion.rename(columns={"whether he/she donated blood in March 2007":"target"},inplace=True)

In [None]:
transfusion.head(2)

### ✅ Task 5 : Target Incidence Analysis

Used `value_counts(normalize=True)` on the `transfusion.target` column to calculate the proportion of donors vs. non-donors. Rounded the output to 3 decimal places for clarity. This step helps understand the class distribution and highlights any imbalance in the target variable.


In [None]:
transfusion.target.value_counts(normalize=True).round(3)

### ✅ Task 6 : Splitting Data for Model Training

Used `train_test_split()` from `sklearn.model_selection` to divide the `transfusion` DataFrame into training and testing sets:
- Features (`X`) and target (`y`) were separated.
- Stratified sampling ensured balanced class distribution.
- 75% of the data was used for training, 25% for testing.
- Verified the split by printing the first two rows of `X_train`.

This prepares the data for model building while maintaining label proportions.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train,X_test,y_train,y_test = train_test_split(
    transfusion.drop(columns='target'),
    transfusion.target,
    test_size=0.25,
    random_state=42,
    stratify= transfusion.target 
)



In [None]:
x_train.head(2)

### ✅ Task 7 : TPOT Pipeline Optimization

Used `TPOTClassifier` to automatically discover the best machine learning pipeline:
- Optimized for `roc_auc` score to evaluate model performance.
- Set `random_state=42` for reproducibility.
- Trained using `.fit()` and evaluated with `roc_auc_score`.
- Displayed pipeline steps using `tpot.fitted_pipeline_.steps`.

This approach helped identify the most effective combination of preprocessing and modeling techniques for predicting blood donation.


In [None]:
# Import necessary libraries
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

# Create TPOTClassifier instance with minimal arguments
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbose=2,
    random_state=42
)

# Fit TPOT on training data
tpot.fit(x_train, y_train)

# Predict probabilities and calculate AUC score
y_pred_proba = tpot.predict_proba(X_test)[:, 1]
tpot_auc_score = roc_auc_score(y_test, y_pred_proba)
print("TPOT AUC Score:", round(tpot_auc_score, 4))

# Display pipeline steps
print("\nBest pipeline steps:")
for idx, transform in tpot.fitted_pipeline_.steps:
    print(f"{idx}: {transform}")


### ✅ Task 8 : Feature Variance Check

Used `pandas.DataFrame.var()` to calculate column-wise variance in `X_train`:
- Rounded results to 3 decimal places for readability.
- This helps identify which features vary the most and may influence model performance.

A useful step for understanding feature distribution before scaling or selection.


In [None]:
x_train.var().round(3)

### ✅ Task 9 : Correcting High Variance with Log Normalization

Identified the feature with the highest variance and applied log normalization to reduce its impact:
- Copied `X_train` and `X_test` into `X_train_normed` and `X_test_normed`.
- Used a for-loop to apply the same transformation to both datasets.
- Replaced the high-variance column with its log-normalized version.
- Verified the change by printing the updated variance, rounded to 3 decimal places.

This step helps stabilize feature scales and may improve model performance.


In [None]:
# Import numpy
import numpy as np

# Copy X_train and X_test into X_train_normed and X_test_normed
X_train_normed,X_test_normed  = x_train.copy(), X_test.copy()

# Specify which column to normalize
col_to_normalize = 'Monetary (c.c. blood)'

# Log normalization
for df_ in [X_train_normed, X_test_normed]:
    # Add log normalized column
    df_['monetary_log'] = np.log(df_['Monetary (c.c. blood)'])
    # Drop the original column
    df_.drop(columns='Monetary (c.c. blood)', inplace=True)

# Check the variance for X_train_normed
X_train_normed.var().round(3)

### ✅ Task 10 : Training Logistic Regression Model

Trained a logistic regression model using scikit-learn:
- Imported `LogisticRegression` from `sklearn.linear_model`.
- Created an instance and trained it using `.fit()` on the normalized training data.
- Evaluated performance using `roc_auc_score` and printed the result.

This step provides a baseline model for comparison with TPOT’s optimized pipeline.

In [19]:
from sklearn.metrics import roc_auc_score


In [20]:
# Importing modules
from sklearn import linear_model

# Instantiate LogisticRegression
logreg = linear_model.LogisticRegression(
    solver='liblinear',
    random_state=42
)

# Train the model
logreg.fit(X_train_normed, y_train)

# AUC score for tpot model
logreg_auc_score = roc_auc_score(y_test, logreg.predict_proba(X_test_normed)[:, 1])
print(f'\nAUC score: {logreg_auc_score:.4f}')


AUC score: 0.7891


### ✅ Task 11 : Sorting Models by AUC Score

Sorted models based on their AUC scores:
- Imported `itemgetter` from the `operator` module.
- Created a list of `(model_name, model_score)` pairs.
- Sorted the list in descending order using `sorted(..., reverse=True)`.

This step helps identify the best-performing model and supports informed model selection.



In [None]:
# Importing itemgetter
from operator import itemgetter

# Sort models based on their AUC score from highest to lowest
sorted(
    [('tpot', tpot_auc_score), ('logreg', logreg_auc_score)],
    key=itemgetter(1),
    reverse =True