<a href="https://www.kaggle.com/code/ibrahimgenuine/blood-donation?scriptVersionId=143539391" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# INTRODUCTION

Blood transfusion is a critical medical procedure that saves lives, from replacing lost blood during major surgeries or severe injuries to treating various diseases and blood disorders. Ensuring an adequate supply of blood when needed is a significant challenge for healthcare professionals. According to WebMD, "approximately 5 million Americans need blood transfusions each year."

Our dataset is obtained from a mobile blood donation vehicle in Taiwan.

The data is stored in datasets/transfusion.data and structured according to the RFMTC marketing model (a variation of RFM).

RFM is typically used for customer segmentation and allows for analysis based on characteristics such as Recency (R), Frequency (F), and Monetary (M). These features are often adapted for modeling customer lifetime value, churn prediction, and customer segmentation.

However, here, these features are used for a social welfare issue, which is blood donation.

In this dataset:

---RFMTC Components---

1. Recency (R) - "Recency (months)"

This feature represents how long it has been since a donor's last donation. Generally, donors who have donated more recently are more likely to donate again in the future.

2. Frequency (F) - "Frequency (times)"

This indicates how frequently a donor donates blood. Donors who donate blood more frequently tend to have a higher likelihood of donating in the future.

3. Monetary (M) - "Monetary (c.c. blood)"

This feature represents the total amount of blood donated by a donor. Donors who donate a higher volume of blood are typically considered more valuable.

4. Time (T) - "Time (months)"

This shows how long it has been since a donor's first donation. This feature can be used to understand how "loyal" a donor has been throughout their donation history.

5. Churn (C) - "whether he/she donated blood in March 2007"

This indicates whether a donor donated blood during a specific time period (March 2007). Churn represents the probability of not donating during that period.

---Uses of RFMTC---

1. Segmentation: Donors can be segmented into different categories using these features. For example, donors with high "F" and low "R" values can be labeled as "Loyal Donors."

2. Prediction: The likelihood of future donations can be predicted using the current RFMTC values.

3. Targeting: Special campaigns or incentives can be used to target specific donor segments.

4. Risk Analysis: Donors with low frequency and a high churn rate can be labeled as "High-Risk," and tailored strategies can be developed for them.

This modeling technique is highly useful for understanding the future behavior of donors and managing them more effectively. It can be used to model the likelihood of donors donating blood in the future.

# Importing Libraries

In [None]:
import numpy as np
import pandas as pd

# Exploratory Data Analysis and Visualization

In [None]:


# .data dosyasını okuma
import pandas as pd

df = pd.read_csv("/kaggle/input/transfusion/transfusion.data")

df


## Change the column names if necessary

In [None]:
new_column_names = {
    'Recency (months)': 'Recency',
    'Frequency (times)': 'Frequency',
    'Monetary (c.c. blood)': 'Monetary',
    'Time (months)': 'Time',
    'whether he/she donated blood in March 2007': 'Target'
                   }
            
df.rename(columns=new_column_names, inplace=True)

## Get the first 5 lines

In [None]:
df.head(5)

## Look at the general information

In [None]:
df.info()

## Look at the shape

In [None]:
df.shape

## Check for missing values

In [None]:
df.isna().sum()

## Check for duplicated values

In [None]:
df.duplicated().sum()  # 

## Check the dtype

In [None]:
df.dtypes

## Calculate the basic statistical values

In [None]:
duplicate_rows = df[df.duplicated()]
duplicate_rows

In [None]:
# Delete duplicates but keep one
df.drop_duplicates(keep='first', inplace=True)
df.shape

In [None]:
df.head(20)

In [None]:
df.describe().T

## Check unique values

In [None]:
df.Recency.unique()

In [None]:
df.Target.unique()

In [None]:
df.Target.nunique()

In [None]:
df.columns

## Calculate the average of 'Recency'

In [None]:
df.Recency.mean()

## Find the highest value in 'Frequency'

In [None]:
df.Frequency.max()

## Calculate the median of 'Time'

In [None]:
df.Time.median()

## Calculate the standard deviation of 'Monetary'

In [None]:
df.Monetary.std()

## Count the number of unique values in 'Time'

In [None]:
df.Time.nunique()

## Calculate the ratio of donors in March 2007 (Target=1) to total donors

In [None]:
(df.Target==1).sum()/ len(df.Target)

In [None]:
df.Target.value_counts(normalize=True)

## Filter donors with 'Recency' less than 10 months

In [None]:
len(df[df.Recency<10])

In [None]:
df[df.Recency<10]

In [None]:
df.query("Recency< 10") 

## Select donors who donated at least 5 times

In [None]:
len(df.query("Frequency >= 5") )

## Create a new column giving the time between the first donation and the last donation

In [None]:
df["Donation_Period"] = df.Time-df.Recency
df.Donation_Period

## Outlier Analysis for 'Monetary'

In a box plot, the multiplier for the Interquartile Range (IQR) used to determine outliers is typically 1.5. However, in some cases, a more aggressive threshold of 3 times the IQR is used.

In [None]:
from scipy import stats

In [None]:
z_scores = np.abs(stats.zscore(df["Monetary"]))
outliers = np.where(z_scores>3)[0] # z scores bigger than 3

In [None]:
z_scores 

In [None]:
outliers  

In [None]:
df.iloc[outliers]

## Create a simple scoring model based on 'Recency' and 'Frequency'

In [None]:
df["Donation_Score"]= (1/df.Recency) + df.Frequency
#inf  values here.

In [None]:
df.head(2)

In [None]:
# in some cases Donation_Score inf so we are creating a simple model
df["Donation_Score1"] = np.where(df.Recency ==0, df.Frequency, (1/df.Recency) + df.Frequency)
# here we added new column

In [None]:
df.head(3)

## Convert Time to Years and Months (Time Series Transformation)

In [None]:
df["Years"]=df.Time//12
df["Years"]

In [None]:
df["Months"]=df.Time % 12
df["Months"]

In [None]:
df.head(3)

## Calculate the correlation of 'Target' with other features (Correlation Analysis)

In [None]:
df.corr()["Target"].sort_values(ascending=False)

In [None]:
df.corr()

## Create donor groups based on 'Frequency' (Grouping and Aggregation)

In [None]:
bins= [0, 5, 10, 50]
group_names = ["Low", "Medium", "High"]
df["Frequency_Group"] = pd.cut(df["Frequency"], bins, labels=group_names)
df.sample(10)

In [None]:
bins = [0, 5, 10, 50]
group_names = ["Low", "Medium", "High"]
df["Frequency_Group"] = pd.cut(df["Frequency"].astype(int), bins, labels=group_names)
df.sample(10)

In [None]:
df.groupby("Frequency_Group")["Monetary"].mean()

## Create a new categorical variable based on 'Recency'

In [None]:
bins = [0, 12, 24,36, 74]
group_names = ["0-12 Month", "13-24 Month", "25-36 Month", "37-74 Month"]
df["Recency_Categorical"] = pd.cut(df["Recency"], bins, labels= group_names)
df.sample(10)

## Check the distribution of the 'Target' variable

In [None]:
df["Recency_Categorical"].value_counts()

In [None]:
df["Recency_Categorical"].value_counts(normalize=True).round(3)

In [None]:
df["Target"].value_counts(normalize=True).round(3)