<a href="https://colab.research.google.com/github/avs20/SioLabsPython0/blob/main/Assignment_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Inspecting transfusion.data file
<p><img src="https://assets.datacamp.com/production/project_646/img/blood_donation.png" style="float: right;" alt="A pictogram of a blood bag with blood donation written in it" width="200"></p>
<p>Blood transfusion saves lives - from replacing lost blood during major surgery or a serious injury to treating various illnesses and blood disorders. Ensuring that there's enough blood in supply whenever needed is a serious challenge for the health professionals. According to <a href="https://www.webmd.com/a-to-z-guides/blood-transfusion-what-to-know#1">WebMD</a>, "about 5 million Americans need a blood transfusion every year".</p>
<p>Our dataset is from a mobile blood donation vehicle in Taiwan. The Blood Transfusion Service Center drives to different universities and collects blood as part of a blood drive. We want to predict whether or not a donor will give blood the next time the vehicle comes to campus.</p>
<p>The data is stored in <code>datasets/transfusion.data</code> and it is structured according to RFMTC marketing model (a variation of RFM). We'll explore what that means later in this notebook. First, let's inspect the data.</p>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
import os
os.chdir('/content/drive/MyDrive/SioLabs/Python For Machine Learning/Assignment/LogisticRegression')

In [9]:
!pwd

/content/drive/MyDrive/SioLabs/Python For Machine Learning/Assignment/LogisticRegression


In [12]:
# Print out the first 5 lines from the transfusion.data file
!head -5 datasets/transfusion.data

Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"
2 ,50,12500,98 ,1
0 ,13,3250,28 ,1
1 ,16,4000,35 ,1
2 ,20,5000,45 ,1


## 2. Loading the blood donations data
<p>We now know that we are working with a typical CSV file (i.e., the delimiter is <code>,</code>, etc.). We proceed to loading the data into memory.</p>

In [None]:
# Import pandas
import ... as pd

# Read in dataset
transfusion = ...

# Print out the first rows of our dataset
# ... YOUR CODE FOR TASK 2 ...

## 3. Inspecting transfusion DataFrame
<p>Let's briefly return to our discussion of RFM model. RFM stands for Recency, Frequency and Monetary Value and it is commonly used in marketing for identifying your best customers. In our case, our customers are blood donors.</p>
<p>RFMTC is a variation of the RFM model. Below is a description of what each column means in our dataset:</p>
<ul>
<li>R (Recency - months since the last donation)</li>
<li>F (Frequency - total number of donation)</li>
<li>M (Monetary - total blood donated in c.c.)</li>
<li>T (Time - months since the first donation)</li>
<li>a binary variable representing whether he/she donated blood in March 2007 (1 stands for donating blood; 0 stands for not donating blood)</li>
</ul>
<p>It looks like every column in our DataFrame has the numeric type, which is exactly what we want when building a machine learning model. Let's verify our hypothesis.</p>

In [None]:
# Print a concise summary of transfusion DataFrame
# ... YOUR CODE FOR TASK 3 ...

## 4. Creating target column
<p>We are aiming to predict the value in <code>whether he/she donated blood in March 2007</code> column. Let's rename this it to <code>target</code> so that it's more convenient to work with.</p>

In [None]:
# Rename target column as 'target' for brevity 
transfusion.rename(
    columns={'...': ...},
    inplace=True
)

# Print out the first 2 rows
# ... YOUR CODE FOR TASK 4 ...

## 5. Checking target incidence
<p>We want to predict whether or not the same donor will give blood the next time the vehicle comes to campus. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the donor will not give blood</li>
<li><code>1</code> - the donor will give blood</li>
</ul>
<p>Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.</p>

In [None]:
# Print target incidence proportions, rounding output to 3 decimal places
# ... YOUR CODE FOR TASK 5 ...

## 6. Splitting transfusion into train and test datasets
<p>We'll now use <code>train_test_split()</code> method to split <code>transfusion</code> DataFrame.</p>


Think about the distribution of classess in both test and train dataset? Do we want them to be same or different?

In [None]:
# Import train_test_split method
from sklearn.model_selection import ...

# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,

# Print out the first 2 rows of X_train
# ... YOUR CODE FOR TASK 6 ...

## 7. Building Models
After splitting the data, we want to create some models 

Use Logistic Regression to create basic models with 
L1 regularization, L2 regularization and no regularization. <b>Please report your findings on the weight vectors and the accuracy you get on all 3 models.</b>



In [13]:
# Model 1 

In [None]:
# Model 2

In [None]:
# Model 3

## 8. Checking the Variance 

One of the assumptions for linear models is that the data and the features we are giving it are related in a linear fashion, or can be measured with a linear distance metric. If a feature in our dataset has a high variance that's orders of magnitude greater than the other features, this could impact the model's ability to learn from other features in the dataset.

Correcting for high variance is called normalization. It is one of the possible transformations you do before training a model. Let's check the variance to see if such transformation is needed.

Normalization is 1 type of scaling where we subtract by mean and divide by standard deviation. The scaling used in class is Normalization. 

But first we need to check if we need normalization. 

Use code cells below to check for data properties and then comment whether you need normalization or not and on which column / columns. 



In [None]:
# Code to check data properties

In [None]:
# add your comment for whether normalization is required or not 

## 9. Normalizing data 

Based on your analysis and comment above if you decide to do normalization then perform the steps in the code cells below. If you do not chose Proceed to step 11

In [None]:
# Normalize the columns if evidence is present in your analysis. Else leave it blank and proceed to step 11

## 10. New models after normalization. 
If you have done normalization then create the new models on the updated dataset. 

Please mention the accuracy for 3 models 

1. No Regularization 
2. With L1 - Regularization 
3. With L2 - Regularization 

In [None]:
# Model 1 

In [None]:
# Model 2

In [None]:
# Model 3 

## 11. Hyper parameter Tuning 

Since we are using regularization we have added a hyperparameter to our Loss function. 

What is that hyperparameter and what it is called in sklearn? Reply in a text cell below. 

We need to tune that hyperparameter. 
Please use the techniques used in class for hyperparameter tuning to find the best model. 

In [None]:
# Do hyper parameter tuning 

## 12. Final model 

After tuning you have now completed most of the steps and just need to train a model with all the insights that you have done. 

Build 1 final model using Logistic Regression and all the analysis done. 

What is your accuracy?

In [None]:
# Final Model 

In [None]:
# Final model accuracy

## 13. Conclusion
<p>The demand for blood fluctuates throughout the year. As one <a href="https://www.kjrh.com/news/local-news/red-cross-in-blood-donation-crisis">prominent</a> example, blood donations slow down during busy holiday seasons. An accurate forecast for the future supply of blood allows for an appropriate action to be taken ahead of time and therefore saving more lives.</p>
<p>In this notebook, we explored  model selection using regularization and hyper parameter tuning.  We try for better than simply choosing <code>0</code> all the time (the target incidence suggests that such a model would have 76% success rate). We then normalized our training data and __________ the accuracy . In the field of machine learning, even small improvements in accuracy can be important, depending on the purpose.</p>
<p>Another benefit of using logistic regression model is that it is interpretable. We can analyze how much of the variance in the response variable (<code>target</code>) can be explained by other variables in our dataset.</p>

## 14. Communicate 

Make a table using pandas to show all the different models you have build and their accuracy. 

Sort them by accuracy in descending order. 

Write a blog post about the things you learned from the assignment and share with your friends.

# References : 

1. Datacamp for blood transfusion definition
2. [Blood transfusion dataset from UCI](https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center)