Welcome to your DataCamp project audition! This notebook must be filled out and vetted before a contract can be signed and you can start creating your project.

The first step is forking the repository in which this notebook lives. After that, there are two parts to be completed in this notebook:

- **Project information**:  The title of the project, a project description, etc.

- **Project introduction**: The three first text and code cells that will form the introduction of your project.

When complete, please email the link to your forked repo to projects@datacamp.com with the email subject line _DataCamp project audition_. If you have any questions, please reach out to projects@datacamp.com.

# Project information

**Project title**: Predict Blood Donations

**Name:** Dimitri Denisjonok

**Email address associated with your DataCamp account:** dimitri.781@gmail.com

**GitHub username:** d-me-tree

**Project description**:<br>
"Blood is the most precious gift that anyone can give to another person — the gift of life." ~ [World Health Organization](http://www.who.int/features/qa/61/en/)

Blood transfusion is needed for:

- women with complications of pregnancy
- children with severe anaemia (deficiency of red cells or of haemoglobin in the blood) often resulting from malaria or malnutrition
- people with severe trauma following man-made and natural disasters
- many complex medical and surgical procedures and cancer patients.

In this Project you will work with data collected from the donor database of Blood Transfusion Service Center in Hsin-Chu City in Taiwan. The center passes their blood transfusion service bus to one university in Hsin-Chu City to gather blood donated about every three months. The dataset consists of a random sample of 748 donors and it was as obtained from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center). Your task will be to predict if a blood donor will donate within a given time window.

You will look at the full model-building process: from inspecting the dataset to using [`tpot`](https://www.datacamp.com/community/tutorials/tpot-machine-learning-python) library to automate your Machine Learning pipeline.

This Project requires that you know your way around Python and pandas. The following courses are recommended as prerequisites: [Preprocessing for Machine Learning in Python](https://www.datacamp.com/courses/preprocessing-for-machine-learning-in-python/) and [Foundations of Predictive Analytics in Python (Part 1)](https://www.datacamp.com/courses/foundations-of-predictive-analytics-in-python-part-1).

# Project introduction

***Note: nothing needs to be filled out in this cell. It is simply setting up the template cells below.***

The final output of a DataCamp project looks like a blog post: pairs of text and code cells that tell a story about data. The text is written from the perspective of the data analyst and *not* from the perspective of an instructor on DataCamp. So, for this blog post intro, all you need to do is pretend like you're writing a blog post -- forget the part about instructors and students.

Below you'll see the structure of a DataCamp project: a series of "tasks" where each task consists of a title, a **single** text cell, and a **single** code cell. There are 8-12 tasks in a project and each task can have up to 10 lines of code. What you need to do:
1. Read through the template structure.
2. As best you can, divide your project as it is currently visualized in your mind into tasks.
3. Fill out the template structure for the first three tasks of your project.

As you are completing each task, you may wish to consult the project notebook format in our [documentation](https://instructor-support.datacamp.com/projects/datacamp-projects-jupyter-notebook). Only the `@context` and `@solution` cells are relevant to this audition.

## 1. Inspecting data

The goal of this project it take you though the steps you'd take when working on **your** Machine Learning projects. So where do we start?

First, we want to learn as much information about our dataset as possible. Below is the extract of [Data Set Description](https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.names) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center):
> **Title**: Blood Transfusion Service Center Data Set
>
> **Abstract**: Data taken from the Blood Transfusion Service Center in Hsin-Chu City in Taiwan -- this is a classification problem.
>
> ...
>
> **Data Set Information**:
> 
> To demonstrate the RFMTC marketing model (a modified version of RFM), this study 
> adopted the donor database of Blood Transfusion Service Center in Hsin-Chu City 
> in Taiwan. The center passes their blood transfusion service bus to one 
> university in Hsin-Chu City to gather blood donated about every three months. To 
> build a FRMTC model, we selected 748 donors at random from the donor database. 
> These 748 donor data, each one included R (Recency - months since last 
> donation), F (Frequency - total number of donation), M (Monetary - total blood 
> donated in c.c.), T (Time - months since first donation), and a binary variable 
> representing whether he/she donated blood in March 2007 (1 stand for donating 
> blood; 0 stands for not donating blood).

From this extract we learn a lot:
- this is a **classification** problem (as opposed to a regression problem for example)
- after googling "RFM", we learn that it stands for Recency, Frequency and Monetary Value and it is a classic analytics and segmentation tool for identifying your best customers. In our case, our customers are the blood donors.
- what columns we can expect in our dataset

We have some context. Let's look at the data which is stored in `datasets/transfusion.data`. In this case, it's not clear from the file extension, `.data`, whether it's tab- or comman-separated values, or something completely different.

The question we want to answer here is "Which `panda`'s read method do we want to use to load the data?"

In [3]:
# Display the first 5 lines of datasets/transfusion.data
!head -n5 datasets/transfusion.data

Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),"whether he/she donated blood in March 2007"
2 ,50,12500,98 ,1
0 ,13,3250,28 ,1
1 ,16,4000,35 ,1
2 ,20,5000,45 ,1


## 2. Loading data

We now know it's a csv document, so we can use `pandas.read_csv()` method.

In [4]:
# Importing modules
import pandas as pd

# Read datasets/transfusion.data into transfusion
transfusion = pd.read_csv('datasets/transfusion.data')

# Print out the first rows of our dataset
transfusion.head()

Unnamed: 0,Recency (months),Frequency (times),Monetary (c.c. blood),Time (months),whether he/she donated blood in March 2007
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


## 3. Inspecting `transfusion` DataFrame

It looks like every column has the numeric type which is exactly what we want when our intention is to train a model. Let's check `transfusion` structure to verify this hypothesis.

In [5]:
transfusion.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 748 entries, 0 to 747
Data columns (total 5 columns):
Recency (months)                              748 non-null int64
Frequency (times)                             748 non-null int64
Monetary (c.c. blood)                         748 non-null int64
Time (months)                                 748 non-null int64
whether he/she donated blood in March 2007    748 non-null int64
dtypes: int64(5)
memory usage: 29.3 KB


*Stop here! Only the three first tasks. :)*