[IEEE ICMLA 2019](https://www.icmla-conference.org/icmla19/)

[The Data Science landscape: foundations, tools, and practical applications](https://www.icmla-conference.org/icmla19/links/tutorialAM.htm)

# Machine learning and data science

In this section we will use the [Kaggle's version of the Pima Indians diabetes database](https://www.kaggle.com/uciml/pima-indians-diabetes-database):

> This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The dataset has several predictors variables (features, attributes) and one target variable (label):

> The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Our goal is to use the features (attributes) in the dataset to predict the label (wheter or not a patient has diabetes).

We will use machine learning for that:

1. Load and inspect the dataset
1. Split into features and labels
1. Split into train and test datasets
1. Create and evaluate models

## 1. Loading and inspecting the dataset

In [2]:
import pandas as pd

# Note that we don't need to unzip to read the file
df = pd.read_csv('data/pima-indians-diabetes-database.zip')

In [3]:
df.shape

(768, 9)

In [4]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [5]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


By inspecting the dataset, we found out that:

- It has 768 rows and 9 columns (`shape`).
- The values of the columns seem to be in the range of what they are supposed to represent (`describe()`).
- There are no missing data (`info()`).

Now that we understand and trust the data, we need to check if it is balanced, i.e. if we have about the same number of samples for each label.

In [24]:
print(df.Outcome.value_counts())

0    500
1    268
Name: Outcome, dtype: int64


The dataset is unbalanced, with twice as many negative (non-diabetic) than positive (diabetic) samples.

It is not extremely unbalanced, but we will need to careful when measuring the performance of the models later.

## 2. Separating features from labels

We will split the dataset into features (attributes) and the label (outcome), using the traditional `X` and `y` names for those pieces of the dataset.

In [9]:
# We can see from df.columns that `Outcome`, the label, is the last column
X = df.iloc[:,:-1]
y = df.iloc[:,-1]

## 3. Splitting into a train and a test dataset