# Chapter 1

# The Travelling Diabetes Clinic: A firts take at the problem 

---

## 1.1 The Travelling Diabetes Clicinc Problem 

### 1.1.1 Reading the data with pandas

Using `pandas.read_csv` method to load the data from the file to be used within our python code. Consult [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for further information of `read_csv` parameters and how it can be used.

In [1]:
import pandas as pd

data = pd.read_csv("../datasets/PimaIndiansDiabetes.csv")
data

Unnamed: 0,Pregnancy Count,Blood Glucose,Diastolic BP,Triceps Skin Fold Thickness,Serum Insulin,BMI,Diabetes Pedigree Function,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


---

Using `.loc` to index the part of data we're interested in form amongst othe data. The slice operator `:` as a row label selects all the rows in the `DataFrame`, while the list `["Blood Glucose", "BMI", "Class"]` chooses only these two columns and leaves the others out 

In [2]:
data_of_interest = data.loc[:, ["Blood Glucose", "BMI", "Class"]]

## 1.2 A Simple ML Attempt with scikit-learn

### 1.2.2 Implementing the Model with scikit-learn

Training a perceptron model after splitting the data into a training and testing parts. You can learn more about the `Perceptron` model and the `train_test_split` method at their documentation [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). scikit-learn has one of the best docs in the field. 

In [3]:
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split

X = data_of_interest.loc[:, ["Blood Glucose", "BMI"]]
y = data_of_interest.loc[:, "Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

classifier = Perceptron()
classifier.fit(X_train, y_train)



Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=None, n_iter=None, n_jobs=1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=0, warm_start=False)

---
Testing the trained model accuracy on the test data using the `score` method

In [4]:
accuracy = classifier.score(X_test, y_test)
print("Prediction Accuracy: {:.2f}%".format(accuracy * 100))

Prediction Accuracy: 50.52%


---
Showing that the model is performing badly even on the training data that it's supposed to ace!

In [5]:
train_accuracy = classifier.score(X_train, y_train)
print("Training Prediction Accuracy: {:.2f}%".format(train_accuracy * 100))

Training Prediction Accuracy: 46.35%


### 1.2.3 Establishing a Baseline
We here use the `DummyClassifier` with the `most_frequent` startegy to establish a baseline to improve on in the coming chapters. You can learn more about the supported startegies in the [docs](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier)

In [6]:
from sklearn.dummy import DummyClassifier

dummy_baseline = DummyClassifier(strategy="most_frequent")
dummy_baseline.fit(X_train, y_train)

baseline_accuracy = dummy_baseline.score(X_test, y_test)
print("Dummy Prediction Accuracy: {:.2f}%".format(baseline_accuracy * 100))

Dummy Prediction Accuracy: 64.06%
