Source:
* Andrew Ng ML (course)[https://www.coursera.org/learn/machine-learning/]
* Dataset suggestion from (here)[https://machinelearningmastery.com/standard-machine-learning-datasets/]

The purpose of this series is to serve as a handbook to those who are starting their journey in Machine Learning.
Currently there is a lot of literature in academia as well as on the web that is frankly very overwhelming for a newcomer (like me).
Also new blogposts and research paper titles have very complex heavy words which daunt a newcomer and makes it harder for them to learn about the field.
In this series we will keep things simple and try to understand the concepts well.
The focus is to present guidelines to make decisions during building a Machine Learning model.
We will understand all these tricks of the trade by actually implementing stuff we will be talking about.
Lets get started.


### System Setup
Install Anaconda and launch Jupyter Notebook.
Familiarize yourself with how to use jupyter notebook and get hello world working.
No need to get into nitty gritty details of jupyter notebook.
That we can do later.

### Getting and Loading data
The first thing to do is get some data with us.
As suggested in [this](https://machinelearningmastery.com/standard-machine-learning-datasets/) I have chosen PIMA Indians Diabetes dataset.
#### Why have I chosen this dataset? Explain Here without too much jargon

Once we download the dataset this first thing is to load the dataset.
As we can see we have a CSV file which is commonly used to send tabular data.
We use pandas library to load this dataset.

In [None]:
import pandas as pd
data = pd.read_csv('data/pima-indians-diabetes.data.csv')
data.head(1)

I seems like the CSV file does not have any column names.
Lets reload the data.

In [None]:
data = pd.read_csv('data/pima-indians-diabetes.data.csv', header = None)
data.head(1)

Now that we have sucessfully loaded data lets add meaningful column names to our dataset.
This information is present on the page we downloaded the dataset from.
For some other dataset which already have column names in them this step and the previous step would be optional.

In [None]:
data.columns = ['pregnant_count'
                , 'glucose'
                , 'blood_pressure'
                , 'thickness'
                , 'insulin'
                , 'bmi'
                , 'pedigree'
                , 'age'
                , 'class']
data.head(1)

### Getting to know the dataset
#### Find the size of the dataset

In [None]:
print(data.shape)

This tells us in our data there are 768 rows and 9 columns.
Our goal is to predict the `class` of a given row in the data from the other values in the row.
These other values are called `features`.
In this dataset we have 8 features from `pregnant_count` through `age`.

#### Find number of distinct values in `class` column

In [None]:
data['class'].unique()

This tells us there are only 2 classes in the dataset namely `1` and `0`.
Hence this is a binary classification problem.
As you are starting your journey into Machine Learning such problems are easier to start with.

In [None]:
#### Getting to know your data a little more

In [None]:
data.describe()

In [None]:
The `describe` method tells us more information about our dataset.
Initially you should pay attention to `count` for each column.
It tells us how many values are present in that column.
In this case there are 768 values in each column exactly equal to the number of rows in our dataset.
In many cases there might be missing values in the dataset where the count of data in one or more columns will not be same as count of rows in the dataset.
We will learn to deal with such missing values if present later.
For now we dont need to worry about it.

#### How to handle missing values from dataset? (Explain Later)

One other thing to notice is all the values in our dataset are numerical values.
In many datasets we might also have other kinds of values such as strings and categorical values.
We will learn to deal the same later.

#### How to handle strings and categorical data in the dataset? (Explain Later)

In [None]:
### How to evaluate your predictions?
The goal of building a prediction model (like the one we are building) is to do well on unseen data.
The more accurate predictions are on unseen data the better is the model.
The data that the model gets to see is known as training data sine it is the data with which the model gets trained to do predictions.
The unseen data on which model has to predit and do well is known as testing data since it is the data model gets tested.
Many datasets have separate training data and testing data.
In our case it is not so.
Hence we will reserve some data from our dataset as testing data.

#### Assumption: distribution of training data and testing data should be the same.
#### How big the testing data should be?

In [None]:
#### Splitting data into training and testing

In [None]:
from sklearn.model_selection import train_test_split
X = data[['pregnant_count'
        , 'glucose'
        , 'blood_pressure'
        , 'thickness'
        , 'insulin'
        , 'bmi'
        , 'pedigree'
        , 'age']]
Y = data[['class']]

trainX, testX, trainY, testY = train_test_split(X
                                                , Y
                                                , test_size = 0.2
                                                , random_state = 73)
print('trainX shape', trainX.shape)
print('trainY shape', trainY.shape)
print('testX shape', testX.shape)
print('testY shape', testY.shape)

In [None]:
We use `train_test_split` method from `scikit-learn` library to split our data into training data and testing data.
This method expects the features and labels (classes) to be provided as arguments hence we extract them as X and Y first.
We use 20% of the data as test data and rest as training data.
`train_test_split` method randomly selects 80% of the data as training data and remaining 20% of the data as test data.
In order to achieve repeatable results we provide `random_state` to this method.
Finally we check the size of training data and test data.

In [None]:
### Training and Testing the model
Now that we have prepared our data it is time to train our model with training data and test the same with testing data.
As we have stated before this is a binary classification problem.
A good starting algorithm for this problem is `Logistic Regression`.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state = 73)
model.fit(trainX, trainY)
print('train accuracy', model.score(trainX, trainY))
print('test accuracy', model.score(testX, testY))

In [None]:
We first created an instance of `LogisticRegression` model.
We provided `random_state` to achieve reproducible results.
The `fit` method trains the model with training data.
Finally we checked the preformane of the model using `score` method on training and testing data.
The `score` method tells if we provided some examples to the trained model what fraction it got right during prediction.
For training data it achieved 77% accuracy.
This means after training the model could successfully classify 77% of the training examples.
For testing data it achieved 77% accuracy.
Congratulations! You have successfully built a machine learning model.
Lets see how we can use this model for prediction.

In [None]:
example = testX.head(1)
cls = testY.head(1)
print(example)
print('#' * 80)
print('actual class')
print(cls)
print('#' * 80)
print('prediction:', model.predict(example))

In [None]:
We provided first row from the test data as the input.
Its actual class was `0`.
Then we provided the same example to our model for prediction.
The model correctly predicted the class of the example as `0`.
The reason the prediction looks like an array with a single value in it is beacuse we provided the model an array that contained a single example for prediction.


In [None]:
### What to do next?
Now that we have trained our model with `Logistic Regression` and achieved 77% accuracy on the test data what can we do to improve our accuracy?
To understand that we need to know how `Logistic Regression` works.

In [None]:
model.coef_

In [None]:
### feature scaling explain here


In [None]:
data.describe()

In [None]:
from sklearn.preprocessing import MinMaxScaler
data = pd.DataFrame(MinMaxScaler().fit_transform(data))
data.columns = ['pregnant_count'
                , 'glucose'
                , 'blood_pressure'
                , 'thickness'
                , 'insulin'
                , 'bmi'
                , 'pedigree'
                , 'age'
                , 'class']

data.describe()

In [None]:
model = LogisticRegression(random_state = 73)
model.fit(trainX, trainY)
print('train accuracy', model.score(trainX, trainY))
print('test accuracy', model.score(testX, testY))

In [None]:
### regularization
### explain regularization here
* by default `LogisticRegression` uses L2 loss
### results without regularization

In [None]:
noReg = LogisticRegression( C = 1e40
                          , random_state = 73)
                          
noReg.fit(trainX, trainY)
print('train accuracy', noReg.score(trainX, trainY))
print('test accuracy', noReg.score(testX, testY))
noReg.coef_