# Neural Networks

---
**NOTE 1**

Answer all questions within the Jupyter Notebook file of the lab.

Submit your answers in BlackBoard (Assessments -> Lab Submission -> Lab 3 - Neural Networks)

---

---
**NOTE 2**

Before you begin with this lab, create a personal forlder for your files to store this lab and all subsequent labs, assignments, and projects. This folder must be located in the Documents folder named as "admissionNumber_Name".

You can create this folder by using the GUI to navigate to the Documents folder and create the folder the same way you do in Windows or Mac.

Alternatively, you can use the CLI to create the folder by executing the following command:
`mkdir ~/Documents/12345_Asad`

You should replace "12345_Asad" with your admission number and name.

---

## Regression

Regression models are used to predict a real value such as weight, price, year, etc.

In this section you will learn how to:

* Import libraries in Python
* Use Pandas to load a dataset
* Use scikit-learn
* Create a simple Regression Model for prediction

### Linear Regression

Linear Regression is a predictive analysis model that is used to find a relationship between independent and dependent variables. Linear Regression's principal is to fit a linear equation to the observed data.

You will use Linear Regression in this section to predict life expectancy from BMI (Body Mass Index). You will start using the basic tools in Python to complete this lab.

Scikit-learn is a library in Python that provides many unsupervised and supervised learning algorithms.

The functionality that scikit-learn provides include:

* Regression, including Linear and Logistic Regression
* Classification, including K-Nearest Neighbors
* Clustering, including K-Means and K-Means++
* Model selection
* Preprocessing, including Min-Max Normalization

Look for the following models: K-Nearest Neighbor, K-Means, and Random Forest. Provide a concise description of these models. While you do not have to provide many details, try to include model description, potential applications, and pros vs cons analysis.


To build the life expectancy predcition Linear Regression model, you will use the scikit-learn library (https://scikit-learn.org/stable/). You will import the `LinearRegression` class of scikit-learn. This class provides the function `fit()` which fits the model to the BMI data.

You will use the BMI dataset from https://www.gapminder.org/data/. You can download the dataset from Blackboard (Learning Resources > Datasets > BMI Dataset).

Create a new Jupyter Notebook and write the codes for this section's Linear Regression model.

#### Importing Libraries

In Python, you can use the `import` keyword to indicate the library that you would like to use. You would need the `Pandas` library (http://pandas.pydata.org/pandas-docs/stable/) which is a useful library to load data from variety of file types. You would also need the `LinearRegression` class from scikit-learn library.

The following code imports the necessary libraries.

```python
# Note that you will use pd when calling Pandas functions 
import pandas as pd
from sklearn.linear_model import LinearRegression
```

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression

#### Loading Data

You will use the `Pandas` functions to load the data that you must have already downloaded.

```python
# the bmi_life_data will be the variable storing the dataset values
# in python you do not need to pre-define variables or provide their datatype!
bmi_life_data = pd.read_csv("bmi_and_life_expectancy.csv")
```

In [2]:
bmi_life_data = pd.read_csv("bmi_and_life_expectancy.csv")
bmi_life_data.head(5)

Unnamed: 0,Country,Life expectancy,BMI
0,Afghanistan,52.8,20.62058
1,Albania,76.8,26.44657
2,Algeria,75.5,24.5962
3,Andorra,84.6,27.63048
4,Angola,56.7,22.25083


#### Creating Linear Regression Model

You will use the scikit-learn `LinearRegression` class here to create the model.

```python
# The bmi_life_model is the object of the LinearRegression class
bmi_life_model = LinearRegression()
# The bmi_life_data[['BMI']] refers to the values whose column name is "BMI"
x = bmi_life_data[['BMI']]
# The bmi_life_data[['Life expectancy']] refers to the values whose column name is "Life expectancy"
y = bmi_life_data[['Life expectancy']]
bmi_life_model.fit(x, y)
```

In [3]:
bmi_life_model = LinearRegression()
x = bmi_life_data[['BMI']]
y = bmi_life_data[["Life expectancy"]]
bmi_life_model.fit(x,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

**Note:** In Machine Learning, it is a common practice to use `x` for input data and `y` for labels. Labels are answers to each input data point.

#### Prediction

You will use the scikit-learn `LinearRegression` class here to predict values based on the trained model from the previous step.

```python
# Here we are predicting the life expectancy for the BMI of 21.07931
life_expectancy = bmi_life_model.predict([[21.07931]])
print ("The life expectancy when BMI is 21.07931 would be ", life_expectancy)
```

In [4]:
life_expectancy = bmi_life_model.predict([[21.07931]])
print("The life expectancy when BMI is 21.07931 would be", life_expectancy)

The life expectancy when BMI is 21.07931 would be [[60.31564716]]


What is the output? What does it mean?
<br>
<br>
The life expectancy of a person with a BMI of 21.07931 would be around 60 years old

Congratulations! You have created your first Neural Network by creating a Linear Regression Model.

## Data Exploration

Data exploration and visualization is a very important action in Machine Learning. The first step in Machine Learning is to understand the data and explore it for creating a clear image of the type of data that you will be working to create a Neural Network for prediction and/or classification. This is also considered to be an important step in Data Analysis.

In this section you will learn how to:
* Use Pandas for data exploration
* Analyze a dataset

We will use the BMI dataset that you trained in the previous section to show some examples of `Pandas` functions for data exploration. `Pandas` is a popular Python package for data analysis and exploration in machine learning. It offers powerful, expressive, and flexible data structures that make data manipulation and analysis easy. You can read more about `Pandas` here: https://pandas.pydata.org/

```python
# The "info" element provides the contents and summary detail of the dataset
bmi_life_data.info
```

In [5]:
bmi_life_data.info

<bound method DataFrame.info of                     Country  Life expectancy       BMI
0               Afghanistan             52.8  20.62058
1                   Albania             76.8  26.44657
2                   Algeria             75.5  24.59620
3                   Andorra             84.6  27.63048
4                    Angola             56.7  22.25083
5                   Armenia             72.3  25.35542
6                 Australia             81.6  27.56373
7                   Austria             80.4  26.46741
8                Azerbaijan             69.2  25.65117
9                   Bahamas             72.2  27.24594
10               Bangladesh             68.3  20.39742
11                 Barbados             75.3  26.38439
12                  Belarus             70.0  26.16443
13                  Belgium             79.6  26.75915
14                   Belize             70.7  27.02255
15                    Benin             59.7  22.41835
16                   Bhutan      

What do you see in the output? How do you think it would help to better understand the dataset?
<br>
<br>
The output shows the entire 163 rows and 3 columns of data from the dataset. Data is being classified in ascending alphatbetical order, this allows people to find the relevant values for a specific country easily.

```Python
# The "describe" function provides details of the entire dataset
bmi_life_data.describe()
```

In [6]:
bmi_life_data.describe()

Unnamed: 0,Life expectancy,BMI
count,163.0,163.0
mean,69.666933,24.792378
std,8.981933,2.4279
min,44.5,19.86692
25%,63.45,22.52794
50%,71.8,25.32054
75%,76.5,26.60396
max,84.6,30.99563


What do you see in the output? In what way(s) this information would be useful in the analysis of the dataset?
<br>
<br>
The output shows some common statistical metrics that are calculated based on the given dataset. They show the spread and distribution of data and the possibiblity of outliers in the given dataset

```Python
# The "columns" element provides the list of columns titles in the dataset
bmi_life_data.columns
```

In [7]:
bmi_life_data.columns

Index(['Country', 'Life expectancy', 'BMI'], dtype='object')

What do you see in the output? Would knowing this information be helpful? If yes, in what way? If no, why?
<br>
<br>
The output shows the name of the columns. It will be useful for data cleaning.

```Python
# The "head" function accepts a number, n, as input and prints the first n rows of the dataset
bmi_life_data.head(5)
```

In [8]:
bmi_life_data.head(5)

Unnamed: 0,Country,Life expectancy,BMI
0,Afghanistan,52.8,20.62058
1,Albania,76.8,26.44657
2,Algeria,75.5,24.5962
3,Andorra,84.6,27.63048
4,Angola,56.7,22.25083


How would this information be used in Data Analysis?
<br>
<br>
This information would be useful for quickly testing if your object has the right type of data in it

```Python
# The "sample" function accepts a number, n, as input and prints n random rows of the dataset
bmi_life_data.sample(3)
```

In [9]:
bmi_life_data.sample(3)

Unnamed: 0,Country,Life expectancy,BMI
161,Zambia,51.1,20.68321
139,Switzerland,82.0,26.20195
9,Bahamas,72.2,27.24594


In what way can the `sample` function be compared to the `head` function?
<br>
<br>
The output from the sample function is randomized and dependent on the number of data one would like to see as compared to the head function where it displays the data chronologically 

```Python
# The "isnull" function determines whether a datapoint is null or not.
pd.isnull(bmi_life_data)
```

In [10]:
pd.isnull(bmi_life_data)

Unnamed: 0,Country,Life expectancy,BMI
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False
9,False,False,False


What do you see in the output? Why do you think a null datapoint is damaging for the Neural Network?
<br>
<br>
Output shows false for all cells in the dataset. This means that there is not a single cell in the dataset that does not contain a value. A null datapoint can be significant enough to cause misintepretation, thus affecting our evaluation process in the later part

Now try to run the following code:

```Python
pd.isnull(bmi_life_data).sum()
```

In [11]:
pd.isnull(bmi_life_data).sum()

Country            0
Life expectancy    0
BMI                0
dtype: int64

What difference do you as compared to the previous line of code? What is the benefit of using the `sum()` function?
<br>
<br>
This line of code gives a summary of the number of null values present in the dataset. It it beneficial as one would not need to tabulate the number of null values found in the dataset, which is a waste of time and being inefficient

## Classification

Classification models are used to predict categories such as "good or bad", "high or low", "red or blue".

Similar to the Regression model, you can implement a Logistic Regression model to classify an iris type. In this example we will use a `scikit-learn` available dataset for iris types.

```Python
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

# this line loads the scikit-learn iris dataset
dataset = datasets.load_iris()

x = dataset.data
y = dataset.target

model = LogisticRegression()
model.fit(x, y)

print(model)

predicted = model.predict(x)

print(predicted)
```

In [12]:
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression


# this line loads the scikit-learn iris dataset
dataset = datasets.load_iris()

x = dataset.data
y = dataset.target

model = LogisticRegression()
model.fit(x, y)

print(model)

predicted = model.predict(x)

print(predicted)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1
 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]




What is the output? Elaborate on what `print(predicted)` has printed.

<br>
<br>
The output shows the parameters of the LogisticRegression function. print(predicted) prints the values of y that corresponds to the values of x

## Model Evaluation

In this section you will learn how to:
* Split a dataset into test/train/validation datasets
* Evaluate a trained model

Model Evaluation is an essential part of Machine Learning. It provides a comprehensive overview of how the trained model has performed. The dataset used to train the neural network should not be used for evaluation. Evaluation must be performed against a dataset that the model has not seen! Thereby, we will have to split our dataset into a set to be used for training and another set for testing (evaluation). As such, we will split the original dataset into **train** dataset and **test** dataset.

There is one more dataset to be genereted which we call **validation** dataset. The validation dataset is used to evaluate the performance of the model during the training process. This dataset is majorly used in Deep Learning and is an essential part of the training process that the model uses to improve itself and increase its accuracy.

In summary, most large datasets are randomly divided into the following datasets prior to initializing the training process:
* **Train Dataset**: It is a subset of the original dataset that is used to train the neural network model.
* **Test Dataset**:  It is a subset of the original dataset that is hidden to the model. It is used to evaluate the trained neural network.
* **Validation Dataset**: It is a subset of the original dataset that is used to assess the performance of model during the training process. It acts as a test platform to fine tuning model's performance. While the _train_ and _test_ datasets are commonly for all Machine Learning models, the _validation_ dataset is mainly used for Deep Learning models and models with reletaively large amount of data.


There are different metrics that can be used to evaluate neural network models. The common metrics are as below:
* Accuracy
* Area Under Curve (AUC)
* Logarithmic Loss
* Confusion Matrix
* F1 Score
* Mean Absolute Error (MAE)
* Mean Squared Error (MSE)

Read more about these metrics here: https://machinelearningmastery.com/metrics-evaluate-machine-learning-algorithms-python/

**K-Fold** is a function provided by `scikit-learn` which spilts the dataset into groups of `train` and `test` sets.

Here is a simple example for evaluating the classification model that you trained in the previous section:

```Python
from sklearn.model_selection import KFold
from sklearn import model_selection

# n_splits determines how many groups of test/train sets you want KFold to create
kfold = KFold(n_splits=2)
# the original dataset x is divided into n_splits groups of test/train sets
kfold.get_n_splits(x)
print(kfold)

# here we choose our evaluation metrics
scoring = 'accuracy'
results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring=scoring)
print(results.mean())
```

In [32]:
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import model_selection

# n_splits determines how many groups of test/train sets you want KFold to create
kfold = KFold(n_splits=2)
# the original dataset x is divided into n_splits groups of test/train sets
kfold.get_n_splits(x)
print(kfold)

# here we choose our evaluation metrics
# scoring = 'accuracy'
scoring = 'roc_auc'
# scoring = 'neg_log_loss'
# scoring = 'neg_mean_absolute_error'
# scoring = 'neg_mean_squared_error'


# matrix = confusion_matrix(y_data, predicted)
# report = classification_report(y_data, predicted)

results = model_selection.cross_val_score(model, x, y, cv=kfold, scoring=scoring)

print(results.mean())

KFold(n_splits=2, random_state=None, shuffle=False)
0.9952000000000001




What is the accuracy of the model? What does it say about the model?
<br>
<br>
The model is not really accurate. 

Change the `n_splits` value in the code. What happens when you increase it? What happens when you set it to `1`?

As the value of `n_splits` increases, the value of kfold increases as well. When the value of n_splits is set to 1, an error occurs

**Exercise:** Try using other evaluation metrics and record your observation.

-0.778 for log loss,
-0.693 for MAE and MSE, 
0.995 for AUC



***

<font size="2"><center><i>Version 1.1 | May 2019 | Asadollah Norouzi | School of Electrical & Electronic Engineering</i></center></font>