# 🤖 Artificial Intelligence, Machine Learning and Python 🤖

## Why does it all matter? 

Artificial Intelligence is becoming more and more prevalent in healthcare. Look at these recent headlines! 


![image](.//headline_1.png)

![image](.//headline_2.png)

Artificial Intelligence is going to become more and more prominent within the NHS.

![image](.//headline_3.png)

Also, Artificial Intelligence is just really interesting! (And sometimes a bit scary!) It's here to stay, so understanding it is going to be important for those doing testing and assurance in the NHS. 
Here are a final selection of headlines to show just how **versatile**, **interesting**, and **important** AI is becoming.

![image](.//headline_4.png)

![image](.//headline_5.png)

## What is todays plan? 

##### **We will start by creating an Artificial Intelligence that can predict if an individual has breast cancer?**
##### **We will then learn a few ways of testing and assuring Artificial intellegence models** 


### Before we get into coding - let's go over some definitions.

Firstly - what actually is Artificial Intelligence? 

Online Chatbots, Amazon Alexa, a self driving Tesla, these are all examples of **machine learning** and **artificial intelligence**. 

- Artificial Intelligence (often referred to as AI) is a general term used to describe computers completing task that we would consider clever or intelligent. 

- Machine Learning is a particular application of AI, it is where computers use data to learn patterns and make predictions without explicit instructions from developers. 

Today, our focus is going to be more closely aligned to machine learning. We are going to use a dataset to **train** a **machine learning model** (often just referred to as a model), so that it can make **predictions** that we can **test** and **assure**.

## Python Recap

Firstly, let's give a quick recap of python and of functions. This will help us a lot later! 

In [1]:
# This is a block of code written in python. It is written inside a box, known as a cell.
# Every time you get to a piece of code in this notebook, you will need to run it.
# There are two possible ways to do this.

# 1 - Click the triangular play button in the top left corner of the cell. 
# 2 - If you are clicked inside the cell, you can hold shift and press enter.

# Once the code has successfully ran, you will see a little blue tick.

# Sometimes, the code will also give an output, which will also appear when you have ran the code. 
# Like this!
print("Hello!")

Hello!


In [2]:
# This is a very simple function that works out percentages.
# You pass it in two numbers, called parameters.
# These represent the numerator and denominator of a fraction.
# Some simple maths is done, and the fraction is returned.  

def calculate_percentage(numerator, denominator):
    fraction = numerator / denominator
    return fraction * 100

# When you run this code snippet nothing will happen, you need to call the function for it to return you something.
# You should still see a little green tick!

In [3]:
# Call the function here:
# You should see a output when you run this piece of code.
# If you get an error, make sure you have entered two numbers without speech marks between the brackets separated by a comma. 

calculate_percentage("""your value 1""", """your value 2""")

TypeError: unsupported operand type(s) for /: 'str' and 'str'

## Lets look at some data! 

Over the next few steps, we are going to follow the path of a data scientist creating a machine learning model.

**Don't worry!** This is not something you will be expected to do after this course. It is just an opportunity to learn how a machine learning model is made, and this will help you understand why we test it in certain ways later in this course.

In [17]:
# # These next few lines of code install certain packages. These contain lots and lots of functions which we can use in our code later.

!pip install pandas
!pip install plotly
!pip install -U shap
!pip install -U scikit-learn

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbr

In [2]:
! pip install -r requirements.txt

Collecting sklearn
  Downloading sklearn-0.0.post7.tar.gz (3.6 kB)
[31m    ERROR: Command errored out with exit status 1:
     command: /Users/Ben/INSTANT/Instant_Training/.venv/bin/python3 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/p4/d8v6xgfn5279fl17t5s538v80000gp/T/pip-install-5mbpp45t/sklearn_5bfbd9ebfa814042ba582ef7c414aa3e/setup.py'"'"'; __file__='"'"'/private/var/folders/p4/d8v6xgfn5279fl17t5s538v80000gp/T/pip-install-5mbpp45t/sklearn_5bfbd9ebfa814042ba582ef7c414aa3e/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/p4/d8v6xgfn5279fl17t5s538v80000gp/T/pip-pip-egg-info-7znr7m5h
         cwd: /private/var/folders/p4/d8v6xgfn5279fl17t5s538v80000gp/T/pip-install-5mbpp45t/sklearn_5

In [3]:
#Now that we have installed them, we can import certain libraries and functions. 

# This first function is used to load a pre-existing breast cancer dataset
from sklearn.datasets import load_breast_cancer

# Pandas is a library containing lots of functions used to modify and navigate dataframes.
import pandas as pd

# Plotly is a library which helps us generate plots.
import plotly.express as px

In [4]:
# We are about to use the load_breast_cancer function from sklearn 
# This calls the function to return:

#breast_cancer_inputs  
# - This is the raw data containing information about different breast cancer screenings

#breast_cancer_outputs 
# - This contains the classification of whether each screening is identified as being breast cancer

breast_cancer_inputs, breast_cancer_outputs = load_breast_cancer(return_X_y=True, as_frame=True) 

In [5]:
# What does the data look like?
display(breast_cancer_inputs)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


Let's be honest, this data is really confusing and we cannot really tell anything from it!

We can create a plotting function to help us spot some trends. 

In [None]:
def produce_scatter_plot(input_dataframe: pd.DataFrame,
                         output_series: pd.Series,
                         columns:list):
    """Produces a 2D or 3D scatter plot based on columns from dataframe.

    Args:
        input_dataframe (pd.DataFrame): The dataframe containing breast cancer information
        output_series (pd.Series): The dataframe containing diagnosis information
        columns (list): The columns to plot.
    """
    
    # Generate a series of strings to use as a colour key
    colours = pd.Series(str(value) for value in output_series)
    
    #Check the length of the columns.
    # If length is 3, make a 3D plot.
    # If length is 2, make a 2D plot.
    if len(columns) == 3:
        fig = px.scatter_3d(input_dataframe, x=columns[0], y=columns[1], z=columns[2], color=colours, labels={"color": "Diagnosis", "Symbol": "Diagnosis"})
    elif len(columns) == 2:
        fig = px.scatter(input_dataframe, x=columns[0], y=columns[1], color =colours, labels={"color": "Diagnosis", "Symbol": "Diagnosis"})
    elif len(columns) != (2 or 3):
        print("Please only use 2 or 3 columns")
        return None

    # Update the graphics
    fig.update_traces(marker=dict(size = 4 if len(columns) == 3 else 10,
                              line=dict(width=2, color='DarkSlateGrey')),
                              selector=dict(mode='markers'))
    
    fig.show()
    return None

![image](.//task.png)

This plotting function can be called by passing into it breast_cancer_inputs, breast_cancer_outputs, and an array of column names.

Try using the column names to generate some 2D and 3D plots, see if you see any trends. 

In [None]:
# Remind ourselves of the column names
print(breast_cancer_inputs.columns)

In [None]:
produce_scatter_plot(breast_cancer_inputs,
                     breast_cancer_outputs,
                     ['mean radius','mean texture']) # <== Change these two column names to any two from the dataset

In [12]:
produce_scatter_plot(breast_cancer_inputs,
                     breast_cancer_outputs,
                     []) # Try putting in three column names here

NameError: name 'produce_scatter_plot' is not defined

There are 30 different columns within this dataset, meaning there are 439 different pairs you could plot in a 2D graph. Spotting patterns, or more importantly the most important patterns, is an incredibly complex task. 

Further to this, can you imagine having a 30 dimensional scatter plot to try and spot patterns. For us, it is completely impossible! But for a machine, this is something it can do easily. 

#### Can we use this data in a machine learning model? Yes!

In [None]:
from sklearn.model_selection import train_test_split

# When creating a machine learning model, it is important to have some training data and some testing data.

# Training data - this is data that is used to train the model.
# Testing data - this is data used to test the model, we will use this a lot later!

# The following function splits our data randomly to test and training data.
breast_cancer_inputs_training, breast_cancer_inputs_testing, breast_cancer_outputs_training, breast_cancer_outputs_testing = train_test_split(breast_cancer_inputs, breast_cancer_outputs, train_size=0.8)

# train_size represents the proportion of data that should be used for training
# we have gone for 80% by setting train_size to 0.8

#### There are multiple types of machine learning algorithms: 

**Classification algorithms** are machine learning techniques used to predict categorical labels or classes based on input data.

**Regression algorithms** are machine learning techniques used to predict continuous numerical values based on input data.

**Clustering algorithms** are machine learning techniques used to group similar data points together based on their inherent similarities, without any predefined labels.

![image](.//task.png)

#### Can you tell which type of algorithm we want to use?

Uncomment (remove the #) the correct type of model we want to use for our training in the code below.
Don't scroll down too far or you will see the answer!

In [None]:
# Import relevant models
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.cluster import KMeans

classification_model = RandomForestClassifier()
# regression_model = RandomForestRegressor()
# cluster_model = KMeans()

Once we have a model, our next step is to train it.

sklearn has an incredibly useful function - `fit` 

The fit method trains the algorithm on the data.

How it does this depends on the model. If you would like to know more about the random forest algorithm you can read about it here: https://www.turing.com/kb/random-forest-algorithm. But, a lot of this specific knowledge you wont need to know today.

**Importantly**: All algorithms with labelled training data use a loss function to train a predictive function. They are all dependent on the quality of the labelled training data.




In [None]:
# The model we are using is a classification model, as we want to make predictions to put individuals into 1 of 2 categories. 
# This is one example of a classification model, called a random forest classifier.
# Others are available, which you can look up if you would like to. 

classification_model.fit(breast_cancer_inputs_training,breast_cancer_outputs_training)

And that is it, in just a few lines of code we have a machine learning model that can be used for predicting if someone has breast cancer!

We can even try it out, using the actual use case of the model. In the future, doctors will likely use the model to put in information about 1 individual patient, and see whether they have a benign or cancerous tumour. 

In [None]:
# Create a Dataframe containing the information of one randomly sampled individual.
individual_1_data = breast_cancer_inputs_testing.sample(n=1)
display(individual_1_data)

prediction = classification_model.predict(individual_1_data)
print(prediction)
# A 0 is a benign prediction, and a 1 is a cancerous prediction

But is our prediction right? We can check the label of this particular test record by peeking at the test outputs. 

In [None]:


# Find out the individuals diagnosis
individual_1_diagnosis = breast_cancer_outputs_testing[(individual_1_data.index)[0]]

print("Individual 1 has a diagnosis of:",individual_1_diagnosis)
# A 0 is benign, a 1 is a cancerous.

So, we now have the information for one individual. We also know what the actual outcome of the prediction should be.

Next, we can use `predict`. This uses the model we have just trained to predict the results of our test set. It really is that simple!

Did the model get it right? It might have? But how do we know whether it will be correct every time? 



##### It's time to switch on our testing brains and start evaluating just how good this model is!

Do you remember the testing data we separated from the training data earlier? This is about 20% of the total data we have access to, and we can use it to evaluate just how good our model is. We can use the same `predict` function as above, but this time we can pass it far more data. 

Typically, you want **about** 80% of your data to be used for training, although a bit of variation from this is fine!. This gives you lots of data to train your model on, but still leaves enough for accurate testing. 

**Importantly**, this data is **labelled**. The basic premise of testing with labelled data is as follows, make your predictions (as we have just done), and then compare them.

In [None]:
predictions = classification_model.predict(breast_cancer_inputs_testing)

What do the predictions actually look like? 

In [None]:
print(predictions)

Let's start with a really simple piece of analysis. How many of our predictions were correct? A super simple way to look at this is using a pie chart. 

In [None]:
# matplotlib is another plotting library similar to plotly.

import matplotlib.pyplot as plt

def plot_predictions(predictions, actual):
    """Creates a pie chart

    Args:
        predictions (Series): Predicted values by the model
        actual (Series): Actual values
    """

    correct_predictions = 0
    incorrect_predictions = 0

    for prediction, value in zip(predictions, actual):
        if prediction == value:
            correct_predictions += 1
        else:
            incorrect_predictions += 1

    total_predictions = len(predictions)

    # Remember our calculate_percentage function we made right at the start of the notebook!?
    correct_percentage = round( calculate_percentage(correct_predictions,total_predictions) ,1)
    incorrect_percentage = round( calculate_percentage(incorrect_predictions,total_predictions), 1)

    # Plot the chart
    plt.pie([correct_predictions,incorrect_predictions],
            labels=[f"Correct Predictions\n{correct_percentage}",f"Incorrect Predictions\n{incorrect_percentage}%"])
             
    # Add a title
    plt.title("Percentage of Correct and Incorrect Predictions")
    
    return None
    

In [None]:
plot_predictions(predictions, breast_cancer_outputs_testing)

There are other ways to analyse our results. For example using a confusion matrix. 

In [None]:
from sklearn import metrics

def generate_confusion_matrix(predicted, actual, axis_labels = ["Benign", "Malignant"]):
    """Creates a confusion matrix chart

    Args:
        predictions (Series): Predicted values by the model
        actual (Series): Actual values
        axis_labels (list[string]): axis labels 
    """
    confusion_matrix = metrics.confusion_matrix(actual, predicted)

    cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = confusion_matrix, display_labels = axis_labels)

    cm_display.plot()

    plt.title("Confusion Matrix")

In [None]:
generate_confusion_matrix(predictions, breast_cancer_outputs_testing)

### What does this actually show? 

The matrix has four squares, each representing a different combination of actual and predicted classifications. The squares are:

- True Positive (TP): The model correctly predicted the positive diagnosis.
- False Positive (FP): The model incorrectly predicted the positive diagnosis.
- True Negative (TN): The model correctly predicted the negative diagnosis.
- False Negative (FN): The model incorrectly predicted the negative diagnosis.

The top left and bottom right boxes indicate correct predictions, but a confusion matrix allows us to see what kind of errors we're making. 

In some scenarios, an algorithm may want to be intentionally biased.

- **Fire Alarms**: You would like a fire alarm to catch all fires. You would much rather a fire alarm go off when there is not a fire, than not go off when there is one. For this, we would like **high sensitivity** and **low specificity**. i.e. You want no false negatives but you'll accept some false positives.

- **Spam Filters**: You do not want a spam filter to filter out important real emails! You care a lot about a **high specificity** and a **low sensitivity**. Therefore, you'll accept some false negatives but don't want any false positives.  

![image](.//task.png)

### Have a think - what bias would we prefer for this dataset and model?

Lets go even further with our analysis and cover something called an AUC-ROC Curve. 

Firstly, what does AUC and ROC mean? 

- ROC: Receiver Operating Characteristics
- AUC: Area Under Curve

ROC essentially is a graphical representation of the effectiveness of a binary classification model.

It plots the true positive rate (TPR) vs the false positive rate (FPR) at different classification thresholds.

The AUC curve represents the area under the ROC curve. It measures the overall performance of the binary classification model. The

The True Positive Rate (often called Recall or Sensitivity) is the ratio of positive examples that are correctly identified.

`TPR = TP / TP + FN`

FPR is the ratio of negative examples that are incorrectly classified.

`FPR = FP / TN + FP`

Lets make a plot to hopefully make more sense of all this!

In [None]:
x = metrics.RocCurveDisplay.from_estimator(classification_model, breast_cancer_inputs_testing, breast_cancer_outputs_testing)
plt.plot([0,1],[0,1],"--",c="red")
plt.show()  

The further to the top left corner the ROC curve is, the better the prediction.

When the curve is perfectly in the top left corner, the area underneath it is 1. this means an AUC score of 1 is a perfect classifier, with 0 being the worse score.

### **It doesn't always need to be so complicated! Sometimes we just want a single number to help our evaluation!**

Another method of analysis is an **F1 Score**. This is a more general measurement for how good a classifier model is.

This is given as:

**F1 = 2 * (precision * recall) / (precision + recall)**

Where:
- **Precision** is a measure of how many predicted positive diagnosis' are actually positive. This is found by taking the number of true positives and dividing by the number of true and false positives.
- **Recall** is the percentage of positive cases that are correctly predicted by the model.

A perfect prediction would give an F1 score of 1, with the worst possible score being 0. 

The sklearn classification report is a very useful way of examining the precision, recall and f1 score of **each class** (i.e benign or malignant).

In [None]:
classification_report = metrics.classification_report(breast_cancer_outputs_testing, predictions)

print(classification_report)

![image](.//task.png)

**OPTIONAL**

So you have made predictions using the random forest algorithm, but could you use something else instead?

Head to https://scikit-learn.org/stable/supervised_learning.html and see what other models are available.

Can you find any better classification models for our dataset?

Feel free to go back and look at the previous code in this notebook.

In [None]:
## OPTIONAL: Your code here!
# import your model 
# fit to the training data
# get predictions from the test data
# test your model



# Interrogating Training Data

As a tester, it is very important to look at training data as well as the predictions of a model.

##### Why should someone assuring a model care about checking the training data it came from?

- Being able to check the training data of a model is a key way to spot any flaws in the data
- Flaws in the training data will be reflected in the predictive model.
- Models may make incorrect predictions based on bad quality training data. 

One particular method we can use to check the training data is to look at **correlation** between variables.

#### Why do we care about correlation?

 - Highly correlated variables essentially contain the same information.
 - When this happens, it becomes difficult for the model to distinguish the individual effects of each variable during training. This can lead to unstable and unreliable predictions.
 - Highly correlated variables can dominate the machine learning model.
 - Data Scientists often remove highly correlated variables, this ensures that each  variable contributes unique and independent information to the model, allowing for more accurate and reliable predictions.
 - However, we can't assume data scientists always do this! So it is important to check for ourselves.

We have just been given the training data for a regression model that predicts the diabetic progression of an individual. The data scientist claims he checked the training data before making the model, but we want to be sure! 

In [None]:
from sklearn.datasets import load_diabetes

# Here, we load the diabetes inputs and outputs using the load diabetes function. 

# Earlier we used clear variables named breast_cancer_inputs and breast_cancer_outputs.

# Sadly, data scientists often like to be more concise (and confusing), and like to use X and y.
# You will likely see this a lot online, so it is good to practice this notation. 

diabetes_X, diabetes_y = load_diabetes(return_X_y=True, as_frame=True)
# Here diabetes_X represents inputs, and diabetes_y represents outputs.

Let's have a look at the training data.

In [None]:
display(diabetes_X)

### Correlation Matrix

In [None]:
corr = diabetes_X.corr()
corr.style.background_gradient(cmap='coolwarm')

![image](.//task.png)

This correlation plot shows the correlation between each variable/column.
- Values close to 1 show high positive correlation.
- Values close to -1 show high negative correlation.


**Looking at this plot, which 2 columns show the highest correlation?**

As a tester, you may want to report this finding back to the developers.

A data scientist would often remove columns that are highly correlated.

In the code below, remove one of the columns to reduce highly correlated features. 

In [None]:
#remove a column
# diabetes_X = diabetes_X.drop("put_the_name_of_a_column_here", axis=1)
#check the column is removed
display(diabetes_X)

Is there anything else looking a bit odd with this data? Maybe you have noticed that the `sex` values are either 0.050680 or -0.044642?

That could be a bit odd, let's do some exploring.

In [None]:
sex_value_counts = diabetes_X["sex"].value_counts()
print(sex_value_counts)

In this case, it all looks pretty  *normal* . Firstly, there are only 2 values, which is a good sign! Secondly, it is normal for data scientists to apply transformations to training data to make it all a similar size, i.e. between 1 and -1. In this case, there doesn't seem to be a problem with the data, but it is always a good idea to keep your eyes peeled on training data!

### Regression Models

A Data Scientist has used the above dataset to train a regression model. They have sent you the following code that they used to create a model to predict the progression of diabetes. 

In [None]:
from sklearn.linear_model import LinearRegression

X_training, X_testing, y_training, y_testing = train_test_split(diabetes_X, diabetes_y, train_size=0.03)

regression_model = LinearRegression()
regression_model.fit(X_training,y_training)

y_predictions = regression_model.predict(X_testing)

from sklearn.metrics import mean_absolute_percentage_error

mape = mean_absolute_percentage_error(y_testing, y_predictions)

print(f"Mean Absolute Percentage Error for model 1: {round(mape,2) * 100}%")
print("See, my model is amazing!")

First up, we've introduced a new testing metric, what is the mean absolute percentage error (MAPE)? It represents the average of the absolute percentage errors of each entry in a dataset. 

This means that, for each data point in the testing data, the absolute percentage error is calculated. These are then all added up together and averaged to come up with the MAPE number. The closer to 0 the MAPE, the better the model.

![image](.//task.png)

Look again at the above code form the data scientist. There is **1** mistake / piece of bad practice that will reduce the accuracy of their code. Can you spot it? 👀

Fix the mistake and rerun the code, does the MAPE decrease?

#### Feature Importance

Feature Importance Analysis reveals how much each feature (columns/fields/dimensions) contributes to the model prediction.

This can be used as a diagnostic tool:

It reveals which features are barely used or not used at all to make a decision.
It shows if the model is relying mostly or entirely on one or two features, this could suggest an error, or perhaps that a simpler model might be just as effective.

More complex feature importance analysis can reveal the qualities of information that each feature gives.

All of this can be used to diagnose issues within the training data, and help understand and explain the model.

In [None]:
# shap is the module we wll use to run feature importance
import shap 
# from sklearn.preprocessing import Imputer

# we pass shap our classification model, alongside the training data
explainer = shap.LinearExplainer(regression_model,masker=shap.maskers.Impute(data=X_training))
# we then pass shap our testing inputs.
shap_values = explainer.shap_values(X_testing)
# shap then creates a plot for us showing which features are the most important
shap.summary_plot(shap_values, X_testing, plot_type="bar")

# In the plot_type parameter, you may wish to try out "violin", "dot" or "bar"

![image](.//task.png)

Before we move onto something new, see if you can answer these questions. From this plot, can you tell:

- Which feature has the greatest impact on the model?

- Which feature has the smallest impact on the model?

- Is this what we expect to see? 

## Deep Learning and Neural Networks!

**Deep learning: **

So far we've discussed machine learning using structured data. This means data that is tabulated and has labelled feature names.

**Deep learning** is a more complex form of machine learning which generally involves **unstructured data** like images and natural language.

**Deep learning** involves **Neural Networks** - computational models inspired by the human brain's neural connections. They are made up of interconnected nodes formed into layers.

But the important thing to recognise is that a **Neural Network**, like any model, is just a complicated function, with an input, a hidden process, and an output.

And the way we test them is much the same as before.

 

You can think of the nodes and connections within the neural network as small, simple mathematical functions.

As the neural network is trained, the weights associated with these small component parts are adjusted using a **loss function.**

If the training data is good enough, the neural network will gradually be able to tend towards the desired predictions.

![image](.//neural_network.png)

### Let's make a dataset to try and classify with a neural network!

In [None]:
from sklearn.datasets import make_blobs, make_swiss_roll
import numpy as np

# Use make swiss rolls to get the coordinates for two swiss roll shaped clusters
X1, y1 = make_swiss_roll(n_samples=2000,noise=0.2, random_state=1)
X2, y2 = make_swiss_roll(n_samples=2000,noise=0.2, random_state=2)
#Use make blobs to get a "blob" shaped cluster
X_blob,y_blob = make_blobs(n_samples = 2000,n_features=3, centers = 1, center_box=(10,0,0), random_state=4, cluster_std= 0.4)

new_X = []
new_y = []
#Iterate through one cluster
for i in range(len(X1)):
    # Add the points from each cluster to a new array
    new_X.append( list(X1[i]))
    new_y.append(1)
    new_X.append( [X2[i][0] * -1,X2[i][1] ,X2[i][2]* -1] ) # Apply some transformations to the coordinates
    new_y.append(2)
    new_X.append( [X_blob[i][0],X_blob[i][1] * 4 * np.random.rand() ,X_blob[i][2]] ) # Apply some transformations to the coordinates
    new_y.append(y_blob[i] + 3)


X = (pd.DataFrame(new_X))
y = new_y


In [None]:
# Use our old produce_scatter_plot function to visualise the data.
produce_scatter_plot(X,y,[0,1,2])

# Interesting, as a human, we can clearly see the different groups, and would be able to classify them easily without colours. 


In [None]:
# Split the data into testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

## How do we train a neural network?

The full process for a training a neural network is very complicated, but it is useful to know a few key points. 

When training a neural network, you often pass the data to the model multiple times. Each time you pass the entire dataset to the model, this is called an **epoch**.

A neural network measures it's training progress using a **loss function** that measures the difference between predicted outputs and actual targets. As you progress through multiple epochs, the loss function decreases

During training, an **optimizer** algorithm to updates the model's parameters. The model updates the parameters based on methods called **gradient descent** and **back propagation** which are too complicated to go into today. But essentially, the model uses calculus to find the best possible value for each parameter in the model.  

Often when training a model, we divide the dataset into smaller **batches** to process data efficiently.

Let's see how we do this in code.

In [None]:
from sklearn.neural_network import MLPClassifier

# This line of code initialises a neural network called MLPClassifier
# The max_iter parameter stands for the maximum number of iterations to go through.
# This is the same as how many epochs to do, as we have set it to 1, we will go through all the data once.
classifier = MLPClassifier(max_iter=1)

# The fit function is the very clever bit of code that uses activation functions, gradient descent, an optimiser and back propagation to train the model.
classifier.fit(X_train, y_train)

# Very similar to before, we have a predict function. This takes the testing data and uses the model to make predictions
predictions = classifier.predict(X_test)

In [None]:
# Lets see what our predictions look like after one epoch
produce_scatter_plot(X_test,predictions,[0,1,2])

Hmm, it doesn't look very good does it!

Let's train the model over multiple epochs and see what happens.

In [None]:
import numpy as np


# This line of code initialises the same neural network called MLPClassifier
# It is here that we can change certain aspects of the model. For example I have changed the activation function to something called "relu".
# The batch size is also set to 20, meaning 20 coordinates will be processed at a time.
classifier = MLPClassifier(random_state=1, max_iter=20, batch_size=20, activation='relu')

# I dont want my model to keep training forever, if the loss is no longer decreasing.
# When the loss is no longer decreasing, the model is already well trained.
# I will set up some code so that if the loss is decreasing by less than the tolerance, the training will stop.
tolerance = 0.0001

#Loop through up to 500 epoch
for epoch in range(500):

    # For each epoch, train the model on all of the data
    # Partial fit is the same as the fit function, but ensures only one epoch is trained
    classifier.partial_fit(X_train, y_train, classes = np.unique(y_train))

    # Calculate the loss
    loss = classifier.loss_curve_

    if epoch > 2: 
        # Calculate the change in loss over the last two epochs
        change_in_loss = (loss[-2] - loss[-1])
    
        if change_in_loss < tolerance:
            # Check how small the change in loss is. If it's too small, stop training.
            print("Change in loss is",'{:0.6f}'.format(change_in_loss), "which is less than the tolerance of", tolerance)
            print(f"Stopping at epoch", epoch)
            break # this line breaks out of the loop if triggered

    



In [None]:
# Lets see how good our prediction is after multiple epochs. 
predictions = classifier.predict(X_test)

In [None]:
produce_scatter_plot(X_test,predictions,[0,1,2])

Woah! It might not be perfect but it is a lot better than before. 

Let's have a look at our loss against epoch to see how it changed during training

In [None]:
plt.plot(loss)
plt.xlabel("Epoch number")
plt.ylabel("Loss")

Remember, the loss is a calculation that measures the difference between predicted outputs and actual targets.

As the epochs continue, the loss decreases, meaning that the model is getting better!

![image](.//task.png)

Remember, although this is a neural network, it is still a classifier model! Can you use the plot_predictions function and generate_confusion_matrix functions to analyse this model? Further to this, can you generate a classification report?

In [None]:
plot_predictions(predictions,y_test)

In [None]:
generate_confusion_matrix(predictions,y_test,[1,2,3])

In [None]:
# Your code here
classification_report = metrics.classification_report(y_test, predictions)
print(classification_report)

### Unsupervised vs Supervised Learning

**Supervised learning** is a machine learning approach where the algorithm learns from labeled training data. The training data consists of input features (also called independent variables) and their corresponding known output labels (also called dependent variables or target variables). The goal of supervised learning is to learn a mapping function that can predict the correct output label for new, unseen input data. In other words, the algorithm learns from examples where the desired outcome is already known.

**Unsupervised Learning**
Unsupervised learning, on the other hand, is a machine learning approach where the algorithm learns from unlabeled data. Unlike supervised learning, there are no predefined output labels or target variables in unsupervised learning. The algorithm's objective is to find patterns, structures, or relationships within the data without any prior knowledge of the outcomes. Unsupervised learning algorithms attempt to uncover hidden patterns, group similar data points together, or reduce the dimensionality of the data without being guided by explicit labels.

So far we have only used supervised learning, which is far more common. For the example above, many clustering algorithms exist which can group clusters together without using labelled data. 

## Large Language Models

Now these really are the talk of the town at the moment. The development of Large Language Models (LLM's) has exploded recently, leading to discussions of dystopian futures and robot overlords. Now whether this happens is quite debatable, but one thing for sure is that LLMs are going to change the way we work and our way of life in the future.

Chat GPT is a LLM, and as I am sure you are all aware, it's very very impressive!

Large Language Models are a type of Neural Network that are very good at processing language. 

One key aspect of LLM's is that they have a parameter called **temperature**. This essentially applies a sort of randomness to their answers. **LLM's can give multiple different answers to the same question**