# Breast Cancer Recurrence Prediction using Machine Learning

## Dataset

Variables in this dataset are:

***Class***: whether or not there has been a recurrence of cancer<br>
***Age***: patient's age at the time of diagnosis<br>
***Menopause***: menopausal status of the patient at the time of diagnosis, pre-menopausal (*premeno*) or post-menopausal (*ge40*, *lt40*) at the time of diagnosis:<br>
***Tumour Size***: the size of the tumour (mm) at the time of diagnosis<br>
***Invasive Nodes***: the total number of lymph nodes confirming Breast Cancer at the time of the histological examination<br>
***Node Caps***: whether the tumour penetrated in the lymph node capsule<br>
***Degree of Malignancy***: divided into 1 -2 or 3, depending on the malignancy of the tumour<br>
***Breast***: the position of the tumour (left or right breast)<br>
***Breast Quadrant***: the quadrant of the breast where the tumour is present<br>
***Irradiation***: whether radiation therapy has been used as a treatment to destroy cancer cells<br>

The data is provided as two separate ```.data``` files<br>
- ```breast-cancer.data```, containing the dataset 
- ```breast-cancer.names```, containing relevant informations about the dataset

In [None]:
# import the files
data = open("./dataset/breast-cancer.data")
feat = open("./dataset/breast-cancer.names")

data = data.read()
feat = feat.read()

In [None]:
# preview the first two rows of the dataset
print(data.split('\n',1)[0])
print(data.split('\n',2)[1])

In [None]:
# view information about the dataset
print(feat)

In the informative, file we have important information about the dataset at *7. Attribute Information*. It is specified that the dataset includes missing values denoted with the attribute **'?'**

In [None]:
# replace missing dataset attributes to NAN
data = data.replace('?','')

<ins>*for Giovanni Notes*:</ins> I prefered to replace the missing data attribute from ***?*** to ***NAN*** straight away while still as a ***str*** file

## Exploratory Data Analysis

### Transforming to DataFrame

The data is stored as ```str```. It is necessary to convert it to ```DataFrame``` format

In [None]:
# import libraries
import pandas as pd
pd.set_option('display.max_colwidth', None) #setting max colwidth to view the entire dataset when using the print() command
from io import StringIO
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# convert data from str to dataframe
data = StringIO(data)
data = pd.read_csv(data, sep=",")
print(type(data)) #check the data variable is a pandas.core.frame.DataFrame

Now that the data is correctly converted into a DataFrame table, I will rename the columns according to the attributes in the ```.names``` file

In [None]:
data.columns = ['class', 'age', 'menopause', 'tumour_size', 'inv_nodes', 'node_caps', 'deg_malig', 'breast', 'breast_quad', 'irrad']
data.columns

In [None]:
# preview the DataFrame table created
data.head()

### Explore the Dataset

In [None]:
data.info()

All the variables in the dataset are of type ```object```, except for ```'deg_malig'```.

<ins>*for Giovanni Notes*:</ins> the data shows in fact there are missing values *277 of 285* in ```'node_caps'``` attribute and *284 of 285* in ```'breast_quad'```

In [None]:
data.describe()

<ins>*for Giovanni Notes*:</ins> is it useless to print ```.describe()``` in this case?

### Explore the Attributes

I want to have a more thorough look at the data inside each attribute, starting from ```'class'``` which contains information about recurrence of Breast Cancer.

In [None]:
class_ = data['class'].value_counts()
class_.plot.barh()

About 70% of the dataset includes patients that didn't experience a recurrence of the disease.

In [None]:
age_ = data['age'].value_counts()
age_.plot.barh()

Most patients in the dataset fall into the age group *40-59* which will probably result in a somewhat even value count of Pre-menopause (*premeno*) and Menopause (*lt40* and *ge40* are both values representing menopause).

In [None]:
menopause_ = data['menopause'].value_counts()
menopause_.plot.barh()

In [None]:
tumour_size_ = data['tumour_size'].value_counts()
tumour_size_.plot.barh()

```'tumour_size'``` is expressed in mm.<br>
In the dataset, most tumour sizes fall into the *20mm-34mm* group.

In [None]:
inv_nodes_ = data['inv_nodes'].value_counts()
inv_nodes_.plot.barh()

Most number of lymph nodes that resulted invaded by the tumour fall into the group "*0-2* lymph nodes invaded".

In [None]:
node_caps_ = data['node_caps'].value_counts()
node_caps_.plot.barh()

Most lymph nodes that were removed from the patient didn't result to have the capsule perforated.

In [None]:
deg_malig_ = data['deg_malig'].value_counts()
deg_malig_.plot.barh()

The most common degree of malignancy for the patients in the dataset is *2*.

In [None]:
breast_ = data['breast'].value_counts()
breast_.plot.barh()

In [None]:
breast_quad_= data['breast_quad'].value_counts()
breast_quad_.plot.barh()

The breast count is even *right* and *left* while the quadrant has a higher value count on the *left_up* and *left_low* group.

In [None]:
irrad_ = data['irrad'].value_counts()
irrad_.plot.barh()

Most of the patients in this dataset didn't undergo Radiation Therapy.<br>
<font size="3">_*radiation therapy: a cancer treatment that uses high doses of radiation to kill cancer cells and shrink tumours.*_</font>

**Missing Values**<br>
There were missing values on the dataset.

In [None]:
# identify where the missing values are in the dataset
data.isna().any()

Both ```'node_caps'``` and ```'breast_quad'``` columns have missing values.

In [None]:
# find % of missing values per single attribute in the dataset
nan = data.isna().sum()
tot = data.count()
perc = (nan*100)/tot

In [None]:
perc

I will replace ```object``` attributes to ```int64``` to have valid data to train the machine learning model and to analyse the correlation between attributes.<br>
So I will replace missing data with a value that won't be relevant to the analysis.

In [None]:
# replace missing data 
data['node_caps'] = data['node_caps'].fillna(5)
data['breast_quad'] = data['breast_quad'].fillna(8)

I will check outliers on the only numerical column in the dataset ```'deg_malig'```.

In [None]:
# import the library
import seaborn as sns

In [None]:
sns.boxplot(x=data['deg_malig'])

Now I replace all DataFrame values into numerical values to convert column type from ```'object'``` to ```'int64'```.

In [None]:
data['class'] = data['class'].replace(['no-recurrence-events','recurrence-events'], [0,1])
data['age'] = data['age'].replace(['20-29', '30-39','40-49','50-59','60-69','70-79'],[0,1,2,3,4,5])
data['menopause'] = data['menopause'].replace(['premeno','ge40','lt40'],[0,1,2])
data['tumour_size'] = data['tumour_size'].replace(['0-4','5-9','10-14','15-19','20-24','25-29','30-34','35-39','40-44','45-49','50-54'],[0,1,2,3,4,5,6,7,8,9,10])
data['inv_nodes'] = data['inv_nodes'].replace(['0-2','3-5','6-8','9-11','12-14','15-17','24-26'],[0,1,2,3,4,5,6])
data['node_caps'] = data['node_caps'].replace(['no','yes'],[0,1])
data['breast'] = data['breast'].replace(['left','right'],[0,1])
data['breast_quad'] = data['breast_quad'].replace(['left_low','left_up','right_up','right_low','central'],[0,1,2,3,4])
data['irrad'] = data['irrad'].replace(['no','yes'],[0,1])

A quick overview on the newly modified dataset using a histogram, respectively for ```'no-recurrence-events'``` and ```'recurrence-events'```.

In [None]:
data.groupby('class').hist(figsize=(9,9))

### Correlation Between Attributes and Identification of Target Attributes

**Reference Information about Breast Cancer**

According to the research paper *The incidence of Breast Cancer Recurrence 10-32 Years After Primary Diagnosis*, "[...] **Women with high lymph node burden, large tumor size, and estrogen receptor–positive tumors had increased risk of late recurrence**."<br>
<font size='2'>*(J Natl Cancer Inst. 2022 Mar; 114(3): 391–399. Published online 2021 Nov 8. doi: 10.1093/jnci/djab202 PMCID: PMC8902439PMID: 34747484)*</font>

<br>

According to the medical paper *Understanding ER-positive breast cancer*, "[...] **Females with a longer lifetime exposure to estrogen and progesterone may have a higher risk of developing hormone receptor-positive breast cancer. This includes women who start menstruating early or reach menopause late**."<br>
<font size='2'>*(Medically reviewed by Faith Selchick, DNP, AOCNP, Nursing, Oncology — By Jenna Fletcher on May 22, 2022)*</font>

<br>

According to the medical paper *Hormone therapy for breast cancer*, "[...] **Hormone therapy following surgery, radiation or chemotherapy has been shown to reduce the risk of breast cancer recurrence in people with early-stage hormone-sensitive breast cancers. It can also effectively reduce the risk of metastatic breast cancer growth and progression in people with hormone-sensitive tumors**."<br>
<font size='2'>*(https://www.mayoclinic.org/tests-procedures/hormone-therapy-for-breast-cancer/about/pac-20384943)*</font>

<br>

According to the medical paper *Radiotherapy for breast cancer*, "[...] **People with a very low risk of the cancer coming back may only have port of the breast treated with radiotherapy. Or they may not have radiotherapy at all**."<br>
<font size='2'>*(https://www.cancerresearchuk.org/about-cancer/breast-cancer/treatment/radiotherapy/radiotherapy-treatment)*</font>

<br>


According to the medical paper *What Types of Breast Cancer Have the Highest Recurrence Rates?*, "[...] **Aggressive breast cancers are harder to treat, more likely to spread, and more likely to recur. The two types of breast cancer most likely to recur are inflammatory breast cancer (IBC) and triple-negative breast cancer (TNBC)**."<br>
<font size='2'>*(Medically reviewed by Faith Selchick, DNP, AOCNP, Nursing, Oncology — By S. Behring on December 19, 2022)*</font>

<br>

**Target Attribute Identification**

The target attribute is ```'class'``` as the ML model should predict if a patient is likely to experience a recurrence of Breast Cancer.

<br>

**Correlation Between Attributes**

In [None]:
# compute the correlation between attributes
data.corr()

There is an evident correlation between ```'age'``` and ```'menopause'```, for obvious reasons.<br>
```'node_caps'```, ```'inv_nodes'```, ```'deg_malig'``` and ```'irrad'```are also correlated.<br>

I want to visualize the correlations using a Heatmap.

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(data.corr(), cmap='BrBG', annot=True, linewidth=.5)

Before analyzing the correlations between the attributes, I want to visualize the correlation between the target attribute ```'class'``` and the other attributes of the dataset.

In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(data.corr()[['class']].sort_values(
    by='class', ascending=False),annot=True,cmap='BrBG')

The highest correlation with the target attribute ```'class'``` (*recurrence or not recurrence event*) is the degree of malignancy ```'deg_malig'``` of the tumour.<br>

As stated earlier, *aggressive breast cancers are harder to treat, more likely to spread and more likely to reoccur*. I notice also that the number of invaded lymph nodes ```'inv_nodes'``` is highly correlated with the ```'class'``` as well as ```'node_caps'``` (*whether the tumor penetrated in the lumph node capsule*), preceding ```'irrad'``` in the correlation heatmap. 

I want to dig deeper into the correlation between the target attribute and the *degree of malignancy* of the tumour.<br>
One hypothesis is that the more aggressive cancer is the more likely the patient will experience recurrence.

In [None]:
from IPython.display import display

<ins>*for Giovanni Notes*:</ins> *'from IPython.display import display'* is only used in the *corr_analysis* function to visualize the table in a standard Jupyther Notebook format.<br>

<ins>*for Giovanni Notes*:</ins> I created a series of functions to make correlation analysis between attributes easier as I am performing the same tasks for each pair of attributes.

In [None]:
# create functions to analyse correlations between attributes including crosstab(), heatmap(), plot()
def corr_analysis(x,y):
    display(crosstab(x,y))
    heatmap(x,y)
    plot(crosstab(x,y),x,y)
    return 


# create a crosstab
def crosstab(x,y):
    crosstab = pd.crosstab(
                    data[x],
                    data[y],
                    margins=True,
                    normalize=True,
                )
    return crosstab


# visualize the correlation in the crosstab using a heatmap
def heatmap(x,y):
    heatmap = sns.heatmap(pd.crosstab(data[x], data[y]), cmap="YlGnBu", annot=True)
    return heatmap

    
# plot the crosstab
def plot(crosstab, x, y):
    crosstab.plot.bar(rot=0, width=0.4)
    x = str(x)
    y = str(y)
    plt.xlabel(x)
    plt.ylabel("Correlation %")
    plt.title('Correlation Plot Between '+ x +' and '+ y)
    return

In [None]:
# run the function with 'class' and 'deg_malig'
corr_analysis('class', 'deg_malig')

By a quick look at the barplot, there is a proportional correlation between the aggressiveness of the tumour and the recurrence event.<br>
I want to dig deeper into that and see if the **proportion of patients having the highest degree of malignancy over all the patients of that class had a recurrence event higher than the patient who didn't experience recurrence**.

<ins>*for Giovanni Notes*:</ins> I created a series of function to make proportion calculations between attribute values easier as I am performing the same tasks for each pair of attributes.

In [None]:
# function to calculate proportion based on attribute values
def proportion(x,x_n,y,y_n):
    tot = data[data[x]==x_n]
    part = tot[tot[y]==y_n]
    part = part[x].value_counts()
    tot = tot[x].value_counts()
    result = round(((part / tot) *100), 2)
    result = float(result)
    return result

<ins>*for Giovanni Notes*:</ins> I am saving the proportion() results into bp_n _n+1 to plot the results later

In [None]:
bp1 = proportion('deg_malig',3,'class',1)
print(bp1,"% of patients who got the tumour with the highest aggressiveness experienced a recurrence")

In [None]:
bp2 = proportion('deg_malig',1,'class',0)
print(bp2,"% of patients who got the tumour with the lowest aggressiveness didn't experience a recurrence")

**46.43%** of patients who got the tumour with the highest aggressiveness didn't experience a recurrence while **16.9%** of patients who got the tumour with the lowest aggressiveness did experience a recurrence.<br>
In the bar plot I notice a high proportion of degree type 2 for the patients class who didn't experience recurrence.

In [None]:
bp3 = proportion('deg_malig',2,'class',0)
print(bp3,"% of patients who got the tumour with 2 degrees of malignancy didn't experienced a recurrence")

In [None]:
plt.barh(['high aggressiveness, yes recurrence','lowest aggressiveness, no recurrence', 'medium aggressiveness, no recurrence'], [bp1,bp2,bp3], color='maroon')
plt.xlim(0, 100)
plt.show()

For the median degree of malignancy (type 2), a total of **21.54%** of patients experienced a recurrence.<br>

From this quick analysis, I can say that the correlation between aggressiveness of the tumour and probability of recurrence is stronger for low degrees of malignancy. It isn't a only factor for high degrees of aggressiveness.

Generally, the more aggressive (*malignant*) is the tumour the more it will spread and attack lymph nodes. For the same reason, there will be a higher percentage of patients with tumour that penetrated the lymph nodes capsule. <br>

Usually for biopsy, about 10 to 40 nodes that contain cancer cells are removed for analysis. If the lymph node has a very low to no cancer cells count it is usually not removed and the patient has to undergo further therapies, such as Radiation Therapy.

<font size="3">_*biopsy: an examination of tissue removed from a living body to discover the presence, cause, or extent of a disease.*_</font>

I want to dig deeper into the correlation between ```'class'``` with ```'inv_nodes'``` and ```'node_caps'```.

In [None]:
corr_analysis('class', 'inv_nodes')

In [None]:
corr_analysis('class', 'node_caps')

By a quick look at the two barplots, there seems to be a proportional correlation between the number of lymph nodes, whether they had the capsule pierced and the recurrence class.<br>

I want to dig deeper into that. I want to see if the **proportion of patients having a high invasion of lymph nodes count had higher capsules pierced over all the patients that had pierced lymph node capsules**.

In [None]:
corr_analysis('inv_nodes', 'node_caps')

In [None]:
# run through a while loop to iterate proportion() for 7 times
count = 0  
lis = [] # list to append results and plot later

while count < 7:
    res = proportion('inv_nodes',count,'node_caps',1)
    print(res,"% of patients who got group",count,'lymph nodes and had lymph nodes with pierced capsule')
    lis.append(res)
    count +=1

<ins>*for Giovanni Notes*:</ins> inv_nodes values are divided into 7 different groups so iterated the function for each group

In [None]:
yax = np.array(lis)
plt.plot(yax, color = 'r')
plt.show()

The more the number of invaded lymph nodes the more likely their capsule will be perforated.<br>
What is the correlation with number of invaded lymph nodes, nodes with pierced capsule and a recurrence event?

I want to dig into that by computing the **proportion of patients with degree of malignancy 3 and group number of invaded nodes over all patients of that class**.

In [None]:
corr_analysis('inv_nodes', 'deg_malig')

In [None]:
# run through a while loop to iterate proportion() for 7 times
count = 0  
lis = [] # list to append results and plot later

while count < 7:
    res = proportion('inv_nodes',count,'deg_malig',3)
    print(res,"% of patients who got group",count,'lymph nodes and had degree of malignancy type 3')
    lis.append(res)
    count +=1

In [None]:
yax = np.array(lis)
plt.plot(yax, color = 'r')
plt.show()

For all patients having the highes degree of malignancy, there is a bigger portion having more invaded lymph nodes.

In [None]:
corr_analysis('deg_malig', 'node_caps')

Radiation as a therapy is not mandatory for every patient. It is used to reduce the patient's risk of breast cancer recurring after surgery. It is also commonly used to ease the symptoms caused by cancer that has spread to other parts of the body (*metastatic breast cancer*).<br>

I want to see the correlation between ```'class'``` and ```'irrad'```.

In [None]:
corr_analysis('class', 'irrad')

In [None]:
bp1 = proportion('irrad',1,'class',0)
print(bp1,"% of patients that got radiation therapy and didn't experience recurrence later on")

In [None]:
bp2 = proportion('irrad',0,'class',1)
print(bp2,"% of patients that didn't get radiation therapy and experienced recurrence later on")

In [None]:
plt.barh(['yes radiation, no recurrence','no radiation, yes recurrence'], [bp1,bp2], color='maroon')
plt.xlim(0, 100)
plt.show()

There are about 45.59% of patients that got radiation therapy and still experienced recurrence later on and about 75.12% of patients that didn't get radiation therapy and didn't experience recurrence.<br>

I can assume that radiation therapy prevents recurrence but not for all patients.

There is clearly a higher proportion of patient that got radiation therapy ```'irrad' == 1``` and still experienced a recurrence of the disease ```'class' == 1```.<br>
**According to this dataset, undergoing radiation therapy isn't enough to prevent recurrence.**

I want to visualize the correlation between ```'class'``` and ```'tumour_size'```.

In [None]:
corr_analysis('tumour_size', 'class')

In [None]:
# run through a while loop to iterate proportion() for 11 times
count = 0  
lis = [] # list to append results and plot later

while count < 11:
    res = proportion('tumour_size',count,'class',1)
    print(res,"% of patients who had group",count,'tumour size and experienced a recurrence event')
    lis.append(res)
    count +=1

In [None]:
# replace nan value to 0.0 ans store to new list f_lis
import math
f_lis = []

for n in lis:
    if math.isnan(n):
        n=0.0
        f_lis.append(n)
    else:
        f_lis.append(n)

In [None]:
yax = np.array(f_lis)
plt.plot(yax, color = 'r')
plt.show()

The higher the size of the tumour the more proportion of patients that experienced recurrence.<br>
I want to compute the same proportion looking at the aggressiveness of the tumour and its size.

In [None]:
# run through a while loop to iterate proportion() for 11 times
count = 0  
lis = [] # list to append results and plot later

while count < 11:
    res = proportion('tumour_size',count,'deg_malig',1)
    print(res,"% of patients who had group",count,'tumour size and lowest degree of malignancy')
    lis.append(res)
    count +=1

In [None]:
yax = np.array(lis)
plt.plot(yax, color = 'r')
plt.show()

<ins>*for Giovanni Notes*:</ins> There is an unexpected peak at 9 group tumour size. I don't feel is relevant.

In [None]:
# run through a while loop to iterate proportion() for 11 times
count = 0  
lis = [] # list to append results and plot later

while count < 11:
    res = proportion('tumour_size',count,'deg_malig',3)
    print(res,"% of patients who had group",count,'tumour size and highest degree of malignancy')
    lis.append(res)
    count +=1

In [None]:
# replace nan value to 0.0 ans store to new list f_lis
import math
f_lis = []

for n in lis:
    if math.isnan(n):
        n=0.0
        f_lis.append(n)
    else:
        f_lis.append(n)

In [None]:
yax = np.array(f_lis)
plt.plot(yax, color = 'r')
plt.show()

The size of the tumour is higher in value for patients with high aggressiveness than patients with low aggressiveness.<br>

From this exploratory analysis I can conclude that, in fact:
- high aggressiveness is linked with probability of recurrence of the tumour.
- high number of invaded lymph nodes results in high probability that their caps will be perforated by the tumour.
- the highest the aggressiveness the more lymph nodes will be invaded by the tumour.
- patients who get radiation are less likely to experience recurrence, but that is not the case for all patients.
- tumour size, aggressiveness and recurrence events are linked. The more aggressive is the tumour, the more it will be in size and the higher the probability to experience a recurrence.

<ins>*for Giovanni Notes*:</ins> I didn't investigate:
- menopause with aggressiveness and recurrence as I don't feel I have enough relevant data to evaluate if *longer exposure to estrogen results in higher rate of breast cancer appearing and recurrence*.
- left or right breast nor the quadrant.


> For further insights about this data and real-time updates check out **my Dash dashboard** deployed on **Amazon Web Services** using AWS Elastic Beanstalk:<br>
http://dashboardbreastcanceranalysis-env.eba-cpv243mm.eu-west-2.elasticbeanstalk.com/


<ins>*for Giovanni Notes*:</ins> my code and requirement ```.txt``` file are inside the repository respectively named ```application.py``` and ```requirements.txt```

## Feature Selection

The target variable is ```'class'``` as the output of the model should be wether or note the patient is subject to recurrence of the disease.<br>

I want to proceed with feature selection to understand the most important features for the model. 

In [None]:
# Independent and dependent variables
y = data['class']
X = data.drop(['class'], axis = 1)

In [None]:
from sklearn.model_selection import train_test_split
# Split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.33)    

In [None]:
# vsualize the correlation between train features with a heatmap
cor = X_train.corr()
plt.figure(figsize=(12,12))
sns.heatmap(cor, cmap="YlGnBu", annot=True)
plt.show()  

#### Feature Selection Using Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train,y_train)

plt.figure(figsize=(12,12))
plt.bar(X_train.columns, clf.feature_importances_)
plt.xticks(rotation=45)

The above histogram shows the importance of each feature.<br>
In this case, ```'tumour_size'```, ```'age'```, ```'breast_quad'``` and ```'deg_malig'``` have the highest importance.

#### Training different models to evaluate performance

In [None]:
# import the libraries
from sklearn.metrics import accuracy_score
from sklearn import metrics

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

<ins>*for Giovanni Notes*:</ins> I created a quick function to view the metrics of all the models I chose to evaluate.

In [None]:
# create function
def model_metrics(model):
    model=model
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print(model,'score:',round(model.score(X_train,y_train),4))
    mae = metrics.mean_absolute_error(y_test, y_pred)
    mse = metrics.mean_squared_error(y_test, y_pred)
    print(model,'mean squared error:',round(mse,4))
    print(model,'mean absolute error:',round(mae,4),'\n')

In [None]:
# store the models to evaluate into a list
models_list = [RandomForestClassifier(),LogisticRegression(),KNeighborsRegressor(n_neighbors=1),GaussianNB(),DecisionTreeClassifier()]

In [None]:
# iterate through the list
for model in models_list:
    model_metrics(model)

##### Cross-validation 

In [None]:
from sklearn.model_selection import KFold, cross_val_score
import statistics

k_f = KFold(n_splits=10, shuffle=True)

In [None]:
def crossvalidation_score(model):
    model_score = cross_val_score(model, X, y, cv =k_f, scoring='accuracy')
    print('----',model,':','----')
    print(model_score,'\n')
    print('Mean:',model_score.mean())
    print('Standard Deviation:',statistics.stdev(model_score), '\n')

In [None]:
# iterate through the list
for model in models_list:
    crossvalidation_score(model)

There is more data I would integrate in the prediction model to increase real-world applications and prediction precision:
- Biometric Data
- Genetic Data