![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<h1 align='center'>The Importance of Ethics in AI</h1>

### What is a Jupyter notebook?

A Jupyter notebook is an online document that can include both text and code in different “cells” or parts of the document. It allows code to be shared alongside text that explains and contextualizes the data, which makes it a popular resource for both data science and instruction.

These documents run on Callysto Hub as well as Google Colab, IBM Watson Studio, and other places.

This presentation is a Jupyter notebook!

### Callysto notebooks ready for you to use!

On our website, [callysto.ca](https://www.callysto.ca/), you will find lesson plans, courses and learning modules that support you in incorporating coding into your lessons. 

Some examples focusing on statistics:



| |
|-|
|<img src="./images/samplenotebooks.png" width="600">|

## What is Data?

Data is a collection of information. Usually obtained (or collected) to address a specific issue. 

Examples of data:

- Sports statistics
- Population-level health data
- National census data

<center><img src="https://img2.pngio.com/download-free-png-19-data-graph-icon-packs-vector-icon-packs-data-graph-png-600_564.png" width="400"></center>

## Example: Comparing data on people's preferred season

#### Introduction to Python and simple datasets

Here we'll create a simple table and show how it can be visualized. You don't need to modify any of the code cells, but if you'd like to play around with the code please do!

To run each cell:

- Click the `Run` button up above
- Click within the cell and press `Shift+Enter`

In [None]:
# creating data set
total_participants = 30
prefer_spring = 5
prefer_summer = 10
prefer_fall = 10
prefer_winter = 5
no_answer = total_participants - (prefer_spring + prefer_summer + prefer_fall + prefer_winter)

In [None]:
import pandas as pd
answer = {"Season": ["Spring", "Summer", "Fall", "Winter", "No answer"],
           "Count": [prefer_spring, prefer_summer, prefer_fall, prefer_winter, no_answer]}

answer_table = pd.DataFrame(answer)
answer_table

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.barplot(x = answer_table["Season"], y =  answer_table['Count'])
plt.title("Frequency of season preference")
plt.ylabel("Count")
plt.xlabel("Season")
plt.show()

## What is Data Science?

Data science involves **obtaining** and **communicating** information from (usually large) sets of observations. It brings together several fields to provide solutions in business, research, science, and more.


| |
|-|
| <img src="https://miro.medium.com/max/993/1*mgXvzNcwfpnBawI6XTkVRg.png" alt="Towards Data Science" width="500"/> |
<center> <a href="https://towardsdatascience.com/introduction-to-statistics-e9d72d818745">Towards Data Science</a></center>



## What is Machine Learning?


Machine Learning (ML) is **teaching computers to think like humans.**

ML is based on the idea that machines should be able to learn and adapt through experience (data), rather than being explicitly programmed.

## What is Artificial Intelligence?

Artificial Intelligence (AI) is a blanket term describing all efforts to make computer “think”.


AI refers to a broader idea where machines can execute tasks "smartly."


## Motivating the role of ethics in AI

    
Using AI to make decisions impacting lives, or access to resources.
1. Determine who gets into university.
1. Determine the sentence following a criminal conviction
1. Use online metadata to determine and predict behaviour.

## What are the potential impacts on education and society, and how do we talk to students about all of this?

- What are the consequences of wrong assignments via a ML-based outcome?
- How do we mitigate and minimize erros?
- How do we measure errors and limitations? 

Predictions will be made based on training data that is provided.

**Bias in training data increases probability for bias in predicted outcome** 


# Exercise: Using machine learning to predict sports enrolment

Data has been collected from 150 male students (all 18 years old) in three groups: 
1. 50 students excelled in team sports (football, basketball, hockey)

1. 50 students excelled in individual sports (swimming, cycling, snowboarding)

1. 50 students did not excel in any sports.

The students were scored by the same coach in the same school. The following parameters were collected:

1. Teamwork skills (coach score)
1. Speed (coach score)
1. Strength (coach score)
1. Height (coach score)

### Using this (hypothetical) training data, can we build a model that predicts which sports students will excel in?

#### Process

1. Get familiar with the data (table, summary statistics, plots)

1. Training the model

1. Evaluate model accuracy

1. Report findings

### Manage & Clean Data

In [None]:
# Load and visualize the data
from pandas import read_csv
from sklearn import datasets
import pandas as pd
import seaborn as sns
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt

# Machine learning
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Evaluate model
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import plot_confusion_matrix
from sklearn import svm, datasets

In [None]:
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns = iris.feature_names)
df['y'] = y

df['y'].replace({0:"Individual sport",
                 1: "No sport",
                 2: "Team sport"},inplace=True)

df.rename(columns={"sepal length (cm)":"Strength",
                   "sepal width (cm)": "Speed",
                   "petal length (cm)": "Teamwork",
                   "petal width (cm)": "Height",
                   "y":"SelectedSport"}, 
                   inplace= True)

In [None]:
df.sample(10)

In [None]:
df.info()

In [None]:
df['SelectedSport'].unique()

<h2 align='center'>Exploratory analysis</h2>

<h3 align='center'>Getting summary stats for all students</h3>

In [None]:
df.describe()

<h3 align='center'> Getting summary stats for specific activities</h3>

In [None]:
individual = df[df['SelectedSport']=='Individual sport']
no_sport = df[df['SelectedSport']=='No sport']
team_sport = df[df['SelectedSport']=='Team sport']


team_sport.describe()

<h3 align='center'>Generating visualization from summary stats</h3>

In [None]:
import ipywidgets as widgets
from IPython.display import display
dropdowna = widgets.Dropdown(
    options=['Strength', 'Speed', 'TeamInclination','Height'],
    value='Strength',
    description='Item:',
    disabled=False,
)

In [None]:
# box and whisker plots
display(dropdowna)
print("Scores for:",dropdowna.value)
sns.catplot(x="SelectedSport", y=dropdowna.value, data=df,hue='SelectedSport',kind='box')

<h3 align='center'> Generating distribution visualization </h3>

In [None]:
dropdownb = widgets.Dropdown(
    options=['Strength', 'Speed', 'TeamInclination','Height'],
    value='Strength',
    description='Item:',
    disabled=False,
)

In [None]:
display(dropdownb)
print("Histogram for various measurements (per class): ",dropdownb.value)
sns.displot(df, x=dropdownb.value, col="SelectedSport", multiple="dodge");

<h2 align='center'>Insights</h2>

| Activity | Team affinity | Speed | Strength | Height|
| -    | -          | -     | -        | -     |
|Independent sport|Lowest |Highest  | Lowest | Lowest |
|No sport|Medium|Lowest|  Medium| Medium |
|Team sport|Highest|Medium| Highest| Highest |


<h2 align='center'>Machine learning technique: variables</h2>

Independent variables (or variables we use to predict):

1. Team affinity
2. Speed
3. Strength
4. Height

Dependent variable (or the variable we want to predict):

Type of activity

<h2 align='center'> Machine learning technique </h2>

Split data set into training and testing data.

We will provide random data points for the algorithm to "learn" (training set). 

We will test how well the model does by providing the testing set to the algorithm after it trains. 


In [None]:
array = df.values
# All measurements
X = array[:,0:4]
# All classes
y = array[:,4]
# Split-out validation dataset
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, 
                                                                test_size=0.20, 
                                                                random_state=1, 
                                                                shuffle=True)

We will use Support Vector Machines (SVM) - a type of algorithm exploring non-linear relationships.

In [None]:
# Make predictions on validation dataset
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)

In [None]:
print(accuracy_score(Y_validation, predictions))

We can see that the accuracy is 0.966 or about 96% on the hold out dataset.

In [None]:
print(classification_report(Y_validation, predictions))

The metrics are calculated by using true and false positives, true and false negatives.



Precision is the ability of a classifier not to label an instance positive that is actually negative.

Recall is the ability of a classifier to find all positive instances. For each class, it is the ratio of true positives to the sum of true positives and false negatives.

F1-score is the average of precision and recall, where each of the two measurements are given equal weight. 1.0 is the best score, 0.0 is the worst score. 

The support is the number of samples of the true response that lie in that class.

In [None]:
classifier = model.fit(X_train, Y_train)
class_names = iris.target_names
print("Confusion matrix")
fig, ax = plt.subplots(figsize=(8, 8))
plot_confusion_matrix(classifier, X_validation, Y_validation,display_labels=class_names,cmap=plt.cm.Blues,
                      normalize=None,ax=ax);

 The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. 

<h2 align='center'>Final Analysis</h2>

<h3 align='center'>Let's take a look at the predicted values</h3>

In [None]:
pred_df = pd.DataFrame(X_validation,columns=df.columns[0:4])

pred_df['Predicted Class'] = predictions

In [None]:
dropdownc = widgets.Dropdown(
    options=['Individual sport', 'No sport', 'Team sport'],
    value='Individual sport',
    description='Class:',
    disabled=False,
)

In [None]:
## 11 setosa, 12 versicolor, 7 virginica
display(dropdownc)
pred_df[pred_df['Predicted Class']==dropdownc.value]

### Which one did it get wrong? 


In [None]:
import numpy as np
y_test = np.asarray(Y_validation)
misclassified = np.where(y_test != model.predict(X_validation))

print(misclassified)

<h2 align='center'>Reporting</h2>

The algorithm classified one sample as team sport, when it was no sport.

This entry is in row with index 22.

The algorithm did not train on data for female students, students with disabilities, or students of different ages. Using the algorithm against new data on these students will result in high levels of misclassification. Consequences include unfair exclusion or categorization in activities. 

## Discussion: Identifying bias

1. What issues can you identify in this problem statement?

1. What biases in the data can you identify?

1. What are the consequences of those biases when the algorithm is in action?

<h2 align='center'>Real examples</h2>

Amazon ditches AI recruiting tool that didn’t like women (Reuters) [link](https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G)


Can Racist Algorithms Be Fixed? (The Marshall Project) [link](https://www.themarshallproject.org/2019/07/01/can-racist-algorithms-be-fixed)


Black and Asian faces misidentified more often by facial recognition software (CBC) [link](https://www.cbc.ca/news/technology/facial-recognition-race-1.5403899)


UK ditches exam results generated by biased algorithm after student protests (The Verge) [link](https://www.theverge.com/2020/8/17/21372045/uk-a-level-results-algorithm-biased-coronavirus-covid-19-pandemic-university-applications)

<h2 align='center'> What can we do? </h2>

- Work towards addressing our own biases in the classroom and daily life

- Identify how our biases play a role in our decision making

- Identify how our biases affect the machines we program 

- Collaborate with people offering diverse points of view

![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

<h2 align='center'>Getting Started with Callysto</h2>

- Feedback form https://tinyurl.com/y2a3uhdt
- Online self-paced courses (courses.callysto.ca)  
- Preview our learning modules https://callysto.github.io/curriculum-jbook/intro.html
- Contact us for “in-class” workshops, teacher PD, virtual hackathons, and more

Email: contact@callysto.ca

On Twitter: @callysto_canada

Site: https://www.callysto.ca

YouTube https://www.youtube.com/channel/UCPdq1SYKA42EZBvUlNQUAng 

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)