<a href="https://colab.research.google.com/github/dodobir/blooket_hack_improved_Glixzzy/blob/master/Copy_of_Student_ExaminingProblemAndDataset_Section1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<font color="#de3023"><h1><b>REMINDER MAKE A COPY OF THIS NOTEBOOK, DO NOT EDIT</b></h1></font>

# Machine Learning for Good
Welcome to your project! In the next 3 notebooks, we will be using the tools you learned over the past sessions in order to use machine learning to solve one of several important problems!  Select the project you are interested in and read further.

In [None]:
# @title # **Run this code cell to set up the notebook!**
# @markdown The data may take some time to load in, so feel free to move on to the next part in the meantime.

project = "histology" # @param ["Choose your dataset!", "bees", "histology", "beans", "malaria"]

import requests
from IPython.display import Markdown, display

import tensorflow_datasets as tfds
from tensorflow.image import resize_with_pad, ResizeMethod

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from PIL import Image
from skimage import data, color
from skimage.transform import rescale, resize, downscale_local_mean

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def ProjectDescription(project):
  display_str =  f"**[{project.capitalize()} Project Background Document]({article_url_dict[project]})** <br />"
  display_str += f"**[{project.capitalize()} Dataset Documentation]({dataset_documentation_url_dict[project]})** <br />"
  display(Markdown(display_str))
  response = requests.get(image_url_dict[project], stream=True)
  img = Image.open(response.raw)
  plt.imshow(img)
  plt.axis('off')
  plt.show()

# URL dictionaries for the projects
article_url_dict = {
    "beans"     : "https://docs.google.com/document/d/19AcNUO-9F4E9Jtc4bvFslGhyuM5pLxjCqKYV3rUaiCc/edit?usp=sharing",
    "malaria"   : "https://docs.google.com/document/d/1u_iX2oDrEZ1clhFefpP3V8uwAjf7EUV4G6kq_3JDcVY/edit?usp=sharing",
    "histology" : "https://docs.google.com/document/d/162WhUE9KqCgq_I7-VvENZD2n1IVsbeXVRSwfJEkxAqQ/edit?usp=sharing",
    "bees"      : "https://docs.google.com/document/d/1PUB_JuYHi6zyHsWAhkIb7D7ExeB1EfI09arc6Ad1bUY/edit?usp=sharing"
}

image_url_dict = {
    "beans"     : "https://storage.googleapis.com/tfds-data/visualization/fig/beans-0.1.0.png",
    "malaria"   : "https://storage.googleapis.com/tfds-data/visualization/fig/malaria-1.0.0.png",
    "histology" : "https://storage.googleapis.com/tfds-data/visualization/fig/colorectal_histology-2.0.0.png",
    "bees"      : "https://storage.googleapis.com/tfds-data/visualization/fig/bee_dataset-bee_dataset_300-1.0.0.png"
}

download_url_prefix_dict = {
    "histology" : "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Towards%20Precision%20Medicine/",
    "bees"      : "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Safeguarding%20Bee%20Health/"
}

dataset_documentation_url_dict = {
    "beans"     : "https://www.tensorflow.org/datasets/catalog/beans",
    "malaria"   : "https://www.tensorflow.org/datasets/catalog/malaria",
    "bees"      : "https://www.tensorflow.org/datasets/catalog/bee_dataset",
    "histology" : "https://www.tensorflow.org/datasets/catalog/colorectal_histology",
}

# Load dataset
if project == "Choose your dataset!":
  print("Please choose your dataset from the dropdown menu!")

elif project == "beans":
  data,  info = tfds.load('beans', split='train[:1024]', as_supervised=True, with_info=True)
  feature_dict = info.features['label'].names
  images = np.array([resize_with_pad(image, 128, 128, antialias=True) for image,_ in data]).astype(int)
  labels = [feature_dict[int(label)] for image,label in data]

elif project == "malaria":
  data,  info = tfds.load('malaria', split='train[:1024]', as_supervised=True, with_info=True)
  images = np.array([resize_with_pad(image, 256, 256, antialias=True) for image,_ in data]).astype(np.uint8)
  labels = ['malaria' if label==1 else 'healthy' for image,label in data]

else:
  wget_command = f'wget -q --show-progress "{download_url_prefix_dict[project]}'
  !{wget_command + 'images.npy" '}
  !{wget_command + 'labels.npy" '}

  images = np.load('images.npy')
  labels = np.load('labels.npy')

  !rm images.npy labels.npy


# Original preprocessing code for datasets

# if project == "histology":
#   data,  info = tfds.load('colorectal_histology', split='train[:1024]', as_supervised=True, with_info=True)
#   feature_dict = info.features['label'].names
#   images = np.array([image for image,label in data]).astype(int)
#   labels = [feature_dict[int(label)] for image,label in data]

# if project == "bees":
#   data,  info = tfds.load('bee_dataset', split='train[:3200]', as_supervised=True, with_info=True)
#   data = [(image, label) for image,label in data if label['wasps_output']==0]
#   data1 = [(image, label) for image,label in data if label['varroa_output']==0][:500]
#   data2 = [(image, label) for image,label in data if label['varroa_output']==1][:500]
#   data = data1 + data2
#   images = np.array([image for image, _ in data]).astype(int)
#   labels = ['diseased' if label['varroa_output'] else 'healthy' for image,label in data]

# PART I: Project Background

## 🛗 Elevator Pitch

<center><img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTgOQKSQqiJYeXLoqgpqxtQj48XnYKongBZww&usqp=CAU" width=200></center>

An [***elevator pitch***](https://zapier.com/blog/elevator-pitch-example/) is a brief way of introducing yourself, getting across a key point or two, and making a connection with someone. Think 30 seconds, the amount of time you would spend talking to someone on an elevator! They have become popular in the business & technology world as a helpful exercise in distilling down the big picture, and efficient way to quickly make connections.



### 🗣 Exercise 1A: Crafting Your Elevator Pitch




Run the cell below to start taking a look at background documents for this project and a quick sneak peek of the dataset, which should be helpful in developing your pitch!

If you want more information beyond the dataset documentation on how your dataset is created, take a look at the end of the hidden code cell at the top of the notebook!

In [None]:
ProjectDescription(project)

Take 5-15 minutes to look through these resources and come up with a 3-5 sentence elevator pitch for your project!

Think about:
- The why: Why does your project matter? What unmet need is it going to solve?
- The what: What are some potential solutions you could develop for the proposed problem in your project?
- The how (optional): How would you start to think about implementing these solutions?

Write down your elevator pitch and deliver it to a partner or the group! This will be a great way for you to practice for the presentation later in the program.



#### **Elevator Pitch:**

DRAFT YOUR ELEVATOR PITCH HERE (Double click into this text cell to edit!)

## 👜 Stakeholders

<center><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Stakeholder_%28en%29.svg/1200px-Stakeholder_%28en%29.svg.png" width=400></center>

A **stakeholder** is a person or organization who might have a personal, financial, legal, or ethical interest in a business or product. Stakeholders might be
- Employees or owners of a company or product
- People or companies involved in the upstream supplying or building of a product
- Customers or other companies using the product
- Government or other parties regulating the product








### 🗣 Exercise 1B: Stakeholders
Name three stakeholders for your project.  


In [None]:
# @title
stakeholder_1 = "" # @param {type:"string"}
stakeholder_2 = "" # @param {type:"string"}
stakeholder_3 = "" # @param {type:"string"}



## ⚖️ Ethics

It's important to start thinking early about the ethical concerns and implications involving machine learning in your project. Before we dive deep into your data, let's think a little about the ethics surrounding your project.

You might want to think about ethical issues involving:
- Bias
- Privacy
- Autonomy
- Security
- Accountability
- Transparency
- Interpretability




### 🗣 Exercise 1C: Ethical Concerns

What might be some ethical concerns stakeholders might have about your project?

In [None]:
# @title
_1_ = "" # @param {type:"string"}
_2_ = "" # @param {type:"string"}
_3_ = "" # @param {type:"string"}


### 🗣 Exercise 1D: Solutions to These Ethical Concerns

How might you go about mitigating each of these concerns?



In [None]:
# @title
_1_ = "" # @param {type:"string"}
_2_ = "" # @param {type:"string"}
_3_ = "" # @param {type:"string"}

### 🗣 Exercise 1E: Describing the Problem

Remember that ***supervised*** learning is a type of machine learning where a model is trained on labeled data, with the aim of predicting or classifying new data based on the patterns it learns between the input features and labels. In this project we are going to build a supervised machine learning project that is relevant to your project.

Identify what the features and labels would be in your supervised machine learning model. What would be some ways of obtaining this dataset?

In [None]:
# @title
features_X = "" # @param {type:"string"}
labels_y = "" # @param {type:"string"}
how_to_get = "" # @param {type:"string"}


# PART II: The Dataset
<center><img src="https://miro.medium.com/v2/resize:fit:1200/1*l85zQU68tFsKsan25bJtMg.jpeg" height=300>

## 💠 Exploring the Data



It's important to understand how many samples, features, and labels are in our dataset so that we can choose what types of models are appropriate, and how we might want to preprocess our data.

We have uploaded a set of images for you, inside the variable ***`images`*** and the corresponding set of labels inside the variable ***`labels`***. Run the code cell below to get a sense of what the data looks like.


In [None]:
print(images[0]) # Print out the first image
print(labels) # Print out the list of labels

### 💻 Exercise 2A: Examining the Dataset


The data you printed above might be a little hard to make sense of! Let's see how else we can investigate the dataset and learn from it:

First, let's take a look at the overall shape of the dataset using the [`.shape`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html) attribute. Take a look at the examples on the documentation link if you're unfamiliar with this!

Once you're able to print this out, discuss what each dimension might signify. Do you think this will be enough data for your models to learn the solution to the problem? If your model is unable to effectively learn the solution, what could you do with respect to the data to improve the results?

In [None]:
dimensions = None ### YOUR CODE HERE

print('Dimensions of dataset:', dimensions)

Let's now take a look at our labels! How many of each label are there in our dataset? Why might this be important to check?

The Seaborn library allows us to easily plot histograms from a list of data, using [`sns.histplot()`](https://seaborn.pydata.org/generated/seaborn.histplot.html). Try taking a look at the documentation to see how you might use this for your `labels` data. If you have time during or after class, feel free to look at the  documentation and examples to play around with the way the plot looks!

In [None]:
### YOUR CODE HERE

### END CODE HERE

plt.xlabel('Labels')
plt.ylabel('Frequency')
plt.title('Histogram of Labels')
plt.show()

## 🔎 Looking at Some Examples

Run the following code to visualize an image and its label from the training dataset.

In [None]:
i = 100 # We are going to look at the i'th image in the dataset.
plt.imshow(images[i]) # Use matplotlib imshow() function to visualize an image.
plt.title(labels[i]) # Set the label of the image as the title
plt.show() # Show your plot.

### 💻 Exercise 2B: Visualizing our Dataset


Iterate through your dataset and use the visualization library Matplotlib (using the `plt` alias) to visualize the first 5 images along with their labels. You can use the code above as a hint!

**BONUS**:  Visualize exactly one image from each label/category! What unique characteristics do you notice in each category? Do you think it’ll be difficult for your model to tell these categories apart?


In [None]:
### YOUR CODE HERE

# Part III: Brainstorming Solutions to the Problem

## 🧪 Testing a Few Simple Hypotheses


### 🗣 Exercise 3A: Brainstorming Classification Rules




Look through a few of your images and brainstorm some simple rules for how you might classify an image. (You can do some domain research on the internet as well!)



In [None]:
# @title
rule_1 = "" # @param {type:"string"}
rule_2 = "" # @param {type:"string"}
rule_3 = "" # @param {type:"string"}

### 🗣 Exercise 3B: Breaking Our Rules

Try to find an image that breaks each of your rules. (Spend ~5 minutes on this - some rules might be harder than others to break!)

## 💻 Machine Learning Models

You've learned a lot about machine learning models in the program! Some supervised machine learning models you have touched on have been: K-means classifiers, logistic regression, linear regression, and neural networks!

<center><img src="https://upload.wikimedia.org/wikipedia/commons/0/09/Supervised_machine_learning_in_a_nutshell.svg" height=200>

### 🗣 Exercise 3C: Models for our Project


Next time we are going to build your first model for your project. Using what you've learned in the program, what are two or three machine learning techniques or models you think might work well for your dataset? Put them in the order of how well you think they'd perform and explain why!


In [None]:
# @title
_1_ = "" # @param {type:"string"}
_2_ = "" # @param {type:"string"}
_3_ = "" # @param {type:"string"}

## 🎁 Wrapping Up

Great work! You've now thought deeply about the motivation behind your project, its stakeholders, and ethical concerns, as well as examined your data for potential hypotheses. Next up, we will see how to build and evaluate your machine learning model!