<font color="#de3023"><h1><b>REMINDER MAKE A COPY OF THIS NOTEBOOK, DO NOT EDIT</b></h1></font>

# Goals
In this colab you will:
* Brainstorm how to use AI for sustainability and conservation, and to combat climate change.
* Learn the ways in which naturally occurring bacteria can help increase plant growth and reforestation.
* Examine a dataset, and build a machine learning model to predict crop yield from soil composition.
* OPTIONALLY: Learn how scientists collect data about bacterial composition of an environment & measure the number of different bacteria in soils using DNA sequences.

In [None]:
#@title ### Setup notebook.

# Sample metadata
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Sustainable%20Farming/sample_metadata.tsv"

# 16S_counts
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Sustainable%20Farming/16S_counts.tsv"

# bacteria_counts
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Sustainable%20Farming/bacteria_counts.tsv"

# sequence_to_species_dict
!wget -q --show-progress "https://storage.googleapis.com/inspirit-ai-data-bucket-1/Data/AI%20Scholars/Sessions%206%20-%2010%20(Projects)/Project%20-%20Sustainable%20Farming/sequence_to_species_dict.npy"

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree

metadata = pd.read_table('sample_metadata.tsv')
metadata.index = ['farm_%i' %i for i in range(len(metadata))]

sequences_counts = pd.read_table('16S_counts.tsv')
sequences_counts.index = ['farm_%i' % i for i in range(len(sequences_counts))]

bacteria_counts = pd.read_table('bacteria_counts.tsv')
bacteria_counts.index = ['farm_%i' % i for i in range(len(bacteria_counts))]
cols = list(bacteria_counts.columns)
np.random.seed(42)
np.random.shuffle(cols)
bacteria_counts = bacteria_counts[cols]
sequence_to_species_dict = np.load('sequence_to_species_dict.npy', allow_pickle=True).item()

bacteria_counts = bacteria_counts.drop(['Unnamed: 0'], axis=1)
print("Setup successful.")

# People Health and Planet Health

<img src="https://www.gstatic.com/earth/social/00_generic_facebook-001.jpg" alt="drawing" width="1000" height="300"/>





In almost all of the problems we've worked on in this course so far, the goal is to improve the lives and health of people around the world (and dogs that have tendencies to run into the street!).  But we're forgetting the health of someone else pretty important.... Mother Earth!

 **Discuss: What does planet health mean to you?**

Some people feel that people health and planet health are sometimes at odds with each other.  That is, people should make sacrifices (like driving less, opt to buy more expensive but sustainable goods, etc.) in order to save the planet. Others think that people health and planet health go hand-in-hand. A healthier global ecosystem means a healthier human!  

####**Discuss: Which do you think?**


#### **Exercise: Brainstorm one problem that pits planet health *against* people health, and one problem where planet health and people health go hand-in-hand.**


In [None]:
people_vs_planet_ = '' #@param {type:"string"}
hand_in_hand_ = '' #@param {type:"string"}
print('Example answers: \n Plane travel is extremely carbon-expensive, \n but many people enjoy travelling or visiting people far away.  \n On the other hand, water pollution affects the health of local \n animals, plants, and humans.')

# Sustainable Farming and AI

One way we can improve both people and planet health is through better farming practices. More efficient and effective farming practices can have greater crop yields while using less land, energy, and harmful chemicals.  The world is projected to have 9 billion people by 2050... How can we make sure we are able to nourish 9 billion people while also nourishing our planet?

If you haven't already done so, please read [this](https://www.nationalgeographic.com/foodfeatures/feeding-9-billion/) article on 5 proposed steps to sustainably feeding the world.


"*Just as electricity transformed almost everything 100 years ago, today I actually have a hard time thinking of an industry that I don't think AI will transform in the next several years.*" -- Andrew Ng


So, why not planet earth and sustainable farming too?

#### **Exercise: Brainstorm a way AI could aid in each of the 5 proposed steps to sustainable farming.**


In [None]:
_1_ = '' #@param {type:"string"}
_2_ = '' #@param {type:"string"}
_3_ = '' #@param {type:"string"}
_4_ = '' #@param {type:"string"}
_5_ = '' #@param {type:"string"}


# Let's look at some farms!

<img src="https://cdn.pixabay.com/photo/2017/08/02/10/31/farming-2570803_1280.jpg" width=500>

Here is a dataset with crop yields and other information from different farms (stored in the variable ```metadata```).  Go ahead and explore the dataset. We've written a little bit of code to get you started.


In [None]:
metadata.head()

####**Exercise: Using the metadata & pandas, answer the following:**
1. Where did these samples come from?
2. For what type of plant was the yield measured?
3. Were all of these samples collected on the same day?

In [None]:
_1_ = '' #@param {type:"string"}
_2_ = '' #@param {type:"string"}
_3_ = '' #@param {type:"string"}


## Instructor Solution
<details><summary>click to reveal!</summary>

Answers: Australia, Barley, No

### **Exercise: Plot a histogram of the crop yields.  What do you observe from the histogram?**

Take a look at [the `histplot` documentation](https://seaborn.pydata.org/generated/seaborn.histplot.html) and play around with some of the optional parameters. How does your graph change?

In [None]:
ax = sns.histplot(data=metadata['crop_yield'])
ax.set(xlabel=None, ylabel=None) ### FILL IN ###
plt.show()

In [None]:
#@title ###Example Instructor Solution
# plt.hist(metadata['crop_yield'])
# plt.xlabel('Crop Yield')
# plt.ylabel('Number of Farms')
# plt.show()

ax = sns.histplot(data=metadata['crop_yield'])
ax.set(xlabel='Crop Yield (Kilograms/hectare)', ylabel='Number of Farms')
plt.show()


In this class we are going to use some interesting measurements (I won't spoil what these are just yet!) in order to predict how well plants or crops will grow (called *yield*) in a certain locale.

**Discuss: What "step" in the previous article would this fall under?**

####**Exercise: Brainstorm 3 attributes of soil (ie *features*) we could measure to predict crop yield?** It's ok if you aren't a chemistry or plant biology whiz -- get creative! Anything goes!


In [None]:
_1_ = '' #@param {type:"string"}
_2_ = '' #@param {type:"string"}
_3_ = '' #@param {type:"string"}

# Predicting Crop Yield From....



<img src="https://t4.ftcdn.net/jpg/01/30/73/43/360_F_130734330_mSBGYTwuGvaHhEoyHlDov6ELFHO4cexF.jpg" alt="drawing" width="1000" height="500"/>



#.... Bacteria!





Scientists have discovered that the soil microbiome, the collection of bacteria that live in a region of soil play an important role in the health of plants! (Interestingly, the bacteria that live in the human intestines also play an important role in the health of humans!). Therefore, maybe we can predict how well plants will grow in a region based on the bacterial composition of the soil.

Here is a short [tutorial](https://www.khanacademy.org/science/biology/bacteria-archaea/prokaryote-metabolism-ecology/a/prokaryote-interactions-ecology) on the role of prokaryotes (bacteria) in ecosystems that you can read if you want to learn more.


So, maybe we actually need to think *smaller* when it comes to sustainabity!  Like, so small you need a microscope ;)

<img src="https://st2.depositphotos.com/1967477/6350/v/950/depositphotos_63507731-stock-illustration-cartoon-bacteria-collection-set.jpg
" alt="drawing" width="300" >


# Examining Our Dataset

Let's take a look at what these bacteria composition datasets look like! Go ahead and run the following line of code to examine your dataset.

In [None]:
bacteria_counts.head()

#### **Exercise: What do the rows and columns represent?**

In [None]:
rows = '' #@param {type:"string"}
columns = '' #@param {type:"string"}

#### **Exercise: Do a quick google search on 3 bacteria from your dataset. Write a brief description of the function & environment of each bacteria.**

Type your findings in the cell below, but no need to run the cell!

In [None]:
_1_ = '' #@param {type:"string"}
_2_ = '' #@param {type:"string"}
_3_ = '' #@param {type:"string"}

# Cleaning Up The Data


<img src="https://live.staticflickr.com/7040/6878034144_02a2e37731_b.jpg" width=300>


Before building our machine learning model, we need to do a bit of preprocessing and data cleaning.

We will do 2 things in order to clean up our data:

1) We will get rid of any bacteria that are very low prevalence.

2) We will log-normalize our data.

## Removing Low Prevalence Bacteria

If a bacteria is so rare that it appears in $<$10 samples, that bacteria is not going to be helpful for building machine learning models that generalize to new data. We call these **singletons**. In order to reduce the amount of data we make the model crunch through, we will remove any features/bacteria that are low-prevalence.


###**Exercise: Fill out the code below to find the names of the bacteria that only appear in a single sample, and then remove them from your data.**

In [None]:
low_prev_bacteria = []
bacteria = bacteria_counts.columns
for b in bacteria:
  if sum(bacteria_counts[b]>0)<10:  # If bacteria 'b' is a singleton,

    # Fill in the code below to add bacteria 'b' to the list of singletons.
    low_prev_bacteria.append(____) ### FILL IN ###


# Fill in the code below to count the total number of singletons.
n_low_prevalence = 0000 #### FILL IN ###
print('%i bacteria are low prevalence.' % n_low_prevalence)

# The 'drop' function drops the columns specified from the bacteria_counts dataframe.
bacteria_counts_no_low_prev = bacteria_counts.drop(low_prev_bacteria, axis=1)

In [None]:
#@title #### Example Solution.
low_prev_bacteria = []
bacteria = bacteria_counts.columns
for b in bacteria:
  if sum(bacteria_counts[b]>0)<10:  # If bacteria 'b' is a singleton,

    # Fill in the code below to add bacteria 'b' to the list of singletons.
    low_prev_bacteria.append(b)


# Fill in the code below to count the total number of singletons.
n_low_prevalence = len(low_prev_bacteria)
print('%i bacteria are low prevalence.' % n_low_prevalence)

# The 'drop' function drops the columns specified from the bacteria_counts dataframe.
bacteria_counts_no_low_prev = bacteria_counts.drop(low_prev_bacteria, axis=1)

## Log-normalizing

**Discuss the following question: An apple tree grows 100 apples one year, and then 110 apples the next year.  Meanwhile, a pumpkin vine grows 10 pumpkins one year, and then 20 pumpkins the next year. Which plant do you feel like experienced a greater change between years?**

Although the plants had the same change in *magnitude* of fruits, you probably feel like the pumpkin plant experienced a greater change because it doubled in fruit production, while the apple tree only increased fruit production by 10%. The pumpkin vine has a larger *relative* change year-to-year. In biology, we are often interested in relative changes as well. A trick for looking at relative changes in data is to use log normalization.

To log-normalize a number N, we do:

$N_{norm} = log(N+1)$

To take the log of a number in our code, we can use the `np.log` function. Take a look at [the documentation for the function](https://numpy.org/doc/stable/reference/generated/numpy.log.html) if you'd like to see some examples of how it's used!



**Optional: Why do you think we also add 1? (Hint: What would happen if N equaled 0?)**

###**Exercise: Perform log-normalization on your dataset by filling out the following code.**


In [None]:
# Fill out the following one-liner to log-normalize your data. Hint: numpy is
# your friend! Remember, you should be performing log-normalization on the
# bacteria_counts_no_low_prev dataframe.
bacteria_counts_lognorm = None #### FILL THIS OUT.

In [None]:
#@title #### Example Solution
# Fill out the following one-liner to log-normalize your data. Hint: numpy is
# your friend! Remember, you should be performing log-normalization on the
# bacteria_counts_no_low_prev dataframe.
bacteria_counts_lognorm = np.log(bacteria_counts_no_low_prev + 1)

In [None]:
bacteria_counts_lognorm.to_csv('bacteria_counts_lognorm.csv')

# Wrapping Up

Great job! Data cleaning and exploration is an important part of machine learning especially when it comes to working with biological or environmental data!

In the next notebook, we will be building a model in order to predict crop yield from bacterial composition.

####**Exericse: To review what you learned in the class answer the following questions:**
1. Is this a regression or classification problem?
2. What will be the features (X data / input variable) in our model?
3. What will be the label (Y data / output variable) in our model?




In [None]:
_1_ = '' #@param {type:"string"}
_2_ = '' #@param {type:"string"}
_3_ = '' #@param {type:"string"}

#### **And finally, discuss: how can this model help sustainable farming in the face of climate change?**



# (Optional) Data Exploration

We are going to use a method of unsupervised learning called **hierarchical clustering** in order to explore and visualize our data. Unsupervised learning is usually used by machine learning scientists to understand what our features look like (if any features are correlated, or if there are certain outliers in the dataset) before diving into machine learning models.


Hierarchical clustering groups together features that occur in similar samples, and also clusters together samples whose features look similar.

Unsupervised learning (and any unsupervised learning methods) do not use labels.  We only give the computer a set of features, and it will group together samples and features that look similar.


Here is an example of a hierarchical clustered dataset. ```is_vegan```, ```is_vegetarian```, etc. are features. Bacon, eggs, toast, etc. are the samples. Red means a value of 1, blue means a value of 0.



 <img src="https://i.ibb.co/Fx0wMkm/hierical-clustering.png
" alt="drawing" width="1000"/>

**Discuss: Explain why toast and frosted mini-wheats got placed near each other in the clustered dataset? Why did ```is_vegan``` and ```is_vegetarian```?**





####**Exercise: Run the code below to perform hierarchical clustering on your dataset.**

In [None]:
#@title INSTRUCTOR RESOURCES

'''
Extra resources for how to explain hierarchical clustering
simple explanation: https://www.displayr.com/what-is-hierarchical-clustering/
more in depth: https://vitalflux.com/hierarchical-clustering-explained-with-python-example/
'''

In [None]:
f = sns.clustermap(bacteria_counts_lognorm)
f.ax_heatmap.set_ylabel('Farm idx')
f.ax_heatmap.set_xlabel('Bacteria Species')
plt.show()

Wow, there's a lot going on here! If you'd like to look at a subset of the data, try using only a small chunk of the data. For example, if you wanted to look at the first 20 farms in the dataset, you would use the line below.


```
f = sns.clustermap(bacteria_counts_lognorm[:20])
```
Try a different number in the cell above!


####**Exercise: Answer the following questions:**
1. What does it mean if two bacteria got clustered nearby each other?
2. Are there any outlier features or samples?
3. Can you think of any other fields that might use unsupervised learning? (Hint: Spotify is very famous for its unsupervised learning methods)


In [None]:
_1_ = '' #@param {type:"string"}
_2_ = '' #@param {type:"string"}
_3_ = '' #@param {type:"string"}

# OPTIONAL: Measuring Bacterial Composition

How did we get the measurements of our different bacteria in the first place?  This section explores the modern experimental and computational methods that scientists in order to measure bacterial abundance.

## DNA Sequencing

Nowadays, measuring the amounts of different species of bacteria in a sample is done through a type of *DNA sequencing* called **16S sequencing**.


**Quick Question: What is DNA? If you haven't taken biology recently, ask one of your teammates!  Team work makes the dream work.**

DNA sequencing involves using a special machine that can take a piece of DNA and figure out the order of the nucleotides/nitrogenous bases (the building blocks of DNA) that the DNA is composed of.

<img src="https://www.albert.io/blog/wp-content/uploads/2016/04/Nucleotide.png" alt="drawing" width="300" >

**Quick Question: What are the 4 nucleotides that make up DNA? Again, refer to whoever is the resident biology expert if you can't remember!**


If you want to learn more about sequencing on your own time, check out [this article](https://www.khanacademy.org/science/high-school-biology/hs-molecular-genetics/hs-biotechnology/a/dna-sequencing).




## Instructor Solution
<details><summary>click to reveal!</summary

Adenine (A)
Thymine (T)
Cytosine (C)
Guanine (G)

## The 16S Barcode

Bacteria have a very special region in their DNA called the **16S region**. Every species of bacteria has a different DNA sequence in their 16S region, and scientists have created a huge catalog of the 16S sequence in thousands of species of bacteria.


The 16S region acts like a "barcode."  Let's give a little metaphor. During 16S sequencing:
1.  To measure the amount of bacteria in a sample (analagous to "the shopping cart" in this extended metaphor), all of the bacteria are put into a test tube (onto the register at the grocery store) with some special chemicals.
2.  The sequencing machine (the cashier's barcode scanner) reads the 16S sequence (the barcode!) of every bacterial cell in the sample and logs it in the computer.
3.  Then the computer looks up what species of bacteria (grocery item) each 16S sequence/barcode belongs to, and tallies up the total number of each bacteria.


P.S. DNA sequencing is getting drastically more efficient, so sequencing your sample might actually be cheaper and faster than your grocery run these days :)

<img src="https://static.scientificamerican.com/sciam/cache/file/354C5183-DB53-432A-92B3FC4BA3E99297_source.jpg
" width=500>



<img src="https://upload.wikimedia.org/wikipedia/commons/b/bd/DNA_Barcoding.png" width=500>




---


## From Barcodes to Bacterial Counts



Let's look at what our ```sequences_counts``` dataframe looks like right now. **Discuss: what are the rows and what are the columns?**

In [None]:
sequences_counts.head()

In [None]:
#@title ##### We want to get our counts in terms of different bacteria (remember that each sequences maps to a type of bacteria). Something that looks like this (run this cell):
bacteria_counts[bacteria_counts.columns[20::20]].head()

We are going to perform an algorithm in 3 steps.
1. Label each sequence (barcode) with the bacteria that corresponds to that particular barcode. You can look this up using the dictionary ```sequence_to_species_dict```.
2. Group barcodes/sequences that correspond to the same bacteria species together using the ```pandas``` ```groupby``` function.
3. Use the ```pandas``` ```.sum()``` function to sum together the counts of all sequences that correspond to the same bacteria species.

Fill in the code below to perform this algorithm.

In [None]:
# The groupby function works better when we are aggregating over columns,
# so we transpose our data.
sequences_counts_t = sequences_counts.transpose()

# Step 1: Create a new column called "species" that corrsponds to the species of a given barcode,
# per looked up in the dictionary.
sequences_counts_t['### FILL IN ###'] = [sequence_to_species_dict[i] for i in sequences_counts_t.index]


# Step 2: Use the pandas "groupby" function to group sequences together by species.
# and Step 3: Use the .sum() function to sum together the counts of the grouped sequences.
summed_data = sequences_counts_t.groupby('### FILL IN ###').sum()

# Finally, we will re-transpose the data so that our columns are bacteria and our rows are each farm.
bacteria_counts = summed_data.transpose()
bacteria_counts.head()

In [None]:
#@title #### Example Solution
# The groupby function works better when we are aggregating over columns,
# so we transpose our data.
sequences_counts_t = sequences_counts.transpose()

# Step 1: Create a new column called "species" that corrsponds to the species of a given barcode,
# per looked up in the dictionary.
sequences_counts_t['species'] = [sequence_to_species_dict[i] for i in sequences_counts_t.index]


# Step 2: Use the pandas "groupby" function to group sequences together by species.
# and Step 3: Use the .sum() function to sum together the counts of the grouped sequences.

summed_data = sequences_counts_t.groupby('species').sum()

# Finally, we will re-transpose the data so that our columns are bacteria and our rows are each farm.
bacteria_counts = summed_data.transpose()
bacteria_counts.head()

Run the next cell and **discuss: what do you notice about the shape of the ```sequence_counts``` dataframe and the ```bacteria_counts``` dataframe? Why are the shapes different?**

In [None]:
print('Shape of sequence_counts:', np.shape(sequences_counts))
print('Shape of bacteria_counts:', np.shape(bacteria_counts))

In [None]:
bacteria_counts

Now you're able to see how scientists measure the amounts of different bacterial species!

In the next notebook, you're actually going to write an algorithm that turns a list of the 16S barcodes output by the sequencer/barcode scanner into counts of different bacteria types. You will use this data to build a machine learning model to predict crop yield from bacterial composition of soil!