<a href="https://colab.research.google.com/github/daisysong76/bioinformatics-research/blob/main/Lab_0_c146_v_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# [DS4Bio] Lab 0: Comparative Genomics

**Notebook Developed by:**
*Reet Mishra and Sarp Dora Kurtoglu* (adapted in part from work by *Shishi Luo* and *Jonathan Fischer*) <br>**Notebook Updated by:**
*Skye Pickett, Zcjanin Ollesca, Xiaomei Song, Diego Sotomayor, Evie Currington*

### Learning Outcomes

In this notebook, you will learn about:
* Importing and analyzing datasets
* Sorting and manipulating scientific data
* Understanding correlation coefficients
* Graphing genome sizes (bar, scatter, histogram)
* Grouping and Pivot histograms
* Summary Statistics
* C-Value Paradox between different organisms


## Table of Contents
1. [Genome Size Exploration](#1.-Genome-Size-Exploration)
1. [Pathogen Genome Sizes](#2.-Pathogen-Genome-Sizes)
1. [Histograms](#3.-Histograms)
1. [Scatterplots](#4.-Scatterplots)
1. [Summary Statistics](#5.-Summary-Statistics)
1. [Putting it all together: Analytics on Animals](#6.-Putting-it-all-together:-Analytics-on-Animals)
1. [C-Value Paradox](#7.-C-Value-Paradox)
1. [Conclusion](#8.-Conclusion)
***

### Helpful Data Science Resources
Here are some resources you can check out while doing this notebook!

- [Reference Sheet for the datascience Module](http://data8.org/sp22/python-reference.html)<br>(This is extremely helpful whenever you need a cheatsheet!)
- [Documentation for the datascience Module](http://data8.org/datascience/index.html)

### Peer Consulting

If you find yourself having trouble with any content in this notebook, Data Peer Consultants are an excellent resource! Click [here](https://dlab.berkeley.edu/training/frontdesk-info) to locate live help.

Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world, or other data science courses offered at Berkeley.

---

## 1. Genome Size Exploration

To prepare our notebook environment, run the following cell which imports the necessary packages. It will print `All necessary packages have been imported.` below the cell when it's completed importing.

In [None]:
# Import the necessary packages
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sp
plt.style.use('fivethirtyeight')
print("All necessary packages have been imported.")

All necessary packages have been imported.


---
### 1.1 Importing data


The **genome size** is the amount of DNA contained in one copy of the complete genome, in terms of the number of basepairs.

The table that we are importing contains information on 15 different species. The table provides both the scientific and common name of these species, as well as the number of genes, the number of proteins, and the genome size for the species.


<font color = #d14d0f>**QUESTION 1**:</font> ***Replace the `...` in the cell below with your code.* Import the model species data as a table. The *csv* file that we will turn into a table is called `lab0_model_species.csv`. Keep in mind that `Table.read_table(...)` takes in the file *path*, not just the file name. In the cell below we are assigning the to the variable `species`.**
>*Hint:<br>- The format of the code below will be `Table.read_table('<foldername>/<filename>')`).<br>- "lab0_model_species.csv" is located inside the "data" folder.*


In [None]:
species = Table.read_table(...)


We can see the full table with the `<tablename>.show()` command. <br>Run the code below, `species.show()`, to see the full *species* table!


In [None]:
species.show()

---
### 1.2 Organisms by Genome Size



<font color = #d14d0f>**QUESTION 2**:</font> Sort the `species` table by size in ascending order (smallest to biggest).<br>
>*Hint:*<br> - Use the sort function: `<table_name>.sort(<"column_name">)`


In [None]:
# YOUR CODE HERE


<font color = #d14d0f>**QUESTION 3**:</font> Now, sort the *species* table by size in descending order (biggest to smallest).<br>
>*Hint:*<br> - The sort function has an optional argument, *descending* that can be set equal to True or False:<br>`<table_name>.sort(<"column_name">, descending = ___)`


In [None]:
# YOUR CODE HERE


<font color = #d14d0f>**QUESTION 4**:</font> Extract the organisms with more than 60,000 genes from the `species` table. This is roughly number of genes that humans have.<br>
>*Hint*:<br> Use the where function with this syntax `<table_name>.where('<column_name>', predicate)`<br>[All `table.where` predicates at this link](https://www.data8.org/fa23/reference/#tablewhere-predicates)


In [None]:
# YOUR CODE HERE
greater_than_60k = ...
greater_than_60k

In [None]:
# DON'T MODIFY THIS CELL -- for graders to check answers
greater_than_60k[1]

---
## 2. Pathogen Genome Sizes

---
### 2.1 Importing Data


<font color = #d14d0f>**QUESTION 5**:</font> ***Replace the `...` in the cell below with your code.* Import the pathogen data as a table and assign it to a variable `pathogens`. The *csv* file that we will turn into a table is called `lab0_pathogens.csv`. Keep in mind that `Table.read_table(...)` takes in the file *path*, not just the file name.**
>*Hint:<br>- See Question 1 for a guide.<br>- The format of the code below will be `Table.read_table('<foldername>/<filename>')`).<br>- "lab0_pathogens.csv" is located inside the "data" folder.*<br>- The last line of the cell below will be `<table_name>.show()`. This will output the table if the line above is done correctly.


In [None]:
# YOUR CODE HERE -- replace the ...

... = Table.read_table(...)
...show()


<font color = #d14d0f>**QUESTION 6**:</font> **Calculate the minimum, maximum, mean, and standard deviation of the pathogen sizes to get a good sense of the data that you are analyzing. Assign each to the corresponding variable names below (replace `...` with your code.)**

>*Hint*:
<br>- Use `np.min(<table_name>[<"column_name">])`, `np.max(<table_name>[<"column_name">])`, `np.mean(<table_name>[<"column_name">])`, and `np.std(<table_name>[<"column_name">])` for these calculations.
<br> - The `pathogens` table has a column "Size".


In [None]:
size_min = ...
size_max = ...
size_mean = ...
size_std = ...

Ellipsis

In [None]:
# DON'T MODIFY THIS CELL -- for graders to check answers
print(size_min, size_max, size_mean, size_std)

1

<font color = #d14d0f>**QUESTION 7**:</font> **How many subgroups do we have in the `pathogens` table?**
**What is the most common subgroup?**
<br>*Hint:* <br>- Use [`<table_name>.group("<column_name>", <optional_aggregation_function>)`](http://www.data8.org/datascience/_autosummary/datascience.tables.Table.group.html#datascience.tables.Table.group).
<br>- The default aggregation function is 'count' which counts how many time each value occurs in the table (ie, 'commonness').
<br>- Using [`.sort("<column_name>")`](http://www.data8.org/datascience/_autosummary/datascience.tables.Table.sort.html#datascience.tables.Table.sort) to sort by the count values could be helpful to identify the most common subgroup.

In [None]:
## Your code here

**ANSWER:**
Double click on this cell, then replace the ... with your answers.

**Number of subgroups** = ...
<br> **Most common subgroup** = ...

---
### 2.2 Understanding Pathogens

<font color = #d14d0f>**QUESTION 8**:</font> **How many organisms does `pathogens` contain?**

>*Hint:*<br>- The `<table_name>.num_rows` function from Data 8 library will provide you with the total number of elements in your table.

In [None]:
# YOUR CODE HERE
num_organisms = ...

In [None]:
# DON'T MODIFY THIS CELL -- for graders to check answers
print(num_organisms)


<font color = #d14d0f>**QUESTION 9**:</font> **Create a scatter plot of Genome Size vs. Number of Genes.**
>*Hint:* To create a scatter plot of genome size vs. the number of genes, you can use the plt.scatter() function from matplotlib.pyplot. Provide the appropriate columns from the pathogens table as the x and y data. See the scatter plot above for help!


In [None]:
# Extract the relevant columns from the pathogens table and put them in the ... below
plt.scatter(..., ..., alpha = 0.5)
# Keep the below code
plt.xlabel('Genome Size (Megabases)')
plt.ylabel('Number of Genes')
plt.title('Scatter Plot of Genome Size vs. Number of Genes')
plt.show()

<font color = #d14d0f>**QUESTION 10**:</font> **Create a bar chart of subgroup frequencies.**
>*Hint:* To create a bar chart of subgroup frequencies, you can use the plt.bar() function from matplotlib.pyplot. You'll need to count the frequencies of each subgroup in the pathogens table and then use the subgroup names as x-values and the frequencies as y-values for your plot.

<br>**Step 1:** Group the `pathogens` table to find the total amount of each subgroup in the table. *Assign this new table to the variable `subgroup_counts`.* See your code from Question 7!
<br> **Step 2:** Plot the counts in a bar chart. Keep in mind that the order is `plt.bar(<x_axis_values>, <y_axis_values>)`.

In [None]:
## Create a grouped table

# YOUR CODE HERE -- Replace the ... with your grouped table

subgroup_counts = ...

## Below we extract the subgroup names and frequencies
subgroup_names = subgroup_counts.column("Subgroup")
subgroup_freq = subgroup_counts.column("count")


# YOUR CODE HERE -- Create a bar chart! Replace the ... below
plt.figure(figsize=(10, 6))
plt.bar(..., ...)
plt.xlabel('Subgroup')
plt.ylabel('Frequency')
plt.title('Bar Chart of Subgroup Frequencies')
plt.xticks(rotation=45)
plt.show()

---
## 3. Histograms
With so many organisms, it can be hard to interpret the table. Histograms are a great way to visualize the distribution of a quantity of interest.

<font color = #d14d0f>**QUESTION 11**:</font> **In the code cell below, create a histogram of genome sizes (in megabases).**
>*Hint:<br>* Use the following format `<table_name>.hist('column_name', bins = b, normed = n)` where<br>- b gives the number of bins in the histogram<br>- n is either True or False for whether bin heights should be normalized by number of observations<br>- Choose 20 bins and normed = False


In [None]:
# YOUR CODE HERE
genome_size_hist = ....hist(..., bins = ..., normed = ...)       # Replace the ellipsis with appropriate arguments
genome_size_hist

---
### 3.1 Understanding Histograms

First, **read about histograms [here](https://chartio.com/learn/charts/histogram-complete-guide/)**, then answer the questions below.


<font color = #d14d0f>**QUESTION 12**:</font> **What are 3 benefits the author describes of using a histogram?**

**ANSWER:**<br>

Double click on this cell, then replace this sentence with your answer.




<font color = #d14d0f>**QUESTION 13**:</font> **What are the 3 best practices for using a histogram?**

**ANSWER:**<br>

Double click on this cell, then replace this sentence with your answer.




<font color = #d14d0f>**QUESTION 14**:</font> **Why is it imperative to choose the appropriate number of bins? How do you decide the right amount?**

**ANSWER:**<br>

Double click on this cell, then replace this sentence with your answer.




<font color = #d14d0f>**QUESTION 15**:</font> **In order to compare genome sizes, should we use absolute frequency or relative frequency?**

**ANSWER:**<br>

Double click on this cell, then replace this sentence with your answer.




<font color = #d14d0f>**QUESTION 16**:</font> **What's wrong with the Histogram above (in Question 11)?**

**ANSWER:**<br>

Double click on this cell, then replace this sentence with your answer.



---
### 3.2 Pivot Histograms for Genome Sizes




<font color = #d14d0f>**QUESTION 17**:</font> **Since we are dealing with different types of subgroups (bacteria, fungi, parasites, and viruses), it would be helpful to create a histogram with the normalized frequency of genome sizes for each subgroup and compare them with one another as opposed to viewing a histogram displaying the frequencies of all of the different sizes for each subgroup.**
>*Hint:* <br> Use the following format `object.hist('column_name', group = variable_to_group_by, bins = b, normed = n)` where<br>- group is set to the column containing our subgroups<br>- b gives the number of bins in the histogram<br>- n is either True or False for whether bin heights should be normalized by number of observations<br>- Choose 20 bins and normed = False   

In [None]:
# YOUR CODE HERE
subgroup_size_hist = ....hist(..., group = ... , bins = ... , normed = ...)
subgroup_size_hist


With the previous histogram, we had no information about how differing subgroups may have different relative frequencies of genome sizes. In the plot above, we are now able to view separate histograms displaying the normalized frequency of genome sizes for each subgroup which gives us more detail about how the distributions of sizes differ across bacteria, fungi, parasites, and viruses.

---
## 4. Scatterplots


Scatterplots are a great way to visualize the connection between two variables. A scatterplot represents each row in our table with a single dot, and we plot each dot relative to its value for the two numerical columns on the axes.

*Below you will make a scatter plot consisting of one point for each row of the table. Note that x_column and y_column must be strings specifying column names.*
    


<font color = #d14d0f>**QUESTION 18**:</font> **Make scatterplot of the genome sizes vs the number of genes in `pathogens` (genome size on X axis, number of genes on Y axis).**<br>
>*Hint:<br>- `<table_name>.scatter(<'X_column'>, <'Y_column'>)`
<br>- 1. string: name of the column on the x-axis
<br>- 2. string: name of the column on the y-axis
<br>- 3. (Optional) fit_line=True (Include optional argument fit_line=True if you want to draw a line of best fit for each set of points)*

In [None]:
## Run this cell to recall what the names
## of each of the columns in the species table is
pathogens.labels


In [None]:
# YOUR CODE HERE

# Don't delete the code below
plt.title("Genome size vs Number of Genes");




In the next few questions, we will be working with correlation coefficients, specifically Pearson's and Spearman correlations. Read these links to familiarize yourself with them both:
>Pearson's Correlation [definition](https://www.mathsisfun.com/data/correlation.html) and [Data 8 explanation](https://inferentialthinking.com/chapters/15/1/Correlation.html#the-correlation-coefficient)

>Spearman Correlation [definition](https://statisticsbyjim.com/basics/spearmans-correlation/)


<font color = #d14d0f>**QUESTION 19**:</font> **Calculate the Pearson and Spearman correlations between the genome sizes and the number of genes in the `pathogens`.**

>*Hints:*<br>- Pearson correlation function:<br>`sp.pearsonr(<table_name>[<"column_name_1">], <table_name>[<"column_name_2">])` <br>- Spearman correlation function:<br>`sp.spearmanr(<table_name>[<'column_name_1'], <table_name>[<'column_name_2'>])`<br>- The column names are the same ones that you used for your x column and y column in the scatter plot above.

In [None]:
pearson_correlation = ...
spearman_correlation = ...

In [None]:
# DON'T MODIFY THIS CELL - for graders to check answers
print("Genome SIZES vs Number of GENES")
print("Pearson correlation: ", pearson_correlation)
print("Spearman correlation: ", spearman_correlation)


<font color = #d14d0f>**QUESTION 20**:</font> **Make a scatterplot of the number of proteins vs number of genes in `pathogens`(number of proteins on X axis, number of genes on Y axis).** *Use the code for your previous scatter plot as a guide.*
<br>
>*Hint:<br>- `<table_name>.scatter(<'X_column'>, <'Y_column'>)`
<br>- 1. string: name of the column on the x-axis
<br>- 2. string: name of the column on the y-axis
<br>- 3. (Optional) fit_line=True (Include optional argument fit_line=True if you want to draw a line of best fit for each set of points)*

In [None]:
# YOUR CODE HERE


# Don't delete the code below
plt.title("Proteins count vs. Number of Genes");




<font color = #d14d0f>**QUESTION 21**:</font> **Which correlation coefficient would be most appropriate for the plot above?**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.


<font color = #d14d0f>**QUESTION 22**:</font> **Calculate the Pearson and Spearman correlations between the number of proteins and the number of genes in the pathogens.**
>*Hints:<br>- Pearson correlation function:<br>`sp.pearsonr(<table_name>[<"column_name_1">], <table_name>[<"column_name_2">])` <br>- Spearman correlation function:<br>`sp.spearmanr(<table_name>[<'column_name_1'], <table_name>[<'column_name_2'>])`<br>- The column names are the same ones that you used for your x column and y column in the scatter plot above.*

In [None]:
pearson_correlation = ...
spearman_correlation = ...

In [None]:
# DON'T MODIFY THIS CELL - for graders to check answers
print("Number of PROTEINS vs Number of GENES")
print("Pearson correlation: ", pearson_correlation)
print("Spearman correlation: ", spearman_correlation)


<font color = #d14d0f>**QUESTION 23**:</font> **Calculate the Pearson and Spearman correlations between the genome sizes and the number of genes in the `pathogens`.**
>*Hints:*<br>- Pearson correlation function:<br>`sp.pearsonr(<table_name>[<"column_name_1">], <table_name>[<"column_name_2">])` <br>- Spearman correlation function:<br>`sp.spearmanr(<table_name>[<'column_name_1'], <table_name>[<'column_name_2'>])`<br>- The column names are the same ones that you used for your x column and y column in the scatter plot above.

In [None]:
pearson_correlation = ...
spearman_correlation = ...

In [None]:
# DON'T MODIFY THIS CELL - for graders to check answers
print("Genome SIZES vs Number of GENES")
print("Pearson correlation: ", pearson_correlation)
print("Spearman correlation: ", spearman_correlation)


<font color = #d14d0f>**QUESTION 24**:</font> **Make a scatterplot of the number of proteins vs number of genes in pathogens (number of proteins on X axis, number of genes on Y axis).** *Use the code for your previous scatter plot as a guide.*
>*Hint:<br>- `<table_name>.scatter(<'X_column'>, <'Y_column'>)`
<br>- 1. string: name of the column on the x-axis
<br>- 2. string: name of the column on the y-axis
<br>- 3. (Optional) fit_line=True (Include optional argument fit_line=True if you want to draw a line of best fit for each set of points)*

In [None]:
# YOUR CODE HERE


# Don't delete the code below
plt.title("Proteins count vs. Number of Genes");

---
## 5. Summary Statistics
                                 
Summary statistics are numerical measures used to describe and summarize essential characteristics or properties of a dataset. They provide a concise overview of the dataset's central tendency, spread, and key features.

>#### **Mean:**
The mean, often called the average, is calculated by summing all values in a dataset and dividing by the number of data points. It represents the arithmetic center of the data.<br>
Use Case: The mean is useful for continuous data when you want to find the typical or central value. However, it can be sensitive to outliers, making it less suitable when extreme values are present.
>#### **Median:**
The median is the middle value in a dataset when it is ordered. If there is an even number of data points, it's the average of the two middle values. The median is less sensitive to outliers compared to the mean.<br>
Use Case: The median is preferred when dealing with skewed or non-normally distributed data, as it provides a more robust measure of central tendency.
>#### **Standard Deviation:**
The standard deviation quantifies the amount of variation or dispersion in a dataset. It measures how individual data points deviate from the mean.<br>
Use Case: The standard deviation is useful for understanding the spread or variability in a dataset. A higher standard deviation indicates greater variability, making it valuable for assessing data consistency.
>#### **Interquartile Range (IQR):**
The IQR is a measure of statistical dispersion, specifically the range between the first quartile (Q1) and the third quartile (Q3). It identifies the middle 50% of the data.<br>
Use Case: The IQR is robust against outliers and is often used to identify and handle extreme values. It's valuable for describing the spread of data while minimizing the influence of outliers.<

Run this cell to print the mean of the genome sizes of the `pathogens` table.

In [None]:
genome_size_mean = np.mean(pathogens['Size'])
genome_size_mean


<font color = #d14d0f>**QUESTION 25**:</font> **Above we computed the mean of the genome sizes in the `pathogens` table. In the cell below, compute the mean (average) of the genes count in `pathogens` and of the proteins count in `pathogens`.**


In [None]:
genes_mean = ...
proteins_mean = ...

In [None]:
# Don't modify this cell -- just run it
print("Average Genome size: ", genome_size_mean)
print("Average number of genes: ", genes_mean)
print("Average number of proteins: ", proteins_mean)
genome_size = [genome_size_mean]
genes = [genes_mean]
proteins = [proteins_mean]


<font color = #d14d0f>**QUESTION 26**:</font> **In one to two sentences, summarize what the cell above tells us about the average values. ie) Which values are highest or lowest? Why is or isn't this what you'd expect?**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.


We're going to store the summary statistics we found so far (and 3 more that we will calculate shortly) in a table. Let's call this table `pathogens_summary`. Read the cell below to understand how this table is being made. We're creating a column that shows each summary statistic and we'll add more columns on later.
>`<table_name> = Table().with_columns(<'column_name_1'>, [...values...], <'column_name_2'>, [...values...], ...)`

In [None]:
pathogens_summary = Table().with_columns(["Summary statistic", ["Mean", "Median", "Standard Deviation", "Interquartile Range"]])
pathogens_summary


<font color = #d14d0f>**QUESTION 27**:</font> **Find the median of the genome sizes, number of genes, and number of proteins in `pathogens`.**
>*Hint*: Use `np.median(<table_name>[<"column_name">])`


In [None]:
genome_size_median = ...
genes_median = ...
proteins_median = ...

In [None]:
# Don't modify this cell -- just run it
print("Median Genome size: ", genome_size_median)
print("Median number of genes: ", genes_median)
print("Median number of proteins: ", proteins_median)
genome_size.append(genome_size_median)
genes.append(genes_median)
proteins.append(proteins_median)


<font color = #d14d0f>**QUESTION 28**:</font> **Find the standard deviation of the genome sizes, number of genes, and number of proteins in `pathogens`.**
>*Hint*: Use `np.std(<table_name>[<"column_name">])`


In [None]:
genome_size_std = ...
genes_std = ...
proteins_std = ...

In [None]:
# Don't modify this cell -- just run it
print("Standard Deviation of Genome size: ", genome_size_std)
print("Standard Deviation of number of genes: ", genes_std)
print("Standard Deviation of number of proteins: ", proteins_std)
genome_size.append(genome_size_std)
genes.append(genes_std)
proteins.append(proteins_std)


<font color = #d14d0f>**QUESTION 29**:</font> **Find the [interquartile range](https://statisticsbyjim.com/basics/interquartile-range/#:~:text=The%20interquartile%20range%20(IQR)%20measures,your%20data%20spread%20out%20further.) of the genome sizes, number of genes, and number of proteins in `pathogens`.**
>*Hint*: Use this syntax: `sq.iqr(<table_name>[<"column_name">])`


In [None]:
genome_size_iqr = ...
genes_iqr = ...
proteins_iqr = ...

In [None]:
# Don't modify this cell -- just run it
print("IQR of Genome size: ", genome_size_iqr)
print("IQR of number of genes: ", genes_iqr)
print("IQR of number of proteins: ", proteins_iqr)
genome_size.append(genome_size_iqr)
genes.append(genes_iqr)
proteins.append(proteins_iqr)



Let's **add all our summary statistics** we found to the `pathogens_summary` we made before. Read the cell below to understand how this table is being updated. We're creating a column for each of the attributes that we're summarizing. Run the code cell below.
>To update a table/add columns, use this:`<table_name> = <table_name>.with_columns(<'column_name_1'>, [...values...], <'column_name_2'>, [...values...], ...)`

In [None]:
pathogens_summary = pathogens_summary.with_columns(["Genome size", genome_size, "Genes", genes, "Proteins", proteins])
pathogens_summary ## Run this cell


<font color = #d14d0f>**QUESTION 30**:</font> **In two to three sentences, summarize what similarities or differences you notice between the summary statistics of the genome sizes, gene counts, and proteins counts from the `pathogens_summary` table above.**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.

---
## 6. Putting it all together: Analytics on Animals
Refer to all prior sections for guidance!

<font color = #d14d0f>**QUESTION 31**:</font> ***Replace the `...` in the cell below with your code.* Import the animals data as a table and assign it to a variable `animals`. The *csv* file that we will turn into a table is called `lab0_animals.csv`. Keep in mind that `Table.read_table(...)` takes in the file *path*, not just the file name.**
>*Hint:<br>- See Question 1 and 5 for reference.<br>- The format of the code below will be `Table.read_table('<foldername>/<filename>')`).<br>- "lab0_animals.csv" is located inside the "data" folder.*<br>- The last line of the cell below will be `<table_name>.show()`. This will output the table if the line above is done correctly.


In [None]:
... = Table.read_table(...)
....show()


This table has 51 different animals listed. We know this by the calculation done in the cell below.




In [None]:
animals.num_rows





<font color = #d14d0f>**QUESTION 32**:</font> **Like we did in Section 3.1, create a histogram of genome sizes (in megabases) but this time use the matplotlib.pyplot package to create it. Remember to adhere to the best practices of creating histograms. The code cell below lays out the general format for plotting histograms when using pyplot. Fill in the ellipsis with the appropriate arguments. Take note of any differences you observe in the histogram generated in 3.1 and the histogram generated here using pyplot.**

>*Hint*: You can refer to the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) and [examples/tutorials](https://realpython.com/python-histograms/) for guidance.


In [None]:
import matplotlib.pyplot as plt # the package is already imported for you

# YOUR CODE HERE

genome_sizes_column = ....column(...)           # Replace the first ellipsis with the appropriate table name, replace the second with the appropriate column

plt.hist(genome_sizes_column,
        bins = ...,                             # Specify your bins
        )
plt.xlabel(...)                                 # Add a label for the x-axis
plt.ylabel(...)                                 # Add a label for the y-axis
plt.title(...)                                  # Add a plot title
plt.show()

<font color = #d14d0f>**QUESTION 33**:</font> **Like we did in Section 3.2, create a pivot histogram of genome sizes (in megabases) based on the `Subgroup` column but this time use the matplotlib.pyplot package to create it. Remember to adhere to the best practices of creating histograms.The code cell below lays out the general format for plotting histograms when using pyplot. Fill in the ellipsis with the appropriate arguments. Again, take note of any differences you observe in the histogram generated in 3.1 and the histogram generated here using pyplot.**


In [None]:
# YOUR CODE HERE

# Do not modify this code, this creates a list of each unique subgroup in pathogens
subgroups = np.unique(pathogens.column('Subgroup'))

# This loop will create a histogram for each subgroup, don't worry too much about the syntax, simply specify the bins for the histogram
for group in subgroups:
        group_column = pathogens.where('Subgroup', are.equal_to(group))
        plt.hist(group_column,
                bins = ...,                     # Specify your bins
                label = group
                )
plt.xlabel(...)                                 # Add a label for the x-axis as a string
plt.ylabel(...)                                 # Add a label for the y-axis as a string
plt.title(...)                                  # Add a plot title as a string
plt.legend()
plt.show()

The code below creates summary statistics (like we did for `pathogens`) and puts them into a table called `animals_summary`. Read through the cell below and notice how this code replicates what you did earlier on. Run the cell to see the summary statistics of the `animals` table.

In [None]:
genome_sizes = [np.mean(animals['Size']), np.median(animals['Size']), np.std(animals['Size']), sq.iqr(animals['Size'])]
genes = [np.mean(animals['Genes']), np.median(animals['Genes']), np.std(animals['Genes']), sq.iqr(animals['Genes'])]
proteins = [np.mean(animals['Proteins']), np.median(animals['Proteins']), np.std(animals['Proteins']), sq.iqr(animals['Proteins'])]

animals_summary = Table().with_columns(
"Summary Statistic", ["Mean", "Median", "Standard Deviation", "IQR"],
"Genome size", genome_sizes,
"Genes", genes,
"Proteins", proteins)
animals_summary.show()


<font color = #d14d0f>**QUESTION 34**:</font> **In two to three sentences, summarize what similarities or differences you notice between the summary statistics of the genome sizes, gene counts, and proteins counts from the `animals_summary` table above.**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.


**Run the cell below** to see the `pathogens_summary` table again.

In [None]:
pathogens_summary.show()


<font color = #d14d0f>**QUESTION 35**:</font> **In two to three sentences, write what you notice between the pathogens and animals summary tables. Why are there these differences?**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.

---
# 7. C-Value Paradox


<font color = #d14d0f>**QUESTION 36**:</font> **Read the following article titled ["The size of the genome and the complexity of living beings"](https://metode.org/issues/monographs/the-size-of-the-genome-and-the-complexity-of-living-beings.html). What is the C-value paradox?**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.


<font color = #d14d0f>**QUESTION 37**:</font> **Choose two organisms from `species`, `pathogens`, and/or `animals` that exhibit the C-value paradox.** Feel free to use the code cell below to look at the tables with `<table_name>.show()>`. **In the text box below, write the scientific names of the two organisms that you're comparing.**


In [None]:
### Optional: Use this cell for code to view data -- write your answer in the cell below this

**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.


<font color = #d14d0f>**QUESTION 38**:</font> **Create a scatter plot to visualize the comparison. Include `plt.title("___ vs ___")` as the last line in your code cell with a descriptive title so we can understand your visualization.**
>*Hint:*<br>- Use the scatter plots we made above for reference!<br>- If you feel that a summary statistics table would help explain your argument, feel free to use either of the two already created: `pathogens_summary` or `animals_summary`.



In [None]:
## YOUR CODE HERE
plt.title(...)


<font color = #d14d0f>**QUESTION 39**:</font> **Write one paragraph interpreting your visualization(s) and explaining how it demonstrates the C-value paradox.**

**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.


<font color = #d14d0f>**QUESTION 40**:</font> **In one to two sentences, hypothesize why a less complex organism may have a bigger genome size.**


**Answer:**

Double tap on this cell to edit the text. Replace this with your answer.

---
# 8. Conclusion

Over the course of this notebook, you:
* Explored genome size data in various species.
* Used summary statistics to understand data distributions.
* Created visualizations such as histograms and scatter plots.
* Investigated the C-value paradox and compare genome sizes across different organisms.

### Congratulations! You have finished the notebook! ##

***
## *Optional Exercise*

### Comparing pathogens and animals

In [None]:
# We need to merge the tables before comparing.

# First, add a new column named 'Type' to each table which says either 'Pathogens' or 'Animals'
# Hint: table.append_column(column_name, word)



# Now define merged by setting it equal to pathogens. Then use the table.append(table2) syntax to merge the tables


In [None]:
# Make the normalized pivot histogram of genome sizes with 50 bins


The scaling is kind of distorted because of the wide range of values. Perhaps a log transformation would make things look better?

In [None]:
# Append a new column named 'Log size'
# Use np.log10 to take the log10 of the 'Size' column. Remember that these are in Mb, so you should add 6 after taking the log!



In [None]:
# Make the normalized pivot histogram of genome sizes with 50 bins
