## Learning outcomes for this notebook

* Be able to plot a histogram of a variable from a pandas DataFrame
* Be able to create new variables from existing ones
* Be able to set the width and range of bins in matplotlib histograms
* Understand that calculations on computers can produce numbers with meaningless digits after the decimal place, and that such numbers needed to be rounded for presentation.


## Presenting and visualising different types of data

You now know how to read in datasets from CSV files using the pandas library and how to plot and annotate histograms using matplotlib. We now systematically go through the types of graphs and tables you would normally use to present and display data of different combinations of types. The types of datasets we will look at are:
* A single numerical variable
* A single categorical variable
* Two numerical variables
* Two categorical variables
* A categorical variable and a numerical variable
* Two categorical variables and a numerical variable
* A categorical variable and two numerical variables

In addition, we'll also introduce some of the other functionality of pandas essential for analysing data.

## Displaying a single numerical variable

Histograms are the main method of displaying numerical data of one variable. 

### Darwin's cross-fertilised and self-fertilised plants

We start with a classic experiment performed by Charles Darwin. Darwin wanted to find out if cross-fertilised corn plants grew with greater vigour than self-fertilised corn plants. Pairs of seedlings of the same age, one produced by cross-fertilisation and the other by self-fertilisation, were grown together so that members of each pair were reared under nearly identical conditions. The final heights of all the plants were measured to the nearest millimetre.

![](images/cross_self_fertilisation.jpg)

<div class="alert alert-danger">
Before examining the data, in one or two sentences write in the next cell whether you expect there to be, or not to be, a difference in heights between cross and self-fertilised corn plants. Give a reason for your answer. 
</div>

>Write your answer here.

Now let's examine Darwin's data presented in Table 1.

Pair | Cross-fertilisation | Self-fertilisation
--- | :---: | :---: 
1 | 59.7 | 44.2
2 | 30.5 | 51.8
3 | 53.3 | 50.8
4 | 55.9 | 50.8
5 | 48.5 | 46.7
6 | 54.6 | 47.2
7 | 56.1 | 47.2
8 | 51.8 | 38.9
9 | 46.5 | 41.9
10 | 54.9 | 45.7
11 | 59.2 | 41.4
12 | 53.3 | 45.7
13 | 56.1 | 32.5
14 | 58.4 | 39.4
15 | 30.5 | 45.7

<center><b>Table 1.</b> Plant heights (cm) produced by cross and self-fertilisation, Darwin (1876).</center><br>

Each row in the table represents a pair of plants. The first column contains a numerical label for each pair, the second column contains the height of cross-fertilised plants and the third column contains the height of self-fertilised plants.

Notice that we include the units of centimetres (cm) in the Table caption. This is good practice for whenever you create a table. First, so that other people know what the units of measurement are, and second, so that the table doesn't look cluttered by putting "cm" after each and every number. 


We are not interested in the heights themselves, but rather the differences in heights for each pair. **Do you know why?**

The file containing the data on Darwin's plant heights is called [`darwin.csv`](darwin.csv). 

<div class="alert alert-danger">

* In the code cell below, read in the dataset from the file `'darwin.csv'` and name it `heights`. 
* Print the dataset with the command `print(heights)` to see what it looks like.

Note: don't forget to import pandas and matplotlib in this notebook
</div>
<br>

The first row of `heights` contains the variable names. In this case `'Cross'` and `'Self'` taken directly from the header in the CSV file.

There are three columns. The first column is an index, or row number, for each pair of plants. Recall that Python indices start from 0 and not from 1. The second column contains the heights of cross-fertilised plants and and third column contains the heights of the self-fertilised plants.

### Creating a new variable from existing variables

You may have noticed that we have two, rather than one, numerical variable in this dataset. These are `'Cross'` and `'Self'`. However, as mentioned earlier, we are not interested in the plant heights themselves, but rather the differences in heights of each pair. So the numerical variable of interest is the "difference in heights". 

Fortunately, pandas makes it easy to create new variables from existing ones. The following command creates a new numerical variable called `'Difference'` from the existing variables `'Cross'` and `'Self'`:

```python
heights['Difference'] = heights['Cross'] - heights['Self'] 
```
or, equivalently, using the pandas short cut method
```python
heights['Difference'] = heights.Cross - heights.Self 
```

(Note that you can't do `heights.Difference = heights.Cross - heights.Self` because the short cut only works on variables already created. `'Cross'` and `'Self'` have already been created but `'Difference'` hasn't. So you must use brackets and quotes the first time you refer to `'Difference'`, but thereafter you can drop them and use the shortcut method.)

What this command does is create a new column in your dataset with the header `'Difference'` and fills that column with the differences between heights of each pair of crossed and self-fertilised plants.

(You can do many other things like adding, multiplying, dividing, taking the square root, and so on, just as you would for actual numbers.)

<div class="alert alert-danger">

In the code cell below:
1. Create an new variable called `'Difference'` from the exising `'Cross'` and `'Self'` variables.
2. Print out the revised DataFrame.
2. Plot a histogram of the difference in plant heights, label the axes and add a title.
3. Looking at the histogram, do you consider that cross-fertilised plants grow with greater vigor than self-fertilised plants? Explain your reasoning.
</div>
<br>

### Rounding numbers

<div class="alert alert-danger">

Darwin's original measurements were in inches. 
* In the code cell below, create a new variable of height differences with units of inches from the existing variable which has units of cm. Make sure you name the variable something informative.
* Print the revised DataFrame.
</div>
<br>

Notice that the new variable is printed to three decimal places whereas the other variables are only printed to one decimal place. Given that plant heights were measured to a precision of 1mm, there is no point reporting height differences to a thousandth of an inch. Those extra digits have no meaning. And if something has no meaning it should not be reported. Python does not know that these extra digits are meaningless, it is just printing out numbers and rounding to a default three decimal places. So we have to tell Python to format the numbers to something that is meaningful. In this case the new variable in inches should be reported to one decimal place. To round numbers to *n* decimal places use the `round(n)` method. To round **all** floats in a DataFrame use

```python
print(DataFrame.round(n))
```

or to round floats in just one variable use

```python
print(DataFrame['variable'].round(n))
```
or 
```python
print(DataFrame.variable.round(n))
```


<div class="alert alert-info">

**Note**: In these examples we've used the generic terms `DataFrame` and `'variable'`. You should replace these with the names of your DataFrame and variables (e.g., `print(heights.Difference.round(1)`). From now on in the notebooks we will use these generic terms to demonstrate pandas functions and you should replace them with the names of the DataFrame and the variables in each example.
</div>

<div class="alert alert-danger">
In the above code cell print the revised DataFrame to 1 decimal place.
</div>
<br>

### Running speeds of spiders

Male spiders in the genus *Tidarren* are tiny, weighing only about 1% as much as females. They also have disproportionately large pedipalps, copulatory organs that make up about 10% of a male’s mass. (See image; the pedipalps are indicated by arrows.) Males load the pedipalps with sperm and then search for females to inseminate. Astonishingly, male *Tidarren* spiders voluntarily amputate one of their two organs, right or left, just before sexual maturity.

![](./images/03_ex_02.jpg)

Why do they do this? Ramos *et al.* (2004) suggested that perhaps speed is important to males searching for females, and amputation increases running performance. They used video to measure running speed of males on strands of spider silk before and after voluntary amputation. The running speeds (in cm/s) are in the file [`tidarren.csv`](tidarren.csv).

<div class="alert alert-danger">

Read in this dataset and plot a clearly labelled histogram of the **difference** between before and after amputation running speeds.
<br>

**NOTE**: When you read in a dataset from a file DO NOT name it `DataFrame`. That is, do not do this

```python
DataFrame = pd.read_csv('tidarren.csv')
```

Call the dataset something useful so that you, and other people, know what the dataset contains. 
</div>
<br>

<div class="alert alert-danger">
Does this data support Ramos's hypothesis? Explain your reasoning in the cell below.
</div>

> Write your answer here.

## The lengths of human genes

The international Human Genome Project was the largest coordinated research effort in the history of biology. It yielded the DNA sequence of all 23 human chromosomes, each consisting of millions of nucleotides chained end to end. These encode the genes whose products - RNA and proteins - shape the growth and development of each individual. The file [`human_genes.csv`](human_genes.csv) contains the lengths of all 20,290 known and predicted genes of the published genome sequence (Hubbard *et al.* 2005). The length of a gene refers to the total number of nucleotides comprising the coding regions.

![](images/dna.jpg)

<div class="alert alert-danger">
In the code cell below, read in the human gene dataset, call it something appropriate and print it to see what the dataset looks like. 
</div>
<br>

You should have noticed that the variable name is `geneLength`.
<div class="alert alert-danger">
Plot a clearly labelled histogram of human gene length.
</div>

### Changing the number and width of histogram bins

The histogram extends out to 100,000 nucleotides but doesn't appear to show any genes between 20,000 and 100,000nt. This is because there are a few very long genes which are so rare that they cannot be seen in this histogram. The longest human gene, with nearly 100,000nt, encodes the gigantic protein titin, which is expressed in heart and skeletal muscle. The protein was named for the titans of Greek mythology, giants who ruled the earth until overthrown by the Olympians. Some mutations in the *titin* gene cause heart muscle disease and muscular dystrophy.

These few very long genes make the bulk of the histogram bunch up to the left hiding details of the shorter genes. Matplotlib automatically calculates the number of bins and their width for you. In this case the bin width is about 10,000nt.

Rather than let matplotlib set the bins for you, you can set the bins manually. This is done with the `bins=range(start, end, width)` argument in `plt.hist()` where `start`, `end` and `width` are integers.

For example, to plot a histogram with bins of width 100 ranging from 0 to 20,000, you would use the command 
```python
plt.hist(DataFrame.geneLength, bins=range(0, 20000, 100))
```
<div class="alert alert-danger">

Try this in the code cell below.
</div>

<div class="alert alert-danger">

* Play around with some other values of `start`, `end` and `width` to examine this dataset in more detail. 
* You should notice some intriguing peaks at lower gene lengths. Try googling to see if there are any explanations for these peaks.
</div>
<br>

### References

Darwin, C. (1876). *Effects of Cross and Self-fertilization in the Vegetable Kingdom.* London: John Murray.

Ramos, M., *et al.* (2004). Overcoming an evolutionary conflict: removal of a reproductive organ greatly increases locomotor performance. *Proc. Nat. Acad. Sci. (USA)* **101**:4883-4887.

Hubbard, T., *et al.* (2005). Ensembl 2005. *Nucl. Acid. Res.* **33**:D447-D453.