# Bio 302 Homework 8

**Due before class Thursday, April 21st**

This homework consists of the two problems in this notebook. You may find the reading in Chapter XX useful background.

**Submission**

Submit your solutions as a link to a Gist via a Slack DM.

### Problem 1

Recall that diploid organisms can have two kinds of genotypes on their
autosomal chromosomes: *homozygotes* and *heterozygotes*. You're
homozygous at a locus if you got the same allele from both mom and
dad, *e.g.* `A/A`; you're heterozygous at a locus if you have
different alleles, *e.g.* `A/G`.

Runs of homozygosity (ROHs) are longer-than-expected runs of
consecutive homozygotes along a chromosome. They can occur for a
number of reasons, and are most common in culturally or geographically
isolated populations. ROHs may lead to poor health, since they have
the potential to expose recessive disease. For more on the connections
between ROHs, history and health, you can check out the following:
@nalls_measures_2009, @kirin_genomic_2010, @szpiech_long_2013.

Here you'll work with human genotyping-microarray files from the
[*Personal Genome Project*](http://www.personalgenomes.org/)
(PGP). PGP is an effort by some notable scientists and thinkers to
collect and publicize genetic and other health-related data. You can
read more about the impetus for the project on the site.

The genotype files have a simple structure, with one tab-delimited SNP
genotype per row. Each row consists of an `rsid`, `chromosome`,
`position`, and `genotype`. The rsid is a unique identifier for the
SNP marker. The other three fields are self-explanatory.

Write a function called `find_roh` that identifies the
longest autosomal ROHs observed in a given human's genotype data. This
function will take two command-line arguments. The first argument is the name of the
genome data file. The second argument is
the number of ROHs to report. What the function should return is
illustrated below.

You should filter out ungenotyped SNPs; those will show genotype
`--`. You should probably proceed one chromosome at a time, and you
should not assume that the markers are sorted. That is, sort the
markers by position. You'll need to look at chromosomes 1 through 22
inclusive.

Once that's done, you'll want to look for the actual runs
themselves. Here's a stretch of consecutive markers from chromosome 1:

```
rs4970383       1       838555  AC
rs4475691       1       846808  CC
rs7537756       1       854250  AA
rs13302982      1       861808  GG
rs55678698      1       864490  CT {Heterozygote that stopped the last ROH.}
i6019299        1       871267  CC (Start of ROH.)
rs1110052       1       873558  TT
rs147226614     1       878697  GG
i6052728        1       878697  GG
i6019302        1       881843  GG
rs2272756       1       882033  GG ... 11 consecutive homozygotes
rs67274836      1       884767  GG
i6019303        1       888554  CC
rs13302945      1       889159  CC
i6019304        1       889182  GG
i6019305        1       891343  GG (End of ROH.)
rs13303106      1       891945  AG {Heterozygote that stopped the ROH.}
i6019306        1       894379  GG
rs13303010      1       894573  AG
i6019308        1       897792  CC
```

So, if this were among the longest ROHs in the whole genome, we would
create a record like this:

```
{'chromosome':'1', 'start':871267, 'end':891343, 'length':11]
```

Meaning, on chromosome 1, starting at position 871267 and ending at
position 891343, there was a run of 11 consecutive homozygotes.

Your function should store the chromosome, start and stop position, and
number of markers in each ROH as above in a dictionary. The function should
return a list that contains the longest ROHs you found, in terms of
the number of markers within the ROH, up to the number
requested.

#### Example

This example uses data from the anonymous PGP human individual
[`huAB47EB`](https://my.pgp-hms.org/profile/huAB47EB). I downloaded
and unzipped her 23andMe genotype data (*her?* -- yep, her Y genotypes
are all no-calls...) from PGP,
[here](https://my.pgp-hms.org/user_file/download/1061), which yielded
a simple albeit long text file,
`genome_Lindgren_Rider_20140216145128.txt`.

```
> find_roh('genome_Lindgren_Rider_20140216145128.txt', 5)
[{'chrom': '16', 'end': 2223985, 'length': 585, 'start': 2085997},
 {'chrom': '17', 'end': 41636310, 'length': 481, 'start': 41085683},
 {'chrom': '7', 'end': 117305628, 'length': 470, 'start': 117235056},
 {'chrom': '17', 'end': 48274031, 'length': 402, 'start': 48111137},
 {'chrom': '7', 'end': 99827148, 'length': 330, 'start': 98788589}]
```

That is, I asked for the **five** longest ROHs in huAB47EB's
genome. The longest one was in a region on chromosome 16, from about
2.1Mb in to 2.2Mb, spanning 585 genotyped SNPs. Meaning, huAB47EB
appears to have a stretch in her genome of more than 100kb without a
heterozygote!

Feel free to download some other individuals from PGP, perhaps getting
some folks of different ethnicities, to see if that length is par for
the course. If you happen to have been genotyped yourself, you can
download your own data and study yourself!

In [None]:
# Solutions to Problem 1 here.




### Problem 2

During most of the time period from 2000 to 2010 many, maybe even
most, of the world's human genetics researchers were engaged in a kind
of genetics gold rush -- a GWAS-rush. A GWAS, or genomewide
association study, searches for associations between some phenotype of
interest -- breast cancer, schizophrenia, body mass index -- and
genotyped markers, typically many hundreds of thousands of them. You
need some number of *cases*, people/organisms having the condition,
and some number of *controls*, those that don't. The signature that
you're looking for is an *association* -- a correlation between having
the condition and having some allele.

In this homework, you'll take a cloud-level view at the fruits of all
this GWAS research. The National Human Genome Research Institute, or
NHGRI, is the part of the NIH that funds human genomic research. NHGRI worked
together with their European counterpart, EMBL, to
maintain a central repository of all the genetic associations
discovered through GWAS. (There's a similar resource, FYI, for
Mendelian disorders: the Online Mendelian Inheritance in Man, or
OMIM.) You'll want to follow the links from EMBL's GWAS
[catalog page](https://www.ebi.ac.uk/gwas/), to download the full
catalog as a tab-delimited file. Follow the 'Download' link,
and choose the file that goes with "The most recent edition of all the associations in the GWAS Catalog".
It's pretty big, at 24,862 lines by my count, and
summarizes many years and many millions of dollars of work.

Each of these 25K lines, apart from the header, corresponds to a
phenotype-genotype correlation that was discovered in a peer-reviewed
GWAS paper. For each question below, you need to (1)
make a matplotlib plot in your notebook, (2) write a single sentence
summarizing any pattern that you see, even if seems obvious, and (3)
write one further sentence commenting on the meaning of this plot, the
*why* behind the pattern you're seeing.

1. How has the number of associations reported changed year-to-year?
   Here you need to look at how many associations there are for 2008,
   for 2009, and so on for each of the years present in the file, and
   make a reasonable plot with them.

1. What is the distribution of risk allele frequencies for these
   associations? Here you'll probably want to use a histogram. Allele
   frequencies, being frequencies, should be numbers between 0
   and 1. Not all the fields in here are numbers. Sometimes it's "NR",
   "not reported" (*Really? Then how'd they get published?*) -- ignore
   those. Sometimes there's both a number and a word -- in that case
   ignore the word and just use the number.

1. What is the distribution of "Initial Sample Size"s across these
   studies? Here you'll probably want to plot a histogram. You'll note
   that many of the fields in sample size are repeated, just because a
   single study can report multiple markers. Therefore, please only
   use one instance of any repeated entries from this column. You'll
   also notice that many of these entries are a mix of text and
   numbers, like "927 North Indian ancestry cases, 1,148 North Indian
   ancestry controls". You want to extract the numbers from such lines
   , and add them up. Do mind the commas. In this example, you'll pop
   out that 927 and that 1148, and have python add them together to
   get a total sample size of 2075.

1. With a bar chart, show the top ten phenotypes ("Disease/Trait"), in
   terms of the number of associations reported for them.

1. With a bar chart, show the top ten cancer phenotypes, in terms of
   the number of associations reported for them. So, subset to the
   phenotypes that contain the string "cancer".



In [None]:
# Solutions to Problem 2.


