# Notebook 12.1: Hi-C scaffolding

### Learning objectives:
By the end of this notebook you should:
+ Understand the concepts behind Hi-C scaffolding.


### Assigned reading:

+ Lightfoot, D. J., D. E. Jarvis, T. Ramaraj, R. Lee, E. N. Jellen, and P. J. Maughan. 2017. “Single-Molecule Sequencing and Hi-C-Based Proximity-Guided Assembly of Amaranth (Amaranthus Hypochondriacus) Chromosomes Provide Insights into Genome Evolution.” BMC Biology 15 (August): 74. https://doi.org/10.1186/s12915-017-0412-4.


+ Burton, Joshua N., Andrew Adey, Rupali P. Patwardhan, Ruolan Qiu, Jacob O. Kitzman, and Jay Shendure. 2013. “Chromosome-Scale Scaffolding of de Novo Genome Assemblies Based on Chromatin Interactions.” Nature Biotechnology 31 (12): 1119–25. https://doi.org/10.1038/nbt.2727.


+ [See video](https://www.jove.com/video/1869/hi-c-a-method-to-study-the-three-dimensional-architecture-of-genomes) (Please watch closely minutes 16-22; but feel free to watch it all.)
    

In [4]:
import random
import toyplot
import numpy as np

### Chromosome conformation capture (3C)

The Hi-C method for chromosome conformation capture was first developed as a method for studying the conformation (structure and organization) of the genome within cells. Does the genome always fold in predictable ways, do certain chromosomes always exist close together or far apart in the cell? These questions are hard to answer even with microscopy. The idea behind Hi-C was to develop a method based on sequence information that could tell us about the spatial organization of the genome. This method decodes the spatial organization of the genome from *interactions* among parts of the genome that occur in a way that correlates with the spatial distance between them. The 3C method was developed as a way to label individual loci and measure the distance between them in a living cell. The **Hi-C** method is a new advance on this original method (3c), extending it to a rapid genome-wide process. 


### Step 1: 

  Hi-C couples chromosome conformation capture (3C) with deep sequencing to reveal regions of chromatin that are in close spatial proximity in the nucleus

<blockquote>
    We developed Hi-C, an extension of 3C that is capable of identifying long range interactions in an unbiased, genome-wide fashion. In Hi-C, cells are fixed with formaldehyde, causing interacting loci to be **bound to one another** by means of covalent DNA-protein cross-links. When the DNA is subsequently **fragmented with a restriction enzyme**, these loci remain linked.
</blockquote>

[Juicer software](https://github.com/aidenlab/juicer) and [HiCExplorer](https://hicexplorer.readthedocs.io/en/latest/content/list-of-tools.html) pipeline tools. 

### Restriction enzymes 
Restriction enzymes are .... 

HindIII enzyme cuts at AAGCTT
https://www.neb.com/products/r0104-hindiii#Product%20Information

In [2]:
def random_sequence(length):
    return "".join([random.choice("ATGC") for i in range(length)])

In [3]:
def restriction_digest(sequence, recognition, overhang):
    # cut sequence at every occurence of recognition site
    fragments = sequence.split(recognition)
    
    # add overhang that results from sequence splitting within the recognition site
    fragments = [i + overhang for i in fragments]
    return fragments

In [4]:
# generate a 10Mb genome
seq = random_sequence(10000000)

# digest the genome at every HindIII site
fragments = restriction_digest(seq, "AAGCTT", "A")

# plot distribution of fragment lengths


In [5]:
# get fragment sizes
frag_sizes = np.histogram([len(i) for i in fragments], bins=50)

toyplot.fill(frag_sizes)

ValueError: setting an array element with a sequence.

ModuleNotFoundError: No module named 'png'