# Investigating evolutionary relationships among histone proteins through sequence and cluster analysis

Scientists often deal with large datasets in molecular genetics. For example, we might go online and run a BLAST search to identify a protein sequence. The default BLAST parameters are designed to limit the amount of data we receive to 100 sequences. In many cases this is more than sufficient, and we can stay out of "Big Data" weeds so to speak. 

However, protein (and nucleotide) sequences are rich with information, and there is much knowledge to be gained about evolutionary relationships between proteins using Big Data. Processing large datasets often requires specialized software packages. 

In this project, we will: 

    1. Be able to handle, filter, and clean large sequence datasets
    2. Learn about and use the basic local alignment sequence tool (BLAST)
    3. Be able to create interactive sequence similarity networks from BLAST results to investigate sequence relationships
    4. Be able to generate new knowledge using sequence similarity networks 


## A (very) brief introduction to Jupyter Notebooks, Jupyter Lab, and Python


Jupyter notebooks provide an interactive, web-based programming environment. This format provides an easy way to access a variety of Big Data programming tools with instructions (like this section) and coding cells (in which you enter code). 

This notebook is based on the Python 3 programming language. Let's type some Python 3 code in the coding cell below:

<font color=blue><b>STEP 1:</b></font> Type in the code provided by Dr. Schurko. While the code is running you will see an asterisk appear in the brackets, e.g.: [\*]. When it is completed, a number appears, e.g. [1].

<b>Congratulations!<b>

<font color=blue><b>STEP 2:</b></font>Let's add to that by adding your name to the end of the statement. You can simply re-run the code in the coding cell above. When you do, you will see the new output below that box and the number in brackets on that line increments by 1.

If this is your first time writing and running some Python code, congratulations! 

<font color=blue><b>STEP 3:</b></font> Let's edit that code further to make it print out that statement three times. This code will create a loop and run it over the range given.


<font color=blue><b>STEP 4:</b></font> Before we move on, change the range in the code above to 10 and run it again. Even more awesomeness!
***

<font color=blue><b>STEP 5:</b></font> Now, let's do some calcuations. In the coding cell below, type in a calculation as discussed by Dr. Schurko:

<font color=blue><b>STEP 6:</b></font> To make better use, we can have Ptyhon do calculations for us. We will assign some variables to calculate how to dilute a stock concentration to a working concentration using the C1V1 = C2V2 formula. In the cell below, assign variables for C1, V1 and C2. We will then solve for V2:

0.01


<font color=blue><b>STEP 8:</b></font> Great! Now, in the space below let's put our solution for V2 in the form of a sentence:

The volume you need to add is 0.01


From these simple exercises, we now have a basic idea for how to run working code with minimal guidance.



<b>Now onto today's project<b>

In this exercise we will use the histone protein sequence files that we downloaded last week from UniProt. You should have one text file that contains all of the histone sequences in FASTA format (there should be >8400 FASTA sequences in your file). Today, we will utilize NCBI BLAST software to use E-values for visualization, the CD-HIT software to reduce our data set size in a rational way, and cytoscape for viewing sequence similarity networks.

While it is possible to install most of this software on any computer, it can be very difficult to get everything working. This is why we are running this Jupyter notebook in Binder, a <i> remote </i> environment that I have set up. In this environment all the needed software is already installed, and we can easily upload files to work with! 
    
## Some important ideas for using Jupyter Notebooks in Binder

    1. Please ensure the code in a cell has finished running before moving on to the next one!
    2. If you accidentally delete or mangle some code, use the Edit -> Undo function!
    3. If you need to stop some code from running in a cell use the Stop button (next to the Run button).
    4. If you don't interact with Binder for too long, you may lose the Kernel and need to restart it. 
    5. If you are away for a long time, then you may need to relaunch your Binder - which means you might need to regenerate any data that you did not download!
***


In [None]:
!clustalo -i <<your_file_name>> --outfmt='msf' -o files/8_histone.msf

<font color=blue><b>STEP 5:</b></font> Find the 8_histones.msf file in the files tab and double click on it to open it. The symbol '~' indicates that a particular sequence does not have additional amino acids at the N or C terminus. A '.' means there is no amino acid in that sequence at that position, also called a gap. 

<font color=blue><b>STEP 6:</b></font> Use the msf to answer the following questions:

    This should be similar to the result we observed last week. What is different about the histone H2A sequence from the rotifer Adineta vaga?


<font color=blue><b>STEP 7:</b></font> When you are done, open Notebook "2 - Sequences_and_BLAST" by double clicking on the file in the left hand panel. The code cells in 2 - Sequences_and_BLAST already have the needed code, to keep you engaged and on your toes, you will have to edit that code in specific (and hopefully clearly annotated) ways.

TEST
