### 500697 Genetic Analysis
# Introduction to Bioinformatics

### Dr Dave Lunt
d.h.lunt@hull.ac.uk


### In the cell below please double click then type your name and student ID

Student:

Although we encourage discussion and some forms of working together **you will each need to complete a notebook** such as this. You will need to use the `File` menu above to `Download As` PDF at the *end* of this practical class. You will then upload it to Canvas as your assessment. Please make sure that all your work is entered in this workbook.

## 0. General Introduction
In this practical class you will be introduced to Bioinformatics, the computational study of large scale DNA or amino acid data. 

Bioinformatics is a truly central discipline in all of biology, scientists with bioinformatics skills are highly sought after and the employment opportunities are superb.

**DON’T PANIC! YOU DO NOT NEED COMPUTATIONAL SKILLS TO CARRY OUT THIS ASSESSMENT. EVERYTHING YOU NEED TO KNOW YOU CAN LEARN IN THIS CLASS**

It is essential however that you engage with the exercises, reading them carefully, trying for yourself, and asking questions of your lab partners and the demonstrators.

We will introduce you to two key areas in bioinformatics
1. using the unix command line; these are essential skills required for bioinformaatic investigations
2. `blast` searching a DNA database;  `blast` search is the most used of all biological software

The goals are that you will have the skills to begin a bioinformatics investigation, and have used the most important software tool in biology - `blast`

## Learning Outcomes
At the end of this class you should:
- Understand the utility of the command line
- Be able to navigate files and directories from the command line
- Perform `grep` searches on very large files
- Be able to perform a `blast` search from the command line and gain biological insight

# 1. Introduction to the command line
Below we will give only a very simple overview. The best way to learn the command line is to Google search for instructions and then try it yourself, experimenting as much as you can. We have also given links to other teaching resources at the end.

Here we have embedded the unix command line into a Jupyter notebook so that you have the manual and the exercises together. A normal terminal would be a seperate window (you used one for the git clone command), but otherwise it is the same. In the *instructions* below we denote commands typed at the terminal with a `$` sign, whereas the output, the info returned by the terminal, has no dollar sign.
```
$ a typed command
output output output
```

The *actual* terminal here are the grey cells (boxes) prefixed by `In [ ]:`  
Here is one:

LICENSE             README.md           blast_barcode.ipynb


You can run commands in a terminal cell by clicking in it and then Shift-Enter. Else you click the >|Run button in the menu bar. Try it below

In [None]:
echo 'Hello World!'

You should see that the output `'Hello World!'` has been printed direcly below the cell (we'll introduce the `echo` command later)

## 1.1 Navigating the file system

The first thing to do is to learn the basics of moving between places on your computer, checking where you are, checking what files are present and having a quick look at them. Some of the terminology is perhaps slightly new (‘directories’ rather than ‘folders’) but using the right words will mean you are speaking the same language as everyone else and make your life easier when Googling for solutions.

**As the navigation commands are introduced below try them on your own system**

### 1.1.1 Checking where you are (`pwd`), changing directories (`cd`) and listing contents (`ls`)
At the command line you need to know “where you are” i.e. which directory (folder) you have open and are working in. Question: If you issue the command to ‘list all files’ which files will be listed? Answer: Those in the directory where you are currently working, called the working directory. The command to find out where you are is `pwd` short for ‘print working directory’. NB: ‘print’ when working at the command line means ‘display on the screen’ rather than ‘write this to a piece of paper’. 

Type the command to print working directory `pwd` below, and hit Shift-Enter to run the command.


You should see something like this:

```
$ pwd
/home/bs/username/intro_unix_and_blast
```

**Tip: if you want to try something and there isn't an empty terminal cell, use the menu bar to Insert Cell**

Try inserting a cell below and check your working directory again

It would be useful now to know what folders and files are present in this location

### Listing the contents of a directory with `ls`

A very common thing you will want to do is to **display the contents of a directory**, i.e. list all the files. 

You can list the files (and directories) in your working directory using the `ls` command. 

**Task: explore to contents of the folders provided for you. Navigate to each and look what is there using the `ls` command.**

But maybe that isn’t where you want to be, in which case you need to `‘change directory’` and the command for that is `cd`

```
$ cd data
$ cd my_sequences
$ pwd

/home/bs/username/intro_unix_and_blast/data/my_sequences
```

**Explanation:** Spaces in file and directory names cause difficulties as spaces are treated as the end of a file name. When looking for ‘my file’ it complains that it can’t find ‘my’. 
```
$ cd my sequences
-bash: cd: my: No such file or directory
```

Although this can be got around by using quotes ‘`my file`’ it is better to name files replacing spaces with underscores (e.g. `my_file`), hyphens (e.g. `my-file`), or concatenating the words (e.g. `myfile`).

**Try navigating for yourself down to the `my_sequences` folder as outlined above**

You can see that the `/` symbol denotes levels of directories, so that the ‘my_sequences’ directory is contained within the ‘data’ directory which is within the user's 'home' directory, and ultimately in the system’s ‘home’ directory. 

These **file paths** can sometimes be long but they are always explicit, which is a very good thing for reproducibility. There is no excuse for failing to remember where the data was stored for your analysis, here it is written out, and in the real world you will want to record this as part of your experiment. 

You can go up one level (to 'data') by using double dots (ensure there is always a space between the `cd` command and the directory you wish to go to).
```
$ cd ..
$ pwd

/home/bs/userename/intro_unix_and_blast/data
```

**Try it for yourself** (remember to always try new commands for yourself, it is an important part of your learning, don't wait for my typed reminders)

What happens if you use the `cd` command without telling the system where you would like to change directories to?  
Try it. How can you find out which directory you are now in?
```
$ cd 
$ pwd 
/home/bs/username
```

Using `cd` command on its own returns you to your **user’s home directory**, in this case `/username` from wherever you are. NB it is **your** home directory, in my case called davelunt, and its nested within the directory called home by the computer where all users' directories are kept. **your home directory** is an important concept, this other home folder is the only difficult part, and can be ignored for the rest of today. In the example below *username* means your username, whatever it is.

The tilde symbol (`~`) is shorthand for your home directory so `username/intro_unix_and_blast` and `~/intro_unix_and_blast` refer to the same directory, which saves a little typing. 

Another very useful shortcut is `cd -` (dash) which takes you to the previous directory that you were in. This is really useful when you need to swap between directories that are separated by several levels or that have long names.



```
$ cd ~/intro_unix_and_blast/data/my_sequences
$ pwd
/home/bs/username/intro_unix_and_blast/data/my_sequences
$ cd
$ pwd
/home/bs/username
$ cd -
$ pwd
/home/bs/username/intro_unix_and_blast/data/my_sequences
$ cd -
$ pwd
/home/bs/username
```

Make sure that you are actually typing this out for yourself rather than just reading along. This **active learning** will really help it to stick in memory, and come back when you need it, a bit like developing muscle memory.


### tab completion saves time and errors

You are probably slightly annoyed now that you have to type in all these directory names, which are long and complex. If you make one typo it will result in an error. Real bioinformaticians make use of the tab key a lot. tab will suggest completions of your file path.

If I am in the home directory and I type `$ cd intr<tab>` it will autocomplete to `$ cd intro_unix_and_blast`
This saves a LOT of typing and prevents errors. Insert a cell below and try it now. Use the Insert menu above.

### Chaining commands together with a semicolon
Above we were issuing commands one at a time; first `cd` then `pwd`. To chain commands together on the same line separate them with a semicolon ie `cd;pwd`

Try the exercise from the line above again, but now using semicolons. You should only need 4 commands not 8.

Are you tab completing the addresses?

### Quiz: Ways to return home
There are 5 ways that you should now know to return to your home directory. Most people only remember 2 or 3, what can you do? Please work with someone nearby, tell a demonstrator, and we'll see who are the champions.


### 1.1.3 Summary
```
pwd	print working directory, show where you are
ls	list, show all the files in your current directory
cd	change directory, move to another location
;	semicolons can chain together commands on one line
```

Test yourself, can you write out a command using each of these?

# 1.3 Editing, inspecting, and searching within text files

## 1.3.1 Inspecting files
Much of bioinformatics is the inspection and retrieval of information from text files. This data is the source of all subsequent analyses and we need to make sure it is the right data in the right format.

If we  want to inspect a file we can display its contents using several unix programs:
```
cat  	to print the whole file to the screen
less 	to print the file to the screen a page at a time
head 	to print the first few lines of the file on screen
tail 	to print the last few lines of the file on screen
```
One of the problems with DNA sequence files is that they can be large - several hundred megabytes to a few gigabytes is not uncommon. Viewing these files can be difficult, as the files need to be loaded into memory, and can therefore take a great deal of time for the text editor to read from the disk. The `less`, `head` and `tail` commands are very efficient for viewing large files such as these. 

Remembering tab completion navigate to the directory `/data/my_sequences`. Tip; when navigating use `ls` to see what is available.

Some of these files e.g. `C.09.F.fasta` are quite large (~30Mb), while the others are smaller. Only one of the commands above is inappropriate for large files all the others will be useful. If you have a problem don't panic, ask a demonstrator.

If unsure about the size of a file `head` is a good way to have a look at it. If you want to know how big a file is you can display this with a version of the `ls` command, can you find out how?

All the commands above will be very useful for basic bioinformatics work, they are key skills. 

How can you step through the file a screen at a time using `less`? Try Googling for the answer and demonstrate that it works.

----
## Assessment 1 - viewing files

Navigate to the `/headers` directory.
(Which commands should you use for big or small files and why?)

In the 4 cells below use each of the following commands to view the file contents
1. cat
2. head
3. tail
4. less 


-----

### 1.3.2 Searching within files
Searching within very large files however can also be problematic, especially using the standard 'find' functions in a text editor, which aren’t optimised for performing searches across very large files. For this reason various tools have been created that allow users to search within large files from the command line, and are highly optimised for their function. One of the most useful utilities for searching within a file is `grep` (global regular expression parser).

`grep` is very simple to use. At the command line you will need to type the word `grep`, followed by the text you are searching for, followed by where (the filename) to look for it. For example, to search for the word Fungi in our 18S_seqs.fas file we do the following:
```
$ cd ~/intro_unix_and_blast/data/my_sequences/
$ grep 'Fungi' 18S_seqs.fas
>AY642706_10 1401 Eukarya/Fungi
```

This returns the text `>AY642706_10 1401 Eukarya/Fungi` which is the single line that contains the word Fungi.

Make sure that you are trying this for yourself, by inserting cells and giving the command

We can confirm that there is only one instance by using the **count flag** (`-c`) with `grep` as follows:
```
$ grep -c 'Fungi' 18S_seqs.fas
1
```

(Flags are useful things, they allow the default behaviour of a command to be refined. See the section on HELP below to learn how to list all the modifier flags for a command.)

### 1.3.3 How many sequences do I have?
A very common question is to ask is ‘how many DNA sequence records are in this enormous fasta file?’ 

**Tip: Do you know what a sequence record looks like? Quickly learn about fasta sequence files from Appendix 1. Now try to see if the sequence file '18S_seqs.fas' matches your expectations. You already have the skills to do this without loading the whole file into memory, can you remember how? Check above if not.**

In order to find out the number of sequences you could of course search for all the greater than `>` symbols, which is almost certainly the number of records. However, you should really search for all the lines *starting with* > rather than the number of times it occurs, as it is possible for a fasta header to contain an internal > . ‘Line starts with’ is represented by the `^` symbol. 

----
## Assessment 2 `grep` #
Write a `grep` search to count the number of fasta header lines. Do not copy/paste from any web resource. Discuss your solution.

Demonstrate your search on the two files `18S_seqs.fas` and secondly `C.09.F.fasta` to determine the number of sequence records. Ensure that the searches and output are below.

Have you remembered the quotation marks around the search phrase? Unfortunately your solution will probably delete the data file if you forget the quote marks! Why? Look at section 2.3.1 below and discuss. A spare file can be found in `/data/backup`. Can you find a way from the command line to **copy** this file to your current directory? Google for a solution on how to copy a file at the unix command line. Try it, and list the files to see if you have been successful. Check with a demonstrator if you need.

You may try your `grep` search again in a new cell below and still get the marks for completing this assessment.

A correct `grep` for counting sequence records, and using it to search these two files will get you the marks

----

### 1.4 Summary
```
head	show the first few lines of a file
tail	show the last few lines of a file
cat 	print the whole file to the screen
less	print the file a screen at a time
grep	search, the -c flag counts the matches
```
Test yourself. Look away from this summary and explain to your partner/self 4 ways of showing file content, and why they are different. Now write a grep search with the correct syntax.

## 2.3 Search, replace, and write output to a new file

`grep` is an excellent tool for undertaking simple yet fast searches within text files. But to search and replace within a text file, or to redirect changes to a new file, we will need to use either `sed` or a simple script (OK there are actually numerous ways of doing this utilizing other tools and python scripts but this manual will only deal with simple examples with `sed`). First though we will write text using `echo` and learn about routing data to the right location with `>` `>>` and `<`.

### 2.3.1 echo and routing
A useful way to write to a text file is with `echo`. This will print to the screen or a file. A short introduction to `echo` can be found here if you wish to explore later, but the next sections are fairly self explanatory without it so no need to read now:
https://www.computerhope.com/unix/uecho.htm

Try these commands
```
$ echo Hello world!
$ echo ‘Hello world!’ > greeting.txt
```

Lastly echo can write file information like file names
```
$ echo *.fas > fasta-file-names.txt
```
This would write the name of every file in the current directory with a .fas extension to a file called fasta-file-names.txt which is often very useful when you need to record lots of output file information. Although I am talking only about bioinformatics today, big data is everywhere in biology. You can manipulated text files of any type of data using your new skills.

If the file 'greeting.txt' does not exist it will be created. We have routed the information to a location (the file) with the greater than (`>`) symbol. If the file does exist it will be overwritten. Check the file now exists (how?), then you can use one of the commands above (cat maybe? Do you remember the others?) to inspect the file you have just created. 

If you wish to append text to a file rather than replace it you can use the `>>` symbol:
```
$ echo ‘Hello again world!’ >> greeting.txt
```
Try this and check your success. 

**Routing syntax** (`>`, `>>`) is general to UNIX and can be used with other programs too, you saw an example earlier in the section on sed. Imagine that you need to add an extra fasta sequence to the end of a big sequence file, the append symbol `>>` will be helpful. In reality you would want to take your new text from a file rather than typing it it. In the sed example in section 2.3.1 above the source file was specified using a less than (<) symbol. 

A common bioinformatics task is to concatenate a lot of individual sequence files into one single file. This is very time consuming to do in a GUI if you have more than a couple of files to open, copy, close, open, paste. The task at the command line however scales easily from 1 to 1 million files. You already have all the skills to do this if you learned about wildcards before. Don't worry if not, just know that the command line *scales*.

As an aside `echo` also allows us to format files correctly. If we need newlines or tabs inserted we can do this by using the -e flag. The tab symbol is \t and newline \n. 
Can you use these to better format the file greeting.txt that you created before?
Try to imagine what the following command will write & discuss with others:
```
$ echo -e “column1\tcolumn2\nRNA\tDNA” > rna-dna-columns.txt
```
Check your success. These commands are useful when writing a lot of data to a file programmatically, and when format is important, which is a very common situation for bioinformatics work.

----
**Task: go to the fasta-to-combine directory. Concatenate all 10 sequence files into a new file with an informative name. Can you do the same excluding the contents of the README.md file? Demonstrate your success.**


### 2.3.2 `sed` the stream editor
`sed` works best when we need to deal with files as single lines, or rows of text data. Since `sed` doesn’t try to take the whole file into memory, instead dealing with a line at a time, it has real advantages when files are enormous- as they often are for sequence data.
To search for and replace Fungi with Fungus in our 18S_seqs.fas file we could do the following:

```
$ sed 's/Fungi/Fungus/' < 18S_seqs.fas > fungi-to-fungus.fas
```

This will replace the single word Fungi we identified using grep with the word Fungus, but output these changes to the file fungi-to-fungus.fas, leaving the original file unchanged. The `s` within the single quotes signifies this is a substitution command and the `/` characters are delimiters that separate the text to search for, and the text to replace it with. 

In UNIX based systems the `<` signifies an input, so we are taking input from our 18S_seqs.fas file and outputting (`>`) to fungi-to-fungus.fas. 

NB the examples below in the rest of this section are tricky, but they are to get you thinking. They are teaching not assessment. Try to understand, discuss, ask. Do not give up, you have made great progress up until here!

Tip: Always give meaningful names to files and directories, even if that makes them seem long. The person you are doing this for is ‘future you’ who will remember less than you think, need clear filenames as one of the ways to make sense of the data, how it has been transformed, and to help record a reproducible experiment. It is very useful to have a filename like: 

`whitby-FDS12763-nematode18S-lenfiltered200bp-uniquespecies.fas`

Can you guess what this file contains, where it is from, what type of sample, how it has been treated bioinformatically? 'Future you' will be able to read the info. Another reason the information-in-filename approach is very useful is that it contains a lot of information you can use for analysis. If you had 1000 files from separate sampling points, you could choose which files to pull data from based on names like “whitby” or maybe the sample code “FDS”. If you wanted to grab data just from enoplid nematodes from only the Whitby samples you could find and list (concatenate) those with a search, and pipes `|` to string several jobs together. 

Below is an example, **these files don’t exist here**, its just an example, but you are going to try it yourself on files that do.

`cat ~/allsamples/*whitby*.fas | grep enoplida | sort | uniq -c`

Task: Google, discuss, and ask until you **know what this command does, you do not need to run this**. To help your searches the asterisks are called ‘unix wildcards’ -why are they used? Wildcards are simple and incredibly powerful, check with a demonstrator, you will need this asterisk wildcard below. unix pipes (`|`) are one of the most powerful aspects of bioinformatics at the command line. They are a bit like the semicolon used earlier, but for data. Maybe Google unix pipes and discuss with a demonstrator?

### 2.3.3 Why use the command line?
I think this section is the most important of the entire practical class. Not because of the specific command line skills (though they are useful) but because I want you to understand *why* we use the command line at all, what are its special powers. These exercises are hard, if you can't easily do them, think through a strategy to do them instead, even though you can't think of the specific commands, and discuss with a demonstrator.

Think for a moment how much work this single line is actually doing and how long it would take manually? Now consider that a single sequence run (4 days) might generate 100 of these files, and a month of lab work thousands. The work to process all these thousands of files containing hundreds of thousands of fasta sequences is however exactly the same, just the command above run on a bigger directory of data. Command line approaches may seem no quicker on a single file, but they scale well. No bioinformatician would work in graphical interface software for real data, not only is it not reproducible, but every repeat of the task requires the same amount of work again.

The tasks below are hard. If you can understsand the lesson that they demonstrate I will be happy. It would be much better if you actually successfully complete them.

TASK: In the folder hep_cd you have 182 fasta files containing either hapatitis C or hapatitis D viral sequences from patients in London or Hull. Make a detailed protocol (don't actually perform the steps, just detail them) for using Word to copy/paste the header lines to a new document for each of the 4 treatments C/D Hull/London. How would you check for duplicates? 

TASK: Now make a detailed protocol for using your command line skills to do this. How would you check for any that are duplicates and delete them? How would you count the remaining unique headers? Can you use the example above to complete this command? You may give it a try if you like, but designing the experiment and getting it checked is enough.

Now I have a second (imaginary) file with 1800 fasta sequences. I would like you to think how much effort that takes to process with (a) Word (b) re-runnning your terminal command.

OK, now for the real data, there are 15 million fasta sequences in a text file...

I hope you understand that **the command line scales**, its the same work to process 182 sequences as it is 15 million. That is not true for GUI menu-driven programs. 15 million fasta sequences is a small and routine dataset for bioinformatics.

TASK: Go to the headers directory. Extract all the header lines from `headers-test.fas`. How many are there? Are there any duplicates or are they all unique (`uniq`)? After reading the section below you may be able to work out how to write it to a new file with an informative name. Consider for a moment how much extra work **you** would have to do to run the same analysis on a file with millions rather than hundreds of sequences. 

Well done, please ask a demonstrator for a high five.


## Summary
```
echo	write to a file
sed	search and replace
<	take input from named file
>	direct output to named file, replacing
>>	direct output to named file, appending
|	pipe, used to chain together commands
*   wildcard, stands for 'any characters'
```

Test yourselves. In pairs ask each other to explain what itens fro mthe list above do. Each of you ask the other to come up with a (not too difficult) command line using one or two of them.

# *-- Pause Here --*#
By this stage you have done a lot of work, and hopefully learned a lot. 
Please wait before proceeding with any more practical excercises. It is OK however to start the 'research BLAST' task just below

---- 
# 2 BLAST similarity searches
In this second section we will investigate and use **blast** the most important software tool in biology.

## 2.1 What is BLAST and how does it work?
**Task: In groups of 2-4 spend 5 minutes to research BLAST searches. How will you do this? What questions will you ask?  
You will prepare yourself to contribute to a group discussions about blast searching.**

## 2.2 The CFTR gene

## Assessment 3; building a blast database and searching it
To demonstrate your competency, follow the instructions to search a chromosome_7 blast database from the command line with the provided CFTR query sequence. The instructions are in the two sections below

## 2.3 Building a BLAST database
You have been provided with the seqeunce of human chromosome 7 in `GRCh38_chr7.fas`. Have a quick look at it to make sure it is what you expect.

Blast searches a very large amount of ('reference') sequence very quickly for matches to the query sequence. In order to do this efficiently the reference sequences must be organised into a database structure. The blast suite of programs will create the database for you using the `makeblastdb` command.

As with all the programs you have used you need to call the program with its name, tell it what the input file is, give it any options it needs (e.g. nucleotide, protein), and tell it where to save the output and under what name.

The command to make a blast database would look like this:
```
$ makeblastdb -in data/CFTR/GRCh38_chr7.fas -dbtype nucl -title my_chr7_db -out CFTRch7
```
Discuss this command until you understand what parts do what. Remember that flags (-flag) are there to flag-up what the next part does. The `-in` flag precedes the input file. 

**Time to run it for yourself, but in a slightly changed format**

First lets prepare some data. Copy, paste, and run the line below (**NB you don't need the dollar sign** the command starts with `gunzip`). This might take a few minutes. When the following cell changes from `In [*]` to a number replacing the asterisk, its done. If you only see a number as normal, its done.
```
$ gunzip -c data/CFTR/GRCh38_chr7.fas.gz | makeblastdb -in - -dbtype nucl -title my_chr7_db -out CFTRch7
```

- What does the pipe | do? 
- What is gunzip? 
- Where will the output files be saved?

## 2.4 BLASTing a sequence file against the database
Now is the time to actually blast a sequence against the database to find out where it matches. You are provided with a sequence file containing a segment of the human CFTR gene called `CFTRshort.fas`. Use the following command to search the database. 
```
$ blastn -db CFTRch7 -num_descriptions 50 -query data/CFTR/CFTRshort.fas -out CFTR_chr7_blastn.out
```

If you have run the command without errors then you have completed to bioinformatics part of your analysis. There are just a few more things to do. For a start you will want to look at and interpret your results. You already know how to examine large text files, why not have a look at the `CFTR_chr7_blastn.out` file. It is actually quite small, but even if you had returned thousands of results you now have the skills to view and search them. Discuss the content with a demonstrator or myself.

You have detected a specific mutant genotype called "cftr delta-f508"
A useful place to look at human mutants is OMIM Online Mendelian Inheritance in Man. Try a search for "cftr delta-f508"

https://www.omim.org/about

If you are interested you can see the specific sequence you have just investigated in a genome browser

https://www.ncbi.nlm.nih.gov/variation/view/?chr=7&from=117559592&to=117559594&mk=117559592%3A117559594%7CNC_000007.14&assm=GCF_000001405.26


## Congratulations!
You have navigated the unix command line, searching and formatting data files, using a range of powerful unix commands, on large amounts of DNA data. You have created and searched a blast database of the human genome to detect a very important single mutation. **You are now a bioinformatician!** 

## Learning Outcomes
At the end of this class you should:
- Understand the utility of the command line
- Navigate files and directories from the command line
- Perform `grep` searches on very large files
- Be able to perform a BLAST search from the command line and gain biological insight

Bioinformatics is a big discipline, and of course a lot of different skills are needed. You have however covered a wide range of skills for one day, and these basics will already be very useful indeed if you were beginning a DNA analysis project. 

----

# THIS IS REALLY IMPORTANT DON'T SKIP THIS SECTION
## At the end of the practical save and upload your practical notebook
Check your name and student ID are in a cell at the top of this page. 

Use the notebok file menu (not the browser menu) `File/Download As/.pdf`

Save this PDF somewhere safe. You should rename it informatively. Now upload it to the Canvas assignment for "Practical 2- bioinformatics"

YOU ARE NOW COMPLETELY FINISHED, WELL DONE

----

# Appendix 1 the fasta file format
Fasta (pronounced “fast- ay” to rhyme with May) is a concise standard text file format used for sequence data. A fasta file may contain one or many separate sequence records, each in fasta format. You will need to use this format several times during today’s session. Both DNA and amino acid sequences can be formatted this way. Essentially it is a greater than symbol “>” followed by any title you want to give it, then, starting on the next line, the sequence.

### Here is the description of FASTA from GenBank.
“A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column”

Example sequences in FASTA format are: 
```
>AAH22472.1 Malic enzyme 3, NADP(+)-dependent, mitochondrial [Homo sapiens]
MGAALGTGTRLAPWPGRACGALPRWTPTAPAQGCHSKPGPARPVPLKKRGYDVTRNPHLNKGMAFTLEER
LQLGIHGLIPPCFLGQDVQLLRIMGYYERQQSDLDKYIILMTLQDRNEKLFYRVLTSDVEKFMPIVYTPT
VGLACQHYGLTFRRPRGLFITIHDKGHLATMLNSWPEDNIKAVVVTDGERILGLGDLGCYGMGIPVGKLA
LYTACGGVNPQQCLPVLLDVGTNNEELLRDPLYIGLKHQRVHGKAYDDLLDEFMQAVTDKFGINCLIQFE

>My sequence of elephant skin-wrinkle gene
CCGGGCTCTGCCTCGATGCAAACGTTATGCATATATGTATTATCACCATTATTTTATATCAAACATATCC
TATATATTAATACATCTCATTTAACAGAAATATAGGTAGATATACCACATATTTGTCAACAACATTTTAA
CTAAGGGGTACATAAACCATAACTAAGTACTCTCCAATAAATATTTATTAATTACTGAACGATAGTTTAA
GACCGATCACAACTCTCACTGGTTAAGATATACCAAGTACCCACCATCCTATTTACCTCCCTTATTTAAT
GTAGTAAGAGCCCACCATCAGTTGATTTCTTAATGTTAACGGTTCTTGAAGGTCAAGGACAAATATTCGT
GGGGGTTTCACTTAGTGAACTATTCCTGGCATCTGGTTCCTATTTCAGGTCCAATAATTGTTATAATTCC
CCATACTTTCATCGACGCTTGCATAAGTTAATGGTGGTAATACATACTCCTCGTTACCCACCATGCCGGG
CGTTCTTTCCAGCGTGTGGGGGGTTCTCTTTTTTTTTNNCCTTTCA
```
### Fasta format records and fasta files
Above are two fasta records. A fasta record contains only a single header with a single sequence. It is one thing. The record is in fasta format because it is a greater than symbol, then a header, then the sequence.

A fasta file is a text file. It usually has the extension .fas or .fsa or .fna or .fasta. Each text file may contain one or more than one fasta record. It is not uncommon to have tens of thousands of fasta records in a single text file. For it to count as a fasta file all the records must be in valid fasta format.

## Appendix 2 some extra things to try

### `head` and `tail`
These commands show the first few or the last few lines of the file. How can you control how many lines are shown?
If you wanted to take the last 5 or the first 5 lines of a big text file and save them in a new file, how might you go about it?

### `help` and `man`
unix has a built-in help system. You can type the name of the program causing confusion and `--help` to get more information on functionality and correct usage. Try

`ls -help`

There are also manuals to help you. These can be brought up with the `man` command. Try

`man ls`

### word counting
unix has a word count program called `wc`

What can you find out about it and how to use it?

How many characters are in the CFTRshort.fas DNA sequence file?

One problem is that the file contains both a fasta header and the sequence. If you only want the length of the DNA sequence a simple `wc` may not give you the right information. Can you think of any way around this?

## Appendix 3
### Some reading if you wish to extend your knowledge
- UNIX Tutorial for Beginners http://www.ee.surrey.ac.uk/Teaching/Unix/
- Command line history tricks http://www.thegeekstuff.com/2008/08/15-examples-to-master-linux-command-line-history/
- Software Carpentry Introduction to the unix shell on YouTube (great short videos) 
- Unix and Perl Primer for Biologists http://korflab.ucdavis.edu/Unix_and_Perl/unix_and_perl_v3.0.pdf
- Bradnam and Korf. (2012) UNIX and Perl to the Rescue!: A Field Guide for the Life Sciences (and Other Data-rich Pursuits). ISBN-10: 0521169828  ISBN-13: 978-0521169820 http://www.amazon.co.uk/gp/product/0521169828
- GREP http://www.gnu.org/software/grep/manual/grep.html
- SED http://www.gnu.org/software/sed/manual/sed.html
- Software Carpentry Introduction to programming in Python (great short YouTube videos) 
- Python for Biologists http://pythonforbiologists.com
- Python for non-programmers https://wiki.python.org/moin/BeginnersGuide/NonProgrammers