# Genomics

The goal of this lecture is to highligh a variety of genomics tools and methods that are commonly used in python.

Things we will go over in this module are:

1) Using Biopython for searching in NCBI databases

2) Using Scanpy for conducting scRNAseq analysis


## Installs/Imports

In [2]:
!pip -q install biopython

In [18]:
import numpy as np
import pandas as pd

### GenBank comprises several subdivisions:

**Nucleotide**: a collection of nucleic acid sequences from several sources.

**Genome Survey Sequence** (GSS): uncharacterized short genomic sequences.

**Expressed Sequence Tags** (EST): uncharacterized short cDNA sequences.

Searching the Nucleotide database with general text queries will produce the most relevant results. You can also use a simple query based on protein name, gene name or gene symbol.

To limit your search to only certain kinds of records, you can search using GenBank's Limits page or alternatively use the Filter your results field to select categories of records after a search.

If you cannot find what you are searching for, check how the database interpreted your query by investigating the Search details field on the right side of the page. This field automatically translates your search into standard keywords.

Here is a link to all the potential search [fields used in Genbank](https://www.ncbi.nlm.nih.gov/books/NBK49540/)

In [43]:
#Bio comes from installing the biopython python module
from Bio import Entrez

#Using this fake email actually works
Entrez.email = "your_name@your_mail_server.com"

handle = Entrez.esearch(db="nucleotide", term='"Zea mays"[Orgn] AND rbcL[Gene] ')
record = Entrez.read(handle)
record["Count"]

'0'

You can see this exact search via this [URL](https://www.ncbi.nlm.nih.gov/nuccore/?term="Zea+mays"%5BOrgn%5D+AND+rbcL%5BGene%5D)

## F-strings

One thing you might notice is that we are writing our query as a string, and it might be nice to easily change the string so we can make many queries.

To do this we can use F strings

Here's an example below

In [14]:
species_list=['Zea mays','Arabidopsis thaliana']
for species in species_list:
    # Notice how I have an f before the open quotes
    # This tells python to evaluate this as an f string
    # You can pass any variable there and it will place it in the string
    query_string = f' "{species}"[Orgn] AND rbcL[Gene]'
    print(query_string)

 "Zea mays"[Orgn] AND rbcL[Gene]
 "Arabidopsis thaliana"[Orgn] AND rbcL[Gene]


In [16]:
for i in range(5):
    # You do not have to cast the variable to a string either
    print(f'This is loop iteration {i}')

This is loop iteration 0
This is loop iteration 1
This is loop iteration 2
This is loop iteration 3
This is loop iteration 4


In [30]:
rand_number = np.random.rand()
for i in range(5):
    # You can include a colon and specify the format of the output
    # In this case I am specifying the number of digits to round to
    print(f'This random number rounded {rand_number: .{i}f}')

This random number rounded  1
This random number rounded  0.9
This random number rounded  0.86
This random number rounded  0.857
This random number rounded  0.8573


In [33]:
# But if you place a comma after the colon
# It will place commas every 3 digits
f'${10000000:,}'


'$10,000,000'

In [42]:
# It will even convert fractions to percentages
# Here I specify that it should be a percentage is 2 decimal places

f'{2/3 :.2%}'

'66.67%'

## Excersise

1) Read in the file `species_classification_ranks_processed.txt` as `species_info`

2) For each species count the number entries for the genes per1 and per2 and store the results as new columns in the dataframe `species_info`

3) Now make multiple plots that compare these genes and color them by different levels of species classification

## Scanpy tutorial