# Environment - Tools

## Ubuntu installation in Windows
- Check Windows Subsystem for Linux (<a href="https://ubuntu.com/desktop/wsl">WSL</a>)

Consider a symbolic link of a data folder in ~/Documents/

```bash
ln -s /mnt/c/myrepo ~/Documents/my_dir
```

## Ways to run python code
- Write code in python script and execute in **shell** (*python script_name.py*)
- Execute in **(i)python** command shell (*ipython: %run script_name.py*)
- **jupyter** notebook/lab
- VS code integrated terminal etc.
- ..

! Check relative path

Do note that there are multiple ways to create a virtualenv, such as:

- The venv module.
- The virtualenv package.
- Via package managers like conda.

We will use conda (minimum installation of Anaconda Distribution).

## Installing Miniconda (https://docs.anaconda.com/miniconda/install/)
**Linux instructions** Terminal or here..
- create a new directory named “miniconda3” in your home directory.
- download the Linux Miniconda installation script for your chosen chip architecture and save the script as -miniconda.sh in the miniconda3 directory.
- run the miniconda.sh installation script in silent mode using bash.
- remove the miniconda.sh installation script file after installation is complete.

```bash
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda3/miniconda.sh
bash ~/miniconda3/miniconda.sh -b -u -p ~/miniconda3
rm ~/miniconda3/miniconda.sh
```
- after installing, close and reopen your terminal application or refresh it by running the following command

```bash
source ~/miniconda3/bin/activate
```
- Then, initialize conda on all available shells

```bash
conda init --all
```


## Create conda environment - install jupyter

- we create the environment "python_intro", activate it, and install packages, and kernel

```bash
# conda create -n python_course python=3.11 notebook
conda create --name python_intro --channel=conda-forge python=3.11 jupyterlab nb_conda_kernels
conda activate python_intro
conda install -c bioconda hmmer #mafft fasttree datamash mmseqs2 astral-tree biopython
# pip install ipykernel
python -m ipykernel install --user --name python_intro --display-name "Python Intro"
```


* Download the notebook (project.ipynb), rename or copy (project_myname.ipynb) and run:

```bash
jupyter-lab project_myname.ipynb
```

- **Use one cell per code line/block, and look for function or method help (help(str.split) or str.split?).**

# Project - <a href="https://en.wikipedia.org/wiki/Codon_usage_bias">Codon usage bias</a>

## Translation (mRNA 2 protein sequence)
- We will use NCBI Genomes <a href="https://www.ncbi.nlm.nih.gov/datasets/genome/">data</a>.
- You may also access it programmatically via <a href="https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/">command-line tools</a> or <a href="https://biopython.org/docs/1.76/api/Bio.Entrez.html">Biopython</a>.

- Files may be also found in https://github.com/cgenomicslab/intro-to-Python-course/tree/main/2025/data

* **Download** and **parse** (store in a dictionary) <a href="https://www.ncbi.nlm.nih.gov/nuccore/KT964724.1?report=fasta" target="_blank">KT964724.1</a> sequence entry from NCBI
    * Take into account that the sequence is split over 80 character lines
    * Keep only the sequence ID (KT964724.1) from the header (using the python string split() method)

```python
>KT964724.1 Euplokamis dunlapae putative nonfluorescent protein mRNA, complete cds

ATTACTATATTTTAATTGAGTGCCTAGTGAGCAGCAATGGACTCCCGTATGGAAAGGGCAGAGTCTGTCT TTGCAGGCAGTATAAAGAGTAAACTCATTGCTGACTTAGTTTATGAGGACCAAACCTACAAGTTGTCAGG GGAAGGGTTTGGGAACCCCCAAGAGGGTCAGCATACGTTAGAGATGAGGTGTTTTGGTACGGAGGCATGC CCTCTTTCATGGTTTGTACTGGGTCCAGTGATACAGTACAGTTACAGAATGTTCACCCAGTATTCAGGTA ACGGAATGTACGACTTCTTCAAGACCTCCTTCCCTGGTGGTCTTAGTACAGAGTCAGTGTGTACCTTCAA TGATGGGGCTACTATCACTGGCAGTCATAATATCAGCTTTGTCAAGGACATCGTGGTCTGCAGATCTAAG CTGGAGTGCGCAGGGTTCAATGACGAGTCTCTGGCCCTGTCCCAGGAGCTGTCCCAAGTCAAGCCTTGTT ACGAGATAATAGATGGATCCGGAGTAGACGCTGTTTCCAGCTCTGTCAAACTCGAGTGGGACTTGTCAGA CGGGGATAAGTACAGTGCCCAGGTAGAGTCAGTGATCAGGAGTAAGACCAACTTTGCACCACAGAGACAC TTTATAGCTCATCACAGCAAGGTGATTGAAAAGTCGCAGAACAATCTGCACTTTTCCCAGCGTGATAAGT CCAGAGCAAACGTCATCAACTTCTACCTGCATAAAGAACAACACAAACGATAGGTCACGTTTAATGAGCA AGTTCTCGCAGTCTGTTTAGCACTCCGGGACTCTGTCTACCGTGGGAACTGAGAGGTGTCACGGGGAAGA TGTCTGTTTGTATTCGGTAGTTAATTTGAACTTGAAGGGAGTTTGTATGGAGTTGCTTGTTGAATGGGAT AAAGCTTTTGAAATGCTTGAGAAGCTTAGGAAGCGGAAGAGTTAAGAGGGGATTTTGAAAAAGGAATGCT
TAAATTATTTTGTTTGACG
```

- Find the overall length of the sequence?

- Get the reverse of the sequence using **indexing with negative step** (sequence[Start : End : Step], see also [here](https://www.geeksforgeeks.org/slicing-with-negative-numbers-in-python/) for slicing)

- How many start codons (ATG) can be found in the three 5′→3′ reading frames?

- If the sequence starts with a start codon keep it, else find the first ATG (if first three letters == "ATG" else find it) and remove trailing sequence upstream. (**str.find?** for help)

- Find the **first** in-frame stop codon (TAA, TAG, TGA) (**increment counter by 3 in a while loop**)

- Get the **coding sequence** from start codon to last (before last codon)

- Write a **function** that takes mRNA (cds, nucleotides) as input and returns the amino acid (aa) sequence using the standard genetic code (below).

```python
codon2aa = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                 
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
        'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'
    }
```    

## Codon frequencies

- What is the GC content of the gene? Write a **function** that returns it and print "Gene's GC content: X%" using string formatting.

- Count the number of total different and unique codons in the sequence (list vs set). see **range** function (**range? for help**).

- Create a codon counter, dictionary with codons as keys, their count as value (check also dict.setdefault method).

- Do the same using Python’s Counter (import from collections).

- Estimate the **frequencies** of **codons** and of **aa categories** below (A, B, C, D).
    - Use list comprehension and Python’s Counter (import from collections).
    - ```python
      # cat2aa = {'A':['R','H','K', 'D', 'E']} etc.
      ```

![title](https://cdn.technologynetworks.com/tn/images/body/aminoacids-pic3revised1574260662291.png)

- **Download** and **parse** the Genomic coding sequences (cds) for the Reference African savanna elephant (*Loxodonta africana*) genome from [NCBI](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_030014295.1/)
    - We want the CDS (cds_from_genomic.fna.gz) from the [ftp site](https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/014/295/GCF_030014295.1_mLoxAfr1.hap2/).
    - **Optional**: There are multiple transcripts/isoforms per gene. To avoid redundancy in the data we may keep only the longest transcript/isoform per gene.

```bash
wget -q "https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/030/014/295/GCF_030014295.1_mLoxAfr1.hap2/GCF_030014295.1_mLoxAfr1.hap2_cds_from_genomic.fna.gz"
```

- **Check** if all CDS start with a start codon (ATG).

- Estimate the overall codon frequencies of the elephant genome

## Plotting

- Plot 