<small><small><i>
Introduction to Python for Bioinformatics - available at https://github.com/GunzIvan28/MScMak2021-IntroductionToPython.
</i></small></small>

## Files, Scripting and Modules

So far, we have been writing all our Python Code in Jupyter notebooks. However, if you want to use the code we have written as part of a pipeline, you need to write scripts. Also, most of the time the data you need to analyse is in a file, which you need to read to Python and process. 


### Reading Files

So far we have been working from memory. In Bioinformatics, you will need to read some file or even write some output to file. We use the `open` function. 

In [63]:
myfile = open("../Data/test.txt", "w")        #Opens a file that is new or exists; r=reading, r=writing, 
myfile.write("My first file written from Python \n")
myfile.write("---------------------------------\n")
myfile.write("Hello, world!\n")
# myfile.close() #Closes the file[always a must!, donr forget]

14

In [7]:
type(myfile)    #Text wrapper that enables navigation through the file, enables editing and manipulation through the file

_io.TextIOWrapper

In [None]:
myfile.seek(2)

In [17]:
read_file = open("../Data/test.txt", 'r')

In [None]:
read_file.readline() #Reads line per line

In [None]:
read_file.readlines() #This reads all the content from start and displays last line

In [None]:
read_file.seek(0) #Brings cursor back to the start

In [None]:
read_file.readlines()

The **mode** in which you open the file determines whether to write (w), read (r) or append(a) to file. 

Opening a file creates what we call a **file handle** which contains methods for manipulating the file. In our case, `myfile` has the methods to write and close the file. Closing the file makes it accessible in the disk. 

Alternatively, one can open the file in a mode that automatically closes the file when done. 

In [22]:
with open("../Data/test.txt", "w") as myfile:
    myfile.write("My first file written from Python \n")
    myfile.write("---------------------------------\n")
    myfile.write("Hello, world!\n")

Let's check what else we can do with `open`.

In [None]:
?open

#### Fetching file from the web
Download this [file](https://www.uniprot.org/docs/humchrx.txt) we will use to explore file reading in python. 

In [None]:
import urllib.request                                 #Script for python to replace 'wget for bash'....so import urlib.request

url = "https://www.uniprot.org/docs/humchrx.txt"      #specify url
destination_filename = "../Data/humchrx.txt"          #specify destination
urllib.request.urlretrieve(url, destination_filename) #format of the import script

#### Reading a file line-at-a-time

We can read the file line by line using `readline`. Thie reads the line one by one until the end of the file. This is suitable for a large file which may not fit memory. 

In [None]:
humchrx = open('../Data/humchrx.txt', 'r')
line = humchrx.readlines()
print(line)

In [27]:
humchrx.close()

In [28]:
with open('../Data/test.txt', 'r') as myfile:
    while True:
        line = myfile.readline()
        if len(line) == 0: # If there are no more lines
            break
        print(line) 

My first file written from Python 

---------------------------------

Hello, world!



In [None]:
with open('../Data/humchrx.txt', 'r') as myfile:
    while True:
        line = myfile.readline()
        if len(line) == 0: # If there are no more lines
            break
        print(line) 

### Read the whole file

If the file is small or PC has enough memory, you can read the whole file into memory as a list using `readlines`.

In [None]:
with open('../Data/test.txt', 'r') as myfile:
    lines = myfile.readlines()                 #reads line per line
    for line in lines:
        print(line)

or as a whole

In [None]:
with open('../Data/test.txt', 'r') as myfile:
    whole_file = myfile.read()                  #reads file as a single line
    print(whole_file)

In [None]:
with open('../Data/humchrx.txt', 'r') as myfile:
    whole_file = myfile.read()                  #reads file as a single line
    print(whole_file)

In [None]:
### Note: if 'r' meets a blank line, it will print '\n' for that

### Exercise 1

Write a function the reads the file (humchr.txt) and writes to another file (gene_names.txt) a clean list of gene names.

In [35]:
humchr=open('../Data/humchrx.txt', 'r')

In [36]:
line = humchr.readlines()

In [None]:
line

In [38]:
line = humchr.readline() #Investigate

In [39]:
for line in humchr:
    print(line)           #prints antents of the file

Genes have been written successfully!!


In [54]:
#WHOLE SCRIPT MERGED TOGETHER#


In [55]:
writeGeneList(clean_gene_list)

Genes have been written successfully!!


### Scripts and Modules

A script is a file containing Python definitions and statements for performing some analysis. Scripts are known as when they are intended for use in other Python programs. Many Python modules come with Python as part of the standard library. 

You can get a list of available modules using help() and explore them.

In [None]:
ls

In [None]:
cd ../Scripts/

In [61]:
"""write_genes.py takes an annotation file and
writes gene names to file
Usage:
    python write_genes.py <>"""
import sys

# print(gene_file)
# print(out_file)
dna_list=list('ACGT')
def getGenList(gene_file):
    with open (gene_file, 'r') as humchr:
        tag = False #Start by setting the tag to false
        gene_list=[]
        for line in humchr:
                if line.startswith('Gene'):
                    tag = True
                if tag:
                    line_split = line.split()
                    if len(line_split) != 0:
                        if '-' in line_split[0]:
                            continue
                        else:
                            gene_list.append(line_split[0])
    return gene_list[3:][:-2]

    clean_gene_list = getGenList()

def writeGeneList(clean_gene_list):
    with open(out_file, 'w') as gene_names:   #creating a new file called gene_names
        for gene in clean_gene_list:
                gene_names.writelines(gene+'\n')
    print('Genes have been written successfully!!')
if len(sys.argv) < 3:
    print(__doc__)
else:
    gene_file = sys.argv[1]
    out_file = sys.argv[2]
    clean_gene_list = getGenList(gene_file)
    writeGeneList(clean_gene_list, out_file)

FileNotFoundError: [Errno 2] No such file or directory: '--ip=127.0.0.1'

In [None]:
ls -l

In [5]:
import write_genes

In [6]:
from write_genes import *

In [None]:
getGenList('../Data/humchrx.txt')

In [None]:
%% bash python write_genes.py ../Data/humchrx.txt ../Data/gene_names2.txt

### OR

In [None]:
!python write_genes.py ../Data/humchrx.txt ../Data/gene_names2.txt

In [None]:
%% bash 

### Writing you own modules

All we need to do to create our own modules is to save our script as a file with a `.py` extension. Suppose, for example, this script is saved as a file named `seqtools.py`.

```python
def remove_at(pos, seq):
    return seq[:pos] + seq[pos+1:]```
    
We can import the module as:

In [None]:
import w

In [None]:
import seqtools

In [None]:
s = "A string!"
seqtools.remove_at(4,s)

In [None]:
'23,000,'.replace(',','')

Modules are useful when you want to analyse large data using the HPC or even create your library of handy functions. 

#### Running scripts

When you have put your commands into a .py file, you can execute on the command line by invoking the Python interpreter using `python script.py.`

### Exercise 2

1. Convert the function you wrote in exercise 1 into a python module. Then, import the module and use the function to read `humchrx.txt` file and create a gene list file.
2. Create a stand-alone script that does all the above.


### Script that takes command line arguments
So far, we can create a script that does one thing. In this case, you have to edit the script if you have a new gene file to analyse or you want to use a different name for the output file.

#### sys.argv
sys.argv is a list in Python, which contains the command line arguments passed to the script. Lets add this to a script `sysargv.py` and run on the command line. 

```python
import sys
print("This is the name of the script: ", sys.argv[0])
print("Number of arguments: ", len(sys.argv))
print("The arguments are: " , str(sys.argv))```

In [None]:
!python sysargv.py test

### Exercise 3

- Using the same concept, convert your script in exercise 1 to take command line arguments (input and output files)
- Using a DNA sequence read from file, answer the following questions:
    1. Show that the DNA string contains only four letters.
    2. In the DNA string there are regions that have a repeating letter. What is the letter and length of the longest repeating region?
    3. How many ’ATG’s are in the DNA string?

### File handling, OS module, Shutil and Path modules

Python can also interface directly with the Linux operating system using the **os**, **Shutil** and **path** modules.

First, let's import the OS module

In [9]:
import os

In [10]:
os.getcwd() #Same as pwd in bash

'/home/gunz/Ivan-Python/Scripts'

In [11]:
os.chdir('..') #Goes one directory back

In [12]:
os.getcwd()

'/home/gunz/Ivan-Python'

In [None]:
os.chdir('INotebooks/')

In [None]:
?os

In [None]:
os.listdir()

In [None]:
os.path.isdir('../Scripts/bank.py')

In [None]:
os.path.isfile('../Scripts/bank.py')

### path manipulation
The path module inside the os module contains methods related with path manipulation.For example you can use `path.join()` to join paths. 
- `path.exists(path):` Checks if a given path exists.
- `path.split(path):` Returns a tuple splitting the file or directory name at the end and the rest of the path
- `path.splitext(path):` Splits out the extension of a file. It returns a tuple with the dotted extension and the original parameter up to the dot.
- `path.join(directory1,directory2,...)`: Join two or more path name components, inserting the operating system path separator as needed

In [None]:
?os.path.join()

Explore more at your own time.

### Shutil
Utility functions for copying and archiving files and directory trees.

In [None]:
import shutil

In [None]:
?shutil

## Exercise

a. Write a function called make_album() that builds a dictionary describing a music album 
The function should take in an artist name and an album title, and it should return a dictionary containing these two pieces of information. 
Use the function to make three dictionaries representing different albums Print each return value to show that the dictionaries are storing the album information correctly.

b. Add an optional parameter  to make_album() that allows you to store the number of tracks on an album If the calling line includes a value for the number of tracks, add that value to the album’s dictionary Make at least one new function call that includes the number of tracks on an album.


In [8]:

def make_album(name, title):
    """"""
    
    artist_dict = {"artist":name, "album":title}
    return artist_dict
       
make_album("celine", "a new day has come" )

{'artist': 'celine', 'album': 'a new day has come'}

In [7]:
def make_album(name, title):
    """"""
    artist_dict = {}
    artist_dict[name] = title
    return artist_dict
       
make_album("Ali", "Has no album" )

{'Ali': 'Has no album'}

In [9]:
def make_album(name, title, num_tracks=None):
    """"""
    
    artist_dict = {"artist":name, "album":title}
    if num_tracks is None:
        artist_dict["num_tracks"] = 0
    else:
        artist_dict["num_tracks"] = num_tracks
    return artist_dict
       
make_album("celine", "a new day has come" )

{'artist': 'celine', 'album': 'a new day has come', 'num_tracks': 0}

In [10]:
make_album("celine", "a new day has come", 20 )

{'artist': 'celine', 'album': 'a new day has come', 'num_tracks': 20}

In [None]:
def make_album(artist, title, tracks_no):
    '''A fuction that takes an artist name and the album
    and creates a dictionary holding the information
    Usage:
    
    python number2b-codes.py 'Artist_name' 'title' 'int(tracks_no)'?
    '''
    dict = {
        'artist': artist.title(),
        'title': title.title(),
        }
    if tracks_no:
        
        dict['Tracks'] = tracks_no
    print(dict)

make_album(artist, title, tracks_no) 

Write a python function that, using a DNA sequence read from file, answers the following questions:
1. Shows that the DNA string contains only four letters.
2. In the DNA string there are regions that have a repeating letter. What is the letter and length of the longest repeating region?
3. How many ’ATG’s are in the DNA string?

NB: Use the file `coding_seq.fa`. Bonus points if the script works for a multisequence fasta file.

In [None]:
dna='AGGGTTTCTCTGTGTAGCCCTGGCTGTCCTGGAACTCACTCTGTAGACCAGGCTGGCCTTGAACTCAGAAATCTGCCGGCCTCTGCCTCCCAAGTGCTGGGATTAAAGGTGTGTGCCACCACAGCTCAGGGTTCTTTTTTATCATTAAAATAATTTATTACTTTTTAGTTCATGTACATTGGTGTTTCATCTGTGTGTGTCTGTATGAAGGCTTTGGATCCCCTGGAGTTACAGACAGTTATTAGCTGCCATGTGGGTGCTGGGAATTGAACCCAGATCCTCTGGAAGAGCAGCCAGTGCTCTTAACTGCTGAGCTATTTCTCTCGCCCTGGCAGCTACTTTTCTATAGATTATTCTAATTATTTTATACAGATGAACTACAGGCTGGGGATGGGGAGATGGCTCACCAGGTGAGAGCCTTTGCCATGCAATGCCCAGAACCCAGGCTGGAAGGGAAGACCTGACCTCTACAGTCAGGCTGCAGCACCCCTGCCCCATCATGCACATACACACATAAATAAAATAAAACCCAAATGGACTCATACAGTATTTGCCTTTGTGACTAGCTTATTTTATTGAGCAATTTCACCCATAGCATTTCAAATAGAACAGCTTCAAGTGTACAGCAAAATTAAATAGATGGTACAAGGGTTTCCTAAATGTCTCCTGCCCTTGATATATTGCTTACCCCTCTCTTAAATGTTTCACTTCCTAAATAATACCTATGTGAGGTAATGCATATTTAATTGGCTAGATTTTATCATTTATGATGTGTATATAATTTTCAAATAGCATGCTGTATATGATAAATAGTTTTATCTCTCTATTTGAAATACAAATTAAACTTTACAAAGACTTCACAGCGTCTCCTGTTTATTGCAGGGGATATGTTCACTGGACTTCAGCAGACACCTAAGACTGGATAGTAGTAACCTAAGCCACAGTCTAGTCGCTCACTGTGGCCATAACATTTTAGCTACTTCCCTCCACCTTCATGTAGCTCCTGTGCATGTTTTCGTTTATACCTTAATATTTCACTTTTAGGAGGCATTGATAGAAGTGAAACTACATCTGATTCCAAATGCTACTTGTTCATTGTTGATACATAAGAAAGCATTTATTTATTTATGTATCTACCACATCCTACTTGTTGTTCAATCCAGGAGTCTTTGGTTGATCACCTTTATATGTAGACAGTCATGCCATGCAAAAACAGTTGTGTTTTCCTTCTCAGAGGCCCCTCTCCTGCTTTATCTTCCTCTTTGCTCCGCCCTCTCTCTCTTGCCCTCCCTTACCACTGTTGCCTCCTTTCCTTTCCTTTTTTCCTTTTCCTTTTTCTTGTGGTTTTCCGAGACAGGGTTTCTCCGTATAACCCTGACTGTCCTGGAACTCTCTGCCTCCCGAGTGCTGGGATTAAAGGCGTGCACCACCACCGCCCGGGTGTCTCCTTTTCTTTTATTGTTCTTTTCTTTGTTCTTTTACTACATAAACTGAGTTCCAGTATAATGTTGACAATAGAAGACATCCTTTTCTTGCTCCTGATTTTAATGGGAAAGGTCGAATGGTATGTGGTTCATGTAGACCACATTTTGTTTCCCTCTCACCCATTGATGGACACTTGGGTAGCTTCCATTTTTGGCTGTTGTGAATAATGCTGCTATGAACATGGGTGTGCACAGAGCTCTCTGAGACGCTGCTTTCAGTCCTTCTGGCAGTAGATCTTCATGGAGGAGCACGGAGTGACCCAAACTGAACACATGGCTACCATAGAAGCCCATGCAGTGGCCCAGCAAGTCCAGCAGGTCCATGTAGCCACGTACACTGAGCACAGTATGCTAAGTGCTGATGAAGACTCCCCTTCCTCCCCCGAGGACACTTCTTATGATGACTCGGACATCCTCAACTCCACGGCAGCTGATGAGGTAACTGCCCATCTGGCTGCTGCAGGTCCTGTGGGAATGGCCGCTGCTGCTGCTGTGGCAACAGGGAAGAAACGGAAACGGCCTCATGTGTTTGAGTCTAATCCATCTATCCGAAAGAGACAGCAGACACGTTTGCTTCGGAAACTCAGAGCCACGTTGGATGAGTACACGACGCGAGTGGGACAGCAAGCGATTGTACTCTGCATCTCACCCTCCAAACCCAACCCTGTCTTCAAGGTGTTTGGCGCAGCACCTTTGGAGAATGTGGTGCGAAAGTACAAGAGCATGATCCTGGAAGACCTCGAGTCTGCTCTGGCAGAACACGCCCCTGCGCCACAGGAGGTTAATTCAGAGCTGCCGCCTCTCACCATCGATGGGATTCCAGTCTCTGTGGACAAAATGACCCAGGCTCAGCTTCGGGCATTTATCCCAGAGATGCTCAAGTATTCCACAGGTCGGGGGAAACCAGGCTGGGGGAAAGAAAGCTGCAAGCCTATCTGGTGGCCAGAAGATATCCCATGGGCCAATGTCCGCAGTGATGTCCGCACAGAAGAGCAAAAACAAAGGGTTTCATGGACCCAGGCATTACGGACCATAGTTAAAAATTGCTATAAGCAACATGGGCGGGAGGATCTTTTATATGCTTTTGAAGATCAGCAAACACAAACTCAGGCCACCACCACACACAGTATAGCTCATCTCGTACCATCACAGACCGTAGTACAGACCTTCAGCAACCCTGATGGCACCGTGTCGCTCATCCAGGTTGGTACAGGGGCAACAGTAGCCACATTGGCTGATGCTTCAGAACTGCCAACCACAGTCACTGTTGCCCAAGTGAATTACTCTGCTGTGGCTGATGGAGAGGTGGAACAAAACTGGGCCACGTTACAGGGCGGTGAAATGACCATCCAGACGACGCAAGCATCAGAGGCCACCCAGGCGGTAGCATCACTGGCAGAAGCCGCAGTGGCAGCTTCTCAGGAGATGCAGCAGGGAGCCACTGTCACCATGGCCCTCAACAGTGAAGCTGCCGCCCATGCTGTCGCCACTCTGGCGGAAGCCACCTTACAAGGTGGGGGACAGATAGTCCTGTCTGGGGAAACCGCAGCAGCCGTCGGAGCACTTACTGGAGTCCAAGATGCTAATGGCCTGGTCCAGATCCCTGTGAGCATGTACCAGACTGTGGTAACCAGCCTCGCCCAGGGCAACGGGCCGGTGCAGGTGGCCATGGCCCCAGTGACCACCAGGATATCGGACAGCGCAGTCACCATGGATGGCCAGGCTGTGGAGGTGGTGACCTTGGAACAGTAGCATGGAGCTCTATCATGGCAGCGTTTTCTAGTCTACTGCAGAATTTTTTACATGTTTGCAGAGGTGCAATCAAATGGAATTAAGTCTCTCGACTTGGAAAGAAAGTTTTGGTAACCTTTTTTTAAGAAGGAAGAAAGGCAGCAGATTTTGGAATCACACTTTTTTAAAGCACCACTCTGGGATCTGGTGGAATGAACGCCACCGATTTCACTGTCCCAAAAAGCCAAATTGTGGCCAGACTTCTTTGTGCAGAAATGTGTGTATACTTACGTGTGTGTACGTGTGAGTGTGAATATATGTATATGTGTACATATGGACATACACATTTACATATATGTATAAAGTATATATGTACATACATACATATGTATGAAACCTGCATGGAATTACCTGTATGAAATCAAGGTGAACTGTGGGAACAAGAACCCACCCAGATTCGTGGGTGGTAGGGTACATGACCAAACACAGTCACCTGGTTTTCGTTCATACCAGGGTCATGCATTGAGCTACTGACAGACTCAGGCGGAGGTGACCACGTCCTTCACCAAAGCTGCCTCCCAGTGGCCGCCTAGACCTCTGCTAGATTCACCGAAGGAAGGAAGATCCAGGACACAGCGTGGTCCAGAGAGTGCTTGTGAAGTCCAGGGACAGAGAGTGCGTGCGCACATGTGCGCTTTGCCAGCAGAGACACACGGCAGCTGGCCCAGGTGCTGACCTTGCCACAGGCAGGTAAACGCCCTGCAGGCTCCTGGCAGGGGCAAGAAATCGTTCCTCAGCCTCCATCTTCTCCCTTCCCAGGAACCCTCAGTCTCACGACTATTCAAGAGTTGCTTGGTTGTAAGGTCAGTCCTGTTACAAACTGAAGGTGACAGAAGTGTTAAGGGTCTGAGGAGTGTTCATGGAGCAGGCGGGTGTAAGTGCAGGGTGTGTGTGTGTGTGTGTGTGTGTGTGTATGAGTAATGGAGAAAATGGGAAGATTATAGGAGAGCAAAATAGGAAGGAGGGAGAAAACTCTTCATAAATCAGGGTGCGCCGTGGGAACCGTGTTCTCCAGCTGTCTGCAGCTGTATTTCAGCAGAGGAGACTGCCTCACACAGGACCTCTGCGCAAAGGCTGGCCGTCACAGATGTGTCAGAAGACTCTGTGAGGACTTTTCCCAGGCACATCCTGGCGGCACAGGCCTGGGACAGCTTTCCTGCTCACAGTGTGGCTTGCACTGAGCAGTCATTGTCACTGTGAGCTTCTGTGCTTTCCAGCCACAAGCCCTGAGTCTCCCGTGGCTCATTCATCTGATGTCTTGACAAGCCAAATCTCCACTCCTGGCGTGCAGGGACTCTTCCTCCTTCCTGCCAGCCCTCTCCCGTGCGTGATAGTGTATTTAATGTGGTGTTTTTGGTTTTTTGTTTTTTAATGAGACATTAAAAGATTCTTCATGTCTTGCTCAGCCTTTGAGAAAAGTTTCCAATTCTTATATTTGCTTGTTTTATATAAAACTATTCAATGTTCTTTGTATGTTCTTTTCTGTATGTGATAAGGGAGGGGTGGGAAATTTGCATATCAATGTCCTGGTTCTACAATTGGTTACTTTTTTTTTTTTTTTAAACTGTGAAGCTGTCCAGGGGCTTTAAGGCCCGTGTTCCTTTGTGGTGAAATAAGCCTCCCGATAGTTTGAGAAATTGCCAAGAAGATAAAAGCAAGATCCCAGCAGCAGAGCATGGAATCTGTGTTGTTCTCCATTCTGTCTAAACTGCCTCATTCAATAAATAGTTTAATGTGGCGAC'

Write a function that takes two arguments – a protein sequence and an amino acid residue code – and returns the percentage of the protein that the amino acid makes up. Use the following assertions to test your function.

```assert my_function("MSRSLLLRFLLFLLLLPPLP", "M") == 5
assert my_function("MSRSLLLRFLLFLLLLPPLP", "r") == 10
assert my_function("MSRSLLLRFLLFLLLLPPLP", "L") == 50
assert my_function("MSRSLLLRFLLFLLLLPPLP", "Y") == 0```