# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

This session we will look at reading to and from files. Often not as important with beginners coding courses, but biological data is pretty typically depending on lots of large files being read in so lets find out how.

Note: You will need to add the Classdata files to the colab runtime to access them, or if running on your own computer know the full path to where you have downloaded the files

# Python I/O Handling

In Python, input/output (I/O) handling is the process of reading data from external sources and writing data to external destinations. This is a fundamental aspect of programming especially in bioinformatics given the amount of data files we work with.

## Reading a Text File

One of the most regular I/O operations in Python is reading data from a file. In Python, we can read a text file using the `open()` function and the `read()` method. This will load the whole file into one variable.


In [1]:
# Open the file in read mode
with open('/content/CanisLupisCO1.fasta') as  inputFile:
  # Read the contents of the file
  data = inputFile.read()

print(data)

>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I
ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG
CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT
ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC
ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT
TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCTTTCTTCTACTATTAGCATCTTCTAT
GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA
GCATCCGTTGACCTTACAATTTTCTCCTTACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT
TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC
AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA
ACAGACCGGAATCTTAATACAACATTTTTTGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACC
TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT


An alternative is to read one line at a time as a temporary variable. This means the whole file isn't stored in memory (good for big files) but you don't have all the data available unless you add it to a variable.

In [3]:
count = 1

with open('/content/CanisLupisCO1.fasta') as dogDNA:
  for line in dogDNA.readlines():
    print("This is line number", count, ":\t", line.strip())
    count +=1

This is line number 1 :	 >U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I
This is line number 2 :	 ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG
This is line number 3 :	 CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT
This is line number 4 :	 ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC
This is line number 5 :	 ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT
This is line number 6 :	 TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCTTTCTTCTACTATTAGCATCTTCTAT
This is line number 7 :	 GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA
This is line number 8 :	 GCATCCGTTGACCTTACAATTTTCTCCTTACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT
This is line number 9 :	 TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC
This is line number 10 :	 AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA
This is line number 11 :	 A

Somthing a bit more complex now. Another useful python method is .startswith() to check if something.... starts with a character! Here we can use ```not``` so that it skips the first line because it begins with a ```>``` character.

Also lets put the data from the file into a list so that we can refer to it later and do more testing

In [2]:
my_lines = []

with open('/content/CanisLupisCO1.fasta') as dogDNA:
  for line in dogDNA.readlines():
    if not line.startswith('>'):
      my_lines.append(line.strip())

for line in my_lines:
  if line.count('T') > 25:
    print(line)

ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG
TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCTTTCTTCTACTATTAGCATCTTCTAT
GCATCCGTTGACCTTACAATTTTCTCCTTACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT
TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT


There is also the similarly named ```.readline()``` (notice it's singular, not plural). That will read just one line from the file at a time. This is powerful for extracting header or title lines without reading the whole file with a loop.

In [4]:
with open('/content/CanisLupisCO1.fasta') as dogDNA:
  header = dogDNA.readline()
  print(header)

>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I





## Writing to a Text File

We can of course also write data to a file in Python. Usually this will be more useful than just putting information onto the screen when working with big bioinformatic files.

To write to a file, we need to open the file in write mode using the `open()` function and use the `write()` method to choose what to output.

Note: The only change in the ```open()``` function is the second parameter of 'w'. We didn't need a second parameter when reading as 'r' is the default.


In [7]:
# Open the file in write mode
with open('declaration.txt', 'w') as output_file:
  output_file.write('Hello, world! Python <3 Bioinformatics!')

Check the file that you just created!


## Appending to a Text File

We can also append data to a file instead of destroying and creating a new file each time using the `open()` function in ***append*** mode and the `write()` method. Here is an example:

In [None]:
# Open the file in append mode
with open('declaration.txt', 'a') as outputFile:
  # Append to the file
  outputFile.write('\nand again.')
  outputFile.write('\nand again..')
  outputFile.write('\nand again...')


## In closing...
In Python, it is important to close a file after reading from or writing to it. In the olden days of a few years ago we would use the `close()` method in python, but now it is recommended to use the ```with``` manager for more simple and readable code. I'm mostly including this here in case you see or use older code that uses the previous format.

Basically, because you're interacting with files outside of python it needs to be told when you're finished, for a few reasons:

1. Memory management: When a file is opened in Python, the operating system allocates memory to store the data read/written from/to a file. If the file is not closed properly, the memory used by the file remains allocated which can cause performance issues or crash the program.

2. Data corruption: If a file is not closed properly, any data that has not been written to the file may be lost (think of a USB unplugged too soon). This can result in corrupted or incomplete data, which can cause issues when the data is later read or used.

3. Resource management: When a file is opened, it is locked by the operating system to prevent other processes from modifying it. If a file is not closed properly, it remains locked, preventing other processes from accessing or modifying the file. This can cause issues if the file is needed by another program or process.

To summarise, always close the file! But if you're using ```with```, then it's automatic.

## Exercises

1. Read in the ```am181037.embl``` file and put all lines into a list
2. Modify your code to only keep lines that do not begin with empty whitespace or the ```FT``` tag, and have them put into the list
3. Write a new file with just the gene descriptions in (The lines that begin with ```KW```).

Extension: Include the number of genes in the filename - Can this be done automatically?

In [23]:
# Write your script here

data=[]
with open('am181037.embl') as inp:
  for line in inp.readlines():
    data.append(line.strip())

print(type(data))
print(data)


print('~~~~~~~~~~~~~~`')
data=[]
with open('am181037.embl') as inp:
  for line in inp.readlines():
    if not (line.startswith(' ') or line.startswith('FT')): data.append(line.strip())

print(type(data))
print(data)

new_data = []
for line in data:
 if line.startswith('KW') :
    new_data.append(line)

for i in new_data:
  print(i)

with open('out_test.txt', 'w') as out:
   #out.write(new_data)
   for lin in new_data:
    out.write(lin + '\n')

<class 'list'>
['ID   AM181037; SV 1; circular; genomic DNA; STD; MAM; 16813 BP.', 'XX', 'AC   AM181037;', 'XX', 'DT   26-SEP-2006 (Rel. 89, Created)', 'DT   21-AUG-2007 (Rel. 92, Last updated, Version 3)', 'XX', 'DE   Vulpes vulpes complete mitochondrial genome', 'XX', 'KW   12S ribosomal RNA; 12S rRNA gene; 16S ribosomal RNA; 16S rRNA gene;', 'KW   ATPase 6 gene; ATPase 8 gene; ATPase subunit 6; ATPase subunit 8; COI gene;', 'KW   COII gene; COIII gene; complete genome; control region; cytb gene;', 'KW   cytochrome b; cytochrome oxidase subunit I; cytochrome oxidase subunit II;', 'KW   cytochrome oxidase subunit III; NADH dehydrogenase subunit 1;', 'KW   NADH dehydrogenase subunit 2; NADH dehydrogenase subunit 3;', 'KW   NADH dehydrogenase subunit 4; NADH dehydrogenase subunit 4L;', 'KW   NADH dehydrogenase subunit 5; NADH dehydrogenase subunit 6; NADH1 gene;', 'KW   NADH2 gene; NADH3 gene; NADH4 gene; NADH4L gene; NADH5 gene; NADH6 gene;', 'KW   transfer RNA-Ala; transfer RNA-Arg; t