### Find the Lengths and Names of Sequences in a Multi-Fasta Files
#### Author:
Walter Odur         
<walterrickyodur2021@gmail.com>  
<https://github.com/Walter-Odur>   
<https://www.linkedin.com/in/walter-rickman-odur/>  
<https://x.com/Walterrickman22>  
#### Affiliation
African Centers Of Excellence In Bioinformatics And Data Intensive Sciences (ACE-Uganda).  
Infectious Diseases Institute (IDI),  
Makerere University,  
Kampala Uganda
#### What you are expected to do/learn from this first tutorial
1. You will use SeqIO sub-module from Module Bio to do the analysis
2. You will then use pandas to represent results for multiple files in a DataFrame
3. You will also learn how to use tabulate module to tabulate your DataFrame and make it more fancy
4. Finally, you will learn how to export your DataFrame as a csv file

In [44]:
# INSTALL THE FOLLOWING IF YOU DO NOT HAVE THEM INSTALLED

#!pip install tabulate
#!pip install biopython
#!pip install pandas

#Import the following
from tabulate import tabulate
import pandas as pd
from Bio import SeqIO

In [66]:
# Provide the path to the file (files.fasta)
file = 'C:/Users/Walterrickman/Desktop/PYTHON/Kobina/files.fasta'

In [5]:
# Read in/parse file
seq_obj=[seq for seq in SeqIO.parse(file,'fasta')] # 'fasta' specifies the file type

In [7]:
# Check how many fasta files are present in files.fasta
len(seq_obj)

5

In [67]:
# Try to work with a single file before we proceed to the multiple files
seq1=seq_obj[0]

In [10]:
# Check the name of the file
seq1.name

'AR465'

In [11]:
# Display the nucleotide sequence of the file
seq1.seq

Seq('GTCTATTCATTCCTTTTTCTCTCCTTTCAGCATTTTATTGAGCCTCTCATCAAC...TTA')

In [13]:
# Check the length of the sequence (how many nucleotides does the sequence contain)
len(seq1.seq)# .seq make make the string object, a sequence object

2911287

In [14]:
# You can also first create sequence object before starting the analysis
sequence = seq1.seq 

In [15]:
len(sequence)

2911287

In [18]:
#Now let's do this with Multiple sequences
#Determine sequence names
#    We will use a for loop to do this
names=[] # create here the variable name to store results
for item in seq_obj:
    name=item.name
    names.append(name)

print(names) # print out the variable (names) to see results

['AR465', 'M48', 'P10', 'R50', 'V521']


In [23]:
#Lets proceed to determing sequence lengths
#approch 1
lengths=[] 
for item in seq_obj:  #iterate through each item in seq_obj
    sequence=item.seq
    length=len(sequence) # determine the length of each sequence
    lengths.append(length) # store results in pr-created variable (lengths)

print(lengths)
print(lengths[0::2]) # You can also play around with the new variable


[2911287, 3050015, 2970728, 2866643, 3085555]
[2911287, 2970728, 3085555]


In [26]:
#approch2
lengths2=[len(item.seq) for item in seq_obj] # Do everything in just a single line
print(lengths2)

[2911287, 3050015, 2970728, 2866643, 3085555]


In [36]:
#You can also Assign length to sequence names
for i in seq_obj:
    print('the length of '+ str(i.name) +' is '+ str(len(i.seq)))

the length of AR465 is 2911287
the length of M48 is 3050015
the length of P10 is 2970728
the length of R50 is 2866643
the length of V521 is 3085555


In [41]:
# That was a bit ugly, but you can place results in a DataFrame for better appearance

# We shall use this variables (names and lengths) that we created previously
print(names) # print to confirm their existence
print(lengths)
#create a dataframe using pandas, remember we exported 'pandas' as 'pd'
df=pd.DataFrame()
df['sequenceNames']=names
df['sequenceLength']=lengths
print(df)

['AR465', 'M48', 'P10', 'R50', 'V521']
[2911287, 3050015, 2970728, 2866643, 3085555]
  sequenceNames  sequenceLength
0         AR465         2911287
1           M48         3050015
2           P10         2970728
3           R50         2866643
4          V521         3085555


In [45]:
# Make the dataframe more nice looking with grids, using tabulate module

print(tabulate(df,headers='keys',tablefmt='grid'))

+----+-----------------+------------------+
|    | sequenceNames   |   sequenceLength |
|  0 | AR465           |          2911287 |
+----+-----------------+------------------+
|  1 | M48             |          3050015 |
+----+-----------------+------------------+
|  2 | P10             |          2970728 |
+----+-----------------+------------------+
|  3 | R50             |          2866643 |
+----+-----------------+------------------+
|  4 | V521            |          3085555 |
+----+-----------------+------------------+


In [68]:
# You can also make it pretty

print(tabulate(df,headers='keys',tablefmt='pretty'))

+---+---------------+----------------+
|   | sequenceNames | sequenceLength |
+---+---------------+----------------+
| 0 |     AR465     |    2911287     |
| 1 |      M48      |    3050015     |
| 2 |      P10      |    2970728     |
| 3 |      R50      |    2866643     |
| 4 |     V521      |    3085555     |
+---+---------------+----------------+


In [70]:
# You can Make it even more fancy

print(tabulate(df,headers='keys',tablefmt='fancy_grid'))

╒════╤═════════════════╤══════════════════╕
│    │ sequenceNames   │   sequenceLength │
╞════╪═════════════════╪══════════════════╡
│  0 │ AR465           │          2911287 │
├────┼─────────────────┼──────────────────┤
│  1 │ M48             │          3050015 │
├────┼─────────────────┼──────────────────┤
│  2 │ P10             │          2970728 │
├────┼─────────────────┼──────────────────┤
│  3 │ R50             │          2866643 │
├────┼─────────────────┼──────────────────┤
│  4 │ V521            │          3085555 │
╘════╧═════════════════╧══════════════════╛


In [49]:
#Locate a row in dataframe
df.loc[0]

sequenceNames       AR465
sequenceLength    2911287
Name: 0, dtype: object

In [50]:
# Extract specific item in a dataframe

df.loc[0,'sequenceNames']

'AR465'

In [51]:
df.loc[3,'sequenceLength']

2866643

In [54]:
# You can also trim data in a dataframe

df[df['sequenceLength']>=2970728]

Unnamed: 0,sequenceNames,sequenceLength
1,M48,3050015
2,P10,2970728
4,V521,3085555


In [53]:
#Tabulate the selected data

print(tabulate((df[df['sequenceLength']>=2970728]),headers='keys',tablefmt='fancy_grid'))

╒════╤═════════════════╤══════════════════╕
│    │ sequenceNames   │   sequenceLength │
╞════╪═════════════════╪══════════════════╡
│  1 │ M48             │          3050015 │
├────┼─────────────────┼──────────────────┤
│  2 │ P10             │          2970728 │
├────┼─────────────────┼──────────────────┤
│  4 │ V521            │          3085555 │
╘════╧═════════════════╧══════════════════╛


In [64]:
#Save df into a csv file with indexes

df.to_csv('C:/Users/Walterrickman/Desktop/PYTHON/Kobina/lengthsIDX.csv')

In [65]:
# Save df without indexes

df.to_csv('C:/Users/Walterrickman/Desktop/PYTHON/Kobina/lengths.csv',index=False)

In [None]:
# You can now go for the next second tutorial for Biopython, Hopping you learnt something here