In this assignment, you will be working with FASTA files. It is a text-based format for 
representing either nucleic acid or protein sequences and in which nucleotides or amino 
acids are represented using single-letter codes.
20 different standard amino acids exist in proteins (A, C, D, E, F, G, H, I, K, L, M, N, P, 
Q, R, S, T, V, W, Y), and these amino acids are divided into 4 categories according to 
their properties:



- Positively charged: R, H, K
- Negatively charged: D, E
- Polar: N, C, Q, S, T, Y
- Non-polar : A, I, L, M, F, P, W, V, G



The input data needed for this assignment is provided in the attached archive uploaded 
on Blackboard. You can unzip this archive manually on your computer into a folder 
named "FASTA_FILES" or it can be done within your script using some specific library.
The goal of this assignment is to ONLY use Python programming language to write a 
script that starts by displaying the following menu of the screen:




1) ALL
2) Per category
3) Within category
4) Specific AA




1) If the user selects this option, the script will calculate the combined frequency
and percentage of each of the 20 amino acid in the different FASTA files and 
print the results in a table sorted in reverse numerical order according to the 
percentage values. The table must contain 3 columns that correspond to the 
amino acid letter, its frequency and its percentage.



2) If the user selects this option, the script will do the same as in (1), but the 
displayed results will be per amino acids category. The table will contain only 4 
rows, each corresponding to an amino acid category.



3) If the user selects this option, he/she will be prompted to choose one category
from the four, and the script will display the frequency and percentage of each 
amino acid belonging to this category. In this case, the percentage of the amino 
acid is computed by dividing its frequency over the total number of amino acids 
belonging to the same category.


4) If the user select this option, he/she will be prompted to enter the letter of the 
amino acid of interest, and the script will just compute and print the percentage 
of that specific amino acid.


For the first three options (1, 2, and 3), the script will also return a plot to illustrate 
graphically the results of the frequency and percentage (2 y-axis scales can be used for 
each of the 2 metrics). You can use any data visualization library that is supported in 
Python Jupyter Notebooks (like Matplotlib, Seaborn, Plotly …). An interactive plot will 
be a plus. 
The first line of each FASTA file (starting with a > sign) is a description of the sequence
and must not be considered in the calculations

In [3]:
# Charbel younes
# Python_Assignment_1
# Import necassary python modules

from Bio import SeqIO
import os
import pandas as pd
import matplotlib.pyplot as plt
import ipympl

In [4]:
# The jupyter notebook has opened in the FASTA_FILES. Now, we check the list of name in the FASTA_FILES
Dir = os.listdir()
Dir

['.ipynb_checkpoints',
 'P00451.fasta',
 'P00533.fasta',
 'P04637.fasta',
 'P10275.fasta',
 'P11362.fasta',
 'P14373.fasta',
 'P21802.fasta',
 'P38398.fasta',
 'P68871.fasta',
 'Python_Assignment_1.ipynb',
 'Q14524.fasta']

In [5]:
# Remove the name from the list that are not FASTA file
for i in Dir:
    if i[6:]!= '.fasta':
        index = Dir.index(i)
        Dir.pop(index)

Dir         # Now we have a list object of the list of the name of FASTA files



['P00451.fasta',
 'P00533.fasta',
 'P04637.fasta',
 'P10275.fasta',
 'P11362.fasta',
 'P14373.fasta',
 'P21802.fasta',
 'P38398.fasta',
 'P68871.fasta',
 'Q14524.fasta']

In [6]:
# Create a list object in which, all the Amino acid are placed from all

# The sequences founded in all FASTA files are combined into a single list object


# Empty list object
combined_seq = []



# Appending the letter (Amino acid) in the combined_seq list from all the FASTA files
for i in Dir:      # Looping through the FASTA files
    for seq_record in SeqIO.parse(i,"fasta"):       # Loop through the each sequence record object of each FASTA file
        
        # Appending string format of amino acid symbol collected from the FASTA file as the new element of the list object
        combined_seq.append(str(seq_record.seq))   # By the seq_record.seq method, only the sequence of protein is appending       


        
        
        
# Coonverting the list object into sequence object
# The sequences founded in all FASTA files are combined into a single string object.
combined_seq = "".join(i for i in combined_seq)     
len(combined_seq)        # Length of the list object

11055

In [7]:
# create a directory to save the created data


if os.path.exists("Tables") == False:   # Checking the if the directory is existed, if not existed, than creating a directory
    dir1 = os.mkdir("Tables")      #  a directory named "Tables" is created containing all the graphs representations

if os.path.exists("Figures") == False:    
    dir2 = os.mkdir("Figures")       #  a directory named "Figures" is created containg all the graph representations

In [9]:
# An inline to create interactive plot with matplotlib and ipympl in jupyter-notebook
%matplotlib widget

In [11]:
# A common code-template as the function is defining. so that, this template can be used with argument passing 
# in other functions
# This code-template can perform plotting with appropriate argument

def template_plotting(data,obj,filename):
    
    """The function has three argument:
       data = tabulated data in CSV file format,
       obj = Amino acid or the name of Amino acid group
       filename = the name of figure"""
    
    
    # Defining the base figure in matplotlib
    f, ax1 = plt.subplots()
    
    
    
    # Defining the data
    x = data['Name']
    y = data['Frequency']
    y_= data['Percentage']
    
    
    
    # All about the font of title or text is putting into dictionary object
    font1 = {'family':'serif','color':'blue','size':20}
    font2 = {'family':'serif','color':'red','size':15}
    font3 = {'family':'serif','color':'blue','size':15}
    font4 = {'family':'serif','color':'k','size':15}

    
    # color object
    color = 'tab:red'
    
    
    # Plotting the data
    ax1.plot(x,y,'o', ms = 15,mfc = 'r',mec = 'r')
    
    
    # Setting label for horizontal axis
    ax1.set_xlabel("Amino-acid", fontdict = font4)
    
    # Setting label for vertical axis in the left side
    ax1.set_ylabel("Frequency", fontdict = font2)
    
    # Coloring the scale-label of vertical axis in the left side
    ax1.tick_params(axis='y', labelcolor=color)



    color = 'tab:blue'
    
    # This property is coppying the axis of ax1, it is using to make two y-axis scale
    ax2 = ax1.twinx()
    
    
    ax2.plot(x,y_, ls = '--', c = 'b')
    
    
    ax2.set_xlabel("Amino-acid", fontdict = font4)
    
    # Setting label for vertical axis in the right side
    ax2.set_ylabel("Percentage", fontdict = font3)
    
    
    # Coloring the scale-label of vertical axis in the right side
    ax2.tick_params(axis='y', labelcolor=color)


    # Template formatting to set the title, the parameter will be passed as the argument, when the function will be called
    temp1 = "Frequency and Percentage of \n {obj}".format(obj = obj)
    
    # Set the title with the template formatting object
    plt.title(temp1, fontdict = font1)
    
    
    # Tight layout resist the overlapping between two plots
    f.tight_layout()
    
    
    
    # Template formatting to set the filename, the parameter will be passed as the argument, when the function will be called   
    temp2 =  "Figures/{filename}.png".format(filename = filename)
    
    # Saving fugure in the Figure directory
    plt.savefig(temp2)
    
    # Output the plot
    plt.show()



In [13]:
# Function for option: All
def All():
    
        
    frequency = []        # Creating an empty list  for listing the frequency
    percentage = []       # creating an empty list  for listing the percentage
    
    
    # List of 20 Amino acid 
    li = ("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
        
    # Appending the frequency and percentage for all Amino acid
    for i in li:          # Looping through the amino acid symbols list
        frequency_ = combined_seq.count(i)    # Counting the frequency of each Amino acid from the combined sequence
        
        percentage_ = (combined_seq.count(i)/len(combined_seq))*100   # Counting the percentage of each Amino acid from the combined sequence
        
        frequency.append(frequency_)      # Updating the list  
        
        percentage.append(percentage_)
            
    
    # Creating a dictionary with that three list: Name of Amino acid, frequency and percentage
    dt = {"Name": li,
          "Frequency": frequency,
          "Percentage": percentage}

    
    # Creating a pandas data-frame object by the above dictionary
    dtf = pd.DataFrame(dt)
    
    
    # Sort values in reverse numerical order
    dtf = dtf.sort_values(by='Percentage', ascending = False)
    
    
    
    
        
    
    # Convert the dataframe into CSV file visible in the directory tables created above
    table1 = dtf.to_csv("Tables/Table_1.CSV", index = False)
    
    
    # Read the CSV file
    table1 = pd.read_csv("Tables/Table_1.CSV")
    
    
    # Print the CSV file
    print("Table: All",'\n',table1, '\n','\n')
    
    
    # Calling the function that can plot the generated data from this code block
    template_plotting(table1,"Amino acid","All")
    
    
    

In [14]:
# Function for option: All
def Per_category():
    frequency_p_c = []         # Creating an empty list  for listing the frequency
    percentage_p_c = []        # creating an empty list  for listing the percentage
    
    
    # List of Positively charged Amino acid group 
    p_c = ["R", "H", "K"]

    # Appending the frequency and percentage of group of Amino acid
    for i in p_c:       # Looping through the Amino acid symbols from the list of Amino acid group 
        frequency_ = combined_seq.count(i)      # Counting the  frequency of each Amino acid
        percentage_ = (combined_seq.count(i)/len(combined_seq))*100    # Counting the percenatge of each Amino acid
        
        frequency_p_c.append(frequency_)      # Upadating the list 
        percentage_p_c.append(percentage_)
        
        
    percentage_p_c = sum(percentage_p_c)      # Sum of percentage

    frequency_p_c = sum(frequency_p_c)        # Sum of frequency
    
    
    
    
    frequency_n_c = []
    percentage_n_c = []

    # List of Negatively charged Amino acid group
    n_c = ["D","E"]


    for i in n_c:
        frequency_ = combined_seq.count(i)
        percentage_ = (combined_seq.count(i)/len(combined_seq))*100
    
        frequency_n_c.append(frequency_)
        percentage_n_c.append(percentage_)
        
        
    percentage_n_c = sum(percentage_n_c)

    frequency_n_c = sum(frequency_n_c) 
    
    
    
    
    frequency_po = []
    percentage_po = []

    # List of Polar Amino acid group
    po = ["N", "C", "Q", "S", "T", "Y"]


    for i in po:
        frequency_ = combined_seq.count(i)
        percentage_ = (combined_seq.count(i)/len(combined_seq))*100
    
        frequency_po.append(frequency_)
        percentage_po.append(percentage_)
        
        
    percentage_po = sum(percentage_po)

    frequency_po = sum(frequency_po) 
    
    
    
    frequency_n_po = []
    percentage_n_po = []

    # List of Non-polar Amino acid group
    n_po = ["A", "I", "L", "M", "F", "P", "W", "V", "G"]


    for i in n_po:
        frequency_ = combined_seq.count(i)
        percentage_ = (combined_seq.count(i)/len(combined_seq))*100
    
        frequency_n_po.append(frequency_)
        percentage_n_po.append(percentage_)
        
        
    percentage_n_po = sum(percentage_n_po)

    frequency_n_po = sum(frequency_n_po)  
    
    
    # List of the name of the Amino acid group
    Name = ['Positively Charged','Negatively Charged','Polar','Non-polar']
    
    # List of the frequency of the Amino acid group
    frequency = [frequency_p_c,frequency_n_c,frequency_po,frequency_n_po]
    
    # List of the percentage of the Amino acid group
    percentage = [percentage_p_c,percentage_n_c,percentage_po,percentage_n_po]

    
    # Creating a dictionary with that three list: Name of Amino acid, frequency and percentage
    dic = {"Name": Name,
          "Frequency": frequency,
          "Percentage": percentage}

    # Creating a pandas data-frame object by the above dictionary
    dtf = pd.DataFrame(dic)
    
    # Sort values in reverse numerical order
    dtf = dtf.sort_values(by='Percentage', ascending = False)
    
    
   

    # Creating the CSV file visible in the tables directory
    table2 = dtf.to_csv("Tables/Table_2.CSV", index = False)
    
    # Reading the CSV file
    table2 = pd.read_csv("Tables/Table_2.CSV")
    
    # printing the CSV file
    print("Table: Per category",'\n',table2,'\n','\n')
    
    
    
    # Calling the function that can plot the generated data by this script with proper argument 
    template_plotting(table2,"categories of Amino acid","Per_category")

In [15]:
"""This code-template can be performed to calculate and to display the frequency and percentage of each
amino acid belonging to this category with appropriate argument"""

def template_within_category(li,filename,header):
    
    """The function has three argument:
       li = list of Amino acid group,
       filename = the name of CSV file and figure,
       header = The title for the output"""
    
    
    frequency = []       # Creating an empty list  for listing the frequency     
    percentage = []        # creating an empty list  for listing the percentage
    
   
    # Appending the frequency and percentage of group of Amino acid
    for i in li:      # Looping through the Amino acid symbols from the list of Amino acid group 
        frequency_ = combined_seq.count(i)       # Counting the  frequency of each Amino acid
    
        
        
        frequency.append(frequency_)     # Updating the list object 
        
    
    s_fre = sum(frequency)     # Sum of the frequency
    
    for i in frequency:        # Looping through the list frequency
        p = (i/s_fre)*100        # Calculating percenatge of each Amino acid belonging to a category
        percentage.append(p)       # Updating the list object
        
    
    # Creating a dictionary with that three list: Name of Amino acid, frequency and percentage
    dic = {"Name":li,
           "Frequency":frequency,
           "Percentage":percentage}

    # Creating a pandas data-frame object by the above dictionary
    dtf = pd.DataFrame(dic)
    
    # Sort values in reverse numerical order
    dtf = dtf.sort_values(by='Percentage', ascending = False)
    
   
    # Template formatting to set the filename, the parameter will be passed as the argument, when the function will call
    temp1 = "Tables/{filename}.CSV".format(filename = filename)
    
    
    # Creating the CSV file
    table = dtf.to_csv(temp1, index = False)
    
    # Read the CSV file
    table = pd.read_csv(temp1)
    
    
    # Template formatting to set the title, the parameter will be passed as the argument, when the function will call
    temp2 = "Table: {header}".format(header = header)
    
    # Printing the CSV file
    print(temp2,'\n',table,'\n','\n')
    
    # Template formatting to set the argument of another function
    temp3 = "{stc} Amino acid".format(stc = header)
    
    # Calling the function that can plot the generated data by this script with proper argument 
    template_plotting(table,temp3,filename)

In [16]:
# Function for sub-option: Positively charged
def Positively_charged():
    
    # List of Positively charged Amino acid group 
    li = ["R", "H", "K"]
    
    # Filename of tabulated data that will be generate after function calling
    filename = "Positively_charged"
    
    # Title of tabulated data that for printing
    header = "Positively charged"
    
    # Calling the function that will print and plot the tabulated data of the Amino-acid group
    template_within_category(li,filename,header)




# Function for sub-option: Negatively charged    
def Negatively_charged():    
    
    # List of Negatively charged Amino acid group 
    li = ["D","E"]
    
    
    filename = "Negatively_charged"
    
    header = "Negatively charged"
    
    template_within_category(li,filename,header)
    
    
   
    
# Function for sub-option: Polar
def Polar():
    
     # List of Polar Amino acid group 
    li = ["N", "C", "Q", "S", "T", "Y"]
    
    filename = "Polar"
    
    header = "Polar"
    
    template_within_category(li,filename,header)
    
    
    
    
# Function for sub-option: Non-polar    
def Non_polar(): 
    
    # List of Non-polar Amino acid group 
    n_po = ["A", "I", "L", "M", "F", "P", "W", "V", "G"]
    
    filename = "Non_polar"
    
    header = "Non-polar"
    
    template_within_category(li,filename,header)
    

In [18]:
# Function for option: Specific_AA

def Specific_AA(letter):
    
    
    percentage = []    # Creating an empty list object for listing the percentage 
   
    # List of Amino acid
    li = ("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")
    
    # Dictionary object
    # Name of Amino acid is the keyword
    # Calculated percentage of Amino acid is the value
    f_p = {i:(combined_seq.count(i)/len(combined_seq))*100 for i in li}
    
    
    # Separating the percenatge of the Amino acid of specific later
    # That letter will be apssed as the argument, when function will be called
    item = f_p[letter]
    
    
    
    # Template formatting to set the title, the parameter will be passed as the argument, when the function will call
    template = "Amino acid {letter}: Percentage{p}%".format(letter = letter, p = item)
    
    
    # Printing the percentage of the Amino acid of specific letter
    print("Specific AA:", '\n', template,'\n','\n')




In [19]:
# Import python tkinter package to let the user to interact with the "options", like a small menu
import tkinter as tk
from tkinter import *
from tkinter import ttk


# Defining the root tkinter object
win = tk.Tk()

# Background color of the GUI
win.configure(background = 'skyblue')

# Size of the GUI
win.geometry('436x60')

# Set the title of the GUI
win.title("Assignment1: Python:  Seelect your option option")

# Defining the button for the option: All    
button1 = ttk.Button(win, text = 'All', command = All)

# Grid position of the button
button1.grid(column = 1, row = 0)


# Defining the button for the option: Per category
button2 = ttk.Button(win, text = 'Per category', command = Per_category)
button2.grid(column = 2, row = 0)



# Defining the menubutton for the option: Within category
menubutton1 = Menubutton(win, text="Within category")
menubutton1.grid(column = 3, row = 0 )

# Create pull down menu
menubutton1.menu = Menu(menubutton1, tearoff = 0)
menubutton1["menu"] = menubutton1.menu

# Create the submenu button for the sub-option: Positively charged
menubutton1.menu.add_command(label='Positively charged',command = Positively_charged )


# Adding a separator between two submenu button
menubutton1.menu.add_separator()

# Create the submenu button for the sub-option: Negatively charged
menubutton1.menu.add_command(label='Negatively charged', command = Negatively_charged)


menubutton1.menu.add_separator()

# Create the submenu button for the sub-option: Polar
menubutton1.menu.add_command(label='Polar', command = Polar)


menubutton1.menu.add_separator()

#dropdown bar
# Create the submenu button for the sub-option: Non-polar
menubutton1.menu.add_command(label='Non-polar', command = Non_polar)






# Function for the command of the button, to select the Amino acid from the Combobox
def select():
    letter = value.get()
    if letter:
        Specific_AA(letter)
    else:
        print("Please, select a letter from Combobox and click again", '\n','\n')



# The list of the name of the Amino acid
li = ("A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y")

# Defining value for the combobox
value = tk.StringVar()

# Defining the combobox 
combo = ttk.Combobox(win, width = 12, textvariable = value)

# Inserting the list of the name of Amino acid as the values of combobox
combo['values'] = li

combo.grid(column = 4, row = 1)



    


# Defining the button for the option: Specific AA
button3 = ttk.Button(win, text = 'Specific AA', command = select)
button3.grid(column = 4, row = 0)





# Calling this function open the generated GUI 
win.mainloop()

Table: All 
    Name  Frequency  Percentage
0     L       1009    9.127092
1     S        947    8.566260
2     E        784    7.091814
3     P        653    5.906829
4     G        651    5.888738
5     A        618    5.590231
6     V        618    5.590231
7     T        614    5.554048
8     K        600    5.427408
9     D        522    4.721845
10    R        507    4.586160
11    I        490    4.432384
12    N        487    4.405246
13    Q        475    4.296698
14    F        406    3.672546
15    Y        308    2.786070
16    M        281    2.541836
17    C        258    2.333786
18    H        258    2.333786
19    W        140    1.266395 
 



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Table: Per category 
                  Name  Frequency  Percentage
0           Non-polar       4866   44.016282
1               Polar       3089   27.942108
2  Positively Charged       1365   12.347354
3  Negatively Charged       1306   11.813659 
 



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Table: Positively charged 
   Name  Frequency  Percentage
0    K        600   43.956044
1    R        507   37.142857
2    H        258   18.901099 
 



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Table: Negatively charged 
   Name  Frequency  Percentage
0    E        784   60.030628
1    D        522   39.969372 
 



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Table: Polar 
   Name  Frequency  Percentage
0    S        947   30.657171
1    T        614   19.876983
2    N        487   15.765620
3    Q        475   15.377145
4    Y        308    9.970864
5    C        258    8.352218 
 



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Table: Non-polar 
    Name  Frequency  Percentage
0     L       1009    9.495577
1     S        947    8.912102
2     E        784    7.378129
3     P        653    6.145304
4     G        651    6.126482
5     A        618    5.815923
6     V        618    5.815923
7     T        614    5.778280
8     K        600    5.646527
9     D        522    4.912479
10    R        507    4.771316
11    I        490    4.611331
12    N        487    4.583098
13    Q        475    4.470168
14    F        406    3.820817
15    Y        308    2.898551
16    M        281    2.644457
17    C        258    2.428007
18    H        258    2.428007
19    W        140    1.317523 
 



Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

Please, select a letter from Combobox and click again 
 

Specific AA: 
 Amino acid A: Percentage5.59023066485753% 
 

Specific AA: 
 Amino acid C: Percentage2.333785617367707% 
 

Specific AA: 
 Amino acid Y: Percentage2.7860696517412937% 
 

Specific AA: 
 Amino acid W: Percentage1.2663952962460425% 
 

Specific AA: 
 Amino acid T: Percentage5.554047942107643% 
 

Specific AA: 
 Amino acid S: Percentage8.56625961103573% 
 

Specific AA: 
 Amino acid R: Percentage4.586160108548168% 
 

Specific AA: 
 Amino acid Q: Percentage4.296698326549072% 
 

