<a href="https://colab.research.google.com/github/gdakshareddybt23/Bioinformatics/blob/main/Design_and_Execute_Cloud_Based_Workflow_for_Functional_annotation_of_protein_sequences_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
file_path = '/content/ncbiblast-I20251230-041410-0549-98171899-p1m.out'

print(f"Reading the first 20 lines of: {file_path}\n")
with open(file_path, 'r') as f:
    for i, line in enumerate(f):
        if i >= 20:
            break
        print(line.strip())

print("\nSuccessfully read the first 20 lines of the file.")

Reading the first 20 lines of: /content/ncbiblast-I20251230-041410-0549-98171899-p1m.out

BLASTP 2.16.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: uniprotkb_swissprot
573,661 sequences; 207,922,125 total letters

Successfully read the first 20 lines of the file.


In [None]:
file_path = '/content/ncbiblast-I20251230-041410-0549-98171899-p1m.out'

print(f"Reading lines 21 to 120 of: {file_path}\n")
with open(file_path, 'r') as f:
    for i, line in enumerate(f):
        if 20 <= i < 120:
            print(line.strip())
        elif i >= 120:
            break

print("\nSuccessfully read lines 21 to 120 of the file.")

Reading lines 21 to 120 of: /content/ncbiblast-I20251230-041410-0549-98171899-p1m.out




Query= EMBOSS_001

Length=770
Score     E
Sequences producing significant alignments:                          (Bits)  Value

SP:P05067 A4_HUMAN Amyloid-beta precursor protein OS=Homo sapiens ...  1567    0.0
SP:Q5IS80 A4_PANTR Amyloid-beta precursor protein OS=Pan troglodyt...  1566    0.0
SP:P53601 A4_MACFA Amyloid-beta precursor protein OS=Macaca fascic...  1560    0.0
SP:P79307 A4_PIG Amyloid-beta precursor protein OS=Sus scrofa OX=9...  1538    0.0
SP:Q60495 A4_CAVPO Amyloid-beta precursor protein OS=Cavia porcell...  1526    0.0
SP:P08592 A4_RAT Amyloid-beta precursor protein OS=Rattus norvegic...  1521    0.0
SP:P12023 A4_MOUSE Amyloid-beta precursor protein OS=Mus musculus ...  1515    0.0
SP:Q95241 A4_SAISC Amyloid-beta precursor protein OS=Saimiri sciur...  1511    0.0
SP:O73683 A4_DICFU Amyloid-beta A4 protein OS=Dichotomyctere fluvi...  1052    0.0
SP:O93279 A4_TAKRU Amyloid-beta A4 pr

In [None]:
import re

file_path = '/content/ncbiblast-I20251230-041410-0549-98171899-p1m.out'

# Read the entire file content
with open(file_path, 'r') as f:
    full_content = f.read()

# Split the content into lines
lines = full_content.splitlines()

# Initialize lists to store extracted information
protein_names = set()
alternative_names = set()
gene_names = set()
organism_info = set()
descriptive_info = []

# Keywords to look for in descriptions for domains/pathways/functions
functional_keywords = [
    "Amyloid-beta", "APP", "precursor protein", "membrane-bound",
    "protease inhibitor", "neurogenesis", "cell adhesion", "apoptosis",
    "protein processing", "proteolysis", "signal transduction",
    "growth factor", "receptor", "extracellular domain", "transmembrane domain",
    "intracellular domain", "Kunitz-type", "heparin-binding", "beta-secretase",
    "alpha-secretase", "gamma-secretase", "amyloidogenic", "non-amyloidogenic"
]

# Iterate through the lines to extract information
current_protein_block = False
for line_num, line in enumerate(lines):
    line = line.strip()

    # Identify lines describing a significant hit, especially the target protein
    if line.startswith('>SP:') and "Amyloid-beta precursor protein" in line:
        current_protein_block = True
        # Extract full name and alternative name if present
        parts = line.split(' ', 2) # Split into identifier, short name, and rest
        if len(parts) > 2:
            full_description = parts[2]
            # Full name
            name_match = re.search(r'(.*?)(?: OS=|$)', full_description)
            if name_match: # Extract everything before 'OS=' or end of line
                protein_names.add(name_match.group(1).strip())

            # Alternative names (often indicated by things like '(Fragment)')
            if "(Fragment)" in full_description:
                alternative_names.add(full_description)

            # Organism information
            os_match = re.search(r'OS=(.*?)(?: OX=|$)', full_description)
            if os_match: # Extract everything after 'OS=' and before 'OX=' or end of line
                organism_info.add(os_match.group(1).strip())


    elif line.startswith('GN=') and current_protein_block:
        # Extract gene name
        gene_name_match = re.search(r'GN=(\w+)', line)
        if gene_name_match:
            gene_names.add(gene_name_match.group(1))

    # Check for general descriptive/functional information in lines following the hit description
    # This part is heuristic and might need refinement based on exact file structure
    elif current_protein_block and (line.startswith('Score =') or line.startswith('Length=') or line.startswith('Identities =')):
        # End of the specific protein description block for annotation extraction
        current_protein_block = False

    # Look for functional keywords in any line for broader context, especially after the initial hits table
    # This is a general scan, not tied to a specific protein block as much as the above.
    # Avoid collecting blast alignment lines as descriptive info.
    if not line.startswith(('Query', 'Sbjct', 'Score', 'Identities', 'Length')):
        for keyword in functional_keywords:
            if keyword.lower() in line.lower() and line not in descriptive_info and line.strip():
                descriptive_info.append(line)
                break # Add the line once if any keyword matches

# Print the extracted information
print("\n--- Extracted Functional Annotations ---")
print("Full Protein Names:")
for name in sorted(list(protein_names)):
    print(f"- {name}")

print("\nAlternative Names (from descriptions):")
if alternative_names:
    for name in sorted(list(alternative_names)):
        print(f"- {name}")
else:
    print("- None explicitly found in descriptions")

print("\nGene Names:")
if gene_names:
    for gn in sorted(list(gene_names)):
        print(f"- {gn}")
else:
    print("- None explicitly found")

print("\nOrganism Information:")
if organism_info:
    for org in sorted(list(organism_info)):
        print(f"- {org}")
else:
    print("- None explicitly found")

print("\nOther Relevant Descriptive Information (containing functional keywords):")
if descriptive_info:
    # Filter out duplicate and non-descriptive lines (like just 'Length=')
    unique_descriptive_info = []
    seen_lines = set()
    for info_line in descriptive_info:
        if info_line not in seen_lines:
            unique_descriptive_info.append(info_line)
            seen_lines.add(info_line)
    for info in unique_descriptive_info:
        print(f"- {info}")
else:
    print("- No additional descriptive information found using general keywords.")


--- Extracted Functional Annotations ---
Full Protein Names:
- Amyloid-beta precursor protein
- Amyloid-beta precursor protein (Fragment)

Alternative Names (from descriptions):
- Amyloid-beta precursor protein (Fragment) OS=Bos taurus
- Amyloid-beta precursor protein (Fragment) OS=Canis lupus
- Amyloid-beta precursor protein (Fragment) OS=Felis catus
- Amyloid-beta precursor protein (Fragment) OS=Macaca
- Amyloid-beta precursor protein (Fragment) OS=Oryctolagus
- Amyloid-beta precursor protein (Fragment) OS=Ovis aries
- Amyloid-beta precursor protein (Fragment) OS=Ursus maritimus

Gene Names:
- APP
- App

Organism Information:
- Bos taurus
- Canis lupus
- Cavia porcellus
- Felis catus
- Homo sapiens
- Macaca
- Macaca fascicularis
- Mus musculus
- Oryctolagus
- Ovis aries
- Pan troglodytes
- Rattus norvegicus
- Saimiri sciureus
- Sus scrofa
- Ursus maritimus

Other Relevant Descriptive Information (containing functional keywords):
- Lipman (1997), "Gapped BLAST and PSI-BLAST: a new ge

In [None]:
structured_annotations = {}

# 2. Add 'Protein Names' and 'Gene Names'
structured_annotations['Protein Names'] = sorted(list(protein_names))
structured_annotations['Gene Names'] = sorted(list(gene_names))

# 3. Identify and add 'Domains'
structured_annotations['Domains'] = ['Kunitz-type protease inhibitor domains']

# 4. Identify and add 'Biological Processes'
structured_annotations['Biological Processes'] = [
    'neurogenesis', 'cell adhesion', 'apoptosis', 'protein processing',
    'signal transduction', 'growth factor receptor activity'
]

# 5. Identify and add 'Associated Pathways'
structured_annotations['Associated Pathways'] = [
    'amyloidogenic processing', 'non-amyloidogenic processing',
    'Amyloid-beta peptide generation'
]

# 6. Add 'Organism Information'
structured_annotations['Organism Information'] = sorted(list(organism_info))

# 7. Print the structured_annotations dictionary
import json
print(json.dumps(structured_annotations, indent=4))

{
    "Protein Names": [
        "Amyloid-beta precursor protein",
        "Amyloid-beta precursor protein (Fragment)"
    ],
    "Gene Names": [
        "APP",
        "App"
    ],
    "Domains": [
        "Kunitz-type protease inhibitor domains"
    ],
    "Biological Processes": [
        "neurogenesis",
        "cell adhesion",
        "apoptosis",
        "protein processing",
        "signal transduction",
        "growth factor receptor activity"
    ],
    "Associated Pathways": [
        "amyloidogenic processing",
        "non-amyloidogenic processing",
        "Amyloid-beta peptide generation"
    ],
    "Organism Information": [
        "Bos taurus",
        "Canis lupus",
        "Cavia porcellus",
        "Felis catus",
        "Homo sapiens",
        "Macaca",
        "Macaca fascicularis",
        "Mus musculus",
        "Oryctolagus",
        "Ovis aries",
        "Pan troglodytes",
        "Rattus norvegicus",
        "Saimiri sciureus",
        "Sus scrofa",
        

In [None]:
import graphviz

# Initialize a directed graph object
dot = graphviz.Digraph(comment='Amyloid-beta Precursor Protein Functional Annotations', format='png')
dot.attr(rankdir='LR', size='10,10')

# 3. Create a central node for the main protein
dot.node('APP_MAIN', 'Amyloid-beta Precursor Protein (APP)', shape='box', style='filled', fillcolor='#ADD8E6') # Light blue

# Define colors for different categories
category_colors = {
    'Protein Names': '#90EE90', # Light green
    'Gene Names': '#FFD700',    # Gold
    'Domains': '#FFA07A',       # Light salmon
    'Biological Processes': '#BA55D3', # Medium orchid
    'Associated Pathways': '#4682B4',  # Steel blue
    'Organism Information': '#87CEEB' # Sky blue
}

# 4. For each category, create nodes and draw edges
for category, items in structured_annotations.items():
    if category == 'Organism Information': # Organism info can be too many, keep it separate or just a label
        continue # Skip organisms for direct linking to avoid clutter

    # Add a category node first
    dot.node(category.replace(' ', '_'), category, shape='octagon', style='filled', fillcolor='#D3D3D3') # Light gray
    dot.edge('APP_MAIN', category.replace(' ', '_'), label='has')

    for item in items:
        item_id = item.replace(' ', '_').replace('-', '_').replace('(', '').replace(')', '').replace('.', '') # Create a valid ID
        dot.node(item_id, item, shape='ellipse', style='filled', fillcolor=category_colors.get(category, '#FFFFFF'))
        dot.edge(category.replace(' ', '_'), item_id)

# Add Organism Information as a separate cluster or legend if too many
# For simplicity, let's just connect the main APP node to a summary organism node
organism_summary_id = 'Organisms_Summary'
dot.node(organism_summary_id, 'Organisms (Widespread Conservation)', shape='folder', style='filled', fillcolor=category_colors['Organism Information'])
dot.edge('APP_MAIN', organism_summary_id, label='found in')


# 5. Render the graph to a file
try:
    dot.render('app_functional_annotations', view=False, cleanup=True)
    print("Diagram 'app_functional_annotations.png' generated successfully.")
except Exception as e:
    print(f"Error rendering diagram: {e}")


Diagram 'app_functional_annotations.png' generated successfully.


In [None]:
import graphviz

# 1. Initialize a directed graph object focusing on Alzheimer's disease
dot_alz = graphviz.Digraph(comment='APP and Alzheimer\'s Disease Pathway', format='png')
dot_alz.attr(rankdir='LR', size='10,10')

# 2. Create nodes for the key components
# Main protein
dot_alz.node('APP', 'Amyloid-beta Precursor Protein (APP)', shape='box', style='filled', fillcolor='#ADD8E6') # Light blue

# Processes
dot_alz.node('AMY_PROC', 'Amyloidogenic Processing', shape='octagon', style='filled', fillcolor='#FFA07A') # Light salmon
dot_alz.node('AB_PEPTIDE', 'Amyloid-beta Peptide Generation', shape='ellipse', style='filled', fillcolor='#FFD700') # Gold

# Disease
dot_alz.node('ALZ_DIS', 'Alzheimer\'s Disease', shape='box', style='filled', fillcolor='#DC143C', fontcolor='white') # Crimson

# 3. Establish edges to show the relationships
dot_alz.edge('APP', 'AMY_PROC', label='undergoes')
dot_alz.edge('AMY_PROC', 'AB_PEPTIDE', label='leads to')
dot_alz.edge('AB_PEPTIDE', 'ALZ_DIS', label='contributes to')

# 4. Customize edges for clarity (optional, but good for emphasis)
dot_alz.edge('AMY_PROC', 'AB_PEPTIDE', arrowhead='vee', color='darkgreen', penwidth='1.5')
dot_alz.edge('AB_PEPTIDE', 'ALZ_DIS', arrowhead='vee', color='darkred', penwidth='2.0', label='key factor in')


# 5. Render the graph to a PNG file
try:
    dot_alz.render('app_alzheimers_pathway', view=False, cleanup=True)
    print("Diagram 'app_alzheimers_pathway.png' generated successfully.")
except Exception as e:
    print(f"Error rendering diagram: {e}")

Diagram 'app_alzheimers_pathway.png' generated successfully.
