# Building and Exploring the STRING Knowledge Graph

This notebook shows an **end-to-end** process of downloading STRING data, unpacking it, constructing a **multi-layered knowledge graph** with NetworkX, and performing initial exploration. 

## 1. Environment Setup

- We install Python packages `gdown` (for downloading files from Google Drive), `networkx` (for the graph data structure), and `biopython` (for parsing FASTA sequences, etc.).  
- `%pip install` works inside Jupyter notebooks, while `!pip install` is a shell command.  
- **Note**: Some dependencies might be pre-installed in certain environments.

In [1]:
!pip install gdown networkx biopython

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting filelock
  Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Collecting PySocks!=1.5.7,>=1.5.6
  Downloading PySocks-1.7.1-py3-none-any.whl (16 kB)
Installing collected packages: PySocks, filelock, biopython, gdown
Successfully installed PySocks-1.7.1 biopython-1.84 filelock-3.16.1 gdown-5.2.0
Note: you may need to restart the kernel to use updated packages.


## 2. Downloading and Unzipping Data

- Uses `gdown` to download a ZIP file containing raw STRING data from a shared Google Drive link.  

In [3]:
!gdown 1guS4_vJ06ZRbqAYSaC3MoEQxwTS4-Ax6

Downloading...
From (original): https://drive.google.com/uc?id=1guS4_vJ06ZRbqAYSaC3MoEQxwTS4-Ax6
From (redirected): https://drive.google.com/uc?id=1guS4_vJ06ZRbqAYSaC3MoEQxwTS4-Ax6&confirm=t&uuid=9dd13ab0-6931-48fd-b71c-5d55b391f494
To: /home/ec2-user/SageMaker/stringdb_raw_data.zip
100%|██████████████████████████████████████| 1.25G/1.25G [00:20<00:00, 60.7MB/s]


- Creates a folder named `stringdb_raw_data`,  
- Extracts the downloaded ZIP into that folder,  
- Lists the extracted files.  
- Error handling is included for cases where the file is missing or corrupted.

In [7]:
import zipfile
import os

# Path to the zip file
zip_file_path = "stringdb_raw_data.zip"

# Destination folder for extraction
destination_folder = "stringdb_raw_data"

# Ensure the destination folder exists
os.makedirs(destination_folder, exist_ok=True)

# Unzip the file
try:
    with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
        zip_ref.extractall(destination_folder)
    print(f"Files extracted to '{destination_folder}'")
    
    # List the extracted files
    extracted_files = os.listdir(destination_folder)
    print("Extracted files:")
    for file in extracted_files:
        print(f"- {file}")
except FileNotFoundError:
    print(f"Error: The file '{zip_file_path}' was not found.")
except zipfile.BadZipFile:
    print(f"Error: The file '{zip_file_path}' is not a valid zip file.")

Files extracted to 'stringdb_raw_data'
Extracted files:
- stringdb_raw_data


## 3. Libraries and Setup

- **gzip**: to handle `.gz` compressed files.  
- **pandas**: for tabular data manipulation (`read_csv`).  
- **networkx**: our graph library.  
- **Bio.SeqIO**: for parsing FASTA sequences.  
- **io.StringIO**: helps in memory-file conversions.

In [5]:
import gzip
import pandas as pd
import networkx as nx
from Bio import SeqIO
from io import StringIO
import os

## 4. Data File Paths

- A dictionary mapping descriptive keys to their respective file paths.  
- Notice each file is pointed to in the extracted directory.

In [6]:
data_files = {
    "clusters_info": "stringdb_raw_data/stringdb_raw_data/83332.clusters.info.v12.0.txt.gz",
    "clusters_proteins": "stringdb_raw_data/stringdb_raw_data/83332.clusters.proteins.v12.0.txt.gz",
    "clusters_tree": "stringdb_raw_data/stringdb_raw_data/83332.clusters.tree.v12.0.txt.gz",
    "protein_aliases": "stringdb_raw_data/stringdb_raw_data/83332.protein.aliases.v12.0.txt.gz",
    "protein_enrichment_terms": "stringdb_raw_data/stringdb_raw_data/83332.protein.enrichment.terms.v12.0.txt.gz",
    "protein_homology": "stringdb_raw_data/stringdb_raw_data/83332.protein.homology.v12.0.txt.gz",
    "protein_info": "stringdb_raw_data/stringdb_raw_data/83332.protein.info.v12.0.txt.gz",
    "protein_links_detailed": "stringdb_raw_data/stringdb_raw_data/83332.protein.links.detailed.v12.0.txt.gz",
    "protein_links_full": "stringdb_raw_data/stringdb_raw_data/83332.protein.links.full.v12.0.txt.gz",
    "protein_links": "stringdb_raw_data/stringdb_raw_data/83332.protein.links.v12.0.txt.gz",
    "protein_orthology": "stringdb_raw_data/stringdb_raw_data/83332.protein.orthology.v12.0.txt.gz",
    "protein_physical_links_detailed": "stringdb_raw_data/stringdb_raw_data/83332.protein.physical.links.detailed.v12.0.txt.gz",
    "protein_physical_links_full": "stringdb_raw_data/stringdb_raw_data/83332.protein.physical.links.full.v12.0.txt.gz",
    "protein_physical_links": "stringdb_raw_data/stringdb_raw_data/83332.protein.physical.links.v12.0.txt.gz",
    "protein_sequences": "stringdb_raw_data/stringdb_raw_data/83332.protein.sequences.v12.0.fa.gz",
    "COG_links_detailed": "stringdb_raw_data/stringdb_raw_data/COG.links.detailed.v12.0.txt.gz",
    "COG_links": "stringdb_raw_data/stringdb_raw_data/COG.links.v12.0.txt.gz",
    "COG_mappings": "stringdb_raw_data/stringdb_raw_data/COG.mappings.v12.0.txt.gz",
    "species_tree": "stringdb_raw_data/stringdb_raw_data/species.tree.v12.0.txt",
    "species": "stringdb_raw_data/stringdb_raw_data/species.v12.0.txt"
}

## 5. Graph Initialization

- We use a **MultiDiGraph** because we expect multiple edge types (e.g., detailed vs. full, physical vs. functional) between the same nodes, and directed edges (for species or cluster hierarchy).

In [7]:
# Create an empty MultiDiGraph
G = nx.MultiDiGraph()

## 6. Reading the Data

- Loads each TSV / space-delimited file into a Pandas DataFrame.  
- Note how we specify `sep="\t"` for tab-separated and `sep=" "` for space-delimited files.

In [8]:
### Load Data ###

# Clusters info
clusters_info = pd.read_csv(data_files["clusters_info"], sep="\t")
# Clusters proteins
clusters_proteins = pd.read_csv(data_files["clusters_proteins"], sep="\t")
# Clusters tree
clusters_tree = pd.read_csv(data_files["clusters_tree"], sep="\t")

# Protein aliases
protein_aliases = pd.read_csv(data_files["protein_aliases"], sep="\t")
# Protein enrichment
protein_enrichment_terms = pd.read_csv(data_files["protein_enrichment_terms"], sep="\t")
# Protein homology
protein_homology = pd.read_csv(data_files["protein_homology"], sep="\t")
# Protein info
protein_info = pd.read_csv(data_files["protein_info"], sep="\t")
# Protein links (interaction)
protein_links = pd.read_csv(data_files["protein_links"], sep=" ")
protein_links_detailed = pd.read_csv(data_files["protein_links_detailed"], sep=" ")
protein_links_full = pd.read_csv(data_files["protein_links_full"], sep=" ")
# Protein orthology
protein_orthology = pd.read_csv(data_files["protein_orthology"], sep="\t")

# Protein physical links
protein_physical_links = pd.read_csv(data_files["protein_physical_links"], sep=" ")
protein_physical_links_detailed = pd.read_csv(data_files["protein_physical_links_detailed"], sep=" ")
protein_physical_links_full = pd.read_csv(data_files["protein_physical_links_full"], sep=" ")

# COG mappings
COG_mappings = pd.read_csv(data_files["COG_mappings"], sep="\t")

# Species and tree
species = pd.read_csv(data_files["species"], sep="\t")
species_tree = pd.read_csv(data_files["species_tree"], sep="\t")

- Separate reading for COG links, also space-delimited.

In [9]:
# COG links
COG_links = pd.read_csv(data_files["COG_links"], sep=" ")
COG_links_detailed = pd.read_csv(data_files["COG_links_detailed"], sep=" ")

## 7. Constructing the Knowledge Graph

### 7.1 Species Nodes and Hierarchy

- Creates a node for each species, labeled `"Species"`.  
- Node ID is `"Species:<taxon_id>"`. 
- Adds edges of type `"CHILD_OF"` to represent the parent-child relationships in the species tree.  

In [10]:
### Add Species Nodes ###
for _, row in species.iterrows():
    sid = row["#taxon_id"]
    G.add_node(f"Species:{sid}", 
                label="Species",
                taxon_id=sid,
                string_type=row["STRING_type"],
                string_name_compact=row["STRING_name_compact"],
                official_name_NCBI=row["official_name_NCBI"],
                domain=row["domain"])

In [11]:
### Add Species Hierarchy Edges ###
for _, row in species_tree.iterrows():
    child = row["#taxon_id"]
    parent = row["parent_taxon_id"]
    if child != parent:  # Avoid root loops
        G.add_edge(f"Species:{child}", f"Species:{parent}", 
                    label="CHILD_OF", 
                    taxon_name=row.get("taxon_name", None),
                    is_STRING_species=row.get("is_STRING_species", None))

### 7.2 Protein Nodes, Sequences, and Species Link

- Creates a node for each protein with relevant metadata.  
- Adds an edge from the protein node to the species node, labeled `"FROM_SPECIES"`. 

In [12]:
### Add Protein Nodes ###
# We can link each protein to its species from the prefix of #string_protein_id
for _, p in protein_info.iterrows():
    pid = p["#string_protein_id"]
    # Extract species from id - format is something like 83332.Rv0001
    species_id = pid.split(".")[0]

    G.add_node(pid, 
                label="Protein",
                preferred_name=p["preferred_name"],
                protein_size=p["protein_size"],
                annotation=p["annotation"])
    # Link protein to species
    G.add_edge(pid, f"Species:{species_id}", label="FROM_SPECIES")

- Parses the FASTA file (gzipped).  
- If the protein is already in our graph, we store the full sequence as a node attribute.

In [13]:
### Add protein sequences ###
# Parse FASTA
with gzip.open(data_files["protein_sequences"], "rt") as handle:
    for record in SeqIO.parse(handle, "fasta"):
        protein_id = record.id
        if protein_id in G.nodes:
            G.nodes[protein_id]["sequence"] = str(record.seq)

### 7.3 Clusters: Info, Hierarchy, and Membership

- Clusters are nodes with label `"Cluster"`.  
- Also link them to the species with an edge labeled `"CLUSTER_IN_SPECIES"`.

In [14]:
### Add Clusters as Nodes ###
for _, c in clusters_info.iterrows():
    cid = c["cluster_id"]
    G.add_node(cid, 
                label="Cluster",
                cluster_size=c["cluster_size"],
                best_described_by=c["best_described_by"])
    # Link cluster to species (all from same species_id?)
    # Since cluster file already has #string_taxon_id and presumably all proteins are from the same species,
    # we can link cluster to species as well:
    species_id = c["#string_taxon_id"]
    G.add_edge(cid, f"Species:{species_id}", label="CLUSTER_IN_SPECIES")

- Connects child cluster to parent cluster with label `"PARENT_OF"`.  

In [15]:
### Add Cluster Hierarchy Edges ###
for _, row in clusters_tree.iterrows():
    child = row["child_cluster_id"]
    parent = row["parent_cluster_id"]
    G.add_edge(child, parent, label="PARENT_OF")

- Links each protein to the cluster(s) it belongs to with an edge labeled `"IN_CLUSTER"`.

In [16]:
### Add Cluster Membership Edges ###
for _, row in clusters_proteins.iterrows():
    cid = row["cluster_id"]
    pid = row["protein_id"]
    if G.has_node(cid) and G.has_node(pid):
        G.add_edge(pid, cid, label="IN_CLUSTER")

- Each alias is a separate node labeled `"Alias"`.  
- Creates edges `"HAS_ALIAS"` from the protein to that alias node.

In [17]:
### Add Protein Aliases ###
# We can store aliases as attributes or as separate nodes. Let's store as separate "Alias" nodes.
for _, row in protein_aliases.iterrows():
    pid = row["#string_protein_id"]
    alias = row["alias"]
    alias_node = f"Alias:{pid}:{alias}"
    G.add_node(alias_node, label="Alias", source=row["source"], name=alias)
    if G.has_node(pid):
        G.add_edge(pid, alias_node, label="HAS_ALIAS")

- Each term is a separate node labeled `"Term"`.  
- The protein -> term edge is labeled `"HAS_TERM"`. 

In [18]:
### Add Protein Enrichment Terms ###
# Similar approach: terms as separate nodes
for _, row in protein_enrichment_terms.iterrows():
    pid = row["#string_protein_id"]
    term = row["term"]
    cat = row["category"]
    term_node = f"Term:{cat}:{term}"
    # Add term node if not present
    if not G.has_node(term_node):
        G.add_node(term_node, label="Term", category=cat, description=row["description"])
    if G.has_node(pid):
        G.add_edge(pid, term_node, label="HAS_TERM")

- Connect pairs of proteins with `"HOMOLOG_OF"`, storing bitscore and alignment ranges as edge attributes.

In [19]:
### Add Protein Homology Edges ###
# Homology edges between proteins
for _, row in protein_homology.iterrows():
    p1 = row["#string_protein_1"]
    p2 = row["string_protein_id_2"]
    if G.has_node(p1) and G.has_node(p2):
        G.add_edge(p1, p2, label="HOMOLOG_OF",
                    bitscore=row["bitscore"],
                    start_1=row["start_1"],
                    end_1=row["end_1"],
                    start_2=row["start_2"],
                    end_2=row["end_2"])

### 7.5 Orthology Groups

- For each protein, create or reuse an orthologous group node (label: `"OrthologousGroup"`) and link via `"BELONGS_TO_OG"`.

In [20]:
### Add Protein Orthology ###
for _, row in protein_orthology.iterrows():
    prot = row["#protein"]
    og = row["orthologous_group_or_ortholog"]
    og_node = f"OG:{og}"
    if not G.has_node(og_node):
        G.add_node(og_node, label="OrthologousGroup", taxonomy_level=row["taxonomy_level"])
    if G.has_node(prot):
        G.add_edge(prot, og_node, label="BELONGS_TO_OG")

### 7.6 Protein-Protein Interactions

- Basic interaction data with a `"STRING_INTERACTION"` edge.

In [21]:
### Add Protein-Protein Interaction Edges ###
# From protein.links
for _, row in protein_links.iterrows():
    p1 = row["protein1"]
    p2 = row["protein2"]
    if G.has_node(p1) and G.has_node(p2):
        G.add_edge(p1, p2, label="STRING_INTERACTION", combined_score=row["combined_score"])

- Adds **detailed** channel scores (e.g. coexpression, text mining).

In [22]:
# Add more detail from protein.links.detailed
for _, row in protein_links_detailed.iterrows():
    p1 = row["protein1"]
    p2 = row["protein2"]
    if G.has_node(p1) and G.has_node(p2):
        # Add or update edge data
        G.add_edge(p1, p2, label="STRING_INTERACTION_DETAILED",
                    neighborhood=row["neighborhood"],
                    fusion=row["fusion"],
                    cooccurence=row["cooccurence"],
                    coexpression=row["coexpression"],
                    experimental=row["experimental"],
                    database=row["database"],
                    textmining=row["textmining"],
                    combined_score=row["combined_score"])

- Adds **transferred** evidence channels to the same pair of proteins.

In [23]:
# protein.links.full
for _, row in protein_links_full.iterrows():
    p1 = row["protein1"]
    p2 = row["protein2"]
    if G.has_node(p1) and G.has_node(p2):
        G.add_edge(p1, p2, label="STRING_INTERACTION_FULL",
                    neighborhood_transferred=row["neighborhood_transferred"],
                    coexpression_transferred=row["coexpression_transferred"],
                    experiments_transferred=row["experiments_transferred"],
                    database_transferred=row["database_transferred"],
                    textmining_transferred=row["textmining_transferred"],
                    combined_score=row["combined_score"])

### 7.7 Protein Physical Interactions

- Similar approach, but specifically for physical (rather than functional) interactions.

In [24]:
### Protein Physical Interactions ###
for _, row in protein_physical_links.iterrows():
    p1 = row["protein1"]
    p2 = row["protein2"]
    if G.has_node(p1) and G.has_node(p2):
        G.add_edge(p1, p2, label="PHYSICAL_INTERACTION",
                    combined_score=row["combined_score"])

- Detailed physical interactions with sub-scores.

In [25]:
for _, row in protein_physical_links_detailed.iterrows():
    p1 = row["protein1"]
    p2 = row["protein2"]
    if G.has_node(p1) and G.has_node(p2):
        G.add_edge(p1, p2, label="PHYSICAL_INTERACTION_DETAILED",
                    experimental=row["experimental"],
                    database=row["database"],
                    textmining=row["textmining"],
                    combined_score=row["combined_score"])

- Full data for physical interactions (transferred evidence channels).

In [26]:
for _, row in protein_physical_links_full.iterrows():
    p1 = row["protein1"]
    p2 = row["protein2"]
    if G.has_node(p1) and G.has_node(p2):
        G.add_edge(p1, p2, label="PHYSICAL_INTERACTION_FULL",
                    experiments_transferred=row["experiments_transferred"],
                    database_transferred=row["database_transferred"],
                    textmining_transferred=row["textmining_transferred"],
                    combined_score=row["combined_score"])

### 7.8 COG Links and Mappings

- Ensures **all** COG IDs appear as nodes.

In [27]:
### COG Links and Mappings ###
# Add Orthologous groups as nodes (some may be COG specifically)
for og in set(COG_mappings["orthologous_group"]):
    node_id = f"OG:{og}"
    if not G.has_node(node_id):
        G.add_node(node_id, label="OrthologousGroup")

- Connect each protein to its COG with `"BELONGS_TO_OG"`, storing additional metadata.

In [28]:
# Add protein -> COG edges
for _, row in COG_mappings.iterrows():
    protein_id = row["#protein"]
    og = row["orthologous_group"]
    og_node = f"OG:{og}"
    if G.has_node(protein_id) and G.has_node(og_node):
        G.add_edge(protein_id, og_node, label="BELONGS_TO_OG",
                    start_position=row["start_position"],
                    end_position=row["end_position"],
                    protein_annotation=row["protein_annotation"])

In [29]:
# Add COG-COG relations (from COG.links and COG.links.detailed)
for _, row in COG_links.iterrows():
    g1 = f"OG:{row['group1']}"
    g2 = f"OG:{row['group2']}"
    if G.has_node(g1) and G.has_node(g2):
        G.add_edge(g1, g2, label="COG_RELATION", association_score=row["association_score"])

- Creates edges between COGs for association scores, both basic and detailed.

In [31]:
for _, row in COG_links_detailed.iterrows():
    g1 = f"OG:{row['group1']}"
    g2 = f"OG:{row['group2']}"
    if G.has_node(g1) and G.has_node(g2):
        G.add_edge(g1, g2, label="COG_RELATION_DETAILED",
                    neighborhood=row["neighborhood"],
                    fusion=row["fusion"],
                    cooccurence=row["cooccurence"],
                    coexpression=row["coexpression"],
                    experimental=row["experimental"],
                    database=row["database"],
                    textmining=row["textmining"],
                    combined_score=row["combined_score"])

## 8. Basic Graph Statistics

- Displays the total count of nodes and edges in the graph.

In [32]:
print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())

Number of nodes: 2934568
Number of edges: 86046962


### 8.1 Node Label Distribution

- Provides a breakdown of node labels (e.g., `Protein`, `Cluster`, `Species`, etc.) and edge labels.  
- A good way to verify data ingestion.

In [33]:
from collections import Counter

In [34]:
# Let's examine node labels
node_labels = [G.nodes[n].get('label', 'Unknown') for n in G.nodes()]
node_label_counts = Counter(node_labels)
print("Node label distribution:", node_label_counts)

# Examine edge labels
edge_labels = [G[u][v][k].get('label', 'Unknown') for u,v,k in G.edges(keys=True)]
edge_label_counts = Counter(edge_labels)
print("Edge label distribution:", edge_label_counts)

Node label distribution: Counter({'Unknown': 2199643, 'OrthologousGroup': 614554, 'Alias': 91922, 'Species': 12535, 'Term': 10887, 'Protein': 4026, 'Cluster': 1001})
Edge label distribution: Counter({'COG_RELATION': 39471870, 'COG_RELATION_DETAILED': 39471870, 'CHILD_OF': 2212177, 'STRING_INTERACTION': 1386026, 'STRING_INTERACTION_DETAILED': 1386026, 'STRING_INTERACTION_FULL': 1386026, 'HAS_TERM': 167062, 'IN_CLUSTER': 149955, 'HAS_ALIAS': 119264, 'PHYSICAL_INTERACTION': 74934, 'PHYSICAL_INTERACTION_DETAILED': 74934, 'PHYSICAL_INTERACTION_FULL': 74934, 'BELONGS_TO_OG': 39973, 'HOMOLOG_OF': 25884, 'FROM_SPECIES': 4026, 'CLUSTER_IN_SPECIES': 1001, 'PARENT_OF': 1000})


In [35]:
# Let's examine node labels
node_labels = [G.nodes[n].get('label', 'Unknown') for n in G.nodes()]
node_label_counts = Counter(node_labels)
print("Node label distribution:", node_label_counts)

# Examine edge labels
edge_labels = [G[u][v][k].get('label', 'Unknown') for u,v,k in G.edges(keys=True)]
edge_label_counts = Counter(edge_labels)
print("Edge label distribution:", edge_label_counts)

Node label distribution: Counter({'Unknown': 2199643, 'OrthologousGroup': 614554, 'Alias': 91922, 'Species': 12535, 'Term': 10887, 'Protein': 4026, 'Cluster': 1001})
Edge label distribution: Counter({'COG_RELATION': 39471870, 'COG_RELATION_DETAILED': 39471870, 'CHILD_OF': 2212177, 'STRING_INTERACTION': 1386026, 'STRING_INTERACTION_DETAILED': 1386026, 'STRING_INTERACTION_FULL': 1386026, 'HAS_TERM': 167062, 'IN_CLUSTER': 149955, 'HAS_ALIAS': 119264, 'PHYSICAL_INTERACTION': 74934, 'PHYSICAL_INTERACTION_DETAILED': 74934, 'PHYSICAL_INTERACTION_FULL': 74934, 'BELONGS_TO_OG': 39973, 'HOMOLOG_OF': 25884, 'FROM_SPECIES': 4026, 'CLUSTER_IN_SPECIES': 1001, 'PARENT_OF': 1000})


## 9. Example Protein Queries

Below are some examples of how you can **query and explore** the resulting graph.

### 9.1 Checking for Specific Proteins

- Checks if a particular protein is in the graph and prints its attributes.

- Lists the first 20 neighbors of that protein and prints their labels (e.g., `Species`, `Cluster`).

In [36]:
# DnaA
protein_id = "83332.Rv0001"
if protein_id in G.nodes:
    print("Protein found!")
    print("Protein attributes:", G.nodes[protein_id])
else:
    print("Protein not found.")

Protein found!
Protein attributes: {'label': 'Protein', 'preferred_name': 'dnaA', 'protein_size': 507, 'annotation': "Chromosomal replication initiator protein DnaA; Plays an important role in the initiation and regulation of chromosomal replication. Binds to the origin of replication; it binds specifically double-stranded DNA at a 9 bp consensus (dnaA box): 5'- TTATC[CA]A[CA]A-3'. DnaA binds to ATP and to acidic phospholipids (By similarity). Binds its own promoter.", 'sequence': 'MTDDPGSGFTTVWNAVVSELNGDPKVDDGPSSDANLSAPLTPQQRAWLNLVQPLTIVEGFALLSVPSSFVQNEIERHLRAPITDALSRRLGHQIQLGVRIAPPATDEADDTTVPPSENPATTSPDTTTDNDEIDDSAAARGDNQHSWPSYFTERPHNTDSATAGVTSLNRRYTFDTFVIGASNRFAHAAALAIAEAPARAYNPLFIWGESGLGKTHLLHAAGNYAQRLFPGMRVKYVSTEEFTNDFINSLRDDRKVAFKRSYRDVDVLLVDDIQFIEGKEGIQEEFFHTFNTLHNANKQIVISSDRPPKQLATLEDRLRTRFEWGLITDVQPPELETRIAILRKKAQMERLAVPDDVLELIASSIERNIRELEGALIRVTAFASLNKTPIDKALAEIVLRDLIADANTMQISAATIMAATAEYFDTTVEELRGPGKTRALAQSRQIAMYLCRELTDLSLPKIGQAFGRDHTTVMYAQRKILSEMAERREVFDHVKELTTRIRQRSKR'}


In [37]:
# Neighbors of a protein
neighbors = list(G.neighbors(protein_id))
print(f"Neighbors of {protein_id}:")
for nbr in neighbors[:20]:
    print(" - ", nbr, G.nodes[nbr].get('label'))

Neighbors of 83332.Rv0001:
 -  Species:83332 Species
 -  CL:26 Cluster
 -  CL:23 Cluster
 -  CL:21 Cluster
 -  CL:16 Cluster
 -  CL:12 Cluster
 -  CL:10 Cluster
 -  CL:5 Cluster
 -  CL:0 Cluster
 -  CL:28 Cluster
 -  CL:30 Cluster
 -  CL:31 Cluster
 -  CL:32 Cluster
 -  CL:33 Cluster
 -  CL:34 Cluster
 -  CL:35 Cluster
 -  CL:36 Cluster
 -  CL:37 Cluster
 -  CL:38 Cluster
 -  CL:39 Cluster


In [43]:
# ESAT-6 (Early Secreted Antigenic Target 6 kDa): Rv3875
protein_id = "83332.Rv3875"
if protein_id in G.nodes:
    print("Protein found!")
    print("Protein attributes:", G.nodes[protein_id])
else:
    print("Protein not found.")

Protein found!
Protein attributes: {'label': 'Protein', 'preferred_name': 'esxA', 'protein_size': 95, 'annotation': "6 kDa early secretory antigenic target EsxA (ESAT-6); A secreted protein that plays a number of roles in modulating the host's immune response to infection as well as being responsible for bacterial escape into the host cytoplasm. Acts as a strong host (human) T-cell antigen. Inhibits IL- 12 p40 (IL12B) and TNF-alpha expression by infected host (mouse) macrophages, reduces the nitric oxide response by about 75%. In mice previously exposed to the bacterium, elicits high level of IFN-gamma production by T-cells upon subsequent challenge by M.tuberculosis, in the first phase of a protecti [...] ", 'sequence': 'MTEQQWNFAGIEAAASAIQGNVTSIHSLLDEGKQSLTKLAAAWGGSGSEAYQGVQQKWDATATELNNALQNLARTISEAGQAMASTEGNVTGMFA'}


In [44]:
# Neighbors of a protein
neighbors = list(G.neighbors(protein_id))
print(f"Neighbors of {protein_id}:")
for nbr in neighbors[:20]:
    print(" - ", nbr, G.nodes[nbr].get('label'))

Neighbors of 83332.Rv3875:
 -  Species:83332 Species
 -  CL:26 Cluster
 -  CL:23 Cluster
 -  CL:21 Cluster
 -  CL:16 Cluster
 -  CL:12 Cluster
 -  CL:10 Cluster
 -  CL:5 Cluster
 -  CL:0 Cluster
 -  CL:28 Cluster
 -  CL:30 Cluster
 -  CL:31 Cluster
 -  CL:32 Cluster
 -  CL:33 Cluster
 -  CL:34 Cluster
 -  CL:35 Cluster
 -  CL:36 Cluster
 -  CL:37 Cluster
 -  CL:38 Cluster
 -  CL:39 Cluster


In [45]:
# CFP-10 (10 kDa Culture Filtrate Protein): Rv3874
protein_id = "83332.Rv3874"
if protein_id in G.nodes:
    print("Protein found!")
    print("Protein attributes:", G.nodes[protein_id])
else:
    print("Protein not found.")

Protein found!
Protein attributes: {'label': 'Protein', 'preferred_name': 'esxB', 'protein_size': 100, 'annotation': '10 kDa culture filtrate antigen EsxB (LHP) (CFP10); A secreted protein. Acts as a strong host (human) T-cell antigen. Involved in translocation of bacteria from the host (human) phagolysosome to the host cytoplasm. Might serve as a chaperone to prevent uncontrolled membrane lysis by its partner EsxA; native protein binds poorly to artificial liposomes in the absence or presence of EsxA. EsxA and EsxA-EsxB are cytotoxic to pneumocytes. EsxB (and EsxA-EsxB but not EsxA alone) activates human neutrophils; EsxB transiently induces host (human) intracellular Ca(2+) mobility in a dose-depend [...] ', 'sequence': 'MAEMKTDAATLAQEAGNFERISGDLKTQIDQVESTAGSLQGQWRGAAGTAAQAAVVRFQEAANKQKQELDEISTNIRQAGVQYSRADEEQQQALSSQMGF'}


In [46]:
# Neighbors of a protein
neighbors = list(G.neighbors(protein_id))
print(f"Neighbors of {protein_id}:")
for nbr in neighbors[:20]:
    print(" - ", nbr, G.nodes[nbr].get('label'))

Neighbors of 83332.Rv3874:
 -  Species:83332 Species
 -  CL:26 Cluster
 -  CL:23 Cluster
 -  CL:21 Cluster
 -  CL:16 Cluster
 -  CL:12 Cluster
 -  CL:10 Cluster
 -  CL:5 Cluster
 -  CL:0 Cluster
 -  CL:28 Cluster
 -  CL:30 Cluster
 -  CL:31 Cluster
 -  CL:32 Cluster
 -  CL:33 Cluster
 -  CL:34 Cluster
 -  CL:35 Cluster
 -  CL:36 Cluster
 -  CL:37 Cluster
 -  CL:38 Cluster
 -  CL:39 Cluster


In [47]:
# FabH (β-Ketoacyl-ACP Synthase III): Rv0533c
protein_id = "83332.Rv0533c"
if protein_id in G.nodes:
    print("Protein found!")
    print("Protein attributes:", G.nodes[protein_id])
else:
    print("Protein not found.")

Protein found!
Protein attributes: {'label': 'Protein', 'preferred_name': 'fabH', 'protein_size': 335, 'annotation': '3-oxoacyl-[acyl-carrier-protein] synthase III FabH (beta-ketoacyl-ACP synthase III) (KAS III); Catalyzes the condensation reaction of fatty acid synthesis by the addition to an acyl acceptor of two carbons from malonyl-ACP. Catalyzes the first condensation reaction which initiates fatty acid synthesis and may therefore play a role in governing the total rate of fatty acid production. Possesses both acetoacetyl-ACP synthase and acetyl transacylase activities. Has some substrate specificity for long chain acyl-CoA such as myristoyl-CoA. Does not use acyl-CoA as primer. Its substrate spec [...] ', 'sequence': 'MTEIATTSGARSVGLLSVGAYRPERVVTNDEICQHIDSSDEWIYTRTGIKTRRFAADDESAASMATEACRRALSNAGLSAADIDGVIVTTNTHFLQTPPAAPMVAASLGAKGILGFDLSAGCAGFGYALGAAADMIRGGGAATMLVVGTEKLSPTIDMYDRGNCFIFADGAAAVVVGETPFQGIGPTVAGSDGEQADAIRQDIDWITFAQNPSGPRPFVRLEGPAVFRWAAFKMGDVGRRAMDAAGVRPDQIDVFVPHQANSRINEL

In [48]:
# Neighbors of a protein
neighbors = list(G.neighbors(protein_id))
print(f"Neighbors of {protein_id}:")
for nbr in neighbors[:20]:
    print(" - ", nbr, G.nodes[nbr].get('label'))

Neighbors of 83332.Rv0533c:
 -  Species:83332 Species
 -  CL:26 Cluster
 -  CL:23 Cluster
 -  CL:21 Cluster
 -  CL:16 Cluster
 -  CL:12 Cluster
 -  CL:10 Cluster
 -  CL:5 Cluster
 -  CL:0 Cluster
 -  CL:28 Cluster
 -  CL:30 Cluster
 -  CL:31 Cluster
 -  CL:32 Cluster
 -  CL:33 Cluster
 -  CL:34 Cluster
 -  CL:35 Cluster
 -  CL:36 Cluster
 -  CL:37 Cluster
 -  CL:38 Cluster
 -  CL:39 Cluster


In [49]:
# SigA (Sigma Factor A): Rv2703
protein_id = "83332.Rv2703"
if protein_id in G.nodes:
    print("Protein found!")
    print("Protein attributes:", G.nodes[protein_id])
else:
    print("Protein not found.")

Protein found!
Protein attributes: {'label': 'Protein', 'preferred_name': 'sigA', 'protein_size': 528, 'annotation': 'RNA polymerase sigma factor SigA (sigma-A); Sigma factors are initiation factors that promote the attachment of RNA polymerase to specific initiation sites and are then released. This sigma factor is the primary sigma factor during exponential growth (Probable); Belongs to the sigma-70 factor family. RpoD/SigA subfamily.', 'sequence': 'MAATKASTATDEPVKRTATKSPAASASGAKTGAKRTAAKSASGSPPAKRATKPAARSVKPASAPQDTTTSTIPKRKTRAAAKSAAAKAPSARGHATKPRAPKDAQHEAATDPEDALDSVEELDAEPDLDVEPGEDLDLDAADLNLDDLEDDVAPDADDDLDSGDDEDHEDLEAEAAVAPGQTADDDEEIAEPTEKDKASGDFVWDEDESEALRQARKDAELTASADSVRAYLKQIGKVALLNAEEEVELAKRIEAGLYATQLMTELSERGEKLPAAQRRDMMWICRDGDRAKNHLLEANLRLVVSLAKRYTGRGMAFLDLIQEGNLGLIRAVEKFDYTKGYKFSTYATWWIRQAITRAMADQARTIRIPVHMVEVINKLGRIQRELLQDLGREPTPEELAKEMDITPEKVLEIQQYAREPISLDQTIGDEGDSQLGDFIEDSEAVVAVDAVSFTLLQDQLQSVLDTLSEREAGVVRLRFGLTDGQPRTLDEIGQVYGVTRERIRQIESKTMSKLRHPSRSQVLRDYLD'}


In [50]:
# Neighbors of a protein
neighbors = list(G.neighbors(protein_id))
print(f"Neighbors of {protein_id}:")
for nbr in neighbors[:20]:
    print(" - ", nbr, G.nodes[nbr].get('label'))

Neighbors of 83332.Rv2703:
 -  Species:83332 Species
 -  CL:26 Cluster
 -  CL:23 Cluster
 -  CL:21 Cluster
 -  CL:16 Cluster
 -  CL:12 Cluster
 -  CL:10 Cluster
 -  CL:5 Cluster
 -  CL:0 Cluster
 -  CL:28 Cluster
 -  CL:30 Cluster
 -  CL:31 Cluster
 -  CL:32 Cluster
 -  CL:33 Cluster
 -  CL:34 Cluster
 -  CL:35 Cluster
 -  CL:36 Cluster
 -  CL:37 Cluster
 -  CL:38 Cluster
 -  CL:39 Cluster


In [51]:
# PknB (Protein Kinase B): Rv0014c
protein_id = "83332.Rv0014c"
if protein_id in G.nodes:
    print("Protein found!")
    print("Protein attributes:", G.nodes[protein_id])
else:
    print("Protein not found.")

Protein found!
Protein attributes: {'label': 'Protein', 'preferred_name': 'pknB', 'protein_size': 626, 'annotation': 'Transmembrane serine/threonine-protein kinase B PknB (protein kinase B) (STPK B); Protein kinase that regulates many aspects of mycobacterial physiology, and is critical for growth in vitro and survival of the pathogen in the host. Is a key component of a signal transduction pathway that regulates cell growth, cell shape and cell division via phosphorylation of target proteins such as GarA, GlmU, PapA5, PbpA, FhaB (Rv0019c), FhaA (Rv0020c), MviN, PstP, EmbR, Rv1422, Rv1747 and RseA. Also catalyzes the phosphorylation of the core proteasome alpha-subunit (PrcA), and thereby regulates th [...] ', 'sequence': 'MTTPSHLSDRYELGEILGFGGMSEVHLARDLRLHRDVAVKVLRADLARDPSFYLRFRREAQNAAALNHPAIVAVYDTGEAETPAGPLPYIVMEYVDGVTLRDIVHTEGPMTPKRAIEVIADACQALNFSHQNGIIHRDVKPANIMISATNAVKVMDFGIARAIADSGNSVTQTAAVIGTAQYLSPEQARGDSVDARSDVYSLGCVLYEVLTGEPPFTGDSPVSVAYQHVREDPIPPSARHEGLSADLDAVVLKALAKNPENRYQTAA

In [52]:
# Neighbors of a protein
neighbors = list(G.neighbors(protein_id))
print(f"Neighbors of {protein_id}:")
for nbr in neighbors[:20]:
    print(" - ", nbr, G.nodes[nbr].get('label'))

Neighbors of 83332.Rv0014c:
 -  Species:83332 Species
 -  CL:26 Cluster
 -  CL:23 Cluster
 -  CL:21 Cluster
 -  CL:16 Cluster
 -  CL:12 Cluster
 -  CL:10 Cluster
 -  CL:5 Cluster
 -  CL:0 Cluster
 -  CL:28 Cluster
 -  CL:30 Cluster
 -  CL:31 Cluster
 -  CL:32 Cluster
 -  CL:33 Cluster
 -  CL:34 Cluster
 -  CL:35 Cluster
 -  CL:36 Cluster
 -  CL:37 Cluster
 -  CL:38 Cluster
 -  CL:39 Cluster


### 9.2 Extracting a PPI Subgraph

- Creates a subgraph containing **only** protein-protein interaction edges (both physical and functional).  
- This is useful for standard network analysis (centralities, communities, etc.).

In [38]:
ppi_edges = [(u,v,k) for u,v,k in G.edges(keys=True) 
             if G[u][v][k].get('label','') in ['STRING_INTERACTION','PHYSICAL_INTERACTION','STRING_INTERACTION_DETAILED','STRING_INTERACTION_FULL','PHYSICAL_INTERACTION_DETAILED','PHYSICAL_INTERACTION_FULL']]

PPI = G.edge_subgraph(ppi_edges).copy()
print("PPI subgraph nodes:", PPI.number_of_nodes())
print("PPI subgraph edges:", PPI.number_of_edges())

PPI subgraph nodes: 4023
PPI subgraph edges: 4382880


In [39]:
degree_sequence = [deg for n,deg in PPI.degree()]
print("Min degree:", min(degree_sequence))
print("Max degree:", max(degree_sequence))

Min degree: 6
Max degree: 13890


- Shows the minimum and maximum degrees in that PPI subgraph and identifies the top 10 hubs.  
- This often reveals highly connected polyketide synthases, fatty acid synthases, or key regulatory proteins in Mycobacterium tuberculosis.

In [40]:
sorted_by_degree = sorted(PPI.degree(), key=lambda x: x[1], reverse=True)
print("Top 10 hub proteins:")
for n,d in sorted_by_degree[:10]:
    print(n, d, G.nodes[n].get('preferred_name',''), G.nodes[n].get('annotation',''))

Top 10 hub proteins:
83332.Rv2048c 13890 pks12 Polyketide synthase Pks12; Rv2048c, (MTV018.35c), len: 4151 aa. Pks12,polyketide synthase similar to many. Contains 2x PS00012 Phosphopantetheine attachment site, 2x PS00606 Beta-ketoacyl synthases active site, and PS00343 Gram-positive cocci surface proteins 'anchoring' hexapeptide. Nucleotide position 2297976 in the genome sequence has been corrected, G:A resulting in S3004L.
83332.Rv3825c 13482 pks2 Polyketide synthase Pks2; Catalyzes the synthesis of the hepta- and octamethyl phthioceranic acids and/or hydroxyphthioceranic acids that are the major acyl constituents of sulfolipids.
83332.Rv2940c 13452 mas Rv2940c, (MTCY24G1.09, MTCY19H9.08c), len: 2111 aa. Probable mas, mycocerosic acid synthase membrane associated, multifunctional enzyme (see citations below),almost identical to Q02251|MCAS_MYCBO|mas mycocerosic acid synthase from Mycobacterium bovis (2110 aa), FASTA scores: opt: 13226, E(): 0, (95.8% identity in 2115 aa overlap) (see 

### 9.3 Orthologous Groups and Terms

- A function to retrieve all orthologous groups for a given protein.  
- Then, for each OG, collects the other proteins that also belong to that OG.  
- Helps see evolutionary/functional relationships.

In [41]:
def get_orthologous_groups_for_protein(protein):
    og_nodes = []
    for u,v,k in G.out_edges(protein, keys=True):
        if G[u][v][k].get('label') == 'BELONGS_TO_OG':
            og_nodes.append(v)
    return og_nodes

In [42]:
ogs = get_orthologous_groups_for_protein(protein_id)
print(f"Orthologous groups for {protein_id}:", ogs)

Orthologous groups for 83332.Rv0001: ['OG:COG0593', 'OG:COG0593', 'OG:COG0593', 'OG:NOG117348', 'OG:NOG002203', 'OG:NOG001178', 'OG:NOG001281', 'OG:NOG001814', 'OG:NOG047772', 'OG:NOG103400']


In [53]:
other_proteins = set()
for og in ogs:
    for u,v,k in G.in_edges(og, keys=True):
        if G[u][v][k].get('label') == 'BELONGS_TO_OG' and G.nodes[u].get('label') == 'Protein':
            other_proteins.add(u)

print("Other proteins in these orthologous groups:")
for p in list(other_proteins)[:20]:
    print(p, G.nodes[p].get('preferred_name','No name'), G.nodes[p].get('annotation','No annotation'))

Other proteins in these orthologous groups:
83332.Rv1013 pks16 Rv1013, (MTCI237.30-MTCY10G2.36c), len: 544 aa. Putative pks16, polyketide synthase, similar to many e.g. N-terminus of Q50857|U24657 saframycin MX1 synthetase B (1770 aa), FASTA scores: opt: 526, E(): 1.4e-25, (29.3% identity in 542 aa overlap); etc. Contains PS00455 Putative AMP-binding domain signature. Belongs to the ATP-dependent AMP-binding enzyme family.
83332.Rv0969 ctpV Probable metal cation transporter P-type ATPase CtpV; Necessary for copper homeostasis and likely functions as a copper exporter. Also required for full virulence.
83332.Rv1550 fadD11 Rv1550, (MTCY48.15c), len: 571 aa. Probable fadD11,fatty-acid-CoA synthetase, similar, except in N-terminus,to many e.g. SC6A5.39|T35430 probable long-chain-fatty-acid--CoA ligase from Streptomyces coelicolor (612 aa); NP_301672.1|NC_002677 putative long-chain-fatty-acid-CoA ligase from Mycobacterium leprae (600 aa); P44446|LCFH_HAEIN putative long-chain-fatty-acid-CoA

- Retrieves any enrichment terms (e.g., GO, KEGG, annotated keywords) attached to the protein.  
- A quick way to see functional annotations.

In [54]:
def get_terms_for_protein(protein):
    terms = []
    for u,v,k in G.out_edges(protein, keys=True):
        if G[u][v][k].get('label') == 'HAS_TERM':
            terms.append(v)
    return terms

In [55]:
p_terms = get_terms_for_protein(protein_id)
print(f"Terms for {protein_id}:")
for t in p_terms:
    print(t, G.nodes[t].get('description',''))

Terms for 83332.Rv0014c:
Term:Annotated Keywords (UniProt):KW-0007 Acetylation
Term:Annotated Keywords (UniProt):KW-0067 ATP-binding
Term:Annotated Keywords (UniProt):KW-0418 Kinase
Term:Annotated Keywords (UniProt):KW-0460 Magnesium
Term:Annotated Keywords (UniProt):KW-0472 Membrane
Term:Annotated Keywords (UniProt):KW-0479 Metal-binding
Term:Annotated Keywords (UniProt):KW-0547 Nucleotide-binding
Term:Annotated Keywords (UniProt):KW-0597 Phosphoprotein
Term:Annotated Keywords (UniProt):KW-0677 Repeat
Term:Annotated Keywords (UniProt):KW-0723 Serine/threonine-protein kinase
Term:Annotated Keywords (UniProt):KW-0808 Transferase
Term:Annotated Keywords (UniProt):KW-0812 Transmembrane
Term:Annotated Keywords (UniProt):KW-0843 Virulence
Term:Annotated Keywords (UniProt):KW-1003 Cell membrane
Term:Annotated Keywords (UniProt):KW-1133 Transmembrane helix
Term:Biological Process (Gene Ontology):GO:0000270 Peptidoglycan metabolic process
Term:Biological Process (Gene Ontology):GO:0006022 Amin

### 9.4 Cluster Hierarchy

- Follows the `"PARENT_OF"` edges up the hierarchy to get all parent clusters.  
- Demonstrates how to navigate the cluster tree.

In [56]:
def get_cluster_lineage(cluster_id):
    # Follow parent_of edges up the hierarchy
    lineage = [cluster_id]
    current = cluster_id
    # Edges are directed, we must be careful about direction. We stored cluster_id as child -> parent edges.
    # If label is 'PARENT_OF', then parent is v
    # Let's assume 'PARENT_OF' means child -> parent edge
    while True:
        # find edges where current -> parent
        parents = [v for u,v,k in G.out_edges(current, keys=True) if G[u][v][k].get('label')=='PARENT_OF']
        if not parents:
            break
        # Take the first parent (assuming a single parent)
        current = parents[0]
        lineage.append(current)
    return lineage

In [57]:
some_cluster = "CL:7984"
lineage = get_cluster_lineage(some_cluster)
print(f"Lineage for {some_cluster}:", lineage)

Lineage for CL:7984: ['CL:7984', 'CL:7947', 'CL:26', 'CL:23', 'CL:21', 'CL:16', 'CL:12', 'CL:10', 'CL:5', 'CL:0']


In [58]:
def get_child_species(species_node):
    children = []
    for u,v,k in G.in_edges(species_node, keys=True):
        if G[u][v][k].get('label') == 'CHILD_OF':
            # u is child species
            children.append(u)
    return children

In [59]:
some_species = "Species:83332"
child_species = get_child_species(some_species)
print("Child species of 83332:", child_species)

Child species of 83332: ['Species:747078', 'Species:757417', 'Species:757418', 'Species:757419', 'Species:757420', 'Species:757421', 'Species:757422', 'Species:1404322', 'Species:1437856']


In [60]:
node_data = []
for n in G.nodes():
    node_data.append((n, G.nodes[n].get('label','Unknown')))

nodes_df = pd.DataFrame(node_data, columns=['node','label'])
print(nodes_df['label'].value_counts())

Unknown             2199643
OrthologousGroup     614554
Alias                 91922
Species               12535
Term                  10887
Protein                4026
Cluster                1001
Name: label, dtype: int64


## 10. Graph Save

In [32]:
nx.write_graphml(G, "string_kg.graphml")

print("Knowledge Graph construction complete!")

Knowledge Graph construction complete!


## Summary

1. We installed required libraries and downloaded raw STRING data.  
2. We extracted the data and loaded them into **pandas** DataFrames.  
3. We created a **NetworkX MultiDiGraph** and inserted nodes/edges for **proteins, species, clusters, orthologous groups, physical and functional interactions, etc.**  
4. We performed some **basic graph statistics** (counts, label distribution, top hub proteins).  
5. We **exported** the final integrated knowledge graph as `string_kg.graphml`.  