## Purpose of the Analysis

1. Identify proteins meeting the following criteria

 - negatively affect dengue virus replication when inhibited by siRNA*
 AND
 - are either:
        - modulated in protein abundance
        - modulated in rna abundance
        - interact with the dengue virus
        
2. Identify systems of interacting proteins enriched in those that affect dengue replication

* \*the highest the Zscore, the more these factors negatively affect dengue virus replication

3. Characterize the systems:
 - Name them by top hits from gProfiler enrichment
 - Name them using GPT-4
 - Analyze the systems with GPT-4
     - General analysis
     - Specific analyses in which data associated with the proteins is provided. This will be a matter of experimentation. Probably most tractable for small systems. We can try specifically asking for interpretation of the system for its role in response to dengue infection.

4. Other points:
 - How should we interpret the 24 vs 48 hour data?
 - What visualization strategies will work best?


## Data Loading
 - Used ChatGPT data analyis to merge the sheets in the original excel file provided by Laura
     - Added a boolean column "binds_viral_protein" based on the viral interaction data
     - Added boolean columns indicating whether the protein had significant values for the various measurements

## Hierarchical Model Building
 - Made an network of interactions between the proteins: STRING HC
     - Next step - add dengue AP-MS
 - Made a hierarchical from the subnetwork interconnecting the proteins of interest
 - Named the model systems by enrichment
 
## TODO
 - Name and some or all of the model systems by LLM
 - Identify goal proteins
 - Identify systems enriched with goal proteins
 - Style the model and the interactome to highlight the data
 - Generate reports
 - Generate figures
 





## Experimental Protocol
We took primary human dendritic cells, infected them with dengue virus (serotype 3) and subjected these cells to either: 
 - siRNA screening to identify human host factors that act to restrict viral replication, 
 - Proteomics (Protein Abundance) to look at human proteins that change in abundance following infection. 
   - This was done at 24h and 48h post infection. (Jeff Johnson, Krogan lab)
 - RNAseq to look at cellular mRNAs that are differently expressed following infection.
   - This was done at 24h and 48h post infection. (Stephen Wolinski, NWU)
   
We also include host-dengue PPI data from (Shah et al., Cell 2018). 
Priya Shah was at the Krogan lab when this data was collected

The datasets are already thresholded based maximum overlap and significance, and all have GeneID as common identifier. 

20220202_Thresholded_Datasets_DHIPC.xlsx

Shah et al., Dengue Interactions




## Get the UniProt and HGNC IDs based on the NCBI Gene IDs

Load the unified dengue spreadsheet (Created using ChatGPT data analysis mode)

Add two columns, one for UniProt and one for HGNC.

Noted gene ids not found by MyGene.info. Not sure what the story is for 441956 and 45338

In [29]:
import requests
import pandas as pd

# Fetch the UniProt IDs and HGNC symbols for a list of NCBI GeneIDs using the MyGene.info API
def fetch_uniprot_ids_and_hgnc_via_mygene(gene_ids):
    url = "https://mygene.info/v3/query"
    headers = {"content-type": "application/x-www-form-urlencoded"}
    data = {
        "q": ",".join(map(str, gene_ids)),
        "species": "human",
        "fields": "uniprot,symbol"
    }
    
    response = requests.post(url, headers=headers, data=data)
    results = response.json()

    # Extract UniProt IDs and HGNC symbols from the results
    uniprot_ids = {}
    hgnc_symbols = {}
    for hit in results:
        if "_id" in hit:
            #print(hit)
            gene_id = hit["_id"]
            uniprot_id = hit.get("uniprot", {}).get("Swiss-Prot")
            hgnc_symbol = hit.get("symbol")
            uniprot_ids[gene_id] = uniprot_id
            hgnc_symbols[gene_id] = hgnc_symbol
        else:
            print(hit)
            print(f'gene id {hit.get("query")} was not found')

    return uniprot_ids, hgnc_symbols

# Read the input data (replace with your file path)
df = pd.read_excel("dengue_09162023.xlsx", sheet_name="Sheet1")

# Fetch UniProt IDs and HGNC symbols for all GeneIDs using MyGene.info in batches
gene_ids_df = df["GeneID"].astype(str)
gene_ids = gene_ids_df.tolist()
print(gene_ids[0])
uniprot_ids, hgnc_symbols = fetch_uniprot_ids_and_hgnc_via_mygene(gene_ids)
print(hgnc_symbols.get("3437"))
print(gene_ids_df.map(hgnc_symbols))
df["UniProtID"] = gene_ids_df.map(uniprot_ids)
df["HGNC"] = gene_ids_df.map(hgnc_symbols)

# Save the updated data (replace with your desired output file path)
df.to_excel("dengue_with_uniprot.xlsx", index=False)
df.head()  # Display the first few rows to verify the results


3437
{'query': '441956', 'notfound': True}
gene id 441956 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338', 'notfound': True}
gene id 45338 was not found
{'query': '45338',

Unnamed: 0,GeneID,UniprotID,GeneSymbol,siRNA_GeneSymbol,DV3_24h-Mock_24h,DV3_48h-Mock_48h,siRNA_Screen_Average_Zscore,log2FC,Condition,GeneSymbol_48hpi,...,dengue_MiST_list,PPI_GeneSymbol,viral_interaction,has_siRNA,has_protein_24hr,has_protein_48hr,has_rnaSeq_24hr,has_rnaSeq_48hr,UniProtID,HGNC
0,3437,O14879,IFIT3,,3.494644,5.698257,,8.197744,24h,IFIT3,...,,,0,0,1,1,1,1,O14879,IFIT3
1,9246,O14933,,,,2.50438,,,,,...,,,0,0,0,1,0,0,O14933,UBE2L6
2,8672,O43432,,,,-1.897606,,,,,...,,,0,0,0,1,0,0,O43432,EIF4G3
3,63931,O60783,,,,0.707202,,,,,...,,,0,0,0,1,0,0,O60783,MRPS14
4,26043,O94888,,,,-2.721373,,,,,...,,,0,0,0,1,0,0,O94888,UBXN7


In [10]:
def debug_conflicts(data_dict):
    # Define a list of dataframes to process
    df_names = ['siRNA data', 'Pt Abundance', 'RNAseq 24hpi', 'RNAseq 48hpi']

    # Define gene name columns for each dataframe
    gene_name_cols = ['GeneSymbol', None, 'GeneName', 'GeneName']

    # Define keys for each dataframe
    keys = ['GeneID', 'GeneID', 'GeneID', 'GeneID']
    
    # Create a dictionary to hold conflict information
    conflict_dict = {}

    for df_name, key, gene_name in zip(df_names, keys, gene_name_cols):
        df = data_dict[df_name].copy()

        # check if gene name column exists in the dataframe
        if gene_name and gene_name in df.columns:
            df.rename(columns={gene_name: 'gene_symbol'}, inplace=True)
            
            # check if there are any conflicts in gene names for the same GeneID
            conflicting_rows = df[df.duplicated(subset=[key, 'gene_symbol'], keep=False)]
            if not conflicting_rows.empty:
                conflict_dict[df_name] = conflicting_rows

    return conflict_dict

conflict_dict = debug_conflicts(data)

conflict_dict

{'RNAseq 24hpi':     gene_symbol    log2FC  GeneID Condition
 182         NaN -5.887024   45338       24h
 183         NaN -2.734256   45338       24h
 184         NaN  4.086980   45338       24h
 185         NaN  2.314898   45338       24h,
 'RNAseq 48hpi':     gene_symbol    log2FC  GeneID Condition
 515         NaN  5.202878   45338       48h
 516         NaN  5.169160   45338       48h
 517         NaN  3.074556   45338       48h
 518         NaN  2.817277   45338       48h
 519         NaN  2.188067   45338       48h}



March 23
Laura: There are not (yet) any approved drugs specific for dengue virus. Ribavirin is used as a treatment, but this is a generic antiviral and is not dengue specific. The promising small molecule JNJ-A07 is currently under investigation but it targets the virus directly and not the host cell. For host-directed targets that result in inhibition of dengue virus replication, some of them are listed in the article (please see Table 2): https://www.sciencedirect.com/science/article/pii/S1879625720300523  Hope this helps!
 
Here is some keyword advice from ChatGPT:
 
Here are some suggested keywords that you can search for on Google and PubMed if you are researching mechanisms of dengue infection:
Dengue fever
Dengue virus
Viral replication
Pathogenesis
Host-pathogen interaction
Immune response
Dengue infection and cytokines
Mosquito-borne diseases
Viral entry
Endocytosis
Viral receptors
Innate immune response
Adaptive immune response
Virus evasion strategies
Dengue vaccine development
You can also use specific gene or protein names that are known to be involved in dengue infection, such as NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5. Additionally, you can try combining these keywords with terms such as "molecular mechanisms", "pathophysiology", "epidemiology", "clinical features", and "diagnosis", depending on your research focus.