## Goal Proteins

 - AND
     - negatively affect dengue virus replication when inhibited by siRNA*
     - OR
        - are modulated in abundance
        - are modulated in rna abundance
        - interact with the dengue virus

* \*the highest the Zscore, the more these factors negatively affect dengue virus replication

## Plan
 - Define rules implementing goal criteria
 - Separate the interaction data from the protein properties data
 - Combine the data sources to get a single protein table:
 - Make an network of interactions between the proteins: STRING HC + dengue AP-MS
 - Annotate the network with the protein data, properties including:
     - Whether the host protein binds a viral protein
     - A list of viral proteins that bind the protein
     - whether it meets any of the interesting criteria: a protein of interest
     - whether it is a goal protein
 - Make a hierarchical from the subnetwork interconnecting the proteins of interest
 - Name the model systems by enrichment
 - Name and some or all of the model systems by LLM
 - Identify goal proteins
 - Identify systems enriched with goal proteins
 - Style the model and the interactome to highlight the data
 - Generate reports
 - Generate figures
 





## Experimental Protocol
We took primary human dendritic cells, infected them with dengue virus (serotype 3) and subjected these cells to either: 
 - siRNA screening to identify human host factors that act to restrict viral replication, 
 - Proteomics (Protein Abundance) to look at human proteins that change in abundance following infection. 
   - This was done at 24h and 48h post infection. (Jeff Johnson, Krogan lab)
 - RNAseq to look at cellular mRNAs that are differently expressed following infection.
   - This was done at 24h and 48h post infection. (Stephen Wolinski, NWU)
   
We also include host-dengue PPI data from (Shah et al., Cell 2018). 
Priya Shah was at the Krogan lab when this data was collected

The datasets are already thresholded based maximum overlap and significance, and all have GeneID as common identifier. 

20220202_Thresholded_Datasets_DHIPC.xlsx

Shah et al., Dengue Interactions




## Load Data


prompt: write a function load_dengue_data to load an excel spreadsheet as a dict of dataframes corresponding to the tabs in the spreadsheet

In [14]:
import pandas as pd

def load_dengue_data(file_path):
    # Create an ExcelFile object
    xls = pd.ExcelFile(file_path)
    
    # Get the sheet names
    sheet_names = xls.sheet_names
    
    # Initialize a dictionary to store the dataframes
    dengue_data = {}

    # Loop over the sheet names and read each into a dataframe
    for sheet in sheet_names:
        dengue_data[sheet] = pd.read_excel(xls, sheet_name=sheet)
    
    return dengue_data

data = load_dengue_data('20220202_Thresholded_Datasets_DHIPC.xlsx')

data

{'siRNA data':      GeneName  GeneID  Average_Zscore
 0       A2LD1   87769        1.588549
 1       ABCC2    1244        1.209934
 2       ABCG2    9429        1.293657
 3       ACACB      32        1.126954
 4      ACAS2L   84532        1.994937
 ...       ...     ...             ...
 1058   ZNF704  619279        1.356330
 1059   ZNF708    7562        1.527764
 1060    ZNRF3   84133        1.363028
 1061   ZSIG11   51368        1.436758
 1062   ZWILCH   55055        1.406935
 
 [1063 rows x 3 columns],
 'Pt Abundance':     Uniprot ID         Timepoint    log2FC  GeneID
 0       Q86TX2  DV3_24h-Mock_24h  1.608293  641371
 1       Q86YS6  DV3_24h-Mock_24h  1.346686  339122
 2       Q8NFC6  DV3_48h-Mock_48h -1.871359  259282
 3       Q96FN4  DV3_24h-Mock_24h  1.548070  221184
 4       Q8IVG5  DV3_24h-Mock_24h  1.172951  219285
 ..         ...               ...       ...     ...
 114     P16671  DV3_48h-Mock_48h -1.209983     948
 115     P50454  DV3_48h-Mock_48h  1.071615     871
 116  

## Merge the RNA Data

Prompt: write a function merge_dengue_data to operate on the data dict to merge these data frames : 
'siRNA data' - columns:  GeneName  GeneID  Average_Zscore
'RNAseq 24hpi' - columns:  GeneName    log2FC  GeneID Condition
'RNAseq 48hpi' - columns:  GeneName    log2FC  GeneID Condition

Merge the dataframes using GeneID as the key
Make the column names unique by prepending abbreviations for the dataframe names

In [12]:
import pandas as pd

def merge_dengue_data(data_dict):
    # Define a list of dataframes to merge
    df_names = ['siRNA data', 'Pt Abundance', 'RNAseq 24hpi', 'RNAseq 48hpi']
    
    # Define abbreviations for the dataframe names
    abbreviations = ['siRNA', 'PtAb', 'RNA24', 'RNA48']
    
    # Define gene name columns for each dataframe
    gene_name_cols = ['GeneSymbol', None, 'GeneName', 'GeneName']
    
    # Define keys for each dataframe
    keys = ['GeneID', 'GeneID', 'GeneID', 'GeneID']
    
    # Start with the first dataframe
    merged_data = data_dict[df_names[0]].copy()
    
    # Check if gene name column exists in the dataframe
    if gene_name_cols[0] in merged_data.columns:
        merged_data.rename(columns={gene_name_cols[0]: 'gene_symbol'}, inplace=True)

    merged_data.columns = [f"{abbreviations[0]}_{col}" if col != keys[0] else col for col in merged_data.columns]
    
    # Merge the remaining dataframes
    for df_name, abbr, key, gene_name in zip(df_names[1:], abbreviations[1:], keys[1:], gene_name_cols[1:]):
        df = data_dict[df_name].copy()

        # check if gene name column exists in the dataframe
        if gene_name and gene_name in df.columns:
            df.rename(columns={gene_name: 'gene_symbol'}, inplace=True)

            # drop the rows where gene_symbol is NaN before merging
            df = df[df['gene_symbol'].notna()]
            
            df.columns = [f"{abbr}_{col}" if col != key else col for col in df.columns]
            merged_data = pd.merge(merged_data, df, how='outer', on=[key, 'gene_symbol'])
        else:
            df.columns = [f"{abbr}_{col}" if col != key else col for col in df.columns]
            merged_data = pd.merge(merged_data, df, how='outer', on=key)

    return merged_data



merged_data = merge_dengue_data(data)

merged_data

KeyError: 'gene_symbol'

In [10]:
def debug_conflicts(data_dict):
    # Define a list of dataframes to process
    df_names = ['siRNA data', 'Pt Abundance', 'RNAseq 24hpi', 'RNAseq 48hpi']

    # Define gene name columns for each dataframe
    gene_name_cols = ['GeneSymbol', None, 'GeneName', 'GeneName']

    # Define keys for each dataframe
    keys = ['GeneID', 'GeneID', 'GeneID', 'GeneID']
    
    # Create a dictionary to hold conflict information
    conflict_dict = {}

    for df_name, key, gene_name in zip(df_names, keys, gene_name_cols):
        df = data_dict[df_name].copy()

        # check if gene name column exists in the dataframe
        if gene_name and gene_name in df.columns:
            df.rename(columns={gene_name: 'gene_symbol'}, inplace=True)
            
            # check if there are any conflicts in gene names for the same GeneID
            conflicting_rows = df[df.duplicated(subset=[key, 'gene_symbol'], keep=False)]
            if not conflicting_rows.empty:
                conflict_dict[df_name] = conflicting_rows

    return conflict_dict

conflict_dict = debug_conflicts(data)

conflict_dict

{'RNAseq 24hpi':     gene_symbol    log2FC  GeneID Condition
 182         NaN -5.887024   45338       24h
 183         NaN -2.734256   45338       24h
 184         NaN  4.086980   45338       24h
 185         NaN  2.314898   45338       24h,
 'RNAseq 48hpi':     gene_symbol    log2FC  GeneID Condition
 515         NaN  5.202878   45338       48h
 516         NaN  5.169160   45338       48h
 517         NaN  3.074556   45338       48h
 518         NaN  2.817277   45338       48h
 519         NaN  2.188067   45338       48h}



March 23
Laura: There are not (yet) any approved drugs specific for dengue virus. Ribavirin is used as a treatment, but this is a generic antiviral and is not dengue specific. The promising small molecule JNJ-A07 is currently under investigation but it targets the virus directly and not the host cell. For host-directed targets that result in inhibition of dengue virus replication, some of them are listed in the article (please see Table 2): https://www.sciencedirect.com/science/article/pii/S1879625720300523  Hope this helps!
 
Here is some keyword advice from ChatGPT:
 
Here are some suggested keywords that you can search for on Google and PubMed if you are researching mechanisms of dengue infection:
Dengue fever
Dengue virus
Viral replication
Pathogenesis
Host-pathogen interaction
Immune response
Dengue infection and cytokines
Mosquito-borne diseases
Viral entry
Endocytosis
Viral receptors
Innate immune response
Adaptive immune response
Virus evasion strategies
Dengue vaccine development
You can also use specific gene or protein names that are known to be involved in dengue infection, such as NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5. Additionally, you can try combining these keywords with terms such as "molecular mechanisms", "pathophysiology", "epidemiology", "clinical features", and "diagnosis", depending on your research focus.