# BAG3 Co-IP Watson Analysis
##### by Emir Turkes

This analysis utilizes Watson for Drug Discovery to validate Co-IP MS (co-immunoprecipitation mass spectrometry) data with BAG3 as the primary target. There are 4 goals thus far:
1. Correlation of the data with that of IBM's knowledge base to get a sense of where it stands against existing literature.
2. Clustering of the Co-IP'ed proteins into various ontological groups (disease relevance, biochemical pathways, and chemical classification come to mind)
3. Breakdown of upstream, downstream, and intermediate biochemical pathways between Co-IP'ed proteins and their clusters.
4. Confirmation of enrichment of proteins involved in endocytosis/membrane fusion event and dendritic localization/function

In [8]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*

# Copyright 2019 Emir Turkes
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""The main analysis routine."""


import os
import types

import numpy as np
import pandas as pd

import ibm_botocore.client as ic
import ibm_boto3 as ib
import pixiedust

In [9]:
# The code was removed by Watson Studio for sharing.

### Cleaning of Co-IP MS data

We do some basic data manipulations to get an ordered list of 20 genes by their coupling with BAG3, excluding BAG3 itself.  

**Note:**  
The data originally contained `#DIV/0!` errors, as there were many cells in the shBAG3 column that contained null values. Assuming these null values were to reflect values less than 1, they were arbitrarily reassigned a value of 0.5 (the median of 0 and 1).

In [10]:
# Read in data.
# The correction of null values was done outside of this notebook.
body = client_75340283764447f99797650de21e6211.get_object(Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb',Key='co_ip_no_keratin.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
co_ip_df = pd.read_csv(body)

# Sort by "fold of changes".
cols = ["fold of changes"]
co_ip_df.drop(0, inplace=True)
co_ip_df[cols] = co_ip_df[cols].astype('int')

# Reduce to three columns and top 21 genes, in order to exclude BAG3 itself.
cols = ["Unnamed: 1", "fold of changes"]
co_ip_df = co_ip_df[cols]
co_ip_df.rename(columns={"Unnamed: 1": "Genes", "fold of changes": "Adjusted PSMs"}, inplace=True)
co_ip_df = co_ip_df.head(21)
co_ip_df.insert(1, "Co-IP rank", co_ip_df.index)

# Cut string to only gene name, capitalize, and remove BAG3.
cols = "Genes"
co_ip_df[cols] = co_ip_df[cols].str.extract("((?<=GN=).*(?= PE=))", expand=True)
co_ip_df["Genes"] = co_ip_df["Genes"].str.upper()
co_ip_df = co_ip_df[~co_ip_df.Genes.str.contains("BAG3")]
co_ip_df = co_ip_df.reset_index(drop=True)
co_ip_df.index = co_ip_df.index + 1

# Replace gene names with Watson entities.
body = client_75340283764447f99797650de21e6211.get_object(Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb',Key='EntitySet_BAG3-co-ip-20-no-BAG3_2019-02-27_09-47-55.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
entity_set_df = pd.read_csv(body)
entity_set_df.index = entity_set_df.index + 1
co_ip_df["Genes"] = entity_set_df["Entity name"]

display(co_ip_df)

Genes,Co-IP rank,Adjusted PSMs
PLEC,1,44
ANK2,2,44
BAG5,3,34
SPTBN2,4,30
HSPA8,5,21
NES,6,20
MYH10,7,20
CAMK2A,8,20
NEFM,9,20
SRCIN1,10,19


## Watson Explore Networks

First, we visualize the relationship between the top 20 genes based on literature reporting of their modulation of BAG3. Though Watson Explore Networks is set to the lowest thresholds (described below), relationships with only two other genes were reported:  
**HSPA8**  
**CAPZB**  

Note the direction of the arrows, which indicate the direction of each modulation.

![network_2019-02-27_09-54-20.png](attachment:network_2019-02-27_09-54-20.png)

### Report confidence scores and number of documents

Watson Explore Networks uses confidence scores that assess the likelihood of a relationship as happening above chance through a scale ranging from 0-100. In addition, information is also provided on the number of documents retrieved and associated metadata of these documents.  

The following minimum thresholds were set:  

$
\begin{align}
Confidence\ score \leq 1 \\
Number\ of\ documents \leq 1
\end{align}
$

$~$  
Below is a summary of the most relevant information.

In [11]:
# Read in data.
body = client_75340283764447f99797650de21e6211.get_object(Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb',Key='relationshipgraph14809_2019-02-27_09-54-25.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
relationship_graph_more_nodes_df = pd.read_csv(body)

# Sort by "Confidence".
cols = ["Confidence"]
relationship_graph_more_nodes_df[cols] = relationship_graph_more_nodes_df[cols].astype('int')
relationship_graph_more_nodes_df.sort_values(by=cols, inplace=True, ascending=False)

# Reduce to three columns.
cols = ["Source name", "Target name", "Confidence", "Documents"]
relationship_graph_more_nodes_df = relationship_graph_more_nodes_df[cols]
relationship_graph_more_nodes_df = relationship_graph_more_nodes_df.reset_index(drop=True)
relationship_graph_more_nodes_df.index = relationship_graph_more_nodes_df.index + 1

display(relationship_graph_more_nodes_df)

Source name,Target name,Confidence,Documents
HSPA8,DNAJB6,3,4
TAZ,BAG3,61,1
MIR217,BAG3,59,1
HSPA8,HSP90AA1,87,32
HSPB6,BAG3,60,2


### Merge confidence scores with the Co-IP gene list

It appears that BAG3 -> HSPA8 is the most strongly supported relationship in the literature. Let's merge the above table with the Co-IP list to get a clearer picture.

In [None]:
merge_df = co_ip_df.copy()

# Hard coded at the moment.
merge_df.at[5, "Confidence (downstream)"] = relationship_graph_more_nodes_df.at[1, "Confidence"]
merge_df.at[5, "Confidence (upstream)"] = relationship_graph_more_nodes_df.at[2, "Confidence"]
merge_df.at[18, "Confidence (downstream)"] = relationship_graph_more_nodes_df.at[3, "Confidence"]
merge_df.at[18, "Confidence (upstream)"] = 0

# Remove unrelated rows and convert to integer values.
cols = ["Confidence (downstream)", "Confidence (upstream)"]
merge_df.dropna(axis=0, subset=cols, inplace=True)
merge_df[cols] = merge_df[cols].astype('int')

# Reset index
merge_df = merge_df.reset_index(drop=True)
merge_df.index = merge_df.index + 1

display(merge_df)

Genes,Co-IP rank,Adjusted PSMs,Confidence (downstream),Confidence (upstream)
HSPA8,5,21,81,72
CAPZB,19,14,57,0


### Reduce variables

To better interpret the above table, it is better to reduce to two dependent variables.  

In order to scale PSMs to a range of 0-100, like confidence, we can simply find their percentage of the max value like so:  

$
\begin{equation*}
PSM_{scaled} = (\frac{PSM_{adjusted}}{PSM_{max}}) \times 100 
\end{equation*}
$

$~$  
It makes sense to combine confidence scores because Co-IP cannot differentiate up and downstream relationships. They are combined averaging confidence for each document like so:  

$\begin{equation*}
Confidence_{combined} = \frac{(Confidence_{down} \times Documents_{down}) + (Confidence_{up} \times Confidence_{up})}{(Documents_{down} + Documents_{up})}
\end{equation*}
$

In [None]:
reduced_df = merge_df.copy()

# Reduce PSMs.
cols = ["Adjusted PSMs"]
PSM_max = co_ip_df[cols].max()

for index, row in reduced_df.iterrows():
    reduced_df.loc[index, cols] = ((row[cols]) / (PSM_max)) * 100

reduced_df[cols] = reduced_df[cols].astype('int')

# Reduce Confidence.
extended_df = merge_df.copy()

# Hard coded at the moment.
extended_df.at[1, "Documents (downstream)"] = relationship_graph_more_nodes_df.at[1, "Documents"]
extended_df.at[1, "Documents (upstream)"] = relationship_graph_more_nodes_df.at[2, "Documents"]
extended_df.at[2, "Documents (downstream)"] = relationship_graph_more_nodes_df.at[3, "Documents"]
extended_df.at[2, "Documents (upstream)"] = 0

cols = ["Confidence (downstream)", "Documents (downstream)", 
        "Confidence (upstream)", "Documents (upstream)"]

for index, row in reduced_df.iterrows():
    reduced_df.loc[index, cols[0]] = (
        (
            (row[(cols[0])] * extended_df.loc[index, (cols[1])]) +
            (row[(cols[2])] * extended_df.loc[index, (cols[3])])
        )
        /
        (
            extended_df.loc[index, (cols[1])] + extended_df.loc[index, (cols[3])]
        )
    )

reduced_df[(cols[0])] = reduced_df[(cols[0])].astype('int')

# Reduce to three columns.
cols = ["Genes", "Adjusted PSMs", "Confidence (downstream)"]
reduced_df = reduced_df[cols]
reduced_df.rename(
    columns={"Adjusted PSMs": "Scaled PSMs", "Confidence (downstream)": "Combined confidence"}, 
    inplace=True
)

display(reduced_df)

Genes,Scaled PSMs,Combined confidence
HSPA8,47,79
CAPZB,31,57


![scaled_PSMs_vs_combined_confidence.png](attachment:scaled_PSMs_vs_combined_confidence.png)

### Increase nodes to link to

The analyses so far only looks at genes that directly modulate each other. Obtained results can be expanded greatly by including intermediate genes that eventually link back to BAG3. Same mimimum thresholds set as earlier.

![network_2019-02-27_14-33-37.png](attachment:network_2019-02-27_14-33-37.png)

In [16]:
# Read in data.
body = client_75340283764447f99797650de21e6211.get_object(Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb',Key='relationshipgraph14809_2019-02-27_14-40-25.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
relationship_graph_more_nodes_df = pd.read_csv(body)

# Sort by "Confidence".
cols = ["Confidence"]
relationship_graph_more_nodes_df[cols] = relationship_graph_more_nodes_df[cols].astype('int')
relationship_graph_more_nodes_df.sort_values(by=cols, inplace=True, ascending=False)

# Reduce to three columns.
cols = ["Source name", "Target name", "Confidence", "Documents"]
relationship_graph_more_nodes_df = relationship_graph_more_nodes_df[cols]
relationship_graph_more_nodes_df = relationship_graph_more_nodes_df.reset_index(drop=True)
relationship_graph_more_nodes_df.index = relationship_graph_more_nodes_df.index + 1

display(relationship_graph_more_nodes_df.head(20))

Source name,Target name,Confidence,Documents
BAG3,AKT1,91,6
HSPA8,HSPA4,89,55
BAG3,BCL2,88,9
HSPA8,NF-KB,88,8
HSPA4,HSPA8,87,55
HSPA8,UBIQUITIN,87,4
HSPA8,TP53,87,8
HSPA8,HSP90AA1,87,32
BAG3,CASP3,86,3
HSP90AA1,HSPA8,86,12


## Work in progress - more coming soon