# BAG3 Co-IP Pathway and Watson Analysis
##### by Emir Turkes

This analysis utilizes pathway analysis and Watson for Drug Discovery to validate Co-IP MS (co-immunoprecipitation mass spectrometry) data with `BAG3` as the primary target. There are 4 goals thus far:
1. Correlation of our results with that of IBM's knowledge base to get a sense of where it stands against existing literature.
1. Clustering of the Co-IP'ed proteins into various ontological groups (disease relevance, biochemical pathways, and chemical classification).
1. Breakdown of upstream, downstream, and intermediate biochemical pathways between Co-IP'ed proteins and their clusters.
1. Confirmation of enrichment of proteins involved in endocytosis/membrane fusion event and dendritic localization/function.

In [7]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*

# Copyright 2019 Emir Turkes
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""The main analysis routine."""


import os
import types

import numpy as np
import pandas as pd

import ibm_botocore.client as ic
import ibm_boto3 as ib
import pixiedust

In [8]:
# The code was removed by Watson Studio for sharing.

### Cleaning of Co-IP MS data

The experimental design consists of `BAG3` as bait and has an `shBAG3` treatment group to knockdown `BAG3`, as well as an `scrRNA` control group. Without the treatment group, we cannot know if a protein was pulled up due to a true interaction with `BAG3` or simply from running the assay itself. 

**Our primary measurement of interest is fold change of protein abundance counts, which represents a simple ratio:**

$
\begin{equation*}
FC = \frac{scrRNA}{shBAG3}
\end{equation*}
$

*FC obtained in our data involved additional standard QC steps using Proteome Discoverer and are described in detail in its [manual](https://assets.thermofisher.com/TFS-Assets/CMD/manuals/Man-XCALI-97808-Proteome-Discoverer-User-ManXCALI97808-EN.pdf).*

**From these fold changes we intend to generate a ranked list of `BAG3`-associated proteins for Watson and pathway analysis.**

An important consideration is that the experiment was done without replicates, which are required for standard hypothesis testing of true protein-protein interactions. It also complicates the treatment of missing values and low abundance counts. While there is ample literature discussing this topic in general (see [Moorthy et al., 2014](10.1186/1752-0509-7-S6-S12) for one review), there is little that pertains to Co-IP MS specifically. Therefore, the following workarounds should be interpreted cautiously.

To begin, we perform some basic pre-calculation cleaning of the Proteome Discoverer output and explore it below.

*94 rows, truncated to 10 for readability*

In [10]:
# Read in data to a DataFrame.
body = client_75340283764447f99797650de21e6211.get_object(
    Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb', Key='co_ip_no_keratin.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
co_ip_df = pd.read_csv(body)

# Rename columns to something clearer.
co_ip_df.rename(columns={
    "Description": "Genes", 
    "Summed Abundance Ratio (log2): (Scr) / (SH)": "Abundance Ratio (log2)",
    "Abundance Counts": "scrRNA Abundance",
    "Unnamed: 12": "shBAG3 Abundance"}, inplace=True)

# Remove unneeded columns.
cols = ["Genes", "Abundance Ratio (log2)", "scrRNA Abundance", "shBAG3 Abundance"]
co_ip_df = co_ip_df[cols]

# Sort by "Abundance Ratio".
cols = "Abundance Ratio (log2)"
co_ip_df.sort_values(by=cols, inplace=True, ascending=False)

# Extract gene names from the larger metadata.
cols = "Genes"
co_ip_df[cols] = co_ip_df[cols].str.extract("((?<=GN=).*(?= PE=))", expand=True)
co_ip_df[cols] = co_ip_df[cols].str.upper()
co_ip_df.dropna(axis=0, subset=[cols], inplace=True)
co_ip_df = co_ip_df[~co_ip_df[cols].str.contains("BAG3")]

# Convert integer values to float.
cols = ["scrRNA Abundance"]
co_ip_df[cols] = co_ip_df[cols].astype("float")
cols = ["shBAG3 Abundance"]
co_ip_df[cols] = co_ip_df[cols].astype("float")

# Add a simple ranking of genes by "Abundance Ratio".
co_ip_df = co_ip_df.reset_index(drop=True)
co_ip_df.index = co_ip_df.index + 1
co_ip_df.insert(0, "Rank", co_ip_df.index)

# Replace gene names with Watson entities.
body = client_75340283764447f99797650de21e6211.get_object(
    Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb', Key='EntitySet_BAG3-co-ip-all_2019-03-20_01-26-33.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
entity_set_df = pd.read_csv(body)
entity_set_df.index = entity_set_df.index + 1
cols = "Genes"
co_ip_df[cols] = entity_set_df["Entity name"]

display(co_ip_df.head(10))

Rank,Genes,Abundance Ratio (log2),scrRNA Abundance,shBAG3 Abundance
1,KIF2A,6.64,2.0,
2,TUBB3,6.64,2.0,
3,TUBGCP3,6.64,2.0,
4,IQSEC3,6.64,2.0,
5,PTN,6.64,2.0,1.0
6,CLASP2,6.64,2.0,
7,TUBB2B,6.64,2.0,
8,BAG5,6.64,14.0,3.0
9,HIST1H4B,6.64,2.0,1.0
10,YWHAG,6.64,2.0,1.0


## Work in progress - more coming soon