# BAG3 Co-IP Watson Analysis
##### by Emir Turkes

This analysis utilizes Watson for Drug Discovery to validate Co-IP MS (co-immunoprecipitation mass spectrometry) data with `BAG3` as the primary target. There are 4 goals thus far:
1. Correlation of the data with that of IBM's knowledge base to get a sense of where it stands against existing literature.
1. Clustering of the Co-IP'ed proteins into various ontological groups (disease relevance, biochemical pathways, and chemical classification).
1. Breakdown of upstream, downstream, and intermediate biochemical pathways between Co-IP'ed proteins and their clusters.
1. Confirmation of enrichment of proteins involved in endocytosis/membrane fusion event and dendritic localization/function.

In [4]:
#!/usr/bin/env python3
# -*- coding: utf-8 -*

# Copyright 2019 Emir Turkes
# 
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
# 
#     http://www.apache.org/licenses/LICENSE-2.0
# 
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""The main analysis routine."""


import os
import types

import numpy as np
import pandas as pd

import ibm_botocore.client as ic
import ibm_boto3 as ib
import pixiedust

In [5]:
# The code was removed by Watson Studio for sharing.

### Cleaning of Co-IP MS data

The experiment was conducted without replicates. This precludes the use of standard hypothesis tests to assign significance to missing values. Because some analyses we would like to perform cannot accept missing values, we must perform *missing value imputation*. Since there is no one-best algorithm for this, we try a couple and select the one that produces the most biologically feasible outcome.  

A review of missing value imputation algorithms is provided by [Moorthy et al., 2014](10.1186/1752-0509-7-S6-S12).  

To begin, let's first take a look at the most relevant sections from the processed output of Proteosome Discoverer:

In [6]:
# Read in data.
body = client_75340283764447f99797650de21e6211.get_object(Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb',Key='co_ip_no_keratin.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
co_ip_df = pd.read_csv(body)

# Make data more presentable.
co_ip_df.rename(columns={"Description": "Genes", "Summed Abundance Ratio (log2): (Scr) / (SH)": "Summed Abundance Ratio"}, inplace=True)
cols = ["Genes", "Summed Abundance Ratio"]
co_ip_df = co_ip_df[cols]
cols = "Summed Abundance Ratio"
co_ip_df.sort_values(by=cols, inplace=True, ascending=False)
cols = "Genes"
co_ip_df[cols] = co_ip_df[cols].str.extract("((?<=GN=).*(?= PE=))", expand=True)
co_ip_df[cols] = co_ip_df[cols].str.upper()
co_ip_df.dropna(axis=0, subset=[cols], inplace=True)
co_ip_df = co_ip_df[~co_ip_df[cols].str.contains("BAG3")]
co_ip_df = co_ip_df.reset_index(drop=True)
co_ip_df.index = co_ip_df.index + 1
co_ip_df.insert(0, "Rank", co_ip_df.index)

# Replace gene names with Watson entities.
body = client_75340283764447f99797650de21e6211.get_object(Bucket='bag3coip-donotdelete-pr-wlavvdgoryxgjb',Key='EntitySet_BAG3-co-ip-all_2019-03-20_01-26-33.csv')['Body']
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
entity_set_df = pd.read_csv(body)
entity_set_df.index = entity_set_df.index + 1
co_ip_df[cols] = entity_set_df["Entity name"]

display(co_ip_df)

Rank,Genes,Summed Abundance Ratio
1,KIF2A,6.64
2,TUBB3,6.64
3,TUBGCP3,6.64
4,IQSEC3,6.64
5,PTN,6.64
6,CLASP2,6.64
7,TUBB2B,6.64
8,BAG5,6.64
9,HIST1H4B,6.64
10,YWHAG,6.64


## Work in progress - more coming soon