# DAY 2: PRIOR KNOWLEDGE
Prior knowledge (PK) is information gathered in past experiments that we can use to inform analyses on our data. In this way, we leverage the vast information that is out there.

## Learning outcomes
* understand how biologists can leverage PK
* get exposed to data science basics: some python and the PK database Omnipath

# Setting up the environment

* Note: in jupyter notebooks, A creates a new cell on top, and B a new cell below. With shift+enter you can run cells. Alternatively, you can click on the play button of each cell to run it.



In [None]:
# install specific version of omnipath
!pip install git+https://github.com/saezlab/omnipath.git@138088d

In [None]:
# import libraries
import omnipath as op
import matplotlib.pyplot as plt

# Task 1: OmniPath visual exploration

* Open an empty slide deck to record your insights during this OmniPath exploration.

## Some brief theory

An Application Programming Interface (API) is a set of commands that a programmer can use to access services from a software library, without having to understand how this software package works internally. OmniPath provides us with an API to easily retrieve the PK we are interested in.

Object Oriented Programming (OOP) is a style of programming that promotes structure and reliability in code. It does so by using the concept of classes and objects. A class is a blueprint to create objects with certain properties/attributes. An object is an instance of a class. We handle this object with class-specific methods. In this way, we know what we are doing to our data when we apply a method to it, since our data will be an object of a specific class with certain instructions.

* Have a look at the [OmniPath documentation](https://omnipath.readthedocs.io/en/latest/api.html#) and answer these questions:
  * how many classes can we instantiate when requesting OmniPath data?
  * what methods can we apply to an object of the Annotations class?
  * what about the methods that can be applied to the objects of other classes?
  * click on these methods. Notice the description of parameters.

In [None]:
# for now we focus on the annotations database
annotations = op.requests.Annotations()

## Some commands
Pandas is a frequently used package in data science to handle dataframes in python.

* Visit [this](https://pandastutor.com/vis.html) site and practice some commonly used pandas commands. This will be helpful for our tasks later on.

## Now onto biology!

* Think about a disease or biological process that you are interested in. What genes are known/thought to play a role in it?
* Pick a gene from [this](https://www.genecards.org/) database.
* Write down the **main symbol** for this gene; you can find it in the top left of the website.
* Also write down the **UniProt/Swiss-Prot** id for this gene.
* Note down why you selected this gene.

In [None]:
# change TP53 and P04637 for the approved symbol and uniprot id of your selected gene
my_gene_symbol = "TP53"
my_gene_uniprot = "P04637"

## Inspect PK available for your gene of interest

* Run cells below and note down your answers to the questions below in your slides.

In [None]:
# retrieve annotations PK for your gene of interest
gene_annotations = annotations.get(my_gene_symbol)
gene_annotations

* What data does each column contain?
* How many rows do you see? Do you think this is a lot of prior knowledge? Or do you think there are missing annotations?
* how do you think people usually use this annotations database? Specifically, which column do you think people are usually interested in?

In [None]:
# see which databases OmniPath retrieves its annotations from
gene_annotations["source"].unique()

* Manually check some of these original resources. Do they seem well maintained? Or does the information seem outdated?

In [None]:
# see how many annotations have been retrieved from each original database
gene_annotations["source"].value_counts().plot(kind="bar")

* ctrl+click to copy figure and add it to your slide.
* Is there a lot of discrepancy? or do most sources provide a similar amount of prior knowledge?
* What could be the implications of this different coverage?
* does the source with the most prior knowledge look outdated or well maintained and updated? Are there any other properties of this prior knowledge resource that you find interesting?

In [None]:
# investigate actual annotations available
gene_annotations["value"].value_counts()

* what are the most common annotations?

## Now retrieve data for your gene from one other database
* Choose one of the other databases in the [Requests section](https://omnipath.readthedocs.io/en/latest/api.html#) of the OmniPath documentation (complexes, enzsub, intercell, or signedPTMs). NOTE: check the documentation to know whether you need to use the gene symbol or the uniprot id to make the query.
* Either 1) repeat previous analysis for this database, or 2) explore it your way. Aka investigate what prior knowledge is available, and what could be interesting to you.

* Note down results in your slides.

# Q&A session 1
* each pair of programmers presents their results of their OmniPath exploration.
* why do you think there are these differences in the prior knowledge available for the different genes?
* is there a bias in prior knowledge?
* how do you think we could overcome bias in prior knowledge?

# Task 2: Expanding our insights from data with prior knowledge
Now we move on to thinking about how we can use this data in our analyses. Specifically, we are going to look at genes that according to literature are involved in ovarian cancer. The goal is to investigate what protein-protein interactions (PPIs) have been recorded for these genes, understand how these PPIs were discovered, and think about some hypotheses we can generate to study ovarian cancer.

Specifically, the genes we focus on were found to be characteristic of cancer-associated fibroblast populations in ovarian cancer in [this](https://www.nature.com/articles/s41591-020-0926-0) single cell study. Seeing what prior knowledge is available for these genes could be interesting to better understand the tumour microenvironment.

In [None]:
# initialize the set of CAF-associated genes in ovarian cancer
ovca = ["P02745", "P02746", "P02747", "P00751", "P09341", "P19875", "P02778", "P48061", "P05231", "P22301"]

* what functions do these proteins have? Check it out [here](https://www.uniprot.org/).

In [None]:
# retrieve the protein-protein interactions database
interactions = op.interactions.OmniPath().get()
interactions

## Filtering PK in different ways

In [None]:
# filtering strategy 1:
# Filter for PPIs where both source and target are in ovca list.
# hint: look at the documentation of pandas.Series.isin method


* how much information (rows) do we retrieve?
* why do you think this is the case?

In [None]:
# filtering strategy 2:
# Filter for PPIs where either the source or the target are in our list of
# ovarian cancer associated genes.
# Note: store filtered df in a new variable called ovca_ppi

* compare the number of PPIs retrieved with the two filtering strategies: what are the implications of being flexible with the prior knowledge we include?
* check out the "sources" and "n_references" column to see what is backing up these interactions.
* is there a lot of discrepancy in the number of references between the different rows? What could be the implications of this discrepancy?
* What experiments do you think were done in the wet lab to discover these PPIs?
* what other filtering strategies would you use to leverage the prior knowledge in a different way?

## Formulating a hypothesis based on the PK

* investigate these interactions, and formulate a hypothesis about these interactions in the context of cancer associated fibroblasts in ovarian cancer. Whatever you can think of. The goal is to see how PK can help us generate hypotheses.
  * for example, focus on only one random interaction and see the literature backing it up
  * or focus only on interactions involving genes with most interactions in the PK

# Q&A session 2
* each pair of programmers presents their results of their PPI investigation.
* In which scenarios would it be useful to expand our ground truth data with prior knowledge?
* Based on everyone's hypotheses, what hypotheses would you think are valuable to pursue?
* What other sources of PK do you think are missing from OmniPath?

# Feedback
How did you like the workshop? Lets us know [here!](https://forms.gle/MBDJ6Z4eCWH7iqkd7)
