# **Bioinformatics with Jupyter Notebooks for WormBase:**
## **Utilities 2 - WormCat**
Welcome to the fourteenth jupyter notebook in the WormBase tutorial series. Over this series of tutorials, we will write code in Python that allows us to retrieve and perform simple analyses with data available on the WormBase sites.

This tutorial will deal with using WormCat, which is a tool for annotating and visualizing gene set enrichment data from _C. elegans_ microarray, RNA seq or RNAi screen data. Let's get started!

 The required packages for this tutorial can be installed using the next 3 cells. A more detaied explanation is there in the setup (Tutorial-00) notebook.?

In [None]:
!pip install rpy2
!pip install wormcat_batch

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
install.packages("devtools")

library("devtools")

install_github("trinker/plotflow")

install_github("dphiggs01/wormcat")

install.packages("argparse")

library(wormcat)

Let's start by importing the required libraries for the tutorial.

In [None]:
import pandas as pd
from IPython.core.display import SVG

We will start by assigning values to some variables.

First, we will have to assign the value to the Annotation_File variable. These are annotation databases that have been defined by, and created by WormCat. Depending on your use-case, you can choose a suitable annotation database.

The different annotation databases are-

[1]  kn_jan-18-2021.csv

[2]  orf_jan-18-2021.csv

[3]  whole_genome_jul-03-2019.csv.bk

[4]  ahringer_jan-02-2019.csv

[5]  whole_genome_jul-03-2019.csv

[6]  orfeome_jan-31-2019.csv

[7]  two_jan-18-2021.csv

Based on this, assign a number to the variable. 
Next, we assign a name to the Output Directory and also assign the name of the input excel sheet ('.xlsx' file) to the Input_File variable.



The input '.xlsx' file needs to follow certain rules:
- Each sheet in the .xlsx file is a different gene set. Each Sheet requires a column header which MUST be 'Sequence ID' or 'Wormbase ID' (The column header is case sensitive.) which is followed by the gene list.
- The Spreadsheet Name should ONLY be composed of Letters, Numbers, and Underscores (_) and has an extension .xlsx, .xlt, .xls.
- The individual Sheet Names (i.e., Tab name) within the spreadsheet should ONLY be composed of Letters, Numbers, and Underscores (_).

In [None]:
Annotation_File = '2'
Output_Directory = 'WormCat_Output'
Input_File = 'data/Murphy_TS.xlsx'

command_input = Annotation_File + '\n' + Output_Directory + '\ny\n' + Input_File
command_input

Now we have the command ready for running the wormcat program. Let's run it and extract the results!

In [None]:
!printf "$command_input" | wormcat_cli

Let's read in the output file that provides us enrichment data from the nested annotation list with broad categories in Category 1 (Cat1) and more specific categories in Cat2 and Cat3. 

For details about the three categories, download this file from the WormCat website - http://wormcat.com/static/download/Category_Definitions.csv

In [None]:
output = 'WormCat_Output/Out_Murphy_TS.xlsx'
Output_Cat1 = pd.read_excel(output, 'Cat1')
Output_Cat2 = pd.read_excel(output, 'Cat2')
Output_Cat3 = pd.read_excel(output, 'Cat3')

In [None]:
Output_Cat1

We then assign the gene set (i.e., sheet name) that we want to take a closer look at and also the category of output. Once specified, we can explore the other outputs.

WormCat output provides scaled bubble charts with enrichment scores that meet a Bonferroni false discovery rate cut off of 0.01.

It also includes CSV files on the data used for the graph.

In [None]:
gene_set = 'hypodermis'
category = '1'

In [None]:
SVG(filename = 'WormCat_Output/' + gene_set + '/rgs_fisher_cat' + category + '_apv.svg')

In [None]:
graph_csv = pd.read_csv('WormCat_Output/' + gene_set + '/rgs_fisher_cat' + category + '_apv.csv')

In [None]:
graph_csv

This is the end of the tutorial on using WormCat to deal with WormBase data!

In the next tutorial, we will generate Chromosome Maps with WormBase data.

Acknowledgements:
- WormCat (http://wormcat.com/)
- WormCat publication - 'WormCat: an online tool for annotation and visualization of Caenorhabditis elegans genome-scale data.' Amy D Holdorf, Daniel P Higgins, Anne C. Hart, Peter R Boag, Gregory Pazour, Albertha J. M. Walhout, Amy Karol Walker. GENETICS February 1, 2020 vol. 214 no. 2 279-294;