Makes metagene plots for bedgraphs over given regions in bed files for any organism. Regions can be continuous or spliced. Useful in analysis of ChIRP-seq, ChIP-seq, GRO-seq, ATAC-seq, iCLIP, ribosome profiling, RNA-seq, and other NGS datasets.
Table of Contents
- Go to 'releases' above and download the latest tar.gz file. Unzip with
tar xvzf metagene-maker-0.x.tar.gz> - Alternatively, you can clone this git repository using
git clone. - Go into the folder:
cd <metagene-maker-0.x> - Make sure you have the needed dependencies (below). Install:
sudo python setup.py install. If you do not have sudo privileges, runpython setup.py install --userorpython setup.py install --prefix=<desired directory>. Be sure that the python you use to runsetup.pyis version 2.7; scripts WILL NOT WORK with lower versions (2.4, 2.5).
- Make config file (see below)
- Ensure that you have a bedgraph for every sample you want to analyze.
- Ensure that you have properly formatted BED6/12 files for every region for which you want to build average profiles. You can make these with the included
extractTranscriptRegionsmodule (see below). - Run:
metagene_maker <config file> <name> <outputDir>where is the configuration file you make usingexample.conf(provided) as the template. The example file is in thetestfolder. Instructions for making configuration file are below. Run this either inscreenornohup. - Output: tab delimited files for each region in a new
averagesfolder in the user-provided output directory, as well as raw files namedallchr_sorted.txtin each subfolder that contains binned profiles for each region and can be used for custom analysis.
usage: metagene_maker [-h] [-l binLength] [-p processors] [--sample] config_file prefix output_directory
example: metagene_maker -p 10 --sample config/test.txt M3_ChIP chip/
| positional arguments: | explanation |
|---|---|
| config_file | required configuration file |
| prefix | Prefix of output files |
| output_directory | Directory where output folders will be written |
| optional arguments: | explanation |
|---|---|
| -h, --help | show this help message and exit |
| -l binLength | Bases per window when processing bedgraph. Default is 2,000,000. |
| -p processors | Number of cores to use. Default is 4. |
| --sample | Run subsampling to make metagenes more robust. |
- Python (>=2.7)
- Numpy (a python module) (>=1.7)
- Pandas (a python module) (>=0.14)
At least 4 GB RAM if your largest bedgraph is 1 GB and you use 4 cores (empirical rule: n cores * m GB bedgraph --> mn GB RAM needed)
You can supply your own BED6/12 files or use genome-wide BED files made using an included script, extractTranscriptRegions. You can start from either GTF files or files downloaded from UCSC as follows:
- Download the GTF file into the desired directory.
- From UCSC Genome Browser, go to Table Browser and choose your favorite organism/assembly. Choose "Genes and Gene Predictions" in 'group' and one of the gene tracks (we recommend UCSC Genes, Ensembl, or RefSeq).
- Choose 'selected fields from primary and related tables' for 'output format'.
- Columns MUST be in this format:
- name
- chrom
- strand (+/-)
- txStart
- txEnd
- cdsStart
- cdsEnd
- exonCount
- exonStarts
- exonEnds
- score
- name2
- Download the file.
Run extractTranscriptRegions -i <gene_file.txt> -o <output_prefix> [--ucsc|--gtf]. Output will be a list of bed files for UTRs, CDS's, exons, introns, splice sites, TSS's, and TES's that can be used for metagene-maker.
folder: the name of the sample (should also be the name of the folder where sample-specific intermediate files will be made)
bedGraphLoc: absolute path to bedgraph
stranded: + if plus only, - if minus only, 0 if no strand information. IMPORTANT if your regions are also strand specific.
pairName: If a bedgraph is stranded, it must be part of a pair of bedgraphs (one + and one -) that share the same pairName.
regionType: name of region
fileLoc: absolute path to file specifying the regions of interest
limitSize: y if only regions >200bp and <200kb should be considered; n if no limitation
numBins: number of bins for the central region. Use 1 to get the average coverage across the entire region. To make plots, anywhere between 100 and 500 is sufficient.
sideExtension: number of nt's to extend on each side of the provided regions. Default is 0.
sideNumBins: number of bins for each of the side extensions
the directory specified in the configuration file; contains all files generated by this pipeline
averages: contains one file for each region type; each file is an Excel spreadsheet with graphable metagenes for each sample
<sample>: a folder for each sample, named as described in the config file; contains intermediate files described below
bedGraphByChr: bedgraphs split by chromosome
bins: in this folder, there are subfolders for each region type, containing profiles for each instance of the region, intermediate RData files, and metagene plots