CRX Information Content
This repository contains all code, reasonably sized data, and directory structures necessary to reproduce figures in "Information Content Differentiates Enhancers From Silencers in Mouse Photoreceptors" by Friedman et al., 2021.
All necessary software packages are described in
bclab.yml with the versions used to process data and produce figures. The easiest way to install these packages is by downloading Miniconda and then typing the command
conda create -f bclab.yml. You can then activate the virtual environment using
conda activate bclab. Then, from this directory, launch Jupyter with the command
In addition, you will need to install gkmSVM. Download the C++ source code from the Beer lab website and compile the code following the README. Be sure to change
utils/gkmsvm.py to point to wherever the binaries were installed. To generate motifs from k-mer scores, you may need to convert the script
svmw_emalign.py from Python 2 to Python 3. You can do this with the command
All motif analysis was performed using version 5.0.4 of the MEME suite on the WashU High Throughput Computing Facility (HTCF). There are wrapper scripts to run DREME and TOMTOM on HTCF using SLURM located in
Organization of the repository
- All Jupyter notebooks used to process data and genreate figures are in the base directory. They should be run in the numbered order.
dataset_names.txtcontains a mapping from file names in the
Datadirectory to the corresponding dataset number in the manuscript.
Datacontains FASTA files of the libraries as well as a BED file with the mm10 coordinates of the sequences. Some of the notebooks will generate additional files.
ActivityBinsis where FASTA files will be generated for different subsets of the data. This directory also contains the results from running DREME and TOMTOM.
Downloadedcontains all datasets that were downloaded for analysis.
ChIPcontains ChIP-seq peaks for MEF2D and NRL in mm10 coordinates,
CrxMpraLibrariescontains data from White et al., PNAS, 2013, and
Pwmcontains the 8 TFs used for the motif analysis. The full mouse HOCOMOCO database, v11 should also be downloaded to this directory.
Rhodopsincontain the raw barcode counts from the respective experiments. Note that these are the same files that are submitted to GEO, but with slightly different file names because GEO does not have the directory structure used here. Notebooks 1 through 3 generate additional files in these directories.
Figureswill contain all figures and supplemental figures for the manuscript after running the notebooks. The only figure not generated with Jupyter notebooks is Supplementary Figure 4, which shows the results from the motif analysis. However, the TOMTOM results are saved in
Modelswill be generated in notebook 6. It contains different SVM and logistic regression models.
Sequencingcontains supplementary files and scripts to demultiplex the FASTQ files and count barcodes (see below).
utilscontains files with Python functions used to process, analyze, and visualize the data; wrapper scripts for running DREME and TOMTOM on SLURM; and scripts to demultiplex FASTQ files and count barcodes.
Demultiplexing FASTQ files and counting barcodes
This task must be performed on a cluster such as HTCF. The
run.sh scripts in the
Rho directories submit SLURM jobs. In order to complete this task, clone the repository into your home directory on HTCF. Then, copy all files from the
Rho directory into your
scratch space. Download the FASTQ file from SRA to the same directory in your
scratch space. Finally, type
sh run.sh to submit the job to SLURM. If you want to get an email when the script is done running, add
--mail-user=<your email here> immediately following
sbatch in the script.
Once the job has completed, there will be
.counts files in your
scratch space. There will also be some logs you can check and histograms showing the distribution of barcode counts in each sample. You can now copy the
.counts files into your
Data directory to use on your local machine. These are the same files already in the repository and submitted to GEO.
Addendum: Due to requirements by SRA, the FASTQ files have already been demultiplexed. However, you can comment out the line of
demultiplexAndBarcodeCount.sh that runs
demultiplexFastq.sh to count barcodes. For internal Cohen lab users, the original FASTQ files before demultiplexing is located within our lab
LTS space on HTCF.
Brief description of Jupyter notebooks
- Processes library 1 with the Rho promoter from raw barcode counts to activity scores and performs statistics comparing each sequence to the basal Rho promoter alone.
- Same as notebook 1, but processes library 2 with the Rho promoter.
- Processes library 1 and 2 with the Polylinker from raw barcode counts to activity scores.
- Joins together the data generated from the first 3 notebooks into one large table and annotates sequences for their activity groups and overlap with TF ChIP-seq datasets. Generates supplementary figures 1 and 2.
- Computes predicted occupancy for each TF on each sequence, then computes information content and related metrics.
- Fits different models on the data with cross-validation and performs additional validation on the TF occupancy logistic regression model.
- Generates all other figures for the manuscript, performs statistics, and displays additional data that might be useful.
Please cite the following article if you use this code in your research:
- Friedman RZ, Granas DM, Myers CA, Corbo JC, Cohen BA, and White MA, 2021. Information Content Differentiates Enhancers From Silencers in Mouse Photoreceptors. eLife 10:e67403. doi:10.7554/eLife.67403.