# Chicken Eye: Setup
Date: June 20 2025

Author: Ben Zazycki

Adapted from: Jared Tangeman

Professor: Dr. Chun Liang


## Workspace Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!rm -rf /content/sample_data
!sudo apt-get install -y libgsl-dev
!sudo apt-get install -y libhdf5-dev
%load_ext rpy2.ipython
%R .libPaths(c('/content/drive/MyDrive/Bioinformatics/Colab_Lib/R', .libPaths()))
# ^ NOTE: change this based on your individual drive setup

## Installing Packages

Note 1: Only run the cells to install the packages you haven't installed yet.

Note 2: Check the outputs for the install lines. If it's not being installed into your specified libPath from above, ensure that you ran that line. If it's still not working, you can specify `'lib = /path/to/lib'` after the package name. The first line below is an example of this.

Note 3: There are two methods for installing R packages on colab. First, you can simply use R cell magic with `%R` or `%%R`. However, you can also use the command line call: `!R -e "insert code here"`. This is more reliable for installing some packages. It's important to note that each new call of `!R -e` will initialize a new R session. Installing packages with BiocManager requires having BiocManager installed and loaded into the session, so it is not possible with this method. If one method isn't working to install a specific package for you, try the other method.

In [None]:
!R -e "install.packages('remotes', lib = '/content/drive/MyDrive/Bioinformatics/Colab_Lib/R')"

In [None]:
!R -e "install.packages('devtools')"

In [None]:
%R remotes::install_github('satijalab/seurat-data')

In [None]:
%R install.packages("hdf5r")

In [None]:
%R remotes::install_github('10XGenomics/loupeR')"

In [None]:
%R devtools::install_github("immunogenomics/presto")

In [None]:
!R -e "install.packages('BiocManager')"

In [None]:
!R -e "install.packages('reticulate')"

In [None]:
!R -e "install.packages('Seurat')"

In [None]:
!R -e "install.packages('Signac')"

In [None]:
!R -e "install.packages('ggpubr')"

In [None]:
!R -e "install.packages('ggplot2')"

In [None]:
!R -e "install.packages('future')"

In [None]:
!R -e "install.packages('DT')"

In [None]:
!R -e "install.packages('gprofiler2')"

In [None]:
!R -e "install.packages('Matrix')"

In [None]:
!R -e "install.packages('plotly')"

In [None]:
!R -e "install.packages('ggforce')"

In [None]:
!R -e "install.packages('patchwork')"

In [None]:
!R -e "install.packages('presto')"

In [None]:
!R -e "install.packages('scCustomize')"

In [None]:
%R BiocManager::install('glmGamPoi')

In [None]:
%R BiocManager::install('ensembldb')

In [None]:
%R BiocManager::install('JASPAR2024')

In [None]:
%R BiocManager::install('TFBSTools')

In [None]:
%R BiocManager::install('motifmatchr')

In [None]:
%R BiocManager::install('chromVAR')

In [None]:
%R BiocManager::install('GenomicRanges')

In [None]:
%R BiocManager::install('BSgenomeForge')

In [None]:
%R BiocManager::install('biovizBase')

In [None]:
%R BiocManager::install('DirichletMultinomial')

In [None]:
%R BiocManager::install('GenomeInfoDb')

In [None]:
%R BiocManager::install('BSgenome')

In [None]:
%R BiocManager::install('Biostrings')

In [None]:
%R BiocManager::install('rtracklayer')

In [None]:
%R install.packages('R.utils')

## Loading from Libraries

Load the packages neccessary for this setup document:

In [None]:
%%R
library(R.utils)
library(Biostrings)
library(rtracklayer)
library(GenomeInfoDb)
library(BSgenome)
library(BSgenomeForge)

## Forging Custom Package

In this section, I will demonstrate creating a custome BSgenomeForge package for the relevant genome. This section involves a lot of work with filepaths. Be sure you change everything to match the setup on your local machine.

First, make a directory to work in:

In [None]:
!mkdir /content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110

Download and unzip genome sequence file from ENSEMBL website

In [None]:
%%R
ensembl_url <- "ftp://ftp.ensembl.org/pub/release-110/fasta/gallus_gallus/dna/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.dna.toplevel.fa.gz"
download.file(ensembl_url, destfile = "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Gallus_gallus.GRCg7b.fa.gz")
R.utils::gunzip("/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Gallus_gallus.GRCg7b.fa.gz", remove = FALSE)

Read FASTA as DNAStringSet

In [None]:
%%R
fasta_path <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Gallus_gallus.GRCg7b.fa"
genome_seq <- readDNAStringSet(fasta_path)

Export to .2bit format

In [None]:
%%R
twobit_file <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Gallus_gallus.GRCg7b.2bit"
export(genome_seq, twobit_file, format = "2bit")

Get sequence lengths from the 2bit file

In [None]:
%%R
two_bit <- TwoBitFile(twobit_file)
seqinfo_data <- seqinfo(two_bit)

Create Seqinfo object with explicit genome name, save for BSgenome

In [None]:
%%R
seqinfo_obj <- Seqinfo(
  seqnames = seqnames(seqinfo_data),
  seqlengths = seqlengths(seqinfo_data),
  genome = "GRCg7b")
saveRDS(seqinfo_obj, "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/seqinfo.rds")

Prepare seed file

In [None]:
%%R
seqnames_data <- seqnames(seqinfo_data)
seed_content <- sprintf(
"Package: BSgenome.Ggallus.ensembl.GRCg7b
Title: Gallus gallus (GRCg7b) from ENSEMBL
Description: Full genome from ENSEMBL release 110
Version: 1.0.0
organism: Gallus gallus
common_name: Chicken
provider: ENSEMBL
provider_version: GRCg7b
release_date: 2023
source_url: ftp://ftp.ensembl.org/pub/release-110/fasta/gallus_gallus/dna/
seqs_srcdir: /content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110
seqfile_name: Gallus_gallus.GRCg7b.2bit
seqnames: c(%s)
circ_seqs: 'MT'
genome: GRCg7b
BSgenomeObjname: Ggallus
organism_biocview: Gallus_gallus",
paste(dQuote(seqnames_data), collapse = ", "))

writeLines(seed_content, "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/seed.txt")

Check that output populated correctly

In [None]:
!cat /content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/seed.txt

Forge package

In [None]:
!rm -rf /content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/BSgenome.Ggallus.ensembl.GRCg7b

In [None]:
%R forgeBSgenomeDataPkg("/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/seed.txt", destdir = "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110")

Install newly forged package

In [None]:
%%bash
cd /content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110
R CMD build BSgenome.Ggallus.ensembl.GRCg7b

In [None]:
%%R
path <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/BSgenome.Ggallus.ensembl.GRCg7b_1.0.0.tar.gz"
install.packages(path, repos = NULL, type = "source", lib = "/content/drive/MyDrive/Bioinformatics/Colab_Lib/R")

## Generating Input Files

Next, we need to generate four files that will be used as input for the main notebook:


1.   Seqinfo.csv: This stores important info about genomic sequences (names, lengths, circularity, etc)
2.   Gene Transfer Format (.gtf): Describes gene/transcript features of the reference genome.
3.   W.txt: a list of W-chromosome-linked genes.
4.   MT.txt: a list of mitochondria-linked genes.



FIRST: SEQINFO. Get sequence info from earlier 2bit file:

In [None]:
%%R
twobit_file <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/Gallus_gallus.GRCg7b.2bit"
two_bit <- TwoBitFile(twobit_file)
seqinfo_data <- seqinfo(two_bit)
seqinfo_data <- Seqinfo(seqnames = seqnames(seqinfo_data),
  seqlengths = seqlengths(seqinfo_data), genome = "GRCg7b")

Create data frame and output to file:

In [None]:
%%R
seqinfo_df <- data.frame(seqnames = seqnames(seqinfo_data),
  length = seqlengths(seqinfo_data),isCircular = seqnames(seqinfo_data) %in% c("MT"))
csv_file <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/GRCg7b.110.SeqInfo.csv"
write.csv(seqinfo_df, file = csv_file,row.names = FALSE,quote = FALSE)

NEXT: GTF (this is actually just a download and unzip)

In [None]:
%%R
gtf_url <- "https://ftp.ensembl.org/pub/release-110/gtf/gallus_gallus/Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.gtf.gz"
gtf_zip_file <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/GRCg7b.110.gtf.gz"
gtf_file <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/GRCg7b.110.gtf"
download.file(gtf_url, destfile = gtf_zip_file)
R.utils::gunzip(gtf_zip_file, destname = gtf_file, remove=FALSE)

NEXT: W and MT files

Import GTF file, extract gene features

In [None]:
%%R
gtf <- rtracklayer::import(gtf_file, format = "gtf")
genes <- gtf[gtf$type == "gene"]

Set preferred identifier (name or ID), remove any NAs:

In [None]:
%%R
get_preferred_id <- function(granges_obj) {
  ids <- ifelse(!is.na(granges_obj$gene_name),
               granges_obj$gene_name,
               granges_obj$gene_id)
  ids[!is.na(ids)] }

Extract genes:

In [None]:
%%R
# For W chromosome genes:
genes_W <- genes[seqnames(genes) == "W"]
gene_ids_W <- get_preferred_id(genes_W)
# For MT chromosome genes:
genes_MT <- genes[seqnames(genes) == "MT"]
gene_ids_MT <- get_preferred_id(genes_MT)

Write output files:

In [None]:
%%R
output_W <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/GRCg7b.110.W.txt"
output_MT <- "/content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110/GRCg7b.110.MT.txt"
write.table(gene_ids_W, file = output_W,
            row.names = FALSE, col.names = FALSE, quote = FALSE)
write.table(gene_ids_MT, file = output_MT,
            row.names = FALSE, col.names = FALSE, quote = FALSE)

List contents of working directory:

In [None]:
!ls /content/drive/MyDrive/Bioinformatics/Colab_Lib/Saved_Files/GRCg7b.110