# Project Title

### Basic Information

**Ada McFarlane :** u1087069, ada.mcfarlane@gmail.com

**Rachel Muller :**  u0846845, mullerachel@gmail.com

**David Sant :** u0454956, david.sant@utah.edu

### Background and Motivation
This project is using data about hydroxymetylcytosine (5hmC) and transcription. It is well known that changes in transcription rate (the amount of protein made from a given gene in a given time) can lead to differences that determine cell type. Additionally, this is how the cell can react to external stimuli and this is often dysregulated in disease. How these external stimuli affect transcription is not completely understood, but one of the methods is through covalent modifications to the DNA within the cell. 

Covalent modifications to DNA include methylation of cytosines at position 5 (5mC) and hydroxymethylation of cytosine at position 5 (5hmC). Covalent modification means that the changes are stable and cannot generally be changed without the help of enzymes. These enzymes respond to changes in the environment, which in turn means that the DNA modifications respond to changes in the environment as well. These changes in DNA methylation have been shown to affect the rate of transcription and imbalances in methylation are known to contribute to disease. Although it is know that DNA methylation affects transcription, the relationship between the changes in DNA methylation (especially 5hmC) and transcription is not very well understood. 

For this project we are working with data from Schwann cells to investigate links between transcription and DNA methylation. Schwann cells are the cells that produce myelin in the peripheral nervous system (nerves outside of the brain and spinal cord, like the sciatic nerve). When Schwann cells don't produce enough myelin, it causes diseases like multiple sclerosis (MS), charcot-marie-tooth (CMT), diabetic neuropathy. This even contributes to the inability to heal after nerve injuries like spinal cord injuries. This is interesting because a global loss of 5hmC is found in many demylenating diseases like diabetic neuropathy and CMT, and vitamin C has been shown to both restore 5hmC levels and improve myelination in these diseases, which leades to an improvement in phenotype. However, the cellular reason for this improvement in phenotype is not clear (meaning we don't know which 5hmC regions lead to changes in which genes that benefit myelination).  

For our project, we are going to use data from whole transcriptome sequencing (RNA-seq) and from hydroxymethylome analysis (hMeDIP-seq) to investigate the relationship between changes in 5hmC and changes in transcription in Schwann cells. 

### Project Objectives

From this data we hope to determine if we can use a model to predict if genes will change in transcription levels in response to addition of vitamin C to the culture media (to enhance global 5hmC generation). We will make multiple models to try to determine the outcome of each gene. We will first train each model using half of the genes and use the model to predict the response of the other half of the genes. After this, we will train the models using all of the genes from the vitamin C dataset and see if we can predict how the genes will respond in Schwann cells after treatment with cyclic AMP (cAMP). These cells are from a different rat of the same strain and will likely have similar responses to changes in 5hmC. If this model is successful at predicting changes in transcription we can possibly predict the changes in future datasets. We also may be able to learn something about the relationship between 5hmC and transcription changes, which could give insight as to the cause of loss of global 5hmC in demyelinating diseases. 

After we have found the most successful model, we will use the model and test if we can predict the responses of genes in ARPE-19 cells, which come from the human retina. If this is successful, the results of our experiment can be applicable to many other cell types and be beneficial for research on many different diseases. 

### Data

This project will involve whole transcriptome sequencing data (RNA-seq) and hydroxymethylome data (hMeDIP-seq) from three cell lines, each with two different treatments. The first cell line is primary Schwann cells from rat treated with or without 50 micromolar vitamin C. The second cell line is primary Schwann cells (different rat, both Fisher rats) treated with or without 100 micromolar cAMP. The third cell line is ARPE-19 cells, an immortalized human retinal pigment epithelial cell line, treated with or without 50 micromolar vitamin C. All data was generated in the Gaofeng Wang lab. Unprocessed data for the Schwann cells treated with cAMP and the ARPE-19 cells can be obtained from Gene Expression Omnibus. 

(https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101153; 
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE121137) 

Data for the Schwann cells treated with vitamin C has been loaded to Gene Expression Omnibus but has a public release date set for January 1, 2020.  

### Ethical Considerations

The data from the Schwann cells was collected using primary cells from rats in the Gaofeng Wang lab at the University of Miami. Approval from the Institutional Animal Care and Use Committee (IACUC) was obtained and all animals were treated in accordance with the policy. Primary cells from rats were passaged for 5 passages to expand the cells and decrease the number of animals used for experiments.

The data from human retinal pigment epithelial (RPE) cells was collected using ARPE-19 cells. These cells were collected with consent from a 19-year old boy after his death in 1995 (Dunn et al. 1996 Experimental Eye Research). After multiple trypsinizations, a line was obtained that could be passaged indefinitely. The cell line was donated to the American Type Culture Collection (ATCC) so that they could be distributed to researchers studying eye disease. No institutional review board (IRB) approval is required for use of cell lines that are obtained from sources such as the ATCC.


While this project is not a business model, there are a number of demographics who would benefit from positive findings. These include:

* Neurologists or clinicians who deal with peripheral nerve damage and prevention.

* Patients who suffer from MS, CMT, diabetes, adrenoleukodystrophy, or nerve injury. 

No patents or trademarks can be legally placed on naturally occurring substances such as Vitamin C or cAMP. This means that there is no monetary concern or conflict associated with the findings of this study.

### Data Processing

We have already worked with the raw sequencing files (fastq files) and already generated information about the transcripts and determined which transcripts are upregulated, downregulated or nondifferential. We have already processed the fastq files from the hMeDIP-seq and determined which areas are enriched, and which areas change in enrichment after treatment. However, the data will still need to be cleaned. First off, the read count data will need to be normalized to account for differences in sequencing coverage and differences in the size of genes or enrichment regions of 5hmC (peaks). In any given cell type, the majority of annotated transcripts are not expressed. Additionally, even if transcripts are "expressed", our sequencing coverage is not sufficient to reliably determine differential expression below a certain threshold. We will have to look through the literature to determine if there is a good method for determining a cutoff point of genes that we can reliably test for differential expression. Additionally, there are over 200,000 regions of 5hmC enrichment throughout the genome, but there will only be ~15,000 genes tested in the genome. Most genes contain many peaks. We will have to parse the data to determine the number of peaks per gene (total peaks, nondifferential peaks, upregulated peaks, and downregulated peaks), as well as the region of the gene that contains the peaks (transcription start site, upstream promoter, downstream promoter, gene body). Simply put, even though we have most of the data, it will still require significant cleaning and normalizing.

### Exploratory Analysis

This data includes information about transcripts (genes) and peaks of 5hmC enrichment. Transcripts can be upregulated (increase in expression), downregulated (decrease in expression), or nondifferential (no change in expression). Similarly, enrichment peaks can be upregulated (increase in enrichment strength), downregulated (decrease in enrichment strength), or nondifferential (equal enrichment between conditions). In the picture below, the top two rows represent the VEGFA gene. You can see that the coverage (coloring) is greatly decreased in the second row indicating that the gene is downregulated. The bottom two rows represent 5hmC data. The two peaks with red stars increase greatly in the bottom row, indicating that these peaks are upregulated. It is important to understand that terms like "upregulated" and "downregulated" can be applied to both the peaks and to the transcripts, and we are looking for the influence of the peaks on the transcripts. 


![image.png](attachment:image.png)

First we will use scatter plots to graph the number of upregulated peaks and the number of downregulated peaks in the gene body of each gene, and color code the genes based on their transcriptional response. We will repeat this for different pairs of peaks (upregulated gene body vs downregulated gene body). We will also make histograms dividing the data by transcript response type (upregulated, downregulated, nondifferential) and then divide the data by H3K4me3 dependent genes (data included in our dataset). 

We would also like to explore the location of the peaks in relation to the transcript start site of the gene they reside with. This will be done by generating a scatter plot with the proportion of the length of the gene as the X-axis and the log of the fold change of the peak as the Y axis, and the color of the point will represent the transcript response type.

### Analysis Methodology

Transcripts will be classified as one of three types, so not all methods can be used. We plan to make a decision tree for one of our analysis methods. This will be repeated multiple times with representing the data different ways. We will also use K-nearest neighbors and Support Vector Machines to classify transcripts to their respective groups. We are aware that this is natural biological data and will not give super clean models, but we will use these three methods and determine which does the best job classifying transcripts from the Schwann cells treated with vitamin C. We will take the top performing model and test this again on the Schwann cells treated with cAMP and then on the human ARPE-19 cells treated with vitamin C to see how rubust this model is when using data from a different cell line. 

### Project Schedule
We have made a beautiful Gantt chart depicting our schedule for this project.


![Project%20Schedule.png](attachment:Project%20Schedule.png)





