# GCB535 Intro to Bioinformatics: Remediation Final

#### Provoide your name below.

## Instructions

Please make sure you illustrate your answer clearly and that the code can be executed and generates the output that you have saved.

All the data are in the folder named: **Remed_Data**.

This remediation is composed of **Two** units and covers:  

(I) Programming Basics (45 points total)  
(II) Data Wrangling (30 points total)

The corresponding points for each question are provided in the headers for each section.

Given that you will need to write **both R and python code**, remember to change your **kernel** accordingly.

## Unit I. Programming Basics

### Unit I-A. UNIX commands and usage (5 questions; 2 points for q1.1-1.3, 3 points for q1.4-1.5, 12 total points)

* Where indicated, please write the UNIX __code__ used to answer the question in the cell(s) labelled "Command".
* Where indicated, please copy the UNIX __output__ from the terminal of the executed code for the question into the cell(s) labelled "Return value".

1. You've been provided a text file: **UNIT1/UNIT-I_q1_data.txt**. It contains some summary data for a large number of single nucleotide polymorphisms (SNPs). This file contains four columns: (i) the dbSNP unique rs-identifier for the variant, (ii) the chromosome where that variant is located, (iii) the position on that chromosome where that variant is located, and (iv) a nearby gene to that variant.

   In the following, you will use this file to perform a collection of UNIX commands to examine the file and answer questions about the data that are contained therein.

1.1. How many SNPs are provided the file?

Command:

Returned Value:

1.2. Use UNIX command(s) to:

* Output to screen the last 10 lines of this file.

Command:

Returned Value:

1.3. We have provided a second file: **UNIT1/UNIT-I_q1_genelist.txt**. Use UNIX command(s) to:

* create an new file that provides a list of variants found in "UNIT-I_q1_data.txt" that are nearby the list of genes provided in the file, above
* name this file "core_circadian_snp_lookup.txt"
* create a directory called "clock_lookup"
* make a copy of the output file you just created into this new directory

Command:

1.4. Use UNIX command(s) to:

* Count the total number of SNPs found for each chromsome in the file: **UNIT1/UNIT-I_q1_data.txt**

Command(s):

Returned Value:

1.5 Use UNIX command(s) to:

* sort the list of variants in the file: **UNIT1/UNIT-I_q1_data.txt** in ascending order: 

(i) first, by chromosome number on which the variant is found, and 

(ii) then, by the position on the chromosome. e.g., 
       
       ...
       rs369554471 9  141032511 RP11-424E7.3
       rs60717533  9  141055738 TUBBP5
       rs11253280  10 119812    TUBB8
       rs7906287   10 135853    IL9RP2
       ...

* output the sorted list to a new file called "UNIT-I_q1_data_sorted.txt"

Commands:

### Unit I-B. R basics and usage  (Seven questions; two points for Q2.1, 2.2, 2.4, 2.5, 2.6; three points for Q2.3; 5 points for Q2.7, 18 total points)

* Where indicated, please write the R __code__ used to answer the question in the cell(s) labelled **"Code"**.
* After each portion of code, make sure you __execute__ the code to report the result and answer. 
    - To remind you, we have indicated this with the label: **"Code + Result"**
* Where indicated, please write your __interpretation__ to answer the question in the cell labelled "Explanation".

**TIP:** In this section, you may find it helpful to set the kernel to **"R (R-Project)"**. In this way, you can execute the code that you describe to answer the question!

2.1. Use R to perform the following:

* Create a variable called **"mychisq"** that stores a list of 1000 values that are drawn from a chi-squared distribution with 1 degree of freedom; 
* Create a variable called **"mynorm"** that stores a list of 1000 values that are drawn from a normal distribution with mean 0 and variance 1.
* Create a variable called **"is.this.chisq"** that stores the square of the absolute values for the list **mynorm**. i.e., 

( |mynorm| )^2

Code + Result:

2.2. Use R to perform the following:

* Plot histograms each for **mychisq** and **is.this.chisq** 

Code + Result:

2.3 Use R to perform the following:

* Print out summaries for **mychisq** and **is.this.chisq**
* Print the mean and variance **mychisq** and **is.this.chisq**
* Perform a test to determine if the distributions in the two lists are statistically different from each other. Please describe the test you select, why you picked it, and write out the conclusion below.

Code + Result:

Explanation:

2.4. We have provided you a file: **UNIT1/pmbb_pc_data.txt**. This file contains the results from a principal components analysis of genetic data of 914 individuals in preparation for a genome-wide association study. Each row of this file is a unique, unrelated individual, and the columns refer to the individual identifier and the first two principal components that have been estimated from the genetic data.

You want to visualize these data. What type of plot would you use? Give your rationale. 

Explanation:

2.5. Use R to perform the following:

* Read in **UNIT1/pmbb_pc_data.txt** data file into R and store into a variable.
* Create the plot that you indicated above visualize the data.


Code + Result:

2.6. Based on this plot, describe the patterns of data that you see, what they could mean, and your interpretation.

Explanation:

2.7. You decide to generate a list of individuals that seem very dissimilar to the bulk individuals in the data, as you suspect that the DNA quality of these samples from which these data were generated may be compromised, and are not useful to analyze further. 

* Provide R code that outputs a list of individuals that appear to be 'outliers' from the bulk of individuals in the data.
* Report the list of the individuals that you propose to remove to a file called "outlier_list.txt"
* After removing these individuals, recreate the plot you proposed above.

Code + Result:

### Unit I-C. Python basics and usage (Three Questions, 5 points per question, 15 points total)

* Where indicated, please write the Python __code__ used to answer the question in the cell(s) labelled "Code".
* Where indicated, please __execute the function__ to report the result and answer. 
    - To remind you, we have indicated this with the label: **"Code + Result"**
    
**NOTE:** In this section, make sure you set the kernel to **"Python 2 (Ubuntu Linux)"**. This will allow you to execute the code that you describe to answer the question!

In the following three questions (3.1, 3.2, and 3.3), you will implement a function in python that achieves the stated discription.

3.1. In .sam files, you might remember that the quality scores for a read are given by a string of characters that is the same length as the read. 

Create a function that returns the positions of **all** occurrences of the base quality score "A" for a given string of quality score values.

The input to the function (i.e., qstr) is of type **str**.

The type of variable that should be returned by the function (i.e., rtype) should be of type **List[int]**.

Code:

In [60]:
def numMaxQualBaseinRead(qstr):
    """
    :type qstr: str
    :rtype: List[int]
    """
    # write your code below

Code + Result:

In [61]:
numMaxQualBaseinRead("2374AIIIHIIHHGHSDHADHSAHGHDHSG")

3.2. Imagine you have a read where 90% of the bases have a quality score of 40 or greater, i.e., >90% of bases in the quality string provided are either: 

       I J K L 

Return the number of reads that meet this criteria where 90% of the bases for the provided quality score string are the above quality score characters.

The input to the function (i.e., variable "qlist") is of type **List[str]**.

The type of variable that should be returned by the function (i.e., rtype) should be of type **int**.

Example:

Input: ["JJJJJJJJJ","ABABHJHJHBBBAB","IJKLIJKLIJKL"]

Output: 2

Code:

In [2]:
def commonQualScores(qlist):
    """
    :type qlist: List[str]
    :rtype: int
    """
    # Write your code below
    

Code + Result:

In [None]:
mylist = ["JJJJJJJJJ","ABABHJHJHBBBAB","IJKLIJKLIJKL"]
commonQualScores(mylist)

3.3 Write a Python function that takes two dictionaries as input, and subtracts the value of the second from the first for all common keys.

Example:

Input:

X = {'a': 100, 'b': 200, 'c':300}

Y = {'a': 300, 'b': 200, 'd':400}

Sample output: {'a': -200, 'b': 0, 'c': 300}


The input to the function (i.e., X and Y) are two variables of type **Dict{str:int}**.

The type of variable that should be returned by the function (i.e., rtype) should be of type **Dict{str:int}**.

Code:

In [None]:
def subtractDict(X,Y):
    """
    :type X: Dict{str:int}
    :type Y: Dict{str:int}
    :rtype: Dict{str:int}
    """
    # Your code begin below
    

Code + Result:

In [5]:
X = {'a': 100, 'b': 200, 'c':300}
Y = {'a': 300, 'b': 200, 'd':400}
subtractDict(X,Y)

## UNIT II. Data Wrangling.

### 4. Data processing via tidyverse (tidyr, dplyr)  (5 questions; 3 points per question, 15 points total)

* Where indicated, please write the R __code__ used to answer the question in the cell(s) labelled "Code".
* Where indicated, please execute the function to report the result and answer.
    - As a reminder to you, we have indicated this with the label: **"Code + Result"**

**NOTE:** In this section, make sure you set the kernel to **"R (R-Project)"**. This will allow you to execute the code that you describe to answer the question!

For questions 4.1 - 4.5, you will work data that that contains information on samples stored in a DNA biobank. These summary data contain: a plate identifying-label (where the sample is stored), self-identified ethnicity for the sample, diabetes status, and age.

You can find data in the file: **UNIT2/PMBB_labIDs_T2D_finalset.txt**

Please load the set of **all** packages required to answer questions **4.1 - 4.6** in the cell below.

Code + Result:

4.1. Use R and/or tidyverse code to:

* Read in the file: **UNIT2/PMBB_labIDs_T2D_finalset.txt**; store this file in a variable
* Create summaries and plot the distributions of the following variables:
    - **Current_AGE** (for all subjects)
    - number of individuals for each label given by **SELFID_ANCS_CODE**

Code + Result:

4.2. You will see in the data file provided, that there exists a column labelled **Plate**, which identifies the list of plates where samples exists, and second, a column labelled **SELFID_ANCS_CODE** which reports the self-identified ethnicity for each of the samples listed.

Using the provided data and R and/or tidyverse code to: 

* Report the count for the total number of unique plates
* Report the count for the total number of samples found on each plate
* Report the count for the total number of samples for each category defined by SELFID_ANCS_CODE

Code + Result:

4.3. Use R and/or tidyverse code to:

* Create a variable called "aa_t2d_ca". This variable stores a table that reports, for each plate, the total samples that meet the following two criteria:
          
          SELFID_ANCS_CODE is BLACK
          DM_FLG is 1

* Create a variable called "aa_t2d_ct". This variable stores a table that reports, for each plate, the total samples that meet the following two criteria:

          SELFID_ANCS_CODE is BLACK
          DM_FLG is 0

* Merge these two tables aa_t2d_ca and aa_t2d_ct into a single table called "aa_t2d_byplate"


Code + Result:

4.4 Use R and/or tidyverse code to:

* For each category in **SELFID_ANCS_CODE**, calculate the average of the values for the column labelled **Current_AGE** for individuals where **DM_FLG** is 1
* For each category in **SELFID_ANCS_CODE**, calculate the average of the values for the column labelled **Current_AGE** for individuals where **DM_FLG** is 0


Code + Result:

4.5. Can you conclude that the average age of white subjects with diabetes is statistically different from subjects with unknown ethnicity with diabetes? Use R code to justify your answer.

Code + Result:

Explanation:

### 5. .fasta file issues, quality control pipeline, and analysis (15 total points)

You undergraduate research assistant has somehow 'accidentally' managed to do some very strange things to a .fasta data file of core core circadian clock gene sequences in a novel model organism that you have generated (Madeupicus remediare).

**UNIT2/core_clock_sequences.fasta**

Unfortunately, these data haven't been uploaded to NCBI, and there's no backup of the data! So, you are going to need to take a look at the file and figure out what has gone wrong, and fix it.

Luckily, you know that:
* The sequence of genes you collected were the same ones listed in **UNIT1/UNIT-I_q1_genelist.txt**
* The list of gene sequences were only DNA sequences

In the following question:

(Part 1) **[6 points]** Report (in human terms) each issue you uncover and have to fix for the .fasta file. Each issue can be succinctly described (i.e., less than 10 words)

(Part 2) **[7 points]** Use code to create a quality control (QC) pipeline. Using any programming language you feel comfortable with (UNIX, R, and/or Python), this pipeline should:
* Process the 'raw' .fasta in memory (i.e., UNIX, R, Python)
* Correct the issue(s), outputting intermediate files as required 
* Ultimately, a final, corrected output is produced which corresponds to an appropriately formatted .fasta file named **core_clock_sequences_fixed.fasta**.

It is perfecty acceptable to use different commands to address different issues you find (e.g., UNIX commands to solve one issue, loading an output file into R to solve another set of issues). 

But here we want you to make sure all lines of code and the steps can be "reproduced" by copy-pasting and re-executing code that you have created, starting with the original version of the file.

(Part 3) **[2 points]** Run this code to clean up the .fasta file. Based on the 'clean' file, calculate the total nucleotide length of each sequence, and report those totals for each sequence included. 

**(Part 1)** Explanation (issues in each file, each issue should be described 10 words or less):

**(Part 2)** Quality Control Pipeline; Code + Result:

**(Part 3)** Analysis; Code (for Analysis):

**(Part 3)** Returned Result: