# Track the flow of work we have done in processing the datasets

**Step 1: Download dataset**


First we will download the data from the [ENCODE](https://www.encodeproject.org/) website. There are many different experiments and datasets available here. You can download both raw and fully processed data. For our learning purposes, we will download the raw data (fastq). Let's take a look at knockdown of LIN28B in K562 cells (data [here](https://www.encodeproject.org/experiments/ENCSR598YQX/)). When you have found the link to the fastq files, right click on it and select "copy Link Address". 

Then on TSCC, put your file in the proper directory by first moving into the directory where you would like your data to end up, and then pasting the link you have copied after the "wget" command. (Remember this is what we did when we downloaded Anaconda). Keep in mind that this data is paired-end, so there are two reads per dataset (R1 and R2). So you will need to download two files. 

    cd ~/raw_data/
    
Let's make a directory in raw_data specifically for the raw data for this project. 

    mkdir ~/raw_data/lin28b_shrna/
    
Then move into that directory before running wget. REMEMBER TO USE TABS TO EASILY MOVE BETWEEN DIRECTORIES. 

    cd ~/raw_data/lin28b_shrna/

    wget https://www.encodeproject.org/files/ENCFF653FTD/@@download/ENCFF653FTD.fastq.gz
    
    wget https://www.encodeproject.org/files/ENCFF621LMO/@@download/ENCFF621LMO.fastq.gz
    


**Step 2: Run fastqc to check the sequencing quality of the reads that you downloaded. Remember that we installed fastqc with:**

    conda install -c bioconda fastqc
    
You can see that it has installed properly with:

    which fastqc
    
The output should be something like:

    ~/anaconda2/bin/fastqc
    
*Q. Why is it finding the program in this location?*

Let's make a directory in projects for our new lin28b_shrna project, and make another directory within that folder for the restuls of our fastqc run.

    mkdir ~/projects/lin28b_shrna/
    mkdir ~/projects/lin28b_shrna/fastqc/

Run fastqc to check the quality of your sequencing results. Remember to specify the full path of where your datasets are stored and where you want the processed data to end up. You will have to do this one one file at a time. REMEMBER TO USE TABS TO AVOID TYPOS! The -o argument is used to specify the location of the output files.

    fastqc ENCFF621LMO.fastq.gz -o ~/projects/lin28b_shrna/fastqc/
    fastqc ENCFF653FTD.fastq.gz -o ~/projects/lin28b_shrna/fastqc/


**Step 3 Move file outputs to your Desktop:**

**For Windows Users:**

Download WinSCP. Go to this [website](https://winscp.net/eng/download.php) and click "Installation package." The program should start downloading (or you can directly download it from this [link](https://winscp.net/download/WinSCP-5.9.1-Setup.exe)

This file will appear in your downloads, double click on it and follow the installation instructions. A WinSCP icon should appeak on your desktop. There is also a folder in your "All Programs" folder in your startup menu called WinSCP.

Double click on the WinSCP icon on your desktop. 

File protocol: SCP

Host name: tscc-login.sdsc.edu

Ucsd Name: ucsd-train##

Advanced - SSH - Authentication - Private key file: click on the 3 dots after the field and find your private key file that you have saved on your Desktop.

Click OK

Login - "Yes"

You can see the files that are in your home folder on tscc. Use the refresh button to see what is most current. Drag and drop files between your home computer and TSCC.

**For MAC users:** 

Use scp (secure copy). scp is a bash command, the syntax is always:

scp sourcefile destinationfile

This is the same syntax that you learned for cp (copy), but there is one added step for scp (secure copy)

Since in this instance, the sourcefile is on tscc, you need to include your login information followed by a colon before the full path of the file you would like to move. 

Notice the * at the end of the line. This is a wild-card character. This will copy all files that have the same prefix.

The destination file is simply ./ meaning the file that we are currently sitting in on our home computer. This is because we first moved into the directory where we want these files to be copied to.

On your local machine (NOT TSCC). Make a folder where you want this data to land and move into it. In this example, I am copying data into a folder on my desktop called module 1 (note this folder is located in a BIOM200 folder... take a look at the full path). Keep in mind this folder must exist before you try to copy something into it. Make sure you are running this command from your LOCAL MACHINE (NOT TSCC).
    
    mkdir ~/Desktop/BIOM200/module_1
    cd ~/Desktop/BIOM200/module_1
    scp ucsd-train##@tscc-login.sdsc.edu:~/projects/lin28b_shrna/fastqc/ENCFF943VGE_fastqc* ./
    
or:

    mkdir ~/BIOM200/module_1
    scp ucsd-train##@tscc-login.sdsc.edu:~/projects/lin28b_shrna/fastqc/ENCFF943VGE_fastqc* ~/Desktop/BIOM200/module_1/
    
The most common error message associated with this command will be that the file or destination does not exist. When possible, use tabs to make sure you are avoiding typos. Also copy and paste directories directly works well. If you are getting this error, check what you wrote and copy that full path onto tscc. You can check if you typed it right with, 

    ls ~/projects/lin28b_shrna/fastqc/ENCFF943VGE_fastqc*
    
If it tells you this doesn't exist, there is a typo somewhere or you haven't defined the full path properly.

*Q: How could you use scp to copy ALL files at once within the folder?*

*HINT - Use man to learn more about the scp command*

Open a finder window to the location where you put the files on your computer and click on the html link to access the fastqc results.