From cb5f15bec0abb9db16cab6fd27cb90fa8398f442 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Mon, 12 Mar 2018 16:21:21 -0400 Subject: [PATCH 01/19] initial commit with index page and some of lesson 01 --- AUTHORS | 1 + CITATION | 1 + _extras/discuss.md | 5 +++ _extras/guide.md | 5 +++ episodes/01-introduction.md | 75 +++++++++++++++++++++++++++++++++++++ index.md | 57 ++++++++++++++++++++++++++++ reference.md | 8 ++++ setup.md | 6 +++ 8 files changed, 158 insertions(+) create mode 100644 AUTHORS create mode 100644 CITATION create mode 100644 _extras/discuss.md create mode 100644 _extras/guide.md create mode 100644 episodes/01-introduction.md create mode 100644 index.md create mode 100644 reference.md create mode 100644 setup.md diff --git a/AUTHORS b/AUTHORS new file mode 100644 index 00000000..659aeece --- /dev/null +++ b/AUTHORS @@ -0,0 +1 @@ +FIXME: list authors' names and email addresses. diff --git a/CITATION b/CITATION new file mode 100644 index 00000000..d5bfc2b3 --- /dev/null +++ b/CITATION @@ -0,0 +1 @@ +FIXME: describe how to cite this lesson. diff --git a/_extras/discuss.md b/_extras/discuss.md new file mode 100644 index 00000000..727205da --- /dev/null +++ b/_extras/discuss.md @@ -0,0 +1,5 @@ +--- +layout: page +title: Discussion +--- +FIXME diff --git a/_extras/guide.md b/_extras/guide.md new file mode 100644 index 00000000..50d9d0b3 --- /dev/null +++ b/_extras/guide.md @@ -0,0 +1,5 @@ +--- +layout: page +title: "Instructor Notes" +--- +FIXME diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md new file mode 100644 index 00000000..24e48cf4 --- /dev/null +++ b/episodes/01-introduction.md @@ -0,0 +1,75 @@ +--- +title: "Introducing R and RStudio IDE" +teaching: 30 +exercises: 15 +questions: +- "What is RStudio and why should I use it?" +- "What is the difference between R and RStudio?" +- "How do I get help using R?" +objectives: +- "Discuss advantages of analyzing data in R" +- "Discuss advantages of using RStudio" +- "Create an RStudio project, and discuss the benefits of working within a + project" +- "Customize RStudio layout" +- "Able to locate and change the current working directory with `getwd()` and + `setwd()`" +- "Compose an R script file with comments and saved commands" +- "Be able to define what an R function is" +- "Locate help for an R function using `?`, `??`, and `args()`" +- "Check the version of R" +- "Able to enter a command in the R console (at the terminal)" +- "List several websites for obtaining R software/packages" +- "Ask effective questions when searching for help on forums or using web + searches" +- "Research an issue you are experiencing with a package installation on + Stackoverflow" + +keypoints: +- "First key point." +--- + +## Getting ready to use R for the first time +In this lesson we will take you through the very first things you need to get +R working, and conclude by showing you the most effective ways to get get help +when you are working with R on your own. + +>## Tip: This lesson works best on the cloud +> Remember, these lessons assume we are using the pre-configured virtual machine +> instances provided to you at a genomics workshop. Much of this work could be +> done on your laptop, but we use instances to simplify workshop setup +> requirements, and to get you familiar with using the cloud (a common +> requirement for working with big data). +> Visit the [Genomics Workshop setup page](http://www.datacarpentry.org/genomics-workshop/setup/) +> for details on getting this instance running on your own, or for the info you +> need to do this on your own computer. + {: .callout} + + +## A Breif History of R +[R](https://en.wikipedia.org/wiki/R_(programming_language)) has been around +since 1995, and was created by Ross Ihaka and Robert Gentleman at the University +of Auckland, New Zealand. It was based off the S programming language developed +at Bell Labs, and was developed to teach intro statistics. See this [slide deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) +by Ross Ihaka for more info on the subject. + +## Advantages of using R +At more than 20 years old, R is fairly mature and [growing in popularity](https://www.tiobe.com/tiobe-index/r/). However, programming isin't a poularity contest. Here are key advantages of +analzying data in R: + + - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. Of + course this means R is free - which is an advantage if you end up at a + institution where you would have to pay for your own MATLAB or SAS license. + Open source, is important to your colleagues in parts of the world where + expensive software in inaccessible. It also means that R is actively + developed by a community (See [r-project.org](https://www.r-project.org/)), + and there are regular updates. + - **R is widely used**. Ok, maybe programming is a popularity contest. Because, + R is used in many areas (not just bioinformatics), you are more likely to + find help online when you need it. Chances are, almost any error message you + run into, someone else has already experienced. +- **R is powerful**. R runs on multiple platforms (Windows/MacOS/Linux). It can + work with much larger datasets than popular spreadsheet programs like + Microsoft Excel, and because of its scripting capabilities is far more + reproducible. Also there are thousads of available software packages for + science, including genomics and other areas of life science. diff --git a/index.md b/index.md new file mode 100644 index 00000000..fb296c1a --- /dev/null +++ b/index.md @@ -0,0 +1,57 @@ +--- +layout: lesson +root: . +permalink: index.html # Is the only page that don't follow the partner /:path/index.html +--- +**Welcome to R!** Working with a programming language (especially if it’s your +first time) often feels intimidating, but the rewards outweigh any frustrations. +An important secret of coding is that even experienced programmers find it +difficult and frustrating at times – so if even the best feel that way, why let +intimidation stop you? Given time and practice* you will soon find it easier +and easier to accomplish what you want. + +Why learn to code? Bioinformatics – like Biology – is messy. Different +organisms, different systems, different conditions, all behave differently. +Experiments at the bench require a variety of approaches – from tested protocols +to trial-and-error. Bioinformatics is also an experimental science, otherwise we +could use the same software and same parameters for every genome assembly. +Learning to code opens up the full possibilities of computing, especially given +that most bioinformatics tools exist only at the command line. Think of it this +way: if you could only do molecular biology using a kit, you could probably +accomplish a fair amount. However, if you don’t understand the biochemistry of +the kit, how would you troubleshoot? How would you do experiments for which +there are no kits? + +R is one of the most widely-used and powerful programming languages in +bioinformatics. R especially shines where a variety of statistical tools are +required (e.g. RNA-Seq, population genomics, etc.) and in the generation of +publication-quality graphs and figures. Rather than get into an R vs. Python +debate (both are useful), keep in mind that many of the concepts you will learn +apply to Python and other programming languages. + +Finally, we won’t lie; R is not the easiest-to-learn programming language ever +created. So, don’t get discouraged! The truth is that even with the modest +amount of R we will cover today, you can start using some sophisticated R +software packages, and have a general sense of how to interpret an R script. +Get through these lessons, and you are on your way to being an accomplished R +user! + +\* We very intentionally used the word practice. One of the other “secrets” of +programming is that you can only learn so much by reading about it. Do the +exercises in class, re-do them on your own, and then work on your own problems. + + +> ## Prerequisites +> +> - **Experimenter's Mindset**: We define the "Experimenter's mindset" as an +> approach to bioinformatics that treats it like any other experiment. There +> are probably a variety of metaphors we could employ (data are our +> reagents, scripts are our protocols, etc.), but the most important idea of +> the mindset is to remind you that as a researcher, you need to employee all +> of your training in the bench or field to working with analyses. Evaluate +> results critically, and don't expect that things will always work the first +> time, or that they will always work in the same way. +> - **Genomics Data Carpentry Instance**: This lesson assumes you are using a +> Genomics Data Carpentry instance as described on the +> [Genomics Workshop setup page](http://www.datacarpentry.org/genomics-workshop/setup/) +{: .prereq} diff --git a/reference.md b/reference.md new file mode 100644 index 00000000..9fd7c8a1 --- /dev/null +++ b/reference.md @@ -0,0 +1,8 @@ +--- +layout: reference +root: . +--- + +## Glossary + +FIXME diff --git a/setup.md b/setup.md new file mode 100644 index 00000000..12b3c0a1 --- /dev/null +++ b/setup.md @@ -0,0 +1,6 @@ +--- +layout: page +title: Setup +root: . +--- +FIXME From aae6145b23549b722917ba95c12af3c50a47929b Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Wed, 14 Mar 2018 17:02:48 -0400 Subject: [PATCH 02/19] add images and 01 episode content --- episodes/01-introduction.md | 537 +++++++++++++++++++++++++++++++++++- 1 file changed, 530 insertions(+), 7 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index 24e48cf4..1f369cc7 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -19,7 +19,6 @@ objectives: - "Locate help for an R function using `?`, `??`, and `args()`" - "Check the version of R" - "Able to enter a command in the R console (at the terminal)" -- "List several websites for obtaining R software/packages" - "Ask effective questions when searching for help on forums or using web searches" - "Research an issue you are experiencing with a package installation on @@ -46,16 +45,16 @@ when you are working with R on your own. {: .callout} -## A Breif History of R +## A Brief History of R [R](https://en.wikipedia.org/wiki/R_(programming_language)) has been around since 1995, and was created by Ross Ihaka and Robert Gentleman at the University -of Auckland, New Zealand. It was based off the S programming language developed -at Bell Labs, and was developed to teach intro statistics. See this [slide deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) +of Auckland, New Zealand. R is based off the S programming language developed +at Bell Labs and was developed to teach intro statistics. See this [slide deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) by Ross Ihaka for more info on the subject. ## Advantages of using R -At more than 20 years old, R is fairly mature and [growing in popularity](https://www.tiobe.com/tiobe-index/r/). However, programming isin't a poularity contest. Here are key advantages of -analzying data in R: +At more than 20 years old, R is fairly mature and [growing in popularity](https://www.tiobe.com/tiobe-index/r/). However, programming isn’t a popularity contest. Here are key advantages of +analyzing data in R: - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. Of course this means R is free - which is an advantage if you end up at a @@ -71,5 +70,529 @@ analzying data in R: - **R is powerful**. R runs on multiple platforms (Windows/MacOS/Linux). It can work with much larger datasets than popular spreadsheet programs like Microsoft Excel, and because of its scripting capabilities is far more - reproducible. Also there are thousads of available software packages for + reproducible. Also, there are thousands of available software packages for science, including genomics and other areas of life science. + +>## Discussion: Your experience +> What has motivated you to learn R? Have you had a research question for which +> spreadsheet programs such as Excel have proven difficult to use, or where the +> size of the data set created issues? +{: .discussion} + + +---- + +## Introducing RStudio Server +In these lessons, we will be making use of a software called [RStudio](https://www.rstudio.com/products/RStudio/), +an [Integrated Development Environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment). +RStudio, like any most IDEs provides a graphical interface to R, making it more +user-friendly, and providing dozens of useful features. We will introduce +additional benefits of using RStudio as you cover the lessons. In this case, +we are specifically using [RStudio Server](https://www.rstudio.com/products/RStudio/#Server) +, a version of RStudio that can be accessed in your web browser. RStudio Server +has the same features of the Desktop version of RStudio you could download as +standalone software. + +## Log on to RStudio Server + +Open a web browser and enter the IP address of your instance, followed by +`:8787`. For example, if your IP address was 123.456.789 your URL would be + > ~~~ + > http://123.456.789:8787 + > + > # Tip: Make sure there are no spaces before or after your URL or your web browser may interpret it as a search query + > ~~~ + > {: .source} + +Enter your user credentials and click Sign In. The credentials for +the genomics Data Carpentry instances are: + + > **username**: dcuser + > + > **password**: data4Carp + +You should now see the RStudio interface: + +rstudio default session + +--- + +## Create an RStudio project + +One of the first benefits we will take advantage of in RStudio is something +called an **RStudio Project**. An RStudio Project allows you easily save data, +files, variables, packages, etc. related to a specific analysis project you are +conducting in R. Saving your work into a project makes it easy to restart work +where you left off, and also makes it easier to collaborate, especially if you +are using version control such as [git](http://swcarpentry.github.io/git-novice/). + + +To create a project, go to the File menu, and click New Project.... + +rstudio default session + +In the window that opens select **New Directory**, then **Empty Project**. For +"Directory name:" enter **dc_genomics_r**. For "Create project as subdirectory of", +you may leave the default, which is your home directory "~". Finally click +Create Project. In your "Files" tab of your output pane (more about +the RStudio layout in a moment), you should see an RStudio project file, +**dc_genomics_r.Rroj**. All RStudio projects end with the ".Rproj" file +extension. + +>## Tip: Make your project more reproducible with Packrat +> One of the most wonderful and also frustrating aspects of working with R is +> managing packages. We will talk more about them, but packages (e.g. ggplot2) +> are add-ons that extend what you can do with R. Unfoturnately it is very +> common that you may run into versions of R and/or R packages that are not +> compatible. This may make it difficut to for somone to run your R script using +> their version of R or a given R package, and/or make it more difficult to run +> their scripts on your machine. [Packrat](https://rstudio.github.io/packrat/) +> Is an RStudio add-on that will associate your packages and project so that +> your work is more portable and reproducible. To turn on Packrat click on +> the Tools menu and select Project Options. Under +> **Packrat** check off "**Use packrat with this project**" and follow any +> installation instructions. +{: .callout} + +--- + +## Creating your first R script + +Now that we are ready to start exploring R, we will want to keep a record of the +commands we are using. To do this we can create an R script: + +Click the File menu and select New File and then +R Script. Before we go any further, save your script by clicking the +save/disk icon that is in the bar above the first line in the script editor, or +click the File menu and select save. In the "Save File" +window that opens, name your file **"genomics_r_basics"**. The new script +**genomics_r_basics.R** should appear under "files" in the output pane. By +convention, R scripts end with the file extention **.R**. + +--- + +## Overview and customization of the RStudio layout + +Now that we have covered the basics, lets address some ways to configure the +layout of RStudio. First, here are the major windows or panes of the RStuio +environment: + +rstudio default session + +- **Source**: This pane is where you will write/view R scripts. Some outputs + (such as if you view a dataset using `View()`) will appear as a tab here. +- **Console**: This is actually where you see the execution of commands, and + what R looks like if you were to run it at the command line without RStudio. + You can work interactively (i.e. enter R commands here), but for the most + part, we will run a script, or lines in a script and watch their execution + and output here. +- **Enviornment**: Here, RStudio will show you what datasets and variables you + have created, and which are actively defined/in memory. You can also see some + characteristics of variables/datasets such as their type and dimensions. + A history tab also contains a history of executed R commands. +- **Files/plots/help**: This multipurpose pane will show you the contents of + directories on your computer. You can also use the "Files" tab to navigate and + set the working directory. The "Plots" tab will show the ouput of any plots + generated. In "Packages" you will see what packages are actively loaded, or + you can attach installed packages. "Help" will display help files for R + functions/packages. + +>## Tip: Downloads from the cloud +> In the "Files" tab you can select a file and download it from your cloud +> instance to your local computer. Uploads are also possible. +{: .callout} + +All of the panes in RStudio have configuration options. For example you can +minimize/maximize a pane, or by moving your mouse in the space between between +panes you can resize as needed. The most important customization options for +pane layout are in the View menu. Other option such as font sizes, +colors/themes, and more are in the Tools menu under +Global Options. + +--- + +## Getting to work with R: navigating directories +Now that we have covered the more aesthetic aspects of R, we can get to work by +learning some commands. We will write, execute, and save the commands we learn +in our **genomics_r_basics.R** script that is loaded in the Source pane. First, +lets see what directory we are in. To do so, type the following command into +the script: + +> ~~~ +> getwd() +> ~~~ +{: .language-r} + +To execute this command, make sure your cusor is on the same line the command +is written. Then click the Run button that is just above the first +line of your script in the header of the Source pane. + + +In the console, we expect to see the following output: + +> ~~~ +> getwd() +> [1] "/home/dcuser/dc_genomics_r" +> ~~~ +{: .output} + +Since we will be learning several commands, we may already want to keep some +short notes in our script to explain the purpose of the command. Entering a `#` +before any line in an R script. Edit your script to include a comment on the +purpose of commands you are learning, e.g.: + +> ~~~ +> # this command shows the current working directory +> getwd() +> ~~~ +{: .language-r} + +--- + +> ## Exercise: Work interactively in R +> What happens when you try to enter the `getwd()` command in the Console pane? +> +>> ## solution +>> You will get the same output you did as when you ran `getwd()` from the +>> source. You can run any command in the Console, however, executing it from +>> the source script will make it easier for us to record what we have done, +>> and ultimately run an entire script, instead of entering commands one-by-one. +> {: .solution} +{: .challenge} +--- + +For the purposes of this exercise we want you to be in the directory `"/home/dcuser/dc_genomics_r"`. +What if you weren't? You can set your home directory using the `setwd()` +command. Enter this command in your script, but *don't run* this yet. + +> ~~~ +> # This sets the working directory +> setwd() +> ~~~ +{: .language-r} + +You may have guessed, you need to tell the `setwd()` command +what directory you want to set as your working directory. To do so, inside of +the parentheses, open a set of quotes. Inside the quotes enter a `/` which is +the root directory of our linux. Next, use the Tab key, to take +advantage of RStudio's Tab-autocompletion method, to select `home`, `dcuser`, +and `dc_genomics_r` directory. The path in your script should look like this: + +> ~~~ +> # This sets the working directory +> setwd("/home/dcuser/dc_genomics_r") +> ~~~ +{: .language-r} + + +When you run this command, the console repeates the command, but gives you no +output. Instead, you see the blank R prompt: `>`. Congradulations! Although it +seems small, knowing what your working directory is, and being able to set your +working directory is the first step to analzying your data. + +>## Tip: Never use `setwd()` +> Wait, what was the last 2 minutes about? Well, setting your working directory +> is something you need to do, you need to be very careful about using this as +> a step in your script. For example, the top-level path in a Unix file system +> is root `/`, but on Windows it is likely `C:\`. This is one of several ways +> you might cause a script to break because a filepath is configured differently +> than your script anticipates. R packages like [`here`](https://cran.r-project.org/web/packages/here/index.html) +> and [`file.path`](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) +> allow you to specifiy file paths is a way that is more operating system +> independent. See Jenny Bryan's [blog post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) for this +> and other R tips. +{: .callout} + +--- + +## Using functions in R, without needing to master them +Functions may seem like an advanced topic (and they are), but you have already +been using functions in R. In fact, even if you never learn how anything else +works in R, the next sections will help you understand what is happening in +any R script. A function in R (or any computing language) is basically a short +program that takes an input and returns and output. + +> ## Exercise: What do these functions do +> Try the following functions by writing them in your script. See if you can +> guess what they do, and make sure to add comments to your script about your +> assumed purpose. +> - `dir()` +> - `sessionInfo()` +> - `date()` +> - `Sys.time()` +> +>> ## solution +>> - `dir()` # lists files in the working directory +>> - `sessionInfo()` # Gives the version of R and additional info including +>> on attached packages +>> - `date()` # Gives the current date +>> - `Sys.time()` # Gives the current time +> {: .solution} +{: .challenge} + +You have hopefully noticed a pattern, some more abstract exceptions aside, in R +a function has three key properties: +- functions have a name (e.g. `dir`, `getwd`) +- following the name, functions have a pair of `()` +- Inside the parentheses, a function may take 0 or more arguments ... + +An argument may be a specific input for your function and/or may modify the +function's behavior. For example the function `round()` will round a number +with a decimal: + +> ~~~ +> # This will round up a number +> round(3.14) +> ~~~ +{: .language-r} + +Which returns + +> ~~~ +> [1] 3 +> ~~~ +{: .output} + +## Getting help with function arguments + +Of course, you may have wanted to round to one significant digit. `round()` can +do this, but you may fist need to read the help to find out how. To see the help +(In R sometimes also called a "vignette") enter a `?` in front of the function +name: + +> ~~~ +> ?round() +> ~~~ +{: .language-r} + +The "Help" tab will show you information (and often, too much information). You +Will slowly learn how to read through all of that. Checking the "Usage" or +"Examples" headings is often a good place to look first. If you look under +"Arguments" we also see what arguments we can "pass" to this function to modify +its behavior. You can also see a function's argument using the `args()` function: + +> ~~~ +> args(round) +> ~~~ +{: .language-r} + +Which returns + +> ~~~ +> function (x, digits = 0) +> NULL +> ~~~ +{: .output} + +We see that `round()` has a `digits` argument. The `=` sign indicates that a +default (in this case 0) is already set. We can explicity set the digits +parameter when we use the function: + +> ~~~ +> round(3.14159, digits = 2) +> ~~~ +{: .language-r} + +> ~~~ +> [1] 3.14 +> ~~~ +{: .output} + +Or, R accepts what we call "positional arguments", if you pass a function +arguments separated by commas, R assumes that they are in the order you saw +when we used `args()`. In the case below that means that `x` is 3.14159 and +digits is 2. + +> ~~~ +> round(3.14159, 2) +> ~~~ +{: .language-r} + +> ~~~ +> [1] 3.14 +> ~~~ +{: .output} + +Finally, what if you are using `?` to get help for a function in a package not +installed on your version of R: + +> ~~~ +> ?geom_point() +> ~~~ +{: .language-r} + +will return an error: + +> ~~~ +> Error in .helpForCall(topicExpr, parent.frame()) : +> no methods for ‘geom_point’ and no documentation for it as a function +> ~~~ +{: .error} + + +Use two question marks (i.e. `?? geom_point()`) and R will return online search +results in the "Help" tab. + +--- + +## Getting help with R + +rstudio default session + +Finally, no matter how much experience you have with R, you will find yourself +needing help. There is no shame in researching how to do something in R, and +most people will find themselves looking up how to do the same things that +they "should know how to do" over and over again. Here are some tips to make +this process as helpful and efficent as possible. + +> "Never memorize something that you can look up" +> - A. Einstein + +## Finding help on Stackoverflow and Biostars + +Two of popular websites will be of great help with many R problems. For **general** +**R questions**, [Stack Overflow](https://stackoverflow.com/), probably the most +popular online community for developers. If you start your question "How to do X +in R" results from Stack Overflow are usually near the top of the list. For +**bioinformatics specific questions**, [Biostars](https://www.biostars.org/) is +a popular online forum. + +>## Tip: Asking for help using online forums: +> +> - When searching for R help, look for answers with the [r](https://stackoverflow.com/questions/tagged/r) tag. +> - Get an account, not required to view answers, but to required to post +> - Put in effort to check throughly before you post a question; folks get +> annoyed if you ask a very common question that has been answered multiple +> times. +> - Be careful. While forums are very helpful, you can't know for sure if the +> advice you are getting is correct. +> - See the [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) +> blog post for more useful tips. +> +{: .callout} + +## Help people help you + +Often, in order to duplicate the issue you are having, somone may need to see +the data you are working with or verify the versions of R or R packages you +are using. The following R functions will help with this: + +You can **check the version of R** you are working with using the `sessionInfo()` +function. Actually, it is good to save this information as part of your notes +on any analysis you are doing. When you run the same script that has worked fine +a dozzen times before, looking back at these notes will remind you that you +upgraded R and forget to check this script. + + +> ~~~ +> sessionInfo() +> ~~~ +{: .language-r} + +> ~~~ +> R version 3.2.3 (2015-12-10) +> Platform: x86_64-pc-linux-gnu (64-bit) +> Running under: Ubuntu 14.04.3 LTS +> +> locale: +> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 +> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 +> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C +> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C +> +> attached base packages: +> [1] stats graphics grDevices utils datasets methods base +> +> loaded via a namespace (and not attached): +> [1] tools_3.2.3 packrat_0.4.9-1 +> ~~~ +{: .output} + +Many times, there may be some issues with your data and the way it is formatted. +In that case, you may want to share that data with somone else. However, you +may not need to share the whole datasets; looking at a subset of your 50,000 row, +10,000 column dataframe may be TMI (too much information)! You can take an +object you have in memory such as dataframe (if you don't know what this means +yet, we will get to it!) and save it to a file. In our example we will use the +`dput()` function on the `iris` dataframe which is an example dataset that is +installed in R: + + +> ~~~ +> dput(head(iris)) # iris is an example data.frame that comes with R +> # the `head()` function just takes the first 6 lines of the iris dataset +> ~~~ +{: .language-r} + +This generates some output (below) which you will be better able to interpret +after covering the other R lessons. This info would be helpful in understanding +how the data is formatted and possibly revealing problematic issues. + +> ~~~ +> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), +> Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, +> 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, +> 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, +> 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", +> "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, +> 6L), class = "data.frame") +> ~~~ +{: .output} + +Alternatively, you can also save objects in R memory to a file by specificying +the name of the object, in this case the `iris` data frame, and passing a +filename to the `file=` argument. + +> ~~~ +> saveRDS(iris, file="iris.rds") # By convention, we use the .rds file extension +> ~~~ +{: .language-r} + +--- + +## Final FAQs on R + +Finally, here are a few pieces of introductory R knowledge that are too good to +pass up. While we won't return to them in this course, we put them here becasue +they come up commonly: + +**Do I need to click Run every time I want to run a script?** + +- No. In fact, the most common shortcut key allows you to run a command (or + any lines of the script that are highlighted): + - Windows execution shortcut: Ctrl+Enter + - Mac execution shortcut: Cmd(⌘)+Enter + + To see a complete list of shortcuts click on the Tools menu and + select Keyboard Shortcuts Help + +**What's with the brackets in R console output?** +- R returns an index with your result. When your result contains multiple values, + the number tells you what ordinal number begins the line, for example: + +> ~~~ +> 1:101 # generates the sequence of numbers from 1 to 101 +> ~~~ +{: .language-r} + +> ~~~ +> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 +> [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 +> [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 +> [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 +> [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 +> [101] 101 +> ~~~ +{: .output} + + +**Can I run my R script without RStudio?** + +- Yes, remember - RStudio is running R. You get to use lots of the enhancements + RStudio provides, but R works independent of RStudio. See [these tips](https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line) + for running your commands at the command line. + + +**Where else can I learn about RStudio?** +Check out the Help menu, especially "Cheatsheets" section. +--- From c3d876961a38ad2f2e4b61ef1975f1bd758d497d Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Thu, 15 Mar 2018 11:08:46 -0400 Subject: [PATCH 03/19] fix typos and add context help images --- episodes/01-introduction.md | 221 +++++++++++++++++++++++------------- 1 file changed, 145 insertions(+), 76 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index 1f369cc7..da9d74dc 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -3,34 +3,40 @@ title: "Introducing R and RStudio IDE" teaching: 30 exercises: 15 questions: -- "What is RStudio and why should I use it?" -- "What is the difference between R and RStudio?" -- "How do I get help using R?" +- "Why use R?" +- "Why use RStudio and how does it differ from R?" +- "How do I get help using R and RStudio?" objectives: -- "Discuss advantages of analyzing data in R" -- "Discuss advantages of using RStudio" -- "Create an RStudio project, and discuss the benefits of working within a +- "Know advantages of analyzing data in R" +- "Know advantages of using RStudio" +- "Create an RStudio project, and know the benefits of working within a project" - "Customize RStudio layout" -- "Able to locate and change the current working directory with `getwd()` and +- "Be able to locate and change the current working directory with `getwd()` and `setwd()`" - "Compose an R script file with comments and saved commands" - "Be able to define what an R function is" - "Locate help for an R function using `?`, `??`, and `args()`" - "Check the version of R" -- "Able to enter a command in the R console (at the terminal)" -- "Ask effective questions when searching for help on forums or using web +- "Be able to ask effective questions when searching for help on forums or using web searches" -- "Research an issue you are experiencing with a package installation on - Stackoverflow" keypoints: -- "First key point." +- "R is a powerful, popular open-source scripting language" +- "RStudio allows you to run R in an easy-to-use interface and makes + it easy to find help" +- "You can customize the layout of RStudio, and use the project feature to manage + the files and packages used in your analysis" +- "R provides thousands of functions for analyzing data, and provides several + way to get help" +- "Using R will mean searching for online help, and there are tips and + resources on how to search effectively" + --- ## Getting ready to use R for the first time In this lesson we will take you through the very first things you need to get -R working, and conclude by showing you the most effective ways to get get help +R working, and conclude by showing you the most effective ways to get help when you are working with R on your own. >## Tip: This lesson works best on the cloud @@ -85,11 +91,11 @@ analyzing data in R: ## Introducing RStudio Server In these lessons, we will be making use of a software called [RStudio](https://www.rstudio.com/products/RStudio/), an [Integrated Development Environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment). -RStudio, like any most IDEs provides a graphical interface to R, making it more +RStudio, like most IDEs, provides a graphical interface to R, making it more user-friendly, and providing dozens of useful features. We will introduce additional benefits of using RStudio as you cover the lessons. In this case, -we are specifically using [RStudio Server](https://www.rstudio.com/products/RStudio/#Server) -, a version of RStudio that can be accessed in your web browser. RStudio Server +we are specifically using [RStudio Server](https://www.rstudio.com/products/RStudio/#Server), +a version of RStudio that can be accessed in your web browser. RStudio Server has the same features of the Desktop version of RStudio you could download as standalone software. @@ -97,12 +103,13 @@ standalone software. Open a web browser and enter the IP address of your instance, followed by `:8787`. For example, if your IP address was 123.456.789 your URL would be - > ~~~ - > http://123.456.789:8787 - > - > # Tip: Make sure there are no spaces before or after your URL or your web browser may interpret it as a search query - > ~~~ - > {: .source} +> ~~~ +> http://123.456.789:8787 +> +> # Tip: Make sure there are no spaces before or after your URL or your web browser may interpret it as a search query +> ~~~ +> +{: .source} Enter your user credentials and click Sign In. The credentials for the genomics Data Carpentry instances are: @@ -134,20 +141,20 @@ To create a project, go to the File menu, and click New Project. In the window that opens select **New Directory**, then **Empty Project**. For "Directory name:" enter **dc_genomics_r**. For "Create project as subdirectory of", you may leave the default, which is your home directory "~". Finally click -Create Project. In your "Files" tab of your output pane (more about +Create Project. In the "Files" tab of your output pane (more about the RStudio layout in a moment), you should see an RStudio project file, -**dc_genomics_r.Rroj**. All RStudio projects end with the ".Rproj" file +**dc_genomics_r.Rroj**. All RStudio projects end with the "**.Rproj**" file extension. >## Tip: Make your project more reproducible with Packrat > One of the most wonderful and also frustrating aspects of working with R is > managing packages. We will talk more about them, but packages (e.g. ggplot2) -> are add-ons that extend what you can do with R. Unfoturnately it is very +> are add-ons that extend what you can do with R. Unfortunately it is very > common that you may run into versions of R and/or R packages that are not -> compatible. This may make it difficut to for somone to run your R script using +> compatible. This may make it difficult for someone to run your R script using > their version of R or a given R package, and/or make it more difficult to run > their scripts on your machine. [Packrat](https://rstudio.github.io/packrat/) -> Is an RStudio add-on that will associate your packages and project so that +> is an RStudio add-on that will associate your packages and project so that > your work is more portable and reproducible. To turn on Packrat click on > the Tools menu and select Project Options. Under > **Packrat** check off "**Use packrat with this project**" and follow any @@ -167,14 +174,14 @@ save/disk icon that is in the bar above the first line in the script editor, or click the File menu and select save. In the "Save File" window that opens, name your file **"genomics_r_basics"**. The new script **genomics_r_basics.R** should appear under "files" in the output pane. By -convention, R scripts end with the file extention **.R**. +convention, R scripts end with the file extension **.R**. --- ## Overview and customization of the RStudio layout Now that we have covered the basics, lets address some ways to configure the -layout of RStudio. First, here are the major windows or panes of the RStuio +layout of RStudio. First, here are the major windows or panes of the RStudio environment: rstudio default session @@ -186,13 +193,13 @@ environment: You can work interactively (i.e. enter R commands here), but for the most part, we will run a script, or lines in a script and watch their execution and output here. -- **Enviornment**: Here, RStudio will show you what datasets and variables you +- **Environment**: Here, RStudio will show you what datasets and variables you have created, and which are actively defined/in memory. You can also see some characteristics of variables/datasets such as their type and dimensions. A history tab also contains a history of executed R commands. - **Files/plots/help**: This multipurpose pane will show you the contents of directories on your computer. You can also use the "Files" tab to navigate and - set the working directory. The "Plots" tab will show the ouput of any plots + set the working directory. The "Plots" tab will show the output of any plots generated. In "Packages" you will see what packages are actively loaded, or you can attach installed packages. "Help" will display help files for R functions/packages. @@ -202,40 +209,51 @@ environment: > instance to your local computer. Uploads are also possible. {: .callout} -All of the panes in RStudio have configuration options. For example you can -minimize/maximize a pane, or by moving your mouse in the space between between +All of the panes in RStudio have configuration options. For example, you can +minimize/maximize a pane, or by moving your mouse in the space between panes you can resize as needed. The most important customization options for -pane layout are in the View menu. Other option such as font sizes, +pane layout are in the View menu. Other options such as font sizes, colors/themes, and more are in the Tools menu under Global Options. +>## Don't be fooled - you are working with R +> Although we won't be working with R at the terminal, there are lots of reasons +> to. For example, once you have written an RScript, you can run it at any Linux +> or Windows terminal without the need to start up RStudio. We just don't want +> you to get confused - RStudio runs R, but R is not RStudio. For more on +> running an R Script at the terminal see this [Carpentry lesson](https://swcarpentry.github.io/r-novice-inflammation/05-cmdline/). +{: .callout} + + --- ## Getting to work with R: navigating directories -Now that we have covered the more aesthetic aspects of R, we can get to work by -learning some commands. We will write, execute, and save the commands we learn -in our **genomics_r_basics.R** script that is loaded in the Source pane. First, -lets see what directory we are in. To do so, type the following command into -the script: +Now that we have covered the more aesthetic aspects of RStudio, we can get to +work learning some commands. We will write, execute, and save the commands we +learn in our **genomics_r_basics.R** script that is loaded in the Source pane. +First, lets see what directory we are in. To do so, type the following command +into the script: > ~~~ > getwd() > ~~~ {: .language-r} -To execute this command, make sure your cusor is on the same line the command +To execute this command, make sure your cursor is on the same line the command is written. Then click the Run button that is just above the first line of your script in the header of the Source pane. -In the console, we expect to see the following output: +In the console, we expect to see the following output*: > ~~~ -> getwd() > [1] "/home/dcuser/dc_genomics_r" > ~~~ {: .output} +\* Notice, at the Console, you will also see the instruction you executed +above the output in blue. + Since we will be learning several commands, we may already want to keep some short notes in our script to explain the purpose of the command. Entering a `#` before any line in an R script. Edit your script to include a comment on the @@ -274,7 +292,7 @@ command. Enter this command in your script, but *don't run* this yet. You may have guessed, you need to tell the `setwd()` command what directory you want to set as your working directory. To do so, inside of the parentheses, open a set of quotes. Inside the quotes enter a `/` which is -the root directory of our linux. Next, use the Tab key, to take +the root directory for Linux. Next, use the Tab key, to take advantage of RStudio's Tab-autocompletion method, to select `home`, `dcuser`, and `dc_genomics_r` directory. The path in your script should look like this: @@ -285,20 +303,20 @@ and `dc_genomics_r` directory. The path in your script should look like this: {: .language-r} -When you run this command, the console repeates the command, but gives you no -output. Instead, you see the blank R prompt: `>`. Congradulations! Although it +When you run this command, the console repeats the command, but gives you no +output. Instead, you see the blank R prompt: `>`. Congratulations! Although it seems small, knowing what your working directory is, and being able to set your -working directory is the first step to analzying your data. +working directory is the first step to analyzing your data. >## Tip: Never use `setwd()` > Wait, what was the last 2 minutes about? Well, setting your working directory > is something you need to do, you need to be very careful about using this as > a step in your script. For example, the top-level path in a Unix file system > is root `/`, but on Windows it is likely `C:\`. This is one of several ways -> you might cause a script to break because a filepath is configured differently +> you might cause a script to break because a file path is configured differently > than your script anticipates. R packages like [`here`](https://cran.r-project.org/web/packages/here/index.html) > and [`file.path`](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) -> allow you to specifiy file paths is a way that is more operating system +> allow you to specify file paths is a way that is more operating system > independent. See Jenny Bryan's [blog post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) for this > and other R tips. {: .callout} @@ -312,7 +330,7 @@ works in R, the next sections will help you understand what is happening in any R script. A function in R (or any computing language) is basically a short program that takes an input and returns and output. -> ## Exercise: What do these functions do +> ## Exercise: What do these functions do? > Try the following functions by writing them in your script. See if you can > guess what they do, and make sure to add comments to your script about your > assumed purpose. @@ -332,9 +350,9 @@ program that takes an input and returns and output. You have hopefully noticed a pattern, some more abstract exceptions aside, in R a function has three key properties: -- functions have a name (e.g. `dir`, `getwd`) +- functions have a name (e.g. `dir`, `getwd`); note that these are case sensitive! - following the name, functions have a pair of `()` -- Inside the parentheses, a function may take 0 or more arguments ... +- Inside the parentheses, a function may take 0 or more arguments An argument may be a specific input for your function and/or may modify the function's behavior. For example the function `round()` will round a number @@ -366,7 +384,7 @@ name: {: .language-r} The "Help" tab will show you information (and often, too much information). You -Will slowly learn how to read through all of that. Checking the "Usage" or +will slowly learn how to read through all of that. Checking the "Usage" or "Examples" headings is often a good place to look first. If you look under "Arguments" we also see what arguments we can "pass" to this function to modify its behavior. You can also see a function's argument using the `args()` function: @@ -384,9 +402,12 @@ Which returns > ~~~ {: .output} -We see that `round()` has a `digits` argument. The `=` sign indicates that a -default (in this case 0) is already set. We can explicity set the digits -parameter when we use the function: +We see that `round()` takes two arguments, `x` which is your number, and a +`digits` argument. The `=` sign indicates that a default (in this case 0) is +already set. Since `x` is not set, `round()` requires we provide it, in contrast +to `digits` where R will use the default value 0 unless you explicitly provide +a different value. We can explicitly set the digits parameter when we call the +function: > ~~~ > round(3.14159, digits = 2) @@ -414,7 +435,7 @@ digits is 2. {: .output} Finally, what if you are using `?` to get help for a function in a package not -installed on your version of R: +installed on your system? > ~~~ > ?geom_point() @@ -431,7 +452,51 @@ will return an error: Use two question marks (i.e. `?? geom_point()`) and R will return online search -results in the "Help" tab. +results in the "Help" tab. Finally, if you think there should be a function, +for example a statistical test, but you aren't sure what R calls it, or what +functions may be available, use the `help.search()` function. + +> ## Exercise: Searching for R functions +> Use `help.search()` to find R functions for the following statistical +> functions. Remember to put what you are using for your search query in +> quotes inside the function parentheses. +> +> - Chi-Squared test +> - Student-t test +> - mixed linear model +> +>> ## solution +>> While your search results may return several tests, we list a few you might +>> find: +>> - Chi-Squared test: `stats::Chisquare` +>> - Student-t test: `stats::TDist` +>> - mixed linear model: `stats::lm.glm` +> {: .solution} +{: .challenge} + + +We will discuss more on where to look for the libraries and packages that +contain functions you want to use. For now, be aware that two important ones +are [CRAN](https://cran.r-project.org/) - the main repository for R, and +[Bioconductor](http://bioconductor.org/) - a popular repository for +bioinformatics R. + +--- + +## RStudio contextual help + +Here is one last bonus we will mention about RStudio. It's difficult to +remember all of the arguments and definitions associated with a given function. +When you start typing the name of a function and hit the Tab key, +RStudio will display functions and associated help: + +rstudio default session + +Once you type a function, hitting the Tab inside the parentheses +will remind you of arguments and provide additional help. + +rstudio default session + --- @@ -443,15 +508,15 @@ Finally, no matter how much experience you have with R, you will find yourself needing help. There is no shame in researching how to do something in R, and most people will find themselves looking up how to do the same things that they "should know how to do" over and over again. Here are some tips to make -this process as helpful and efficent as possible. +this process as helpful and efficient as possible. > "Never memorize something that you can look up" > - A. Einstein ## Finding help on Stackoverflow and Biostars -Two of popular websites will be of great help with many R problems. For **general** -**R questions**, [Stack Overflow](https://stackoverflow.com/), probably the most +Two popular websites will be of great help with many R problems. For **general** +**R questions**, [Stack Overflow](https://stackoverflow.com/) is probably the most popular online community for developers. If you start your question "How to do X in R" results from Stack Overflow are usually near the top of the list. For **bioinformatics specific questions**, [Biostars](https://www.biostars.org/) is @@ -460,28 +525,28 @@ a popular online forum. >## Tip: Asking for help using online forums: > > - When searching for R help, look for answers with the [r](https://stackoverflow.com/questions/tagged/r) tag. -> - Get an account, not required to view answers, but to required to post -> - Put in effort to check throughly before you post a question; folks get -> annoyed if you ask a very common question that has been answered multiple -> times. +> - Get an account; not required to view answers but to required to post +> - Put in effort to check thoroughly before you post a question; folks get +> annoyed if you ask a very common question that has been answered multiple +> times > - Be careful. While forums are very helpful, you can't know for sure if the -> advice you are getting is correct. +> advice you are getting is correct > - See the [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) -> blog post for more useful tips. +> blog post for more useful tips > {: .callout} ## Help people help you -Often, in order to duplicate the issue you are having, somone may need to see +Often, in order to duplicate the issue you are having, someone may need to see the data you are working with or verify the versions of R or R packages you are using. The following R functions will help with this: You can **check the version of R** you are working with using the `sessionInfo()` function. Actually, it is good to save this information as part of your notes on any analysis you are doing. When you run the same script that has worked fine -a dozzen times before, looking back at these notes will remind you that you -upgraded R and forget to check this script. +a dozen times before, looking back at these notes will remind you that you +upgraded R and forget to check your script. > ~~~ @@ -509,8 +574,8 @@ upgraded R and forget to check this script. {: .output} Many times, there may be some issues with your data and the way it is formatted. -In that case, you may want to share that data with somone else. However, you -may not need to share the whole datasets; looking at a subset of your 50,000 row, +In that case, you may want to share that data with someone else. However, you +may not need to share the whole dataset; looking at a subset of your 50,000 row, 10,000 column dataframe may be TMI (too much information)! You can take an object you have in memory such as dataframe (if you don't know what this means yet, we will get to it!) and save it to a file. In our example we will use the @@ -539,7 +604,7 @@ how the data is formatted and possibly revealing problematic issues. > ~~~ {: .output} -Alternatively, you can also save objects in R memory to a file by specificying +Alternatively, you can also save objects in R memory to a file by specifying the name of the object, in this case the `iris` data frame, and passing a filename to the `file=` argument. @@ -553,7 +618,7 @@ filename to the `file=` argument. ## Final FAQs on R Finally, here are a few pieces of introductory R knowledge that are too good to -pass up. While we won't return to them in this course, we put them here becasue +pass up. While we won't return to them in this course, we put them here because they come up commonly: **Do I need to click Run every time I want to run a script?** @@ -563,7 +628,7 @@ they come up commonly: - Windows execution shortcut: Ctrl+Enter - Mac execution shortcut: Cmd(⌘)+Enter - To see a complete list of shortcuts click on the Tools menu and + To see a complete list of shortcuts, click on the Tools menu and select Keyboard Shortcuts Help **What's with the brackets in R console output?** @@ -575,6 +640,9 @@ they come up commonly: > ~~~ {: .language-r} +In the output below, `[81]` indicates that the first value on that line is the +81st item in your result + > ~~~ > [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 > [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 @@ -590,9 +658,10 @@ they come up commonly: - Yes, remember - RStudio is running R. You get to use lots of the enhancements RStudio provides, but R works independent of RStudio. See [these tips](https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line) - for running your commands at the command line. + for running your commands at the command line **Where else can I learn about RStudio?** -Check out the Help menu, especially "Cheatsheets" section. +- Check out the Help menu, especially "Cheatsheets" section + --- From 7b7b77d6ef38560fa0189c8fb1ff2e56b6f6b481 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Tue, 17 Apr 2018 11:20:02 -0400 Subject: [PATCH 04/19] update screenshots and instructions for Rstudio Server 1.1 --- episodes/01-introduction.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index da9d74dc..d76c516a 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -138,7 +138,7 @@ To create a project, go to the File menu, and click New Project. rstudio default session -In the window that opens select **New Directory**, then **Empty Project**. For +In the window that opens select **New Directory**, then **New Project**. For "Directory name:" enter **dc_genomics_r**. For "Create project as subdirectory of", you may leave the default, which is your home directory "~". Finally click Create Project. In the "Files" tab of your output pane (more about From 773a0cc4df90de0319dafbc135ba8f45ef7046f1 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Tue, 17 Apr 2018 11:23:49 -0400 Subject: [PATCH 05/19] update screenshots and instructions for Rstudio Server 1.1 --- episodes/01-introduction.md | 32 +++++++++++++++++--------------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index d76c516a..518c6b5a 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -188,21 +188,23 @@ environment: - **Source**: This pane is where you will write/view R scripts. Some outputs (such as if you view a dataset using `View()`) will appear as a tab here. -- **Console**: This is actually where you see the execution of commands, and - what R looks like if you were to run it at the command line without RStudio. - You can work interactively (i.e. enter R commands here), but for the most - part, we will run a script, or lines in a script and watch their execution - and output here. -- **Environment**: Here, RStudio will show you what datasets and variables you - have created, and which are actively defined/in memory. You can also see some - characteristics of variables/datasets such as their type and dimensions. - A history tab also contains a history of executed R commands. -- **Files/plots/help**: This multipurpose pane will show you the contents of - directories on your computer. You can also use the "Files" tab to navigate and - set the working directory. The "Plots" tab will show the output of any plots - generated. In "Packages" you will see what packages are actively loaded, or - you can attach installed packages. "Help" will display help files for R - functions/packages. +- **Console/Terminal**: This is actually where you see the execution of commands + , and what R looks like if you were to run it at the command line without + RStudio. You can work interactively (i.e. enter R commands here), but for the + most part, we will run a script, or lines in a script and watch their + execution and output here. The "Terminal" tab give you access to the BASH + terminal. +- **Environment/History**: Here, RStudio will show you what datasets and + variables you have created, and which are actively defined/in memory. You can + also see some characteristics of variables/datasets such as their type and + dimensions. A "History" tab also contains a history of executed R commands. In + the history tab you can see a list of previously executed commands. +- **Files/plots/Packages/help**: This multipurpose pane will show you the + contents of directories on your computer. You can also use the "Files" tab to + navigate and set the working directory. The "Plots" tab will show the output + of any plots generated. In "Packages" you will see what packages are actively + loaded, or you can attach installed packages. "Help" will display help files + for R functions/packages. >## Tip: Downloads from the cloud > In the "Files" tab you can select a file and download it from your cloud From 144e83192780cb36c68247291bcb2190e6c7abf0 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Wed, 25 Apr 2018 17:09:22 -0400 Subject: [PATCH 06/19] intermediate commit --- episodes/02-r-basics.md | 669 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 669 insertions(+) create mode 100644 episodes/02-r-basics.md diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md new file mode 100644 index 00000000..ce489f8e --- /dev/null +++ b/episodes/02-r-basics.md @@ -0,0 +1,669 @@ +--- +title: "R Basics" +teaching: 60 +exercises: 30 +questions: +- "What are the basic features of the R language?" +- "What are the most common objects in R?" +- "How do I get started with tabular data (e.g. spreadsheets) in R?" +objectives: +- "Know how far you can get with basic R skills" +- "Be able to explain what a data types are, and know the common R datatypes (modes)" +- "Be able to create the most common R objects including vectors, factors, lists, and dataframes" +- "Be able to retrieve (index), name, or replace, values from an object" +- "Be able to do simple arithmetic procedures on R objects" +- "Be able to load a tabular dataset using base R functions" +- "Explain the basic principle of tidy datasets" +- "Be able to determine the structure of a datafram including its dimensions and the datatypes of variables" +- "Be able to retrieve (index) a dataframe" +- "Be able to apply an arithmetic function to a dataframe" +- "Be able to coerce the class of an object (including variables in a dataframe)" +- "Be able to save a dataframe as a delimited file" + + +keypoints: +- "R is a powerful, popular open-source scripting language" +- "RStudio allows you to run R in an easy-to-use interface and makes + it easy to find help" +- "You can customize the layout of RStudio, and use the project feature to manage + the files and packages used in your analysis" +- "R provides thousands of functions for analyzing data, and provides several + way to get help" +- "Using R will mean searching for online help, and there are tips and + resources on how to search effectively" + +--- + +## Getting ready to use R for the first time +In this lesson we will take you through the very first things you need to get +R working, and conclude by showing you the most effective ways to get help +when you are working with R on your own. + +>## Tip: This lesson works best on the cloud +> Remember, these lessons assume we are using the pre-configured virtual machine +> instances provided to you at a genomics workshop. Much of this work could be +> done on your laptop, but we use instances to simplify workshop setup +> requirements, and to get you familiar with using the cloud (a common +> requirement for working with big data). +> Visit the [Genomics Workshop setup page](http://www.datacarpentry.org/genomics-workshop/setup/) +> for details on getting this instance running on your own, or for the info you +> need to do this on your own computer. + {: .callout} + + +## A Brief History of R +[R](https://en.wikipedia.org/wiki/R_(programming_language)) has been around +since 1995, and was created by Ross Ihaka and Robert Gentleman at the University +of Auckland, New Zealand. R is based off the S programming language developed +at Bell Labs and was developed to teach intro statistics. See this [slide deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) +by Ross Ihaka for more info on the subject. + +## Advantages of using R +At more than 20 years old, R is fairly mature and [growing in popularity](https://www.tiobe.com/tiobe-index/r/). However, programming isn’t a popularity contest. Here are key advantages of +analyzing data in R: + + - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. Of + course this means R is free - which is an advantage if you end up at a + institution where you would have to pay for your own MATLAB or SAS license. + Open source, is important to your colleagues in parts of the world where + expensive software in inaccessible. It also means that R is actively + developed by a community (See [r-project.org](https://www.r-project.org/)), + and there are regular updates. + - **R is widely used**. Ok, maybe programming is a popularity contest. Because, + R is used in many areas (not just bioinformatics), you are more likely to + find help online when you need it. Chances are, almost any error message you + run into, someone else has already experienced. +- **R is powerful**. R runs on multiple platforms (Windows/MacOS/Linux). It can + work with much larger datasets than popular spreadsheet programs like + Microsoft Excel, and because of its scripting capabilities is far more + reproducible. Also, there are thousands of available software packages for + science, including genomics and other areas of life science. + +>## Discussion: Your experience +> What has motivated you to learn R? Have you had a research question for which +> spreadsheet programs such as Excel have proven difficult to use, or where the +> size of the data set created issues? +{: .discussion} + + +---- + +## Introducing RStudio Server +In these lessons, we will be making use of a software called [RStudio](https://www.rstudio.com/products/RStudio/), +an [Integrated Development Environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment). +RStudio, like most IDEs, provides a graphical interface to R, making it more +user-friendly, and providing dozens of useful features. We will introduce +additional benefits of using RStudio as you cover the lessons. In this case, +we are specifically using [RStudio Server](https://www.rstudio.com/products/RStudio/#Server), +a version of RStudio that can be accessed in your web browser. RStudio Server +has the same features of the Desktop version of RStudio you could download as +standalone software. + +## Log on to RStudio Server + +Open a web browser and enter the IP address of your instance, followed by +`:8787`. For example, if your IP address was 123.456.789 your URL would be +> ~~~ +> http://123.456.789:8787 +> +> # Tip: Make sure there are no spaces before or after your URL or your web browser may interpret it as a search query +> ~~~ +> +{: .source} + +Enter your user credentials and click Sign In. The credentials for +the genomics Data Carpentry instances are: + + > **username**: dcuser + > + > **password**: data4Carp + +You should now see the RStudio interface: + +rstudio default session + +--- + +## Create an RStudio project + +One of the first benefits we will take advantage of in RStudio is something +called an **RStudio Project**. An RStudio Project allows you easily save data, +files, variables, packages, etc. related to a specific analysis project you are +conducting in R. Saving your work into a project makes it easy to restart work +where you left off, and also makes it easier to collaborate, especially if you +are using version control such as [git](http://swcarpentry.github.io/git-novice/). + + +To create a project, go to the File menu, and click New Project.... + +rstudio default session + +In the window that opens select **New Directory**, then **New Project**. For +"Directory name:" enter **dc_genomics_r**. For "Create project as subdirectory of", +you may leave the default, which is your home directory "~". Finally click +Create Project. In the "Files" tab of your output pane (more about +the RStudio layout in a moment), you should see an RStudio project file, +**dc_genomics_r.Rroj**. All RStudio projects end with the "**.Rproj**" file +extension. + +>## Tip: Make your project more reproducible with Packrat +> One of the most wonderful and also frustrating aspects of working with R is +> managing packages. We will talk more about them, but packages (e.g. ggplot2) +> are add-ons that extend what you can do with R. Unfortunately it is very +> common that you may run into versions of R and/or R packages that are not +> compatible. This may make it difficult for someone to run your R script using +> their version of R or a given R package, and/or make it more difficult to run +> their scripts on your machine. [Packrat](https://rstudio.github.io/packrat/) +> is an RStudio add-on that will associate your packages and project so that +> your work is more portable and reproducible. To turn on Packrat click on +> the Tools menu and select Project Options. Under +> **Packrat** check off "**Use packrat with this project**" and follow any +> installation instructions. +{: .callout} + +--- + +## Creating your first R script + +Now that we are ready to start exploring R, we will want to keep a record of the +commands we are using. To do this we can create an R script: + +Click the File menu and select New File and then +R Script. Before we go any further, save your script by clicking the +save/disk icon that is in the bar above the first line in the script editor, or +click the File menu and select save. In the "Save File" +window that opens, name your file **"genomics_r_basics"**. The new script +**genomics_r_basics.R** should appear under "files" in the output pane. By +convention, R scripts end with the file extension **.R**. + +--- + +## Overview and customization of the RStudio layout + +Now that we have covered the basics, lets address some ways to configure the +layout of RStudio. First, here are the major windows or panes of the RStudio +environment: + +rstudio default session + +- **Source**: This pane is where you will write/view R scripts. Some outputs + (such as if you view a dataset using `View()`) will appear as a tab here. +- **Console/Terminal**: This is actually where you see the execution of commands + , and what R looks like if you were to run it at the command line without + RStudio. You can work interactively (i.e. enter R commands here), but for the + most part, we will run a script, or lines in a script and watch their + execution and output here. The "Terminal" tab give you access to the BASH + terminal. +- **Environment/History**: Here, RStudio will show you what datasets and + variables you have created, and which are actively defined/in memory. You can + also see some characteristics of variables/datasets such as their type and + dimensions. A "History" tab also contains a history of executed R commands. In + the history tab you can see a list of previously executed commands. +- **Files/plots/Packages/help**: This multipurpose pane will show you the + contents of directories on your computer. You can also use the "Files" tab to + navigate and set the working directory. The "Plots" tab will show the output + of any plots generated. In "Packages" you will see what packages are actively + loaded, or you can attach installed packages. "Help" will display help files + for R functions/packages. + +>## Tip: Downloads from the cloud +> In the "Files" tab you can select a file and download it from your cloud +> instance to your local computer. Uploads are also possible. +{: .callout} + +All of the panes in RStudio have configuration options. For example, you can +minimize/maximize a pane, or by moving your mouse in the space between +panes you can resize as needed. The most important customization options for +pane layout are in the View menu. Other options such as font sizes, +colors/themes, and more are in the Tools menu under +Global Options. + +>## Don't be fooled - you are working with R +> Although we won't be working with R at the terminal, there are lots of reasons +> to. For example, once you have written an RScript, you can run it at any Linux +> or Windows terminal without the need to start up RStudio. We just don't want +> you to get confused - RStudio runs R, but R is not RStudio. For more on +> running an R Script at the terminal see this [Carpentry lesson](https://swcarpentry.github.io/r-novice-inflammation/05-cmdline/). +{: .callout} + + +--- + +## Getting to work with R: navigating directories +Now that we have covered the more aesthetic aspects of RStudio, we can get to +work learning some commands. We will write, execute, and save the commands we +learn in our **genomics_r_basics.R** script that is loaded in the Source pane. +First, lets see what directory we are in. To do so, type the following command +into the script: + +> ~~~ +> getwd() +> ~~~ +{: .language-r} + +To execute this command, make sure your cursor is on the same line the command +is written. Then click the Run button that is just above the first +line of your script in the header of the Source pane. + + +In the console, we expect to see the following output*: + +> ~~~ +> [1] "/home/dcuser/dc_genomics_r" +> ~~~ +{: .output} + +\* Notice, at the Console, you will also see the instruction you executed +above the output in blue. + +Since we will be learning several commands, we may already want to keep some +short notes in our script to explain the purpose of the command. Entering a `#` +before any line in an R script. Edit your script to include a comment on the +purpose of commands you are learning, e.g.: + +> ~~~ +> # this command shows the current working directory +> getwd() +> ~~~ +{: .language-r} + +--- + +> ## Exercise: Work interactively in R +> What happens when you try to enter the `getwd()` command in the Console pane? +> +>> ## solution +>> You will get the same output you did as when you ran `getwd()` from the +>> source. You can run any command in the Console, however, executing it from +>> the source script will make it easier for us to record what we have done, +>> and ultimately run an entire script, instead of entering commands one-by-one. +> {: .solution} +{: .challenge} +--- + +For the purposes of this exercise we want you to be in the directory `"/home/dcuser/dc_genomics_r"`. +What if you weren't? You can set your home directory using the `setwd()` +command. Enter this command in your script, but *don't run* this yet. + +> ~~~ +> # This sets the working directory +> setwd() +> ~~~ +{: .language-r} + +You may have guessed, you need to tell the `setwd()` command +what directory you want to set as your working directory. To do so, inside of +the parentheses, open a set of quotes. Inside the quotes enter a `/` which is +the root directory for Linux. Next, use the Tab key, to take +advantage of RStudio's Tab-autocompletion method, to select `home`, `dcuser`, +and `dc_genomics_r` directory. The path in your script should look like this: + +> ~~~ +> # This sets the working directory +> setwd("/home/dcuser/dc_genomics_r") +> ~~~ +{: .language-r} + + +When you run this command, the console repeats the command, but gives you no +output. Instead, you see the blank R prompt: `>`. Congratulations! Although it +seems small, knowing what your working directory is, and being able to set your +working directory is the first step to analyzing your data. + +>## Tip: Never use `setwd()` +> Wait, what was the last 2 minutes about? Well, setting your working directory +> is something you need to do, you need to be very careful about using this as +> a step in your script. For example, the top-level path in a Unix file system +> is root `/`, but on Windows it is likely `C:\`. This is one of several ways +> you might cause a script to break because a file path is configured differently +> than your script anticipates. R packages like [`here`](https://cran.r-project.org/web/packages/here/index.html) +> and [`file.path`](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) +> allow you to specify file paths is a way that is more operating system +> independent. See Jenny Bryan's [blog post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) for this +> and other R tips. +{: .callout} + +--- + +## Using functions in R, without needing to master them +Functions may seem like an advanced topic (and they are), but you have already +been using functions in R. In fact, even if you never learn how anything else +works in R, the next sections will help you understand what is happening in +any R script. A function in R (or any computing language) is basically a short +program that takes an input and returns and output. + +> ## Exercise: What do these functions do? +> Try the following functions by writing them in your script. See if you can +> guess what they do, and make sure to add comments to your script about your +> assumed purpose. +> - `dir()` +> - `sessionInfo()` +> - `date()` +> - `Sys.time()` +> +>> ## solution +>> - `dir()` # lists files in the working directory +>> - `sessionInfo()` # Gives the version of R and additional info including +>> on attached packages +>> - `date()` # Gives the current date +>> - `Sys.time()` # Gives the current time +> {: .solution} +{: .challenge} + +You have hopefully noticed a pattern, some more abstract exceptions aside, in R +a function has three key properties: +- functions have a name (e.g. `dir`, `getwd`); note that these are case sensitive! +- following the name, functions have a pair of `()` +- Inside the parentheses, a function may take 0 or more arguments + +An argument may be a specific input for your function and/or may modify the +function's behavior. For example the function `round()` will round a number +with a decimal: + +> ~~~ +> # This will round up a number +> round(3.14) +> ~~~ +{: .language-r} + +Which returns + +> ~~~ +> [1] 3 +> ~~~ +{: .output} + +## Getting help with function arguments + +Of course, you may have wanted to round to one significant digit. `round()` can +do this, but you may fist need to read the help to find out how. To see the help +(In R sometimes also called a "vignette") enter a `?` in front of the function +name: + +> ~~~ +> ?round() +> ~~~ +{: .language-r} + +The "Help" tab will show you information (and often, too much information). You +will slowly learn how to read through all of that. Checking the "Usage" or +"Examples" headings is often a good place to look first. If you look under +"Arguments" we also see what arguments we can "pass" to this function to modify +its behavior. You can also see a function's argument using the `args()` function: + +> ~~~ +> args(round) +> ~~~ +{: .language-r} + +Which returns + +> ~~~ +> function (x, digits = 0) +> NULL +> ~~~ +{: .output} + +We see that `round()` takes two arguments, `x` which is your number, and a +`digits` argument. The `=` sign indicates that a default (in this case 0) is +already set. Since `x` is not set, `round()` requires we provide it, in contrast +to `digits` where R will use the default value 0 unless you explicitly provide +a different value. We can explicitly set the digits parameter when we call the +function: + +> ~~~ +> round(3.14159, digits = 2) +> ~~~ +{: .language-r} + +> ~~~ +> [1] 3.14 +> ~~~ +{: .output} + +Or, R accepts what we call "positional arguments", if you pass a function +arguments separated by commas, R assumes that they are in the order you saw +when we used `args()`. In the case below that means that `x` is 3.14159 and +digits is 2. + +> ~~~ +> round(3.14159, 2) +> ~~~ +{: .language-r} + +> ~~~ +> [1] 3.14 +> ~~~ +{: .output} + +Finally, what if you are using `?` to get help for a function in a package not +installed on your system? + +> ~~~ +> ?geom_point() +> ~~~ +{: .language-r} + +will return an error: + +> ~~~ +> Error in .helpForCall(topicExpr, parent.frame()) : +> no methods for ‘geom_point’ and no documentation for it as a function +> ~~~ +{: .error} + + +Use two question marks (i.e. `?? geom_point()`) and R will return online search +results in the "Help" tab. Finally, if you think there should be a function, +for example a statistical test, but you aren't sure what R calls it, or what +functions may be available, use the `help.search()` function. + +> ## Exercise: Searching for R functions +> Use `help.search()` to find R functions for the following statistical +> functions. Remember to put what you are using for your search query in +> quotes inside the function parentheses. +> +> - Chi-Squared test +> - Student-t test +> - mixed linear model +> +>> ## solution +>> While your search results may return several tests, we list a few you might +>> find: +>> - Chi-Squared test: `stats::Chisquare` +>> - Student-t test: `stats::TDist` +>> - mixed linear model: `stats::lm.glm` +> {: .solution} +{: .challenge} + + +We will discuss more on where to look for the libraries and packages that +contain functions you want to use. For now, be aware that two important ones +are [CRAN](https://cran.r-project.org/) - the main repository for R, and +[Bioconductor](http://bioconductor.org/) - a popular repository for +bioinformatics R. + +--- + +## RStudio contextual help + +Here is one last bonus we will mention about RStudio. It's difficult to +remember all of the arguments and definitions associated with a given function. +When you start typing the name of a function and hit the Tab key, +RStudio will display functions and associated help: + +rstudio default session + +Once you type a function, hitting the Tab inside the parentheses +will remind you of arguments and provide additional help. + +rstudio default session + + +--- + +## Getting help with R + +rstudio default session + +Finally, no matter how much experience you have with R, you will find yourself +needing help. There is no shame in researching how to do something in R, and +most people will find themselves looking up how to do the same things that +they "should know how to do" over and over again. Here are some tips to make +this process as helpful and efficient as possible. + +> "Never memorize something that you can look up" +> - A. Einstein + +## Finding help on Stackoverflow and Biostars + +Two popular websites will be of great help with many R problems. For **general** +**R questions**, [Stack Overflow](https://stackoverflow.com/) is probably the most +popular online community for developers. If you start your question "How to do X +in R" results from Stack Overflow are usually near the top of the list. For +**bioinformatics specific questions**, [Biostars](https://www.biostars.org/) is +a popular online forum. + +>## Tip: Asking for help using online forums: +> +> - When searching for R help, look for answers with the [r](https://stackoverflow.com/questions/tagged/r) tag. +> - Get an account; not required to view answers but to required to post +> - Put in effort to check thoroughly before you post a question; folks get +> annoyed if you ask a very common question that has been answered multiple +> times +> - Be careful. While forums are very helpful, you can't know for sure if the +> advice you are getting is correct +> - See the [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) +> blog post for more useful tips +> +{: .callout} + +## Help people help you + +Often, in order to duplicate the issue you are having, someone may need to see +the data you are working with or verify the versions of R or R packages you +are using. The following R functions will help with this: + +You can **check the version of R** you are working with using the `sessionInfo()` +function. Actually, it is good to save this information as part of your notes +on any analysis you are doing. When you run the same script that has worked fine +a dozen times before, looking back at these notes will remind you that you +upgraded R and forget to check your script. + + +> ~~~ +> sessionInfo() +> ~~~ +{: .language-r} + +> ~~~ +> R version 3.2.3 (2015-12-10) +> Platform: x86_64-pc-linux-gnu (64-bit) +> Running under: Ubuntu 14.04.3 LTS +> +> locale: +> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 +> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 +> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C +> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C +> +> attached base packages: +> [1] stats graphics grDevices utils datasets methods base +> +> loaded via a namespace (and not attached): +> [1] tools_3.2.3 packrat_0.4.9-1 +> ~~~ +{: .output} + +Many times, there may be some issues with your data and the way it is formatted. +In that case, you may want to share that data with someone else. However, you +may not need to share the whole dataset; looking at a subset of your 50,000 row, +10,000 column dataframe may be TMI (too much information)! You can take an +object you have in memory such as dataframe (if you don't know what this means +yet, we will get to it!) and save it to a file. In our example we will use the +`dput()` function on the `iris` dataframe which is an example dataset that is +installed in R: + + +> ~~~ +> dput(head(iris)) # iris is an example data.frame that comes with R +> # the `head()` function just takes the first 6 lines of the iris dataset +> ~~~ +{: .language-r} + +This generates some output (below) which you will be better able to interpret +after covering the other R lessons. This info would be helpful in understanding +how the data is formatted and possibly revealing problematic issues. + +> ~~~ +> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), +> Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, +> 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, +> 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, +> 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", +> "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, +> 6L), class = "data.frame") +> ~~~ +{: .output} + +Alternatively, you can also save objects in R memory to a file by specifying +the name of the object, in this case the `iris` data frame, and passing a +filename to the `file=` argument. + +> ~~~ +> saveRDS(iris, file="iris.rds") # By convention, we use the .rds file extension +> ~~~ +{: .language-r} + +--- + +## Final FAQs on R + +Finally, here are a few pieces of introductory R knowledge that are too good to +pass up. While we won't return to them in this course, we put them here because +they come up commonly: + +**Do I need to click Run every time I want to run a script?** + +- No. In fact, the most common shortcut key allows you to run a command (or + any lines of the script that are highlighted): + - Windows execution shortcut: Ctrl+Enter + - Mac execution shortcut: Cmd(⌘)+Enter + + To see a complete list of shortcuts, click on the Tools menu and + select Keyboard Shortcuts Help + +**What's with the brackets in R console output?** +- R returns an index with your result. When your result contains multiple values, + the number tells you what ordinal number begins the line, for example: + +> ~~~ +> 1:101 # generates the sequence of numbers from 1 to 101 +> ~~~ +{: .language-r} + +In the output below, `[81]` indicates that the first value on that line is the +81st item in your result + +> ~~~ +> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 +> [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 +> [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 +> [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 +> [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 +> [101] 101 +> ~~~ +{: .output} + + +**Can I run my R script without RStudio?** + +- Yes, remember - RStudio is running R. You get to use lots of the enhancements + RStudio provides, but R works independent of RStudio. See [these tips](https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line) + for running your commands at the command line + + +**Where else can I learn about RStudio?** +- Check out the Help menu, especially "Cheatsheets" section + +--- From 51e16cbad33dc880c9b5f953ade73a3cd613fdfb Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Thu, 3 May 2018 13:24:09 -0400 Subject: [PATCH 07/19] push changes to episode 2 --- episodes/02-r-basics.md | 75 ++++++++++++++++++++++++++--------------- 1 file changed, 48 insertions(+), 27 deletions(-) diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index ce489f8e..09218fe3 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -8,46 +8,67 @@ questions: - "How do I get started with tabular data (e.g. spreadsheets) in R?" objectives: - "Know how far you can get with basic R skills" -- "Be able to explain what a data types are, and know the common R datatypes (modes)" -- "Be able to create the most common R objects including vectors, factors, lists, and dataframes" +- "Be able to explain what a data types are, and know the common R datatypes + (modes)" +- "Be able to create the most common R objects including vectors, factors, + lists, and dataframes" - "Be able to retrieve (index), name, or replace, values from an object" - "Be able to do simple arithmetic procedures on R objects" - "Be able to load a tabular dataset using base R functions" - "Explain the basic principle of tidy datasets" -- "Be able to determine the structure of a datafram including its dimensions and the datatypes of variables" +- "Be able to determine the structure of a dataframe including its dimensions + and the datatypes of variables" - "Be able to retrieve (index) a dataframe" - "Be able to apply an arithmetic function to a dataframe" - "Be able to coerce the class of an object (including variables in a dataframe)" - "Be able to save a dataframe as a delimited file" - - keypoints: -- "R is a powerful, popular open-source scripting language" -- "RStudio allows you to run R in an easy-to-use interface and makes - it easy to find help" -- "You can customize the layout of RStudio, and use the project feature to manage - the files and packages used in your analysis" -- "R provides thousands of functions for analyzing data, and provides several - way to get help" -- "Using R will mean searching for online help, and there are tips and - resources on how to search effectively" - +- "Effectively using R is a journey of months or years. Still you don't have to + be an expert to use R and you can start using and analzying your data with + with about a day's worth of training" +- "It is important to understand how data are organized by R in a given object + type (e.g. vector, factor, dataframe, etc.) how the mode of that type + (e.g. numeric, character, logical, etc.) will determine how R will operate + on that data, and what can happen when datatypes are coerced, misinterpreted, + or misapplied" +- "Data wrangling - loading data, cleaning this data (e.g checking for outliers, + handling missing values, sorting, filtering, etc.) is an important first step + for working with data" --- -## Getting ready to use R for the first time -In this lesson we will take you through the very first things you need to get -R working, and conclude by showing you the most effective ways to get help -when you are working with R on your own. +## "The fantastic world of R awaits you" OR "Nobody wants to learn how to use R" +Before we begin this lesson, we want you to be clear on the goal of the workshop +and these lessons. We believe that every learner can be **achieve competency +with R**. You have reached competency when you find that you are able to +**use R to handle common analysis challenges in a reasonable amount of time** +(which includes time needed to spend looking at learning materials, searching +for help online, and asking colleagues for help). As you spend more time using R +(there is no substitute for regular use and practice) you will find yourself +gaining competency and even expertise. The more familiar you get, the more +complex analyses you will be able to carry out, with less frustration, and in +less time - the "fantastic world of R" awaits you! + +## What these lessons will not teach you +Nobody wants to learn how to use R, people want to learn how to use R to analyze +their own research questions! Ok, maybe some folks learn R for R's sake, but +these lessons assume that you want to start analyzing genomic data as soon as +possible. Given this, there are many valuable pieces of information about R +that we simply wont have time to cover. Hopefully we will clear the hurdle of +giving you just enough knowledge to be dangerous (which can be a high hurdle +in R), but we also suggest you look into additional the learning materials in +the tip box below. >## Tip: This lesson works best on the cloud -> Remember, these lessons assume we are using the pre-configured virtual machine -> instances provided to you at a genomics workshop. Much of this work could be -> done on your laptop, but we use instances to simplify workshop setup -> requirements, and to get you familiar with using the cloud (a common -> requirement for working with big data). -> Visit the [Genomics Workshop setup page](http://www.datacarpentry.org/genomics-workshop/setup/) -> for details on getting this instance running on your own, or for the info you -> need to do this on your own computer. +> The following are good resources for learning more about R. Some of them +> can be quite technically, but if you are a regular R user you may ultimately +> need some of this technical knowledge. +> - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the R project +> - [R contributed documentation](https://cran.r-project.org/other-docs.html): Also linked to the R project; importantly there are materials available in several languages +> - [R for Data Science](http://r4ds.had.co.nz/): A wonderful collection by noted R +educators and developers Garrett Grolemund and Hadley Wickham +> - [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/): +Not exclusively about R usage, but a nice collection of pre-prints on data science +and applications for R {: .callout} From 0896b49aec8c6cc52f804b1334d6773a86880b06 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Fri, 4 May 2018 16:42:31 -0400 Subject: [PATCH 08/19] complete up to math with objects --- episodes/02-r-basics.md | 780 ++++++++++------------------------------ 1 file changed, 194 insertions(+), 586 deletions(-) diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index 09218fe3..005587e0 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -3,22 +3,25 @@ title: "R Basics" teaching: 60 exercises: 30 questions: +- "What will these lessons not cover?" - "What are the basic features of the R language?" - "What are the most common objects in R?" - "How do I get started with tabular data (e.g. spreadsheets) in R?" objectives: -- "Know how far you can get with basic R skills" -- "Be able to explain what a data types are, and know the common R datatypes +- "Identify R skills not covered in these lessons and where to learn more" +- "Be able to create and appropriately name objects in R" +- "Be able to explain what a data types are, and know the common R data types (modes)" +- "Be able to do simple arithmetic of functional procedures on R objects" +- "Be able to reassign object values and delete objects" - "Be able to create the most common R objects including vectors, factors, - lists, and dataframes" + lists, and data frames" - "Be able to retrieve (index), name, or replace, values from an object" -- "Be able to do simple arithmetic procedures on R objects" - "Be able to load a tabular dataset using base R functions" - "Explain the basic principle of tidy datasets" -- "Be able to determine the structure of a dataframe including its dimensions +- "Be able to determine the structure of a data frame including its dimensions and the datatypes of variables" -- "Be able to retrieve (index) a dataframe" +- "Be able to retrieve (index) a data frame" - "Be able to apply an arithmetic function to a dataframe" - "Be able to coerce the class of an object (including variables in a dataframe)" - "Be able to save a dataframe as a delimited file" @@ -27,7 +30,7 @@ keypoints: be an expert to use R and you can start using and analzying your data with with about a day's worth of training" - "It is important to understand how data are organized by R in a given object - type (e.g. vector, factor, dataframe, etc.) how the mode of that type + type (e.g. vector, factor, data frame, etc.) how the mode of that type (e.g. numeric, character, logical, etc.) will determine how R will operate on that data, and what can happen when datatypes are coerced, misinterpreted, or misapplied" @@ -41,650 +44,255 @@ Before we begin this lesson, we want you to be clear on the goal of the workshop and these lessons. We believe that every learner can be **achieve competency with R**. You have reached competency when you find that you are able to **use R to handle common analysis challenges in a reasonable amount of time** -(which includes time needed to spend looking at learning materials, searching -for help online, and asking colleagues for help). As you spend more time using R -(there is no substitute for regular use and practice) you will find yourself -gaining competency and even expertise. The more familiar you get, the more -complex analyses you will be able to carry out, with less frustration, and in -less time - the "fantastic world of R" awaits you! +(which includes time needed to look at learning materials, search for answers +online, and ask colleagues for help). As you spend more time using R (there is +no substitute for regular use and practice) you will find yourself gaining +competency and even expertise. The more familiar you get, the more +complex the analyses you will be able to carry out, with less frustration, and +in less time - the "fantastic world of R" awaits you! ## What these lessons will not teach you -Nobody wants to learn how to use R, people want to learn how to use R to analyze +Nobody wants to learn how to use R. People want to learn how to use R to analyze their own research questions! Ok, maybe some folks learn R for R's sake, but these lessons assume that you want to start analyzing genomic data as soon as possible. Given this, there are many valuable pieces of information about R that we simply wont have time to cover. Hopefully we will clear the hurdle of -giving you just enough knowledge to be dangerous (which can be a high hurdle -in R), but we also suggest you look into additional the learning materials in -the tip box below. +giving you just enough knowledge to be dangerous, which can be a high hurdle +in R! We uggest you look into additional the learning materials in the tip box +below. ->## Tip: This lesson works best on the cloud +**Here are some R skills we will *not* cover in these lessons** + +- How to create and work with R matrices and R lists +- How to create and work with loops and conditional statements +- How to do basic string manipulations (e.g. finding patterns in text using grep) +- How to plot using the default R graphic tools (we *will* cover ggplot2) +- How to use the advanced R statistical functions + +>## Tip: Where to learn more > The following are good resources for learning more about R. Some of them > can be quite technically, but if you are a regular R user you may ultimately > need some of this technical knowledge. -> - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the R project -> - [R contributed documentation](https://cran.r-project.org/other-docs.html): Also linked to the R project; importantly there are materials available in several languages -> - [R for Data Science](http://r4ds.had.co.nz/): A wonderful collection by noted R -educators and developers Garrett Grolemund and Hadley Wickham +> - [R for Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf): + By Emmanuel Paradis, great starting point +> - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the + R project +> - [R contributed documentation](https://cran.r-project.org/other-docs.html): + Also linked to the R project; importantly there are materials available in + several languages +> - [R for Data Science](http://r4ds.had.co.nz/): A wonderful collection by + noted R educators and developers Garrett Grolemund and Hadley Wickham > - [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/): -Not exclusively about R usage, but a nice collection of pre-prints on data science -and applications for R + Not exclusively about R usage, but a nice collection of pre-prints on data science + and applications for R +> - [Programming in R Software Carpentry lesson](https://software-carpentry.org/lessons/): + There are several Software Carpentry lessons in R to choose from {: .callout} +## Creating objects in R -## A Brief History of R -[R](https://en.wikipedia.org/wiki/R_(programming_language)) has been around -since 1995, and was created by Ross Ihaka and Robert Gentleman at the University -of Auckland, New Zealand. R is based off the S programming language developed -at Bell Labs and was developed to teach intro statistics. See this [slide deck](https://www.stat.auckland.ac.nz/~ihaka/downloads/Massey.pdf) -by Ross Ihaka for more info on the subject. - -## Advantages of using R -At more than 20 years old, R is fairly mature and [growing in popularity](https://www.tiobe.com/tiobe-index/r/). However, programming isn’t a popularity contest. Here are key advantages of -analyzing data in R: - - - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. Of - course this means R is free - which is an advantage if you end up at a - institution where you would have to pay for your own MATLAB or SAS license. - Open source, is important to your colleagues in parts of the world where - expensive software in inaccessible. It also means that R is actively - developed by a community (See [r-project.org](https://www.r-project.org/)), - and there are regular updates. - - **R is widely used**. Ok, maybe programming is a popularity contest. Because, - R is used in many areas (not just bioinformatics), you are more likely to - find help online when you need it. Chances are, almost any error message you - run into, someone else has already experienced. -- **R is powerful**. R runs on multiple platforms (Windows/MacOS/Linux). It can - work with much larger datasets than popular spreadsheet programs like - Microsoft Excel, and because of its scripting capabilities is far more - reproducible. Also, there are thousands of available software packages for - science, including genomics and other areas of life science. - ->## Discussion: Your experience -> What has motivated you to learn R? Have you had a research question for which -> spreadsheet programs such as Excel have proven difficult to use, or where the -> size of the data set created issues? -{: .discussion} - - ----- - -## Introducing RStudio Server -In these lessons, we will be making use of a software called [RStudio](https://www.rstudio.com/products/RStudio/), -an [Integrated Development Environment (IDE)](https://en.wikipedia.org/wiki/Integrated_development_environment). -RStudio, like most IDEs, provides a graphical interface to R, making it more -user-friendly, and providing dozens of useful features. We will introduce -additional benefits of using RStudio as you cover the lessons. In this case, -we are specifically using [RStudio Server](https://www.rstudio.com/products/RStudio/#Server), -a version of RStudio that can be accessed in your web browser. RStudio Server -has the same features of the Desktop version of RStudio you could download as -standalone software. - -## Log on to RStudio Server - -Open a web browser and enter the IP address of your instance, followed by -`:8787`. For example, if your IP address was 123.456.789 your URL would be -> ~~~ -> http://123.456.789:8787 +> ## Reminder +> At this point you should writing following along in the "**genomics_r_basics.R**" +> script we created in the last episode. Writing you commands in the script +> will make it easier to record what you did and why. > -> # Tip: Make sure there are no spaces before or after your URL or your web browser may interpret it as a search query -> ~~~ -> -{: .source} - -Enter your user credentials and click Sign In. The credentials for -the genomics Data Carpentry instances are: - - > **username**: dcuser - > - > **password**: data4Carp - -You should now see the RStudio interface: - -rstudio default session - ---- +{: .prereq} -## Create an RStudio project - -One of the first benefits we will take advantage of in RStudio is something -called an **RStudio Project**. An RStudio Project allows you easily save data, -files, variables, packages, etc. related to a specific analysis project you are -conducting in R. Saving your work into a project makes it easy to restart work -where you left off, and also makes it easier to collaborate, especially if you -are using version control such as [git](http://swcarpentry.github.io/git-novice/). - - -To create a project, go to the File menu, and click New Project.... - -rstudio default session - -In the window that opens select **New Directory**, then **New Project**. For -"Directory name:" enter **dc_genomics_r**. For "Create project as subdirectory of", -you may leave the default, which is your home directory "~". Finally click -Create Project. In the "Files" tab of your output pane (more about -the RStudio layout in a moment), you should see an RStudio project file, -**dc_genomics_r.Rroj**. All RStudio projects end with the "**.Rproj**" file -extension. - ->## Tip: Make your project more reproducible with Packrat -> One of the most wonderful and also frustrating aspects of working with R is -> managing packages. We will talk more about them, but packages (e.g. ggplot2) -> are add-ons that extend what you can do with R. Unfortunately it is very -> common that you may run into versions of R and/or R packages that are not -> compatible. This may make it difficult for someone to run your R script using -> their version of R or a given R package, and/or make it more difficult to run -> their scripts on your machine. [Packrat](https://rstudio.github.io/packrat/) -> is an RStudio add-on that will associate your packages and project so that -> your work is more portable and reproducible. To turn on Packrat click on -> the Tools menu and select Project Options. Under -> **Packrat** check off "**Use packrat with this project**" and follow any -> installation instructions. -{: .callout} - ---- - -## Creating your first R script - -Now that we are ready to start exploring R, we will want to keep a record of the -commands we are using. To do this we can create an R script: - -Click the File menu and select New File and then -R Script. Before we go any further, save your script by clicking the -save/disk icon that is in the bar above the first line in the script editor, or -click the File menu and select save. In the "Save File" -window that opens, name your file **"genomics_r_basics"**. The new script -**genomics_r_basics.R** should appear under "files" in the output pane. By -convention, R scripts end with the file extension **.R**. - ---- - -## Overview and customization of the RStudio layout - -Now that we have covered the basics, lets address some ways to configure the -layout of RStudio. First, here are the major windows or panes of the RStudio -environment: - -rstudio default session - -- **Source**: This pane is where you will write/view R scripts. Some outputs - (such as if you view a dataset using `View()`) will appear as a tab here. -- **Console/Terminal**: This is actually where you see the execution of commands - , and what R looks like if you were to run it at the command line without - RStudio. You can work interactively (i.e. enter R commands here), but for the - most part, we will run a script, or lines in a script and watch their - execution and output here. The "Terminal" tab give you access to the BASH - terminal. -- **Environment/History**: Here, RStudio will show you what datasets and - variables you have created, and which are actively defined/in memory. You can - also see some characteristics of variables/datasets such as their type and - dimensions. A "History" tab also contains a history of executed R commands. In - the history tab you can see a list of previously executed commands. -- **Files/plots/Packages/help**: This multipurpose pane will show you the - contents of directories on your computer. You can also use the "Files" tab to - navigate and set the working directory. The "Plots" tab will show the output - of any plots generated. In "Packages" you will see what packages are actively - loaded, or you can attach installed packages. "Help" will display help files - for R functions/packages. - ->## Tip: Downloads from the cloud -> In the "Files" tab you can select a file and download it from your cloud -> instance to your local computer. Uploads are also possible. -{: .callout} - -All of the panes in RStudio have configuration options. For example, you can -minimize/maximize a pane, or by moving your mouse in the space between -panes you can resize as needed. The most important customization options for -pane layout are in the View menu. Other options such as font sizes, -colors/themes, and more are in the Tools menu under -Global Options. - ->## Don't be fooled - you are working with R -> Although we won't be working with R at the terminal, there are lots of reasons -> to. For example, once you have written an RScript, you can run it at any Linux -> or Windows terminal without the need to start up RStudio. We just don't want -> you to get confused - RStudio runs R, but R is not RStudio. For more on -> running an R Script at the terminal see this [Carpentry lesson](https://swcarpentry.github.io/r-novice-inflammation/05-cmdline/). -{: .callout} - - ---- - -## Getting to work with R: navigating directories -Now that we have covered the more aesthetic aspects of RStudio, we can get to -work learning some commands. We will write, execute, and save the commands we -learn in our **genomics_r_basics.R** script that is loaded in the Source pane. -First, lets see what directory we are in. To do so, type the following command -into the script: +What might be called a variable in many language is properly called an **object** +in R. To create your object you need a name (e.g. 'a'), and a value (e.g. '1'). +Using the R assignment operator '<-''. In your script, "**genomics_r_basics.R**" +write a comment (using the '#') sign, and assign '1' to the object 'a' as shown +below: > ~~~ -> getwd() +> # this line creates the object 'a' and assigns it the value '1' +> a <- 1 > ~~~ {: .language-r} -To execute this command, make sure your cursor is on the same line the command -is written. Then click the Run button that is just above the first -line of your script in the header of the Source pane. - +Be sure to execute this line of code in your script. You can run a line of code +by hitting the Run button that is just above the first line of your +script in the header of the Source pane or you can use the appropriate shortcut: + - Windows execution shortcut: Ctrl+Enter + - Mac execution shortcut: Cmd(⌘)+Enter +to run multiple lines of code, you can highlight all the line you wish to run +and then hit Run or use the shortcut key combo. -In the console, we expect to see the following output*: +You should notice the following outputs; in the RStudio 'Console' you should see: > ~~~ -> [1] "/home/dcuser/dc_genomics_r" +> # this line creates the object 'a' and assigns it the value '1' +> a <- 1 > ~~~ {: .output} -\* Notice, at the Console, you will also see the instruction you executed -above the output in blue. +The 'Console' will display lines of code run from a script and any outputs or +status/warning/error messages (usually in red). -Since we will be learning several commands, we may already want to keep some -short notes in our script to explain the purpose of the command. Entering a `#` -before any line in an R script. Edit your script to include a comment on the -purpose of commands you are learning, e.g.: +You should also notice that in the 'Environment' window you get a table: -> ~~~ -> # this command shows the current working directory -> getwd() -> ~~~ -{: .language-r} +|Values|| +|------|-| +|a|1| ---- +The 'Environment' window allows you to easily keep track of the objects you have +created in R. -> ## Exercise: Work interactively in R -> What happens when you try to enter the `getwd()` command in the Console pane? +> ## Exercise: Create some objects in R +> Create the following objects in R, give each object an appropriate name. > ->> ## solution ->> You will get the same output you did as when you ran `getwd()` from the ->> source. You can run any command in the Console, however, executing it from ->> the source script will make it easier for us to record what we have done, ->> and ultimately run an entire script, instead of entering commands one-by-one. -> {: .solution} -{: .challenge} ---- - -For the purposes of this exercise we want you to be in the directory `"/home/dcuser/dc_genomics_r"`. -What if you weren't? You can set your home directory using the `setwd()` -command. Enter this command in your script, but *don't run* this yet. - -> ~~~ -> # This sets the working directory -> setwd() -> ~~~ -{: .language-r} - -You may have guessed, you need to tell the `setwd()` command -what directory you want to set as your working directory. To do so, inside of -the parentheses, open a set of quotes. Inside the quotes enter a `/` which is -the root directory for Linux. Next, use the Tab key, to take -advantage of RStudio's Tab-autocompletion method, to select `home`, `dcuser`, -and `dc_genomics_r` directory. The path in your script should look like this: - -> ~~~ -> # This sets the working directory -> setwd("/home/dcuser/dc_genomics_r") -> ~~~ -{: .language-r} - - -When you run this command, the console repeats the command, but gives you no -output. Instead, you see the blank R prompt: `>`. Congratulations! Although it -seems small, knowing what your working directory is, and being able to set your -working directory is the first step to analyzing your data. - ->## Tip: Never use `setwd()` -> Wait, what was the last 2 minutes about? Well, setting your working directory -> is something you need to do, you need to be very careful about using this as -> a step in your script. For example, the top-level path in a Unix file system -> is root `/`, but on Windows it is likely `C:\`. This is one of several ways -> you might cause a script to break because a file path is configured differently -> than your script anticipates. R packages like [`here`](https://cran.r-project.org/web/packages/here/index.html) -> and [`file.path`](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) -> allow you to specify file paths is a way that is more operating system -> independent. See Jenny Bryan's [blog post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) for this -> and other R tips. -{: .callout} - ---- - -## Using functions in R, without needing to master them -Functions may seem like an advanced topic (and they are), but you have already -been using functions in R. In fact, even if you never learn how anything else -works in R, the next sections will help you understand what is happening in -any R script. A function in R (or any computing language) is basically a short -program that takes an input and returns and output. - -> ## Exercise: What do these functions do? -> Try the following functions by writing them in your script. See if you can -> guess what they do, and make sure to add comments to your script about your -> assumed purpose. -> - `dir()` -> - `sessionInfo()` -> - `date()` -> - `Sys.time()` +> 1. Create an object that has the value of number of pairs of human chromosomes +> 2. Create an object that has a value of your favorite gene name +> 3. Create an object that value of this URL: "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/" +> 4. Create and object that has the value of the number of chromosomes in a diplod cell > >> ## solution ->> - `dir()` # lists files in the working directory ->> - `sessionInfo()` # Gives the version of R and additional info including ->> on attached packages ->> - `date()` # Gives the current date ->> - `Sys.time()` # Gives the current time +>> Here as some possible answers to the challenge: +>> 1. human_chr_number <- 23 +>> 2. gene_name <- 'pten' +>> 3. ensemble_url <- 'ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/' +>> 4. human_diploid_chr_num <- 2 * human_chr_number +>> > {: .solution} {: .challenge} -You have hopefully noticed a pattern, some more abstract exceptions aside, in R -a function has three key properties: -- functions have a name (e.g. `dir`, `getwd`); note that these are case sensitive! -- following the name, functions have a pair of `()` -- Inside the parentheses, a function may take 0 or more arguments - -An argument may be a specific input for your function and/or may modify the -function's behavior. For example the function `round()` will round a number -with a decimal: - -> ~~~ -> # This will round up a number -> round(3.14) -> ~~~ -{: .language-r} - -Which returns - -> ~~~ -> [1] 3 -> ~~~ -{: .output} - -## Getting help with function arguments - -Of course, you may have wanted to round to one significant digit. `round()` can -do this, but you may fist need to read the help to find out how. To see the help -(In R sometimes also called a "vignette") enter a `?` in front of the function -name: - -> ~~~ -> ?round() -> ~~~ -{: .language-r} - -The "Help" tab will show you information (and often, too much information). You -will slowly learn how to read through all of that. Checking the "Usage" or -"Examples" headings is often a good place to look first. If you look under -"Arguments" we also see what arguments we can "pass" to this function to modify -its behavior. You can also see a function's argument using the `args()` function: - -> ~~~ -> args(round) -> ~~~ -{: .language-r} - -Which returns - -> ~~~ -> function (x, digits = 0) -> NULL -> ~~~ -{: .output} - -We see that `round()` takes two arguments, `x` which is your number, and a -`digits` argument. The `=` sign indicates that a default (in this case 0) is -already set. Since `x` is not set, `round()` requires we provide it, in contrast -to `digits` where R will use the default value 0 unless you explicitly provide -a different value. We can explicitly set the digits parameter when we call the -function: - -> ~~~ -> round(3.14159, digits = 2) -> ~~~ -{: .language-r} +## Naming objects in R + +Here are some important details about naming objects in R. + +- **Avoid spaces and special characters**: Object cannot contain spaces. Typically + you can use '-' or '_' to provide separation. You should avoid using special + characters in your object name (e.g. ! @ # . , etc.). Also, names cannot begin with + a number. +- **Use short, easy-to-understand names**: You should avoid naming your objects + using single letters (e.g. 'n', 'p', etc.). This is mostly to encourage you + to use names that would make sense to anyone reading your code (a colleague, + or even yourself a year from now). Also, avoiding really long names will make + your code more readable. +- **Avoid commonly used names**: There are several names that may alread have a + definition in the R language (e.g. 'mean', 'min', 'max'). One clue that a name + already has meaning is that if you start typing a name in RStudio and either + pause your typing or hit the Tab key and RStudio gives you a + suggested autocompletion or help message you have choosen a name that has a + prior meaning. +- **Use the recommended assignment operator**: In R, we use '<- '' as the + prefered assignment operator. '=' works too, but is most comonly used in + passing arguments to functions (more on functions later). There is a shortcut + for the R assignment operator: + - Windows execution shortcut: Alt+- + - Mac execution shortcut: Option+- + + +There are a few more suggestions about naming and style you may want to learn +more about as you write more R code. There are several "style guides" that +have advice, and one to start with is the [tidyverse R style guide](http://style.tidyverse.org/index.html). + +>## Tip: Pay attention to warnings in the script console +> +> If you enter a line of code in your R that contains some error, RStudio +> may give you hint with an error indication and an underline of this mistake. +> Sometimes these messages are easy to understand, but often the message may +> need some figuring out. In any case paying attention to these warnings help +> you avoid mistakes. In this case, our object name has a space, which is not +> allowed in R. Notice the error message does not say this directly, but +> essentially R is "not sure" about to to assign the name to "human_ chr_number" +> when the object name we want is "human_chr_number". +> +> rstudio script warning +> + {: .callout} -> ~~~ -> [1] 3.14 -> ~~~ -{: .output} +## Reassigning object names or deleting objects -Or, R accepts what we call "positional arguments", if you pass a function -arguments separated by commas, R assumes that they are in the order you saw -when we used `args()`. In the case below that means that `x` is 3.14159 and -digits is 2. +Once an object has a value, you can change that value by overwriting it. R will +not complain about overwriting objects, which may or may not be a good thing +depending on how you look at it. > ~~~ -> round(3.14159, 2) +> # gene_name has the value 'pten' or whatever value you used in the challenge. We will now assign the new value 'tp53' +> gene_name <- 'tp53' > ~~~ {: .language-r} -> ~~~ -> [1] 3.14 -> ~~~ -{: .output} - -Finally, what if you are using `?` to get help for a function in a package not -installed on your system? +You can also remove an object from R's memory entirely. The `rm()` function +will delete the object. > ~~~ -> ?geom_point() +> # delete the object 'gene_name' +> rm(gene_name) > ~~~ {: .language-r} -will return an error: +If you run a line of code that just has an object name, R will normally display +the contents of that object. In this case, we are told the object is no +longer defined. > ~~~ -> Error in .helpForCall(topicExpr, parent.frame()) : -> no methods for ‘geom_point’ and no documentation for it as a function +> Error: object 'gene_name' not found > ~~~ {: .error} - -Use two question marks (i.e. `?? geom_point()`) and R will return online search -results in the "Help" tab. Finally, if you think there should be a function, -for example a statistical test, but you aren't sure what R calls it, or what -functions may be available, use the `help.search()` function. - -> ## Exercise: Searching for R functions -> Use `help.search()` to find R functions for the following statistical -> functions. Remember to put what you are using for your search query in -> quotes inside the function parentheses. +## Understaning object data types (modes) + +One very important thing to know about an object is that every object has two +properties, "length" and "mode". We will get to the "length" property later in +the lesson. The **"mode" property corresponds to the type of data an object** +**represents**. The most common modes you will encounter in R are: + +|Mode (abbreviation)|Type of data| +|----|------------| +|Numeric (num)| Numbers such integers (e.g. 1, 892, 1.3e+10) and floating pont/decimals (0.5, 3.14)| +|Character (chr)|A sequence of letters/numbers in single '' or double " " quotes| +|Logical| Boolean values - TRUE or FALSE| + +There are a few other modes (double", "complex", "raw" etc.) but for now, these +three are the most important. Data types are familiar in many programming +languages, but also in natural language where we refer to them as the +parts of speech, e.g. nouns, verbs, adverbs, etc. One you know if a word - +perhaps an unfamilar one - is a noun, you can probbaly guess you can count it +and make it plural if there is more than one (e.g. 1 Tuatara, or 2 Tuataras). +If something is a adjective, you can usually change it into an adverb by +adding "-ly" (e.g. jejune vs. jejunely). Depending on the context, you may need +to decide if a word is in one category or another (e.g "cut" may be a noun when +its on your finger, or a verb when you are preparing vegetables). These examples +have important analogies when working with R objects. + +> ## Exercise: Create objects and check their modes +> Create the following objects in R, then use the `mode()` function to verify +> their modes. Try to guess what the mode will be before you look at the solution > -> - Chi-Squared test -> - Student-t test -> - mixed linear model +> 1. chromosome_name <- 'chr02' +> 2. od_600_value <- 0.47 +> 3. chr_position <- '1001701' +> 4. spock <- TRUE +> 5. pilot <- Earhart > >> ## solution ->> While your search results may return several tests, we list a few you might ->> find: ->> - Chi-Squared test: `stats::Chisquare` ->> - Student-t test: `stats::TDist` ->> - mixed linear model: `stats::lm.glm` +>> +>> 1. mode(chromosome_name) # "character" +>> 2. mode(od_600_value) # "numeric" +>> 3. mode(chr_position) # "character" +>> 4. mode(spock) # "logical" +>> 5. pilot # Error - > {: .solution} {: .challenge} +Notice from the solution that even if a series of numbers are given as a value +R will consider them to be in the "character" mode if they are enclosed as +single or double quotes. Also notice that you cannot take a string of alphanumeric +character (e.g. Earhart) and assign as a value for an object. In this case, +R looks for the object `Earhart` but since there is no object, no assignment can +be made. If `Earhart` did exist, then the mode of `pilot` would be whatever +the mode of `Earthrt` was originally. -We will discuss more on where to look for the libraries and packages that -contain functions you want to use. For now, be aware that two important ones -are [CRAN](https://cran.r-project.org/) - the main repository for R, and -[Bioconductor](http://bioconductor.org/) - a popular repository for -bioinformatics R. - ---- - -## RStudio contextual help - -Here is one last bonus we will mention about RStudio. It's difficult to -remember all of the arguments and definitions associated with a given function. -When you start typing the name of a function and hit the Tab key, -RStudio will display functions and associated help: - -rstudio default session - -Once you type a function, hitting the Tab inside the parentheses -will remind you of arguments and provide additional help. - -rstudio default session - - ---- - -## Getting help with R - -rstudio default session - -Finally, no matter how much experience you have with R, you will find yourself -needing help. There is no shame in researching how to do something in R, and -most people will find themselves looking up how to do the same things that -they "should know how to do" over and over again. Here are some tips to make -this process as helpful and efficient as possible. - -> "Never memorize something that you can look up" -> - A. Einstein - -## Finding help on Stackoverflow and Biostars - -Two popular websites will be of great help with many R problems. For **general** -**R questions**, [Stack Overflow](https://stackoverflow.com/) is probably the most -popular online community for developers. If you start your question "How to do X -in R" results from Stack Overflow are usually near the top of the list. For -**bioinformatics specific questions**, [Biostars](https://www.biostars.org/) is -a popular online forum. - ->## Tip: Asking for help using online forums: -> -> - When searching for R help, look for answers with the [r](https://stackoverflow.com/questions/tagged/r) tag. -> - Get an account; not required to view answers but to required to post -> - Put in effort to check thoroughly before you post a question; folks get -> annoyed if you ask a very common question that has been answered multiple -> times -> - Be careful. While forums are very helpful, you can't know for sure if the -> advice you are getting is correct -> - See the [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) -> blog post for more useful tips -> -{: .callout} - -## Help people help you - -Often, in order to duplicate the issue you are having, someone may need to see -the data you are working with or verify the versions of R or R packages you -are using. The following R functions will help with this: - -You can **check the version of R** you are working with using the `sessionInfo()` -function. Actually, it is good to save this information as part of your notes -on any analysis you are doing. When you run the same script that has worked fine -a dozen times before, looking back at these notes will remind you that you -upgraded R and forget to check your script. - - -> ~~~ -> sessionInfo() -> ~~~ -{: .language-r} - -> ~~~ -> R version 3.2.3 (2015-12-10) -> Platform: x86_64-pc-linux-gnu (64-bit) -> Running under: Ubuntu 14.04.3 LTS -> -> locale: -> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 -> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 -> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C -> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C -> -> attached base packages: -> [1] stats graphics grDevices utils datasets methods base -> -> loaded via a namespace (and not attached): -> [1] tools_3.2.3 packrat_0.4.9-1 -> ~~~ -{: .output} - -Many times, there may be some issues with your data and the way it is formatted. -In that case, you may want to share that data with someone else. However, you -may not need to share the whole dataset; looking at a subset of your 50,000 row, -10,000 column dataframe may be TMI (too much information)! You can take an -object you have in memory such as dataframe (if you don't know what this means -yet, we will get to it!) and save it to a file. In our example we will use the -`dput()` function on the `iris` dataframe which is an example dataset that is -installed in R: - - -> ~~~ -> dput(head(iris)) # iris is an example data.frame that comes with R -> # the `head()` function just takes the first 6 lines of the iris dataset -> ~~~ -{: .language-r} - -This generates some output (below) which you will be better able to interpret -after covering the other R lessons. This info would be helpful in understanding -how the data is formatted and possibly revealing problematic issues. - -> ~~~ -> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), -> Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, -> 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, -> 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, -> 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", -> "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, -> 6L), class = "data.frame") -> ~~~ -{: .output} - -Alternatively, you can also save objects in R memory to a file by specifying -the name of the object, in this case the `iris` data frame, and passing a -filename to the `file=` argument. - -> ~~~ -> saveRDS(iris, file="iris.rds") # By convention, we use the .rds file extension -> ~~~ -{: .language-r} - ---- - -## Final FAQs on R - -Finally, here are a few pieces of introductory R knowledge that are too good to -pass up. While we won't return to them in this course, we put them here because -they come up commonly: - -**Do I need to click Run every time I want to run a script?** - -- No. In fact, the most common shortcut key allows you to run a command (or - any lines of the script that are highlighted): - - Windows execution shortcut: Ctrl+Enter - - Mac execution shortcut: Cmd(⌘)+Enter - - To see a complete list of shortcuts, click on the Tools menu and - select Keyboard Shortcuts Help - -**What's with the brackets in R console output?** -- R returns an index with your result. When your result contains multiple values, - the number tells you what ordinal number begins the line, for example: - -> ~~~ -> 1:101 # generates the sequence of numbers from 1 to 101 -> ~~~ -{: .language-r} - -In the output below, `[81]` indicates that the first value on that line is the -81st item in your result - -> ~~~ -> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 -> [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -> [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -> [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 -> [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 -> [101] 101 -> ~~~ -{: .output} +## Mathematical and functional operations on objects +Once an object exsits (which by definition also means it has a mode), R can +appropriately manipulate that object. For example, objects in the numeric modes +are numbers which can be added, multiplied, divided, etc: -**Can I run my R script without RStudio?** -- Yes, remember - RStudio is running R. You get to use lots of the enhancements - RStudio provides, but R works independent of RStudio. See [these tips](https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line) - for running your commands at the command line -**Where else can I learn about RStudio?** -- Check out the Help menu, especially "Cheatsheets" section --- From 12ecaa08c66fa4d4fda45ba61a2945eb6daab4cf Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Mon, 7 May 2018 16:35:55 -0400 Subject: [PATCH 09/19] intermeidate up to logical indexing --- episodes/02-r-basics.md | 487 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 482 insertions(+), 5 deletions(-) diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index 005587e0..e7dc588d 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -1,7 +1,7 @@ --- title: "R Basics" teaching: 60 -exercises: 30 +exercises: 20 questions: - "What will these lessons not cover?" - "What are the basic features of the R language?" @@ -12,8 +12,8 @@ objectives: - "Be able to create and appropriately name objects in R" - "Be able to explain what a data types are, and know the common R data types (modes)" -- "Be able to do simple arithmetic of functional procedures on R objects" - "Be able to reassign object values and delete objects" +- "Be able to do simple arithmetic of functional procedures on R objects" - "Be able to create the most common R objects including vectors, factors, lists, and data frames" - "Be able to retrieve (index), name, or replace, values from an object" @@ -106,6 +106,7 @@ below: > ~~~ > # this line creates the object 'a' and assigns it the value '1' +> > a <- 1 > ~~~ {: .language-r} @@ -122,6 +123,7 @@ You should notice the following outputs; in the RStudio 'Console' you should see > ~~~ > # this line creates the object 'a' and assigns it the value '1' +> > a <- 1 > ~~~ {: .output} @@ -210,6 +212,7 @@ depending on how you look at it. > ~~~ > # gene_name has the value 'pten' or whatever value you used in the challenge. We will now assign the new value 'tp53' +> > gene_name <- 'tp53' > ~~~ {: .language-r} @@ -219,6 +222,7 @@ will delete the object. > ~~~ > # delete the object 'gene_name' +> > rm(gene_name) > ~~~ {: .language-r} @@ -273,7 +277,7 @@ have important analogies when working with R objects. >> 2. mode(od_600_value) # "numeric" >> 3. mode(chr_position) # "character" >> 4. mode(spock) # "logical" ->> 5. pilot # Error - +>> 5. pilot # Error: object 'Earhart' not found > {: .solution} {: .challenge} @@ -288,11 +292,484 @@ the mode of `Earthrt` was originally. ## Mathematical and functional operations on objects Once an object exsits (which by definition also means it has a mode), R can -appropriately manipulate that object. For example, objects in the numeric modes -are numbers which can be added, multiplied, divided, etc: +appropriately manipulate that object. For example, objects of the numeric modes +can be added, multiplied, divided, etc. R provides several mathematical +(arithmetic) operators incuding: + +|Operator|Description| +|--------|-----------| +|+|addition| +|-|subtraction| +|*|multiplication| +|/|division| +|^ or **|exponentiation| +|a%%b|modulus| + +These can be used with literal numbers: + +> ~~~ +> (1 + (5 ** 0.5))/2 +> ~~~ +{: .language-r} + +> ~~~ +> [1] 1.618034 +> ~~~ +{: .output} + +and importantly, can be used on any object that evaluates to (i.e. iterprited +by R) a numeric object: + + +> ~~~ +> # multiply the object 'human_chr_number' by 2 +> +> human_chr_number * 2 +> ~~~ +{: .language-r} + +returns the result: + +> ~~~ +> [1] 46 +> ~~~ +{: .output} + +Finally, it is useful to know that several other types of mathematical +operations have their own associated functions. While there are too many to +list, you can always search the online documentation in R for a function ( +even if you don't know what it may be called in R). For example: + +> ~~~ +> # search for functions associated with chi squared +> +> ?? chisquared +> ~~~ +{: .language-r} + +Will open search results in your help tab. Of course, using Google will help +here too. + +> ## Exercise: Compute the golden ratio +> One appoximation of the golen ratio (φ) can be found by taking the sum of 1 +> and the square root of 5, and dividing by 2 as in the example above. Compute +> the golden ratio to 3 digits of precision using the `sqrt()` and `round()` +> functions. Hint: remember the `round()` function can take 2 arguments. +> +>> ## solution +>> +>> round((1 + sqrt(5))/2, digits=3) +>> +>> [1] 1.618 +>> +>> * Notice that you can place one function inside of another. +> {: .solution} +{: .challenge} + + +## Vectors + +With a solid understanding of the most basic objects, we come to probably the +most used objects in R, vectors. A vector can be though of as a collection of +values (numbers, characters, etc.). Vectors also have a mode (data type), so +all of the contents of a vctor must be of the same mode. One of the most common +way to create a vector is to use the `c()` function - the "concatanate" or +"combine" function. Inside the function you may enter one or more values; for +multiple values, seperate each value with a comma: + +> ~~~ +> # Create the SNP gene name vector +> +> snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1") +> ~~~ +{: .language-r} + +Two important properties of vectors are their **mode** and their **length**. +You can check these with the `mode()` and `length()` function respectively. +Another useful function that gives both of these pieces of information is the +`str()` (structure) function. Importantly, **items within a vector must all +be of the same mode/ data type**. This is because a vector can have only one +mode. More on this later. + +> ~~~ +> # Check the mode, length, and structure of 'gene_names' +> +> mode(gene_names) +> length(gene_names) +> str(gene_names) +> ~~~ +{: .language-r} + +returns: + +> ~~~ +> [1] "character" +> [1] 4 +> chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" +> ~~~ +{: .output} + +Vectors are quite important in R, mostly for us because data frames are +essentially collections of vectors (more on this later). What we learn about +manipulating vectors now will pay of even more when we get to data frames. + +## More on creating and indexing vectors + +Let's create a few more vectors to play around with: + +> ~~~ +> # some interesting human SNPs +> # while accuracy is important, typos in the data won't hurt you here +> +> snps <- c('rs53576', 'rs1815739', 'rs6152', 'rs1799971') +> snp_chromosomes <- c('3', '11', 'X', '6') +> snp_positions <- c(8762685, 66560624, 67545785, 154039662) +> ~~~ +{: .language-r} + +Once we have vectors, one thing we may want to do is specifically retrieve one +or more values from our vector. To do so we use **bracket notation**. We type +the name of the vector followed by square brackets. In those square brackets +we place the index (e.g. a number) in that bracket as follows: + +> ~~~ +> # get the 3rd value in the snp_genes vector +> +> snp_genes[3] +> ~~~ +{: .language-r} +> ~~~ +> [1] "AR" +> ~~~ +{: .output} + +In R, every item your vector is indexed, starting from the first item (1) +through to the final number of items in your vector. You can also retrieve a +range of numbers: + +> ~~~ +> # get the 1st through 3rd value in the snp_genes vector +> +> snp_genes[1:3] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" +> ~~~ +{: .output} + +If you want to to retreive several (but not necessarily sequential) items from +a vector, you pass a **vector of indicies**; a vector that has the numbered +positions you wish to retrieve. + +> ~~~ +> # get the 1st, 3rd, and 4th value in the snp_genes vector +> +> snp_genes[c(1, 3, 4)] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "AR" "OPRM1" +> ~~~ +{: .output} + +There are additional (and perhaps less commonly used) ways of indexing a vector +(see [these examples](https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html)). +Also, several of these indexing expressions can be combined: + +> ~~~ +> # get the 1st through the 3rd value, and 4th value in the snp_genes vector +> # yes, this is a little silly in a vector of only 4 values. +> +> snp_genes[c(1:3,4)] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" +> ~~~ + +## Adding to, removing, or replacing values in existing vectors + +Once you have an existing vector, you may want to add a new item to it. To do +so, you can use the `c()` function again to add your new value: + +> ~~~ +> # add the gene 'CYP1A1' and 'APOA5' to our list of snp genes +> # this overwrites our existing vector +> +> snp_genes <- c(snp_genes, "CYP1A1", "APOA5") +> ~~~ +{: .language-r} +We can of course verify that "snp_genes" contains the new gene entry +> ~~~ +> snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" +> ~~~ +{: .output} + +Using a negative index will return a version a vector with that index's +value removed: + +> ~~~ +> snp_genes[-6] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" +> ~~~ +{: .output} + + +We can remove that value from our vector by overwriting it with this expression: +> ~~~ +> snp_genes <- snp_genes[-6] +> snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" +> ~~~ +{: .output} + +We can also explicitly rename or add a value to our index using double bracket +notation: + +> ~~~ +> snp_genes[[7]]<- "APOA5" +> snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" +> ~~~ +{: .output} + +Notice in the operation above that R inserts an `NA` value to extend our vector +so that the gene "APOA5" is an index 7. This may be a good or not so good thing +depending on how you use this. + +> ## Exercise: Examining and indexing vectors +> Answer the following questions to test your knowledge vectors +> +> Which of the following is true of vectors in R +> +> A) All vectors have a mode or a length +> +> B) All vector have a mode and a length +> +> C) Vectors may have different lengths +> +> D) Items within a vector may be of different modes +> +> E) You can use the `c()` to one or more items to an existing vector +> +> F) You can use the `c()` to add a vector to an exiting vector +>> +>> ## solution +>> A) False - Vectors have both of these properties +>> +>> B) True +>> +>> C) True +>> +>> D) False - Vectors have only one mode (e.g. numeric, character); all items in +>> a vector must be of this mode. +>> +>> E) True +>> +>> F) True +>> +> {: .solution} +{: .challenge} + + +## Logical Indexing + +There is one last set of cool indexing capabilities we want to introduce. It is +possible within R to retrieve items in a vector based on a logical evaluation +or numerical comparison. For example, let's say we wanted get all of the SNPs +in our vector of SNP positons that were greater than 100,000,000. We could +index using the '>' (greater than) logical operator: + +> ~~~ +> snp_positions[snp_positions > 100000000] +> ~~~ +{: .language-r} +> ~~~ +> [1] 154039662 +> ~~~ +{: .output} + +As demonstrated above, in the square brackets you place the name of the vector +followed by the comparison operator and (in this numeric case) a numeric value. +Some of the most common logical operators you will use in R are: + +|Operator|Description| +|--------|-----------| +|<|less than| +|<=|less than or equal to| +|>|greater than| +|>=|greater than or equal to| +|==|exactly equal to| +|!=|not equal to| +|!x|not x| +|a \| b| a or b| +|a & b| a and b| + +> ## The magic of programming +> +>The reason why the expression `snp_positions[snp_positions > 100000000]` works +>can be better understood if you examine what the expression "snp_positions > 100000000" +>evaluates to: +> +>> ~~~ +>> snp_positions > 100000000 +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] FALSE FALSE FALSE TRUE +>> ~~~ +>{: .output} +> +>The output above is a logical vector, the 4th element of which is TRUE. When +>you pass a logical vector as an index, R will return the true values: +> +>> ~~~ +>> snp_positions[c(FALSE, FALSE, FALSE, TRUE)] +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] 154039662 +>> ~~~ +>{: .output} +> +> +>If you have never coded before, this type of situation starts to expose the +>"magic" of programming. We mentioned before that in the bracket indexing +>notation you take your named vector followed by brakets which contain an index: +>**named_vector[index]**. The "magic" is that the index needs to *evaluate to* a +>number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long +>as R can evaluate it, we will get a result. That our expression +>`snp_positions[snp_positions > 100000000]` evaluates to a number can be seen +>in the following situtaion. If you wanted to know which **index** (1, 2, 3, or +>4) in our vector of SNP positions was the one that was greater than 100,000,000? +>We can use the `which()` function to return the indicies of any item that +>evaluates as TRUE in our comparison: +>> ~~~ +>> which(snp_positions > 100000000) +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] 4 +>> ~~~ +>{: .output} +> **Why is this important?** Often in programming we will not know what inputs +> and values will be used when our code is executed. Rather than put in a +> pre-determined value (e.g 100000000) we can use an object that can take on +> whatever value we need. So for example: +> +>> ~~~ +>> snp_marker_cutoff <- 100000000 +>> snp_positions[snp_positions > snp_marker_cutoff] +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] 154039662 +>> ~~~ +>{: .output} +> Ultimately, it's putting together flexible, reusable code like this that gets +> at the "magic" of programming! +{: .callout} + +## A few final vector tricks + +Finally, there are a few other common retrieve or replace operations you may +want to know about. First, you can check to see if any of the values of your +vector is an NA value. Missing data will get a more detailed treatment later, +but the `is.NA()` function will return a logical vector, with TRUE for any NA +value: + +> ~~~ +> # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" +> +> is.na(snp_genes) +> ~~~ +{: .language-r} +> ~~~ +> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE +> ~~~ +{: .output} + +Sometimes, you may wish to find out if a specific value (or several values) is +in a vector. You can do this using the comparison operator `%in%`, which will +return TRUE for any value in your collection of one or more values matches a +value in the vector you are searching: + +> ~~~ +> # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" +> # test to see if "ACTN3" or "APO5A" is in the snp_genes vector +> # if you are looking for more than one value, you must pass this as a vector +> +> c("ACTN3","APOA5") %in% snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] TRUE TRUE +> ~~~ +> ## Review: Creating and indexing vectors +> Use your knowledge of vectors to accomplish the following tasks: +> +> **1) Add the following values to the following vectors** +> +> a. To the `snps` vector add: 'rs662799' +> +> b. To the `snp_chromosomes` vector add: 11 +> +> c. To the `snp_positions` vector add: 116792991 +> +> **2) Make the following change to the `snp_genes` vector** +> Hint: Your vector should look like this in the 'Global Enviornment': +> `chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"`. If not +> recreate the vector by running this expression: +> `snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5")` +> +> a. Create a new version of `snp_genes` that does not contain CYP1A1 +> +> b. Add 2 NA values to the end of `snp_genes` (hint: final vector should +> have a length of 8) +> +> **3) Create a new vector that contains** +> +> a. The the 1st value in `snp_genes` +> +> b. The 1st value in `snps` +> +> c. The 1st value in `snp_chromosomes` +> +> d. The 1st value in `snp_positions` +>> +>> ## solution +>> +>> +>> +>> +>> +>> +>> +>> +>> +>> +>> +>> +>> +> {: .solution} +{: .challenge} --- From 924aa1559bce73342e5eae8c5a41e6b2fd07fe83 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Wed, 9 May 2018 16:10:52 -0400 Subject: [PATCH 10/19] finish episode 2 add 3 --- episodes/02-r-basics.md | 154 +++- episodes/03-basics-factors-dataframes.md | 873 +++++++++++++++++++++++ 2 files changed, 993 insertions(+), 34 deletions(-) create mode 100644 episodes/03-basics-factors-dataframes.md diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index e7dc588d..ca318177 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -6,37 +6,22 @@ questions: - "What will these lessons not cover?" - "What are the basic features of the R language?" - "What are the most common objects in R?" -- "How do I get started with tabular data (e.g. spreadsheets) in R?" objectives: -- "Identify R skills not covered in these lessons and where to learn more" -- "Be able to create and appropriately name objects in R" -- "Be able to explain what a data types are, and know the common R data types - (modes)" -- "Be able to reassign object values and delete objects" -- "Be able to do simple arithmetic of functional procedures on R objects" -- "Be able to create the most common R objects including vectors, factors, - lists, and data frames" -- "Be able to retrieve (index), name, or replace, values from an object" -- "Be able to load a tabular dataset using base R functions" -- "Explain the basic principle of tidy datasets" -- "Be able to determine the structure of a data frame including its dimensions - and the datatypes of variables" -- "Be able to retrieve (index) a data frame" -- "Be able to apply an arithmetic function to a dataframe" -- "Be able to coerce the class of an object (including variables in a dataframe)" -- "Be able to save a dataframe as a delimited file" +- "Be able to create the most common R objects including vectors" +- "Understand that vectors have modes, which correspond to the type of data they contain" +- "Be able to use arithmetic operators on R objects" +- "Be able to retrieve (index), name, or replace, values from a vector" +- "Be able to use logical operators in an indexing operation" +- "Understand that lists can hold data of more than one mode and can be indexed" keypoints: - "Effectively using R is a journey of months or years. Still you don't have to be an expert to use R and you can start using and analzying your data with with about a day's worth of training" - "It is important to understand how data are organized by R in a given object - type (e.g. vector, factor, data frame, etc.) how the mode of that type - (e.g. numeric, character, logical, etc.) will determine how R will operate - on that data, and what can happen when datatypes are coerced, misinterpreted, - or misapplied" -- "Data wrangling - loading data, cleaning this data (e.g checking for outliers, - handling missing values, sorting, filtering, etc.) is an important first step - for working with data" + type how the mode of that type (e.g. numeric, character, logical, etc.) will + determine how R will operate on that data. +- "Working with vectors effectively prepares you for understanding how data are + organized in R." --- ## "The fantastic world of R awaits you" OR "Nobody wants to learn how to use R" @@ -721,12 +706,20 @@ value in the vector you are searching: > ~~~ > [1] TRUE TRUE > ~~~ - +{: .output} > ## Review: Creating and indexing vectors > Use your knowledge of vectors to accomplish the following tasks: > -> **1) Add the following values to the following vectors** +> **1) What mode are the following vectors? Use `typeof()` to check** +> +> a. `snps` +> +> b. `snp_chromosomes` +> +> c. `snp_positions` +> +> **2) Add the following values to the following vectors** > > a. To the `snps` vector add: 'rs662799' > @@ -734,7 +727,7 @@ value in the vector you are searching: > > c. To the `snp_positions` vector add: 116792991 > -> **2) Make the following change to the `snp_genes` vector** +> **3) Make the following change to the `snp_genes` vector** > Hint: Your vector should look like this in the 'Global Enviornment': > `chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"`. If not > recreate the vector by running this expression: @@ -745,31 +738,124 @@ value in the vector you are searching: > b. Add 2 NA values to the end of `snp_genes` (hint: final vector should > have a length of 8) > -> **3) Create a new vector that contains** +> **4) Create a new vector `combined` that contains:** > -> a. The the 1st value in `snp_genes` +> - The the 1st value in `snp_genes` > -> b. The 1st value in `snps` +> - The 1st value in `snps` > -> c. The 1st value in `snp_chromosomes` +> - The 1st value in `snp_chromosomes` +> +> - The 1st value in `snp_positions` +> +> **Check the mode of `combined` using `typeof()` > -> d. The 1st value in `snp_positions` ->> >> ## solution >> +>> **1) What mode are the following vectors? Use `typeof()` to check** +>> +>> a. `typeof(snps)` # "character" +>> +>> b. `typeof(snp_chromosomes)` # "character" +>> +>> c. `typeof(snp_positions)` # "double" - which is also a numeric type +>> >> +>> **2) Add the following values to the following vectors** >> +>> a. `snps <- c(snps, 'rs662799')` >> +>> b. `snp_chromosomes <- c(snp_chromosomes, "11")` # did you use quotes? >> +>> c. `snp_positions <- c(snp_positions, 116792991)` >> +>> **3) Make the following change to the `snp_genes` vector** >> +>> a. `snp_genes <- snp_genes[-5]` or `snp_genes <- snp_genes[c(1,2,3,4,6,7)]`, etc. >> +>> b. `snp_genes <- c(snp_genes, NA, NA)` or `snp_genes[[8]] <- NA`, etc. >> >> +>> **4) Create a new vector `combined` that contains:** >> +>> - The the 1st value in `snp_genes` >> +>> - The 1st value in `snps` +>> +>> - The 1st value in `snp_chromosomes` +>> +>> - The 1st value in `snp_positions` +>> +>> +>> `combined <- c(snp_genes[1], snps[1], snp_chromosomes[1], snp_positions[1])` +>> +>> `typeof(combined)` # "character" - Do you know why this is? >> > {: .solution} {: .challenge} +## Bonus material: Lists + +Lists are quite useful in R, but we won't be using them in the genomics lessons. +That said, you may come across lists in the way that some bioinformatics +programs may store and/or return data to you. One of the key attributes of a list +is that unlike a vector, a list may contain data of more than one mode. Learn +more about creating and using lists using this [nice tutorial](http://r4ds.had.co.nz/lists.html). +In this one example, we will create a named list and show you how to retreive +items from the list. + + +> ~~~ +> # Create a named list using the 'list' function and our SNP examples +> # Note, for easy reading we have place each item in the list on a separate line +> # Nothing special about this, you can do this for any multiline commands +> # To run this command, make sure the entire command (all 4 lines) are highlited +> # before running +> +>snp_data <- list(genes = snp_genes, +> refference_snp = snps, +> chromosome = snp_chromosomes, +> position = snp_positions) +> +> # Examine the structure of the list +>str(snp_data) +> ~~~ +{: .language-r} +> ~~~ +>List of 4 +> $ genes : chr [1:8] "OXTR" "ACTN3" "AR" "OPRM1" ... +> $ refference_snp: chr [1:5] "rs53576" "rs1815739" "rs6152" "rs1799971" ... +> $ chromosome : chr [1:4] "3" "11" "X" "6" +> $ position : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 +> ~~~ +{: .output} + +To get all of the values for the `position` object in the list we use the `$` notation: + +> ~~~ +> # return all the values of position object +> +> snp_data$position +> ~~~ +{: .language-r} +> ~~~ +> [1] 8762685 66560624 67545785 154039662 +> ~~~ +{: .output} + +To get the first value in the `position` object, use `[]` notation to index: + +> ~~~ +> # return first value of the position object +> +> snp_data$position[1] +> ~~~ +{: .language-r} +> ~~~ +> [1] 8762685 +> ~~~ +{: .output} + + + --- diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md new file mode 100644 index 00000000..d0b174e2 --- /dev/null +++ b/episodes/03-basics-factors-dataframes.md @@ -0,0 +1,873 @@ +--- +title: "R Basics continued - factors and data frames" +teaching: 60 +exercises: 20 +questions: +- "How do I get started with tabular data (e.g. spreadsheets) in R?" +objectives: +- "Identify R skills not covered in these lessons and where to learn more" +- "Be able to create and appropriately name objects in R" +- "Be able to explain what a data types are, and know the common R data types + (modes)" +- "Be able to reassign object values and delete objects" +- "Be able to do simple arithmetic of functional procedures on R objects" +- "Be able to create the most common R objects including vectors, factors, + lists, and data frames" +- "Be able to retrieve (index), name, or replace, values from an object" +- "Be able to load a tabular dataset using base R functions" +- "Explain the basic principle of tidy datasets" +- "Be able to determine the structure of a data frame including its dimensions + and the datatypes of variables" +- "Be able to retrieve (index) a data frame" +- "Be able to apply an arithmetic function to a dataframe" +- "Be able to coerce the class of an object (including variables in a dataframe)" +- "Be able to save a dataframe as a delimited file" +keypoints: +- "Effectively using R is a journey of months or years. Still you don't have to + be an expert to use R and you can start using and analzying your data with + with about a day's worth of training" +- "It is important to understand how data are organized by R in a given object + type (e.g. vector, factor, data frame, etc.) how the mode of that type + (e.g. numeric, character, logical, etc.) will determine how R will operate + on that data, and what can happen when datatypes are coerced, misinterpreted, + or misapplied" +- "Data wrangling - loading data, cleaning this data (e.g checking for outliers, + handling missing values, sorting, filtering, etc.) is an important first step + for working with data" +--- + +## "The fantastic world of R awaits you" OR "Nobody wants to learn how to use R" +Before we begin this lesson, we want you to be clear on the goal of the workshop +and these lessons. We believe that every learner can be **achieve competency +with R**. You have reached competency when you find that you are able to +**use R to handle common analysis challenges in a reasonable amount of time** +(which includes time needed to look at learning materials, search for answers +online, and ask colleagues for help). As you spend more time using R (there is +no substitute for regular use and practice) you will find yourself gaining +competency and even expertise. The more familiar you get, the more +complex the analyses you will be able to carry out, with less frustration, and +in less time - the "fantastic world of R" awaits you! + +## What these lessons will not teach you +Nobody wants to learn how to use R. People want to learn how to use R to analyze +their own research questions! Ok, maybe some folks learn R for R's sake, but +these lessons assume that you want to start analyzing genomic data as soon as +possible. Given this, there are many valuable pieces of information about R +that we simply wont have time to cover. Hopefully we will clear the hurdle of +giving you just enough knowledge to be dangerous, which can be a high hurdle +in R! We uggest you look into additional the learning materials in the tip box +below. + +**Here are some R skills we will *not* cover in these lessons** + +- How to create and work with R matrices and R lists +- How to create and work with loops and conditional statements +- How to do basic string manipulations (e.g. finding patterns in text using grep) +- How to plot using the default R graphic tools (we *will* cover ggplot2) +- How to use the advanced R statistical functions + +>## Tip: Where to learn more +> The following are good resources for learning more about R. Some of them +> can be quite technically, but if you are a regular R user you may ultimately +> need some of this technical knowledge. +> - [R for Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf): + By Emmanuel Paradis, great starting point +> - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the + R project +> - [R contributed documentation](https://cran.r-project.org/other-docs.html): + Also linked to the R project; importantly there are materials available in + several languages +> - [R for Data Science](http://r4ds.had.co.nz/): A wonderful collection by + noted R educators and developers Garrett Grolemund and Hadley Wickham +> - [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/): + Not exclusively about R usage, but a nice collection of pre-prints on data science + and applications for R +> - [Programming in R Software Carpentry lesson](https://software-carpentry.org/lessons/): + There are several Software Carpentry lessons in R to choose from + {: .callout} + +## Creating objects in R + +> ## Reminder +> At this point you should writing following along in the "**genomics_r_basics.R**" +> script we created in the last episode. Writing you commands in the script +> will make it easier to record what you did and why. +> +{: .prereq} + +What might be called a variable in many language is properly called an **object** +in R. To create your object you need a name (e.g. 'a'), and a value (e.g. '1'). +Using the R assignment operator '<-''. In your script, "**genomics_r_basics.R**" +write a comment (using the '#') sign, and assign '1' to the object 'a' as shown +below: + +> ~~~ +> # this line creates the object 'a' and assigns it the value '1' +> +> a <- 1 +> ~~~ +{: .language-r} + +Be sure to execute this line of code in your script. You can run a line of code +by hitting the Run button that is just above the first line of your +script in the header of the Source pane or you can use the appropriate shortcut: + - Windows execution shortcut: Ctrl+Enter + - Mac execution shortcut: Cmd(⌘)+Enter +to run multiple lines of code, you can highlight all the line you wish to run +and then hit Run or use the shortcut key combo. + +You should notice the following outputs; in the RStudio 'Console' you should see: + +> ~~~ +> # this line creates the object 'a' and assigns it the value '1' +> +> a <- 1 +> ~~~ +{: .output} + +The 'Console' will display lines of code run from a script and any outputs or +status/warning/error messages (usually in red). + +You should also notice that in the 'Environment' window you get a table: + +|Values|| +|------|-| +|a|1| + +The 'Environment' window allows you to easily keep track of the objects you have +created in R. + +> ## Exercise: Create some objects in R +> Create the following objects in R, give each object an appropriate name. +> +> 1. Create an object that has the value of number of pairs of human chromosomes +> 2. Create an object that has a value of your favorite gene name +> 3. Create an object that value of this URL: "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/" +> 4. Create and object that has the value of the number of chromosomes in a diplod cell +> +>> ## solution +>> Here as some possible answers to the challenge: +>> 1. human_chr_number <- 23 +>> 2. gene_name <- 'pten' +>> 3. ensemble_url <- 'ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/' +>> 4. human_diploid_chr_num <- 2 * human_chr_number +>> +> {: .solution} +{: .challenge} + +## Naming objects in R + +Here are some important details about naming objects in R. + +- **Avoid spaces and special characters**: Object cannot contain spaces. Typically + you can use '-' or '_' to provide separation. You should avoid using special + characters in your object name (e.g. ! @ # . , etc.). Also, names cannot begin with + a number. +- **Use short, easy-to-understand names**: You should avoid naming your objects + using single letters (e.g. 'n', 'p', etc.). This is mostly to encourage you + to use names that would make sense to anyone reading your code (a colleague, + or even yourself a year from now). Also, avoiding really long names will make + your code more readable. +- **Avoid commonly used names**: There are several names that may alread have a + definition in the R language (e.g. 'mean', 'min', 'max'). One clue that a name + already has meaning is that if you start typing a name in RStudio and either + pause your typing or hit the Tab key and RStudio gives you a + suggested autocompletion or help message you have choosen a name that has a + prior meaning. +- **Use the recommended assignment operator**: In R, we use '<- '' as the + prefered assignment operator. '=' works too, but is most comonly used in + passing arguments to functions (more on functions later). There is a shortcut + for the R assignment operator: + - Windows execution shortcut: Alt+- + - Mac execution shortcut: Option+- + + +There are a few more suggestions about naming and style you may want to learn +more about as you write more R code. There are several "style guides" that +have advice, and one to start with is the [tidyverse R style guide](http://style.tidyverse.org/index.html). + +>## Tip: Pay attention to warnings in the script console +> +> If you enter a line of code in your R that contains some error, RStudio +> may give you hint with an error indication and an underline of this mistake. +> Sometimes these messages are easy to understand, but often the message may +> need some figuring out. In any case paying attention to these warnings help +> you avoid mistakes. In this case, our object name has a space, which is not +> allowed in R. Notice the error message does not say this directly, but +> essentially R is "not sure" about to to assign the name to "human_ chr_number" +> when the object name we want is "human_chr_number". +> +> rstudio script warning +> + {: .callout} + +## Reassigning object names or deleting objects + +Once an object has a value, you can change that value by overwriting it. R will +not complain about overwriting objects, which may or may not be a good thing +depending on how you look at it. + +> ~~~ +> # gene_name has the value 'pten' or whatever value you used in the challenge. We will now assign the new value 'tp53' +> +> gene_name <- 'tp53' +> ~~~ +{: .language-r} + +You can also remove an object from R's memory entirely. The `rm()` function +will delete the object. + +> ~~~ +> # delete the object 'gene_name' +> +> rm(gene_name) +> ~~~ +{: .language-r} + +If you run a line of code that just has an object name, R will normally display +the contents of that object. In this case, we are told the object is no +longer defined. + +> ~~~ +> Error: object 'gene_name' not found +> ~~~ +{: .error} + +## Understaning object data types (modes) + +One very important thing to know about an object is that every object has two +properties, "length" and "mode". We will get to the "length" property later in +the lesson. The **"mode" property corresponds to the type of data an object** +**represents**. The most common modes you will encounter in R are: + +|Mode (abbreviation)|Type of data| +|----|------------| +|Numeric (num)| Numbers such integers (e.g. 1, 892, 1.3e+10) and floating pont/decimals (0.5, 3.14)| +|Character (chr)|A sequence of letters/numbers in single '' or double " " quotes| +|Logical| Boolean values - TRUE or FALSE| + +There are a few other modes (double", "complex", "raw" etc.) but for now, these +three are the most important. Data types are familiar in many programming +languages, but also in natural language where we refer to them as the +parts of speech, e.g. nouns, verbs, adverbs, etc. One you know if a word - +perhaps an unfamilar one - is a noun, you can probbaly guess you can count it +and make it plural if there is more than one (e.g. 1 Tuatara, or 2 Tuataras). +If something is a adjective, you can usually change it into an adverb by +adding "-ly" (e.g. jejune vs. jejunely). Depending on the context, you may need +to decide if a word is in one category or another (e.g "cut" may be a noun when +its on your finger, or a verb when you are preparing vegetables). These examples +have important analogies when working with R objects. + +> ## Exercise: Create objects and check their modes +> Create the following objects in R, then use the `mode()` function to verify +> their modes. Try to guess what the mode will be before you look at the solution +> +> 1. chromosome_name <- 'chr02' +> 2. od_600_value <- 0.47 +> 3. chr_position <- '1001701' +> 4. spock <- TRUE +> 5. pilot <- Earhart +> +>> ## solution +>> +>> 1. mode(chromosome_name) # "character" +>> 2. mode(od_600_value) # "numeric" +>> 3. mode(chr_position) # "character" +>> 4. mode(spock) # "logical" +>> 5. pilot # Error: object 'Earhart' not found +> {: .solution} +{: .challenge} + +Notice from the solution that even if a series of numbers are given as a value +R will consider them to be in the "character" mode if they are enclosed as +single or double quotes. Also notice that you cannot take a string of alphanumeric +character (e.g. Earhart) and assign as a value for an object. In this case, +R looks for the object `Earhart` but since there is no object, no assignment can +be made. If `Earhart` did exist, then the mode of `pilot` would be whatever +the mode of `Earthrt` was originally. + +## Mathematical and functional operations on objects + +Once an object exsits (which by definition also means it has a mode), R can +appropriately manipulate that object. For example, objects of the numeric modes +can be added, multiplied, divided, etc. R provides several mathematical +(arithmetic) operators incuding: + +|Operator|Description| +|--------|-----------| +|+|addition| +|-|subtraction| +|*|multiplication| +|/|division| +|^ or **|exponentiation| +|a%%b|modulus| + +These can be used with literal numbers: + +> ~~~ +> (1 + (5 ** 0.5))/2 +> ~~~ +{: .language-r} + +> ~~~ +> [1] 1.618034 +> ~~~ +{: .output} + +and importantly, can be used on any object that evaluates to (i.e. iterprited +by R) a numeric object: + + +> ~~~ +> # multiply the object 'human_chr_number' by 2 +> +> human_chr_number * 2 +> ~~~ +{: .language-r} + +returns the result: + +> ~~~ +> [1] 46 +> ~~~ +{: .output} + +Finally, it is useful to know that several other types of mathematical +operations have their own associated functions. While there are too many to +list, you can always search the online documentation in R for a function ( +even if you don't know what it may be called in R). For example: + +> ~~~ +> # search for functions associated with chi squared +> +> ?? chisquared +> ~~~ +{: .language-r} + +Will open search results in your help tab. Of course, using Google will help +here too. + +> ## Exercise: Compute the golden ratio +> One appoximation of the golen ratio (φ) can be found by taking the sum of 1 +> and the square root of 5, and dividing by 2 as in the example above. Compute +> the golden ratio to 3 digits of precision using the `sqrt()` and `round()` +> functions. Hint: remember the `round()` function can take 2 arguments. +> +>> ## solution +>> +>> round((1 + sqrt(5))/2, digits=3) +>> +>> [1] 1.618 +>> +>> * Notice that you can place one function inside of another. +> {: .solution} +{: .challenge} + + +## Vectors + +With a solid understanding of the most basic objects, we come to probably the +most used objects in R, vectors. A vector can be though of as a collection of +values (numbers, characters, etc.). Vectors also have a mode (data type), so +all of the contents of a vctor must be of the same mode. One of the most common +way to create a vector is to use the `c()` function - the "concatanate" or +"combine" function. Inside the function you may enter one or more values; for +multiple values, seperate each value with a comma: + +> ~~~ +> # Create the SNP gene name vector +> +> snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1") +> ~~~ +{: .language-r} + +Two important properties of vectors are their **mode** and their **length**. +You can check these with the `mode()` and `length()` function respectively. +Another useful function that gives both of these pieces of information is the +`str()` (structure) function. Importantly, **items within a vector must all +be of the same mode/ data type**. This is because a vector can have only one +mode. More on this later. + +> ~~~ +> # Check the mode, length, and structure of 'gene_names' +> +> mode(gene_names) +> length(gene_names) +> str(gene_names) +> ~~~ +{: .language-r} + +returns: + +> ~~~ +> [1] "character" +> [1] 4 +> chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" +> ~~~ +{: .output} + +Vectors are quite important in R, mostly for us because data frames are +essentially collections of vectors (more on this later). What we learn about +manipulating vectors now will pay of even more when we get to data frames. + +## More on creating and indexing vectors + +Let's create a few more vectors to play around with: + +> ~~~ +> # some interesting human SNPs +> # while accuracy is important, typos in the data won't hurt you here +> +> snps <- c('rs53576', 'rs1815739', 'rs6152', 'rs1799971') +> snp_chromosomes <- c('3', '11', 'X', '6') +> snp_positions <- c(8762685, 66560624, 67545785, 154039662) +> ~~~ +{: .language-r} + +Once we have vectors, one thing we may want to do is specifically retrieve one +or more values from our vector. To do so we use **bracket notation**. We type +the name of the vector followed by square brackets. In those square brackets +we place the index (e.g. a number) in that bracket as follows: + +> ~~~ +> # get the 3rd value in the snp_genes vector +> +> snp_genes[3] +> ~~~ +{: .language-r} +> ~~~ +> [1] "AR" +> ~~~ +{: .output} + +In R, every item your vector is indexed, starting from the first item (1) +through to the final number of items in your vector. You can also retrieve a +range of numbers: + +> ~~~ +> # get the 1st through 3rd value in the snp_genes vector +> +> snp_genes[1:3] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" +> ~~~ +{: .output} + +If you want to to retreive several (but not necessarily sequential) items from +a vector, you pass a **vector of indicies**; a vector that has the numbered +positions you wish to retrieve. + +> ~~~ +> # get the 1st, 3rd, and 4th value in the snp_genes vector +> +> snp_genes[c(1, 3, 4)] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "AR" "OPRM1" +> ~~~ +{: .output} + +There are additional (and perhaps less commonly used) ways of indexing a vector +(see [these examples](https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html)). +Also, several of these indexing expressions can be combined: + +> ~~~ +> # get the 1st through the 3rd value, and 4th value in the snp_genes vector +> # yes, this is a little silly in a vector of only 4 values. +> +> snp_genes[c(1:3,4)] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" +> ~~~ + +## Adding to, removing, or replacing values in existing vectors + +Once you have an existing vector, you may want to add a new item to it. To do +so, you can use the `c()` function again to add your new value: + +> ~~~ +> # add the gene 'CYP1A1' and 'APOA5' to our list of snp genes +> # this overwrites our existing vector +> +> snp_genes <- c(snp_genes, "CYP1A1", "APOA5") +> ~~~ +{: .language-r} + +We can of course verify that "snp_genes" contains the new gene entry + +> ~~~ +> snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" +> ~~~ +{: .output} + +Using a negative index will return a version a vector with that index's +value removed: + +> ~~~ +> snp_genes[-6] +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" +> ~~~ +{: .output} + + +We can remove that value from our vector by overwriting it with this expression: +> ~~~ +> snp_genes <- snp_genes[-6] +> snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" +> ~~~ +{: .output} + +We can also explicitly rename or add a value to our index using double bracket +notation: + +> ~~~ +> snp_genes[[7]]<- "APOA5" +> snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" +> ~~~ +{: .output} + +Notice in the operation above that R inserts an `NA` value to extend our vector +so that the gene "APOA5" is an index 7. This may be a good or not so good thing +depending on how you use this. + +> ## Exercise: Examining and indexing vectors +> Answer the following questions to test your knowledge vectors +> +> Which of the following is true of vectors in R +> +> A) All vectors have a mode or a length +> +> B) All vector have a mode and a length +> +> C) Vectors may have different lengths +> +> D) Items within a vector may be of different modes +> +> E) You can use the `c()` to one or more items to an existing vector +> +> F) You can use the `c()` to add a vector to an exiting vector +>> +>> ## solution +>> A) False - Vectors have both of these properties +>> +>> B) True +>> +>> C) True +>> +>> D) False - Vectors have only one mode (e.g. numeric, character); all items in +>> a vector must be of this mode. +>> +>> E) True +>> +>> F) True +>> +> {: .solution} +{: .challenge} + + +## Logical Indexing + +There is one last set of cool indexing capabilities we want to introduce. It is +possible within R to retrieve items in a vector based on a logical evaluation +or numerical comparison. For example, let's say we wanted get all of the SNPs +in our vector of SNP positons that were greater than 100,000,000. We could +index using the '>' (greater than) logical operator: + +> ~~~ +> snp_positions[snp_positions > 100000000] +> ~~~ +{: .language-r} +> ~~~ +> [1] 154039662 +> ~~~ +{: .output} + +As demonstrated above, in the square brackets you place the name of the vector +followed by the comparison operator and (in this numeric case) a numeric value. +Some of the most common logical operators you will use in R are: + +|Operator|Description| +|--------|-----------| +|<|less than| +|<=|less than or equal to| +|>|greater than| +|>=|greater than or equal to| +|==|exactly equal to| +|!=|not equal to| +|!x|not x| +|a \| b| a or b| +|a & b| a and b| + +> ## The magic of programming +> +>The reason why the expression `snp_positions[snp_positions > 100000000]` works +>can be better understood if you examine what the expression "snp_positions > 100000000" +>evaluates to: +> +>> ~~~ +>> snp_positions > 100000000 +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] FALSE FALSE FALSE TRUE +>> ~~~ +>{: .output} +> +>The output above is a logical vector, the 4th element of which is TRUE. When +>you pass a logical vector as an index, R will return the true values: +> +>> ~~~ +>> snp_positions[c(FALSE, FALSE, FALSE, TRUE)] +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] 154039662 +>> ~~~ +>{: .output} +> +> +>If you have never coded before, this type of situation starts to expose the +>"magic" of programming. We mentioned before that in the bracket indexing +>notation you take your named vector followed by brakets which contain an index: +>**named_vector[index]**. The "magic" is that the index needs to *evaluate to* a +>number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long +>as R can evaluate it, we will get a result. That our expression +>`snp_positions[snp_positions > 100000000]` evaluates to a number can be seen +>in the following situtaion. If you wanted to know which **index** (1, 2, 3, or +>4) in our vector of SNP positions was the one that was greater than 100,000,000? +>We can use the `which()` function to return the indicies of any item that +>evaluates as TRUE in our comparison: +>> ~~~ +>> which(snp_positions > 100000000) +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] 4 +>> ~~~ +>{: .output} +> **Why is this important?** Often in programming we will not know what inputs +> and values will be used when our code is executed. Rather than put in a +> pre-determined value (e.g 100000000) we can use an object that can take on +> whatever value we need. So for example: +> +>> ~~~ +>> snp_marker_cutoff <- 100000000 +>> snp_positions[snp_positions > snp_marker_cutoff] +>> ~~~ +>{: .language-r} +>> ~~~ +>> [1] 154039662 +>> ~~~ +>{: .output} +> Ultimately, it's putting together flexible, reusable code like this that gets +> at the "magic" of programming! +{: .callout} + +## A few final vector tricks + +Finally, there are a few other common retrieve or replace operations you may +want to know about. First, you can check to see if any of the values of your +vector is an NA value. Missing data will get a more detailed treatment later, +but the `is.NA()` function will return a logical vector, with TRUE for any NA +value: + +> ~~~ +> # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" +> +> is.na(snp_genes) +> ~~~ +{: .language-r} +> ~~~ +> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE +> ~~~ +{: .output} + +Sometimes, you may wish to find out if a specific value (or several values) is +in a vector. You can do this using the comparison operator `%in%`, which will +return TRUE for any value in your collection of one or more values matches a +value in the vector you are searching: + +> ~~~ +> # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" +> # test to see if "ACTN3" or "APO5A" is in the snp_genes vector +> # if you are looking for more than one value, you must pass this as a vector +> +> c("ACTN3","APOA5") %in% snp_genes +> ~~~ +{: .language-r} +> ~~~ +> [1] TRUE TRUE +> ~~~ +{: .output} + +> ## Review: Creating and indexing vectors +> Use your knowledge of vectors to accomplish the following tasks: +> +> **1) What mode are the following vectors? Use `typeof()` to check** +> +> a. `snps` +> +> b. `snp_chromosomes` +> +> c. `snp_positions` +> +> **2) Add the following values to the following vectors** +> +> a. To the `snps` vector add: 'rs662799' +> +> b. To the `snp_chromosomes` vector add: 11 +> +> c. To the `snp_positions` vector add: 116792991 +> +> **3) Make the following change to the `snp_genes` vector** +> Hint: Your vector should look like this in the 'Global Enviornment': +> `chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"`. If not +> recreate the vector by running this expression: +> `snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5")` +> +> a. Create a new version of `snp_genes` that does not contain CYP1A1 +> +> b. Add 2 NA values to the end of `snp_genes` (hint: final vector should +> have a length of 8) +> +> **4) Create a new vector `combined` that contains:** +> +> - The the 1st value in `snp_genes` +> +> - The 1st value in `snps` +> +> - The 1st value in `snp_chromosomes` +> +> - The 1st value in `snp_positions` +> +> **Check the mode of `combined` using `typeof()` +> +>> ## solution +>> +>> **1) What mode are the following vectors? Use `typeof()` to check** +>> +>> a. `typeof(snps)` # "character" +>> +>> b. `typeof(snp_chromosomes)` # "character" +>> +>> c. `typeof(snp_positions)` # "double" - which is also a numeric type +>> +>> +>> **2) Add the following values to the following vectors** +>> +>> a. `snps <- c(snps, 'rs662799')` +>> +>> b. `snp_chromosomes <- c(snp_chromosomes, "11")` # did you use quotes? +>> +>> c. `snp_positions <- c(snp_positions, 116792991)` +>> +>> **3) Make the following change to the `snp_genes` vector** +>> +>> a. `snp_genes <- snp_genes[-5]` or `snp_genes <- snp_genes[c(1,2,3,4,6,7)]`, etc. +>> +>> b. `snp_genes <- c(snp_genes, NA, NA)` or `snp_genes[[8]] <- NA`, etc. +>> +>> +>> **4) Create a new vector `combined` that contains:** +>> +>> - The the 1st value in `snp_genes` +>> +>> - The 1st value in `snps` +>> +>> - The 1st value in `snp_chromosomes` +>> +>> - The 1st value in `snp_positions` +>> +>> +>> `combined <- c(snp_genes[1], snps[1], snp_chromosomes[1], snp_positions[1])` +>> +>> `typeof(combined)` # "character" - Do you know why this is? +>> +> {: .solution} +{: .challenge} + +## Bonus material: Lists + +Lists are quite useful in R, but we won't be using them in the genomics lessons. +That said, you may come across lists in the way that some bioinformatics +programs may store and/or return data to you. One of the key attributes of a list +is that unlike a vector, a list may contain data of more than one mode. Learn +more about creating and using lists using this [nice tutorial](http://r4ds.had.co.nz/lists.html). +In this one example, we will create a named list and show you how to retreive +items from the list. + + +> ~~~ +> # Create a named list using the 'list' function and our SNP examples +> # Note, for easy reading we have place each item in the list on a separate line +> # Nothing special about this, you can do this for any multiline commands +> # To run this command, make sure the entire command (all 4 lines) are highlited +> # before running +> +>snp_data <- list(genes = snp_genes, +> refference_snp = snps, +> chromosome = snp_chromosomes, +> position = snp_positions) +> +> # Examine the structure of the list +>str(snp_data) +> ~~~ +{: .language-r} +> ~~~ +>List of 4 +> $ genes : chr [1:8] "OXTR" "ACTN3" "AR" "OPRM1" ... +> $ refference_snp: chr [1:5] "rs53576" "rs1815739" "rs6152" "rs1799971" ... +> $ chromosome : chr [1:4] "3" "11" "X" "6" +> $ position : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 +> ~~~ +{: .output} + +To get all of the values for the `position` object in the list we use the `$` notation: + +> ~~~ +> # return all the values of position object +> +> snp_data$position +> ~~~ +{: .language-r} +> ~~~ +> [1] 8762685 66560624 67545785 154039662 +> ~~~ +{: .output} + +To get the first value in the `position` object, use `[]` notation to index: + +> ~~~ +> # return first value of the position object +> +> snp_data$position[1] +> ~~~ +{: .language-r} +> ~~~ +> [1] 8762685 +> ~~~ +{: .output} + + + +--- From a141b5480e19e5ac61228f8cdd3cc0155488315d Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Wed, 9 May 2018 16:54:07 -0400 Subject: [PATCH 11/19] begin lesson 3 --- episodes/02-r-basics.md | 6 +- episodes/03-basics-factors-dataframes.md | 115 +++++++++-------------- 2 files changed, 49 insertions(+), 72 deletions(-) diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index ca318177..58b360e7 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -19,7 +19,7 @@ keypoints: with about a day's worth of training" - "It is important to understand how data are organized by R in a given object type how the mode of that type (e.g. numeric, character, logical, etc.) will - determine how R will operate on that data. + determine how R will operate on that data." - "Working with vectors effectively prepares you for understanding how data are organized in R." --- @@ -288,7 +288,7 @@ can be added, multiplied, divided, etc. R provides several mathematical |*|multiplication| |/|division| |^ or **|exponentiation| -|a%%b|modulus| +|a%%b|modulus (returns the remainder after division)| These can be used with literal numbers: @@ -748,7 +748,7 @@ value in the vector you are searching: > > - The 1st value in `snp_positions` > -> **Check the mode of `combined` using `typeof()` +> **Check the mode of `combined` using `typeof()`** > >> ## solution >> diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index d0b174e2..13e6de9d 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -4,87 +4,64 @@ teaching: 60 exercises: 20 questions: - "How do I get started with tabular data (e.g. spreadsheets) in R?" +- "What are some best practices for reading data into R?" +- "How do I save tabular data generated in R?" objectives: -- "Identify R skills not covered in these lessons and where to learn more" -- "Be able to create and appropriately name objects in R" -- "Be able to explain what a data types are, and know the common R data types - (modes)" -- "Be able to reassign object values and delete objects" -- "Be able to do simple arithmetic of functional procedures on R objects" -- "Be able to create the most common R objects including vectors, factors, - lists, and data frames" -- "Be able to retrieve (index), name, or replace, values from an object" - "Be able to load a tabular dataset using base R functions" - "Explain the basic principle of tidy datasets" - "Be able to determine the structure of a data frame including its dimensions and the datatypes of variables" - "Be able to retrieve (index) a data frame" +- "Understand how how R may converse data into different modes" +- "Be able to convert the mode of an object" +- "Understand that R uses factors to store and manipulate catagorical data" +- "Be able to manipulate a factor, including indexing and reordering" - "Be able to apply an arithmetic function to a dataframe" - "Be able to coerce the class of an object (including variables in a dataframe)" - "Be able to save a dataframe as a delimited file" keypoints: -- "Effectively using R is a journey of months or years. Still you don't have to - be an expert to use R and you can start using and analzying your data with - with about a day's worth of training" -- "It is important to understand how data are organized by R in a given object - type (e.g. vector, factor, data frame, etc.) how the mode of that type - (e.g. numeric, character, logical, etc.) will determine how R will operate - on that data, and what can happen when datatypes are coerced, misinterpreted, - or misapplied" -- "Data wrangling - loading data, cleaning this data (e.g checking for outliers, - handling missing values, sorting, filtering, etc.) is an important first step - for working with data" +- "It is easy to import data into R from tabular formats including Excel. + However, you still need to check that R has imported and interprited your + data correctly" +- "There are best practices for organizing your data (keeping it tidy) and R + is great for this" +- "Base R has many useful functions for manipulating your data, but all of R's + capabilities are greatly enhanced by software packages developed by the + community" --- -## "The fantastic world of R awaits you" OR "Nobody wants to learn how to use R" -Before we begin this lesson, we want you to be clear on the goal of the workshop -and these lessons. We believe that every learner can be **achieve competency -with R**. You have reached competency when you find that you are able to -**use R to handle common analysis challenges in a reasonable amount of time** -(which includes time needed to look at learning materials, search for answers -online, and ask colleagues for help). As you spend more time using R (there is -no substitute for regular use and practice) you will find yourself gaining -competency and even expertise. The more familiar you get, the more -complex the analyses you will be able to carry out, with less frustration, and -in less time - the "fantastic world of R" awaits you! - -## What these lessons will not teach you -Nobody wants to learn how to use R. People want to learn how to use R to analyze -their own research questions! Ok, maybe some folks learn R for R's sake, but -these lessons assume that you want to start analyzing genomic data as soon as -possible. Given this, there are many valuable pieces of information about R -that we simply wont have time to cover. Hopefully we will clear the hurdle of -giving you just enough knowledge to be dangerous, which can be a high hurdle -in R! We uggest you look into additional the learning materials in the tip box -below. - -**Here are some R skills we will *not* cover in these lessons** - -- How to create and work with R matrices and R lists -- How to create and work with loops and conditional statements -- How to do basic string manipulations (e.g. finding patterns in text using grep) -- How to plot using the default R graphic tools (we *will* cover ggplot2) -- How to use the advanced R statistical functions - ->## Tip: Where to learn more -> The following are good resources for learning more about R. Some of them -> can be quite technically, but if you are a regular R user you may ultimately -> need some of this technical knowledge. -> - [R for Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf): - By Emmanuel Paradis, great starting point -> - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the - R project -> - [R contributed documentation](https://cran.r-project.org/other-docs.html): - Also linked to the R project; importantly there are materials available in - several languages -> - [R for Data Science](http://r4ds.had.co.nz/): A wonderful collection by - noted R educators and developers Garrett Grolemund and Hadley Wickham -> - [Practical Data Science for Stats](https://peerj.com/collections/50-practicaldatascistats/): - Not exclusively about R usage, but a nice collection of pre-prints on data science - and applications for R -> - [Programming in R Software Carpentry lesson](https://software-carpentry.org/lessons/): - There are several Software Carpentry lessons in R to choose from - {: .callout} +## Working with spreadsheets (tabular data) +A substantial amount of the data we work with in genomics will be tabular data, +this is data arranged in rows and columns - also known as spreadsheets. We could +write a whole lesson on how to work with spreadsheets effectively ([actually we did](http://www.datacarpentry.org/spreadsheet-ecology-lesson/)). For our +purposes, we want to remind you of a few principles before we work with our +first set of example data: + +**1) Keep raw data separate from analyzed data** +This is principle number one because if you can't tell what data is the +original form, you risk making some serious mistakes. + +**2) Keep speadsheet data Tidy** +The simplest principle of **Tidy data** is that we we have one row in our +spreadsheet for each observation or sample, and one colum for every variable +that we measure or report on. As simple as this sounds, its very easily +violated, and most data scintists agree that most of their time is spent +tidying their data for analysis. Read more about data organization in +[our lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/) and in [this paper](https://www.jstatsoft.org/article/view/v059i10). + +**3) Trust but verify** +Finally, while you don't need to be paranoid about data, you should have a plan +for how you will prepare it for analysis. **This is the focus of this lesson.** +You probably already have a lot of intuition, expectations, assumptions about +your data - the range of values you expect, how many values should have +been recorded, etc. Of course, as the data get larger, our human ability to +keep track will start to fail (and yes, it can fail for small data sets too). +R will help you to examine your data so that you can have greater confidence +in your analysis, and its reproducibility. + + + + ## Creating objects in R From bc9d75e506076c1effe529939160f55145e29d20 Mon Sep 17 00:00:00 2001 From: JasonJWilliamsNY Date: Fri, 11 May 2018 17:31:02 -0400 Subject: [PATCH 12/19] complete episode 3 up to ordering factors --- episodes/03-basics-factors-dataframes.md | 911 ++++++----------------- 1 file changed, 246 insertions(+), 665 deletions(-) diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index 13e6de9d..54f8494e 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -7,12 +7,12 @@ questions: - "What are some best practices for reading data into R?" - "How do I save tabular data generated in R?" objectives: -- "Be able to load a tabular dataset using base R functions" - "Explain the basic principle of tidy datasets" +- "Be able to load a tabular dataset using base R functions" - "Be able to determine the structure of a data frame including its dimensions and the datatypes of variables" -- "Be able to retrieve (index) a data frame" -- "Understand how how R may converse data into different modes" +- "Be able to retrieve values (index) from a data frame" +- "Understand how R may converse data into different modes" - "Be able to convert the mode of an object" - "Understand that R uses factors to store and manipulate catagorical data" - "Be able to manipulate a factor, including indexing and reordering" @@ -38,812 +38,393 @@ purposes, we want to remind you of a few principles before we work with our first set of example data: **1) Keep raw data separate from analyzed data** -This is principle number one because if you can't tell what data is the -original form, you risk making some serious mistakes. + +This is principle number one because if you can't tell which files are the +original raw data, you risk making some serious mistakes (e.g. drawing conculsion +from data which have been manipulated in some unknown way). **2) Keep speadsheet data Tidy** -The simplest principle of **Tidy data** is that we we have one row in our + +The simplest principle of **Tidy data** is that we have one row in our spreadsheet for each observation or sample, and one colum for every variable -that we measure or report on. As simple as this sounds, its very easily -violated, and most data scintists agree that most of their time is spent -tidying their data for analysis. Read more about data organization in -[our lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/) and in [this paper](https://www.jstatsoft.org/article/view/v059i10). +that we measure or report on. As simple as this sounds, it's very easily +violated. Most data scintists agree that significant amounts of their time is +spent tidying data for analysis. Read more about data organization in +[our lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/) and +in [this paper](https://www.jstatsoft.org/article/view/v059i10). **3) Trust but verify** + Finally, while you don't need to be paranoid about data, you should have a plan -for how you will prepare it for analysis. **This is the focus of this lesson.** +for how you will prepare it for analysis. **This a the focus of this lesson.** You probably already have a lot of intuition, expectations, assumptions about your data - the range of values you expect, how many values should have -been recorded, etc. Of course, as the data get larger, our human ability to +been recorded, etc. Of course, as the data get larger our human ability to keep track will start to fail (and yes, it can fail for small data sets too). R will help you to examine your data so that you can have greater confidence in your analysis, and its reproducibility. +## Importing tabular data into R +There are several ways to import data into R. For our purpose here, we will +focus on using the tools every R installtion comes with (so called "base" R) to +import a comma-delimited file, a sequencing sample submission sheet. We will +First, we need to load the sheet using a function called `read.csv()`. - -## Creating objects in R - -> ## Reminder -> At this point you should writing following along in the "**genomics_r_basics.R**" -> script we created in the last episode. Writing you commands in the script -> will make it easier to record what you did and why. +> ## Exercise: Review the arguments of the `read.csv()` function +> **Before using the `read.csv()` function, use R's help feature to answer the +> following questions**. > -{: .prereq} - -What might be called a variable in many language is properly called an **object** -in R. To create your object you need a name (e.g. 'a'), and a value (e.g. '1'). -Using the R assignment operator '<-''. In your script, "**genomics_r_basics.R**" -write a comment (using the '#') sign, and assign '1' to the object 'a' as shown -below: - -> ~~~ -> # this line creates the object 'a' and assigns it the value '1' +> *Hint*: Entering '?' before the function name and then running that line will +> bring up the help documentation. Also, when reading this particular help +> be careful to pay attention to the 'read.csv' expression under the 'Usage' +> heading. Other answers will be in the 'Arguments' heading. > -> a <- 1 -> ~~~ -{: .language-r} - -Be sure to execute this line of code in your script. You can run a line of code -by hitting the Run button that is just above the first line of your -script in the header of the Source pane or you can use the appropriate shortcut: - - Windows execution shortcut: Ctrl+Enter - - Mac execution shortcut: Cmd(⌘)+Enter -to run multiple lines of code, you can highlight all the line you wish to run -and then hit Run or use the shortcut key combo. - -You should notice the following outputs; in the RStudio 'Console' you should see: - -> ~~~ -> # this line creates the object 'a' and assigns it the value '1' +> A) What is the default parameter for 'header' in the `read.csv()` function? > -> a <- 1 -> ~~~ -{: .output} - -The 'Console' will display lines of code run from a script and any outputs or -status/warning/error messages (usually in red). - -You should also notice that in the 'Environment' window you get a table: - -|Values|| -|------|-| -|a|1| - -The 'Environment' window allows you to easily keep track of the objects you have -created in R. - -> ## Exercise: Create some objects in R -> Create the following objects in R, give each object an appropriate name. +> B) What argument would you have to change to read a file that was delimeted +> by semicolons (;) rather than commas? > -> 1. Create an object that has the value of number of pairs of human chromosomes -> 2. Create an object that has a value of your favorite gene name -> 3. Create an object that value of this URL: "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/" -> 4. Create and object that has the value of the number of chromosomes in a diplod cell +> C) What argument would you have to change to read file in which numbers +> used commas for decimal separation (i.e. 1,00)? +> +> D) What argument would you have to change to read in only the first 10,000 rows +> of a very large file? > >> ## solution ->> Here as some possible answers to the challenge: ->> 1. human_chr_number <- 23 ->> 2. gene_name <- 'pten' ->> 3. ensemble_url <- 'ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/' ->> 4. human_diploid_chr_num <- 2 * human_chr_number >> +>> A) The `read.csv()` function has the argument 'header' set to TRUE by deault, +>> this means the function always assumes the first row is header information, +>> (i.e. column names) +>> +>> B) The `read.csv()` function has the argument 'sep' set to ",". This means +>> the function assumes commas are used as delimiters, as you would expect. +>> Changing this parameter (e.g. `sep=";"`) would now interprit semicolons as +>> delimiters. +>> +>> C) Although it is not listed in the `read.csv()` usage, `read.csv()` is +>> a "version" of the function `read.table()` and accepts all its arguments. +>> If you set `dec=","` you could change the decimal operator. We'd probably +>> assume the delimiter is some other character. +>> +>> D) You can set `nrow` to a numeric value (e.g. `nrow=10000`) to choose how +>> many rows of a file you read in. This may be useful for very large files +>> where not all the data is needed to test some data cleaning steps you are +>> applying. +>> +>> Hopefully, this exercise gets you thinking about using the provided help +>> documentation in R. There are many arguments that exist, but which we wont +>> have time to cover. Look here to get familiar with functions you use +>> frequently, you may be surpized at what you find they can do. > {: .solution} {: .challenge} -## Naming objects in R - -Here are some important details about naming objects in R. - -- **Avoid spaces and special characters**: Object cannot contain spaces. Typically - you can use '-' or '_' to provide separation. You should avoid using special - characters in your object name (e.g. ! @ # . , etc.). Also, names cannot begin with - a number. -- **Use short, easy-to-understand names**: You should avoid naming your objects - using single letters (e.g. 'n', 'p', etc.). This is mostly to encourage you - to use names that would make sense to anyone reading your code (a colleague, - or even yourself a year from now). Also, avoiding really long names will make - your code more readable. -- **Avoid commonly used names**: There are several names that may alread have a - definition in the R language (e.g. 'mean', 'min', 'max'). One clue that a name - already has meaning is that if you start typing a name in RStudio and either - pause your typing or hit the Tab key and RStudio gives you a - suggested autocompletion or help message you have choosen a name that has a - prior meaning. -- **Use the recommended assignment operator**: In R, we use '<- '' as the - prefered assignment operator. '=' works too, but is most comonly used in - passing arguments to functions (more on functions later). There is a shortcut - for the R assignment operator: - - Windows execution shortcut: Alt+- - - Mac execution shortcut: Option+- - - -There are a few more suggestions about naming and style you may want to learn -more about as you write more R code. There are several "style guides" that -have advice, and one to start with is the [tidyverse R style guide](http://style.tidyverse.org/index.html). - ->## Tip: Pay attention to warnings in the script console -> -> If you enter a line of code in your R that contains some error, RStudio -> may give you hint with an error indication and an underline of this mistake. -> Sometimes these messages are easy to understand, but often the message may -> need some figuring out. In any case paying attention to these warnings help -> you avoid mistakes. In this case, our object name has a space, which is not -> allowed in R. Notice the error message does not say this directly, but -> essentially R is "not sure" about to to assign the name to "human_ chr_number" -> when the object name we want is "human_chr_number". -> -> rstudio script warning -> - {: .callout} - -## Reassigning object names or deleting objects -Once an object has a value, you can change that value by overwriting it. R will -not complain about overwriting objects, which may or may not be a good thing -depending on how you look at it. +Now, let's read in the file `sample_submission.csv` which will be located in +`/home/dcuser/dc_sample_data/R`. Save the file as `submission_metadata`. The +first argument to pass to our `read.csv()` function is the file path for our +data. The file path must be in quotes and now is a good time to remember to +use tab autocompletion. **If you use tab autocompletion you avoid typos and +errors in file paths.** Use it! > ~~~ -> # gene_name has the value 'pten' or whatever value you used in the challenge. We will now assign the new value 'tp53' +>## read in a CSV file and save it as 'submission_metadata' > -> gene_name <- 'tp53' +> submission_metadata <- read.csv("/home/dcuser/dc_sample_data/R/sample_submission.csv") > ~~~ {: .language-r} -You can also remove an object from R's memory entirely. The `rm()` function -will delete the object. +One of the first things you should notice is that in the Enviornment window, +you have the `submission_metadata` object, listed as 96 obs. (observations/rows) +of 10 variables (columns). Double-clicking on the name of the object will open +a view of the data in a new tab. -> ~~~ -> # delete the object 'gene_name' -> -> rm(gene_name) -> ~~~ -{: .language-r} +rstudio data frame view -If you run a line of code that just has an object name, R will normally display -the contents of that object. In this case, we are told the object is no -longer defined. +## Summarizing and determining the structure of a data frame. +A **data frame is the standard way in R to store tabular data**. A data fame +could also be thought of as a collect of vectors, all of which have the same +length. Using only two functions, we can learn a lot about out data frame +including some summary statics as well as well as the "structure" of the data +frame. Let's examine what each of these functions can tell us: > ~~~ -> Error: object 'gene_name' not found -> ~~~ -{: .error} - -## Understaning object data types (modes) - -One very important thing to know about an object is that every object has two -properties, "length" and "mode". We will get to the "length" property later in -the lesson. The **"mode" property corresponds to the type of data an object** -**represents**. The most common modes you will encounter in R are: - -|Mode (abbreviation)|Type of data| -|----|------------| -|Numeric (num)| Numbers such integers (e.g. 1, 892, 1.3e+10) and floating pont/decimals (0.5, 3.14)| -|Character (chr)|A sequence of letters/numbers in single '' or double " " quotes| -|Logical| Boolean values - TRUE or FALSE| - -There are a few other modes (double", "complex", "raw" etc.) but for now, these -three are the most important. Data types are familiar in many programming -languages, but also in natural language where we refer to them as the -parts of speech, e.g. nouns, verbs, adverbs, etc. One you know if a word - -perhaps an unfamilar one - is a noun, you can probbaly guess you can count it -and make it plural if there is more than one (e.g. 1 Tuatara, or 2 Tuataras). -If something is a adjective, you can usually change it into an adverb by -adding "-ly" (e.g. jejune vs. jejunely). Depending on the context, you may need -to decide if a word is in one category or another (e.g "cut" may be a noun when -its on your finger, or a verb when you are preparing vegetables). These examples -have important analogies when working with R objects. - -> ## Exercise: Create objects and check their modes -> Create the following objects in R, then use the `mode()` function to verify -> their modes. Try to guess what the mode will be before you look at the solution -> -> 1. chromosome_name <- 'chr02' -> 2. od_600_value <- 0.47 -> 3. chr_position <- '1001701' -> 4. spock <- TRUE -> 5. pilot <- Earhart +>## get summary statistics on a data frame > ->> ## solution ->> ->> 1. mode(chromosome_name) # "character" ->> 2. mode(od_600_value) # "numeric" ->> 3. mode(chr_position) # "character" ->> 4. mode(spock) # "logical" ->> 5. pilot # Error: object 'Earhart' not found -> {: .solution} -{: .challenge} - -Notice from the solution that even if a series of numbers are given as a value -R will consider them to be in the "character" mode if they are enclosed as -single or double quotes. Also notice that you cannot take a string of alphanumeric -character (e.g. Earhart) and assign as a value for an object. In this case, -R looks for the object `Earhart` but since there is no object, no assignment can -be made. If `Earhart` did exist, then the mode of `pilot` would be whatever -the mode of `Earthrt` was originally. - -## Mathematical and functional operations on objects - -Once an object exsits (which by definition also means it has a mode), R can -appropriately manipulate that object. For example, objects of the numeric modes -can be added, multiplied, divided, etc. R provides several mathematical -(arithmetic) operators incuding: - -|Operator|Description| -|--------|-----------| -|+|addition| -|-|subtraction| -|*|multiplication| -|/|division| -|^ or **|exponentiation| -|a%%b|modulus| - -These can be used with literal numbers: - -> ~~~ -> (1 + (5 ** 0.5))/2 +> summary(submission_metadata) > ~~~ {: .language-r} - > ~~~ -> [1] 1.618034 +> well_position tube_barcode plate_barcode client_sample_id replicate Volume..µL. +> A1 : 1 Min. :151017990 LP-10624:96 k255M_1h-2 : 3 a: 1 Min. : 0.50 +> A10 : 1 1st Qu.:152123658 k255N_1h-1 : 3 A:31 1st Qu.: 57.35 +> A11 : 1 Median :153386891 k255N_1h-10: 3 b: 1 Median : 59.60 +> A12 : 1 Mean :153306679 k255N_1h-11: 3 B:31 Mean : 65.15 +> A2 : 1 3rd Qu.:154445370 k255N_1h-12: 3 c: 1 3rd Qu.: 62.50 +> A3 : 1 Max. :155537812 k255N_1h-13: 3 C:31 Max. :630.10 +> (Other):90 (Other) :78 +> concentration..ng.µL. RIN prep_date ship_date +> Min. : 15.82 Min. :5.600 6-Jul-15:45 20-Jul:96 +> 1st Qu.:183.70 1st Qu.:8.200 7/8/15 :48 +> Median :197.27 Median :8.500 7-Jun-15: 3 +> Mean :193.06 Mean :8.474 +> 3rd Qu.:209.97 3rd Qu.:8.900 +> Max. :237.12 Max. :9.600 > ~~~ {: .output} -and importantly, can be used on any object that evaluates to (i.e. iterprited -by R) a numeric object: +Our data frame had 10 variables, so we get 10 feilds that summarize the data. +The `tube_barcode`, `Volume..ul.`, `concentration..ng.ul`, `RIN`, variables are +numerical data and so you get summary statistics on the min and max values for +these columns, as well as mean, median, and interquartile ranges. The other data +(e.g. `replicate`, etc.) are treated as catagorical data (which have special +treatment in R - more on this in a bit). The top 6 different catagories and the +number of times they appear (e.g. the replicate called 'A' appeared 31 times) +are displayed. There was only one value for `ship_date`, "20-Jul" which appeared +96 times. +Before we operate on the data, we also need to know a little more about the +data frame structure to do that we use the `str()` function: > ~~~ -> # multiply the object 'human_chr_number' by 2 +>## get the structure of a data frame > -> human_chr_number * 2 +> str(submission_metadata) > ~~~ {: .language-r} - -returns the result: - > ~~~ -> [1] 46 +>'data.frame': 96 obs. of 10 variables: + >$ well_position : Factor w/ 96 levels "A1","A10","A11",..: 1 13 25 37 49 61 73 85 5 17 ... +>$ tube_barcode : int 151017990 151101577 151142725 151232891 151236606 151323716 151346588 151423653 151462684 151508377 ... +>$ plate_barcode : Factor w/ 1 level "LP-10624": 1 1 1 1 1 1 1 1 1 1 ... +>$ client_sample_id : Factor w/ 34 levels "k255M_1h-2","k255N_1h-1",..: 18 18 18 26 26 26 27 27 27 28 ... +>$ replicate : Factor w/ 6 levels "a","A","b","B",..: 1 3 5 2 4 6 2 4 6 2 ... +>$ Volume..µL. : num 64.2 63.7 60.2 55.8 60.8 57.5 64.9 62.5 53.9 62.4 ... +>$ concentration..ng.µL.: num 211 220 208 181 191 ... +>$ RIN : num 8.1 9.4 8.9 9 8.1 8.6 8.6 8.8 9.5 8.1 ... +>$ prep_date : Factor w/ 3 levels "6-Jul-15","7/8/15",..: 1 1 1 1 1 1 1 1 1 1 ... +>$ ship_date : Factor w/ 1 level "20-Jul": 1 1 1 1 1 1 1 1 1 1 ... > ~~~ {: .output} -Finally, it is useful to know that several other types of mathematical -operations have their own associated functions. While there are too many to -list, you can always search the online documentation in R for a function ( -even if you don't know what it may be called in R). For example: - -> ~~~ -> # search for functions associated with chi squared -> -> ?? chisquared -> ~~~ -{: .language-r} - -Will open search results in your help tab. Of course, using Google will help -here too. +Ok, thats a lot up unpack! Some things to notice. -> ## Exercise: Compute the golden ratio -> One appoximation of the golen ratio (φ) can be found by taking the sum of 1 -> and the square root of 5, and dividing by 2 as in the example above. Compute -> the golden ratio to 3 digits of precision using the `sqrt()` and `round()` -> functions. Hint: remember the `round()` function can take 2 arguments. -> ->> ## solution ->> ->> round((1 + sqrt(5))/2, digits=3) ->> ->> [1] 1.618 ->> ->> * Notice that you can place one function inside of another. -> {: .solution} -{: .challenge} +- the object type `data.frame` is displayed in the first row along with its + dimensions, in this case 96 observations (rows) and 10 variables (columns) +- Each variable (column) has a name (e.g. `well_position`). This is followed + by the object mode (e.g. factor, int, num, etc.). Notice that before each + variable name there is a `$` - this will be important later. +## Introducing Factors -## Vectors +Factors are the final major data structure we will introduce in our R genomics +lessons. Factors can be thought of as vectors which are specialized for +categorical data. Given R's specialization for statistics, this make sense since +categorial and contiuous variables usually have different treatments. Sometimes +you may want to have data treated as a fator, but in other cases, this may be +undersirable. -With a solid understanding of the most basic objects, we come to probably the -most used objects in R, vectors. A vector can be though of as a collection of -values (numbers, characters, etc.). Vectors also have a mode (data type), so -all of the contents of a vctor must be of the same mode. One of the most common -way to create a vector is to use the `c()` function - the "concatanate" or -"combine" function. Inside the function you may enter one or more values; for -multiple values, seperate each value with a comma: +Since some of the data in our data frame are factors, lets see how factors work +using the `factor()` function to create a factor: > ~~~ -> # Create the SNP gene name vector +>## create a factor 'days of the week' by passing a vector of characters > -> snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1") +>days_of_the_week <- factor(c('monday', +> 'tuesday', +> 'wednesday', +> 'thursday', +> 'friday')) > ~~~ {: .language-r} -Two important properties of vectors are their **mode** and their **length**. -You can check these with the `mode()` and `length()` function respectively. -Another useful function that gives both of these pieces of information is the -`str()` (structure) function. Importantly, **items within a vector must all -be of the same mode/ data type**. This is because a vector can have only one -mode. More on this later. +Notice what happens when we run a line with just the name of our factor: > ~~~ -> # Check the mode, length, and structure of 'gene_names' +># create a factor 'days of the week' by passing a vector of characters > -> mode(gene_names) -> length(gene_names) -> str(gene_names) +>days_of_the_week > ~~~ {: .language-r} - -returns: - > ~~~ -> [1] "character" -> [1] 4 -> chr [1:4] "OXTR" "ACTN3" "AR" "OPRM1" +> [1] monday tuesday wednesday thursday friday +> Levels: friday monday thursday tuesday wednesday > ~~~ {: .output} -Vectors are quite important in R, mostly for us because data frames are -essentially collections of vectors (more on this later). What we learn about -manipulating vectors now will pay of even more when we get to data frames. - -## More on creating and indexing vectors - -Let's create a few more vectors to play around with: +What we get back are the items in our factor, and also something called "Levels". +**Levels are the different categories contained in a factor**. By default, R +will organize the levels in a factor in alphabetical order. +Lets look at the contents of a factor in a slightly diffrent way using `str()`: > ~~~ -> # some interesting human SNPs -> # while accuracy is important, typos in the data won't hurt you here > -> snps <- c('rs53576', 'rs1815739', 'rs6152', 'rs1799971') -> snp_chromosomes <- c('3', '11', 'X', '6') -> snp_positions <- c(8762685, 66560624, 67545785, 154039662) +> str(days_of_the_week) > ~~~ {: .language-r} - -Once we have vectors, one thing we may want to do is specifically retrieve one -or more values from our vector. To do so we use **bracket notation**. We type -the name of the vector followed by square brackets. In those square brackets -we place the index (e.g. a number) in that bracket as follows: - > ~~~ -> # get the 3rd value in the snp_genes vector -> -> snp_genes[3] -> ~~~ -{: .language-r} -> ~~~ -> [1] "AR" +> Factor w/ 5 levels "friday","monday",..: 2 4 5 3 1 > ~~~ {: .output} -In R, every item your vector is indexed, starting from the first item (1) -through to the final number of items in your vector. You can also retrieve a -range of numbers: +For the sake of efficency, R stores the content of a factor as a vector of +integers, which an integer is assigned to each of the possible levels. Recall +levels are assigned in alphabetical order, so: -> ~~~ -> # get the 1st through 3rd value in the snp_genes vector -> -> snp_genes[1:3] -> ~~~ -{: .language-r} -> ~~~ -> [1] "OXTR" "ACTN3" "AR" -> ~~~ -{: .output} +|Level|integer| +|-----|-------| +|friday|1| +|monday|2| +|thursday|3| +|tuesday|4| +|wednesday|5| -If you want to to retreive several (but not necessarily sequential) items from -a vector, you pass a **vector of indicies**; a vector that has the numbered -positions you wish to retrieve. +Notice what happens to the levels if we add some repeated values to our factor: > ~~~ -> # get the 1st, 3rd, and 4th value in the snp_genes vector +> # create a factor with repeated values > -> snp_genes[c(1, 3, 4)] +> more_days_of_the_week <- factor(c('monday', +> 'tuesday', +> 'wednesday', +> 'thursday', +> 'friday', +> 'friday', +> 'friday')) +> str(more_days_of_the_week) > ~~~ {: .language-r} > ~~~ -> [1] "OXTR" "AR" "OPRM1" +> Factor w/ 5 levels "friday","monday",..: 2 4 5 3 1 1 1 > ~~~ {: .output} -There are additional (and perhaps less commonly used) ways of indexing a vector -(see [these examples](https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html)). -Also, several of these indexing expressions can be combined: +Going back to our chart above, "friday" is assigned "1" in the factor, and that +integer is listed three times in our factor. This is slightly obscure, but it +provides some clarification to why we get this output. -> ~~~ -> # get the 1st through the 3rd value, and 4th value in the snp_genes vector -> # yes, this is a little silly in a vector of only 4 values. -> -> snp_genes[c(1:3,4)] -> ~~~ -{: .language-r} -> ~~~ -> [1] "OXTR" "ACTN3" "AR" "OPRM1" -> ~~~ - -## Adding to, removing, or replacing values in existing vectors +## Plotting and ordering factors -Once you have an existing vector, you may want to add a new item to it. To do -so, you can use the `c()` function again to add your new value: +One of the most common uses for factors will be when you plot categorical +values. For example, suppose we want to know how many samples from our sample +submision were preped on each date? We could generate a plot: > ~~~ -> # add the gene 'CYP1A1' and 'APOA5' to our list of snp genes -> # this overwrites our existing vector +> # create a factor with repeated values > -> snp_genes <- c(snp_genes, "CYP1A1", "APOA5") +> plot(table(submission_metadata$prep_date)) > ~~~ {: .language-r} -We can of course verify that "snp_genes" contains the new gene entry +rstudio data frame view -> ~~~ -> snp_genes -> ~~~ -{: .language-r} -> ~~~ -> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" -> ~~~ -{: .output} +Let's quickly brake down this line of code: -Using a negative index will return a version a vector with that index's -value removed: +First we are pulling a single column of data from the `submission_metadata` data +frame using `$` notation: > ~~~ -> snp_genes[-6] +> # obtain the values of the 'prep_date' variable from the data frame +> +> submission_metadata$prep_date > ~~~ {: .language-r} > ~~~ -> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" "APOA5" +>[1] 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 +>[11] 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 7-Jun-15 7-Jun-15 +>[21] 7-Jun-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 +>[31] 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 +>[41] 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 6-Jul-15 7/8/15 7/8/15 +>[51] 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 +>[61] 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 +>[71] 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 +>[81] 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 +>[91] 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 7/8/15 +>Levels: 6-Jul-15 7/8/15 7-Jun-15 > ~~~ {: .output} +Then we use the `table()` function to turn this into a table of counts: -We can remove that value from our vector by overwriting it with this expression: > ~~~ -> snp_genes <- snp_genes[-6] -> snp_genes +> # generate a table from values of the 'prep_date' variable from the data frame +> +> table(submission_metadata$prep_date) > ~~~ {: .language-r} > ~~~ -> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" +>6-Jul-15 7/8/15 7-Jun-15 +> 45 48 3 > ~~~ {: .output} -We can also explicitly rename or add a value to our index using double bracket -notation: - +Finally, we use R's `plot()` function which attemtps to generate a plot from the +data: > ~~~ -> snp_genes[[7]]<- "APOA5" -> snp_genes +> # generate a plot from values of the 'prep_date' variable from the data frame +> +> plot(table(submission_metadata$prep_date)) > ~~~ {: .language-r} -> ~~~ -> [1] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" -> ~~~ -{: .output} +rstudio unordered prep plot -Notice in the operation above that R inserts an `NA` value to extend our vector -so that the gene "APOA5" is an index 7. This may be a good or not so good thing -depending on how you use this. - -> ## Exercise: Examining and indexing vectors -> Answer the following questions to test your knowledge vectors -> -> Which of the following is true of vectors in R -> -> A) All vectors have a mode or a length -> -> B) All vector have a mode and a length -> -> C) Vectors may have different lengths -> -> D) Items within a vector may be of different modes -> -> E) You can use the `c()` to one or more items to an existing vector -> -> F) You can use the `c()` to add a vector to an exiting vector ->> ->> ## solution ->> A) False - Vectors have both of these properties ->> ->> B) True ->> ->> C) True ->> ->> D) False - Vectors have only one mode (e.g. numeric, character); all items in ->> a vector must be of this mode. ->> ->> E) True ->> ->> F) True ->> -> {: .solution} -{: .challenge} - - -## Logical Indexing - -There is one last set of cool indexing capabilities we want to introduce. It is -possible within R to retrieve items in a vector based on a logical evaluation -or numerical comparison. For example, let's say we wanted get all of the SNPs -in our vector of SNP positons that were greater than 100,000,000. We could -index using the '>' (greater than) logical operator: +While this is a toy example, and there are problems with our prep dates that +need fixing, let's see how you order a factor so that we can fix our plot. +We can take our existing `more_days_of_the_week` factor, and use the `factor()` +function again. This time we will pass it two new arguments: `levels` will be +assigned to a vector that has the days of the week in the order we want them, +and we will set the `ordered` argument to TRUE. > ~~~ -> snp_positions[snp_positions > 100000000] -> ~~~ -{: .language-r} -> ~~~ -> [1] 154039662 -> ~~~ -{: .output} - -As demonstrated above, in the square brackets you place the name of the vector -followed by the comparison operator and (in this numeric case) a numeric value. -Some of the most common logical operators you will use in R are: - -|Operator|Description| -|--------|-----------| -|<|less than| -|<=|less than or equal to| -|>|greater than| -|>=|greater than or equal to| -|==|exactly equal to| -|!=|not equal to| -|!x|not x| -|a \| b| a or b| -|a & b| a and b| - -> ## The magic of programming -> ->The reason why the expression `snp_positions[snp_positions > 100000000]` works ->can be better understood if you examine what the expression "snp_positions > 100000000" ->evaluates to: -> ->> ~~~ ->> snp_positions > 100000000 ->> ~~~ ->{: .language-r} ->> ~~~ ->> [1] FALSE FALSE FALSE TRUE ->> ~~~ ->{: .output} -> ->The output above is a logical vector, the 4th element of which is TRUE. When ->you pass a logical vector as an index, R will return the true values: -> ->> ~~~ ->> snp_positions[c(FALSE, FALSE, FALSE, TRUE)] ->> ~~~ ->{: .language-r} ->> ~~~ ->> [1] 154039662 ->> ~~~ ->{: .output} -> -> ->If you have never coded before, this type of situation starts to expose the ->"magic" of programming. We mentioned before that in the bracket indexing ->notation you take your named vector followed by brakets which contain an index: ->**named_vector[index]**. The "magic" is that the index needs to *evaluate to* a ->number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long ->as R can evaluate it, we will get a result. That our expression ->`snp_positions[snp_positions > 100000000]` evaluates to a number can be seen ->in the following situtaion. If you wanted to know which **index** (1, 2, 3, or ->4) in our vector of SNP positions was the one that was greater than 100,000,000? ->We can use the `which()` function to return the indicies of any item that ->evaluates as TRUE in our comparison: ->> ~~~ ->> which(snp_positions > 100000000) ->> ~~~ ->{: .language-r} ->> ~~~ ->> [1] 4 ->> ~~~ ->{: .output} -> **Why is this important?** Often in programming we will not know what inputs -> and values will be used when our code is executed. Rather than put in a -> pre-determined value (e.g 100000000) we can use an object that can take on -> whatever value we need. So for example: -> ->> ~~~ ->> snp_marker_cutoff <- 100000000 ->> snp_positions[snp_positions > snp_marker_cutoff] ->> ~~~ ->{: .language-r} ->> ~~~ ->> [1] 154039662 ->> ~~~ ->{: .output} -> Ultimately, it's putting together flexible, reusable code like this that gets -> at the "magic" of programming! -{: .callout} - -## A few final vector tricks - -Finally, there are a few other common retrieve or replace operations you may -want to know about. First, you can check to see if any of the values of your -vector is an NA value. Missing data will get a more detailed treatment later, -but the `is.NA()` function will return a logical vector, with TRUE for any NA -value: - -> ~~~ -> # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" -> -> is.na(snp_genes) +> # order the 'more_days_of_the_week' factor to our desired set of levels +>more_days_of_the_week <- factor(more_days_of_the_week, levels = c("monday", +> "tuesday", +> "wednesday", +> "thursday", +> "friday"), +> ordered = TRUE) > ~~~ {: .language-r} -> ~~~ -> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE -> ~~~ -{: .output} - -Sometimes, you may wish to find out if a specific value (or several values) is -in a vector. You can do this using the comparison operator `%in%`, which will -return TRUE for any value in your collection of one or more values matches a -value in the vector you are searching: +We can now see the new ordering: > ~~~ -> # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" -> # test to see if "ACTN3" or "APO5A" is in the snp_genes vector -> # if you are looking for more than one value, you must pass this as a vector -> -> c("ACTN3","APOA5") %in% snp_genes +> str(more_days_of_the_week) > ~~~ {: .language-r} > ~~~ -> [1] TRUE TRUE +> Ord.factor w/ 5 levels "monday"<"tuesday"<..: 1 2 3 4 5 5 5 > ~~~ {: .output} -> ## Review: Creating and indexing vectors -> Use your knowledge of vectors to accomplish the following tasks: -> -> **1) What mode are the following vectors? Use `typeof()` to check** -> -> a. `snps` -> -> b. `snp_chromosomes` -> -> c. `snp_positions` -> -> **2) Add the following values to the following vectors** -> -> a. To the `snps` vector add: 'rs662799' -> -> b. To the `snp_chromosomes` vector add: 11 -> -> c. To the `snp_positions` vector add: 116792991 -> -> **3) Make the following change to the `snp_genes` vector** -> Hint: Your vector should look like this in the 'Global Enviornment': -> `chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"`. If not -> recreate the vector by running this expression: -> `snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5")` -> -> a. Create a new version of `snp_genes` that does not contain CYP1A1 -> -> b. Add 2 NA values to the end of `snp_genes` (hint: final vector should -> have a length of 8) -> -> **4) Create a new vector `combined` that contains:** -> -> - The the 1st value in `snp_genes` -> -> - The 1st value in `snps` +Although not all levels are shown, notice there are `<` signs indicating an +order. + +> ## Exercise: Order and plot `sample_submission` `prep_date` > -> - The 1st value in `snp_chromosomes` +> **Generate a plot of the `prep_date` variable, properly ordered from the `sample_submission` +> data frame** > -> - The 1st value in `snp_positions` +> To choose the ordering, assume that the unambiguous dates for this data are: +> - 7-Jun-15: June 7, 205 +> - 6-Jul-15: July 6, 2015 +> - 7/8/15: July 8, 2015 > -> **Check the mode of `combined` using `typeof()` +> *hint* you can use the `factor()` function inside of your `table()`and `plot()` +> function calls. > +> *hint* build this single line of code from the inside out! >> ## solution +>>plot(table(factor(submission_metadata$prep_date, levels = c("7-Jun-15", +>> "6-Jul-15", +>> "7/8/15"), +>> ordered = TRUE))) >> ->> **1) What mode are the following vectors? Use `typeof()` to check** ->> ->> a. `typeof(snps)` # "character" ->> ->> b. `typeof(snp_chromosomes)` # "character" ->> ->> c. `typeof(snp_positions)` # "double" - which is also a numeric type ->> ->> ->> **2) Add the following values to the following vectors** ->> ->> a. `snps <- c(snps, 'rs662799')` ->> ->> b. `snp_chromosomes <- c(snp_chromosomes, "11")` # did you use quotes? ->> ->> c. `snp_positions <- c(snp_positions, 116792991)` ->> ->> **3) Make the following change to the `snp_genes` vector** ->> ->> a. `snp_genes <- snp_genes[-5]` or `snp_genes <- snp_genes[c(1,2,3,4,6,7)]`, etc. ->> ->> b. `snp_genes <- c(snp_genes, NA, NA)` or `snp_genes[[8]] <- NA`, etc. ->> ->> ->> **4) Create a new vector `combined` that contains:** ->> ->> - The the 1st value in `snp_genes` ->> ->> - The 1st value in `snps` ->> ->> - The 1st value in `snp_chromosomes` ->> ->> - The 1st value in `snp_positions` ->> ->> ->> `combined <- c(snp_genes[1], snps[1], snp_chromosomes[1], snp_positions[1])` ->> ->> `typeof(combined)` # "character" - Do you know why this is? >> +>> rstudio ordered prep plot > {: .solution} {: .challenge} -## Bonus material: Lists - -Lists are quite useful in R, but we won't be using them in the genomics lessons. -That said, you may come across lists in the way that some bioinformatics -programs may store and/or return data to you. One of the key attributes of a list -is that unlike a vector, a list may contain data of more than one mode. Learn -more about creating and using lists using this [nice tutorial](http://r4ds.had.co.nz/lists.html). -In this one example, we will create a named list and show you how to retreive -items from the list. - - -> ~~~ -> # Create a named list using the 'list' function and our SNP examples -> # Note, for easy reading we have place each item in the list on a separate line -> # Nothing special about this, you can do this for any multiline commands -> # To run this command, make sure the entire command (all 4 lines) are highlited -> # before running -> ->snp_data <- list(genes = snp_genes, -> refference_snp = snps, -> chromosome = snp_chromosomes, -> position = snp_positions) -> -> # Examine the structure of the list ->str(snp_data) -> ~~~ -{: .language-r} -> ~~~ ->List of 4 -> $ genes : chr [1:8] "OXTR" "ACTN3" "AR" "OPRM1" ... -> $ refference_snp: chr [1:5] "rs53576" "rs1815739" "rs6152" "rs1799971" ... -> $ chromosome : chr [1:4] "3" "11" "X" "6" -> $ position : num [1:4] 8.76e+06 6.66e+07 6.75e+07 1.54e+08 -> ~~~ -{: .output} - -To get all of the values for the `position` object in the list we use the `$` notation: - -> ~~~ -> # return all the values of position object -> -> snp_data$position -> ~~~ -{: .language-r} -> ~~~ -> [1] 8762685 66560624 67545785 154039662 -> ~~~ -{: .output} - -To get the first value in the `position` object, use `[]` notation to index: - -> ~~~ -> # return first value of the position object -> -> snp_data$position[1] -> ~~~ -{: .language-r} -> ~~~ -> [1] 8762685 -> ~~~ -{: .output} From e0cbde71aefb69b70a574537d0f77bf973d7b506 Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Fri, 11 May 2018 23:16:57 -0400 Subject: [PATCH 13/19] get up to coercing factors --- episodes/03-basics-factors-dataframes.md | 90 ++++++++++++++++++++++++ 1 file changed, 90 insertions(+) diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index 54f8494e..a6bc4554 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -426,6 +426,96 @@ order. {: .challenge} +## Indexing and data frames + +Next, we are going to talk about how you can get specific values from data frames, and where necessary, change the mode of a column of values. + +The first thing to remember is that a data frame is two-dimensional (rows and +columns). Therefore, to select a specific value we will will once again use +`[]` notation, but we will specify more than one value (except in some cases +where we are taking a range). + +> ## Exercise: Indexing a data frame +> +> **Try the following indices and functions and try to figure out what they return** +> +> a. `submission_metadata[1,1]` +> +> b. `submission_metadata[2,4]` +> +> c. `submission_metadata[96,10]` +> +> d. `submission_metadata[2, ]` +> +> e. `submission_metadata[-1, ]` +> +> f. `submission_metadata[1:4,1]` +> +> g. `submission_metadata[1:10,c("client_sample_id","RIN")]` +> +> h. `submission_metadata[,c("RIN")]` +> +> i. `head(submission_metadata)` +> +> j. `tail(submission_metadata)` +> +> k. `submission_metadata$prep_date` +> +> l. `submission_metadata[submission_metadata$RIN >= 9.0,]` +> +>> ## solution +>> +>> a. `submission_metadata[1,1]` # 1st row, 1st column +>> +>> b. `submission_metadata[2,4]` # 2nd row, 4th column +>> +>> c. `submission_metadata[96,10]` # 96th row, 10th column +>> +>> d. `submission_metadata[2, ]` # The entire 2nd row +>> +>> e. `submission_metadata[-1, ]` # The entire data frame except the 1st row +>> +>> f. `submission_metadata[1:4,1]` # rows 1-4, column 1 +>> +>> g. `submission_metadata[1:10,c("client_sample_id","client_sample_id")]` # rows 1:10, column 'client_sample_id' and 'RIN' +>> +>> h. `submission_metadata[,c("RIN")]` # all rows, column 'RIN' +>> +>> i. `head(submission_metadata)` # first 6 rows of the data frame +>> +>> j. `tail(submission_metadata)` # last 6 rows of the data frame +>> +>> k. `submission_metadata$prep_date` # "prep_date" column, all rows +>> +>> l. `submission_metadata[submission_metadata$RIN >= 9.0,]` # all rows where the value of "RIN" column is greater-than-or-equal-to 9.0 +> {: .solution} +{: .challenge} + +Essentially, the indexing notation is very similar to what we learned for +vectors. The key differences include: + +- Typically provide two values separated by commas: dataframe[row, column] +- In cases where you are taking a continuous range of numbers use a colon + between the numbers (start:stop, inclusive) +- For a non continuous set of numbers, pass a vector using `c()` +- Index using the name of a column(s) by passing them as vectors using `c()` + +## Coercing values in data frames + +> ## Tip: coercion isn't limited to data frames +> +> While we are going to address coercion in the context of data frames +> most of these methods apply to other data structures, such as vectors +{: .callout} + +Sometimes, it is possible that R will misinterpret the type of data represented +in a data frame, or store that data in a mode which prevents you from +operating on the data the way you wish. + + + + + --- From 586a5f8657d478da7004d3f33b360727100d0899 Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Sun, 13 May 2018 20:56:37 -0400 Subject: [PATCH 14/19] complete coercion section --- episodes/02-r-basics.md | 28 +-- episodes/03-basics-factors-dataframes.md | 209 +++++++++++++++++++++-- 2 files changed, 212 insertions(+), 25 deletions(-) diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index 58b360e7..6240de4e 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -10,8 +10,8 @@ objectives: - "Be able to create the most common R objects including vectors" - "Understand that vectors have modes, which correspond to the type of data they contain" - "Be able to use arithmetic operators on R objects" -- "Be able to retrieve (index), name, or replace, values from a vector" -- "Be able to use logical operators in an indexing operation" +- "Be able to retrieve (subset), name, or replace, values from a vector" +- "Be able to use logical operators in an subsetting operation" - "Understand that lists can hold data of more than one mode and can be indexed" keypoints: - "Effectively using R is a journey of months or years. Still you don't have to @@ -398,7 +398,7 @@ Vectors are quite important in R, mostly for us because data frames are essentially collections of vectors (more on this later). What we learn about manipulating vectors now will pay of even more when we get to data frames. -## More on creating and indexing vectors +## More on creating and subsetting vectors Let's create a few more vectors to play around with: @@ -458,9 +458,9 @@ positions you wish to retrieve. > ~~~ {: .output} -There are additional (and perhaps less commonly used) ways of indexing a vector +There are additional (and perhaps less commonly used) ways of subsetting a vector (see [these examples](https://thomasleeper.com/Rcourse/Tutorials/vectorindexing.html)). -Also, several of these indexing expressions can be combined: +Also, several of these subsetting expressions can be combined: > ~~~ > # get the 1st through the 3rd value, and 4th value in the snp_genes vector @@ -538,7 +538,7 @@ Notice in the operation above that R inserts an `NA` value to extend our vector so that the gene "APOA5" is an index 7. This may be a good or not so good thing depending on how you use this. -> ## Exercise: Examining and indexing vectors +> ## Exercise: Examining and subsetting vectors > Answer the following questions to test your knowledge vectors > > Which of the following is true of vectors in R @@ -573,12 +573,12 @@ depending on how you use this. {: .challenge} -## Logical Indexing +## Logical Subsetting -There is one last set of cool indexing capabilities we want to introduce. It is +There is one last set of cool subsetting capabilities we want to introduce. It is possible within R to retrieve items in a vector based on a logical evaluation or numerical comparison. For example, let's say we wanted get all of the SNPs -in our vector of SNP positons that were greater than 100,000,000. We could +in our vector of SNP positions that were greater than 100,000,000. We could index using the '>' (greater than) logical operator: > ~~~ @@ -635,8 +635,8 @@ Some of the most common logical operators you will use in R are: > > >If you have never coded before, this type of situation starts to expose the ->"magic" of programming. We mentioned before that in the bracket indexing ->notation you take your named vector followed by brakets which contain an index: +>"magic" of programming. We mentioned before that in the bracket +>notation you take your named vector followed by brackets which contain an index: >**named_vector[index]**. The "magic" is that the index needs to *evaluate to* a >number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long >as R can evaluate it, we will get a result. That our expression @@ -708,7 +708,7 @@ value in the vector you are searching: > ~~~ {: .output} -> ## Review: Creating and indexing vectors +> ## Review: Creating and subsetting vectors > Use your knowledge of vectors to accomplish the following tasks: > > **1) What mode are the following vectors? Use `typeof()` to check** @@ -801,7 +801,7 @@ That said, you may come across lists in the way that some bioinformatics programs may store and/or return data to you. One of the key attributes of a list is that unlike a vector, a list may contain data of more than one mode. Learn more about creating and using lists using this [nice tutorial](http://r4ds.had.co.nz/lists.html). -In this one example, we will create a named list and show you how to retreive +In this one example, we will create a named list and show you how to retrieve items from the list. @@ -809,7 +809,7 @@ items from the list. > # Create a named list using the 'list' function and our SNP examples > # Note, for easy reading we have place each item in the list on a separate line > # Nothing special about this, you can do this for any multiline commands -> # To run this command, make sure the entire command (all 4 lines) are highlited +> # To run this command, make sure the entire command (all 4 lines) are highlighted > # before running > >snp_data <- list(genes = snp_genes, diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index a6bc4554..f757deda 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -14,14 +14,14 @@ objectives: - "Be able to retrieve values (index) from a data frame" - "Understand how R may converse data into different modes" - "Be able to convert the mode of an object" -- "Understand that R uses factors to store and manipulate catagorical data" -- "Be able to manipulate a factor, including indexing and reordering" -- "Be able to apply an arithmetic function to a dataframe" -- "Be able to coerce the class of an object (including variables in a dataframe)" -- "Be able to save a dataframe as a delimited file" +- "Understand that R uses factors to store and manipulate categorical data" +- "Be able to manipulate a factor, including subsetting and reordering" +- "Be able to apply an arithmetic function to a data frame" +- "Be able to coerce the class of an object (including variables in a data frame)" +- "Be able to save a data frame as a delimited file" keypoints: - "It is easy to import data into R from tabular formats including Excel. - However, you still need to check that R has imported and interprited your + However, you still need to check that R has imported and interpreted your data correctly" - "There are best practices for organizing your data (keeping it tidy) and R is great for this" @@ -426,7 +426,7 @@ order. {: .challenge} -## Indexing and data frames +## Subsetting data frames Next, we are going to talk about how you can get specific values from data frames, and where necessary, change the mode of a column of values. @@ -435,7 +435,7 @@ columns). Therefore, to select a specific value we will will once again use `[]` notation, but we will specify more than one value (except in some cases where we are taking a range). -> ## Exercise: Indexing a data frame +> ## Exercise: Subsetting a data frame > > **Try the following indices and functions and try to figure out what they return** > @@ -491,15 +491,55 @@ where we are taking a range). > {: .solution} {: .challenge} -Essentially, the indexing notation is very similar to what we learned for +Essentially, the subsetting notation is very similar to what we learned for vectors. The key differences include: -- Typically provide two values separated by commas: dataframe[row, column] +- Typically provide two values separated by commas: data.frame[row, column] - In cases where you are taking a continuous range of numbers use a colon between the numbers (start:stop, inclusive) - For a non continuous set of numbers, pass a vector using `c()` - Index using the name of a column(s) by passing them as vectors using `c()` +Finally, in all of the subsetting exercises above, we simply printed values to +the screen. Remember that you can create a new data frame object by assigning +them to a new object name: + +> ~~~ +> #subset submission_metadata to a new data frame with RIN >= 8 +> +>high_quality_rna <- submission_metadata[submission_metadata$RIN >= 8,] +> +> #check the dimension of the data frame +> +>dim(high_quality_rna) +> +> #get a summary of the data frame +> +> summary(high_quality_rna) +> ~~~ +{: .language-r} +> ~~~ +> [1] 86 10 +> +>well_position tube_barcode plate_barcode client_sample_id replicate Volume..µL. +>A1 : 1 Min. :151017990 LP-10624:86 k255M_1h-2 : 3 a: 1 Min. : 0.50 +>A10 : 1 1st Qu.:152080214 k255N_1h-1 : 3 A:26 1st Qu.: 57.50 +>A11 : 1 Median :153366715 k255N_1h-11: 3 b: 1 Median : 59.70 +>A12 : 1 Mean :153266703 k255N_1h-12: 3 B:28 Mean : 65.94 +>A2 : 1 3rd Qu.:154489518 k255N_1h-13: 3 c: 1 3rd Qu.: 62.50 +>A3 : 1 Max. :155537812 k255N_1h-14: 3 C:29 Max. :630.10 +>(Other):80 (Other) :68 +>concentration..ng.µL. RIN prep_date ship_date +>Min. :157.7 Min. :8.00 6-Jul-15:42 20-Jul:86 +>1st Qu.:186.2 1st Qu.:8.30 7/8/15 :42 +>Median :197.5 Median :8.60 7-Jun-15: 2 +>Mean :197.9 Mean :8.61 +>3rd Qu.:211.0 3rd Qu.:8.90 +>Max. :237.1 Max. :9.60 +> ~~~ +{: .output} + + ## Coercing values in data frames > ## Tip: coercion isn't limited to data frames @@ -510,11 +550,158 @@ vectors. The key differences include: Sometimes, it is possible that R will misinterpret the type of data represented in a data frame, or store that data in a mode which prevents you from -operating on the data the way you wish. +operating on the data the way you wish. For example, a long list of gene names +isn't usually thought of as a categorical variable, the way that your +experimental condition (e.g. control, treatment) might be. More importantly, +some R packages you use to analyze your data may expect characters as input, +not factors. At other times (such as plotting or some statistical analyses) a +factor may be more appropriate. Ultimately, you should know how to change the +mode of an object. + +First, its very important to recognize that coercion happens in R all the time. +This can be a good thing when R gets it right, or a bad thing when the result +is not what you expect. Consider: + +> ~~~ +> snp_chromosomes <- c('3', '11', 'X', '6') +> typeof(snp_chromosomes) +> ~~~ +{: .language-r} +> ~~~ +> [1] "character" +> ~~~ +{: .output} + +Although there are several numbers in our vector, they are all in quotes, so +we have explicitly told R to consider them characters. Even if we removed the +quotes from the numbers, R would coerce everything into a character: + +> ~~~ +> snp_chromosomes_2 <- c(3, 11, 'X', 6) +> typeof(snp_chromosomes_2) +> snp_chromosomes_2[1] +> ~~~ +{: .language-r} +> ~~~ +> [1] "character" +> [1] "3" +> ~~~ +{: .output} + +We can use some of the `as.` functions to explicitly coerce values from one +form into another. Consider the following vector of characters, which all happen to be valid numbers: + +> ~~~ +> snp_positions_2 <- c("8762685", "66560624", "67545785", "154039662") +> typeof(snp_positions_2) +> snp_positions_2[1] +> ~~~ +{: .language-r} +> ~~~ +> [1] "character" +> [1] "8762685" +> ~~~ +{: .output} + +Now we can coerce `snp_positions_2` into a numeric type using `as.numeric()`: +> ~~~ +> snp_positions_2 <- as.numeric(snp_positions_2) +> typeof(snp_positions_2) +> snp_positions_2[1] +> ~~~ +{: .language-r} +> ~~~ +> [1] "double" +> [1] 8762685 +> ~~~ +{: .output} + +Sometimes coercion is straight forward, but what would happen if we tried +using `as.numeric()` on `snp_chromosomes_2` + +> ~~~ +> snp_chromosomes_2 <- as.numeric(snp_chromosomes_2) +> ~~~ +{: .language-r} +> ~~~ +> Warning message: +> NAs introduced by coercion +> ~~~ +{: .error} +If we check, we will see that an `NA` value (R's default value for missing +data) has been introduced. + +> ~~~ +> snp_chromosomes_2 +> ~~~ +{: .language-r} +> ~~~ +> [1] 3 11 NA 6 +> ~~~ +{: .output} +Trouble can really start when we try to coerce a factor. For example, when we +try to coerce the `replicate` column in our data frame into a character mode +look at the result: + +> ~~~ +> as.numeric(submission_metadata$replicate) +> ~~~ +{: .language-r} +> ~~~ +> [1] 1 3 5 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 +> [37] 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 +> [73] 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 2 4 6 +> ~~~ +{: .output} + +Strangely, it works! Almost. Instead of giving and error message, R returns +numeric values, which in this case are the integers assigned to the levels in +this factor. This kind of behavior can lead to hard-to-find bugs, for example +when we do have numbers in a factor, and we get numbers from a coercion. If +we don't look carefully, we may not notice a problem. + +If you need to coerce an entire column you can overwrite it using an expression +like this one: + +> ~~~ +> # make the 'well_position' column a character type column +> +>submission_metadata$well_position <- as.character(submission_metadata$well_position) +> +> # check the type of the column +> +>typeof(submission_metadata$well_position) +> ~~~ +{: .language-r} +> ~~~ +> [1] "character" +> ~~~ +{: .output} +## StringsAsFactors=FALSE + +Lets summarize this section on coercion with a few take home messages. + +- When you explicitly coerce one data type into another (this is known as + **explicit coercion**), be careful to check the result. Ideally, you should try to see if its possible to avoid steps in your analysis that force you to + coerce. +- R will sometimes coerce without you asking for it. This is called + (appropriately) **implicit coercion**. For example when we tried to create + a vector with multiple data types, R chose one type through implicit + coercion. +- Check the structure (`str()`) of your data frames before working with them! + +One regarding the first bullet point, one way to avoid needless coercion when +importing a data frame using any one of the `read.table()` functions such as +`read.csv()` is to set the argument `StringsAsFactors` to FALSE. By default, +this argument is TRUE. Setting it to FALSE will treat any non-numeric column to +a character type. `read.csv()` documentation, you will also see you can +explicitly type your columns using the `colClasses` argument. Other R packages +(such as the Tidyverse "readr") don't have this particular conversion issue, +but many packages will still try to guess a data type. From 9892c14314a74b0bbb7ff6c5e3ffc15984a43a5b Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Sun, 13 May 2018 21:26:17 -0400 Subject: [PATCH 15/19] add some bonus material --- episodes/03-basics-factors-dataframes.md | 74 ++++++++++++++++++++++++ 1 file changed, 74 insertions(+) diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index f757deda..2404fe74 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -703,6 +703,80 @@ explicitly type your columns using the `colClasses` argument. Other R packages (such as the Tidyverse "readr") don't have this particular conversion issue, but many packages will still try to guess a data type. +## Data frame bonus material: math, sorting, renaming +Here are a few operations that don't need much explanation, but which are good +to know. + +There are lots of arithmetic functions you may want to apply to your data +frame, an covering those would be a course in itself (there is some starting +material [here](https://swcarpentry.github.io/r-novice-inflammation/15-supp-loops-in-depth/)). Our lessons will cover some additional summary statistical functions in +a subsequent lesson, but overall we will focus on data cleaning and +visualization. + +As you might expect, you can use functions like `mean()`, `min()`, `max()` on an +individual column: + +> ~~~ +> mean(submission_metadata$RIN) +> ~~~ +{: .language-r} +> ~~~ +> [1] 8.473958 +> ~~~ +{: .output} + +You can do math and save the result in a new column: + +> ~~~ +> submission_metadata$vol_in_L <- submission_metadata$Volume..µL. /10000 +> head(submission_metadata$vol_in_L) +> ~~~ +{: .language-r} +> ~~~ +> [1] 0.00642 0.00637 0.00602 0.00558 0.00608 0.00575 +> ~~~ +{: .output} + +You can sort a data frame using the `order()` function: + +> ~~~ +>sorted_by_replicate <- submission_metadata[order(submission_metadata$replicate), ] +>head(sorted_by_replicate$replicate) +> ~~~ +{: .language-r} +> ~~~ +>[1] a A A A A A +>Levels: a A b B c C +> ~~~ +{: .output} + +You can selectively replace values in a data frame based on their value: + +> ~~~ +> sorted_by_replicate$replicate[sorted_by_replicate$replicate == "a"] <- "A" +>head(sorted_by_replicate$replicate) +> ~~~ +{: .language-r} +> ~~~ +>[1] A A A A A A +>Levels: a A b B c C +> ~~~ +{: .output} + +You can rename columns: + +> ~~~ +> colnames(submission_metadata)[colnames(submission_metadata) == "Volume..µL."] <- "vol_in_µL" +> +>#check the column name (hint names are returned as a vector) +> +> colnames(submission_metadata)[6] +> ~~~ +{: .language-r} +> ~~~ +>[1] "vol_in_µL" +> ~~~ +{: .output} --- From 7e040e331c5e8303adf18ce2e2cfbedd458ba47d Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Tue, 15 May 2018 11:12:21 -0400 Subject: [PATCH 16/19] finish episode 3 and reorganize --- episodes/01-introduction.md | 175 +--------------------- episodes/03-basics-factors-dataframes.md | 150 +++++++++++++++++-- episodes/99-r-help.md | 183 +++++++++++++++++++++++ 3 files changed, 324 insertions(+), 184 deletions(-) create mode 100644 episodes/99-r-help.md diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index 518c6b5a..65ac3bb4 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -5,7 +5,6 @@ exercises: 15 questions: - "Why use R?" - "Why use RStudio and how does it differ from R?" -- "How do I get help using R and RStudio?" objectives: - "Know advantages of analyzing data in R" - "Know advantages of using RStudio" @@ -17,9 +16,6 @@ objectives: - "Compose an R script file with comments and saved commands" - "Be able to define what an R function is" - "Locate help for an R function using `?`, `??`, and `args()`" -- "Check the version of R" -- "Be able to ask effective questions when searching for help on forums or using web - searches" keypoints: - "R is a powerful, popular open-source scripting language" @@ -27,10 +23,6 @@ keypoints: it easy to find help" - "You can customize the layout of RStudio, and use the project feature to manage the files and packages used in your analysis" -- "R provides thousands of functions for analyzing data, and provides several - way to get help" -- "Using R will mean searching for online help, and there are tips and - resources on how to search effectively" --- @@ -192,7 +184,7 @@ environment: , and what R looks like if you were to run it at the command line without RStudio. You can work interactively (i.e. enter R commands here), but for the most part, we will run a script, or lines in a script and watch their - execution and output here. The "Terminal" tab give you access to the BASH + execution and output here. The "Terminal" tab give you access to the BASH terminal. - **Environment/History**: Here, RStudio will show you what datasets and variables you have created, and which are actively defined/in memory. You can @@ -500,170 +492,5 @@ will remind you of arguments and provide additional help. rstudio default session ---- - -## Getting help with R - -rstudio default session - -Finally, no matter how much experience you have with R, you will find yourself -needing help. There is no shame in researching how to do something in R, and -most people will find themselves looking up how to do the same things that -they "should know how to do" over and over again. Here are some tips to make -this process as helpful and efficient as possible. - -> "Never memorize something that you can look up" -> - A. Einstein - -## Finding help on Stackoverflow and Biostars - -Two popular websites will be of great help with many R problems. For **general** -**R questions**, [Stack Overflow](https://stackoverflow.com/) is probably the most -popular online community for developers. If you start your question "How to do X -in R" results from Stack Overflow are usually near the top of the list. For -**bioinformatics specific questions**, [Biostars](https://www.biostars.org/) is -a popular online forum. - ->## Tip: Asking for help using online forums: -> -> - When searching for R help, look for answers with the [r](https://stackoverflow.com/questions/tagged/r) tag. -> - Get an account; not required to view answers but to required to post -> - Put in effort to check thoroughly before you post a question; folks get -> annoyed if you ask a very common question that has been answered multiple -> times -> - Be careful. While forums are very helpful, you can't know for sure if the -> advice you are getting is correct -> - See the [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) -> blog post for more useful tips -> -{: .callout} - -## Help people help you - -Often, in order to duplicate the issue you are having, someone may need to see -the data you are working with or verify the versions of R or R packages you -are using. The following R functions will help with this: - -You can **check the version of R** you are working with using the `sessionInfo()` -function. Actually, it is good to save this information as part of your notes -on any analysis you are doing. When you run the same script that has worked fine -a dozen times before, looking back at these notes will remind you that you -upgraded R and forget to check your script. - - -> ~~~ -> sessionInfo() -> ~~~ -{: .language-r} - -> ~~~ -> R version 3.2.3 (2015-12-10) -> Platform: x86_64-pc-linux-gnu (64-bit) -> Running under: Ubuntu 14.04.3 LTS -> -> locale: -> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 -> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 -> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C -> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C -> -> attached base packages: -> [1] stats graphics grDevices utils datasets methods base -> -> loaded via a namespace (and not attached): -> [1] tools_3.2.3 packrat_0.4.9-1 -> ~~~ -{: .output} - -Many times, there may be some issues with your data and the way it is formatted. -In that case, you may want to share that data with someone else. However, you -may not need to share the whole dataset; looking at a subset of your 50,000 row, -10,000 column dataframe may be TMI (too much information)! You can take an -object you have in memory such as dataframe (if you don't know what this means -yet, we will get to it!) and save it to a file. In our example we will use the -`dput()` function on the `iris` dataframe which is an example dataset that is -installed in R: - - -> ~~~ -> dput(head(iris)) # iris is an example data.frame that comes with R -> # the `head()` function just takes the first 6 lines of the iris dataset -> ~~~ -{: .language-r} - -This generates some output (below) which you will be better able to interpret -after covering the other R lessons. This info would be helpful in understanding -how the data is formatted and possibly revealing problematic issues. - -> ~~~ -> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), -> Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, -> 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, -> 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, -> 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", -> "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, -> 6L), class = "data.frame") -> ~~~ -{: .output} - -Alternatively, you can also save objects in R memory to a file by specifying -the name of the object, in this case the `iris` data frame, and passing a -filename to the `file=` argument. - -> ~~~ -> saveRDS(iris, file="iris.rds") # By convention, we use the .rds file extension -> ~~~ -{: .language-r} - ---- - -## Final FAQs on R - -Finally, here are a few pieces of introductory R knowledge that are too good to -pass up. While we won't return to them in this course, we put them here because -they come up commonly: - -**Do I need to click Run every time I want to run a script?** - -- No. In fact, the most common shortcut key allows you to run a command (or - any lines of the script that are highlighted): - - Windows execution shortcut: Ctrl+Enter - - Mac execution shortcut: Cmd(⌘)+Enter - - To see a complete list of shortcuts, click on the Tools menu and - select Keyboard Shortcuts Help - -**What's with the brackets in R console output?** -- R returns an index with your result. When your result contains multiple values, - the number tells you what ordinal number begins the line, for example: - -> ~~~ -> 1:101 # generates the sequence of numbers from 1 to 101 -> ~~~ -{: .language-r} - -In the output below, `[81]` indicates that the first value on that line is the -81st item in your result - -> ~~~ -> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 -> [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 -> [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 -> [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 -> [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 -> [101] 101 -> ~~~ -{: .output} - - -**Can I run my R script without RStudio?** - -- Yes, remember - RStudio is running R. You get to use lots of the enhancements - RStudio provides, but R works independent of RStudio. See [these tips](https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line) - for running your commands at the command line - - -**Where else can I learn about RStudio?** -- Check out the Help menu, especially "Cheatsheets" section --- diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index 2404fe74..c31c1a8b 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -1,7 +1,7 @@ --- title: "R Basics continued - factors and data frames" teaching: 60 -exercises: 20 +exercises: 30 questions: - "How do I get started with tabular data (e.g. spreadsheets) in R?" - "What are some best practices for reading data into R?" @@ -11,13 +11,14 @@ objectives: - "Be able to load a tabular dataset using base R functions" - "Be able to determine the structure of a data frame including its dimensions and the datatypes of variables" -- "Be able to retrieve values (index) from a data frame" +- "Be able to subset/retrieve values from a data frame" - "Understand how R may converse data into different modes" - "Be able to convert the mode of an object" - "Understand that R uses factors to store and manipulate categorical data" - "Be able to manipulate a factor, including subsetting and reordering" - "Be able to apply an arithmetic function to a data frame" - "Be able to coerce the class of an object (including variables in a data frame)" +- "Be able to import data from Excel" - "Be able to save a data frame as a delimited file" keypoints: - "It is easy to import data into R from tabular formats including Excel. @@ -40,13 +41,13 @@ first set of example data: **1) Keep raw data separate from analyzed data** This is principle number one because if you can't tell which files are the -original raw data, you risk making some serious mistakes (e.g. drawing conculsion +original raw data, you risk making some serious mistakes (e.g. drawing conclusion from data which have been manipulated in some unknown way). -**2) Keep speadsheet data Tidy** +**2) Keep spreadsheet data Tidy** The simplest principle of **Tidy data** is that we have one row in our -spreadsheet for each observation or sample, and one colum for every variable +spreadsheet for each observation or sample, and one column for every variable that we measure or report on. As simple as this sounds, it's very easily violated. Most data scintists agree that significant amounts of their time is spent tidying data for analysis. Read more about data organization in @@ -67,7 +68,7 @@ in your analysis, and its reproducibility. ## Importing tabular data into R There are several ways to import data into R. For our purpose here, we will -focus on using the tools every R installtion comes with (so called "base" R) to +focus on using the tools every R installation comes with (so called "base" R) to import a comma-delimited file, a sequencing sample submission sheet. We will First, we need to load the sheet using a function called `read.csv()`. @@ -83,7 +84,7 @@ First, we need to load the sheet using a function called `read.csv()`. > > A) What is the default parameter for 'header' in the `read.csv()` function? > -> B) What argument would you have to change to read a file that was delimeted +> B) What argument would you have to change to read a file that was delimited > by semicolons (;) rather than commas? > > C) What argument would you have to change to read file in which numbers @@ -94,13 +95,13 @@ First, we need to load the sheet using a function called `read.csv()`. > >> ## solution >> ->> A) The `read.csv()` function has the argument 'header' set to TRUE by deault, +>> A) The `read.csv()` function has the argument 'header' set to TRUE by default, >> this means the function always assumes the first row is header information, >> (i.e. column names) >> >> B) The `read.csv()` function has the argument 'sep' set to ",". This means >> the function assumes commas are used as delimiters, as you would expect. ->> Changing this parameter (e.g. `sep=";"`) would now interprit semicolons as +>> Changing this parameter (e.g. `sep=";"`) would now interpret semicolons as >> delimiters. >> >> C) Although it is not listed in the `read.csv()` usage, `read.csv()` is @@ -706,7 +707,7 @@ but many packages will still try to guess a data type. ## Data frame bonus material: math, sorting, renaming Here are a few operations that don't need much explanation, but which are good -to know. +to know. There are lots of arithmetic functions you may want to apply to your data frame, an covering those would be a course in itself (there is some starting @@ -779,4 +780,133 @@ You can rename columns: > ~~~ {: .output} +## Importing data from Excel + +Excel is one of the most common formats, so we need to discuss how to make +these files play nicely with R. The simplest way to import data from Excel is +to **save your Excel file in .csv format***. You can then import into R right +away. Sometimes you may not be able to do this (imagine you have data in 300 +Excel files, are you going to open and export all of them?). + +One common R package (a set of code with features you can download and add to +your R installation) is the [readxl package](https://CRAN.R-project.org/package=readxl) which can open and import Excel +files. Rather than addressing package installation this second, we can take +advantage of RStudio's import feature which integrates this package. + + + +First, in the RStudio menu go to **File**, select **Import Dataset**, and +choose **From Excel...** (notice there are several other options you can +explore). + +rstudio import menu + +Next, under **File/Url:** click the Browse button and navigate to the **Ecoli_metadata.xlsx** file located at `/home/dcuser/dc_sample_data/R`. +You should now see a preview of the data to be imported: + +rstudio import screen + +Notice that you have the option to change the data type of each variable by +clicking arrow (drop-down menu) next to each column title. Under **Import +Options** you may also rename the data, choose a different sheet to import, and +choose how you will handle headers and skipped rows. Under **Code Preview** you +can see the code that will be used to import this file. We could have written +this code and imported the Excel file without the RStudio import function, but +now you can choose your preference. + +In this exercise, we will leave the title of the data frame as +**Ecoli_metadata**, and there are no other options we need to adjust. Click the +Import button to import the data. + +Finally, let's check the fist few lines of the `Ecoli_metadata` metadata data +frame: + +> ~~~ +> head(Ecoli_metadata) +> ~~~ +{: .language-r} +> ~~~ +># A tibble: 6 x 7 +> sample generation clade strain cit run genome_size +> +>1 REL606 0. NA REL606 unknown NA 4.62 +>2 REL1166A 2000. unknown REL606 unknown SRR098028 4.63 +>3 ZDB409 5000. unknown REL606 unknown SRR098281 4.60 +>4 ZDB429 10000. UC REL606 unknown SRR098282 4.59 +>5 ZDB446 15000. UC REL606 unknown SRR098283 4.66 +>6 ZDB458 20000. (C1,C2) REL606 unknown SRR098284 4.63 +> ~~~ +{: .output} + +Works as we expect! Notice the type of this object is 'tibble', a type of data +frame we will talk more about in the 'dplyr' section. Of course, if you needed +a true R data frame you could coerce with `as.data.frame()`. + +## Saving your data frame to a file + +Finally, we can conclude this episode with saving our data frame, in this case +to a .csv file using the `write.csv()` function: + +> ~~~ +> write.csv(submission_metadata, file = "submission_metatata_cleaned.csv") +> ~~~ +{: .language-r} +> ~~~ +># use the dir() function to see files in our working directory +> +>[1] "dc_genomics_r.Rproj" "genomics_r_basics.R" "sample_submission.csv" +>[4] "submission_metadata_summary.txt" "submission_metatata_cleaned.csv" +> ~~~ +{: .output} + +The `write.csv()` function has some additional argument listed in the help, but +at a minimum you need to tell it what data frame to write to file, and give a +path to a file name in quotes (if you only provide a file name, the file will +be written in the current working directory). + +> ## Exercise: Putting it all together - data frames +> +> **Using the `Ecoli_metadata` data frame created above, answer the following questions** +> +> *Hint*: If you did not create the `Ecoli_metadata` data frame, use the +> instructions above (Importing data from Excel section) to create this object. +> +> A) What are the dimensions (# rows, #columns) of the data frame? +> +> B) What are categories are there in the `cit` column? *hint*: treat column as factor +> +> C) How many of each of the `cit` categories are there? +> +> D) What is the genome size for the 7th observation in this data set? +> +> E) What is the median value of the variable `genome_size` +> +> F) Rename the column `sample` to `sample_id` +> +> G) Create a new column (name genome_size_bp) and set it equal to the genome_size multiplied by 1,000,000 +> +> H) Save the edited Ecoli_metadata data frame as "exercise_solution.csv" in your current working directory. +> +>> ## solution +>> +>> A) `dim(Ecoli_metadata)` # (30 rows, 7 columns) +>> +>> B) `levels(as.factor(Ecoli_metadata$cit))` # "minus" "plus" "unknown" +>> +>> C) `table(as.factor(Ecoli_metadata$cit))` # 9 minus, 9 plus, 12 unknown +>> +>> D) `Ecoli_metadata[7,7]` # 4.62 +>> +>> E) `median(Ecoli_metadata$genome_size)` # 4.625 +>> +>> F) `colnames(Ecoli_metadata)[colnames(Ecoli_metadata) == "sample"]<- "sample_id"` +>> +>> G) `Ecoli_metadata$genome_size_bp <- Ecoli_metadata$genome_size * 1000000` +>> +>> H) `write.csv(Ecoli_metadata, file= "exercise_solution.csv")` +> {: .solution} +{: .challenge} + + + --- diff --git a/episodes/99-r-help.md b/episodes/99-r-help.md new file mode 100644 index 00000000..a232b46b --- /dev/null +++ b/episodes/99-r-help.md @@ -0,0 +1,183 @@ +--- +title: "Getting help with R" +teaching: 10 +exercises: 5 +questions: +- "How do I get help using R and RStudio?" +objectives: +- "Locate help for an R function using `?`, `??`, and `args()`" +- "Check the version of R" +- "Be able to ask effective questions when searching for help on forums or using web + searches" + +keypoints: +- "R provides thousands of functions for analyzing data, and provides several + way to get help" +- "Using R will mean searching for online help, and there are tips and + resources on how to search effectively" + +--- + +## Getting help with R + +rstudio default session + +No matter how much experience you have with R, you will find yourself +needing help. There is no shame in researching how to do something in R, and +most people will find themselves looking up how to do the same things that +they "should know how to do" over and over again. Here are some tips to make +this process as helpful and efficient as possible. + +> "Never memorize something that you can look up" +> - A. Einstein + +## Finding help on Stackoverflow and Biostars + +Two popular websites will be of great help with many R problems. For **general** +**R questions**, [Stack Overflow](https://stackoverflow.com/) is probably the most +popular online community for developers. If you start your question "How to do X +in R" results from Stack Overflow are usually near the top of the list. For +**bioinformatics specific questions**, [Biostars](https://www.biostars.org/) is +a popular online forum. + +>## Tip: Asking for help using online forums: +> +> - When searching for R help, look for answers with the [r](https://stackoverflow.com/questions/tagged/r) tag. +> - Get an account; not required to view answers but to required to post +> - Put in effort to check thoroughly before you post a question; folks get +> annoyed if you ask a very common question that has been answered multiple +> times +> - Be careful. While forums are very helpful, you can't know for sure if the +> advice you are getting is correct +> - See the [How to ask for R help](http://blog.revolutionanalytics.com/2014/01/how-to-ask-for-r-help.html) +> blog post for more useful tips +> +{: .callout} + +## Help people help you + +Often, in order to duplicate the issue you are having, someone may need to see +the data you are working with or verify the versions of R or R packages you +are using. The following R functions will help with this: + +You can **check the version of R** you are working with using the `sessionInfo()` +function. Actually, it is good to save this information as part of your notes +on any analysis you are doing. When you run the same script that has worked fine +a dozen times before, looking back at these notes will remind you that you +upgraded R and forget to check your script. + + +> ~~~ +> sessionInfo() +> ~~~ +{: .language-r} + +> ~~~ +> R version 3.2.3 (2015-12-10) +> Platform: x86_64-pc-linux-gnu (64-bit) +> Running under: Ubuntu 14.04.3 LTS +> +> locale: +> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 +> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 +> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C +> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C +> +> attached base packages: +> [1] stats graphics grDevices utils datasets methods base +> +> loaded via a namespace (and not attached): +> [1] tools_3.2.3 packrat_0.4.9-1 +> ~~~ +{: .output} + +Many times, there may be some issues with your data and the way it is formatted. +In that case, you may want to share that data with someone else. However, you +may not need to share the whole dataset; looking at a subset of your 50,000 row, +10,000 column dataframe may be TMI (too much information)! You can take an +object you have in memory such as dataframe (if you don't know what this means +yet, we will get to it!) and save it to a file. In our example we will use the +`dput()` function on the `iris` dataframe which is an example dataset that is +installed in R: + + +> ~~~ +> dput(head(iris)) # iris is an example data.frame that comes with R +> # the `head()` function just takes the first 6 lines of the iris dataset +> ~~~ +{: .language-r} + +This generates some output (below) which you will be better able to interpret +after covering the other R lessons. This info would be helpful in understanding +how the data is formatted and possibly revealing problematic issues. + +> ~~~ +> structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), +> Sepal.Width = c(3.5, 3, 3.2, 3.1, 3.6, 3.9), Petal.Length = c(1.4, +> 1.4, 1.3, 1.5, 1.4, 1.7), Petal.Width = c(0.2, 0.2, 0.2, +> 0.2, 0.2, 0.4), Species = structure(c(1L, 1L, 1L, 1L, 1L, +> 1L), .Label = c("setosa", "versicolor", "virginica"), class = "factor")), .Names = c("Sepal.Length", +> "Sepal.Width", "Petal.Length", "Petal.Width", "Species"), row.names = c(NA, +> 6L), class = "data.frame") +> ~~~ +{: .output} + +Alternatively, you can also save objects in R memory to a file by specifying +the name of the object, in this case the `iris` data frame, and passing a +filename to the `file=` argument. + +> ~~~ +> saveRDS(iris, file="iris.rds") # By convention, we use the .rds file extension +> ~~~ +{: .language-r} + +--- + +## Final FAQs on R + +Finally, here are a few pieces of introductory R knowledge that are too good to +pass up. While we won't return to them in this course, we put them here because +they come up commonly: + +**Do I need to click Run every time I want to run a script?** + +- No. In fact, the most common shortcut key allows you to run a command (or + any lines of the script that are highlighted): + - Windows execution shortcut: Ctrl+Enter + - Mac execution shortcut: Cmd(⌘)+Enter + + To see a complete list of shortcuts, click on the Tools menu and + select Keyboard Shortcuts Help + +**What's with the brackets in R console output?** +- R returns an index with your result. When your result contains multiple values, + the number tells you what ordinal number begins the line, for example: + +> ~~~ +> 1:101 # generates the sequence of numbers from 1 to 101 +> ~~~ +{: .language-r} + +In the output below, `[81]` indicates that the first value on that line is the +81st item in your result + +> ~~~ +> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 +> [21] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 +> [41] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 +> [61] 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 +> [81] 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 +> [101] 101 +> ~~~ +{: .output} + + +**Can I run my R script without RStudio?** + +- Yes, remember - RStudio is running R. You get to use lots of the enhancements + RStudio provides, but R works independent of RStudio. See [these tips](https://support.rstudio.com/hc/en-us/articles/218012917-How-to-run-R-scripts-from-the-command-line) + for running your commands at the command line + + +**Where else can I learn about RStudio?** +- Check out the Help menu, especially "Cheatsheets" section From fecc8d34e607b1749c889a8cd0d4ef80e7f2b8df Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Mon, 28 May 2018 16:33:04 +0100 Subject: [PATCH 17/19] update formatting --- index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/index.md b/index.md index fb296c1a..570a6067 100644 --- a/index.md +++ b/index.md @@ -10,7 +10,7 @@ difficult and frustrating at times – so if even the best feel that way, why le intimidation stop you? Given time and practice* you will soon find it easier and easier to accomplish what you want. -Why learn to code? Bioinformatics – like Biology – is messy. Different +Why learn to code? Bioinformatics – like biology – is messy. Different organisms, different systems, different conditions, all behave differently. Experiments at the bench require a variety of approaches – from tested protocols to trial-and-error. Bioinformatics is also an experimental science, otherwise we From 68891edd294de233e1adb4ca3dc2ec4c7a352f97 Mon Sep 17 00:00:00 2001 From: Jason Williams Date: Tue, 5 Jun 2018 11:43:15 -0700 Subject: [PATCH 18/19] fix typos and edit text --- episodes/01-introduction.md | 125 +++++++------- episodes/02-r-basics.md | 201 +++++++++++++---------- episodes/03-basics-factors-dataframes.md | 74 +++++---- 3 files changed, 226 insertions(+), 174 deletions(-) diff --git a/episodes/01-introduction.md b/episodes/01-introduction.md index 65ac3bb4..fd2ee87d 100644 --- a/episodes/01-introduction.md +++ b/episodes/01-introduction.md @@ -10,26 +10,26 @@ objectives: - "Know advantages of using RStudio" - "Create an RStudio project, and know the benefits of working within a project" -- "Customize RStudio layout" +- "Be able to customize the RStudio layout" - "Be able to locate and change the current working directory with `getwd()` and `setwd()`" -- "Compose an R script file with comments and saved commands" -- "Be able to define what an R function is" +- "Compose an R script file containing comments and commands" +- "Understand what an R function is" - "Locate help for an R function using `?`, `??`, and `args()`" keypoints: - "R is a powerful, popular open-source scripting language" -- "RStudio allows you to run R in an easy-to-use interface and makes - it easy to find help" - "You can customize the layout of RStudio, and use the project feature to manage the files and packages used in your analysis" +- "RStudio allows you to run R in an easy-to-use interface and makes + it easy to find help" + --- ## Getting ready to use R for the first time In this lesson we will take you through the very first things you need to get -R working, and conclude by showing you the most effective ways to get help -when you are working with R on your own. +R working. >## Tip: This lesson works best on the cloud > Remember, these lessons assume we are using the pre-configured virtual machine @@ -54,22 +54,22 @@ by Ross Ihaka for more info on the subject. At more than 20 years old, R is fairly mature and [growing in popularity](https://www.tiobe.com/tiobe-index/r/). However, programming isn’t a popularity contest. Here are key advantages of analyzing data in R: - - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. Of - course this means R is free - which is an advantage if you end up at a - institution where you would have to pay for your own MATLAB or SAS license. - Open source, is important to your colleagues in parts of the world where - expensive software in inaccessible. It also means that R is actively - developed by a community (See [r-project.org](https://www.r-project.org/)), + - **R is [open source](https://en.wikipedia.org/wiki/Open-source_software)**. + This means R is free - an advantage if you are at an institution where you + have to pay for your own MATLAB or SAS license. Open source, is important to + your colleagues in parts of the world where expensive software in + inaccessible. It also means that R is actively developed by a community (see + [r-project.org](https://www.r-project.org/)), and there are regular updates. - **R is widely used**. Ok, maybe programming is a popularity contest. Because, R is used in many areas (not just bioinformatics), you are more likely to find help online when you need it. Chances are, almost any error message you run into, someone else has already experienced. - **R is powerful**. R runs on multiple platforms (Windows/MacOS/Linux). It can - work with much larger datasets than popular spreadsheet programs like - Microsoft Excel, and because of its scripting capabilities is far more - reproducible. Also, there are thousands of available software packages for - science, including genomics and other areas of life science. + work with much larger datasets than popular spreadsheet programs like + Microsoft Excel, and because of its scripting capabilities is far more + reproducible. Also, there are thousands of available software packages for + science, including genomics and other areas of life science. >## Discussion: Your experience > What has motivated you to learn R? Have you had a research question for which @@ -103,6 +103,11 @@ Open a web browser and enter the IP address of your instance, followed by > {: .source} +You should now be looking at a page that will allow you to login to the RStudio +server: + +rstudio default session + Enter your user credentials and click Sign In. The credentials for the genomics Data Carpentry instances are: @@ -112,18 +117,19 @@ the genomics Data Carpentry instances are: You should now see the RStudio interface: -rstudio default session +rstudio default session --- ## Create an RStudio project One of the first benefits we will take advantage of in RStudio is something -called an **RStudio Project**. An RStudio Project allows you easily save data, -files, variables, packages, etc. related to a specific analysis project you are -conducting in R. Saving your work into a project makes it easy to restart work -where you left off, and also makes it easier to collaborate, especially if you -are using version control such as [git](http://swcarpentry.github.io/git-novice/). +called an **RStudio Project**. An RStudio project allows you to more easily: + +- Save data, files, variables, packages, etc. related to a specific + analysis project +- Restart work where you left off +- Collaborate, especially if you are using version control such as [git](http://swcarpentry.github.io/git-novice/). To create a project, go to the File menu, and click New Project.... @@ -173,34 +179,36 @@ convention, R scripts end with the file extension **.R**. ## Overview and customization of the RStudio layout Now that we have covered the basics, lets address some ways to configure the -layout of RStudio. First, here are the major windows or panes of the RStudio +layout of RStudio. First, here are the major windows (or panes) of the RStudio environment: rstudio default session - **Source**: This pane is where you will write/view R scripts. Some outputs - (such as if you view a dataset using `View()`) will appear as a tab here. -- **Console/Terminal**: This is actually where you see the execution of commands - , and what R looks like if you were to run it at the command line without - RStudio. You can work interactively (i.e. enter R commands here), but for the - most part, we will run a script, or lines in a script and watch their - execution and output here. The "Terminal" tab give you access to the BASH - terminal. + (such as if you view a dataset using `View()`) will appear as a tab here +- **Console/Terminal**: This is actually where you see the execution of + commands. This is the same display you would see if you were using R at the + command line without RStudio. You can work interactively (i.e. enter R + commands here), but for the most part we will run a script (or lines in a + script) in the source pane and watch their execution and output here. The + "Terminal" tab give you access to the BASH terminal (the Linux operating + system, unrelated to R) - **Environment/History**: Here, RStudio will show you what datasets and - variables you have created, and which are actively defined/in memory. You can - also see some characteristics of variables/datasets such as their type and - dimensions. A "History" tab also contains a history of executed R commands. In - the history tab you can see a list of previously executed commands. + objects (variables) you have created and which are defined in memory. + You can also see some properties of objects/datasets such as their type + and dimensions. A "History" tab also contains a history of executed R commands. + In the history tab you can see a list of previously executed commands - **Files/plots/Packages/help**: This multipurpose pane will show you the contents of directories on your computer. You can also use the "Files" tab to navigate and set the working directory. The "Plots" tab will show the output of any plots generated. In "Packages" you will see what packages are actively loaded, or you can attach installed packages. "Help" will display help files - for R functions/packages. + for R functions/packages ->## Tip: Downloads from the cloud +>## Tip: Uploads and downloads in the cloud > In the "Files" tab you can select a file and download it from your cloud -> instance to your local computer. Uploads are also possible. +> instance (click the "more" button) to your local computer. +> Uploads are also possible. {: .callout} All of the panes in RStudio have configuration options. For example, you can @@ -223,7 +231,7 @@ colors/themes, and more are in the Tools menu under ## Getting to work with R: navigating directories Now that we have covered the more aesthetic aspects of RStudio, we can get to -work learning some commands. We will write, execute, and save the commands we +work using some commands. We will write, execute, and save the commands we learn in our **genomics_r_basics.R** script that is loaded in the Source pane. First, lets see what directory we are in. To do so, type the following command into the script: @@ -299,17 +307,18 @@ and `dc_genomics_r` directory. The path in your script should look like this: When you run this command, the console repeats the command, but gives you no output. Instead, you see the blank R prompt: `>`. Congratulations! Although it -seems small, knowing what your working directory is, and being able to set your +seems small, knowing what your working directory is and being able to set your working directory is the first step to analyzing your data. >## Tip: Never use `setwd()` > Wait, what was the last 2 minutes about? Well, setting your working directory > is something you need to do, you need to be very careful about using this as -> a step in your script. For example, the top-level path in a Unix file system -> is root `/`, but on Windows it is likely `C:\`. This is one of several ways -> you might cause a script to break because a file path is configured differently -> than your script anticipates. R packages like [`here`](https://cran.r-project.org/web/packages/here/index.html) -> and [`file.path`](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) +> a step in your script. For example, what if your script is being on a computer +> that has a different directory structure? The top-level path in a Unix file +> system is root `/`, but on Windows it is likely `C:\`. This is one of several +> ways you might cause a script to break because a file path is configured +> differently than your script anticipates. R packages like [here](https://cran.r-project.org/web/packages/here/index.html) +> and [file.path](https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/file.path) > allow you to specify file paths is a way that is more operating system > independent. See Jenny Bryan's [blog post](https://www.tidyverse.org/articles/2017/12/workflow-vs-script/) for this > and other R tips. @@ -339,12 +348,15 @@ program that takes an input and returns and output. >> on attached packages >> - `date()` # Gives the current date >> - `Sys.time()` # Gives the current time +>> +>> *Notice*: Commands are case sensitive! > {: .solution} {: .challenge} You have hopefully noticed a pattern, some more abstract exceptions aside, in R a function has three key properties: -- functions have a name (e.g. `dir`, `getwd`); note that these are case sensitive! +- functions have a name (e.g. `dir`, `getwd`); note that functions are case + sensitive! - following the name, functions have a pair of `()` - Inside the parentheses, a function may take 0 or more arguments @@ -367,8 +379,8 @@ Which returns ## Getting help with function arguments -Of course, you may have wanted to round to one significant digit. `round()` can -do this, but you may fist need to read the help to find out how. To see the help +What if you wanted to round to one significant digit? `round()` can +do this, but you may first need to read the help to find out how. To see the help (In R sometimes also called a "vignette") enter a `?` in front of the function name: @@ -377,11 +389,11 @@ name: > ~~~ {: .language-r} -The "Help" tab will show you information (and often, too much information). You -will slowly learn how to read through all of that. Checking the "Usage" or -"Examples" headings is often a good place to look first. If you look under -"Arguments" we also see what arguments we can "pass" to this function to modify -its behavior. You can also see a function's argument using the `args()` function: +The "Help" tab will show you information (often, too much information). You +will slowly learn how to read through that. Checking the "Usage" or "Examples" +headings is often a good place to look first. If you look under "Arguments," we +also see what arguments we can "pass" to this function to modify its behavior. +You can also see a function's argument using the `args()` function: > ~~~ > args(round) @@ -429,7 +441,8 @@ digits is 2. {: .output} Finally, what if you are using `?` to get help for a function in a package not -installed on your system? +installed on your system, such as when you are running a script which has +dependencies. > ~~~ > ?geom_point() @@ -452,8 +465,8 @@ functions may be available, use the `help.search()` function. > ## Exercise: Searching for R functions > Use `help.search()` to find R functions for the following statistical -> functions. Remember to put what you are using for your search query in -> quotes inside the function parentheses. +> functions. Remember to put your search query in quotes inside the function +> parentheses. > > - Chi-Squared test > - Student-t test diff --git a/episodes/02-r-basics.md b/episodes/02-r-basics.md index 6240de4e..0656e449 100644 --- a/episodes/02-r-basics.md +++ b/episodes/02-r-basics.md @@ -8,14 +8,15 @@ questions: - "What are the most common objects in R?" objectives: - "Be able to create the most common R objects including vectors" -- "Understand that vectors have modes, which correspond to the type of data they contain" +- "Understand that vectors have modes, which correspond to the type of data they + contain" - "Be able to use arithmetic operators on R objects" - "Be able to retrieve (subset), name, or replace, values from a vector" - "Be able to use logical operators in an subsetting operation" - "Understand that lists can hold data of more than one mode and can be indexed" keypoints: - "Effectively using R is a journey of months or years. Still you don't have to - be an expert to use R and you can start using and analzying your data with + be an expert to use R and you can start using and analyzing your data with with about a day's worth of training" - "It is important to understand how data are organized by R in a given object type how the mode of that type (e.g. numeric, character, logical, etc.) will @@ -42,24 +43,25 @@ their own research questions! Ok, maybe some folks learn R for R's sake, but these lessons assume that you want to start analyzing genomic data as soon as possible. Given this, there are many valuable pieces of information about R that we simply wont have time to cover. Hopefully we will clear the hurdle of -giving you just enough knowledge to be dangerous, which can be a high hurdle -in R! We uggest you look into additional the learning materials in the tip box +giving you just enough knowledge to be dangerous, which can be a high bar +in R! We suggest you look into additional the learning materials in the tip box below. **Here are some R skills we will *not* cover in these lessons** - How to create and work with R matrices and R lists -- How to create and work with loops and conditional statements -- How to do basic string manipulations (e.g. finding patterns in text using grep) +- How to create and work with loops and conditional statements, and the "apply" + family of functions (which are super useful, read more [here](https://www.r-bloggers.com/r-tutorial-on-the-apply-family-of-functions/)) +- How to do basic string manipulations (e.g. finding patterns in text using grep, replacing text) - How to plot using the default R graphic tools (we *will* cover ggplot2) -- How to use the advanced R statistical functions +- How to use advanced R statistical functions >## Tip: Where to learn more > The following are good resources for learning more about R. Some of them -> can be quite technically, but if you are a regular R user you may ultimately -> need some of this technical knowledge. +> can be quite technical, but if you are a regular R user you may ultimately +> need this technical knowledge. > - [R for Beginners](https://cran.r-project.org/doc/contrib/Paradis-rdebuts_en.pdf): - By Emmanuel Paradis, great starting point + By Emmanuel Paradis and a great starting point > - [The R Manuals](https://cran.r-project.org/manuals.html): Maintained by the R project > - [R contributed documentation](https://cran.r-project.org/other-docs.html): @@ -72,22 +74,31 @@ below. and applications for R > - [Programming in R Software Carpentry lesson](https://software-carpentry.org/lessons/): There are several Software Carpentry lessons in R to choose from +> - [Data Camp Introduction to R](https://www.datacamp.com/courses/free-introduction-to-r): + This is a fun online learning platform for Data Science, including R. {: .callout} ## Creating objects in R > ## Reminder -> At this point you should writing following along in the "**genomics_r_basics.R**" -> script we created in the last episode. Writing you commands in the script -> will make it easier to record what you did and why. +> At this point you should be coding along in the "**genomics_r_basics.R**" +> script we created in the last episode. Writing your commands in the script +> (and commenting it) will make it easier to record what you did and why. > {: .prereq} -What might be called a variable in many language is properly called an **object** -in R. To create your object you need a name (e.g. 'a'), and a value (e.g. '1'). -Using the R assignment operator '<-''. In your script, "**genomics_r_basics.R**" -write a comment (using the '#') sign, and assign '1' to the object 'a' as shown -below: +What might be called a variable in many languages is properly called an **object** +in R. + +**To create your object you need:** + +- a name (e.g. 'a') +- a value (e.g. '1') +- the assignment operator ('<-') + +In your script, "**genomics_r_basics.R**", using the R assignment operator '<-', +assign '1' to the object 'a' as shown. Remember to leave a comment in the line +above (using the '#') to explain what you are doing: > ~~~ > # this line creates the object 'a' and assigns it the value '1' @@ -96,20 +107,21 @@ below: > ~~~ {: .language-r} -Be sure to execute this line of code in your script. You can run a line of code + +Next, run this line of code in your script. You can run a line of code by hitting the Run button that is just above the first line of your script in the header of the Source pane or you can use the appropriate shortcut: - Windows execution shortcut: Ctrl+Enter - Mac execution shortcut: Cmd(⌘)+Enter -to run multiple lines of code, you can highlight all the line you wish to run + +To run multiple lines of code, you can highlight all the line you wish to run and then hit Run or use the shortcut key combo. -You should notice the following outputs; in the RStudio 'Console' you should see: +In the RStudio 'Console' you should see: > ~~~ -> # this line creates the object 'a' and assigns it the value '1' -> > a <- 1 +> > > ~~~ {: .output} @@ -126,12 +138,14 @@ The 'Environment' window allows you to easily keep track of the objects you have created in R. > ## Exercise: Create some objects in R -> Create the following objects in R, give each object an appropriate name. +> Create the following objects; give each object an appropriate name +> (your best guess at what name to use is fine): > > 1. Create an object that has the value of number of pairs of human chromosomes > 2. Create an object that has a value of your favorite gene name -> 3. Create an object that value of this URL: "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/" -> 4. Create and object that has the value of the number of chromosomes in a diplod cell +> 3. Create an object that has this URL as its value: "ftp://ftp.ensemblgenomes.org/pub/bacteria/release-39/fasta/bacteria_5_collection/escherichia_coli_b_str_rel606/" +> 4. Create an object that has the value of the number of chromosomes in a +> diploid human cell > >> ## solution >> Here as some possible answers to the challenge: @@ -147,23 +161,22 @@ created in R. Here are some important details about naming objects in R. -- **Avoid spaces and special characters**: Object cannot contain spaces. Typically - you can use '-' or '_' to provide separation. You should avoid using special - characters in your object name (e.g. ! @ # . , etc.). Also, names cannot begin with - a number. +- **Avoid spaces and special characters**: Object names cannot contain spaces. + Typically, you can use '-' or '_ ' to make names more readable. You should avoid + using special characters in your object name (e.g. ! @ # . , etc.). Also, + names cannot begin with a number - **Use short, easy-to-understand names**: You should avoid naming your objects using single letters (e.g. 'n', 'p', etc.). This is mostly to encourage you to use names that would make sense to anyone reading your code (a colleague, - or even yourself a year from now). Also, avoiding really long names will make - your code more readable. -- **Avoid commonly used names**: There are several names that may alread have a + or even yourself a year from now). Also, avoiding excessively long names will + make your code more readable +- **Avoid commonly used names**: There are several names that may already have a definition in the R language (e.g. 'mean', 'min', 'max'). One clue that a name - already has meaning is that if you start typing a name in RStudio and either - pause your typing or hit the Tab key and RStudio gives you a - suggested autocompletion or help message you have choosen a name that has a - prior meaning. -- **Use the recommended assignment operator**: In R, we use '<- '' as the - prefered assignment operator. '=' works too, but is most comonly used in + already has meaning is that if you start typing a name in RStudio and it gets + a colored highlight, or RStudio gives you a suggested autocompletion you have + chosen a name that has a reserved meaning +- **Use the recommended assignment operator**: In R, we use '<- ' as the + preferred assignment operator. '=' works too, but is most commonly used in passing arguments to functions (more on functions later). There is a shortcut for the R assignment operator: - Windows execution shortcut: Alt+- @@ -176,14 +189,14 @@ have advice, and one to start with is the [tidyverse R style guide](http://style >## Tip: Pay attention to warnings in the script console > -> If you enter a line of code in your R that contains some error, RStudio -> may give you hint with an error indication and an underline of this mistake. -> Sometimes these messages are easy to understand, but often the message may -> need some figuring out. In any case paying attention to these warnings help -> you avoid mistakes. In this case, our object name has a space, which is not -> allowed in R. Notice the error message does not say this directly, but -> essentially R is "not sure" about to to assign the name to "human_ chr_number" -> when the object name we want is "human_chr_number". +> If you enter a line of code in your script that contains some error, RStudio +> may give you an error message and underline this mistake. Sometimes these +> messages are easy to understand, but often the message may need some figuring +> out. In any case paying attention to these warnings help you avoid mistakes. +> In this case, our object name has a space, which is not allowed in R. Notice +> the error message does not say this directly, but essentially R is "not sure" +> about to to assign the name to "human_ chr_number" when the object name we +> want is "human_chr_number". > > rstudio script warning > @@ -196,7 +209,8 @@ not complain about overwriting objects, which may or may not be a good thing depending on how you look at it. > ~~~ -> # gene_name has the value 'pten' or whatever value you used in the challenge. We will now assign the new value 'tp53' +> # gene_name has the value 'pten' or whatever value you used in the challenge. +> # We will now assign the new value 'tp53' > > gene_name <- 'tp53' > ~~~ @@ -221,30 +235,37 @@ longer defined. > ~~~ {: .error} -## Understaning object data types (modes) +## Understanding object data types (modes) + +In R, **every object has two properties**: + +- **Length**: How many distinct values are held in that object +- **Mode**: What is the classification (type) of that object. -One very important thing to know about an object is that every object has two -properties, "length" and "mode". We will get to the "length" property later in -the lesson. The **"mode" property corresponds to the type of data an object** -**represents**. The most common modes you will encounter in R are: +We will get to the "length" property later in the lesson. The **"mode" property** +**corresponds to the type of data an object**represents**. The most common modes +you will encounter in R are: |Mode (abbreviation)|Type of data| |----|------------| -|Numeric (num)| Numbers such integers (e.g. 1, 892, 1.3e+10) and floating pont/decimals (0.5, 3.14)| +|Numeric (num)| Numbers such floating point/decimals (1.0, 0.5, 3.14), there are also more specific numeric types (dbl - Double, int - Integer). These differences are not relevant for most beginners and pertain to how these values are stored in memory | |Character (chr)|A sequence of letters/numbers in single '' or double " " quotes| |Logical| Boolean values - TRUE or FALSE| -There are a few other modes (double", "complex", "raw" etc.) but for now, these -three are the most important. Data types are familiar in many programming -languages, but also in natural language where we refer to them as the -parts of speech, e.g. nouns, verbs, adverbs, etc. One you know if a word - -perhaps an unfamilar one - is a noun, you can probbaly guess you can count it -and make it plural if there is more than one (e.g. 1 Tuatara, or 2 Tuataras). -If something is a adjective, you can usually change it into an adverb by -adding "-ly" (e.g. jejune vs. jejunely). Depending on the context, you may need -to decide if a word is in one category or another (e.g "cut" may be a noun when -its on your finger, or a verb when you are preparing vegetables). These examples -have important analogies when working with R objects. +There are a few other modes (i.e. "complex", "raw" etc.) but for now, these three +are the most important. + +Data types are familiar in many programming languages, but also in natural +language where we refer to them as the parts of speech, e.g. nouns, verbs, +adverbs, etc. Once you know if a word - perhaps an unfamiliar one - is a noun, +you can probably guess you can count it and make it plural if there is more than +one (e.g. 1 [Tuatara](https://en.wikipedia.org/wiki/Tuatara), or 2 Tuataras). +If something is a adjective, you can usually change it into an adverb by adding +"-ly" (e.g. [jejune](https://www.merriam-webster.com/dictionary/jejune) vs. +jejunely). Depending on the context, you may need to decide if a word is in one +category or another (e.g "cut" may be a noun when its on your finger, or a verb +when you are preparing vegetables). These concepts have important analogies when +working with R objects. > ## Exercise: Create objects and check their modes > Create the following objects in R, then use the `mode()` function to verify @@ -269,17 +290,17 @@ have important analogies when working with R objects. Notice from the solution that even if a series of numbers are given as a value R will consider them to be in the "character" mode if they are enclosed as single or double quotes. Also notice that you cannot take a string of alphanumeric -character (e.g. Earhart) and assign as a value for an object. In this case, +characters (e.g. Earhart) and assign as a value for an object. In this case, R looks for the object `Earhart` but since there is no object, no assignment can be made. If `Earhart` did exist, then the mode of `pilot` would be whatever the mode of `Earthrt` was originally. ## Mathematical and functional operations on objects -Once an object exsits (which by definition also means it has a mode), R can +Once an object exists (which by definition also means it has a mode), R can appropriately manipulate that object. For example, objects of the numeric modes can be added, multiplied, divided, etc. R provides several mathematical -(arithmetic) operators incuding: +(arithmetic) operators including: |Operator|Description| |--------|-----------| @@ -302,7 +323,7 @@ These can be used with literal numbers: > ~~~ {: .output} -and importantly, can be used on any object that evaluates to (i.e. iterprited +and importantly, can be used on any object that evaluates to (i.e. interpreted by R) a numeric object: @@ -355,12 +376,12 @@ here too. ## Vectors With a solid understanding of the most basic objects, we come to probably the -most used objects in R, vectors. A vector can be though of as a collection of -values (numbers, characters, etc.). Vectors also have a mode (data type), so -all of the contents of a vctor must be of the same mode. One of the most common -way to create a vector is to use the `c()` function - the "concatanate" or +most used objects in R, vectors. **A vector is a collection of values (numbers,** +**characters, etc.)**. Vectors also have a mode (data type), so all of the +contents of a vector must be of the same mode. One of the most common +ways to create a vector is to use the `c()` function - the "concatenate" or "combine" function. Inside the function you may enter one or more values; for -multiple values, seperate each value with a comma: +multiple values, separate each value with a comma: > ~~~ > # Create the SNP gene name vector @@ -396,7 +417,7 @@ returns: Vectors are quite important in R, mostly for us because data frames are essentially collections of vectors (more on this later). What we learn about -manipulating vectors now will pay of even more when we get to data frames. +manipulating vectors now will pay off even more when we get to data frames. ## More on creating and subsetting vectors @@ -539,9 +560,9 @@ so that the gene "APOA5" is an index 7. This may be a good or not so good thing depending on how you use this. > ## Exercise: Examining and subsetting vectors -> Answer the following questions to test your knowledge vectors +> Answer the following questions to test your knowledge of vectors > -> Which of the following is true of vectors in R +> Which of the following are true of vectors in R? > > A) All vectors have a mode or a length > @@ -641,9 +662,9 @@ Some of the most common logical operators you will use in R are: >number. So, even if it does not appear to be an integer (e.g. 1, 2, 3), as long >as R can evaluate it, we will get a result. That our expression >`snp_positions[snp_positions > 100000000]` evaluates to a number can be seen ->in the following situtaion. If you wanted to know which **index** (1, 2, 3, or +>in the following situation. If you wanted to know which **index** (1, 2, 3, or >4) in our vector of SNP positions was the one that was greater than 100,000,000? ->We can use the `which()` function to return the indicies of any item that +>We can use the `which()` function to return the indices of any item that >evaluates as TRUE in our comparison: >> ~~~ >> which(snp_positions > 100000000) @@ -653,7 +674,10 @@ Some of the most common logical operators you will use in R are: >> [1] 4 >> ~~~ >{: .output} -> **Why is this important?** Often in programming we will not know what inputs +> +> **Why this is important** +> +>Often in programming we will not know what inputs > and values will be used when our code is executed. Rather than put in a > pre-determined value (e.g 100000000) we can use an object that can take on > whatever value we need. So for example: @@ -691,9 +715,9 @@ value: {: .output} Sometimes, you may wish to find out if a specific value (or several values) is -in a vector. You can do this using the comparison operator `%in%`, which will -return TRUE for any value in your collection of one or more values matches a -value in the vector you are searching: +present a vector. You can do this using the comparison operator `%in%`, which +will return TRUE for any value in your collection of one or more values matches +a value in the vector you are searching: > ~~~ > # current value of 'snp_genes': chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5" @@ -728,7 +752,7 @@ value in the vector you are searching: > c. To the `snp_positions` vector add: 116792991 > > **3) Make the following change to the `snp_genes` vector** -> Hint: Your vector should look like this in the 'Global Enviornment': +> Hint: Your vector should look like this in the 'Global Environment': > `chr [1:7] "OXTR" "ACTN3" "AR" "OPRM1" "CYP1A1" NA "APOA5"`. If not > recreate the vector by running this expression: > `snp_genes <- c("OXTR", "ACTN3", "AR", "OPRM1", "CYP1A1", NA, "APOA5")` @@ -738,7 +762,7 @@ value in the vector you are searching: > b. Add 2 NA values to the end of `snp_genes` (hint: final vector should > have a length of 8) > -> **4) Create a new vector `combined` that contains:** +> **4) Using indexing, create a new vector `combined` that contains:** > > - The the 1st value in `snp_genes` > @@ -776,7 +800,7 @@ value in the vector you are searching: >> b. `snp_genes <- c(snp_genes, NA, NA)` or `snp_genes[[8]] <- NA`, etc. >> >> ->> **4) Create a new vector `combined` that contains:** +>> **4) Using indexing, create a new vector `combined` that contains:** >> >> - The the 1st value in `snp_genes` >> @@ -807,10 +831,12 @@ items from the list. > ~~~ > # Create a named list using the 'list' function and our SNP examples -> # Note, for easy reading we have place each item in the list on a separate line +> # Note, for easy reading we have placed each item in the list on a separate line > # Nothing special about this, you can do this for any multiline commands > # To run this command, make sure the entire command (all 4 lines) are highlighted > # before running +> # Note also, as we are doing all this inside the list() function use of the +> # '=' sign is good style > >snp_data <- list(genes = snp_genes, > refference_snp = snps, @@ -818,6 +844,7 @@ items from the list. > position = snp_positions) > > # Examine the structure of the list +> >str(snp_data) > ~~~ {: .language-r} diff --git a/episodes/03-basics-factors-dataframes.md b/episodes/03-basics-factors-dataframes.md index c31c1a8b..d40ec617 100644 --- a/episodes/03-basics-factors-dataframes.md +++ b/episodes/03-basics-factors-dataframes.md @@ -12,8 +12,8 @@ objectives: - "Be able to determine the structure of a data frame including its dimensions and the datatypes of variables" - "Be able to subset/retrieve values from a data frame" -- "Understand how R may converse data into different modes" -- "Be able to convert the mode of an object" +- "Understand how R may coerce data into different modes" +- "Be able to change the mode of an object" - "Understand that R uses factors to store and manipulate categorical data" - "Be able to manipulate a factor, including subsetting and reordering" - "Be able to apply an arithmetic function to a data frame" @@ -49,7 +49,7 @@ from data which have been manipulated in some unknown way). The simplest principle of **Tidy data** is that we have one row in our spreadsheet for each observation or sample, and one column for every variable that we measure or report on. As simple as this sounds, it's very easily -violated. Most data scintists agree that significant amounts of their time is +violated. Most data scientists agree that significant amounts of their time is spent tidying data for analysis. Read more about data organization in [our lesson](http://www.datacarpentry.org/spreadsheet-ecology-lesson/) and in [this paper](https://www.jstatsoft.org/article/view/v059i10). @@ -57,7 +57,7 @@ in [this paper](https://www.jstatsoft.org/article/view/v059i10). **3) Trust but verify** Finally, while you don't need to be paranoid about data, you should have a plan -for how you will prepare it for analysis. **This a the focus of this lesson.** +for how you will prepare it for analysis. **This a focus of this lesson.** You probably already have a lot of intuition, expectations, assumptions about your data - the range of values you expect, how many values should have been recorded, etc. Of course, as the data get larger our human ability to @@ -65,13 +65,22 @@ keep track will start to fail (and yes, it can fail for small data sets too). R will help you to examine your data so that you can have greater confidence in your analysis, and its reproducibility. +>## Tip: Keeping you raw data separate +> When you work with data in R, you are not changing the original file you +> loaded that data from. This is different than (for example) working with +> a spreadsheet program where changing the value of the cell leaves you one +> "save"-click away from overwriting the original file. You have to purposely +> use a writing function (e.g. `write.csv()`) to save data loaded into R. In +> that case, be sure to save the manipulated data into a new file. More on this +> later in the lesson. + {: .callout} + + ## Importing tabular data into R There are several ways to import data into R. For our purpose here, we will focus on using the tools every R installation comes with (so called "base" R) to -import a comma-delimited file, a sequencing sample submission sheet. We will - -First, we need to load the sheet using a function called `read.csv()`. +import a comma-delimited file, a sequencing sample submission sheet. We will need to load the sheet using a function called `read.csv()`. > ## Exercise: Review the arguments of the `read.csv()` function > **Before using the `read.csv()` function, use R's help feature to answer the @@ -117,7 +126,7 @@ First, we need to load the sheet using a function called `read.csv()`. >> Hopefully, this exercise gets you thinking about using the provided help >> documentation in R. There are many arguments that exist, but which we wont >> have time to cover. Look here to get familiar with functions you use ->> frequently, you may be surpized at what you find they can do. +>> frequently, you may be surprised at what you find they can do. > {: .solution} {: .challenge} @@ -136,7 +145,7 @@ errors in file paths.** Use it! > ~~~ {: .language-r} -One of the first things you should notice is that in the Enviornment window, +One of the first things you should notice is that in the Environment window, you have the `submission_metadata` object, listed as 96 obs. (observations/rows) of 10 variables (columns). Double-clicking on the name of the object will open a view of the data in a new tab. @@ -175,12 +184,12 @@ frame. Let's examine what each of these functions can tell us: > ~~~ {: .output} -Our data frame had 10 variables, so we get 10 feilds that summarize the data. +Our data frame had 10 variables, so we get 10 fields that summarize the data. The `tube_barcode`, `Volume..ul.`, `concentration..ng.ul`, `RIN`, variables are numerical data and so you get summary statistics on the min and max values for these columns, as well as mean, median, and interquartile ranges. The other data -(e.g. `replicate`, etc.) are treated as catagorical data (which have special -treatment in R - more on this in a bit). The top 6 different catagories and the +(e.g. `replicate`, etc.) are treated as categorical data (which have special +treatment in R - more on this in a bit). The top 6 different categories and the number of times they appear (e.g. the replicate called 'A' appeared 31 times) are displayed. There was only one value for `ship_date`, "20-Jul" which appeared 96 times. @@ -222,9 +231,9 @@ Ok, thats a lot up unpack! Some things to notice. Factors are the final major data structure we will introduce in our R genomics lessons. Factors can be thought of as vectors which are specialized for categorical data. Given R's specialization for statistics, this make sense since -categorial and contiuous variables usually have different treatments. Sometimes -you may want to have data treated as a fator, but in other cases, this may be -undersirable. +categorial and continuous variables usually have different treatments. Sometimes +you may want to have data treated as a factor, but in other cases, this may be +undesirable. Since some of the data in our data frame are factors, lets see how factors work using the `factor()` function to create a factor: @@ -258,7 +267,7 @@ What we get back are the items in our factor, and also something called "Levels" **Levels are the different categories contained in a factor**. By default, R will organize the levels in a factor in alphabetical order. -Lets look at the contents of a factor in a slightly diffrent way using `str()`: +Lets look at the contents of a factor in a slightly different way using `str()`: > ~~~ > > str(days_of_the_week) @@ -269,7 +278,7 @@ Lets look at the contents of a factor in a slightly diffrent way using `str()`: > ~~~ {: .output} -For the sake of efficency, R stores the content of a factor as a vector of +For the sake of efficiency, R stores the content of a factor as a vector of integers, which an integer is assigned to each of the possible levels. Recall levels are assigned in alphabetical order, so: @@ -309,7 +318,7 @@ provides some clarification to why we get this output. One of the most common uses for factors will be when you plot categorical values. For example, suppose we want to know how many samples from our sample -submision were preped on each date? We could generate a plot: +submission were prepped on each date? We could generate a plot: > ~~~ > # create a factor with repeated values @@ -360,7 +369,7 @@ Then we use the `table()` function to turn this into a table of counts: > ~~~ {: .output} -Finally, we use R's `plot()` function which attemtps to generate a plot from the +Finally, we use R's `plot()` function which attempts to generate a plot from the data: > ~~~ > # generate a plot from values of the 'prep_date' variable from the data frame @@ -411,10 +420,10 @@ order. > - 6-Jul-15: July 6, 2015 > - 7/8/15: July 8, 2015 > -> *hint* you can use the `factor()` function inside of your `table()`and `plot()` +> *hint*: you can use the `factor()` function inside of your `table()`and `plot()` > function calls. > -> *hint* build this single line of code from the inside out! +> *hint*: build this single line of code from the inside out! >> ## solution >>plot(table(factor(submission_metadata$prep_date, levels = c("7-Jun-15", >> "6-Jul-15", @@ -433,7 +442,7 @@ Next, we are going to talk about how you can get specific values from data frame The first thing to remember is that a data frame is two-dimensional (rows and columns). Therefore, to select a specific value we will will once again use -`[]` notation, but we will specify more than one value (except in some cases +`[]` (bracket) notation, but we will specify more than one value (except in some cases where we are taking a range). > ## Exercise: Subsetting a data frame @@ -574,8 +583,8 @@ is not what you expect. Consider: {: .output} Although there are several numbers in our vector, they are all in quotes, so -we have explicitly told R to consider them characters. Even if we removed the -quotes from the numbers, R would coerce everything into a character: +we have explicitly told R to consider them characters. However, even if we removed +the quotes from the numbers, R would coerce everything into a character: > ~~~ > snp_chromosomes_2 <- c(3, 11, 'X', 6) @@ -589,8 +598,9 @@ quotes from the numbers, R would coerce everything into a character: > ~~~ {: .output} -We can use some of the `as.` functions to explicitly coerce values from one -form into another. Consider the following vector of characters, which all happen to be valid numbers: +We can use the `as.` functions to explicitly coerce values from one form into +another. Consider the following vector of characters, which all happen to be +valid numbers: > ~~~ > snp_positions_2 <- c("8762685", "66560624", "67545785", "154039662") @@ -695,7 +705,7 @@ Lets summarize this section on coercion with a few take home messages. coercion. - Check the structure (`str()`) of your data frames before working with them! -One regarding the first bullet point, one way to avoid needless coercion when +Regarding the first bullet point, one way to avoid needless coercion when importing a data frame using any one of the `read.table()` functions such as `read.csv()` is to set the argument `StringsAsFactors` to FALSE. By default, this argument is TRUE. Setting it to FALSE will treat any non-numeric column to @@ -710,7 +720,7 @@ Here are a few operations that don't need much explanation, but which are good to know. There are lots of arithmetic functions you may want to apply to your data -frame, an covering those would be a course in itself (there is some starting +frame, covering those would be a course in itself (there is some starting material [here](https://swcarpentry.github.io/r-novice-inflammation/15-supp-loops-in-depth/)). Our lessons will cover some additional summary statistical functions in a subsequent lesson, but overall we will focus on data cleaning and visualization. @@ -791,7 +801,9 @@ Excel files, are you going to open and export all of them?). One common R package (a set of code with features you can download and add to your R installation) is the [readxl package](https://CRAN.R-project.org/package=readxl) which can open and import Excel files. Rather than addressing package installation this second, we can take -advantage of RStudio's import feature which integrates this package. +advantage of RStudio's import feature which integrates this package. (Note: +this feature is available only in the latest versions of RStudio such as is +installed on our cloud instance). @@ -818,7 +830,7 @@ In this exercise, we will leave the title of the data frame as **Ecoli_metadata**, and there are no other options we need to adjust. Click the Import button to import the data. -Finally, let's check the fist few lines of the `Ecoli_metadata` metadata data +Finally, let's check the first few lines of the `Ecoli_metadata` metadata data frame: > ~~~ @@ -859,7 +871,7 @@ to a .csv file using the `write.csv()` function: > ~~~ {: .output} -The `write.csv()` function has some additional argument listed in the help, but +The `write.csv()` function has some additional arguments listed in the help, but at a minimum you need to tell it what data frame to write to file, and give a path to a file name in quotes (if you only provide a file name, the file will be written in the current working directory). From 9bab3b8166e40c90d8618934636d043de447e24f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fran=C3=A7ois=20Michonneau?= Date: Wed, 3 Oct 2018 09:56:02 -0400 Subject: [PATCH 19/19] switch to using remote theme --- _extras/discuss.md | 5 ----- _extras/guide.md | 5 ----- 2 files changed, 10 deletions(-) delete mode 100644 _extras/discuss.md delete mode 100644 _extras/guide.md diff --git a/_extras/discuss.md b/_extras/discuss.md deleted file mode 100644 index 727205da..00000000 --- a/_extras/discuss.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -layout: page -title: Discussion ---- -FIXME diff --git a/_extras/guide.md b/_extras/guide.md deleted file mode 100644 index 50d9d0b3..00000000 --- a/_extras/guide.md +++ /dev/null @@ -1,5 +0,0 @@ ---- -layout: page -title: "Instructor Notes" ---- -FIXME