# Week 1: Getting Started With R in JupyterHub
**Written by Professor Emily Klancher Merchant for STS 112, UC Davis, Winter 2019**
## Introduction
Welcome to JupyterHub! This is the platform we will be using for STS 112. JupyterHub facilitates what is called "literate programming." Literate programming combines chunks of natural-language text (in this case, English), with chunks of executable code (in this case, R). In this class, we will be working with JuypyterHub notebooks, which have the file extension `.ipynb`. You will access your notebooks by logging into the JupyterHub server for the class from any web-enabled device. All of the data you will need is on the server, and you will save your work on the server as well. You can, however, export your notebooks, either in `.ipynb` format or in a variety of other file formats (`File -> Download as`). As you work through the notebooks for the class, you will notice new terms in **bold** the first time they appear.

**Note:** If you find that you like working with Jupyter and want to install it on your own machine, you are in luck: Juptyter is free and open source, available as part of [Anaconda](https://anaconda.org/). However, the easiest way to work with R on your own machine is to download and install [R](https://cran.r-project.org/) and [RStudio](https://www.rstudio.com/), both of which are also free and open source. R also has its own version of notebooks, called [RMarkdown](https://rmarkdown.rstudio.com/).

## Editing
Each chunk of either text or code is called a **block**. The first thing you need to know how to do is edit a block. You can edit this one by double-clicking anywhere in it. Once you have done that, type some text below, and then hit `shift + return` (that is, hold down `shift` and hit `return`).



What happened when you did that? First, you saw behind the scenes of the text block above. We will discuss that in a moment. Then, you edited the block by adding your own text. Finally, you **ran** the block by clicking `shift + return`. You can also run a block of text or code by clicking the `>|Run` button above. When you were editing that block, it was outlined in green. After you ran it, this block was selected, which you can tell because it is now outlined in blue. If you want to select a block without editing it, click on it once. You can add your own text block after this one. Make sure this block is selected, then click the `+` button above. Once you have done that, choose `Markdown` from the drop-down list. You can now type any text, and then run the block.

## What is Markdown?
[Markdown](https://www.markdownguide.org/) is a simple markup language. When you edit any text block, you are editing in **markdown**. If you double-click in this block, you will see that I made the heading by typing two hashtags (one hashtag would give me a larger heading and three would give me a sm,aller one). I made a link to an excellent markdown resource using square brackets and parentheses. I made *this text italic* with a single asterisk at the beginning and the end, and I made **this text bold** with two asterisks at the beginning and the end. I made `this text look like code` by typing ``` (right under the escape key in the upper left of the keyboard) at the beginning and the end. Make a new text block below and try out your own markdown. Run your block to see the final result.

## Getting Started With R
As you may have noticed, when you add a new block, it defaults to code. Our notebooks have what is called an **R kernel**, so when we write code, we will be using R, which is a high-level programming language typically used for statistics and data science. R is an **interpreted language**, which means that it is **executed** one line at a time. A code block that you write in a Jupyter notebook may be a single line or a series of lines.

At its simplest, R is like a large calculator. You can see this by creating a new code block below, entering a mathematical expression, and running the block. When you do that, the R interpreter will execute your code, print the results (in this case the result of evaluating the mathematical expression) to the **console** (here the space under the code block), and move on to the next block.

In [5]:
2 + 2

What happens if you enter an incomplete mathematical expression?

In [6]:
2 +

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: 2 +
   ^


You may write text in a code block by prefacing it with `#`. The remainder of the line will not be interpreted. This is called a **comment**. What happens when you run the code block below?

In [4]:
#This is a comment.
"This is not a comment."

As you saw, nothing happens to the text that is **commented out**. The other line gets printed verbatim to the console. 
## Hello World
Traditionally, the first thing one does when learning a new programming language is to write a program that prints the phrase "hello world" to the console. In R, that is exceptionally easy. Create a new code block below and write a line of code that prints "hello world" to the console.

In [8]:
"hello world"

## Working With Variables
The real power in programming comes from creating and manipulating **variable**. A variable is simply an **object** you have created that takes the value you assigned to it. Creating a variable is R is very simple. You simply type the name of the variable, the **assignment operator** (`<-`), and the value you want the variable to take. Run the code below.

In [9]:
a <- 5

Notice that nothing was printed to the console. Nonetheless, R created a variable called `a` that is equal to 5. How do we know this? We can use the `ls()` **function** to list all of the objects in our **working environment**. Run the code below.

In [27]:
ls()

This tells us that we currently have one object in our working environment, and that it is called `a`. We can see the value of `a` by simply entering its name, as below.

In [26]:
a

The value of a variable does not need to be a number; it can also be a mathematical expression. Run the code below to make a new variable.

In [19]:
b <- 5 + 3

What is the value of `b`?

In [22]:
b

You can ask R to print the value of a variable as you create it by wrapping the whole expression in parentheses.

In [23]:
(c <- 7)

You can give a variable any name you want as long as it doesn't start with a number or include spaces or special characters. Let's clear out our working environment and create some new variables. You can clear the working environment with `rm(list = ls())`. Run the following code block.

In [2]:
rm(list = ls())
a <- 3
b4 <- 5
cat <- a + b4 - 1

Use the `ls()` function to see what variables are currently in their environment, then check their values.

In [50]:
ls()

Now that we have created these variables, we can use them in mathematical expressions or, as you have seen, to create new variables. You can change the value of a variable any time you want, simply by reassigning it. Run the code below to see an example.

In [51]:
(a <- 9)

Create a new variable called `dog` that is equal to 12.

In [3]:
dog <- 12

As well as addition and subtraction, we can use variables in multiplication, division, and exponentiation:

In [53]:
cat*dog
cat/dog
cat**2
dog/cat*cat

You can also create variables whose value is a text string. The string must be wrapped in quotation marks, even if it is only a single word. Run the code below.

In [5]:
g <- "coding is"
hi <- "fun"

We can combine text strings, but not with mathematical operators. Note that the following code does not work.

In [55]:
g + hi

ERROR: Error in g + hi: non-numeric argument to binary operator


To combine the two strings, we need to use the `paste()` function. As you may have noticed by now, functions are denoted with a set of parentheses. The text before the parentheses is the name of the function; the **arguments** to the function go inside the parentheses, separated bty commas. The `ls()` function did not take any arguments. The arguments to the `paste()` function are the strings that you want to combine. Run the code below to see how it works.

In [58]:
paste(g, hi)

The `paste()` function can take any number of arguments, and also has an optional final argument, which tells the function how to separate the strings you are combining. The default is a space, but you can make it anything with the `sep=` argument. Try running the examples below:

In [60]:
paste(g, hi, sep="")
paste(g, hi, sep=",")
paste(g, hi, sep=" not ")

Note that R does not care whether the resulting expression is gramatically correct or true. Note also that you may need to use spaces around whatever you are using to separate your strings. There are many other [string functions](https://www.r-bloggers.com/string-functions-in-r/) you can use in R. Another one that we will try right now is the `toupper()` and `tolower()` functions, which change the case of your string. Since our strings are already in lower case, let's try `toupper()`.

In [63]:
toupper(hi)

We can also combine string functions.

In [64]:
toupper(paste(g, hi))
tolower(toupper(hi))

Create a new code block below to make some of your own string variables and manipulate them with string functions.

## Data Types
So far, we have made two different **types** of variables: numeric and string. If we ever want to know the type of a variable, we can use the `typeof()` function. Its argument is the variable in question, and it returns the type of the variable. Run the following code.

In [65]:
typeof(a)
typeof(g)

These types indicate how R is storing the data contained in the variables: **double** refers to a number that can have decimals; **character** refers to a text string. Run the following code block.

In [71]:
(twelve <- "12")
typeof(twelve)

Explain what is happening above.

Since the variable `twelve` looks like a number, can we convert it to one? Yes!

In [74]:
as.numeric(twelve)
typeof(as.numeric(twelve))

Note that we did not change the type of the variable `twelve`, we only expressed it as a number. We could have changed it by reassigning the variable like this: `twelve <- as.numeric(twelve)`. Numeric variables can be converted to character variables (using either the `as.numeric()` function or the `toString()` function), but character variables can't be converted to numeric variables unless they consist only of digits or mathematical expressions.

In [83]:
(ten <- 10)
typeof(ten)
(ten <- as.character(ten))
typeof(ten)
(ten <- as.numeric(ten))
typeof(ten)

Explain what happened here. How could we have known the type of variable `ten` without using the `typeof()` function?

There is one more data type that we will cover at this point, called **logical**. A logical variable can take one of two values: `TRUE` or `FALSE`. Run the code below to see this in action.

In [79]:
newvariable <- 2 == 5 - 3
anothernewvariable <- 2 != 5 - 3
newvariable
anothernewvariable
typeof(newvariable)
typeof(anothernewvariable)

The logical data type is created with **Boolean** operators. These are equal `==`, not equal `!=`, less than `<`, less than or equal to `<=`, greater than `>`, and greater than or equal to `>=`.

## Vectors and Lists
You can combine variables into vectors with the `c()` function, and those vectors can also be saved as variables. All vectors in a function must be of the same type. So we can have a vector of doubles, a vector of characters, or a vector of logcials. If you run the `typeof()` function on a vector, R will return the data type of the vector's elements. As you see below, the variable `vec_a` is a vector containing the numeric variables `a`, `b4`, `cat`, and `dog`. The variable `vec_b` is a vector containing the character variables `g` and `hi`.

In [8]:
(vec_a <- c(a, b4, cat, dog))
(vec_b <- c(g, hi))
typeof(vec_a)
typeof(vec_b)

You can do mathematical operations on numeric vectors. This will be important later.

In [72]:
vec_a * 2
vec_a - 1
sum(vec_a)

You can reference any element of a vector using the name of the vector variable followed by the element number in square brackets.

In [9]:
vec_a[2]

You can also use the `paste()` function to combine elements of a vector, but with the `collapse` argument rather than the `sep` argument. If you use this function on a numeric vector, R will turn the elements into characters before combining them.

In [16]:
paste(vec_b, collapse=" ")
paste(vec_a, collapse=" ")

If you try to create a vector where some elements are numbers and some are characters, R will turn all of them into characters.

In [24]:
(vec_c <- c(1, 2, "fifteen", "twenty"))
typeof(vec_c)

Note that we created vec_c from numbers and words, not from variables we had already made. If we want to combine different types of data and preserve their original types, we can make a list with the `list()` function. A list is another data type.

In [25]:
(list_a <- list(a, b4, g))
typeof(list_a)

We can reference any element of a list with the name of the list variable and the number of the element in double square brackets.

In [31]:
list_a[[3]]
typeof(list_a[[3]])

Lists can also contain vectors and other lists.

In [36]:
(list_b <- list(1, 3, list_a, vec_b, 2, vec_a))
typeof(list_b[[3]])
typeof(list_b[[6]])
list_b[[3]][[2]]

## Data Frames
The data type we will be working with most in this class is the data frame. A data frame is an array of vectors where each vector has the same number of elements. The vectors that comprise the data frame can be numeric, character, or some of each. 

To understand data frames, imagine you are running a day care center for pets. You take dogs, cats, and rabbits. The following code makes a series of vectors: `days` lists the days of the week; `dogs` lists the number of dogs you cared for on each day of the week (`dogs[1]` is the number of dogs you had on Monday, `dogs[2]` is the number of dogs you had on Tuesday, etc.); `cats` lists the number of cats you cared for each day; `rabbits` lists the number of rabbits you cared for each day.

In [55]:
days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
dogs <- c(4, 3, 4, 2, 4)
cats <- c(1, 2, 1, 1, 2)
rabbits <- c(0, 1, 1, 0, 1)

We can now use the `data.frame()` function to convert these vectors into a data frame called `pets`.

In [74]:
pets <- data.frame(days, dogs, cats, rabbits)

Since this is a small data frame, we can view it directly, the same way we would view any other variable.

In [59]:
pets

days,dogs,cats,rabbits
Monday,4,1,0
Tuesday,3,2,1
Wednesday,4,1,1
Thursday,2,1,0
Friday,4,2,1


Note that the names of the vectors that comprise the data frame are now the names of the columns in the data frame. There are numerous ways of accessing the data in a data frame. You can reference a single value: `pets[2,3]` references the second row, third column, which is the number of cats on Tuesday (2). You can reference a whole column in two different ways: `pets$cats` and `pets[,3]` will both give you the `cats` vector. You can also reference whole rows: `pets[2,]` will give you the row for Tuesday. You can reference a subset of rows or columns: `pets[,2:4]` will give you columns 2-4 (`dogs`, `cats`, `rabbits`); `pets[,2:4]` will give you rows 2-4 (Tuesday, Wednesday, Thursday). You can see the column names with the `names()` function.

In [77]:
pets[2,3]
pets$cats
pets[,3]
pets[2,]
pets[,2:4]
pets[2:4,]
names(pets)

Unnamed: 0,days,dogs,cats,rabbits,total
2,Tuesday,3,2,1,6


dogs,cats,rabbits
4,1,0
3,2,1
4,1,1
2,1,0
4,2,1


Unnamed: 0,days,dogs,cats,rabbits,total
2,Tuesday,3,2,1,6
3,Wednesday,4,1,1,6
4,Thursday,2,1,0,3


We can also create new columns in the data frame. For example, say we wanted a new column indicating the total number of animals on each day. We do it as if we were making a new variable, but using the `dataframe$column` notation.

In [76]:
pets$total <- pets$dogs + pets$cats + pets$rabbits
pets

days,dogs,cats,rabbits,total
Monday,4,1,0,5
Tuesday,3,2,1,6
Wednesday,4,1,1,6
Thursday,2,1,0,3
Friday,4,2,1,7


Now let's make another data frame. This one will have one row per customer per day the customer brought in animals. Columns will list customer's name, day of the week, number of dogs, number of cats, and number of rabbits.

In [100]:
name <- c("Al", "Bob", "Al", "Carmen", "Dana", "Al", "Bob", "Dana", "Al", "Al", "Bob", "Dana", "Evelyn")
day <- c("Monday", "Monday", "Tuesday", "Tuesday", "Tuesday", "Wednesday", "Wednesday", "Wednesday", "Thursday", "Friday", "Friday", "Friday", "Friday")
dogs <- c(2, 2, 2, 1, 0, 2, 2, 0, 2, 2, 2, 0, 0)
cats <- c(1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0)
rabbits <- c(0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1)
customers <- data.frame(name, day, dogs, cats, rabbits)
customers

name,day,dogs,cats,rabbits
Al,Monday,2,1,0
Bob,Monday,2,0,0
Al,Tuesday,2,1,0
Carmen,Tuesday,1,0,0
Dana,Tuesday,0,1,1
Al,Wednesday,2,1,0
Bob,Wednesday,2,0,0
Dana,Wednesday,0,0,1
Al,Thursday,2,1,0
Al,Friday,2,1,0


Now let's calculate how much we are going to bill each customer on each day. We charge `$20` per day for dogs, `$15` for cats, and `$5` for rabbits. Customers who bring in two animals get `10%` off the total and customers who bring in three or more animals get `20%` off the total.

In [102]:
customers$bill <- customers$dogs * 20 + customers$cats * 15 + customers$rabbits * 5
customers

name,day,dogs,cats,rabbits,bill
Al,Monday,2,1,0,55
Bob,Monday,2,0,0,40
Al,Tuesday,2,1,0,55
Carmen,Tuesday,1,0,0,20
Dana,Tuesday,0,1,1,20
Al,Wednesday,2,1,0,55
Bob,Wednesday,2,0,0,40
Dana,Wednesday,0,0,1,5
Al,Thursday,2,1,0,55
Al,Friday,2,1,0,55


At the end of the week, Al and Bob complain that they are spending too much on day care for their pets. You want to keep their business and you don't want their animals to be home alone all day, so you consider implementing a discount: 10% off each day a person brings in two pets and 15% off each day a person brings in three or more pets.

In [106]:
customers$newbill <- ifelse(customers$cats + customers$dogs + customers$rabbits > 2, customers$bill * 0.9,
            ifelse(customers$cats + customers$dogs + customers$rabbits > 1, customers$bill *0.85, customers$bill))

In [107]:
customers

name,day,dogs,cats,rabbits,bill,newbill
Al,Monday,2,1,0,55,49.5
Bob,Monday,2,0,0,40,34.0
Al,Tuesday,2,1,0,55,49.5
Carmen,Tuesday,1,0,0,20,20.0
Dana,Tuesday,0,1,1,20,17.0
Al,Wednesday,2,1,0,55,49.5
Bob,Wednesday,2,0,0,40,34.0
Dana,Wednesday,0,0,1,5,5.0
Al,Thursday,2,1,0,55,49.5
Al,Friday,2,1,0,55,49.5


## The Tidyverse
That's a lot of typing! Fortunately, there is a set of packages called the tidyverse that makes manipulating data tables much easier. In order to use external packages, we must install and load them. Installation has already happened in JupyterHub, so we just have to load the tidyverse with the command `library(tidyverse`.

In [108]:
library(tidyverse)

The tidyverse contains several functions that make it very easy to manipulate data frames. The first one we will use is `mutate()` which creates a new column. There are two ways to use tidyverse functions. The first is with the name of the data frame as the first argument to the function. The second and more straightforward is with pipes (`%>%`). Let's go back to our original customers data frame. We will use the first method to calculate the bill and the second to calculate the total number of animals per customer per day.

In [128]:
customers <- data.frame(name, day, dogs, cats, rabbits)

#First way:
customers <- mutate(customers, bill = dogs * 20 + cats * 15 + rabbits * 5)

#Second way:
customers <- customers %>% mutate(animals = dogs + cats + rabbits)

customers

name,day,dogs,cats,rabbits,bill,animals
Al,Monday,2,1,0,55,3
Bob,Monday,2,0,0,40,2
Al,Tuesday,2,1,0,55,3
Carmen,Tuesday,1,0,0,20,1
Dana,Tuesday,0,1,1,20,2
Al,Wednesday,2,1,0,55,3
Bob,Wednesday,2,0,0,40,2
Dana,Wednesday,0,0,1,5,1
Al,Thursday,2,1,0,55,3
Al,Friday,2,1,0,55,3


Now we will calculate the discounted bill using the `ifelse()` function together with the `mutate()` function. The `ifelse()` function takes three arguments: `ifelse(condition, do if true, do if false)`. The first argument is the test; the second is what to do if the test passes; the third is what to do if the test fails. You can nest `ifelse()` statements, so if one test fails, you can move on to a second test (and so on).

In [118]:
customers <- customers %>% mutate(newbill = ifelse(animals > 2, bill * 0.9, ifelse(animals > 1, bill * 0.85, bill)))
customers

name,day,dogs,cats,rabbits,bill,animals,newbill
Al,Monday,2,1,0,55,3,49.5
Bob,Monday,2,0,0,40,2,34.0
Al,Tuesday,2,1,0,55,3,49.5
Carmen,Tuesday,1,0,0,20,1,20.0
Dana,Tuesday,0,1,1,20,2,17.0
Al,Wednesday,2,1,0,55,3,49.5
Bob,Wednesday,2,0,0,40,2,34.0
Dana,Wednesday,0,0,1,5,1,5.0
Al,Thursday,2,1,0,55,3,49.5
Al,Friday,2,1,0,55,3,49.5


Explain what we just did.

There are three other tidyverse functions you will need to do this week's lab: `arrange()`, `select()`, and `filter()`. You can use `arrange()` to organize your data frame in ascending or descending order on the basis of one or more columns. For example, we can arrange our data alphabetically by customer:

In [131]:
customers %>% arrange(name)

name,day,dogs,cats,rabbits,bill,animals
Al,Monday,2,1,0,55,3
Al,Tuesday,2,1,0,55,3
Al,Wednesday,2,1,0,55,3
Al,Thursday,2,1,0,55,3
Al,Friday,2,1,0,55,3
Bob,Monday,2,0,0,40,2
Bob,Wednesday,2,0,0,40,2
Bob,Friday,2,0,0,40,2
Carmen,Tuesday,1,0,0,20,1
Dana,Tuesday,0,1,1,20,2


Within the record for each customer, we can also organize by number of animals brought in each day:

In [133]:
customers %>% arrange(name, animals)

name,day,dogs,cats,rabbits,bill,animals
Al,Monday,2,1,0,55,3
Al,Tuesday,2,1,0,55,3
Al,Wednesday,2,1,0,55,3
Al,Thursday,2,1,0,55,3
Al,Friday,2,1,0,55,3
Bob,Monday,2,0,0,40,2
Bob,Wednesday,2,0,0,40,2
Bob,Friday,2,0,0,40,2
Carmen,Tuesday,1,0,0,20,1
Dana,Wednesday,0,0,1,5,1


We can do the same thing in descending order of number of animals (this really only makes a difference for Dana, who brings her rabbit on Wednesday and her cat on Friday and both on Tuesday):

In [135]:
customers %>% arrange(name, -animals)

name,day,dogs,cats,rabbits,bill,animals
Al,Monday,2,1,0,55,3
Al,Tuesday,2,1,0,55,3
Al,Wednesday,2,1,0,55,3
Al,Thursday,2,1,0,55,3
Al,Friday,2,1,0,55,3
Bob,Monday,2,0,0,40,2
Bob,Wednesday,2,0,0,40,2
Bob,Friday,2,0,0,40,2
Carmen,Tuesday,1,0,0,20,1
Dana,Tuesday,0,1,1,20,2


The `filter()` function allows us to see only the rows where a certain condition is true. For example, say we only wanted to see the records for Bob:

In [136]:
customers %>% filter(name == "Bob")

name,day,dogs,cats,rabbits,bill,animals
Bob,Monday,2,0,0,40,2
Bob,Wednesday,2,0,0,40,2
Bob,Friday,2,0,0,40,2


What if we wanted to see everyone except Bob?

The `select()` function allows us to see only a subset of the columns. For example, if we just wanted to see name, day of the week, and total number of animals for each customer, there are two ways to do it:

In [138]:
#First way:
customers %>% select(name, day, animals)

#Second way:
customers %>% select(-dogs, -cats, -rabbits, -bill, -newbill)

name,day,animals
Al,Monday,3
Bob,Monday,2
Al,Tuesday,3
Carmen,Tuesday,1
Dana,Tuesday,2
Al,Wednesday,3
Bob,Wednesday,2
Dana,Wednesday,1
Al,Thursday,3
Al,Friday,3


ERROR: Error: `NULL` must evaluate to column positions or names, not a double vector


Note that when we were working with `arrange()`, `filter()`, and `select()`, we did not change the `customers` data frame: we only viewed the results of our actions. As you can see, the `customers` data set remains as it previously was:

In [139]:
customers

name,day,dogs,cats,rabbits,bill,animals
Al,Monday,2,1,0,55,3
Bob,Monday,2,0,0,40,2
Al,Tuesday,2,1,0,55,3
Carmen,Tuesday,1,0,0,20,1
Dana,Tuesday,0,1,1,20,2
Al,Wednesday,2,1,0,55,3
Bob,Wednesday,2,0,0,40,2
Dana,Wednesday,0,0,1,5,1
Al,Thursday,2,1,0,55,3
Al,Friday,2,1,0,55,3


If we had wanted to overwrite the `customers` dataset with the results of any of these operations, we simply would have had to add `customers <- ` at the beginning, just as we did when we were working with the `mutate()` function. Alternatively, we could have saved the results as a new data frame, by giving it a new name (`newname <- customers %>% arrange(...`).