# R Haas Departmental Workshop

The goal of this Departmental Workshop is to provide a light introduction to R. We aim to build up to data frames, which are one of the most commonly used data structures in R.

## Section 1: Jupyter Notebooks

This course will be using a Jupyter Notebook to interact with R.  The bit of extra setup is well worth it because the Notebook provides code completion and other helpful features.

Notebook files have the extension ".ipynb" to distinguish them from other Python (e.g., ".py") files.


### Open Jupyter Notebook on Haas DataHub

We *strongly* recommend using the Haas DataHub to run the materials for these lessons. You can access the DataHub by clicking this button:

[![DataHub](https://img.shields.io/badge/launch-datahub-blue)](https://mba200a-fall-2022.haastech.org/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdlab-berkeley%2FR-Haas-Workshop&urlpath=tree%2FR-Haas-Workshop%2Flessons%2F01_introduction.ipynb&branch=main)

Some users may have to click the link twice if the materials do not load initially.

The DataHub downloads this repository, along with any necessary packages, and allows you to run the materials in an instance on UC Berkeley's servers. No installation is needed from your end - you only need an internet browser and a CalNet ID to log in. By using the DataHub, you can save your work and come back to it at any time.

### Navigating in Jupyter Notebook

Jupyter Notebooks have more useful features for interactive use than a standard interpreter, but they work in the same basic way: you type things and then execute them.

Unlike many graphical systems, you don't need to press a button to run your code! Instead, **you typically run code using Shift-Enter**. This also moves you to the next box (or "cell") for code below the one you just ran.

Using Jupyter notebooks has several advantages:

- You can easily type, edit, and copy and paste blocks of code.
- Tab complete allows you to easily access the names of things you are using and learn more about them.
- It allows you to annotate your code with links, different sized text, bullets, etc., to make it more accessible to you and your collaborators.
- It allows you to display figures next to the code that produces them to tell a complete story of the analysis.
- The notebook is stored as JSON but can be saved as a .py file if you would like to run it from the Bash shell or a Python interpreter.

## Section 2: Variable Assignment

We'll start our foray into R with variables, which are one of the most fundamental concepts in programming. A variable is simply a placeholder for another value.

Try assigning the value "5" to the variable `number` and then run `number` in its own cell.

In [None]:
# The variable "number" is a placeholder for "5"
number <- 5

In [None]:
number

In [None]:
# You can also use the '=' operator to do variable assignment. 
number = 5

In [None]:
number

There are subtle differences between '<-' and '=', which won't matter in most cases. However, using '<-' is considered good code style. You want your code to adhere to good stylistic practices, since that makes it easier to read and use by other users.

In [None]:
# You can perform basic arithmetic in R
number + 1
number - 2
number * 3
number / 4

Use a hashtag to comment your code (e.g., write notes to your future self and your collaborators) to help keep your script organized. 

## Section 3: Functions and Arguments

A recurring theme in programming is **abstraction**. We use a placeholder to _represent_ some other value or operation to make our code, so that we don't have to keep typing out those values or operations repeatedly. Variables are an abstraction of values: for example, rather than always having to type out the value for pi, we can store it in a variable called `pi`, and simply reference that. In the same vein, **functions** are abstractions of sometimes complex actions.

**Functions** perform actions on inputs. They are followed by trailing round parentheses.

**Arguments** are the inputs - values, expressions, text, entire datasets, etc. You tell a function what arguments it needs inside the parentheses. Sometimes, these arguments are "named". This is helpful when you need to enter multiple arguments: the names tell R which arguments correspond to what variables you're passing into the function.

Let's explore some functions that are **built-in** to R. You get them for free just by having R installed.

In [None]:
# Use the ls() function to see all of the variables you have defined.
# Notice that ls() does not take any arguments!
ls()

In [None]:
# You can use the "TAB" key to autocomplete a variable.
# Place your cursor after the 'b' in 'numb' below and press TAB.
# This works for variables and functions alike.
numb

In [None]:
# The class() function tells the data class/type of the variable and requires one argument
class(number)

In [None]:
# Removing Variables. rm() will remove a variable:
rm(number)

In [None]:
ls()

In [None]:
number # Error

In [None]:
# Remove all variables with rm(list = ls()).
# Notice that this is the first function we're using with a named argument!
number1 <- 2
number2 <- 3
ls()

In [None]:
rm(list = ls()) 
ls()

### Challenge 1: Variable Assignment
Define three variables and then write a mathematical expression using only those variables.



In [None]:
# YOUR CODE HERE

##  Section 4: Data Types

There are five main types of data we will work with in R:
1. numeric: decimals (the default for ALL numbers in R).
2. integer: whole numbers (positive and negative, including zero).
3. character: text strings (always wrapped in quotations).
4. logical: TRUE or FALSE (1 or 0).
5. factor: nominal or ordinal categorical type.

### Section 4.1: Numerics

Let's assign 5 to 'number' and check its class. 

In [None]:
number <- 5
number
class(number)

### Section 4.2: Integers

We can coerce variables, or force them to change type. For example, let's convert  'number' to integer type with the `as.integer()` function:

In [None]:
number_int <- as.integer(number)
number_int
class(number_int)

### Section 4.3: Characters

The character data type consists of text. The key to this data type is the usage of quotes (single or double) to indicate that we're dealing with text. Let's try it out:

In [None]:
welcome <- "Welcome to the D-Lab"
class(welcome)

In [None]:
# Single and double quotes work similarly:
contraction <- 'I am hungry.'
contraction

contraction <- "I am hungry."
contraction

In [None]:
# You can nest single quotes inside of double quotes:
contraction <- "I'm hungry"
contraction

However, you cannot nest single quotes inside of single quotes.

### Section: 4.4 Logicals

Programming often relies on making decisions _conditional_ on something being true or false. Logical data types provide the infrastructure to allow for these kinds of control structures. 

Specifically, logical data consists of only two values: TRUE (or 1) and FALSE (or 0).

In [None]:
class(TRUE)
class(FALSE)

In [None]:
# Since TRUE and FALSE are stored as 1 and 0, they take on mathematical properties:
TRUE + 2
FALSE - 4

In [None]:
# Boolean data types evaluate whether a statement is TRUE. Check the following:
FALSE < TRUE # less than
TRUE >= TRUE # greater than or equal to
FALSE == FALSE # equivalent to (equal to)
"Mac" == "mac" # R is case sensitive
FALSE != FALSE # not equivalent to (not equal to)
"PC" != "Windows"

In [None]:
# Boolean 'and' (all conditions must be satisfied):
TRUE & TRUE 
TRUE & FALSE

In [None]:
# Boolean "or" (just one condition must be satisfied):
TRUE | TRUE 
TRUE | FALSE

## Section 4.5: Factors

A **factor** variable is a set of categorical or ordinal values. We won't cover factors in this workshop, but check out our [Fundamentals](https://github.com/dlab-berkeley/R-Fundamentals) workshop to learn more!

### Challenge 2: Data type coercion

Like `as.integer`, other "as dot" functions exist as well, such as `as.numeric`, `as.character`, `as.logical`, and `as.factor`.

1. Define three variables: one numeric, one character, and one logical

In [None]:
# YOUR CODE HERE

2. Can you convert numeric to integer type?
3. Convert numeric to logical?
4. Convert numeric to character?
5. Convert logical to character?
6. Convert character to numeric?


In [None]:
# YOUR CODE HERE

## Section 5: Data Structures

Data structures are useful ways of representing and organizing data in R. There are several data structures we can construct, but we'll focus on two:

1. `c()`: ordered groupings of the SAME type of data (called "vectors").
2. `data.frame()`: an ordered group of equal-length vectors; think of an Excel spreadsheet, with rows and columns.

### Section 5.1: Vectors
A vector is an ordered group of the *same* type of data. We can we can create vectors by "concatenating" data together with the `c()` function:


In [None]:
vec <- c(2, 5, 8, 11, 14)
vec

It does not matter what type the data is contained within the vector, as long as it is all the same:

In [None]:
numeric_vector <- c(234, 31343, 78, 0.23, 0.0000002)
numeric_vector

In [None]:
class(numeric_vector)
length(numeric_vector) # There are five elements in this vector.

#### Indexing a vector
To index a vector means to extract an element based on its position. For example, if we want to return just the third element from "numeric_vector", we would use square brackets to _index_ the third position:

In [None]:
numeric_vector[3]

When we want a subset of entries, we can also use the colon operator to index multiple values in a row:

In [None]:
colon_vector <- c(28:36)
colon_vector 

Vectors can contain other types, too. Consider the following example:

In [None]:
character_vector <- c("Canada", "United States", "Mexico")
character_vector
class(character_vector)

## Section 6: Data frames

Why do we need a data frame? Think about datasets that you have seen before. For example, suppose we collected data on the characteristics of D-Lab Workshop learners. We might want to know the age, degree program, previous familiarity with programming, research interests, and likely many other attributes (variables). 

This kind of dataset is multidimensional. We have one row for each participant and a number of columns for each attribute we collect data on. If we had forty participants and collected 10 attributes for each participant, then we would have a 40 by 10 dataset.

The data structure in R that is most suited for this kind of problem is the data frame. 
A data frame is an ordered group of equal-length vectors. They are the most common type of data structure used for data analyses. Most of the time when we load real data into R, we are loading that data into a data frame. 

Since they are vectors, each column can only contain the same data type, but columns of different types can be lined up next to each other.

Meanwhile, rows can contain heterogeneous data.

Let's create a data frame capturing some information about countries:




In [None]:
countries <- c("Canada", "Mexico", "United States")
populations <- c(10, 20, 30)
areas <- c(30, 10, 20)

In [None]:
# We can create the data frame with the data.frame() function.
# The equal-length vectors are the arguments.
# Notice that the name of each variable becomes the name of the column.
df <- data.frame(countries, populations, areas)
df

In [None]:
# If we wanted to change the column names, we can specify that with the function argument:
df <- data.frame(country = countries, population = populations, area = areas)
df

In [None]:
# Check the compact structure of the data frame:
str(df)

In [None]:
# View the dimensions (nrow x ncol) of the data frame:
dim(df) 

In [None]:
# View column names:
colnames(df)

In [None]:
# View row names (we did not change these and they default to character type):
rownames(df)
class(rownames(df))

In [None]:
# You can extract a single column with the $ operator:
df$country

In [None]:
# The $ operator can also be used to create new columns:
df$density <- df$population / df$area
df

### Challenge 3: Make your own data frame.
1. Create a data frame that contains four different food items and three attributes for each: name, price, and quantity.
2. Add a fourth column that calculates the total cost for each food item.
3. What function could you use to calculate the total cost of all the fruits combined?

In [None]:
# YOUR CODE HERE

This concludes our brief introduction into R!

This introduction basically consisted of creating different kinds of variables. But what if we want to _do_ things with those variables? R provides many tools for additional analyses you can conduct on, for example, your dataframe. We encourage you to check out [R-Fundamentals](https://github.com/dlab-berkeley/R-Fundamentals) to learn more about these additional functionalities!