# Introduction to Coding in R
## Political Science 3 Discussion Week 2 - Clara Hu

Today, we'll focus on the basics of coding in R and familiarizing ourselves with the Jupyter notebook environment.

As a note, much of this content is pulled from the [D-Lab](https://dlab.berkeley.edu/)'s R Bootcamp content  
Credit: Adapted from notebook by Ian Castro

## Jupyter notebooks
This webpage is called a Jupyter notebook. A notebook is a place to write programs and view their results.

### Text cells
In a notebook, each rectangle containing text or code is called a *cell*.

Text cells (like this one) can be edited by double-clicking on them. They're written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to learn Markdown, but you might want to.

After you edit a text cell, select the "run cell" button at the top that looks like ▶| to confirm any changes. You can also press Shift + Enter to run the cell.

Q1: Double click me, a text cell, and complete the sentence:

My name is Clara and my favorite color is green

Don't forget to run the cell to save your answer!

### Coding Cells

The majority of our analysis will be completed in *code cells*. They look like the one below. Cells can contain multiple lines, with multiple lines of code. When you run a cell, the lines of code are executed in the order in which they appear.

In [3]:
3 + 5 # This is a comment. Run this cell!

Notice how when you run the cell, we can see the order in which the cells have been run -- the number in the brackets (`[1]:`) to the left of the cell tells us that this cell is the first coding cell you've run in this notebook.

Sometimes, when you make a mistake, you'll see an error. We'll get more practice with these later (and you'll see them a lot as you learn how to code), but these are usually messages that tell you that something about your syntax (the way you wrote the code) is incorrect. When you get an error, you need to fix it and then re-run the cell to get a correct output.

In [4]:
3 +

ERROR: Error in parse(text = x, srcfile = src): <text>:2:0: unexpected end of input
1: 3 +
   ^


### The Kernel

The kernel is a program that executes the code inside your notebook and outputs the results. In the top right of your window, you can see a circle that indicates the status of your kernel. If the circle is empty (⚪), the kernel is idle and ready to execute code. If the circle is filled in (⚫), the kernel is busy running some code. 

You may run into problems where your kernel is stuck for an excessive amount of time, your notebook is very slow and unresponsive, or your kernel loses its connection. If this happens, try the following steps:
1. At the top of your screen, select **Kernel**, then **Interrupt**.
2. If that doesn't help, select **Kernel**, then **Restart**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.
3. If that doesn't help, restart your server. First, save your work by selecting **File** at the top left of your screen, then **Save and Checkpoint**. Next, select **Control Panel** at the top right. Choose **Stop My Server** to shut it down, then **My Server** to start it back up. Then, navigate back to the notebook you were working on.

## Coding in R

Now that we've gotten that out of the way, let's go learn how to code in R!


## Quick Check

Let's practice some R code. The temperature in Berkeley on 8/25 at 4 PM was 68 degrees Fahrenheit, and saved in the variable **`berkeley_f`**. Write some code so we can convert this temperature into Celsius. **Make sure you save the temperature as the variable `berkeley_c`.**

The formula for converting fahrenheit to celsius is as follows:

"Take the temperature in Fahrenheit, subtract 32, and then multiply by five ninths to get the temperature in Celsius."

In [3]:
berkeley_f <- 68
berkeley_c <- (berkeley_f - 32) * 5 / 9

In [4]:
berkeley_c

## Arithmetic in R

As a calculator, R works great. Notice that we can do most of the mathematical operations we expect, like addition, subtraction, multiplication, and division.

In [None]:
2 + 2 # add
6 - 3 # subtract
2 * 3.14159 # multiply
10 / 5 # divide

# More advanced:
3^4 # power
23 %/% 2 # floor division
23 %% 2 # remainder

In [11]:
ceiling(23 / 2) # regular division

In [13]:
# PEMDAS - Order of operations matter; we can use parentheses to change it
12 - 4 * 3
(12 - 4) * 3

Sometimes, we want to do more than just these calculations. We can use something called a **function**, which introduces extra functionality to our code. In this case, we can calculate square roots and logarithms.

The way this works is as follows - assuming we have some random function called `func`:

`func(arguments)`

If you're baking a cake, you can think of `arguments` as the ingredients; the function as the process of baking; and the output as the final result, the cake. Notice that, to use or "call" a function, we need to follow the name of the function with a set of parentheses `(...)`. 

In [14]:
sqrt(16) # functions
log(10) # natural log
log(100, base = 10) # specify if we want a different base; "base" is an optional/additional argument

In [16]:
sqrt(x = 16)

Now, what if we want to do a complicated calculation, like:

"What is the square root of pi times four, plus 64, divided by 12?"

We can do that in one line, but you'll notice it's a bit hard to read.

In [17]:
(sqrt(3.14159 * 4) + 64) / 12

What we can do instead is something called **variable assignment**. In other words, we can give a calculation a name, and then reference the value of that calculation by using that name. This makes things easier for us because we can split our code into separate lines, which is easier to read and makes us less likely to make mistakes when doing calculations.

We assign variables as follows:

`variable <- <insert some calculation here>`

This tells R: "I want to assign the value of the calculation on the right side of the arrow to the name variable". 

Let's see an example:

In [4]:
pi <- 3.14159
four_pi <- 4 * pi
root_four_pi <- (sqrt(four_pi))
(root_four_pi + 64) / 12

## Tables and Datasets in R

As political researchers, we will be working with data instead of using R as a glorified calculator. Let's quickly introduce how we can work with datasets in R.

In [5]:
# This gives us the data that we need to work. Run this cell.  
install.packages("gapminder") 
library(gapminder)

Installing package into ‘/opt/r’
(as ‘lib’ is unspecified)



In [11]:
# Run this cell too; the cell above imported the data we want to work with, in the name "gapminder"
# This is primarily demographic data from around the world. 
#head(gapminder, 3) # The "head" function shows the first few rows only
tail(gapminder, 3)#last 6 rows

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Zimbabwe,Africa,1997,46.809,11404948,792.45
Zimbabwe,Africa,2002,39.989,11926563,672.0386
Zimbabwe,Africa,2007,43.487,12311143,469.7093


In general, a column represents a variable, which is some information about every single individual in the dataset. A row represents an individual entry, which contains that individual's data for every single variable.

We can look at specific set of the data that satisfies a certain condition using the `subset` function using the format `subset(dataset, subset condition)`. Commonly used rational operators to indicate conditions include `==`, `>`, `<`, `>=`, `<=`.

In [12]:
# Let's find the first few lines for the subset of the 
#gapminder dataset where the lifeExp is less than 50.
head(subset(gapminder, lifeExp < 50))

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Afghanistan,Asia,1952,28.801,8425333,779.4453
Afghanistan,Asia,1957,30.332,9240934,820.853
Afghanistan,Asia,1962,31.997,10267083,853.1007
Afghanistan,Asia,1967,34.02,11537966,836.1971
Afghanistan,Asia,1972,36.088,13079460,739.9811
Afghanistan,Asia,1977,38.438,14880372,786.1134


In [17]:
# We can also subset using conditions to match character values 
#Let's find the first few lines for the subset of the dataset 
#where the country is Canada
head(subset(gapminder, country == 'Canada'))

country,continent,year,lifeExp,pop,gdpPercap
<fct>,<fct>,<int>,<dbl>,<int>,<dbl>
Canada,Americas,1952,68.75,14785584,11367.16
Canada,Americas,1957,69.96,17010154,12489.95
Canada,Americas,1962,71.3,18985849,13462.49
Canada,Americas,1967,72.13,20819767,16076.59
Canada,Americas,1972,72.88,22284500,18970.57
Canada,Americas,1977,74.21,23796400,22090.88


If we want to pull a column out of the dataset, we can use dollar sign $ notation: `dataset$variable`

In [18]:
# Let's get all of the values from the gdpPercap column in the dataset.
# Use the head function, because there are A LOT of values. 
head(gapminder$gdpPercap)

There are some functions that work on collections of data, like the column we got above. One example is `mean`, which finds the average value of set of numbers you put in. 

Just like we learned earlier, all you need to do is take the function and call it (use parentheses) on the values you want to calculate a statistic for.

In this case, let's calculate the average gdpPercap for ALL of the values in the gapminder dataset.

In [19]:
mean(gapminder$gdpPercap)

Other examples of useful functions include `sd` (standard deviation), `median` (the middle number), `max`, `min`, and `range`. In R, the `range` function returns a vector with the highest and lowest values. To find the mathematical range, you will need to subtract your output from the `min` function from the output from the `max` function.

Practice using these functions with variables from the dataset and interpret the outputs!

In [None]:
...

And that's it for now! We'll get a lot more practice on all of this in the weeks ahead.