## POLSCI 3 Summer 2022

## Week 5: Analyzing Data using R in Jupyter Notebooks


Welcome! In this notebook, you will learn how how to use Jupyter Notebooks (like this one!) and the R programming language to analyze quantitative data.

# Jupyter Notebooks

A Jupyter Notebook is an online, interactive computing environment, composed of different types of __cells__. Cells are chunks of code or text that are used to break up a larger notebook into smaller, more manageable parts and to let the viewer modify and interact with the elements of the notebook.

There's an advantage of Jupyter notebooks for a class like PS 3: You don't need to install software on your computer (and potentially have to troubleshoot it if it doesn't work).


Notice that the notebook consists of 2 different kinds of cells: **text** and **code**. A text cell (like this one) contains text, while a code cell contains expressions in R, the programming language that you will be using. 

"Running" a cell is similar to pressing 'Enter' on a calculator once you've typed in an expression; it computes all of the expressions contained within the cell.

To run a code cell, you can do one of the following:
- press __Shift + Enter__
- click __-> Run__ in the toolbar at the top of the screen.

You can navigate the cells by either clicking on them or by using your up and down arrow keys. Try running the cell below to see what happens. 

In [1]:
5 + 5

The input of the cell consists of the text/code that is contained within the cell's enclosing box. Here, the input is an expression in R that "prints" or repeats whatever text or number is passed in. 

The output of running a cell is shown in the line immediately after it. Notice that markdown (text) cells have no output.

Each line of a cell runs an operation.

In [2]:
# Addition
20 + 20

In [3]:
# Multiplication
10 * 8.5

In [4]:
# Division
625 / 25

In [5]:
# A series of arithmetic operations
(2 - 4 * 5 + 7) + 18 * 2

Note that code after a # (hashtag) is not run, so we use lines starting with hashtags to add comments or notes on our code. Here's an example.

In [None]:
#By using a comment at the beginning of the cell, we can describe what will occur when you run the cell.
# Add ten to 8
10 + 8 # Note how we can add a comment after the expression

In [6]:
# If you create two lines, each line will run on its own.
2 + 2
5 + 5

### R Variables

Aside from numbers, R has **variables**, names that act as placeholders for certain values. For example, let the variables `x` and `y` equal 10 and 9, respectively. This action is called "assigning" a variable.

In [7]:
# Assign your variable by using <-
x <- 10
y <- 9

Notice that assigning a number to a variable name such as `x` produces no output. To view the value of x, place it at the end of a coding cell, like below.

In [8]:
x

Now, we can use the variables `x` and `y` in expressions.

In [10]:
10 + 9

In [11]:
x + y

Now what happens when the value of `x` changes?

In [12]:
x <- 12 * 2

Then, the value of expressions that rely on `x` also change.

In [13]:
x

In [14]:
x + y

**This is why the order in which you run code cells is important.** The expression `x + y` can yield different results depending on which cells you ran before.

In [15]:
x <- 5
y <- 20
x + y

What happens if you try to use a variable without assigning it to a value first?

In [16]:
x + y + z

ERROR: Error in eval(expr, envir, enclos): object 'z' not found


You'll see that R outputs a `Error in eval`. R tried to find the value of `z`, but `z` hadn't been defined yet!

**Important:** If you see this error again in this notebook or in future notebooks, it is an indication that you might not have run all the previous cells or that you might be using variables without assigning values to them first.

Run the next cell to define `z`

In [17]:
# Defining z here
z <- 2019

In [18]:
# Good to go! Now that we've assigned a value to z, the code runs.
x + y + z

## Reading in and Using Datasets

R is meant to be used as a tool for statistical computing and graphics. Naturally, R allows us to read in data sets.

During most of the semester, we're going to be reading in real datasets from real studies about about the real world.

But first we'll start with something simple. In the next cell, we will read a comma-separated values **(CSV) file** that includes some data from US presidential elections. This is the same electoral data analyzed in our textbook. We *assign* this dataset to a variable named `election.data`.

In [19]:
# This stores, or assigns, the dataset as election.data
election.data <- read.csv('FairFPSR3.csv')

Running the name of the dataset by itself prints out the data set.

In [20]:
# This prints out the first six observations (rows) of election.data.
head(election.data)

Unnamed: 0_level_0,inc_vote,year,inflation,goodnews,growth
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,48.516,1876,,,5.11
2,50.22,1880,1.974,9.0,3.879
3,49.846,1884,1.055,2.0,1.589
4,50.414,1888,0.604,3.0,-5.553
5,48.268,1892,2.274,7.0,2.763
6,47.76,1896,3.41,6.0,-10.024


In this class, you'll also see a **codebook** that tells you what each variable means. Here's what the variables mean in this dataset:

`inc_vote`: % of major party presidential vote won by incumbent party

`year`: Year of the presidential election

`inflation`: Inflation rate

`goodnews`: Number of quarters in the first 15 quarters of admin in which econ growth>3.2%

`growth`: % change in real GDP per capita

As we can see R succesfully interpreted our file!

While the dataset is insightful enough by itself, it would be nice to do some operations with it. R allow us to do some interesting operations with the dataset columns. `dataSetName$varName` allow us to read a specified column in our data set and return it as an **vector**.

In [22]:
# dataNameSet$varName give us the specified column as a vector.
election.data$inc_vote

Furthermore, it is possible to do different operations to this vector. In this case, let's compute the mean of the incumbent vote across presidential elections since 1876. lets use R's `mean()` function, which computes the mean of a given vector.

In [23]:
mean(election.data$inc_vote)

You can even save the results into a variable.

In [24]:
inc_vote.mean <- mean(election.data$inc_vote) # This assigns the value to inc_vote.mean
inc_vote.mean # This line just prints what is in inc_vote.mean

Once we've assigned a variable, whether by using a command line `x <- 5` or the command above, they behave just like numbers. So, once they're assigned, so you can do arithmetic with the variables, too, just like we showed above.

### How `subset()` works

Here's how `subset()` works:

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == accepted.value)`

This line takes `original.dataset`, subsets it to rows (observations) when `variable.in.dataset` equals `accepted.value`, and saves that subset in `name.of.new.subset.dataset`.

or 

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset > accepted.value)`

or 

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset < accepted.value)`

If the variable is a **string** variable (contains letters), you need to wrap it in quotations, like this (single quotes `'` and double quotes `"` both work):

`name.of.new.subset.dataset <- subset(original.dataset, variable.in.dataset == 'accepted.value')`



Example use of subset()

Let's say we want to subset the election data to look only for elections in which the economic growth rate
is greater than zero.

In [25]:
growthgt0.election.data <- subset(election.data, growth > 0)
head(growthgt0.election.data)

Unnamed: 0_level_0,inc_vote,year,inflation,goodnews,growth
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,48.516,1876,,,5.11
2,50.22,1880,1.974,9.0,3.879
3,49.846,1884,1.055,2.0,1.589
5,48.268,1892,2.274,7.0,2.763
10,54.708,1912,2.172,8.0,4.164
11,51.682,1916,4.252,3.0,2.229


## Review

Over the course of this notebook, you were introduced to the basic types of objects in R, how to store them, and how to use them.

#### Reminder about Peer Consulting Office Hours

If you had trouble with any content in this notebook, Data Peer Consultants are here to help! You
can view their locations and availabilites at this link: https://data.berkeley.edu/degrees/peer-advising. Peer Consultants are there to answer all data-related questions, whether it be about the content of this notebook, applications of data science in the world or other data science courses offered at Berkeley -- make sure to take advantage of this wonderful resource!