# Introduction to R and Jupyter Notebooks

## Econ 130 Fall 2022

## Overview

Graduate Student Instructors: Ale Marchetti-Bowick and Alice Schmitz

### Goals for today
This notebook is intended to introduce you to basic analytic techniques in R within Jupyter notebooks. R is an open-source statistical computing software used to analyze data. A Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text that describes the output of our code. 

Here, we will introduce some fundamental coding techniques that will help you in this course so that when we cover it again in section, it will not be the first time you have seen it: **this and additional material will be reviewed and covered in more detail in the four coding discussion sections.**


## Jupyter and R Basics
* When you first log into Datahub, to create a new notebook, click the "New" button and select R
* All Jupyter Notebooks are comprised of a colelction of boxes called *cells*. We will be working with two types in this course: *Markdown* cells for text and *Code* cells for code.
* Select a cell by clicking on it. 
    * If you click to the left of the cell contents, you will see a blue bar on the left. That means you are in *command mode*. You'll be able to see the Cell type in the dropdown list at the top of the page and you can use command mode keyboard shortcuts. You will not be able to edit the contents of the cell. Pressing `Esc` will take you into command mode.
    * If you click in the cell, you'll instead enter *edit mode*. The bar at the left will be green and you will be able to edit the contents of the cell. Pressing `Enter` will take you into edit mode.
* Write R script by selecting the option "Code" from the dropdown list, or write text by selecting "Markdown"
* Select "Insert" to add a block of text or code
* Run code by highlighting and selecting "Run"
    * You can also use `control+enter` to run a cell, or `shift+enter` to run a cell and automatically select the next cell
    * When code is running, you will see an asterisk * to the left of the cell. When it is finished, you will see a number (ex. In [4] is finished; In [*] is still running).
* To clear your coding output, select Cell=>All Output=>Clear from the toolbar at the top of the page
* Jupyter notebooks automatically save periodically, but you can also force it to save with `Control+S`
    * You can view the save status at the top of the page next to the notebook name
* Close a notebook by selecting File=>Close and Halt.
* Some useful guides are here:
    * [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/) for pretty text like in this cell
    * [Jupyter Notebook Keyboard Shortcuts](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330)
    * Your GSIs refer to these all the time!

Note: This introduction is based off material originally created by Kayleigh Barnes

In [None]:
# Clear the workspace, this removes all data and numbers you have stored or saved in R
# The hashtag (or octothorpe, if you're old-fashioned) is how you tell R that what follows is
# a "comment" and will not be interpreted as code. They are just references for you and us.
rm(list = ls())

# The help function, using ? or help() before a command will bring up information on what the command does
?setwd
help(setwd)

In [None]:
#The working directory is the location that R will look for data in
# this is the same as telling your computer to look in a documents folder when uploading something
getwd()

User written open-source packages are needed for specific functionality in R (e.g. nice graphics). However, we need to manually install these packages (once) and load them at the beginning of every script. Packages have been pre-installed in Jupyter notebooks.  If you are wondering why a command you've used before is no longer working, it may be because you haven't loaded the package.

In [None]:
# Install packages
# Installing Packages is generally not necessary if you are working on Jupyter because many packages have been
# pre-installed by the Berkeley Data Science team, but is required the first time you want to use a
# package if you choose to run R code "locally" on your own computer. "Uncomment" (delete the # at the 
# beginning of) the next line if you are working on your own computer outside of Datahub and Jupyter.

#install.packages('ggplot2')
  
# Load required packages
# This is always necessary. If some of your code isn't working, double-check that you've loaded the required
# packages!
library(ggplot2)

## Loading in data and summary statistics

Now let's load in the data set. Make sure you have uploaded the data to Jupyter before running the next line of code. We are going to use data on a set of households in Mexico in the 1990's. The data includes a village ID, a household ID, and demogrpahic variables like income, household size, age and gender of the head of household and a poverty indicator. 

In [None]:
# Reading data into R from a CSV file
#  ?read.table # delete the # at the beginning of this line to view the help entry for the "read" command
  MyFirstData <- read.csv('MyFirstData.csv', header = TRUE)

Notice that there is no ouput from the code that reads in the data. Unlike excel, R stores the data in the background and we need to use specific comands to interact with it. Once it's read in, we can use several commands to describe the data

In [None]:
# Structure of the Data
  str(MyFirstData)

In [None]:
  # Summary of the Data
  summary(MyFirstData)

In [None]:
  # Variable Names
  colnames(MyFirstData)

In [None]:
  #Number of Observations
  nrow(MyFirstData)

In [None]:
  #Display first 6 rows of the data 
  head(MyFirstData)

In [None]:
  #Tabulate a specific variable (to refer to a variable, use Dataset$VariableName)
  table(MyFirstData$sexhead)

## Basic Data Cleaning and Formatting

### Category Variable

Right now, we have two categorical variables: sexhead, which indicates the sex of the head of household and pov_HH, which indicates whether a household is below the poverty line. The data entries for these variables are text rather than numbers (we call these string variables in the data science world). Often when doing data analysis, it is easier to map categorical text variables to numbers, particularly 0 and 1. These variables that contain only 0's and 1's are called dummy variables. 

Now, suppose we want to create a poor_male variable, which will be defined as 1 if the household is categorized as poor (pov_HH = pobre) and the head of the household is male (sexhead is Male), and 0 otherwise.

In [None]:
#Create one dummy variable based on T/F condition
MyFirstData$poor_male <- ifelse(MyFirstData$pov_HH == 'pobre' & MyFirstData$sexhead == 'Male', 1, 0)

#tabulate the observations
table(MyFirstData$poor_male)

### Numerical Variable
We can use regular mathematical operations to create numerical variables from other variables.

In [None]:
#Squaring an existing variable
MyFirstData$agehead2 <-  MyFirstData$agehead^2
summary(MyFirstData$agehead2)

#Creating a constant
MyFirstData$constant <- 1
summary(MyFirstData$constant)

 ### New Datasets
 We may also want to create a new data that summarizes the old, or is a subset of the original dataset.

In [None]:
#Subset of only observations with male head of hh
data_males<-MyFirstData[ which(MyFirstData$sexhead=='Male'),]
summary(data_males)

#First select variables to aggregate
myvars <- c("villid", "IncomeLab", "famsize", "agehead")
meandata <- MyFirstData[myvars]

#Collapse data to get average values by village.  Could also use "sum" as the function to get totals
meandata<-aggregate(meandata, by = list(meandata$villid), FUN = mean)
nrow(meandata)
summary(meandata)



## Making comparisons - T-Tests

A main goal of working with data is to make inferences about the population we are interested in. Much of Econ 130 will be focused on methods to make these inferences: What is the relationship between two variables? Did a policy have a significant treatment effect?

If you have taken Stats 20, you are likely already familiar with a t-test. T-tests compare the difference in the means of a variable between two groups. The test statistic tells us whether the difference is *significant*, that is we can confidently say that the two groups are different. 

In [None]:
#let's run a t-test comparing the average family size for households above and below the poverty line
t.test(MyFirstData$famsize ~ MyFirstData$pov_HH, var.equal = TRUE)

## Visualizing Data
Make sure that the ggplot2 package is included at the top of the script.  Below, we show an example of a scatterplot using ggplot.  "geom" can be used to denote different types of graphs such as a line graph.

In [None]:
head(MyFirstData)

In [None]:
  ggplot(MyFirstData, aes(x = agehead, y=famsize)) + geom_point()
  ?geom_line  

We can use a direct function or ggplot to create a histogram. Notice that changing the options in the function allows you to customize the graph. Use the help function to learn more about the options for each command.

In [None]:
# Base Graphics
  hist(MyFirstData$agehead)
  hist(MyFirstData$agehead, col = "blue", main = "Histogram of age")
# Using ggplot2: more customization options are available: Google for more!
  ggplot(MyFirstData, aes(x = agehead)) + geom_histogram(fill = "blue") + ggtitle("Histogram of age")