# Coding Section 1
## Econ 130
GSIs: Linda Mutesi and Alice Schmitz

### Goals for today
* Introduce you to Jupyter Notebook environment and coding in R
* Walk through the basic steps to read a dataset into R
* Do some basic data exploration
* I will be assuming very little experience with coding and R


This notebook is intended to introduce you to the practical implementation of basic analytic techniques in R in Jupyter notebooks. R is an open-source statistical computing software used to analyze data. A Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text that describes the output of our code. 

First, we will cover some basics on how to interact with the Jupyter Notebook environment. Then, we will introduce some R code, read a dataset into R, and do some basic data exploration.



## Jupyter Notebook Basics
* To create a new notebook, click the "New" button and select R
* All Jupyter Notebooks are comprised of a collection of boxes called *cells*. We will be working with two types in this course: *Markdown* cells for text and *Code* cells for code.
* Select a cell by clicking on it. 
    * If you single-click the cell, you will see a blue bar on the left. That means you are in *command mode*. You'll be able to see the Cell type in the dropdown list at the top of the page and you can use command mode keyboard shortcuts. You will not be able to edit the contents of the cell. Pressing `Esc` while in edit mode will take you into command mode.
    * If you double-click the cell, you'll instead enter *edit mode*. The bar at the left will be green and you will be able to edit the contents of the cell. Pressing `Enter` while in command mode will take you into edit mode.
* Create a  new cell by clicking the `+` icon at the top of the screen (under `File`) or by using the `Insert` menu.
    * In command mode, you can also press `a` to create a new cell above the current cell or `b` to create a new cell below.
* Write R script by selecting the option "Code" from the dropdown list, or write text by selecting "Markdown"
* Run code by selecting a cell (edit and command mode both work) and pressing "Run"
    * You can also use `control+enter` (it may be `command+enter` on mac) to run a cell, or `shift+enter` to run a cell and automatically select the next cell
    * When code is running, you will see an asterisk * to the left of the cell. When it is finished, you will see a number (ex. In [4] is finished; In [*] is still running).
* To clear your coding output, select Cell=>All Output=>Clear from the toolbar at the top of the page
* Jupyter notebooks automatically save periodically, but you can also force it to save with `control+s` (or `command+s`), or by clicking the disk icon under `File`.
    * You can view the save status at the top of the page next to the notebook name
* **IMPORTANT: Close a notebook by selecting `File`=>`Close and Halt`. Don't just close the tab! You might lose your progress.**
* Some useful guides are here:
    * [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/) for pretty text like in this cell
    * [Jupyter Notebook Keyboard Shortcuts](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330)
        * You can also access keyboard shortcuts by pressing `h` while in command mode
    * Your GSIs refer to these all the time!

Note: This introduction is based off material originally created by Kayleigh Barnes

## R Basics

In [2]:
# Note that this is a Code cell.
# The hashtag (or octothorpe, if you're old-fashioned) is how you tell R that what follows is
# a "comment" and will not be interpreted as code. They are just references for you and us.

# Clear the workspace, this removes all data and numbers you have stored or saved in R
rm(list = ls())

# The help function, using ? or help() before a command will bring up information on what the command does
?setwd
help(setwd)

# Run these two commands by clicking `Run` at the top of the screen or by pressing 
# `control+enter`

In [4]:
#The working directory is the location that R will look for data in
getwd()

# If you are working locally on your own computer (rather than in a Jupyter Notebook), you may
# need to set the working directory.
# Online, you do not need to do anything.
# This is the same as telling your computer to look in a documents folder when uploading something.
# Uncomment the setwd line below and replace your_folder_directory within the single quotes with 
# the directory you have saved your files in.
# For example, you might use setwd('/Users/julien/Documents/Econ130/Coding')

# setwd('your_folder_directory')

In [6]:
# Now let's read in a .csv file containing data.
# We will name it minwage_data, which is how we will tell R to use it
# R uses an arrow "<-" to assign values to variables, which is a little different from other programming
# languages, which usually use the "=" sign.

minwage_data <- read.csv('nj_min_wage.csv')

Notice that there is no ouput from the code that reads in the data. Unlike Excel, R stores the data in the background and we need to use specific comands to interact with it. 

Once it's read in, we can use several commands to describe the data.

In [7]:
# The str ("structure") command shows you the structure of the data along with some examples
# Note that we have variables defined as "numerical" or "integer"
str(minwage_data)

'data.frame':	820 obs. of  32 variables:
 $ store_id : int  1 1 2 2 3 3 4 4 5 5 ...
 $ interview: int  1 2 1 2 1 2 1 2 1 2 ...
 $ chain    : int  1 1 1 1 2 2 3 3 3 3 ...
 $ co_owned : int  0 0 0 0 0 0 0 0 0 0 ...
 $ state    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ southj   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ centralj : int  1 1 1 1 1 1 1 1 1 1 ...
 $ northj   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pa1      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pa2      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ shore    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ pct_aff  : num  80 80 80 80 50 50 50 50 NA NA ...
 $ type     : int  NA 1 NA 1 NA 1 NA 1 NA 1 ...
 $ status   : int  NA 1 NA 1 NA 1 NA 1 NA 1 ...
 $ date     : int  NA 110592 NA 110592 NA 111792 NA 110592 NA 111892 ...
 $ ncalls   : int  0 NA 0 NA 3 2 0 NA 0 2 ...
 $ full_time: num  16 20 10 7.5 6 4 10 5 5 10 ...
 $ part_time: num  30 40 6 10 13 7 12 30 30 30 ...
 $ managers : num  4 4 3 3 3 3 3 4 4 4 ...
 $ wage_st  : num  4.5 5.05 4.75 5.25 4.25 ...
 $ inc_time : num  NA NA 26

In [8]:
# The head command will show you a snapshot of the data in an Excel-like format
head(minwage_data)

Unnamed: 0_level_0,store_id,interview,chain,co_owned,state,southj,centralj,northj,pa1,pa2,⋯,special,meals,open_hr,hrs_open,p_soda,p_fry,p_entree,n_regs,n_regs11,bonus
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>
1,1,1,1,0,1,0,1,0,0,0,⋯,,3,7,16,0.93,0.83,0.85,4,3,0.0
2,1,2,1,0,1,0,1,0,0,0,⋯,0.0,2,7,16,1.05,0.79,0.9,4,3,
3,2,1,1,0,1,0,1,0,0,0,⋯,,1,7,14,1.06,0.91,0.96,2,2,1.0
4,2,2,1,0,1,0,1,0,0,0,⋯,1.0,1,7,15,1.05,1.01,0.94,2,2,
5,3,1,2,0,1,0,1,0,0,0,⋯,,2,11,10,1.06,0.95,3.09,5,3,1.0
6,3,2,2,0,1,0,1,0,0,0,⋯,0.0,1,11,11,1.05,0.94,2.75,5,3,


In [9]:
# The number of observations is equal to the number of rows.

nrow(minwage_data)

## Compare the summarizing commands `summary` and `table`

In [10]:
# You can get some summary statistics for the variables using the summary command
summary(minwage_data)

    store_id       interview       chain          co_owned     
 Min.   :  1.0   Min.   :1.0   Min.   :1.000   Min.   :0.0000  
 1st Qu.:119.0   1st Qu.:1.0   1st Qu.:1.000   1st Qu.:0.0000  
 Median :237.5   Median :1.5   Median :2.000   Median :0.0000  
 Mean   :246.8   Mean   :1.5   Mean   :2.117   Mean   :0.3439  
 3rd Qu.:372.0   3rd Qu.:2.0   3rd Qu.:3.000   3rd Qu.:1.0000  
 Max.   :523.0   Max.   :2.0   Max.   :4.000   Max.   :1.0000  
                                                               
     state            southj          centralj          northj      
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.8073   Mean   :0.2268   Mean   :0.1537   Mean   :0.4268  
 3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     

In [11]:
# You can also do it for an individual variable. Here, we look at the starting wage, wage_st,
# and chain identifier, chain.

# You have to tell R which dataset to look in first. The $ sign tells it you are moving on
# from the dataset name to the variable name.

summary(minwage_data$wage_st)
summary(minwage_data$chain)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  4.250   4.500   5.000   4.806   5.050   6.250      41 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   2.000   2.117   3.000   4.000 

In [12]:
# The table command provides a different kind of summary. Can you tell what it is doing?

table(minwage_data$wage_st)
table(minwage_data$chain)


     4.25 4.3000002 4.3200002 4.3499999 4.3699999 4.3800001 4.3899999 4.4000001 
      147         1         1         9         5         1         1         2 
4.4499998       4.5 4.5500002 4.5999999 4.6199999 4.6500001 4.6700001      4.75 
        3        74         2         2        15         2         2        62 
4.8000002 4.8499999 4.8699999 4.9000001 4.9099998 4.9499998         5 5.0500002 
        4         2         8         2         1         2        79       288 
5.0599999 5.0999999 5.1199999 5.1399999 5.1500001 5.1999998      5.25 5.2800002 
        1         2         3         1         2         3        23         1 
5.3000002 5.3699999 5.4000001 5.4200001       5.5 5.5599999 5.6199999 5.6700001 
        2         1         1         1        16         1         1         2 
     5.75      6.25 
        2         1 


  1   2   3   4 
342 160 198 120 

In [43]:
# What if you want to look at summary statistics for a subset of observations?
# Let's compare summary statistics for wage_st when interview == 1 and interview == 2
# (which means before and after the minimum wage increase)

summary(minwage_data[which(minwage_data$interview == 1),]$wage_st)
summary(minwage_data[which(minwage_data$interview == 2),]$wage_st)
summary(minwage_data[which(minwage_data$interview == 2 & minwage_data$shore == 0),]$wage_st)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  4.250   4.250   4.500   4.616   4.950   5.750      20 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  4.250   5.050   5.050   4.996   5.050   6.250      21 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  4.250   5.050   5.050   4.988   5.050   6.250      20 

## Clean the data
In the real world, data is messy and there may be some mistakes that you (the researcher) will have to discover and fix

In [16]:
table(minwage_data$ncalls) # has missing data but isn't shown
table(minwage_data$ncalls, useNA = "ifany") # has missing data and is shown
table(minwage_data$chain, useNA = "ifany") # doesn't have missing data


  0   1   2   3   4   5   6   7   8   9 
208 108 142  73  21   4   6   1   4   4 


   0    1    2    3    4    5    6    7    8    9 <NA> 
 208  108  142   73   21    4    6    1    4    4  249 


  1   2   3   4 
342 160 198 120 

In [17]:
# Let's assume that any missing values are equal to 0 and make that change.
minwage_data$ncalls[is.na(minwage_data$ncalls)] <- 0

# Verify we've done it correctly
table(minwage_data$ncalls, useNA = "ifany")


  0   1   2   3   4   5   6   7   8   9 
457 108 142  73  21   4   6   1   4   4 

## Create some variables
We will create variables indicating PA and NJ stores. I'll show you two ways to do this.

In [18]:
# Method 1: create a variable, set all values equal to zero, and replace with one as appropriate.
minwage_data$pa <- 0
minwage_data$pa[minwage_data$pa1 == 1] <- 1
minwage_data$pa[minwage_data$pa2 == 1] <- 1

# Method 2: take advantage of the fact that all regions are {0, 1} binary variables
minwage_data$nj <- minwage_data$southj + minwage_data$northj + minwage_data$centralj + minwage_data$shore

In [19]:
# Let's check our work. What should be true about the means? What about the minimum and maximum?
summary(minwage_data$nj)
summary(minwage_data$pa)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  1.0000  1.0000  0.8927  1.0000  2.0000 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.1927  0.0000  1.0000 

In [20]:
# Uh-oh. We messed up. "Shore" is in addition to NJ regions.
minwage_data$nj <- minwage_data$southj + minwage_data$northj + minwage_data$centralj

# And here's perhaps an easier way to check our work
table(minwage_data$nj)
table(minwage_data$pa)


  0   1 
158 662 


  0   1 
662 158 

In [21]:
# We are going to exclude stores on the Jersey Shore in our analysis because we think that
# they may be different from other stores: the pre-period is in March and the post-period is in
# November, and we might worry that seasonal patterns from beach tourism might affect the shore
# stores differently, making them poor comparisons to "typical" stores.
minwage_no_shore <- minwage_data[which(minwage_data$shore != 1),]

table(minwage_no_shore$nj)
table(minwage_no_shore$pa)

# And let's save this dataset.
write.csv(minwage_no_shore,"minwage_no_shore.csv",row.names = FALSE)


  0   1 
158 592 


  0   1 
592 158 

## Generating summary statistics for analysis
We will now generate summary statistics we can use to improve our understanding of the relationship between minimum wages and employment.

As a warmup, let's get a better sense of the differences in wages between Pennsylvania and New Jersey. Please calculate:
1. The average wage in Pennsylvania (across all stores and time periods)
1. The average wage in New Jersey (across all stores and time periods)

Next, let's calculate summary statistics that will allow us to estimate the treatment effect of the minimum wage introduction. Let FT be the number of full-time workers, and let t=1 be the first interview (the pre-minimum-wage period), and t=2 (the post-minimum-wage period). You may want to define a "pre" and "post" variable.

![2x2_DD_table.png](attachment:2x2_DD_table.png)

What method is best-suited for this setting? Use the tools we've gone over so far to fill out the table above.

## If we have time: Plotting the data
Now, let's make some histograms showing the wage distribution before and after.

In [None]:
# It is easier to define some additional variables so we don't have to keep typing "which" commands.
# This is also good coding practice because it makes your code easier to read.
pre_all  <- minwage_no_shore[which(minwage_no_shore$interview == 1),]$wage_st
post_all <- minwage_no_shore[which(minwage_no_shore$interview == 2),]$wage_st

hist(pre_all, main = "Starting Wages, Pre-period")
hist(post_all, main = "Starting Wages, Post-period")

In [None]:
# What about overlaying two histograms? How might we do that?
?hist
# The help command is very detailed and isn't super helpful for us.
# Let's try Google.

# Google "R histogram two variables" (without the quotes) and let's see what comes up.
# Google is your friend! I Google things all the time, and the community is really helpful.

# In September 2023, the first result for me was really helpful!
# https://www.statology.org/histogram-two-variables-in-r/

In [48]:
# Below is what we can do, with guidance from the first result
# First, we will make four datasets, pre and post with NJ and PA:
pre_nj <- minwage_no_shore[which(minwage_no_shore$interview == 1 & minwage_no_shore$nj == 1),]$wage_st
pre_pa <- minwage_no_shore[which(minwage_no_shore$interview == 1 & minwage_no_shore$nj == 0),]$wage_st

post_nj <- minwage_no_shore[which(minwage_no_shore$interview == 2 & minwage_no_shore$nj == 1),]$wage_st
post_pa <- minwage_no_shore[which(minwage_no_shore$interview == 2 & minwage_no_shore$nj == 0),]$wage_st

hist(pre_nj, col=rgb(0,0,1,0.2), freq = FALSE, xlim=c(4, 6.5),
     xlab='Wage', ylab='Density', main='Pre-Period Starting Wages')
hist(pre_pa, col=rgb(1,0,0,0.2), freq = FALSE, add=TRUE)
legend('topright', c('NJ', 'PA'),
       fill=c(rgb(0,0,1,0.2), rgb(1,0,0,0.2)))

hist(post_nj, col=rgb(0,0,1,0.2), freq = FALSE, xlim=c(4, 6.5), breaks=seq(4,6.5,0.2),
     xlab='Wage', ylab='Density', main='Post-Period Starting Wages')
hist(post_pa, col=rgb(1,0,0,0.2), freq = FALSE, breaks=seq(4,6.5,0.2), add=TRUE)
legend('topright', c('NJ', 'PA'),
       fill=c(rgb(0,0,1,0.2), rgb(1,0,0,0.2)))

ERROR: Error in eval(expr, envir, enclos): object 'minwage_no_shore' not found


Make sure you quit using `File`--> `Close and Halt` or you may lose your work!