# EEP/IAS 118 - Coding Bootcamp, Part 2

## Introduction to RStudio (and more R)


## RStudio

Last week we worked through how to access _Datahub_, use a Jupter notebook, and code in R. 
Today we will introduce RStudio. RStudio is an Intregrated Development Environment (IDE) for R. It lets us edit scripts, produce output, navigate file structures, view plots, and see content curently in memory through its interface. This is the interface most econometric researchers use to conduct their analysis.

First we'll get you started on installing R and RStudio on your personal computers (for those who want local installations). Then we'll access RStudio on Datahub and work through how to use it.

## Installing R and RStudio

Now we'll work through installing __R__ and **RStudio** on your own computers. Once installed, you can use **RStudio** exactly as above (albeit with different file paths).

### Installing R

To install **R**, go to the [R Project website](https://ftp.osuosl.org/pub/cran/). 

#### Mac

If you're on a Mac, use the **R for (Mac) OS X** link and download the **R-#.pkg** file (where **#** is the current version). Once downloaded, open the package and follow the installation instructions.

#### Windows

If you're on Windows, use the **R for Windows** link, click on **Base**, and **Download R # for Windows** (where **#** is the current version). Once downloaded, open the package and follow the installation instructions.



### Installing RStudio

To install **RStudio**, go to the [RStudio Download Page](https://rstudio.com/products/rstudio/download/#download). Scroll down to the OS section and download the file that corresponds to your operating system. For Mac that's the `RStudio-#.dmg` (where **#** is the current version) and for Windows the `RStudio-#.exe` (for Windows). When download, run the installer and follow the instructions.

### Troubleshooting Local Installations

Next we will introduce you to **RStudio** on __Datahub__. If you're having issues getting things installed/running on your local computer, please come to office hours and one of us will help troubleshoot.

## RStudio on Datahub

You can use the [Following Link](https://r.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fds-modules%2FENVECON-118-SP23&urlpath=rstudio%2F) to access RStudio on Datahub. Note that RStudio always stays in the same place (unlike our Jupyter notebooks it isn't specific to a given folder location), so you can use the above link or the link posted on bCourses (**Syllabus > Datahub heading > "Link to RStudio on Datahub"**) throughout the semester.

When you click one of the links, you'll open a tab that looks something like this:

<p style="text-align: center;"> RStudio on Datahub </p>

![rstudio.png](attachment:31d52231-dd20-4f01-9d7b-487a9d5f8227.png)

This is RStudio. RStudio gives us a really helpful interface for producing and running __R__ code all in one window. It has several advantages when it comes to producing research.

There are four main elements of RStudio:

### 1. Script Window 

<p style="text-align: center;"> Script Window </p>

![script_window.png](attachment:21942276-12c3-4c91-a012-6cee9ae7de45.png)

The script window is the “upper left” window in __RStudio__. This is where you will be writing all of your code. This window allows you to save your R scripts and gives any researcher with the file the ability to replicate your analysis.

Every time you start a project, it is important to start a script. A script lets you save and load your code as you go. To start a script, either click on **File > New File > R Script**, or click the sheet of paper with a plus on it in the toolbar just below **File**. An untitled script tab will open, which you can save by going to **File > Save**. Note that scripts do not autosave, so make sure to save your scripts as you go. It's a good habit to save your code every few minutes, and if you make big changes to your approach go ahead and save a separate copy of the code (we no longer have the built-in version control as we did with the notebooks).

When you save your script, you'll need to select the folder you want to save the script in. It's useful to save your scripts in a project-specific destination to help keep track of your files. The script can be reopened through the file window (which we'll talk about later) or by clicking **File > Open** in the menu toolbar.

Though you'll write all your task code in the same script, sometimes you'll want to run just a line at a time. To do so, select the line(s) you're interested in and either click **Run** in the upper-right portion of the script window, go to **Code > Run Selected Lines**, click **Command + Enter** (Mac) or **Control + Enter** (Mac and PC).

#### Organizing Scripts

Since you'll be adding all your code for the project in the same script, it's a good idea to organize your script according to a workflow. This will save you time and make it a bunch easier to get back up to speed if you revisit code after some time away or share code with colleagues. A best practice is to begin your script with a *preamble* where you load packages, declare options, and set paths before you dive into analysis. 

Another good practice is to comment code - use the pound sign `#` to comment out everything behind it on a given line. For example, try running the following cell: 

In [1]:
# This is a comment
########## This is also a comment
# sleepdf <- read_dta("this is still a comment")

Comments are helpful for keeping notes about what you're doing as you go, which is especially helpful when working with others. You can also add blocks to your script by adding `#----` to a line. Once you do, a small arrow will appear to the left of the line. Clicking on this area will hide/unhide all the lines of the section. This can be useful to temporarily hide some code when your scripts start getting long.

### 2. Console

<p style="text-align: center;"> Console Window </p>

![console.png](attachment:97085b9c-fccb-4996-b2ad-23042ee97fa6.png)

This is the "bottom left” window in __RStudio__. The console is essentially a direct line to __R__. You can type code here directly, but more importantly this is where all your output will be printed. Think of this like the output spaces of the notebooks; rather than right below a code cell, all our output will get printed right to the console window.

Note that because all our output gets printed in the same window, we can scroll back through it to look at old results. At the same time, it means we don't get results printed in a clean way that's saved out in an easily-read fashion. If we want to save out our work from __RStudio__, we'll need to employ some other techniques which we'll discuss later.

__You should never be coding in this window__. You cannot save anything you run from the console, thus making it not replicable. Always write your code in a script and then run it so that the results appear in the console.


### 3. Environment 

<p style="text-align: center;"> Environment Window </p>

![environment.png](attachment:6a8f3142-8be0-4b6f-967a-9255f2af2e13.png)

This is the “upper right” window in __RStudio__. This window will list all your stored variables, dataframes, and functions that you’ve written to memory. Whereas all the memory items are hidden in the background when we use a Jupyter Notebook, now we can see them all (and even view their contents) directly. The environment window is useful in keeping you informed about what R has stored from the code you’ve written.

To see how this works, bring in a dataset. Let's load in the `sleep75.dta` file that's in the **1_CodingBootcamp** folder. To do this using your script,

 * Load the package `haven` with `library(haven)`
 * Store the sleep data as `sleepdf` using the command `sleepdf <- read_dta("/home/jovyan/ENVECON-118-SP23/1_CodingBootcamp/sleep75.dta")`
 
 Note that we had to account for the whole path of the `sleep75` data in order to load it in RStudio. If you want to change your working directory so that you'll be in the **1_CodingBootcamp** folder and can reference the data file by name, use the `setwd()` function - include the path you want to locate to (in quotations) as the function's argument. To relocate to the **1_CodingBootcamp** folder, run `setwd("/home/jovyan/ENVECON-118-SP23/1_CodingBootcamp/")`. Since the data we want is also in this folder, we can load it by name with `read_dta("sleep75.dta")` - note that if we don't assign it a name, **RStudio** will print the entire dataset to the console.
 
 Now that we have `sleepdf` in memory, it appears in the environment window: 

<p style="text-align: center;"> Loaded Data </p>

![data_environment.png](attachment:1411d84d-06fc-4d1b-b1be-59b57adb778d.png)

The environment tells us how many observations and variables are in our data. If we click the arrow to the left of `sleepdf`, we get information on each of our variables:

<p style="text-align: center;"> Variable Info </p>

![data_vars.png](attachment:b02e1854-6bfd-409f-98cb-155493649e07.png)

You can adjust the size of the environment window to see more of the variables at once or scroll through if it's a long list. 

If you double click `sleepdf`, you can view the whole dataset. The data opens in a new tab in the script window, letting you see the variable names, variable labels (sometimes in Stata files), and all the observations themselves!


<p style="text-align: center;"> Viewing Data </p>

![viewing_data.png](attachment:f1614a1f-b533-48db-a945-d73d49e55ab3.png)

You can switch back and forth between the data and your script by clicking on the respective tab. When done viewing the data, click the little x on the right of the tab.

#### Removing objects from memory

While there is a lot of space on the server, it's a good habit to remove objects from memory that you no longer need (this is especially important when working with large datasets on your personal computer). To remove objects from memory, use the `rm()` function. 

Let's try this by creating a summary statistics object: `sleepstats <- summary(sleepdf)`. You'll see that these have been added to the *Values* portion of the environment, telling us it's stored as a matrix and not formatted as a dataframe. To remove it, run `rm(sleepstats)`.


### 4. Files/Plots/Help Window

<p style="text-align: center;"> File/Plots/Help Window </p>

![files_plots_packages.png](attachment:220b6ea4-6f29-45fc-92af-258258487fbc.png)

This is the "bottom right" window in __RStudio__. This combined area is used for a couple of different tasks, depending on which tab we're in. Here are a few that you might find helpful:

#### Files

The file viewer lets us view and navigate the files/folders on our system. In the case of __Datahub__, this lets us view our entire folder structure and access the various files we have saved here. 

* Right-clicking on a folder name in the hierarchy will take us to that folder
* Clicking the `..` at the top of the tree (next to the green arrow) will take us back up one folder in the hierarchy
* Clicking on a dataset will give us the option to import the dataset - note that we'll ultimately want to load everything using our script, else we can't replicate our results conveniently. 

#### Plots

Whereas in Jupyter notebooks the plots appeared below our cells, now they will appear in the plots viewer tab when we produce/call them. For example, if we want a histogram of income in our data, we can use the following code:

In [None]:
hist(sleepdf$earns74,
     main = "Distribution of Earnings",
     xlab = "Total Earnings (1974)")


Note that running this code in RStudio will open a (potentially squished) figure in the plots window. To make it larger, click the **Zoom** button. You can use the **Export** button to save an image or pdf copy to the server, or copy the plot to the clipboard.


#### Help

Whenever you have a question about a given function, you can type `?` in front of the function name and run the code, and the documentation will appear in the help tab, which will come to the front. This will help us get information on the syntax and output of a given function. If we wanted a refresher on how to use `read_dta()`, we can just run `?read_dta` and the function's documentation will pop up in the help window.


## Example Project

By now you should all have R and RStudio installed on your computer. Let's get an example project started so that you can get a sense of how the workflow would work if you were starting a homework assignment or research project using RStudio rather than R in Jupyter Notebooks. 

### Step 1: 

#### Set up your project folder.

You want everything related to your project to be easy to find on your computer. 

* Start by creating a folder on your desktop (or elsewhere such as in dropbox) called __"Ex_Proj"__.

Hint: some coding languages don't like spaces in file paths, so using underlines as a habit can help avoid errors. 

* Next, inside the **Ex_Proj** folder create three subfolders with the following names: __"Data"__, __"Code"__, and __"Output"__.

### Step 2:

#### Download the `sleep75.dta` datafile from bCourses (`Files>Sections>Coding Bootcamp`) and save it in the __Ex_Proj>Data__ folder you just created.

###  Step 3:

#### Open RStudio and start a new script:

Click on __File > New File > R Script__ to open a new script.

Add a header to your script to help keep track of what the script does. For example:



In [None]:
# Example Project for Coding Bootcamp
# 01/18/2023

#### Save your script.

Click on **File > Save As**, assign the script a name, and save it in the __Ex_Proj>Code__ folder you just created.

You can also use `Command + S` on Mac or `Control + S` on Windows to save your script.

### Step 4: Install and load the packages you will use in your code.

When you are working in Jupyter Notebooks, we have already installed the code packages you will use on the server. When you are working on your own machine, you need to install the packages you plan on using in your code. This installation procedure is usually done at the begining of all code files in what is frequently referred to as the *preamble*. 

Today we will use the following packages: `haven` and `tidyverse`

In order to use them you need to

##### a) Install the package using `install.packages("")`

Hint: once you have run `install.packages("")` you can comment it out using `#` so you don't reinstall each time you run your code.  

To install the packages we need today use the following code:

`install.packages("haven")`

`install.packages("tidyverse")`

##### b) load the package using `library()`

To load our package, use the following code:

`library(haven)`

`library(tidyverse)`


### Step 5: Set your working directory.

You want to tell R where your project folder is. Doing this at the begining of your code means you only have to do this once rather than each time you are saving something or loading a dataset. This also means that if you move your project folder, you only need to change this one line of code. 

To set your working directory you will use `setwd("")` and insert the file path to your project folder. 

You can find the file path of your folder by:

On __Windows__: press `shift+right click` on the folder, then select `Copy as path`

On __Mac__: `Right-click+OPTION` to reveal the “Copy (item name) as Pathname” option

Once you have the file path you need to switch the backslashes to forward slashes.

so you will have a line of code that looks like:

`setwd("C:/Users/sjohns/Desktop/Ex_Proj")`

### Step 6: Load your data. 

You need to tell R where your data is and give your dataset a name. Because we have set the working directory, we only need to tell R where the file is relative to the working directory. We use the function `read_dta()` (from the `haven` package we installed) to read the dataset. The arrow `<-` assigns the data to a name we creates. Let's call our data `sleepdf`. 

`sleepdf <- read_dta("Data/sleep75.dta")`

You should now see `sleepdf` appear in the environment window. 

To check that you data has been loaded correctly, and to see what it looks like you can use the `head()` function, which will show you the first six rows of data.

`head(sleepdf)`

### Step 7: Work with your data.

This step will obviously depend on what your goals are. Lets stick to a simple example here. Let's do two things: create a new variable in our dataset and plot that new variable.

##### Create a new variable
Currently in the data, sleep is measured in minutes per week. Let's convert this to hours per night. 

`sleepdf$sleep_hr <- sleepdf$sleep/420`

##### Create a new variable, tidyverse style

Alternatively, you can use the `mutate()` function available through `tidyverse` to create the same variable:

`sleepdf <- mutate(sleepdf, sleep_hr = sleep/420)`

Note that `mutate()` can be helpful when you want to create multiple variables at once, as you can specify them as additional arguments (create additional variables by separating them with a comma).

##### Make a Histogram of the new variable 

`mysleephist <- hist(sleepdf$sleep_hr)`

Hint: remember to save your code as you work!

### Step 8: Save your output and modified data. 

You may want to save things that you create in R. You can create many different types of output and save them in many ways. The two most common types of output to save out are plots and modified versions of our data. 

First, let's save the histogram we created as a pdf (and then take a look at it):

`pdf("output/sleep_hist.pdf")
plot(mysleephist)`

Next, let's also save the new modified dataset we created. 

##### Note: ALWAYS SAVE MODIFIED DATA AS A NEW FILE. NEVER OVERWRITE YOUR ORIGINAL DATASET. 

`write.csv(sleepdf, file = "data/sleepdf.csv")`