<a href="https://colab.research.google.com/github/deveyn-hainey/data-wrangling/blob/main/Solutions_01_Intro_to_R_and_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Data-Wrangling-and-Visualization/blob/main/01-Intro-to-R-and-Colab.ipynb)



# <a name="01-title"><font size="6">Introduction to R and Google Colaboratory</font></a>

---

# <a name="01-intro">Programming with R</a>

---

[R](https://www.r-project.org/about.html) is a programming language used largely for statistical computing, data wrangling and visualization.

- It is modeled after the _S_ programming language.
- It was introduced by Robert Gentleman and Robert Ihaka in 1993.


The first stable version of R was released in 2000, and after all of this time, there is a large community of R users that have created many useful scripts, packages, and data sets that openly shared and updated.

- <font color="dodgerblue">**R is free and open source software**</font>.
- R runs on Windows, Mac, Linux, and other types of computers.
- R runs on cloud-based software such as [Google Colab](https://colab.research.google.com/) and [Posit Cloud](https://posit.cloud/).
- R is an <font color="dodgerblue">**interactive programming language**</font>.
  - You type and execute a command in the Console for immediate feedback.
  - In contrast, a compiled programming language compiles a program that is then executed.
- R is highly extendable.
  - Many user-created packages are available to extend the functionality beyond what is installed by default.
  - Users can write their own functions and easily add software libraries to R.




# <a name="01-colab">What is Google Colaboratory (Colab)?</a>

---


We will use the open source Google Colaboratory (or Colab) to interact, edit, and save interactive Jupyter notebooks where we'll play around with R to help us explore methods for wrangling and visualizing data. <font color="dodgerblue">**You do not need to purchase or install any software for this course!**</font>


- Jupyter, [https://jupyter.org/](https://jupyter.org/), is a free software used for coding and interactive computing.
  - Jupyter notebooks are a dynamic environment to synthesize narrative text, executable R (and Python) code, visualizations, videos, and more.
- Google Colaboratory (or Colab, [https://colab.research.google.com](https://colab.research.google.com)), is a free, cloud-based application used at universities, research labs, and companies around the world.
  - You can open a Colab notebook in any web browser.
  - A device with a connection and a web browser is the only requirement.
- Below are several links to Google resources for getting started with Colab.
  - Watch Google's [Introduction to Colab video](https://www.youtube.com/watch?v=inN8seMm7UI) to learn more.
  - Open [Google's Welcome to Colab notebook](https://colab.research.google.com/?utm_source=scs-index).
  - Open a [tutorial video on how to edit and run code cells in Colab](https://youtu.be/kbIy_8rVZYo).

## <a name="01-navigate">Navigating a Colab Notebook</a>

---


This is an interactive document that contains two types of cells. Blah blah

- <font color="dodgerblue">**Text cells**</font> are used for typing Markdown text (similar to Word).
  - For example, this is a text cell!
  - Double-click on an existing text cell to edit the text.
- <font color="dodgerblue">**Code cells**</font> are used to insert, edit, run, and view/store output of Python code.
  - Click on a code cell to edit the code.
  - Click the play button in the upper left corner of a code cell to run it.
  - Or use the keyboard shortcut `Shift + Return/Enter` to run a code cell.
- Run the code cell below to compute `731 * 123`.


- Bullet 1


In [None]:
(731 * 123) + 2  # run to compute the product

In [None]:
1 + 3

Above we added 1 and 3

<font color="dodgerblue">To add a new text or code cell to a Colab notebook</font>, hover the pointer over the upper or lower border of an existing cell and:

- Click the `+ Code` button to add a new code cell.
- Click the `+ Text` button to add a new text cell.
- We can also delete, reorder, and add comments to cells using the buttons in the upper right corner of an active cell.

A <font color="dodgerblue">Table of Contents</font> can be opened (and closed) along the left side of this window for quickly linking to other parts of this document.

- Expand and collapse sections to improve navigating around longer documents.



## <a name="01-save">Saving Your Work to a Colab Notebook</a>

---

This notebook is a shared Colab notebook available for anyone to view. However, since everyone is sharing this notebook, you do not have permission to save changes to this shared Colab notebook. In order to save your work:

1. You will need to set up a free Google Drive account. If you already have a Google Drive account, you are ready to go!
2. Click the `Copy to Drive` button to the right of the `+ Code` and `+ Text` buttons on top of the notebook.
3. Select from the menu `File/Save a Copy in Drive`.
  - By default, the notebook will be saved in a folder named **Colab Notebooks** in your Drive.
  - Feel free to rename and store the notebook wherever you like.


# <a name="workflow-basics">Workflow Basics of Programming  in R</a>

---

Reading: See [Section 2](https://r4ds.hadley.nz/workflow-basics) of R for Data Science for more details.

## <a name="01-run">Running R Code Cells</a>

---


We can use R to do basic math calculations and display the result to the screen. To run the code, either:

- Click the play button in the upper left corner of the code cell.
- Or use the keyboard shortcut `Shift + Return/Enter`.

If the code compiles without an error, a green check appears. If the code crashes, a red exclamation point appears along with an error message.


In [None]:
sqrt(3 + 4)  # using R as a calculator

In [None]:
sqrt(3 - z)

ERROR: Error: object 'z' not found


# <a name="assign">Storing Output: Assignment of Objects</a>

---

You can create new <font color="dodgerblue">**objects**</font> with the assignment operator `<-`.

- <font color="tomato">Caution: In R we use `<-` for the assignment operator, not the `=` character which used to set options inside functions.</font>

In [None]:
# create vector of integers from 1 to 9 with increments of 2
x <- seq(1, 9, 2)

## <a name="printing">Printing Output to Screen</a>

---

Although we do not see any output after running the previous code cell, the green check mark to the left of the play button indicates the code has successfully run.

- We have stored the sequence of integers to vector `x`.
- If we would like to see the value that is being stored in a `x`,  we need to instruct Python to **print the output to the screen**.
- We can simply type the variable name `x` in a code cell to see the contents stored in `x`.



Note that the value of `x` is not printed, it's just stored. If you want to view the value, type `x` in the console.

In [None]:
x  # printing the output stored in x

In [None]:
# performing multiple operations in one code cell
x <- seq(1, 9, 2)
x

### <a name="asign-print">Assigning and Printing with `( )`</a>

---

If we enclose a command in a pair rounded parentheses `( )`, then output of the command inside the parentheses will executed and printed to the screen.

In [None]:
# enclosing command in parentheses assigns and prints to screen
(x <- seq(1, 9, 2))

There are multiple ways we can create the vector `x` and print the contents of vector `x` to the screen. When coding, there are often many different ways we can write code to perform the same tasks.

- Sometimes we will opt to write code most efficiently.
- Other times we will choose to write code that is easier to understand.
- At times we will show multiple ways of performing the same task.

In [None]:
(mean_x <- mean(x))  # applying the mean() function to vector x

# <a name="math-ops">Mathematical Operations in R</a>

---

- Use `+`, `-`, `*`, and `/` to add, subtract, multiply, and divide, respectively.
- Use a carat `^` or double asterisk `**` for the operation to raise to a power.
- Parentheses are useful when applying multiple operations.
- Spaces are cosmetic, but can help make the code easier. For example, `3 + 5`, `3+5`, `3+  5` are identical commands.


## <a name="01q1">Question 1</a>

---

What will be displayed on screen after running the the code below.

```
y <- (3 * 2)^2
```

Without running the code, type your answer in the space below.



### <a name="01sol1">Solution to Question 1</a>

---

NOTHING!!!

<br>  
<br>  


## <a name="01q2">Question 2</a>

---

Edit the code cell below so the output stored in `y` is displayed on screen.





### <a name="01sol2">Solution to Question 2</a>

---

Edit the code cell below.

<br>  


In [None]:
# Edit the code cell to display the result
(y <- (3 * 2)^2)

## <a name="comments">Comments in R</a>

---

R will ignore any text after `#` for that line. This allows you to write <font color="dodgerblue">**comments**</font>, text that is ignored by R but read by other humans. We'll sometimes include comments in examples explaining what's happening with the code. Comments typically are typed either:

- A the start of new line to explain what the following line(s) of code do.
- On the same line of as a command.
  - After the command, type **two spaces**, then `#`, another space, and then type the comment.


Comments help make your code more readable both to yourself and others and can help save people (including yourself) time when interpreting your code. Comments can be used to explain the how or what code is doing, but perhaps most useful is to explain *why*. Use comments to explain your overall plan of attack and record important insights.

## <a name="naming">Naming Objects</a>

---

Object names:

- Must start with a letter, and
- Can only contain letters, numbers, `_`, and/or `.`.

Object names should be descriptive but brief since each time we refer to the object we do not want to type a lot of characters. When we want to use multiple words to describe an object, we recommend using the `snake_case` method:

- Use lowercase letters.
- Separate words with the underscore character, `_`.
- If needed, abbreviate longer words.


## <a name="packages">What Are Packages in R?</a>
---

R packages are a collection functions, sample data, and/or other code scripts. R installs a set of default packages during installation. In this case, we are working with R in a cloud using [Google Colaboratory](https://colab.research.google.com/).

-   The files, code, and data associated to installed packages are saved in the cloud and not locally on your computer.
-   Many R packages have already been installed in Google Colaboratory.

**Run the code cell below to get a list of all default R packages
available in Google Colaboratory.**

In [None]:
# See a list of installed default packages
allpack <- installed.packages()
rownames(allpack)

## <a name="data">What Data is Available in R?</a>
---

R has many available data sets that we can easily import, investigate, and apply statistical methods and analysis that we will discover this semester.

-   Run the code cell below to get a list of all available data sets in all available packages in R.
-   A tab should open on the right displaying a long list of data sets.
-   We can close the tab in order to keep a larger working window.

In [None]:
data(package = .packages(all.available = TRUE))

## <a name="load-pack">Loading Packages with the `library()` Command</a>
---

Each time we start or restart a new R session and want to access the library of functions and data in the package, we need to load the library of files in the package with the `library()` command.

-   The `dplyr` package is already installed in Google Colaboratory
-   We still need to use a `library()` command to load the package if we want to access data and functions in the package.
-   If we do not run the code cell below, we will not be able to run the rest of the code cells in this document without receiving error messages.
-   **Run the code cell below to load the `dplyr` package.**

In [None]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




### <a name="reload">Caution: Reloading Packages When Restarting a Session</a>
---

If we take a break in our work, it is possible our R session will time out and close. <font color="tomato">**Each time we restart an R session, we will need to rerun `library()` commands in order reload any packages we plan to use**</span>.

The same caution applies to any objects, vectors, or data frames we create or edit in an R session. If a session times out, and we want to use an object `x` that we previously created, we will need to run the code cell(s) where object `x` is created again before we can refer back to `x` in the current session.



## <a name="help">Finding Help Documentation</a>
---

As with learning any new skill, it is always useful to know where to find help. R has been in use since 2000, and there is a large, active community of users that share lots of helpful advice online. Certainly [Google](https://www.google.com/) or other search engines are a useful way to search and find help with R. Below are two additional websites useful for searching for help with R.

-   The developers of R have [useful page where to find help](https://www.r-project.org/help.html).
-   [Rseek](https://rseek.org/) is provided by Sasha Goodman at Stanford university. This engine lets you search several R related sites.

We can also find help without opening a separate browser window or tab. The `?` help operator and `help()` function provide access to the help manuals for R functions, data sets, and other objects. Running a `?` or `help()` command in a code cell opens a side bar with a tab displaying the help documentation.

-   For example, the package `dplyr` contains a data set called `storms`.
-   Where is the data from, and what variables are in the data set?
-   **Run the code cell below to access the help documentation for the `storms` data set.**
    -   Resizing the tab in the side bar may help the documentation be more readable.
    -   We can close the tab if we want to increase the size of our working window.

In [None]:
?storms

In [None]:
help(storms)

## <a name="q3">Question 3</a>
---

After reading the `storms` help documentation, answer the following
questions:

a.  What is the source of the data?

b.  What variables are included in the data?

c.  Over what period of time and how frequently are observations recorded?



### <a name="01sol3">Solutions to Question 3</a>
---


a. The `storms` dataset is the NOAA Atlantic hurricane database best track data, <https://www.nhc.noaa.gov/data/#hurdat>


<br>  

b. There are thirteen variables in the data set listed below.

- `name` is the Storm Name.
-  `year`, `month`, `day` and `hour` tells us when the storm observation was recorded.
- `lat` and `long` give the latitude and longitude of the location of the storm center.
- `status` gives the storm classification.
- `category` is the Saffir-Simpson hurricane category calculated from wind speed.
- `wind` is the storm's maximum sustained wind speed (in knots).
- `pressure` is the air pressure at the storm's center (in millibars).
- `tropicalstorm_force_diameter` is the diameter (in nautical miles) of the area experiencing tropical storm strength winds (34 knots or above).
- `hurricane_force_diameter` is the diameter (in nautical miles) of the area experiencing hurricane strength winds (64 knots or above).


<br>  

c. The data includes the positions and attributes of storms from 1975-2022. Storms from 1979 onward are measured every six hours during the lifetime of the storm.


<br>  
<br>  

## <a name="01q4">Question 4</a>
---

Insert a code cell and run the command `?hist` to see the help
documentation for the histogram function.

a.  What option can we use to add a main title to the histogram?

b.  What option can we use to set the fill color for the bars of a histogram?



In [None]:
hist( col = "pink", main = "Title of Histogram")

### <a name="01sol4">Solution to Question 4</a>

---

a. The `main` option.

<br>  

b. The `col` option.


<br>  
<br>  

  



## <a name="CC License">Creative Commons License Information</a>
---

![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

Materials created by the [Department of Mathematical and Statistical Sciences at the University of Colorado Denver](https://github.com/CU-Denver-MathStats-OER/)
and is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/).