# DSCI 100 - Introduction to Data Science


## Lecture 2 - Getting data into R
<img src="img/intentional_arrival.png" width=500>


Reminders while you wait...
- If you ask a question in the chat: a TA will respond
- Feel free to "raise your hand"
- Please note a previous lecture was recorded and will be posted on Canvas

## Housekeeping

- If you registered late to the class:
    - any assignment from a class that happened before you registered is due **5 days after you registered**
    - any assignment from a class that happened after you registered is due **at the regular time**

- Quizzes: All sections at 7:30pm 
   - Quiz 1: June 2 (In Person)
   - Quiz 2: June 16 (In Person)
   - Quiz 3: Will be scheduled by classroom services 

- Zoom recordings available on Canvas

- Remember: when in doubt, 
    - **Kernel -> Restart Kernel** 
    - **Run -> Run All Cells**

- When you are working on assignments there is a cell that says 
```
DOUBLE CLICK TO EDIT THIS CELL AND REPLACE THIS TEXT WITH YOUR ANSWER
``` 
  **make sure you double click and edit that cell directly.** Your question won't be graded otherwise.

- Shut down server when you are done working

## So far...

- Introduction to 
  - R programming and Jupyter notebooks
  - a sprinkle of data analysis
  
## Now...

Taking our first step in data analysis: loading data into R!

<center><img src="https://thumbs.gfycat.com/VeneratedTeemingHypsilophodon-size_restricted.gif" alt="Where are we going?" style="width: 450px;"/></center>

## Why are we using programming languages to do data analysis? 
### And why can't I just use Excel???

<p float="left">   
    <img src="https://www.r-project.org/Rlogo.png" width="150" /> vs.
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/34/Microsoft_Office_Excel_%282019%E2%80%93present%29.svg/516px-Microsoft_Office_Excel_%282019%E2%80%93present%29.svg.png" width="150" />
</p>
  
- There are many advantages to using R (or another language, like Python or Julia):
    - statistical analysis functions that go beyond Excel
    - free and open-source 
    - transparent & reproducible code 
    - can handle large amounts of data and complex analyses

- Using a programming language is like baking with a recipe: 
    - Ingredients = data
<img src ="https://www.thespruceeats.com/thmb/FYR4bNLrj304CEaE2aSGPYzygzY=/4680x2632/smart/filters:no_upscale()/greek-butter-cookies-1705307-step-01-5bfef717c9e77c00510e3bf9.jpg" style ="width:500px;"/>
    - Recipe = code
<img src ="https://www.emblibrary.com/EL/Product_images/M16436.jpg" style="width: 500px;"/>

- Someone else can use your recipe (code) to bake the same cake (produce the same data analyses)
<center> <img src ="https://i.insider.com/5aa6a52b4177f92f008b4616?width=1300&format=jpeg&auto=webp" style ="width:1000px;"/> <center/>

- Spreadsheets in Excel make it *very* difficult to understand where results came from

**In the data science workflow** (source: Grolemund & Wickham, [R for Data Science](https://r4ds.had.co.nz/))

<center><img src="https://d33wubrfki0l68.cloudfront.net/571b056757d68e6df81a3e3853f54d3c76ad6efc/32d37/diagrams/data-science.png" style="width: 700px"/></center>

## Loading/importing data

- 4 most common ways to do this in Data Science
    1. **read in a text file with data in a spreadsheet format**
    2. **read from a database (e.g., SQLite, PostgreSQL)**
    3. scrape data from the web (optional bonus material)
    4. use a web API (Application Programming Interface) to read data from a website (not covered in DSCI100)

## Different ways to locate a file / dataset



**Local (on your computer)**
- An *absolute* path locates a file with respect to the "root" folder on a computer 
  - starts with `/`  
  
  e.g. `/home/instructor/documents/timesheet.xlsx`
  
  
- A *relative* path locates a file relative to your *working directory*
  - doesn't start with `/`
  
  e.g. `documents/timesheet.xlsx` <br>(where working directory is `/home/instructor/`)

**Remote (on the web)**

via "URL" that starts with `http://` or `https://`
<br><br>
`http://traffic.libsyn.com/mbmbam/MyBrotherMyBrotherandMe367.mp3`


**Absolute vs relative paths: Which should you use?**
- Generally to ensure your code can be run on a different computer, you should use relative paths 

e.g. 

`/home/alice/project/data/happiness_report.csv`


`/home/bob/project/data/happiness_report.csv`


- Even though stored their files in the same place on their computers (in their home folders), the absolute paths are different due to their different usernames. 

- If Bob has code that loads the happiness_report.csv data using an absolute path, the code won’t work on Alice’s computer. But the relative path from inside the project folder (data/happiness_report.csv) is the same on both computers; any code that uses relative paths will work on both

## Demo: Loading data from your computer

**Workflow:**
1. make the dataset accessible to the computer
  - might need to load a package, download a file, or connect to a database
2. inspect the data using Jupyter to see what it looks like
3. load the data into R
  - using `read_csv`, `read_delim`, etc
4. inspect the result to make sure it worked

Let's load the Old Faithful geyser dataset from [Larry Wasserman's book All of Statistcs](https://www.stat.cmu.edu/~larry/all-of-statistics/)


In [None]:
#Step 0: load the tidyverse package -- allows us to use the read_* functions
options(repr.matrix.max.rows = 6)
library(tidyverse)

In [None]:
#Step 1: download the file to get the data from the URL onto our computer
# my_url <- "http://stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat"
my_url <- "https://web.archive.org/web/20190328100248if_/http://www.stat.cmu.edu:80/~larry/all-of-statistics/=data/faithful.dat"

```
my_url <- "stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat"
download.file(my_url, 'data/faithful.dat')
```
- download file to the data folder (notice this is a relative path)
- can also read into R directly using URL without this step 

In [None]:
#Step 2: take a look at the data using Jupyter


- note that Jupyterhub is remote, so files on your computer cannot be accessed
- right click file to open in plain text editor

In [None]:
#Step 3: load the data into R

Load the data without skipping lines, make delimiter entry = spaces
notice it takes the first row as the headers
```
faithful_data <- read_delim('data/faithful.dat', delim = " ")
faithful_data
```

- Notice that we needed to:
    - skip 26 lines of meta data
    - manually add the column names index, eruption_time, wait_time 
    - set the entry delimiter to spaces

**For instructors only:**

**If you're on the v0.17.0 docker image: read_delim still works fine. You just need to skip, col_names, col_types.**

**If you're on an upgrade version (current is v0.28.0): read_delim will not work. You need to use `read_table` instead (as in the example below).**

```
faithful_data <- read_delim('data/faithful.dat', 
                               delim = " ", 
                               skip = 26,  
                               col_names = c("index", "eruption_time", "wait_time"))

```
- Show that `read_delim` can't handle multiple whitespace, and the file has it

```
faithful_data <- read_table('data/faithful.dat',
                             skip = 26, 
                             col_names = c("index", "eruption_time", "wait_time"))
```
                              
- We can also change the data types when reading in the data

```
faithful_data <- read_table('data/faithful.dat', 
                               skip = 26,           # skip 26 lines of metadata 
                               col_names = c("index", "eruption_time", "wait_time"), # specify column names
                              col_types = c(
                                  index = "c",         # index is character
                                  eruption_time = "n", # make eruption_time and wait_time numeric
                                  wait_time = "n"))
```

c() -- combines values into a vector or list

In [None]:
#Step 4: check the result


`faithful_data`

## Note about loading data

- It's important to do it carefully + check results after!
  - will help reduce bugs and speed up your analyses down the road
- Think of it as tying your shoes before you run; not exciting, but if done wrong it will trip you up later!

<center><img src="https://media.giphy.com/media/Se449o7yNjziw/giphy.gif"/></center>


## Questions?

## Go for it! 

Work together on your worksheets and tutorials. **Collaborate, but do not copy! Learn from each other!** 
![](https://media.giphy.com/media/zaezT79s3Ng7C/giphy.gif)

## Class activity:

- In the group, try to read in this dataset from the web: <br> https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt

## What did we learn?
- 
- 
- 

## Note on web scraping

- More and more websites don't want you scraping 
- They instead are providing "easier" ways for you to access the data as opposed to scraping it (which they can regulate and know who you are)
- So, TL;DR read the Terms of Service for ANY webpage you are planning on scraping
    - they're long to read, so search for "scraping", "auto", "bot", etc to find the relevant section