# Getting Started with R (Warmup)
**Author:** Marcus Birkenkrahe (pledged) with ChatGPT  
**Subtitle:** DSC 205 (Introduction to Advanced Data Science), Lyon College Spring 2026

## Objectives

Let's make a simple plot with R and learn a few useful workflow habits along the way.

1. Open R in Google Colaboratory.
2. Load a dataset from a URL into a data frame.
3. Use basic error handling with `tryCatch`.
4. Explore the dataset.
5. Create a simple plot.
6. Summarize and check your understanding.

## (Optional) Packages and your search path

- In R, attached packages and environments are visible via `search()`:

In [None]:
search()

- You can list objects in your current workspace with `ls()`:

In [None]:
ls()

- You may not want anything in this environment. You can `detach` packages and `rm` objects from the `ls()` vector:

In [None]:
# detach("package:lubridate") # Uncomment if needed
rm(list=ls())

## Download CSV data from the web into a data frame

- We'll download the Iris dataset as a CSV file from a URL. (The built-in `iris` dataset exists, but using a URL is good practice.)

In [None]:
data_url <- "https://tinyurl.com/iris-data-csv"

- You can simply load this dataset using `read.csv` but what if there are issues: When interacting with the web, things can fail: URL typos, network issues, etc.
- Load the dataset into a data frame, with error handling.

In [None]:
df <- tryCatch(read.csv(data_url), error = function(e) NULL)

- Load the dataset into a data frame, with error handling and communication:

In [None]:
df <- tryCatch(
  {
    tmp <- read.csv(data_url)
    cat("Dataset downloaded and loaded successfully!\n")
    tmp
  },
  error = function(e) {
    cat("Error loading dataset:\n")
    cat(conditionMessage(e), "\n")
    cat("Please check your internet connection or the URL.\n")
    NULL
  }
)

## Exploring the data

- Print the whole data frame (be careful: printing large data frames is noisy):

In [None]:
df

- Use `head()` to view just the first rows:

In [None]:
head(df)

- A simple safety check: did `df` load?

In [None]:
if (exists("df") && !is.null(df)) {
  head(df,3)
} else {
  cat("Data frame 'df' not found. Did the download fail?\n")
}

- `str()` is a great quick summary (similar role to `DataFrame.info()` in pandas):

In [None]:
str(df)

- Basic summary statistics:

In [None]:
summary(df)

## Plotting with base R

- A minimal scatter plot: sepal length vs sepal width.

In [None]:
# Note: In Colab, plots usually display inline automatically.
# The code below writes to a file as per the original script.
plot(df$sepal_length, df$sepal_width,
     xlab = "Sepal Length (cm)",
     ylab = "Sepal Width (cm)",
     main = "Iris: Sepal Length vs Sepal Width")

- Save a plot to a file (PNG), then close the device with `dev.off()`.

In [None]:
# Ensure the directory exists if running locally, or modify path for Colab
# dir.create("../img", showWarnings = FALSE) 

png("iris2.png", width = 800, height = 600)
plot(df$sepal_length, df$sepal_width,
     xlab = "Sepal Length (cm)",
     ylab = "Sepal Width (cm)",
     main = "Iris: Sepal Length vs Sepal Width")
dev.off()

- Check that the file was created:

In [None]:
system("ls -lt iris2.png")

- Grouping by species (color points by category) with a legend:

In [None]:
species <- df$species
cols <- as.integer(factor(species))  # 1..k
plot(df$sepal_length, df$sepal_width,
     col = cols, pch = 19,
     xlab = "Sepal Length (cm)",
     ylab = "Sepal Width (cm)",
     main = "Iris: Sepal Length vs Sepal Width (by species)")
legend("topright",
       legend = levels(factor(species)),
       col = seq_along(levels(factor(species))),
       pch = 19, bty = "n")

## A common subsetting mistake (and the correct rule)

- Suppose you want only rows for one species. This is correct:

In [None]:
s <- "virginica"
sub <- df[df$species == s, ]
head(sub)

- This is **wrong** (it compares a **string** to `s`, giving a single TRUE/FALSE, not a row mask):

```r
sub <- df["species" == s, ]
```

- Subsetting rule to remember:

> 1) `df$col` selects a column.
> 2) `df$col == value` creates a logical mask (TRUE/FALSE for each row).
> 3) `df[mask, ]` filters rows where mask is TRUE.

## Summary

| | Command / Concept | Definition |
|---|---|---|
| 1 | `search()` | Show attached packages/environments |
| 2 | `ls()` | List objects in current workspace |
| 3 | `read.csv(url)` | Read CSV data from URL into a data frame |
| 4 | `tryCatch(..., error=...)` | Handle runtime errors safely |
| 5 | `conditionMessage(e)` | Extract readable error message |
| 6 | `exists("df")` | Check if an object name exists |
| 7 | `is.null(df)` | Check whether object is NULL |
| 8 | `head(df)` | Show first rows of a data frame |
| 9 | `str(df)` | Compact structure summary |
| 10 | `summary(df)` | Basic summary statistics |
| 11 | `plot(x,y)` | Create scatter plot (base R) |
| 12 | `png(file,...)` | Open a PNG graphics device |
| 13 | `dev.off()` | Close current graphics device |
| 14 | `factor(x)` | Encode categorical values as levels |
| 15 | `as.integer(factor(x))` | Map categories to integers |
| 16 | `legend(...)` | Add legend to a plot |
| 17 | `df$col` | Select a column (vector) |
| 18 | `df[df$col==value, ]` | Filter rows using a logical mask |

## Programming Assignment (Home)

Test your understanding of "Getting started with R" by creating an R script (or Org-mode notebook) for the Palmer Penguins dataset available online from: [tinyurl.com/palmer-data-csv](https://tinyurl.com/palmer-data-csv)

**Tasks:**
1. Create a new R script or Org-mode notebook.
2. Load the dataset into a data frame.
3. Check if the dataset is loaded in the current workspace.
4. Display the first few lines of the dataset.
5. Make a simple plot of `bill_length_mm` vs `bill_depth_mm`.
6. Make a fully customized plot grouping by `species` (different colors + legend).
7. Submit your script/notebook file to Canvas.