# Review: Preparation, Exploration, Visualization

*Created:* 2025-09-12

**Instructions**
- Open [DataCamp Datalab](https://datacamp.com/datalab) and log into the "Lyon College Fall 2025" workspace.
- This notebook mirrors the in-class review: preparation, exploration, and visualization in R.
- Run cells top-to-bottom. Colab: *Runtime → Change runtime type → R*.

## 1) What needs to be done in this dataset?

Example raw data (for illustration):

```
Name   Age  Size Country
1 Sara  27  1.77 Belgium
2 Lis   30  5.58 USA
3 Hadrien NA 1.80 FR
4 Lis   30  5.58 USA
```

In [None]:
# Get the data (from the web)
df <- read.csv("https://tinyurl.com/cleaning-csv")
str(df)
df

**Clean up country code and size.**

In [None]:
# Change first country to two-letter code (Belgium -> BE)
df$Country[1] <- "BE"

# Change height values in inches (5.58) to meters (1.70)
df$Size[df$Size == 5.58] <- 1.70

str(df)
df

## 2) What is the purpose of removing duplicates in a dataset?

> To ensure that each observation (row) is unique.

In [None]:
# Show current data (with a duplicate row for 'Lis')
df

# Remove the duplicate row (row 4 in this small example)
df <- df[-4, ]
df

## 3) What are methods to handle missing values?

> *Impute* (replace intelligently, e.g., by an average), drop, or keep.

**Replace `NA` by column average (`mean`).**

In [None]:
df
# Extract the third element of Age (was NA originally)
df$Age[3]

# Impute mean of the non-missing Age values
df$Age[3] <- as.integer(mean(df$Age, na.rm = TRUE))
df

## 4) What is the main goal of EDA?

> The main goal of Exploratory Data Analysis (EDA) is to explore the data, 
formulate hypotheses, and assess characteristics (correlation, trends, patterns). 
It happens **after** data preparation.

**Create a statistical summary for the data frame.**

In [None]:
summary(df)

## 5) What does *Anscombe's quartet* illustrate in the context of EDA?

> The Anscombe quartet shows the importance of visualizing data even if the statistical properties are very similar.

**Summarize parts of the built-in `anscombe` data set.**

In [None]:
summary(anscombe[c("x1","x2","y1","y2")])

**Visualize two of the `anscombe` distributions.**

In [None]:
par(mfrow=c(1,2), pty='s')
plot(anscombe$x1, anscombe$y1, col="red",  pch=19, main="Set 1: linear-ish")
plot(anscombe$x2, anscombe$y2, col="blue", pch=9,  main="Set 2: non-linear")

## 6) What does 'Knowing your data' mean? Which R functions deliver this?

1. Preview data values (`head`)

2. View structure (`str`)

3. Descriptive stats (`summary`)

4. Visualize (`plot`)

5. Look for correlations (`cor`)

6. Look for outliers (`boxplot`)

## 7) Which picture or photo do you know that's *worth a thousand words*?

*(Local file references won't display in Colab unless uploaded. External links below are provided instead.)*

- [Battle of Iwo Jima (1945)](https://www.witf.io/wp-content/uploads/2020/02/iwo-jima-rosenthal-520748-1-1920x1080.jpg)
- [Saigon Execution (1968)](https://www.gannett-cdn.com/-mm-/f40f3606fa7f520417c0c9e02d7aa7a371d004ba/r=x513&c=680x510/local/-/media/USATODAY/USATODAY/2013/04/28/war-icons-003-4_3.jpg)
- [Trump assassination attempt (2024)](https://www.njspotlightnews.org/wp-content/uploads/sites/123/2024/07/Donald-Trump-assassination-attempt-July-13-2024.jpg)

## 8) What are dashboards in data science, and what are they good for?

- Dashboards group relevant information in one place

- Real-time information helps viewers to keep track

- Dashboards can be customized to different data needs

- Dashboards can easily be overwhelming (design issues)

- Interactive dashboards can help extract features

## 9) What are dashboards definitely not good for?

- Data preparation, cleaning and transformation

- Exploratory Data Analysis (because they are fixed)

## 10) What is *labeling* in data visualizations, and why is it important?

> Labeling helps viewers understand what each axis, title, and legend represents. Units and data sources are also important.

**Code example (labeled vs. unlabeled):**

In [None]:
par(mfrow=c(1,2), pty='s')
plot(mtcars$wt, mtcars$mpg) # unlabeled
plot(mtcars$wt, mtcars$mpg,
     main="32 cars from `mtcars`",
     xlab="Weight [tons]",
     ylab="Miles-per-gallon") # labeled