In [1]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(ggplot2)
    library(palmerpenguins)
    library(lubridate)
})

In [2]:
penguins_df <- penguins
glimpse(penguins_df)

Rows: 344
Columns: 8
$ species           [3m[90m<fct>[39m[23m Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel~
$ island            [3m[90m<fct>[39m[23m Torgersen, Torgersen, Torgersen, Torgersen, Torgerse~
$ bill_length_mm    [3m[90m<dbl>[39m[23m 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, ~
$ bill_depth_mm     [3m[90m<dbl>[39m[23m 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, ~
$ flipper_length_mm [3m[90m<int>[39m[23m 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186~
$ body_mass_g       [3m[90m<int>[39m[23m 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, ~
$ sex               [3m[90m<fct>[39m[23m male, female, female, NA, female, male, female, male~
$ year              [3m[90m<int>[39m[23m 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007~


## Data Cleaning

In [3]:
# Check the number of missing values
sum(is.na(penguins_df))

# Create subset dataframe containing rows with NA
penguins_df[apply(is.na(penguins_df), 1, any), ]

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
<fct>,<fct>,<dbl>,<dbl>,<int>,<int>,<fct>,<int>
Adelie,Torgersen,,,,,,2007
Adelie,Torgersen,34.1,18.1,193.0,3475.0,,2007
Adelie,Torgersen,42.0,20.2,190.0,4250.0,,2007
Adelie,Torgersen,37.8,17.1,186.0,3300.0,,2007
Adelie,Torgersen,37.8,17.3,180.0,3700.0,,2007
Adelie,Dream,37.5,18.9,179.0,2975.0,,2007
Gentoo,Biscoe,44.5,14.3,216.0,4100.0,,2007
Gentoo,Biscoe,46.2,14.4,214.0,4650.0,,2008
Gentoo,Biscoe,47.3,13.8,216.0,4725.0,,2009
Gentoo,Biscoe,44.5,15.7,217.0,4875.0,,2009


- **Rows to automatically remove**: First and last row is to be removed due to missing values in almost all variables (except for species, island and year)

- **Rows to investigate further**: All other rows have `NA` in the variable `sex`. These could be determined manually if there is a large difference in other dimensions.

Summarize the averages by species, island and sex

In [4]:
penguins_df %>%
    drop_na() %>%
    group_by(species, island, sex) %>%
    summarize(
        avg_bill_length = mean(bill_length_mm, na.rm = TRUE),
        avg_bill_depth = mean(bill_depth_mm, na.rm = TRUE),
        avg_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
        avg_body_mass = mean(body_mass_g, na.rm = TRUE),
        .groups = "keep"
    )

species,island,sex,avg_bill_length,avg_bill_depth,avg_flipper_length,avg_body_mass
<fct>,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>
Adelie,Biscoe,female,37.35909,17.70455,187.1818,3369.318
Adelie,Biscoe,male,40.59091,19.03636,190.4091,4050.0
Adelie,Dream,female,36.91111,17.61852,187.8519,3344.444
Adelie,Dream,male,40.07143,18.83929,191.9286,4045.536
Adelie,Torgersen,female,37.55417,17.55,188.2917,3395.833
Adelie,Torgersen,male,40.58696,19.3913,194.913,4034.783
Chinstrap,Dream,female,46.57353,17.58824,191.7353,3527.206
Chinstrap,Dream,male,51.09412,19.25294,199.9118,3938.971
Gentoo,Biscoe,female,45.56379,14.23793,212.7069,4679.741
Gentoo,Biscoe,male,49.47377,15.71803,221.541,5484.836


There is a large difference in body weight between males and females for each specie. Also, the species habitant (island) does not seem to impact the average weight significantly. Thus we can use these to determine the sex of the observations of interest.

In [5]:
# Fetch row indices containing NA
na_indicies <- which(apply(is.na(penguins_df), 1, any))
na_indicies

In [6]:
# Assign sex based on body mass
penguins_df[9, ]$sex <-  "female"
penguins_df[10, ]$sex <- "male"
penguins_df[11, ]$sex <- "female"
penguins_df[48, ]$sex <- "female"
penguins_df[179, ]$sex <- "female"
penguins_df[219, ]$sex <- "female"
penguins_df[257, ]$sex <- "female"
penguins_df[269, ]$sex <- "female"

# Remove rows with excessive NAs
# and row 12 due to body mass being outlier
penguins_clean <- penguins_df[-c(4, 12, 272), ]

# Confirm if all NAs have been handled
sum(is.na(penguins_clean))

## Data Preperation: Tidy Data and Joining data

### Conditions:
1. Each variable must have its own column
2. Each observation must have its own row
3. Each value must have its own cell

### Example of a tidy dataset

In [7]:
table1

country,year,cases,population
<chr>,<dbl>,<dbl>,<dbl>
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


### Example: tidy a dataset using pivoting

In [8]:
 # Untidy due to
## 1. Columns 1999 and 2000 are values not variables
## 2. Values in the two columns are cases 
table4a

# Use pivot_longer() to tidy
table4a %>%
    pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

country,1999,2000
<chr>,<dbl>,<dbl>
Afghanistan,745,2666
Brazil,37737,80488
China,212258,213766


country,year,cases
<chr>,<chr>,<dbl>
Afghanistan,1999,745
Afghanistan,2000,2666
Brazil,1999,37737
Brazil,2000,80488
China,1999,212258
China,2000,213766


### Example: tidy two dataset and joining

In [9]:
# Untidy datasets
table4a
table4b

# tidy the two datasets using pivot_longer()
tidy4a <- table4a %>%
    pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

tidy4b <- table4b %>%
    pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")

# complete the data by joining them
left_join(tidy4a, tidy4b, by = join_by(country, year))

country,1999,2000
<chr>,<dbl>,<dbl>
Afghanistan,745,2666
Brazil,37737,80488
China,212258,213766


country,1999,2000
<chr>,<dbl>,<dbl>
Afghanistan,19987071,20595360
Brazil,172006362,174504898
China,1272915272,1280428583


country,year,cases,population
<chr>,<chr>,<dbl>,<dbl>
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583


## Pivot Longer versus Pivot Wider
Use it when an observation is scattered across multiple rows.

### Example: when to pivot wider

In [10]:
table2

country,year,type,count
<chr>,<dbl>,<chr>,<dbl>
Afghanistan,1999,cases,745
Afghanistan,1999,population,19987071
Afghanistan,2000,cases,2666
Afghanistan,2000,population,20595360
Brazil,1999,cases,37737
Brazil,1999,population,172006362
Brazil,2000,cases,80488
Brazil,2000,population,174504898
China,1999,cases,212258
China,1999,population,1272915272


- The column to to take variables names from is `type`
- The column to take values from is `count`

In [11]:
table2 %>%
    pivot_wider(names_from = type, values_from = count)

country,year,cases,population
<chr>,<dbl>,<dbl>,<dbl>
Afghanistan,1999,745,19987071
Afghanistan,2000,2666,20595360
Brazil,1999,37737,172006362
Brazil,2000,80488,174504898
China,1999,212258,1272915272
China,2000,213766,1280428583
