<a href="https://colab.research.google.com/github/arooshasolomon/ds1002-npj4qa/blob/main/labs/lab2/lab2-race-results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DS1002 Lab 2: Determine Race Results with R

In this lab you will work with a dataset, writing R to generate the deliverables specified in the cells below.

The dataset for this lab is made up of fictitious results from a road race. Runner information and results is provided in the data.

Answer the questions below with the appropriate R code. Point assignments are indicated for each section. There are 10 total points possible for this lab.

Useful reference material (check all R modules within the Canvas site for more help)
- [R Reference Material](https://canvas.its.virginia.edu/courses/78571/modules#module_219810).
- [Plots Samples](https://colab.research.google.com/github/nmagee/ds1002/blob/main/notebooks/25-plots-in-r.ipynb)

## Group Submissions

If you are working in a group to complete this lab, you may have no more than 3 members to a group. Group members should be indicated in the cell below -- list both names and UVA computing IDs.

Each student should then submit **the same URL** for the lab in Canvas. (If a group has Member1, Member2, and Member3, only one member needs to save the completed work back to GitHub and all members should submit that URL for grading.)

In [86]:
# List group members (if applicable). Identify names and computing IDs
#
# Name                    Computing ID

## 1. Load Libraries & Data (1 pt)

https://raw.githubusercontent.com/nmagee/ds1002/main/data/road-race.csv

Import any necessary libraries and load the remote CSV file below into a data frame.

In [101]:
#

library(tidyverse)
library(ggplot2)
install.packages("gapminder")
library(gapminder)
library(dplyr)

df <- read.csv("https://raw.githubusercontent.com/nmagee/ds1002/main/data/road-race.csv")
df

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



runner_bib,runner_name,runner_age,runner_gender,finish_time
<int>,<chr>,<int>,<chr>,<chr>
1,Loydie Lopes,17,Male,16:01
2,Lorens Crispe,33,Male,15:40
3,Shirline Hasser,22,Female,14:20
4,Alleyn Hartshorn,39,Male,17:06
5,Wang MacColl,50,Male,16:49
6,Tonnie Tidder,44,Male,15:43
7,Hermy Everal,51,Male,
8,Basil Moxsom,44,Male,15:16
9,Lark Bragge,30,Female,15:59
10,Kent Wakely,60,Male,14:29


## 2. Get Summary Data (1 pt)

In code, display how many rows and columns are in the raw dataset.

In [102]:
#

info <- dim(df)
info

## 3. Clean and Organize the Data (2 pts)

Check for data quality.

- Resolve any duplicate rows.
- If a runner does not have a finish time, they are DNF and should not be counted in the dataset.



In [103]:
#

#removing duplicate rows

library(dplyr)



df$finish_time[df$finish_time == ""] <- NA
df <- df %>%
  filter(!duplicated(.) & !is.na(finish_time))

df

runner_bib,runner_name,runner_age,runner_gender,finish_time
<int>,<chr>,<int>,<chr>,<chr>
1,Loydie Lopes,17,Male,16:01
2,Lorens Crispe,33,Male,15:40
3,Shirline Hasser,22,Female,14:20
4,Alleyn Hartshorn,39,Male,17:06
5,Wang MacColl,50,Male,16:49
6,Tonnie Tidder,44,Male,15:43
8,Basil Moxsom,44,Male,15:16
9,Lark Bragge,30,Female,15:59
10,Kent Wakely,60,Male,14:29
11,Judye Hattrick,17,Female,15:27


Now display the first 10 rows of the cleaned dataset.

In [104]:
#
head(df, 10)


Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>
1,1,Loydie Lopes,17,Male,16:01
2,2,Lorens Crispe,33,Male,15:40
3,3,Shirline Hasser,22,Female,14:20
4,4,Alleyn Hartshorn,39,Male,17:06
5,5,Wang MacColl,50,Male,16:49
6,6,Tonnie Tidder,44,Male,15:43
7,8,Basil Moxsom,44,Male,15:16
8,9,Lark Bragge,30,Female,15:59
9,10,Kent Wakely,60,Male,14:29
10,11,Judye Hattrick,17,Female,15:27


## 4. Calculate Elapsed Time (3 pts)

Using R, add a new column named `["finish_minutes"]` to the dataframe that calculates the number of minutes it took for the runner to complete the race. Ideally this is a column consisting of plain integers.

The starting gun was fired at precisely 12:00pm that day.

Note: This is calculated using a built-in function of R, `difftime()` which takes 3 parameters:

- End time
- Start time
- Units

The result is an output that figures the difference between the two: `3 days`, `14 years`, `112 mins`, etc.

The syntax for that function is below. Take care to use the proper order of parameters. The `as.POSIXct` casting makes it possible to read a long datetime in the `YYYY-MM-DDTHH:MM` format, a common `datetime` value. The `format` parameter specifies the pattern you are trying to read.

```
df$new-column <- (difftime( as.POSIXct(df$end-column, format="%Y-%m-%dT%H:%M"),
                            as.POSIXct(df$start-column, format="%Y-%m-%dT%H:%M"),
                            units="min"))
```

In [110]:

#create a column with the start_time
df$start_time <- as.POSIXct("12:00", format = "%H:%M")


df$finish_minutes <- (difftime( as.POSIXct(df$finish_time, format="%H:%M"),
                            as.POSIXct (df$start_time, format = "%H:%M"), units = "mins"))



df


runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
1,Loydie Lopes,17,Male,16:01,2023-12-06 12:00:00,241 mins
2,Lorens Crispe,33,Male,15:40,2023-12-06 12:00:00,220 mins
3,Shirline Hasser,22,Female,14:20,2023-12-06 12:00:00,140 mins
4,Alleyn Hartshorn,39,Male,17:06,2023-12-06 12:00:00,306 mins
5,Wang MacColl,50,Male,16:49,2023-12-06 12:00:00,289 mins
6,Tonnie Tidder,44,Male,15:43,2023-12-06 12:00:00,223 mins
8,Basil Moxsom,44,Male,15:16,2023-12-06 12:00:00,196 mins
9,Lark Bragge,30,Female,15:59,2023-12-06 12:00:00,239 mins
10,Kent Wakely,60,Male,14:29,2023-12-06 12:00:00,149 mins
11,Judye Hattrick,17,Female,15:27,2023-12-06 12:00:00,207 mins


## 5. Identify Winners by Gender (2 pts)

Based on the minutes it took each runner to complete the race, identify the top three places for each gender.

There are several ways to do this, some of which require less code than others. You will only be graded for producing the correct output, not on how elegant/advanced your programming is.

In [126]:
# Assuming df is the name of your dataframe
# Assuming your gender column is named "gender"

# Split the dataframe into separate data frames for each gender
gender_split <- split(df, df$runner_gender)

gender_split






Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
11,12,Amalea Laurand,17,Agender,16:46,2023-12-06 12:00:00,286 mins
17,18,Ike Ranson,20,Agender,16:45,2023-12-06 12:00:00,285 mins
56,57,Pierson Chaney,52,Agender,16:58,2023-12-06 12:00:00,298 mins
295,318,Coletta Longea,35,Agender,16:10,2023-12-06 12:00:00,250 mins
298,321,Jacobo Telling,33,Agender,14:47,2023-12-06 12:00:00,167 mins
309,332,Ina Bonifant,30,Agender,15:17,2023-12-06 12:00:00,197 mins
445,480,Tomasina Greensall,22,Agender,15:12,2023-12-06 12:00:00,192 mins
542,586,Adair McKeefry,45,Agender,16:38,2023-12-06 12:00:00,278 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
48,49,Lily Brigman,48,Bigender,15:28,2023-12-06 12:00:00,208 mins
134,139,Bob Brane,30,Bigender,17:01,2023-12-06 12:00:00,301 mins
156,162,Ariel Holyard,42,Bigender,17:01,2023-12-06 12:00:00,301 mins
180,191,Geri Conroy,22,Bigender,17:01,2023-12-06 12:00:00,301 mins
191,206,Kinny Cyphus,48,Bigender,16:03,2023-12-06 12:00:00,243 mins
231,249,Sephira Kirgan,33,Bigender,14:24,2023-12-06 12:00:00,144 mins
243,263,Ernestus Draper,45,Bigender,14:58,2023-12-06 12:00:00,178 mins
285,307,Damian Gladdis,36,Bigender,16:40,2023-12-06 12:00:00,280 mins
291,314,Alejoa Normansell,21,Bigender,16:04,2023-12-06 12:00:00,244 mins
294,317,Ody Gian,48,Bigender,16:12,2023-12-06 12:00:00,252 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
3,3,Shirline Hasser,22,Female,14:20,2023-12-06 12:00:00,140 mins
8,9,Lark Bragge,30,Female,15:59,2023-12-06 12:00:00,239 mins
10,11,Judye Hattrick,17,Female,15:27,2023-12-06 12:00:00,207 mins
12,13,Christiana Winspeare,52,Female,16:18,2023-12-06 12:00:00,258 mins
15,16,Casey Arthur,54,Female,16:41,2023-12-06 12:00:00,281 mins
19,20,Stormy Gideon,43,Female,16:51,2023-12-06 12:00:00,291 mins
20,21,Carmencita Petrishchev,20,Female,14:15,2023-12-06 12:00:00,135 mins
21,22,Cristina Brabbs,60,Female,14:20,2023-12-06 12:00:00,140 mins
22,23,Cordy Jeannaud,27,Female,14:45,2023-12-06 12:00:00,165 mins
23,24,Rubi Greeson,39,Female,14:46,2023-12-06 12:00:00,166 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
58,59,Vitoria Baile,34,Genderfluid,15:06,2023-12-06 12:00:00,186 mins
130,135,Tucky Slator,18,Genderfluid,15:01,2023-12-06 12:00:00,181 mins
147,152,Patrice Draye,18,Genderfluid,14:15,2023-12-06 12:00:00,135 mins
288,310,Cyrus Chardin,35,Genderfluid,16:57,2023-12-06 12:00:00,297 mins
325,351,Viviene Gavigan,30,Genderfluid,15:26,2023-12-06 12:00:00,206 mins
390,421,Neysa Willatt,60,Genderfluid,14:33,2023-12-06 12:00:00,153 mins
396,429,Evangelin Bagge,30,Genderfluid,15:17,2023-12-06 12:00:00,197 mins
555,600,Brana Seman,39,Genderfluid,15:04,2023-12-06 12:00:00,184 mins
561,606,Nonna Restill,57,Genderfluid,14:24,2023-12-06 12:00:00,144 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
122,127,Nichols Brosch,17,Genderqueer,16:52,2023-12-06 12:00:00,292 mins
148,153,Candide Jerok,45,Genderqueer,17:02,2023-12-06 12:00:00,302 mins
153,158,Kameko Rouchy,60,Genderqueer,15:54,2023-12-06 12:00:00,234 mins
159,165,Ula Vokins,47,Genderqueer,17:00,2023-12-06 12:00:00,300 mins
393,425,Susette Gerish,26,Genderqueer,17:03,2023-12-06 12:00:00,303 mins
487,527,Bobby Sykes,52,Genderqueer,15:28,2023-12-06 12:00:00,208 mins
506,547,Patin Dawidowitsch,17,Genderqueer,15:27,2023-12-06 12:00:00,207 mins
546,590,Abigale Beedle,41,Genderqueer,15:34,2023-12-06 12:00:00,214 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
1,1,Loydie Lopes,17,Male,16:01,2023-12-06 12:00:00,241 mins
2,2,Lorens Crispe,33,Male,15:40,2023-12-06 12:00:00,220 mins
4,4,Alleyn Hartshorn,39,Male,17:06,2023-12-06 12:00:00,306 mins
5,5,Wang MacColl,50,Male,16:49,2023-12-06 12:00:00,289 mins
6,6,Tonnie Tidder,44,Male,15:43,2023-12-06 12:00:00,223 mins
7,8,Basil Moxsom,44,Male,15:16,2023-12-06 12:00:00,196 mins
9,10,Kent Wakely,60,Male,14:29,2023-12-06 12:00:00,149 mins
13,14,Trevar Pegrum,31,Male,16:05,2023-12-06 12:00:00,245 mins
14,15,Kearney Doumerc,48,Male,15:11,2023-12-06 12:00:00,191 mins
16,17,Antony Burnes,19,Male,15:58,2023-12-06 12:00:00,238 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
37,38,Mellie Merryfield,54,Non-binary,15:03,2023-12-06 12:00:00,183 mins
69,72,Chanda Daldan,42,Non-binary,16:22,2023-12-06 12:00:00,262 mins
307,330,Genni Handling,16,Non-binary,15:25,2023-12-06 12:00:00,205 mins
316,341,Gaylor Buckett,17,Non-binary,14:29,2023-12-06 12:00:00,149 mins
323,349,Ariela Boosey,50,Non-binary,16:43,2023-12-06 12:00:00,283 mins
394,426,Base Dukes,62,Non-binary,15:19,2023-12-06 12:00:00,199 mins

Unnamed: 0_level_0,runner_bib,runner_name,runner_age,runner_gender,finish_time,start_time,finish_minutes
Unnamed: 0_level_1,<int>,<chr>,<int>,<chr>,<chr>,<dttm>,<drtn>
66,68,Fonsie Haithwaite,50,Polygender,16:00,2023-12-06 12:00:00,240 mins
118,123,Trumann Winny,33,Polygender,14:57,2023-12-06 12:00:00,177 mins
145,150,Parry Lionel,38,Polygender,14:56,2023-12-06 12:00:00,176 mins
195,210,Thaddus Offield,59,Polygender,15:11,2023-12-06 12:00:00,191 mins
212,228,Durante Wloch,37,Polygender,15:37,2023-12-06 12:00:00,217 mins
263,283,Karyl Pyne,38,Polygender,14:21,2023-12-06 12:00:00,141 mins
327,353,Gayla Guterson,38,Polygender,14:50,2023-12-06 12:00:00,170 mins
330,356,Alair Blyde,42,Polygender,14:27,2023-12-06 12:00:00,147 mins
352,379,Natty Veasey,48,Polygender,14:55,2023-12-06 12:00:00,175 mins
365,394,Latisha Rendall,60,Polygender,17:06,2023-12-06 12:00:00,306 mins


In [127]:
top3_lowest_times_df <- df %>%
  group_by(runner_gender) %>%
  arrange(finish_minutes) %>%
  slice_head(n = 3)

# Show the resulting dataframe
print(top3_lowest_times_df)

[90m# A tibble: 24 × 7[39m
[90m# Groups:   runner_gender [8][39m
   runner_bib runner_name            runner_age runner_gender finish_time
        [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m                       [3m[90m<int>[39m[23m [3m[90m<chr>[39m[23m         [3m[90m<chr>[39m[23m      
[90m 1[39m        321 Jacobo Telling                 33 Agender       14:47      
[90m 2[39m        480 Tomasina Greensall             22 Agender       15:12      
[90m 3[39m        332 Ina Bonifant                   30 Agender       15:17      
[90m 4[39m        249 Sephira Kirgan                 33 Bigender      14:24      
[90m 5[39m        546 Chauncey Langthorne            16 Bigender      14:29      
[90m 6[39m        263 Ernestus Draper                45 Bigender      14:58      
[90m 7[39m         21 Carmencita Petrishchev         20 Female        14:15      
[90m 8[39m        134 Marylin Standering             18 Female        14:15      
[90m 9[39m    

## 6. Plot the Data (3 pts)

Finally, using `ggplot2` create two plots of the data -- density plots of race finishers.

- In the first plot use `finish_minutes` as the x axis.
- In the second plot use `runner_age` as the x axis.
- Use `runner_gender` as the fill.
- We suggest using a `geom_density(alpha=0.2)` or therabouts to see layers through one another.
- Use the `gridExtra` library's `grid.arrange()` method to plot them both.

You will note that since this is artificial data you will be able to see the gender layers clearly enough but they will not be statistically meaningful.

In [None]:
#
