# Project Proposal

## BMI relation to proportion of Gymnastics Medals in Olympic

## Introduction

For this project we will be using “120 years of Olympic history: athletes and results” data set found on Kaggle. https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results 
The original data set was scraped from https://www.sports-reference.com/. Forman, S. (2000). Sports reference . Sports stats, fast, easy, and up-to-date. Retrieved November 4, 2022, from https://www.sports-reference.com/ 

The data set contains 271116 rows that correspond to individual athletes that have competed in an Olympic event and 15 columns (ID, Name, Sex, Age, Height, Weight, Team, NOC, Games, Year, Season, City, Sport, Event, Medal). 

For our research question, we will be investigating the how BMI affects the proportion of people ages 20-30 who won an Olympic medal in gymnastics from the year 2000 and onwards. BMI is used to categorize individuals as underweight (BMI <18.5), normal weight (BMI= 18.5–24.9), overweight (BMI= 25–29.9) or obese (BMI >30), based on the parameters weight and height. Previous research shows that lower BMI was related with better performance but was negatively affected once BMI became very low. Further studies show that smaller gymnists with high strength to mass ratios have greater potential for performing skills involving whole-body rotations. 

## Prelimenary Results

We have first downloaded the data set from Kaggle. https://www.kaggle.com/datasets/heesoo37/120-years-of-olympic-history-athletes-and-results .

Here, we load all the libraries that we might need to clean, wrangle and visualise the data.

In [16]:
library(cowplot)
library(datateachr)
library(digest)
library(infer)
library(repr)
library(taxyvr)
library(tidyverse)
options(repr.matrix.max.rows = 6)

We used read_csv to load the data which was downloaded from kaggle and then uploaded into the jupyter notebook folder (and then pushed to the github repository).

In [17]:
olympic_data <- read_csv("athlete_events.csv")
head(olympic_data)

[1mRows: [22m[34m271116[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
[32mdbl[39m  (5): ID, Age, Height, Weight, Year

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>
1,A Dijiang,M,24,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
2,A Lamusi,M,23,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
3,Gunnar Nielsen Aaby,M,24,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
4,Edgar Lindenau Aabye,M,34,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",


For our inferential study we are going to look at the data from the year 2000 and onwards and we are going to assess people of ages between 20 to 30. We are doing this because age and the year might play a factor in determining the proportion of poeple who won medals in olympics.

Since, we only want to assess the proportion of Gymnasts who won a medal in the Olympics in the event Gymnastics Men's/Women's Individual All-Around, we are going to filter the data so we only have the Gymnastics data. Furthermore, we are only selecting the columns we require for our study.

In [32]:
gymnast_data <- olympic_data |> 
                        filter(Age >= 20 & Age <= 30 & Year > 1999) |> 
                        filter(Sport == "Gymnastics" & (Event == "Gymnastics Men's Individual All-Around" | Event == "Gymnastics Women's Individual All-Around")) |>
                        select(Sex , Height, Weight, Medal)
head(gymnast_data)

Sex,Height,Weight,Medal
<chr>,<dbl>,<dbl>,<chr>
M,167,64,
M,167,63,
F,165,55,
M,162,54,
M,171,63,
M,171,63,


Since, we are doing our study on BMI, we are going to have to make a seperate column for BMI by using the formula:

 # $\text{BMI}= \frac{\text{Weight}(\text{in kg})}{(\text{Height}(\text{in cm}))^2}* 10000$

In [41]:
gymnast_data_bmi <- gymnast_data |> 
                    mutate(bmi = Weight/Height^2*10000) |>
                    filter(bmi >= 18.5 & bmi <= 25 & Sex == "M")

nrow(gymnast_data_bmi)
head(gymnast_data_bmi)

Sex,Height,Weight,Medal,bmi
<chr>,<dbl>,<dbl>,<chr>,<dbl>
M,167,64,,22.94812
M,167,63,,22.58955
M,162,54,,20.57613
M,171,63,,21.54509
M,171,63,,21.54509
M,166,63,,22.86253


## Methods : Plan

## References

(Sherman, Robert Trattner; Thompson, Ron A; Rose, Jennifer S. ) Body mass index and athletic performance in elite female gymnasts,
Journal of Sport Behavior; Mobile Vol. 19, Iss. 4,  (Dec 1996): 338.

https://www.proquest.com/openview/af45415483fdf49d1e797ded686c20a0/1?pq-origsite=gscholar&cbl=30153 

Ackland, T., Elliott, B., & Richards, J. (2003). Gymnastics. Sports Biomechanics, 2(2), 163–176. https://doi.org/10.1080/14763140308522815