**Task**

* Update the `wnba` dataframe to create a new column `pts_game` that describes the number of points a player scored per game played during the season.

* Stratify the `wnba` data set by player position.
* Randomly sample 10 observations for each player position.
* Estimate the average number of points per season as `mean_pts_season`.
* Estimate the average number of points per game as `mean_pts_game`.
* `arrange()` by player position so that results are displayed in alphabetical order

**Answer**

`set.seed(1)`

`wnba <- wnba %>%
  mutate(pts_game = PTS/Games_Played)`

`total_points_estimates <- wnba %>%
  group_by(Pos) %>%
  sample_n(10) %>% 
  summarise(mean_pts_season = mean(PTS),
            mean_pts_game = mean(pts_game)) %>% 
  arrange(Pos)`

![image.png](attachment:image.png)

The range between the minimum and maximum is 30 games played. How would the distribition of points look we stratified the players into three bins spanning approximately 10 games played? Calling the [base R function `cut()`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html) inside a `mutate()` function call is one method of establishing these strata. The `breaks` argument specifies how many intervals we want to split our data into.

`wnba %>% 
  mutate(games_stratum = cut(Games_Played, breaks = 3)) %>%
  group_by(games_stratum) %>% 
  summarize(n = n()) %>% 
  mutate(percentage = n / sum(n) * 100) %>% 
  arrange(desc(percentage))`
  
![image.png](attachment:image.png)

**Task**

* Establish three strata for `Games_Played` and sample randomly from each strata using the specific proportion defined for each stratum. Combine the results of the random samples from each stratum and calculate the mean points scored per season for the combined group of random samples.

**Answer**

`set.seed(1)`

`under_12 <- wnba %>% 
  filter(Games_Played <= 12) %>% 
  sample_n(1)`
  
`btw_13_22 <- wnba %>% 
  filter(Games_Played > 12 & Games_Played <= 22) %>% 
  sample_n(2)`
  
`over_22 <- wnba %>% 
  filter(Games_Played > 22) %>% 
  sample_n(7)`

`combined <- bind_rows(under_12, btw_13_22, over_22)`

`mean(combined$PTS)`

**Task**

* Make a function of above

**Answer**

`sample_mean <- function(x){`
`under_12 <- wnba %>% 
filter(Games_Played <= 12) %>% 
sample_n(1)`

`btw_13_22 <- wnba %>% 
filter(Games_Played > 12 & Games_Played <= 22) %>% 
sample_n(2)`

`over_22 <- wnba %>%
filter(Games_Played > 22) %>% 
sample_n(7)`

`combined <- bind_rows(under_12, btw_13_22, over_22)`

`mean(combined$PTS)`

`}`

**Task**

Build a scatterplot of 100 proportional stratified random samples.

**Answer**

`library(purrr)
library(tibble)
library(ggplot2)`

`set.seed(1)`

`sample_number <- 1:100`

`mean_points_season <- map_dbl(sample_number, sample_mean)` # `map_dbl()` to iterate over the       
                                                              sample_number vector and with the 
                                                              sample_mean() function.
                                                              
`df <- tibble(sample_number, mean_points_season)`           # Generate a dataframe called `df` that  
                                                             contains the sample_number and
                                                             mean_points_season vectors.
                                                             
`ggplot(data = df) + 
    aes(x = sample_number, y = mean_points_season) + geom_point() +geom_hline(yintercept = mean(wnba$PTS), color = "blue") + ylim(80, 320)`

We have been working with the `sample_n()` function from `dplyr` because it returns randomly sampled rows from a dataframe. This can be useful when we are interested in analyzing more than one variable from a random sample. The `dplyr` package also offers the `sample_frac()` function that returns a dataframe of randomly sampled rows. Instead of sampling a specified number of rows, `sample_frac()` randomly samples a specified fraction or proportion of rows. For example, if we wanted to randomly sample a quarter, or 25%, of all WNBA players who played more than 22 games in a season, we would input:

`over_22 <- wnba %>% 
  filter(Games_Played > 22) %>% 
  sample_frac(0.25)`

`sample_frac()` is useful with proportional **stratified random sampling** because we do not need to calculate the number of samples to perform on each individual stratum! We only need to specify the fraction of the population that we would like to sample, and `sample_frac()` will return the number of rows for each stratum that is proportional to their share of the population.

**Task**

Build a scatterplot of 100 proportional stratified random samples.

**Answer**

`set.seed(1)`

`sample_mean <- function(x) {`
  `sample <- wnba %>% 
  group_by(games_stratum) %>% 
  sample_frac(.07)`

  `mean(sample$PTS)`
`}`

`sample_number <- 1:100`

`mean_points_season <- map_dbl(sample_number, sample_mean)`

`df <- tibble(sample_number, mean_points_season)`

`ggplot(data = df) + 
    aes(x = sample_number, y = mean_points_season) +
    geom_point() +
    geom_hline(yintercept = mean(wnba$PTS), color = "blue") +
    ylim(80, 320)`

The results using `sample_frac()` are identical to our approach with `sample_n()`

**Task**

Simulate cluster sampling on `Team` Column

**Answer**

`set.seed(10)`

`clusters <-  unique(wnba$Team) %>% sample(size = 4)`

`sample <- wnba %>% filter(Team %in% clusters)` # filter the wnba dataframe for all teams in the cluster with `%in%`

`sampling_error_height <- mean(wnba$Height) - mean(sample$Height)`
`sampling_error_age <- mean(wnba$Age) - mean(sample$Age)`
`sampling_error_games <- mean(wnba$Games_Played) - mean(sample$Games_Played)`
`sampling_error_points <- mean(wnba$PTS) - mean(sample$PTS)`