`library(readr)
library(tidyr)
wnba <- read_csv("wnba.csv")`

`freq_dist_pos <- wnba %>%
  group_by(Pos) %>%
  summarize(Freq = n())`
            
`freq_dist_height <- wnba %>%
  group_by(Height) %>%
  summarize(Freq = n())`

`age_ascending <- wnba %>%
  group_by(Age) %>%
  summarize(Freq = n())`

`age_descending <- wnba %>%
  group_by(Age) %>%
  summarize(Freq = n()) %>% 
  arrange(desc(Age))`

`wnba <- wnba %>%
  mutate(Height_labels = case_when(
    Height <= 170 ~ "short",
    Height > 170 & Height <= 180 ~ "medium",
    Height > 180 ~ "tall"
  ))`
    
`wnba %>% select(Height, Height_labels) %>% head(10)`

We want to sort the labels in an ascending or descending order, but using `arrange(Height_labels)` doesn't work because the function can't infer quantities from words like **"medium"**. `arrange(Height_labels)` can only order the index alphabetically in an ascending or descending order:

`wnba %>% 
  group_by(Height_labels) %>% 
  summarize(Freq = n()) %>% 
  arrange(desc(Height_labels))`
  
![image.png](attachment:image.png)

One solution is to convert the `Height_labels` variable to a [factor type](https://stat.ethz.ch/R-manual/R-devel/library/base/html/factor.html). In R, the factor variable type is used for categorical variables. Think of categories as discrete buckets, or bins. For our purposes we will need to provide two arguments to the `factor()` function: 

1. `x`, a `vector`, and 
2. `levels`, which is a character vector that specifies the order to sort the vector.

`height_levels <- c("short", "medium", "tall")`

`wnba_factor <- wnba %>%
    mutate(Height_labels = 
             factor(Height_labels, 
                    levels = height_levels))`

**A word of caution**: when creating factors be sure to specify all desired categorical values because any values not specified will be silently converted to `NA!`

Now that the `Height_labels` variable is a factor with established levels, arranging our frequency table in ascending order will list the values from shortest to tallest:

`wnba_factor %>% 
  group_by(Height_labels) %>% 
  summarize(Freq = n()) %>% 
  arrange(Height_labels)`
  
![image.png](attachment:image.png)

It is worth mentioning that we do not actually need to convert the `Height_labels` column to a factor in the first place! This is especially useful when we do not want to convert a character variable to a factor. We can call the `factor()` function call on the `Height_labels` variable and set factor levels within the `arrange()` call itself, like this:

`height_levels <- c("short", "medium", "tall")`

`wnba %>% 
  group_by(Height_labels) %>% 
  summarize(Freq = n()) %>% 
  arrange(factor(Height_labels, 
                 levels = height_levels))`
                 
![image.png](attachment:image.png)

When we analyze distributions, we're often interested in answering questions about **proportions** and **percentages**

Because proportions and percentages are relative to the total number of instances in some set of data, they are called **relative frequencies**. In contrast, the frequencies we've been working with so far are called **absolute frequencies**.

`wnba %>%
  group_by(Pos) %>%
  summarize(Freq = n()) %>% 
  mutate(Prop = Freq / nrow(wnba)) %>%
  mutate(Percentage = Freq / nrow(wnba) * 100) %>% 
  arrange(desc(Freq))`

![image.png](attachment:image.png)

`wnba %>%
  filter(Pos == "G") %>% 
  summarize(Freq = n()) %>% 
  mutate(Prop = Freq / nrow(wnba)) %>%
  mutate(Percentage = Freq / nrow(wnba) * 100)`
  
![image.png](attachment:image.png)

Notice that we did not use `group_by()` when filtering for a single value because `filter()` creates a subset of data that only includes players in the guard position. There is no need for grouping in this instance.

**A word of caution regarding `NA` values:** In the examples above we determined proportions and percentages by dividing frequency by the total number of rows (`players`) in the `wnba` dataframe. We were able to do this because there are not any `NA` values in the `Pos` variable. If the variable that we are estimating proporions or percentages for contains `NA` values, we will need to divide the frequency by the `length()` of the column (vector) after the `NA` values have been removed.

`age_23_or_under <- wnba %>%
  filter(Age <= 23) %>% 
  summarize(Freq = n()) %>% 
  mutate(Prop = Freq / nrow(wnba)) %>%
  mutate(Percentage = Freq / nrow(wnba) * 100)`

![image.png](attachment:image.png)

We found that the percentage of players age 23 years or younger is 19% (rounded to the nearest integer). This percentage is also called a **percentile rank**.

A percentile rank of a value `x` in a frequency distribution is given by the percentage of values that are equal or less than `x`.  

Above `23` has a percentile rank of `19%` means that `19%` of the values are equal to or less than `23`

We can arrive at the same answer faster by using the base R function `mean()` like this:

`# Proportion`

`mean(wnba$Age <= 23)`

`# Percentage`

`mean(wnba$Age <= 23) * 100`

![image.png](attachment:image.png)

We need to use `<=` to indicate that we want to find the percentage of values that are less than or equal to the total number of values in the `x` argument. In this case the `x` argument is the Age vector from the `wnba` dataframe.

**Task**

There are 34 games in the WNBA’s regular season;
* What percentage of players played half the number of games or less in the 2016-2017 season?
* What percentage of players played more than half the number of games of the season 2016-2017? 

**Answer**

`percentile_50_or_less <- mean(wnba$Games_Played <= 17) * 100`

`percentile_above_50 <- mean(wnba$Games_Played > 17) * 100`

To find percentiles for the `Age` variable, we can use the built in [`summary()` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html) which returns by default the 25th, the 50th, and the 75th percentiles:

![image.png](attachment:image.png)

We can also use another [base R function `quantile()`](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html) to display similar information, but presented in a slightly different way. Note that the `quantile()` function does not return the mean:

![image.png](attachment:image.png)

The three percentiles that divide the distribution in four equal parts are also known as **quartiles** (from the Latin `quartus` which means four). Note that the term quartile is different than [quantile](https://en.wikipedia.org/wiki/Quantile). **Quantiles** provide us with the value of a random variable for a specified probability. For example, the median is the value of the random variable at the quantile probability value of 0.5

We may be interested to find the percentiles for percentages other than 25%, 50%, or 75%. For that, we can use the `probs` argument of the `quantile()` function. 

![image.png](attachment:image.png)

What if we want to calculate age percentiles for each player in the wnba dataframe? The `dplyr` package offers a [`convenient function cume_dist()`](https://dplyr.tidyverse.org/reference/ranking.html) for calculating percentiles for each value in a column. The `cume_dist()` function takes one argument, a vector.

`wnba %>% 
  mutate(cume_dist_age = cume_dist(Age)) %>% 
  select(Name, Age, cume_dist_age) %>% 
  head(n = 15)`
  
![image.png](attachment:image.png)

Percentiles don't have [`a single standard definition`](https://en.wikipedia.org/wiki/Percentile#Definitions), so don't be surprised if we get very similar (but not identical) values if we use different functions (especially if the functions come from different packages).

**Task**

Use the `Age` variable to calculate **Quartiles** and **Percentiles**.

**Answer**

`age_upper_quartile <- quantile(wnba$Age, probs = 0.75)
age_middle_quartile <- quantile(wnba$Age, probs = 0.50)
age_95th_percentile <- quantile(wnba$Age, probs = 0.95)`

`wnba_age_percentiles <- wnba %>% 
  mutate(cume_dist_age = cume_dist(Age)) %>% 
  select(Name, Age, cume_dist_age) %>% 
  arrange(Age)`

With frequency tables, we're trying to transform relatively large and incomprehensible amounts of data to a table format we can understand. However, not all frequency tables are straightforward:

`wnba %>%
  group_by(Weight) %>%
  summarize(Freq = n())`
  
The table for the `Weight` variable is a relatively straight-forward case, but the frequency tables for variables like `PTS`, `BMI`, or `MIN` are even more daunting.

If the variable is measured on an interval or ratio scale, a common solution to this problem is to group the values in equal intervals.

Fortunately, R can handle this process gracefully. We only need to make use of the breaks argument of the [`cut()` function](https://stat.ethz.ch/R-manual/R-devel/library/base/html/cut.html). We want ten equal intervals, so we need to specify `breaks = 10`. We use `mutate()` to create a new column called **weight_categories** that contains the ten intervals:

`wnba <- wnba %>% 
  mutate(weight_categories = 
           cut(Weight, breaks = 10, dig.lab = 4))`

The `dig.lab` argument determines the number of digits used in formatting the break numbers. We will use `dplyr` to create our frequency distribution table, and we will drop the single `NA` value with the `drop_na()` function from the `tidyr` package:

`wnba %>% 
  group_by(weight_categories) %>% 
  summarize(Freq = n()) %>% 
  drop_na()`

The `(` character indicates that the starting point is not included, while the `]` indicates that the endpoint is included. `(54.94, 60.8]` means that 54.94 isn't included in the interval, while 60.8 is. The interval `(54.94, 60.8]` contains all real numbers greater than 54.94 and less than or equal to 60.8.

Because we group values in a table to get a better sense of frequencies in the distribution, the table we generated above is also known as a **grouped frequency distribution table**. Each group (interval) in a grouped frequency distribution table is also known as a **class interval**. (107.2, 113.1], for instance, is a class interval.

**Task**

* Add a new column to the `wnba` dataframe called **points_categories** that breaks the `PTS` column into 10 class intervals. Use the value of 4 for the `dig.lab` argument.

* Generate a frequency table for the `PTS` (total points) variable called `pts_freq_table` and try to find some patterns in the distribution of values.

* Build a grouped frequency distribution table for the `points_categories`.

**Answer**

`wnba <- wnba %>% 
  mutate(points_categories = cut(PTS, breaks = 10, dig.lab = 4))`

`pts_freq_table <- wnba %>% 
  group_by(PTS) %>% 
  summarize(Freq = n())`

`pts_grouped_freq_table <- wnba %>% 
  group_by(points_categories) %>% 
  summarize(Freq = n()) %>% 
  mutate(Percentage = Freq / nrow(wnba) * 100) %>% 
  arrange(desc(points_categories))`

When we generate grouped frequency distribution tables, there's an inevitable information loss.

To get back this granular information, we can increase the number of class intervals. However, if we do that, we end up again with a table that's lengthy and very difficult to analyze.

On the other hand, if we decrease the number of class intervals, we lose even more information:

We can conclude there is a trade-off between the information in a table, and how comprehensible the table is.

As a rule of thumb, 10 is a good number of class intervals to choose because it offers a good balance between information and comprehensibility.

R helps a lot when we need to quickly explore grouped frequency tables. However, the intervals R outputs are confusing at first sight.

Imagine we'd have to publish the table above in a blog post or a scientific paper. The readers will have a hard time understanding the intervals we chose. They'll also be puzzled by the decimal numbers because points in basketball can only be integers.

To fix this, we can define the intervals ourselves by providing a numeric vector to the breaks argument, instead of simply providing the number of breaks we want.

`wnba <- wnba %>% 
  mutate(points_categories = cut(PTS, 
               breaks = c(0, 100, 200, 300, 400, 500, 600), 
               dig.lab = 4))`

`wnba %>% 
  group_by(points_categories) %>% 
  summarize(Freq = n()) %>% 
  mutate(Percentage = Freq / nrow(wnba) * 100)`

Note that we're not restricted by the minimum and maximum values of a variable when we define intervals. The minimum number of points is 2, and the maximum is 584, but our intervals range from 1 to 600 (remember, zero is not included).

**Task**

Modify the `min_categorie`s variable we added to the wnba dataframe by breaking the `MIN` column into 8 class intervals.

**Answer**

`wnba <- wnba %>% 
  mutate(min_categories = 
           cut(MIN, 
               breaks = c(0, 150, 300, 450, 600, 750, 900, 1050), 
               dig.lab = 4))`

`min_grouped_freq_table <- wnba %>% 
  group_by(min_categories) %>% 
  summarize(Freq = n()) %>% 
  mutate(Percentage = Freq / nrow(wnba) * 100)`