The goal of this lab is to introduce you to R and RStudio, which you'll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.
As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
Your RStudio window has four panels.
Your R Markdown file (this document) is in the upper left panel.
The panel on the lower left is where the action happens. It's called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you're running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.
The panel in the upper right contains your workspace as well as a history of the commands that you've previously entered.
Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.
R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:
statsr
: for data files and functions used in this coursedplyr
: for data wranglingggplot2
: for data visualization
You should have already installed these packages using commands like
install.packages
and install_github
.
Next, you need to load the packages in your working environment. We do this with
the library
function. Note that you only need to install packages once, but
you need to load them each time you relaunch RStudio.
library(dplyr)
library(ggplot2)
library(statsr)
To do so, you can
- click on the green arrow at the top of the code chunk in the R Markdown (Rmd) file, or
- highlight these lines, and hit the Run button on the upper right corner of the pane, or
- type the code in the console.
Going forward you will be asked to load any relevant packages at the beginning of each lab.
To get you started, run the following command to load the data.
data(arbuthnot)
To do so, once again, you can
- click on the green arrow at the top of the code chunk in the R Markdown (Rmd) file, or
- put your cursor on this line, and hit the Run button on the upper right corner of the pane, or
- type the code in the console.
This command instructs R to load some data. The Arbuthnot baptism counts for boys
and girls. You should see that the workspace area in the upper righthand corner of
the RStudio window now lists a data set called arbuthnot
that has 82 observations
on 3 variables. As you interact with R, you will create a series of objects.
Sometimes you load them as we have done here, and sometimes you create them yourself
as the byproduct of a computation or some analysis you have performed.
The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.
arbuthnot
However printing the whole dataset in the console is not that useful.
One advantage of RStudio is that it comes with a built-in data viewer. Click on
the name arbuthnot
in the Environment pane (upper right window) that lists
the objects in your workspace. This will bring up an alternative display of the
data set in the Data Viewer (upper left window). You can close the data viewer
by clicking on the x in the upper lefthand corner.
What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.
Note that the row numbers in the first column are not part of Arbuthnot's data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot's data in a kind of spreadsheet or table called a data frame.
You can see the dimensions of this data frame by typing:
dim(arbuthnot)
This command should output [1] 82 3
, indicating that there are 82 rows and 3
columns (we'll get to what the [1]
means in a bit), just as it says next to
the object in your workspace. You can see the names of these columns (or
variables) by typing:
names(arbuthnot)
- How many variables are included in this data set?
- 2
- 3
- 4
- 82
- 1710
You should see that the data frame contains the columns year
, boys
, and
girls
. At this point, you might notice that many of the commands in R look a
lot like functions from math class; that is, invoking R commands means supplying
a function with some number of arguments. The dim
and names
commands, for
example, each took a single argument, the name of a data frame.
So far we asked you to type your commands in the console. The console is a great place for playing around with some code, however it is not a good place for documenting your work. Working in the console exclusively makes it difficult to document your work as you go, and reproduce it later.
R Markdown is a great solution for this problem. And, you already have worked with an R Markdown document -- this lab! Going forward type the code for the questions in the code chunks provided in the R Markdown (Rmd) document for the lab, and Knit the document to see the results.
Let's start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like
arbuthnot$boys
This command will only show the number of boys baptized each year. The dollar sign basically says "go to the data frame that comes before me, and find the variable that comes after me".
- What command would you use to extract just the counts of girls born?
- `arbuthnot$boys`
- `arbuthnot$girls`
- `girls`
- `arbuthnot[girls]`
- `$girls`
# type your code for the Question 2 here, and Knit
Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, in the arbuthnot$boys vector, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.
R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
geom_point()
Before we review the code for this plot, let's summarize the trends we see in the data.
- Which of the following best describes the number of girls baptised over the years included in this dataset?
- There appears to be no trend in the number of girls baptised from 1629 to 1710.
- There is initially an increase in the number of girls baptised, which peaks around 1640. After 1640 there is a decrease in the number of girls baptised, but the number begins to increase again in 1660. Overall the trend is an increase in the number of girls baptised.
- There is initially an increase in the number of girls baptised. This number peaks around 1640 and then after 1640 the number of girls baptised decreases.
- The number of girls baptised has decreased over time.
- There is an initial increase in the number of girls baptised but this number appears to level around 1680 and not change after that time point.
Back to the code... We use the ggplot()
function to build plots. If you run the
plotting code in your console, you should see the plot appear under the Plots tab
of the lower right panel of RStudio. Notice that the command above again looks like
a function, this time with arguments separated by commas.
- The first argument is always the dataset.
- Next, we provide thevariables from the dataset to be assigned to
aes
thetic elements of the plot, e.g. the x and the y axes. - Finally, we use another layer, separated by a
+
to specify thegeom
etric object for the plot. Since we want to scatterplot, we usegeom_point
.
You might wonder how you are supposed to know the syntax for the ggplot
function.
Thankfully, R documents all of its functions extensively. To read what a function
does and learn the arguments that are available to you, just type in a question mark
followed by the name of the function that you're interested in. Try the following in
your console:
?ggplot
Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.
Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like
5218 + 4683
to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously.
arbuthnot$boys + arbuthnot$girls
What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right.
We'll be using this new vector to generate some plots, so we'll want to save it as a permanent column in our data frame.
arbuthnot <- arbuthnot %>%
mutate(total = boys + girls)
What in the world is going on here? The %>%
operator is called the piping
operator. Basically, it takes the output of the current line and pipes it into
the following line of code.
"Take the arbuthnot
dataset and pipe it into the mutate
function.
Using this mutate a new variable called total
that is the sum of the variables
called boys
and girls
. Then assign this new resulting dataset to the object
called arbuthnot
, i.e. overwrite the old arbuthnot
dataset with the new one
containing the new variable."
This is essentially equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.
You'll see that there is now a new column called total
that has been tacked on
to the data frame. The special symbol <-
performs an assignment, taking the
output of one line of code and saving it into an object in your workspace. In
this case, you already have an object called arbuthnot
, so this command updates
that data set with the new mutated column.
We can make a plot of the total number of baptisms per year with the following command.
ggplot(data = arbuthnot, aes(x = year, y = total)) +
geom_line()
Note that using geom_line()
instead of geom_point()
results in a line plot instead
of a scatter plot. You want both? Just layer them on:
ggplot(data = arbuthnot, aes(x = year, y = total)) +
geom_line() +
geom_point()
# type your code for the Exercise here, and Knit
Finally, in addition to simple mathematical operators like subtraction and
division, you can ask R to make comparisons like greater than, >
, less than,
<
, and equality, ==
. For example, we can ask if boys outnumber girls in each
year with the expression
arbuthnot <- arbuthnot %>%
mutate(more_boys = boys > girls)
This command add a new variable to the arbuthnot
data frame containing the values
of either TRUE
if that year had more boys than girls, or FALSE
if that year
did not (the answer may surprise you). This variable contains different kind of
data than we have considered so far. All other columns in the arbuthnot
data
frame have values are numerical (the year, the number of boys and girls). Here,
we've asked R to create logical data, data where the values are either TRUE
or FALSE
. In general, data analysis will involve many different kinds of data
types, and one reason for using R is that it is able to represent and compute
with many of them.
In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot's baptism data. Next you will do a similar analysis, but for present day birth records in the United States. Load up the present day data with the following command.
data(present)
The data are stored in a data frame called present
which should now be loaded in
your workspace.
- How many variables are included in this data set?
- 2
- 3
- 4
- 74
- 2013
# type your code for Question 4 here, and Knit
# type your code for Exercise here, and Knit
- Calculate the total number of births for each year and store these values in a new
variable called
total
in thepresent
dataset. Then, calculate the proportion of boys born each year and store these values in a new variable calledprop_boys
in the same dataset. Plot these values over time and based on the plot determine if the following statement is true or false: The proportion of boys born in the US has decreased over time.
- True
- False
# type your code for Question 5 here, and Knit
- Create a new variable called
more_boys
which contains the value of eitherTRUE
if that year had more boys than girls, orFALSE
if that year did not. Based on this variable which of the following statements is true?
- Every year there are more girls born than boys.
- Every year there are more boys born than girls.
- Half of the years there are more boys born, and the other half more girls born.
# type your code for Question 6 here, and Knit
- Calculate the boy-to-girl ratio each year, and store these values in a new variable called
prop_boy_girl
in thepresent
dataset. Plot these values over time. Which of the following best describes the trend?
- There appears to be no trend in the boy-to-girl ratio from 1940 to 2013.
- There is initially an increase in boy-to-girl ratio, which peaks around 1960. After 1960 there is a decrease in the boy-to-girl ratio, but the number begins to increase in the mid 1970s.
- There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.
- The boy-to-girl ratio has increased over time.
- There is an initial decrease in the boy-to-girl ratio born but this number appears to level around 1960 and remain constant since then.
# type your code for Question 7 here, and Knit
- In what year did we see the most total number of births in the U.S.? Hint: Sort
your dataset in descending order based on the
total
column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. Or to arrange the data in a descenting order with new function:descr
(for descending order).
- 1940
- 1957
- 1961
- 1991
- 2007
# type your code for Question 8 here
# sample code is provided below, edit as necessary, uncomment, and then Knit
#present %>%
# mutate(total = ?) %>%
# arrange(desc(total))
That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. You might find the following tips and resources helpful.
-
In this course we will be using the
dplyr
(for data wrangling) andggplot2
(for data visualization) extensively. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling "scatterplot in R", google "scatterplot in R with ggplot2". -
The following cheathseets may come in handy throughout the course. Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful as you progress through the course material.
-
While you will get plenty of exercise working with these packages in the labs of this course, if you would like further opportunities to practice we recommend checking out the relevant courses at DataCamp.
Some define statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab we explore flights, specifically a random sample of domestic flights that departed from the three major New York City airport in 2013. We will generate simple graphical and numerical summaries of data on these flights and explore delay times. As this is a large data set, along the way you'll also learn the indispensable skills of data processing and subsetting.
In this lab we will explore the data using the dplyr
package and visualize it
using the ggplot2
package for data visualization. The data can be found in the
companion package for this course, statsr
.
Let's load the packages.
library(statsr)
library(dplyr)
library(ggplot2)
The Bureau of Transportation Statistics (BTS) is a statistical agency that is a part of the Research and Innovative Technology Administration (RITA). As its name implies, BTS collects and makes available transportation data, such as the flights data we will be working with in this lab.
We begin by loading the nycflights
data frame. Type the following in your console
to load the data:
data(nycflights)
The data frame containing r nrow(nycflights)
flights that shows up in your
workspace is a data matrix, with each row representing an observation and each
column representing a variable. R calls this data format a data frame, which is
a term that will be used throughout the labs.
To view the names of the variables, type the command
names(nycflights)
This returns the names of the variables in this data frame. The codebook
(description of the variables) is included below. This information can also be
found in the help file for the data frame which can be accessed by typing
?nycflights
in the console.
year
,month
,day
: Date of departuredep_time
,arr_time
: Departure and arrival times, local timezone.dep_delay
,arr_delay
: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.carrier
: Two letter carrier abbreviation.9E
: Endeavor Air Inc.AA
: American Airlines Inc.AS
: Alaska Airlines Inc.B6
: JetBlue AirwaysDL
: Delta Air Lines Inc.EV
: ExpressJet Airlines Inc.F9
: Frontier Airlines Inc.FL
: AirTran Airways CorporationHA
: Hawaiian Airlines Inc.MQ
: Envoy AirOO
: SkyWest Airlines Inc.UA
: United Air Lines Inc.US
: US Airways Inc.VX
: Virgin AmericaWN
: Southwest Airlines Co.YV
: Mesa Airlines Inc.
tailnum
: Plane tail numberflight
: Flight numberorigin
,dest
: Airport codes for origin and destination. (Google can help you with what code stands for which airport.)air_time
: Amount of time spent in the air, in minutes.distance
: Distance flown, in miles.hour
,minute
: Time of departure broken in to hour and minutes.
A very useful function for taking a quick peek at your data frame, and viewing
its dimensions and data types is str
, which stands for structure.
str(nycflights)
The nycflights
data frame is a massive trove of information. Let's think about
some questions we might want to answer with these data:
- We might want to find out how delayed flights headed to a particular destination tend to be.
- We might want to evaluate how departure delays vary over months.
- Or we might want to determine which of the three major NYC airports has a better on time percentage for departing flights.
The dplyr
package offers seven verbs (functions) for basic data
manipulation:
filter()
arrange()
select()
distinct()
mutate()
summarise()
sample_n()
We will use some of these functions in this lab, and learn about others in a future lab.
We can examine the distribution of departure delays of all flights with a histogram.
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram()
This function says to plot the dep_delay
variable from the nycflights
data frame on the x-axis. It also defines a geom
(short for geometric object), which describes the type of plot you will produce. Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can easily define the binwidth you want to use:
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 15)
ggplot(data = nycflights, aes(x = dep_delay)) +
geom_histogram(binwidth = 150)
If we want to focus on departure delays of flights headed to RDU only, we need to first filter
the data for flights headed to RDU (dest == "RDU"
) and then make a histogram of only departure delays of only those flights.
rdu_flights <- nycflights %>%
filter(dest == "RDU")
ggplot(data = rdu_flights, aes(x = dep_delay)) +
geom_histogram()
Let's decipher these three lines of code:
- Line 1: Take the
nycflights
data frame,filter
for flights headed to RDU, and save the result as a new data frame calledrdu_flights
.==
means "if it's equal to".RDU
is in quotation marks since it is a character string.
- Line 2: Basically the same
ggplot
call from earlier for making a histogram, except that it uses the data frame for flights headed to RDU instead of all flights.
==
means "equal to"!=
means "not equal to">
or<
means "greater than" or "less than">=
or<=
means "greater than or equal to" or "less than or equal to"
We can also obtain numerical summaries for these flights:
rdu_flights %>%
summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n())
Note that in the summarise
function we created a list of two elements. The names of these elements are user defined, like mean_dd
, sd_dd
, n
, and you could customize these names as you like (just don't use spaces in your names). Calculating these summary statistics also require that you know the function calls. Note that n()
reports the sample size.
mean
median
sd
var
IQR
range
min
max
We can also filter based on multiple criteria. Suppose we are interested in flights headed to San Francisco (SFO) in February:
sfo_feb_flights <- nycflights %>%
filter(dest == "SFO", month == 2)
Note that we can separate the conditions using commas if we want flights that are both headed to SFO and in February. If we are interested in either flights headed to SFO or in February we can use the |
instead of the comma.
- Create a new data frame that includes flights headed to SFO in February, and save this data frame as
sfo_feb_flights
. How many flights meet these criteria?
- 68
- 1345
- 2286
- 3563
- 32735
# type your code for Question 1 here, and Knit
- Make a histogram and calculate appropriate summary statistics for arrival
delays of
sfo_feb_flights
. Which of the following is false?
- The distribution is unimodal.
- The distribution is right skewed.
- No flight is delayed more than 2 hours.
- The distribution has several extreme values on the right side.
- More than 50% of flights arrive on time or earlier than scheduled.
# type your code for Question 2 here, and Knit
Another useful functionality is being able to quickly calculate summary statistics for various groups in your data frame. For example, we can modify the above command using the group_by
function to get the same summary stats for each origin airport:
rdu_flights %>%
group_by(origin) %>%
summarise(mean_dd = mean(dep_delay), sd_dd = sd(dep_delay), n = n())
Here, we first grouped the data by origin
, and then calculated the summary statistics.
- Calculate the median and interquartile range for
arr_delay
s of flights in thesfo_feb_flights
data frame, grouped by carrier. Which carrier has the hights IQR of arrival delays?
- American Airlines
- JetBlue Airways
- Virgin America
- Delta and United Airlines
- Frontier Airlines
# type your code for Question 3 here, and Knit
Which month would you expect to have the highest average delay departing from an NYC airport?
Let's think about how we would answer this question:
- First, calculate monthly averages for departure delays. With the new language we are learning, we need to
group_by
months, thensummarise
mean departure delays.
- Then, we need to
arrange
these average delays indesc
ending order
nycflights %>%
group_by(month) %>%
summarise(mean_dd = mean(dep_delay)) %>%
arrange(desc(mean_dd))
- Which month has the highest average departure delay from an NYC airport?
- January
- March
- July
- October
- December
# type your code for Question 4 here, and Knit
- Which month has the highest median departure delay from an NYC airport?
- January
- March
- July
- October
- December
# type your code for Question 5 here, and Knit
- Is the mean or the median a more reliable measure for deciding which month(s) to avoid flying if you really dislike delayed flights, and why?
- Mean would be more reliable as it gives us the true average.
- Mean would be more reliable as the distribution of delays is symmetric.
- Median would be more reliable as the distribution of delays is skewed.
- Median would be more reliable as the distribution of delays is symmetric.
- Both give us useful information.
We can also visualize the distributions of departure delays across months using side-by-side box plots:
ggplot(nycflights, aes(x = factor(month), y = dep_delay)) +
geom_boxplot()
There is some new syntax here: We want departure delays on the y-axis and the months on the x-axis to produce side-by-side box plots. Side-by-side box plots require a categorical variable on the x-axis, however in the data frame month
is stored as a numerical variable (numbers 1 - 12). Therefore we can force R to treat this variable as categorical, what R calls a factor, variable with factor(month)
.
Suppose you will be flying out of NYC and want to know which of the three major NYC airports has the best on time departure rate of departing flights. Suppose also that for you a flight that is delayed for less than 5 minutes is basically "on time". You consider any flight delayed for 5 minutes of more to be "delayed".
In order to determine which airport has the best on time departure rate, we need to
- first classify each flight as "on time" or "delayed",
- then group flights by origin airport,
- then calculate on time departure rates for each origin airport,
- and finally arrange the airports in descending order for on time departure percentage.
Let's start with classifying each flight as "on time" or "delayed" by creating a new variable with the mutate
function.
nycflights <- nycflights %>%
mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))
The first argument in the mutate
function is the name of the new variable
we want to create, in this case dep_type
. Then if dep_delay < 5
we classify
the flight as "on time"
and "delayed"
if not, i.e. if the flight is delayed
for 5 or more minutes.
Note that we are also overwriting the nycflights
data frame with the new
version of this data frame that includes the new dep_type
variable.
We can handle all the remaining steps in one code chunk:
nycflights %>%
group_by(origin) %>%
summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
arrange(desc(ot_dep_rate))
The summarise step is telling R to count up how many records of the currently found group are on time - sum(dep_type == "on time”) - and divide that result by the total number of elements in the currently found group - n() - to get a proportion, then to store the answer in a new variable called ot_dep_rate.
- If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?
- EWR
- JFK
- LGA
# type your code for Question 7 here, and Knit
We can also visualize the distribution of on on time departure rate across the three airports using a segmented bar plot.
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
geom_bar()
- Mutate the data frame so that it includes a new variable that contains the
average speed,
avg_speed
traveled by the plane for each flight (in mph). What is the tail number of the plane with the fastestavg_speed
? Hint: Average speed can be calculated as distance divided by number of hours of travel, and note thatair_time
is given in minutes. If you just want to show theavg_speed
andtailnum
and none of the other variables, use the select function at the end of your pipe to select just these two variables withselect(avg_speed, tailnum)
. You can Google this tail number to find out more about the aircraft.
- N666DN
- N755US
- N779JB
- N947UW
- N959UW
# type your code for Question 8 here, and Knit
- Make a scatterplot of
avg_speed
vs.distance
. Which of the following is true about the relationship between average speed and distance.
- As distance increases the average speed of flights decreases.
- The relationship is linear.
- There is an overall postive association between distance and average speed.
- There are no outliers.
- The distribution of distances are uniform over 0 to 5000 miles.
# type your code for Question 9 here, and Knit
- Suppose you define a flight to be "on time" if it gets to the destination on
time or earlier than expected, regardless of any departure delays. Mutate the data
frame to create a new variable called
arr_type
with levels"on time"
and"delayed"
based on this definition. Then, determine the on time arrival percentage based on whether the flight departed on time or not. What proportion of flights that were"delayed"
departing arrive"on time"
? [NUMERIC INPUT]
# type your code for Question 10 here, and Knit
Basketball players who make several baskets in succession are described as having a hot hand. Fans and players have long believed in the hot hand phenomenon, which refutes the assumption that each shot is independent of the next. However, a 1985 paper by Gilovich, Vallone, and Tversky collected evidence that contradicted this belief and showed that successive shots are independent events. This paper started a great controversy that continues to this day, as you can see by Googling hot hand basketball.
We do not expect to resolve this controversy today. However, in this lab we'll apply one approach to answering questions like this. The goals for this lab are to (1) think about the effects of independent and dependent events, (2) learn how to simulate shooting streaks in R, and (3) to compare a simulation to actual data in order to determine if the hot hand phenomenon appears to be real.
In this lab we will explore the data using the dplyr
package and visualize it
using the ggplot2
package for data visualization. The data can be found in the
companion package for this course, statsr
.
Let's load the packages.
library(statsr)
library(dplyr)
library(ggplot2)
Our investigation will focus on the performance of one player: Kobe Bryant of the Los Angeles Lakers. His performance against the Orlando Magic in the 2009 NBA finals earned him the title Most Valuable Player and many spectators commented on how he appeared to show a hot hand. Let's load some necessary files that we will need for this lab.
data(kobe_basket)
This data frame contains 133 observations and 6 variables, where every
row records a shot taken by Kobe Bryant. The shot
variable in this dataset
indicates whether the shot was a hit (H
) or a miss (M
).
Just looking at the string of hits and misses, it can be difficult to gauge whether or not it seems like Kobe was shooting with a hot hand. One way we can approach this is by considering the belief that hot hand shooters tend to go on shooting streaks. For this lab, we define the length of a shooting streak to be the number of consecutive baskets made until a miss occurs.
For example, in Game 1 Kobe had the following sequence of hits and misses from his nine shot attempts in the first quarter:
[ \textrm{H M | M | H H M | M | M | M} ]
You can verify this by viewing the first 8 rows of the data in the data viewer.
Within the nine shot attempts, there are six streaks, which are separated by a "|" above. Their lengths are one, zero, two, zero, zero, zero (in order of occurrence).
- Fill in the blank: A streak length of 1 means one ___ followed by one miss.
- hit
- miss
- Fill in the blank: A streak length of 0 means one ___ which must occur after a miss that ended the preceeding streak.
- hit
- miss
Counting streak lengths manually for all 133 shots would get tedious, so we'll
use the custom function calc_streak
to calculate them, and store the results
in a data frame called kobe_streak
as the length
variable.
kobe_streak <- calc_streak(kobe_basket$shot)
We can then take a look at the distribution of these streak lengths.
ggplot(data = kobe_streak, aes(x = length)) +
geom_histogram(binwidth = 1)
- Which of the following is false about the distribution of Kobe's streak lengths from the 2009 NBA finals.
- The distribution of Kobe's streaks is unimodal and right skewed.
- The typical length of a streak is 0 since the median of the distribution is at 0.
- The IQR of the distribution is 1.
- The longest streak of baskets is of length 4.
- The shortest streak is of length 1.
We've shown that Kobe had some long shooting streaks, but are they long enough to support the belief that he had hot hands? What can we compare them to?
To answer these questions, let's return to the idea of independence. Two processes are independent if the outcome of one process doesn't effect the outcome of the second. If each shot that a player takes is an independent process, having made or missed your first shot will not affect the probability that you will make or miss your second shot.
A shooter with a hot hand will have shots that are not independent of one another. Specifically, if the shooter makes his first shot, the hot hand model says he will have a higher probability of making his second shot.
Let's suppose for a moment that the hot hand model is valid for Kobe. During his career, the percentage of time Kobe makes a basket (i.e. his shooting percentage) is about 45%, or in probability notation,
[ P(\textrm{shot 1 = H}) = 0.45 ]
If he makes the first shot and has a hot hand (not independent shots), then the probability that he makes his second shot would go up to, let's say, 60%,
[ P(\textrm{shot 2 = H} , | , \textrm{shot 1 = H}) = 0.60 ]
As a result of these increased probabilites, you'd expect Kobe to have longer streaks. Compare this to the skeptical perspective where Kobe does not have a hot hand, where each shot is independent of the next. If he hit his first shot, the probability that he makes the second is still 0.45.
[ P(\textrm{shot 2 = H} , | , \textrm{shot 1 = H}) = 0.45 ]
In other words, making the first shot did nothing to effect the probability that he'd make his second shot. If Kobe's shots are independent, then he'd have the same probability of hitting every shot regardless of his past shots: 45%.
Now that we've phrased the situation in terms of independent shots, let's return to the question: how do we tell if Kobe's shooting streaks are long enough to indicate that he has hot hands? We can compare his streak lengths to someone without hot hands: an independent shooter.
While we don't have any data from a shooter we know to have independent shots, that sort of data is very easy to simulate in R. In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following.
coin_outcomes <- c("heads", "tails")
sample(coin_outcomes, size = 1, replace = TRUE)
The vector outcomes
can be thought of as a hat with two slips of paper in it:
one slip says heads
and the other says tails
. The function sample
draws
one slip from the hat and tells us if it was a head or a tail.
Run the second command listed above several times. Just like when flipping a coin, sometimes you'll get a heads, sometimes you'll get a tails, but in the long run, you'd expect to get roughly equal numbers of each.
If you wanted to simulate flipping a fair coin 100 times, you could either run
the function 100 times or, more simply, adjust the size
argument, which
governs how many samples to draw (the replace = TRUE
argument indicates we put
the slip of paper back in the hat before drawing again). Save the resulting
vector of heads and tails in a new object called sim_fair_coin
.
sim_fair_coin <- sample(coin_outcomes, size = 100, replace = TRUE)
To view the results of this simulation, type the name of the object and then use
table
to count up the number of heads and tails.
sim_fair_coin
table(sim_fair_coin)
Since there are only two elements in outcomes
, the probability that we "flip"
a coin and it lands heads is 0.5. Say we're trying to simulate an unfair coin
that we know only lands heads 20% of the time. We can adjust for this by adding
an argument called prob
, which provides a vector of two probability weights.
sim_unfair_coin <- sample(coin_outcomes, size = 100, replace = TRUE,
prob = c(0.2, 0.8))
prob = c(0.2, 0.8)
indicates that for the two elements in the outcomes
vector,
we want to select the first one, heads
, with probability 0.2 and the second
one, tails
with probability 0.8. Another way of thinking about this is to
think of the outcome space as a bag of 10 chips, where 2 chips are labeled
"head" and 8 chips "tail". Therefore at each draw, the probability of drawing a
chip that says "head"" is 20%, and "tail" is 80%.
In a sense, we've shrunken the size of the slip of paper that says "heads",
making it less likely to be drawn and we've increased the size of the slip of
paper saying "tails", making it more likely to be drawn. When we simulated the
fair coin, both slips of paper were the same size. This happens by default if
you don't provide a prob
argument; all elements in the outcomes
vector have
an equal probability of being drawn.
If you want to learn more about sample
or any other function, recall that you
can always check out its help file with ?sample
.
Simulating a basketball player who has independent shots uses the same mechanism that we use to simulate a coin flip. To simulate a single shot from an independent shooter with a shooting percentage of 50% we type,
shot_outcomes <- c("H", "M")
sim_basket <- sample(shot_outcomes, size = 1, replace = TRUE)
To make a valid comparison between Kobe and our simulated independent shooter, we need to align both their shooting percentage and the number of attempted shots.
Note that we've named the new vector `sim_basket`, the same name that we gave to
the previous vector reflecting a shooting percentage of 50%. In this situation,
R overwrites the old object with the new one, so always make sure that you don't
need the information in an old vector before reassigning its name.
With the results of the simulation saved as `sim_basket`, we have the data
necessary to compare Kobe to our independent shooter.
Both data sets represent the results of 133 shot attempts, each with the same
shooting percentage of 45%. We know that our simulated data is from a shooter
that has independent shots. That is, we know the simulated shooter does not have
a hot hand.
### Comparing Kobe Bryant to the Independent Shooter
<div id="exercise">
**Exercise**: Using `calc_streak`, compute the streak lengths of `sim_basket`, and save the results in a data frame called `sim_streak`. Note that since the `sim_streak` object is just a vector and not a variable in a data frame, we don't need to first select it from a data frame like we did earlier when we calculated the streak lengths for Kobe's shots.
</div>
```{r sim-streak-lengths}
# type your code for the Exercise here, and Knit
4. If you were to run the simulation of the independent shooter a second time, how
would you expect its streak distribution to compare to the distribution from the
exercise above?
<ol>
<li> Exactly the same </li>
<li> Somewhat similar </li>
<li> Totally different </li>
</ol>
5. How does Kobe Bryant's distribution of streak lengths compare to the distribution
of streak lengths for the simulated shooter? Using this comparison, do you have
evidence that the hot hand model fits Kobe's shooting patterns?
<ol>
<li> The distributions look very similar. Therefore, there doesn't appear to be evidence for Kobe Bryant's hot hand. </li>
<li> The distributions look very similar. Therefore, there appears to be evidence for Kobe Bryant's hot hand. </li>
<li> The distributions look very different. Therefore, there doesn't appear to be evidence for Kobe Bryant's hot hand. </li>
<li> The distributions look very different. Therefore, there appears to be evidence for Kobe Bryant's hot hand. </li>
</ol>
<div id="exercise">
**Exercise**: What concepts from the course videos are covered in this lab? What
concepts, if any, are not covered in the videos? Have you seen these concepts
elsewhere, e.g. textbook, previous labs, or practice problems?
</div>
<div id="license">
This is a derivative of an [OpenIntro](https://www.openintro.org/stat/labs.php) lab, and is released under a [Attribution-NonCommercial-ShareAlike 3.0 United States](https://creativecommons.org/licenses/by-nc-sa/3.0/us/) license.
</div>