# The Bechdel Test

A movie passes the Bechdel test if it satisfies 3 rules:

1. it has at least two women;
2. the women talk to each other; and
3. they talk to each other about something or someone other than a man.

The Bechdel test originated in this comic by Alison Bechdel (image source http://dykestowatchoutfor.com/the-rule): 

![](https://mhc-stat140-2017.github.io/labs/20170913_bechdel_movies/The-Rule-cleaned-up.jpg)

## Data Source

The data we're going to work with today have been gathered from a variety of sources by several people.  The Bechdel test ratings themselves are from www.bechdeltest.com, where the general public can rate movies according to whether they pass or fail the Bechdel test.  Some additional information about the movies comes from www.the-numbers.com.  These data were the basis of an article on the topic at http://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/.  The data have since been added to the `fivethirtyeight` package for R.  I took those data and scraped some additional information about the movies like the MPAA rating, run time, and ratings from IMDB users from imdb.com (IMDB stands for "Internet Movie Data Base" -- it's a website that compiles lots of information about movies and lets users rate the movies).  Note that this is not a random sample of movies -- which movies made it into the data set was basically determined by which movies were rated by users of www.bechdeltest.com.  That means any findings from your analysis of these data are tentative -- there's no guarantee that this sample of movies is representative of the population of all movies.

## Load Necessary Packages

You will need three packages for this lab: `readr`, `dplyr`, and `mosaic`.  Add three calls to the `library` function in the following cell to load these packages and make the functions they provide available to use.

## Loading the Data Into R

The data for this lab are on our course website at http://www.evanlray.com/stat140_s2018/lecture/20180130_wrangling/bechdel.csv.  If you want, you can click on that link to download the file to your computer and open it up in Excel -- but that's not necessary.

In the next cell, use the `read_csv` function to read that data file in and store it in a new data frame called `movies`.

## A First Look at the Data

Whenever I am thinking about a new data set, the first thing I do is try to get a sense of what the observational units and variables in the data set are, and whether each variable is quantitative or categorical (and if categorical, whether it is nominal or ordinal).

You can start to do this by using some combination of:

 * `head` to look at the first few rows of the data
 * `str` to get some more detailed information
 * `dim`, `nrow`, and `ncol` to see how many observational units and variables are in the data set.

In the next cell, add some appropriate calls to those functions to start to look at what's in this data set.  Think through the questions listed above (how many observational units and variables, and what are the types of each variable?).  Can you tell what each variable is measuring by its name and the first few values?  (I'll answer this question in our next lab, which will also use this data set).

## Convert Categorical Variables to `factor`s

You will have noticed three categorical variables in your initial exploration above: `bechdel_test`, `bechdel_test_binary`, and `mpaa_rating`.  We next need to convert these categorical variables to `factor`s so that R deals with them appropriately.

In the cell below, we use the `distinct` function to identify the distinct values of the `bechdel_test` and `mpaa_rating` variables.  Since these are both ordinal variables, you need to know these distinct values in order to specify the order of the levels when converting the variables to factors.  The `bechdel_test_binary` variable is also ordinal.  Add a third call to the `distinct` function to the cell below to find out what all of the distinct values of that variable are.

In [None]:
movies %>% distinct(bechdel_test)
movies %>% distinct(mpaa_rating)

You should see that the distinct values of `bechdel_test` are "nowomen", "notalk", "men", "dubious", "ok":

 * "nowomen" means there are not at least two women in the movie
 * "notalk" means there are at least two women in the movie, but they donâ€™t talk to each other;
 * "men" means there are at least two women in the movie, but they only talk to each other about men;
 * "dubious" means there was some disagreement among users of bechdeltest.com about whether or not the movie passed the test;
 * "ok" means that the movie passes the test.

The levels of the `bechdel_test_binary` variable are "FAIL", "PASS":

 * "PASS" means that the movie passed the test (i.e., its value for bechdel_test is "ok")
 * "FAIL" means it did not pass the test (i.e., its value for bechdel_test is something other than "ok")

The `mpaa_rating` variable has nine levels: "UNRATED", "NOT RATED", "G", "PG", "TV-PG", "PG-13", "TV-14", "R", "NC-17"

With this information in hand, we are ready to tell R that these are ordinal categorical variables by converting them to ordered `factor`s.  Use the `mutate` and `factor` functions to do that in the next cell:

## Using `tally` to get counts of one variable

One thing we might like to know is how many of the movies in this data set have each possible level of the `mpaa_rating` variable.  Use the `tally` function to calculate this in the cell below:

## Using `filter` to select a subset of the movies

It looks like the vast majority of the movies (and apparently a few TV shows) in this data set have ratings of "G", "PG", "PG-13", and "R".  Since we have so few movies in the other ratings categories, it might make sense to restrict our analysis to just those four categories.  In the next cell, use the `filter` function to create a new data frame with only movies that are in those four ratings categories.

## Using `tally` to get counts for a combination of two variables

Is there a relationship between a movie's MPAA rating and how it does on the Bechdel test?  In the next cell, use the `tally` function to get counts of how many movies in the filtered data set are in each combination of levels of the `mpaa_ratings` variable and the `bechdel_test_binary` variable.

If you have time at the end of class, you might like to come back to this section and calculate the joint distribution of `mpaa_ratings` and `bechdel_test_binary`, the marginal distribution of `bechdel_test_binary`, and the conditional distribution of `bechdel_test_binary` given that a movie has a G rating and that it has an R rating.  Is a movie's result on the Bechdel test independent of its MPAA rating?

## Using `arrange`

What movie in this data set came out earliest?  In the cell below, use the `arrange` function to sort the movies in ascending order according to the `year` variable.  Then use the `head` function to look at the first few rows of the sorted data frame.  Movies that came out earlier will be at the top of the data frame.

What movie in this data set had the highest international gross earnings, in inflation-adjusted 2013 dollars?  In the cell below, use the arrange function to sort the movies in **descending** order according to the `intgross_2013` variable.  Then use the `head` function to look at the first few rows of the sorted data frame.