In the previous file, we analyzed movie rating data to look for evidence of bias. We did this by looking at differences among ratings released by four movie rating sites using histograms and box plots.

Visualizing the differences among the sites' ratings suggested that some sites, especially Fandango, tend to give mostly good ratings and very few poor ones:

Discovering that Fandango tends to give higher ratings leads to another question: How do Fandango's movie ratings compare to those of the other sites? We now know that Fandango's ratings are higher overall, but do they generally agree with other sites in terms of relative film quality?

In this file, we'll work on answering that question as we visualize **relationships** between ratings released by Fandango and the other three sites.

Once again, we'll be working with the movie review [data](https://github.com/fivethirtyeight/data/tree/master/fandango) compiled by FiveThirtyEight.

In the previous file, we made a subset of the data available in a format conducive to creating histograms and box plots.

In this file, we'll be working with the raw movie data set and making some modifications to it ourself.

Let's import and modify the data.

**Task**

* Import the "movie_reviews_2.csv" file into R and save it as a data frame.


* Create a new data frame, reviews_2, containing only the following columns:
    * FILM
    * RT_user_norm
    * Metacritic_user_nom (Note the misspelling "nom")
    * IMDB_norm
    * Fandango_Ratingvalue
    
    
* Give the columns new names to make them easier to work with:
    * FILM
    * Rotten_Tomatoes
    * Metacritic
    * IMDB
    * Fandango
    
    
**Answer**

`library(readr)
 library(dplyr)`

`reviews_2 <- read_csv("movie_reviews_2.csv")`

`reviews_2 <- reviews_2 %>%
  select(FILM, RT_user_norm, Metacritic_user_nom, IMDB_norm, Fandango_Ratingvalue) %>%
  rename(Rotten_Tomatoes = RT_user_norm, Metacritic = Metacritic_user_nom, IMDB = IMDB_norm, Fandango = Fandango_Ratingvalue)`

The data are structured with a column for each site's rating for each film:

![image.png](attachment:image.png)

We'll be comparing each film's Fandango rating to those of the other sites to see how well they agree. To accomplish this, we'll learn to work with a new type of data visualization called [scatter plots](https://en.wikipedia.org/wiki/Scatter_plot).

Scatter plots represent data using points that have the value of one variable determining the position on the x-axis, and the value of the other variable determining the position on the y-axis:

When we work with scatter plots, the variables on the x- and y-axes do not need to be assigned roles of "independent" and "dependent." In other words, we do not necessarily have to suspect the values of one of the variables depend on the values of the other — we are just looking for any sort of relationship between them.

The relationship appears to be **strong** if points are clustered close together instead of spread out and **Weak** if vice versa. 

Scatter plots can also indicate **negative** relationships between variables, in which one variable decreases as another increases and **positive** if vice versa. 

In addition to allowing for visualization of relationships between variables, scatter plots can also demonstrate a lack of relationship between variables. In this scatter plot representing values of `Variable_1` and `Variable_2`, the points are arranged in a shapeless cloud, indicating there is not likely a relationship between the two variables:

![image.png](attachment:image.png)

Whether a relationship is strong or weak is subjective — it depends on the person

To create a scatter plot using `ggplot2`, we'll:

* Specify the two variables for which we want to investigate a relationship in the `aes()` layer.
* Add a `geom_point()` layer to specify use of points to represent each pair of variable values.

**Task**

* Create a scatter plot to investigate the relationship between movie rating scores for Fandango (as x axis) and Rotten Tomatoes (as y axis).

**Answer**

`library(ggplot2)`

`ggplot(data = reviews_2,
  aes(x = Fandango, y = Rotten_Tomatoes)) +
  geom_point()`

![image.png](attachment:image.png)

First, let's focus on the axes. They reflect the spread of values for each site that we visualized using histograms and box plots. Fandango, on the x-axis, has values ranging from 3 to 4.5 while Rotten Tomatoes, on the y-axis, has values ranging from 1 to 4.

Next, look at the pattern of the points. The points are spread out, but generally, they trend upwards. Recall that this indicates a weak, positive relationship between movie ratings for Fandango and Rotten Tomatoes.

Overall, we can determine from this plot that while Fandango tends to give higher scores overall, the two sites often agree on relative movie quality.

There are a few layers we can add to your plot to make it easier to interpret.

###  Adjusting Axis Ranges

To make the difference in the range of rating values for Fandango and Rotten Tomatoes easier to visualize, we can specify the same range of values for the x- and y-axes.

`plot + xlim(1,5) + ylim(1,5)`

The code above specifies ranges, or limits, of 1 through 5 for the `x (xlim())` and `y (ylim())` axes.

### Making Points Transparent

We can make the points transparent by adding the argument `alpha =` to the `geom_point()` layer:

`geom_point(alpha = 0.3)`

An alpha value of 1 specifies complete [opacity](https://en.wikipedia.org/wiki/Opacity), or points that can't be seen through. An alpha value of 0 specifies complete [transparency](https://en.wikipedia.org/wiki/Transparency), or points that are completely see-through. Intermediate values specify varying degrees of transparency. Usually, alpha values of 0.2 to 0.5 will help make overlapping points easier to see.

**Task**

* Add new layes and modify existing layers of your scatter plot so it meets the following specifications:
 * The x- and y-axes range from 1 to 5.
 * The points have a transparency of `alpha = 0.3`.
 * The graph background is white.

**Answer**

`ggplot(data = reviews_2,
  aes(x = Fandango, y = Rotten_Tomatoes)) +
  geom_point(alpha = 0.3) +
  xlim(1,5) +
  ylim(1,5) +
  theme(panel.background = element_rect(fill = "white"))`

![image.png](attachment:image.png)

Let's return to our initial question: How do Fandango's ratings for each film compare to those of the other sites?

To answer it, let's create scatter plots to depict the relationship of Fandango's ratings with those of IMDB and Metacritic.

**Task**

* Create two new scatter plots:
 * A comparison of ratings from Fandango and IMDB
 * A comparison of ratings from Fandango and Metacritic
 
**Answer**

`ggplot(data = reviews_2,
  aes(x = Fandango, y = IMDB)) +
  geom_point(alpha = 0.3) +
  xlim(1,5) +
  ylim(1,5) +
  theme(panel.background = element_rect(fill = "white"))`

`ggplot(data = reviews_2,
  aes(x = Fandango, y = Metacritic)) +
  geom_point(alpha = 0.3) +
  xlim(1,5) +
  ylim(1,5) +
  theme(panel.background = element_rect(fill = "white"))`

Let's make a function for Scatter Plot.

When writing a function to create plots, there is one change we should make. Instead of using `aes()` to map variables to axes, use `aes_string()` instead. Using `aes_string()` allows us to pass vectors of variable names into our function for efficiency.

Once we've written the function, we can use a functional from the `purrr` package to apply the `create_scatter()` function to the site ratings we want to look at relationships between.

Since `create_scatter()` is a **two-variable** function, we need to use the `functional map2()`. Remember that `map2()` takes two variables and a function as arguments and returns a list. In this case, the input function is `create_scatter()`. What variables do we specify, though?

Let's remind ourselves of the output we want to obtain using the `create_scatter` function. We want a list of three scatter plots of relationships between site ratings:

* Fandango and Rotten Tomatoes
* Fandango and IMDB
* Fandango and Metacritic

Either the x or y variable will always be Fandango. Let's assign the Fandango variable to the x-axis variable:

`x_var <- "Fandango"`

The other variable, on the y-axis, will consist of the three sites we want to compare Fandango against:

`y_var <- c("Rotten_Tomatoes", "IMDB", "Metacritic")`

In instances where we don't want to have to type all the variable names into vectors, we can index the data frame to select specific rows and use the `names()` function to extract row names:

`x_var <- names(reviews_2)[5]
y_var <- names(reviews_2)[2:4]`


**Task**

* Write a function that creates a scatter plot from the input variables x and y.

**Answer**

`library(purr)`

`create_scatter <- function(x, y) {
  ggplot(data = reviews_2) + 
    aes_string(x = x, y = y) +
    geom_point(alpha = 0.3) +
    xlim(1,5) +
    ylim(1,5) +
    theme(panel.background = element_rect(fill = "white"))
}`

`x_var <- names(reviews_2)[5]
y_var <- names(reviews_2)[2:4]`

`map2(x_var, y_var, create_scatter)`

While this approach improved the efficiency of our code, imagine how much time it would save if we needed to create dozens or hundreds of scatter plots!

As we work on personal data science projects, we can also apply these techniques to the creation of other types of data visualizations using `ggplot2`.

Here are the three plots we've created to explore the relationships between Fandango's movie ratings and the ratings of the three other sites:

![image.png](attachment:image.png)

Let's return to our question: How do Fandango's film ratings compare to those of the other sites?

We can make some observations:

* The plots show that ratings for all three sites are weakly positively related to Fandango's ratings. This means that, in general, Fandango and the other sites often agreed on which movies were good and which were not, despite Fandango assigning higher ratings overall.
* The Rotten Tomatoes and IMDB ratings seem to be more strongly related to Fandango's than the Metacritic ratings.