In the previous file, we learned about some of the concepts that are fundamental to predictive models of all kinds, not just linear models. We imagined a situation where we recorded trip distance and trip cost data for 50 Uber rides. We learned that this data can be used to fit a model that predicts the total cost of a future Uber trip based on trip distance.

![image.png](attachment:image.png)

We did not learn the details of how to fit a linear model, but we buit our intuition about what a linear fit line can say about our data. Here, we see that the fit line represents the trend that the cost of an Uber trip generally increases with distance. We are able to observe this relationship between `cost` and `distance` using the fit line, but we can also observe this relationship with the scatterplot alone.

![image.png](attachment:image.png)

In this file, we will learn what scatterplots can reveal about a bivariate relationship — the relationship between two variables. We will continue to illustrate examples from the `uber_trips` data as we learn new concepts and coding approaches. We will work with publically available [property sale data from New York City](https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page) to explore bivariate relationships between variables with numeric data.

In this file we will explore the relationship between pairs of variables to determine whether any variables are suitable to include in a linear regression model to predict the sale price for a property. In addition to visualization of relationships between variables, we will learn how to quantify these relationships using correlation.

In this file, we will explore relationships between pairs of variables to determine if linear regression is suitable to describe the relationship between the variables. 

Bivariate linear regression can be performed between pairs of quantitative data measured on an `interval` or `ratio` scale. Problems with a quantitative response where there exists a dependency relationship are known as **regression** problems. 

As we learned, the `cost` and `distance` variables in the `uber_trips` data are examples of quantitative data suitable for linear regression

On the other hand, problems involving a qualitative or categorical response are known as **classification** problems. 

Using our `uber_trips` data as an example, qualitative data refers to the `neighborhood` variable. The `neighborhood` variable is a categorical string variable that describes the name of a neighborhood and does not include quantitative information.Another example of qualitative data we might record is the `company`, if we use both Lyft and Uber.

Logistic regression is a type of modeling that is commonly used with classification problems. It is worth knowing that a form of regression with qualitative information is possible.

We are focused entirely on **bivariate linear regression** - a type of model that uses a single quantitative input to explain a single quantitative response. However, it is possible, and often preferred in practice to use more than one input variable to predict the quantitative response of a single variable. This is known as **multiple linear regression** or **multivariate linear regression**. 

Multiple linear regression is a popular machine-learning technique, because it is often preferable to use as many predictors as possible when the goal is to return an accurate prediction.

We will focus on building a strong understanding of **bivariate linear regression** first, because many of the concepts covered and summary metrics used can be applied with multiple linear regression.

Let's deepen our knowledge of understanding bivariate relationships by exploring the New York City property sales data for the Brooklyn borough. This diagram shows the location of Brooklyn, and the other boroughs of New York City:

![image.png](attachment:image.png)

Spreadsheet for `Brooklyn` contains over 22,000 property sales records. For learning purposes, we will start by using a subset of this data. The dataset we'll use first contains 50 randomly selected sales records for [`condominiums`](https://en.wikipedia.org/wiki/Condominium) in the [Williamsburg](https://en.wikipedia.org/wiki/Williamsburg,_Brooklyn)-North neighborhood.

**Task**

1. Load the csv file `williamsburg_north.csv`.
    * Once we have loaded the data, take a look at the type of information available.

2. Download the [Glossary of Terms for Property Sales Files](https://www1.nyc.gov/assets/finance/downloads/pdf/07pdf/glossary_rsf071607.pdf) and read through the description for each variable to gain an understanding of the data available to us.

3. Once we have the data loaded, examine it. Do we see any quantitative variables in the dataset? Do we think any of this information can be used to predict property sale price?

Note: We deleted a few variables that do not contain useful infomation for filtering or analysis. We reformatted the variable names and reformatted some of the observations to make the data easier to read. We also deleted duplicate entries from the dataset.

**Task**

`library(readr)
williamsburg_north <- read_csv("williamsburg_north.csv")`

Our overall goal will be to use bivariate linear regression to predict property sale price. A prediction for sale price is considered a quantitative response to a quantitative input, or predictor variable. 

In the previous exercise, did we identify any quantitative variables that might be useful to predict `sale_price` for a property using a linear regression model? There are a few quantitative variables worth considering:

* `gross_square_feet`: The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property.
* `year_built`: Year the structure on the property was built.

If we were analyzing single family home sales then we would also want to consider the `land_square_feet` variable which describes the total size of the land area for each property. But `land_square_feet` data is not relevant to condominium sales.

Even though the dataset does not contain many quantitative variables useful to predict `sale_price`, the information was useful for filtering the data to create this subset of similar properties. We used the following filter parameters to generate the `williamsburg_north` dataset:

* `year_built`: Missing values (presumably recorded as 0) were discarded.
* `sale_price`: Only values greater than $10,000 were considered. A $0 sale indicates that there was a transfer of ownership without a cash consideration.
* `building_class_category`: Selected only "13 Condos - Elevator Apartments".
* `building_class_at_time_of_sale`: Selected only "R4", which is a residential unit in a building that has an elevator.
* `neighborhood`: Selected "Williamsburg-North" because it had the highest number of "R4" unit sales records for all neighborhoods in Broolyn (160 total).
* The `sample_n()` function was used to randomly select 50 unique sales records.


Before selecting variables from a dataset to build a linear model, it's useful to generate scatterplots to visualize the data relationships between pairs of data. When we visualize bivariate relationships, the independent variable is plotted on the x-axis, and the dependent variable is plotted on the y-axis. 

With our `uber_trips` data, we plotted distance on the x-axis and `cost` on the y-axis, because `cost` depends on `distance`. Trip cost may also depend on other variables. But when exploring the relationship between `cost` and `distance`, we can assume that `cost` is explained by `distance`, not the other way around.

When we generate a scatterplot to explore bivariate relationships, we can evaluate the relationships between the variables by examining the following characteristics: 

* direction, 
* linearity,
* and strength. 

Let's examine each of these in more detail. We will begin with `direction`.

**Direction**: With a positive relationship, an increase in value along the x-axis results in an increase in value along the y-axis. A negative relationship is when an increase in one variable is associated with the decrease in value for another variable. We observed with our `uber_trips` data that, in general, as `distance` increases, trip `cost` also increases. This is an example of a positive relationship.

![image.png](attachment:image.png)

In this exercise let's return our attention to the `williamsburg_north` dataframe. 

* Is there any relationship between when the condominium was built and how much it sells for? Or 
* what about the size of the property?
* Do larger properties generally sell for a higher price than smaller properties? 

Let's generate a couple of scatterplots to find out.

**Task**

Generate two plots with `ggplot2` to observe directional relationship. In each plot include `sale_price` on the y-axis because we are treating this as the dependent variable. In each case, a simple exploratory plot is fine.

1. Load the package required to generate these two plots.
2. Generate a scatter plot with `sale_price` on the y-axis and `year_built` on the x-axis.
    * Add the call `scale_y_continuous(labels = scales::comma)` to the plot so that y-axis labels are not displayed as scientific notation.
4. Generate a scatter plot with `sale_price` on the y-axis and `gross_square_feet` on the x-axis.
    * Add `scale_y_continuous(labels = scales::comma)` to this plot as well.


Take a look at each plot. Do we notice a positive or negative directional relationship between `year_built` and `sale_price`? How about for `gross_square_feet` and `sale_price` ?

**Answer**


`library(ggplot2)`

`ggplot(data = williamsburg_north, 
       aes(x = year_built, y = sale_price)) +
       scale_y_continuous(labels = scales::comma) +
       geom_point()`

`ggplot(data = williamsburg_north, 
       aes(x = gross_square_feet, y = sale_price)) +
       scale_y_continuous(labels = scales::comma) +
       geom_point()`

Above we generated two scatterplots to see if a positive or negative relationship can be observed between `year_built` and `sale_price`, or `gross_square_feet` and `sale_price`. 

Let's start by discussing what we observed about the relationship bewteen the `year` that a property was built, and `sale price`.

![image.png](attachment:image.png)

There does not appear to be any obvious positive or negative relationship between `year_built` and `sale_price`. The observations are clustered into two main groups. One group corresponds to property sales for condominiumns built in the early 1900's, and the other group corresponds to properties built in the early 2000's. 

The values for `sale_price` are scattered across a similar range of values between the two groups. In other words, there is no indication that a condominium in an older building would sell for less than a condominium a newer building, or the other way around.

Why do these distinct clusters exist in the first place? Answering this question would take some investigation.

What about the relationship between the `size` of a property and `sale price`? Do we observe a positive or negative relationship with `gross_square_feet` and `sale_price`?

![image.png](attachment:image.png)

Looking at the scatterplot for `gross_square_feet` and `sale_price` we are able to observe that there is a positive relationship between these two variables. Generally speaking, the larger the property, the higher the sale price. Now that we've discussed directional relationships, let's consider linearity.

**Linearity**: A bivariate relationship is linear when the scatter, or spread of data points generally follows a linear pattern. 

With our `uber_trips` data we observed that `cost` generally increased with `distance`. When we visualized the linear regression fit line, we observed that some points fall below the line, while others are above the line. 

But in general, the data points did not curve-away from the fit line at any point. We do not need to visualize a regression line to determine, roughly, if a bivariate relationship is linear. Often a scatterplot is all we need. But since we previously visualized the linear fit line for the `uber_trips` data, it is useful to consider here.

![image.png](attachment:image.png)

Let's do a quick comprehension check by considering whether or not we observed linear relationships in the scatterplots we generated.

Assign the value `TRUE` or `FALSE` to each of the following questions.

1. For the `year_built` versus `sale_price` scatterplot: There does not appear to be a linear relationship between the two clusters, but within each cluster we observe a linear pattern that the newer condominiums tend to sell for more money than older ones.
    * Assign the value `TRUE` or `FALSE` to the variable `question_1`.


2. For the `gross_square_feet` versus `sale_price` scatterplot: The pattern looks to be generally linear. There is some variation observed in `sale_price` for any given value of `gross_square_feet`, but the overall pattern is positive and does not appear to curve strongly in any direction.
    * Assign the value `TRUE` or `FALSE` to the variable `question_2`.
    
**Answer**

`question_1 <- FALSE
question_2 <- TRUE`

Now that we've examined **direction** and **linearity**, let's also take a look at **Strength**. A bivariate relationship is strong when the [spread, or dispersion](https://en.wikipedia.org/wiki/Statistical_dispersion), of the data is narrow. 

With a strong bivariate relationship, there will not be a lot of scatter, or noise, present. Scatterplots are useful for visualizing bivariate relationships. With a strong relationship, the residuals will be lower. 

In the previous file we visualized the residuals for our `uber_trips` data and observed that there is a moderately-strong relationship between `distance` and `cost`. We don't need the blue lines representing the residuals to assess the strength - a scatterplot will be useful by itself - but it is worth looking at here.

![image.png](attachment:image.png)

How strong do we think the bivariate relationship is between `gross_square_feet` versus `sale_price` from the `williamsburg_north` dataframe?

![image.png](attachment:image.png)

The 50 points in the scatterplot do not form a perfectly straight line when plotted, so it's safe to say that the relationship is not perfect. And for any given value of `gross_square_feet` we observe variations in price on the order of hundreds-of-thousands-of-dollars. 

But there is definitely some level of strength in the data because the points are not randomly spread across the plot. In the previous file we learned that we can essentially quantify levels of strength by estimating **mean absolute error**, for example. But for now we're focused on scatterplots and building our intuition about what scatterplots can tell us about bivariate relationships.

Let's build on our intuition by comparing the spread of data from our scatterplot above to a similar scatterplot that is built using data from the four Williamsburg neighborhoods included in the dataset from Brooklyn:

* Williamsburg-North
* Williamsburg-East
* Williamsburg-Central
* Williamsburg-South

Is it possible that one of these neighborhoods is more expensive than the others? Or is one of these neighborhoods generally more affordable than others? If there is variation among condominium values between neighborhoods, then we may be able to visualize this on a scatterplot. Will our points appear more spread out, and thus have a lower strength, if our data contains records from multiple neighborhoods? Let's check it out.

**Task**

Generate a dataset of 50 condominium sale records from the four Williamsburg neighborhoods. Use this dataset to build a scatterplot to visualize the strength of the relationship between `gross_square_feet` and `sale_price`. 
    
1. defined set.seed() as 1. 
2. Generate a dataframe called `williamsburg_all` by applying the following filtering conditions to the `brooklyn_sales` dataframe:
    * `year_built` should be greater than 0
    * `sale_price` should be greater than 10,000
    * `building_class_category` must equal "13 Condos - Elevator Apartments"
    * `building_class_at_time_of_sale` must equal "R4"
    * include only values for `neighborhood` that include "Williamsburg"
    * retain only distinct rows (remove duplicate entries) with the [distinct() function](https://dplyr.tidyverse.org/reference/distinct.html) from `dplyr`
    * take 50 random samples with the [`sample_n()` function](https://dplyr.tidyverse.org/reference/sample.html) from `dplyr`
    

3. Generate a scatter plot with `sale_price` on the y-axis and `gross_square_feet` on the x-axis.
    * Add the call `scale_y_continuous(labels = scales::comma)` to the plot so that y-axis labels are not displayed as scientific notation.
    * Color points by neighborhood by defining the required argument within the `aes()` call.
    * Add the geom and argument `theme(legend.position="bottom")` so that this plot displays at a similar aspect ratio to our other scatterplot.


3. Observe the scatterplot we generated. Is the bivariate relationship stronger for `williamsburg_north` only? Or is the bivariate relationship stronger for this scatterplot generated from `williamsburg_all`?
    * Assign either the string `"williamsburg_north"` or `"williamsburg_all"` to the variable `stronger_relationship`.
    
**Answer**

`brooklyn_sales <- suppressMessages(read_csv("brooklyn_sales.csv"))`

`library(dplyr)
set.seed(1)`

`williamsburg_all <- brooklyn_sales %>% `

  `# Remove year-built zero years (assumed to be missing data)`
  
  `filter(year_built > 0) %>%` 
  
  `# Remove transactions assumed to be between family members`
  
  `filter(sale_price > 10000) %>% `
  
  `# Select condominum category`
  
  `filter(building_class_category == "13 Condos - Elevator Apartments") %>%` 
  
  `# Select building class "CONDO; RESIDENTIAL UNIT IN ELEVATOR BLDG."`
  
  `filter(building_class_at_time_of_sale == "R4") %>% `
  
  `# Choose all Williamburg neighborhoods`
  
  `filter(stringr::str_detect(neighborhood, "Williamsburg")) %>%` 
  
  `# Include only unique entries`
  
  `distinct() %>% `
  
  `# Select random sample of 50`
  
  `sample_n(50)`

`ggplot(data = williamsburg_all, 
       aes(x = gross_square_feet, y = sale_price, color = neighborhood)) +
  scale_y_continuous(labels = scales::comma) +
  geom_point() +
  theme(legend.position="bottom")`

`stronger_relationship <- "williamsburg_north"`

When we compare our scatterplot from Williamsburg-North to the plot we generated above using data from all Williamsburg neighborhoods, we observe a weaker bivariate relationship between `gross_square_feet` and `sale_price` for the plot with all four neighboods combined. 

Qualitatively, we can probably say that there is a moderately-strong relationship between `gross_square_feet` and `sale_price` for the `Williamsburg-North` scatterplot. But the strength of the scatterplot for all Williamsburg falls somewhere between moderately-strong, and weak.

![image.png](attachment:image.png)

The `uber_trips` scatterplot we generated shows a moderately-strong positive, linear association between `distance` and `cost`. There is a moderate amount of spread and perhaps an outlier or two. When we consider the strength of bivatiate relationships, we also consider outliers because outliers can impact model fit. `There is no single definition of an outlier`, and we will explore why in this screen.

We have identified a single outlier in our `uber_trips` dataset (highlighted in orange) with respect to the inter-quartile range. This single point was isolated as an outlier in the scatterplot above for `cost` because it is larger than the upper quartile by 1.5 times the difference between the upper quartile and the lower quartile (the interquartile range).

![image.png](attachment:image.png)

However, in the context of linear regression, an **outlier** is an observation for which the response value $y_i$ is far from the value predicted by our model. In our case, an outlier in the `uber_trips` dataframe is a point that is far from the predicted value for `cost`. 

There are [methods](https://en.wikipedia.org/wiki/Studentized_residual) for determining how large a residual has to be before it can be considered an outlier in the regression context. Put simply, an outlier in regression is a data point that does not fit the pattern.

What if a data point is considered an outlier because it is larger than the upper quartile by 1.5 times the interquartile range for `sale_price` but it falls near the fit line of our linear regression model? In this case, we would probably not consider this data point an outlier in the context of linear regression! Let's illustrate why using our `williamsburg_north` data.

The image below highlights data points considered outliers for `sale_price` because they fall outside the upper-quartile by more than 1.5 times the interquartile range. But would these points be considered outliers in the context of linear regression? To built our intuition around outliers and regression, let's add a fit line to this plot to see where the line falls relative to these points.

![image.png](attachment:image.png)

**Task**

`# sale_price quartiles`

`quartiles <- quantile(williamsburg_north$sale_price)`

`# 75% minus 25% = interquartile range (iqr)`

`iqr <- quartiles[[4]] - quartiles[[2]]`

`# Outlier boundaries`

`lower_bound <- quartiles[[2]] - (1.5 * iqr)
upper_bound <- quartiles[[4]] + (1.5 * iqr)`

`# Isolate outlier(s)`

`outliers <- williamsburg_north %>% 
  filter(sale_price > upper_bound | sale_price < lower_bound)`
  
  
We've provided code that determines the upper and lower outlier boundaries for `sale_price`. We saved the results to a new dataframe called `outliers` that contains all data points that fall above the upper outlier boundary, or below the lower outlier boundary. 

Highlight the outliers by integrating the `outliers` dataframe into the `williamsburg_north` scatterplot we previously generated. A useful feature of `ggplot2` is that we can use many datasets within a single plot. Also add a linear model fit line to the plot.

1. Generate a scatterplot using `williamsburg_north` that includes `gross_square_feet` on the x-axis and `sale_price` on the y-axis.
2. Highlight the outliers by **adding a second** `geom_point()` call that contains parameters for how we want the outliers to be highlighted.
    * Include the following arguments in this `geom_point()` **call**: `data = outliers, aes(gross_square_feet, sale_price), color = "orangered3", size = 4`
    * This new `geom_point()` call must come before the other `geom_point()` call.
    * This allows the points to display on top of the orange highlighting.

3. Add a `scale_y_continuous()` call to the plot with the necessary arguments to display commas in the numbers.
4. Use the [`geom_smooth()` function](https://ggplot2.tidyverse.org/reference/geom_smooth.html) to add a linear model fit line to the scatterplot code we've provided.
    * Include the argument required to prevent the confidence intervals from being displayed.
    
    
**Answer**

`ggplot(data = williamsburg_north, 
       aes(x = gross_square_feet, y = sale_price)) +
  geom_point(data = outliers, aes(gross_square_feet, sale_price), 
             color = "orangered3", size = 4) + 
  geom_point() +
  scale_y_continuous(labels = scales::comma) +
  geom_smooth(method = "lm", se = FALSE)`