`library(readr)`

`houses  <-  read_tsv('AmesHousing_1.txt', col_types = cols(`Pool QC` = col_character()))`

**Task**

Create a new dataframe, `sale_price_range`, containing the range of the SalePrice variable for each year of sales.

**Answer**

`library(dplyr)`

`sale_price_range <- houses %>%
    group_by(`Yr Sold`) %>%
    summarize(range_by_year = max(SalePrice) - min(SalePrice))`


**Task**

* Write a function that takes in a numerical vector and returns the average distance
* Compute the average distance for distribution `C`
    - `C  <-  c(1,1,1,1,1,1,1,1,1,21)`

**Answer**

`average_distance <- function(vector) {
    distances  <-  vector - mean(vector)
    sum(distances) / length(distances)
}`

`avg_distance  <-  average_distance(C)`

**Task**

* Create a new function to compute the mean absolute deviation.
* Compute the mean absolute deviation of distribution `C`.

**Answer**

`mean_absolute_deviation <- function(vector) {
    distances  <-  abs(vector - mean(vector)) 
    sum(distances) / length(distances)
}`

`mad  <-  mean_absolute_deviation(C)`

**Task**

* Create a new function to compute the variance.
* Compute the variance of distribution `C`.

**Answer**

`variance <- function(vector) {
    distances  <-  (vector - mean(vector))**2 
    sum(distances) / length(distances)
}`

`variance_C  <-  variance(C)`

**Task**

* Create a new function to compute the standard deviation.
* Compute the standard deviation of distribution `C`.

**Answer**

`standard_deviation <- function(vector) {
    distances  <-  (vector - mean(vector))**2 
    sqrt(sum(distances) / length(distances)) 
}`

`standard_deviation_C  <-  standard_deviation(C)`


**Task**

* Create a new dataframe, `houses_years_std`, containing the standard deviation of the `SalePrice` variable for each year of sales sorted by the standard deviation column.
* Find the year with the greatest variability of prices.
* Find the year with the lowest variability of prices.

**Answer**

`# Measure first the variability for each year`

`houses_years_std <- houses %>%
    group_by(`Yr Sold`) %>%
    summarize(st_dev = standard_deviation(SalePrice)) %>%
    arrange(st_dev)`

`# Get years of max and min variability`

`greatest_variability  <-  houses_years_std %>%
  filter(st_dev == max(st_dev)) %>% 
  pull(`Yr Sold`)`

`lowest_variability  <-  houses_years_std %>%
  filter(st_dev == min(st_dev)) %>% 
  pull(`Yr Sold`)`

**Task**

We are taking two samples of 50 sample points each from the distribution of the `Year Built` variable. Examine the graph below. Estimate visually which sample has a bigger spread.

`set.seed(10)`

`sample1  <-  sample(x = houses$`Year Built`, size = 50)
sample2  <-  sample(x = houses$`Year Built`, size = 50)`

![image.png](attachment:image.png)


**Answer**

`bigger_spread  <-  'sample 2'
st_dev1  <-  standard_deviation(sample1)
st_dev2  <-  standard_deviation(sample2)`

**Task**

1. Let's consider the data we have for SalePrice a population and sample it 5,000 times. For each of the 5,000 iterations use the `replicate() function`:
    - Sample 10 data points from the SalePrice variable using the sample() function.
    - Compute the standard deviation of the sample using the `standard_deviation()` function.
    
2. Convert the `std_points` variable to a tibble
3. Generate a histogram

**Answer**

`library(ggplot2)`

`set.seed(1)`


`std_points  <-  replicate(n = 5000, expr = standard_deviation(sample(x = houses$SalePrice, size = 10)))`

`std_points_tibble <- tibble::tibble(std_points)`

`ggplot(data = std_points_tibble, aes(x = std_points)) +
    geom_histogram(bins = 10, position = "identity") +
    geom_vline(aes(xintercept = standard_deviation(houses$SalePrice))) +
    xlab("Sample standard deviation") + 
    ylab("Frequency")`

**Task**

Modify the code we wrote  by implementing **Bessel's correction**, and generate the histogram again. 

**Answer**

`standard_deviation_bessel_correction <- function(vector) {
    distances  <-  (vector - mean(vector))**2 
    sqrt(sum(distances) / (length(distances) - 1) )`
    
`library(ggplot2)`

`set.seed(1)`

`std_points  <-  replicate(n = 5000, expr = standard_deviation_bessel_correction(sample(x = houses$SalePrice, size = 10)))`

`std_points_tibble <- tibble::tibble(std_points)`

`ggplot(data = std_points_tibble, aes(x = std_points)) +
    geom_histogram(bins = 10, position = "identity") +
    geom_vline(aes(xintercept = population_stdev)) +
    xlab("Sample standard deviation") + 
    ylab("Frequency")`

**Task**

1. Compare the result of our standard deviation function and the R base function.
2. compare the result of our variance function and the R base function.

**Answer**

`computed_stdev  <-  standard_deviation_bessel_correction(sample_sales)
stdev_r  <-  sd(sample_sales)
equal_stdevs  <-  computed_stdev == stdev_r`

`computed_var  <-  variance_bessel_correction(sample_sales)
var_r  <-  var(sample_sales)
equal_vars  <-  computed_var == var_r`

**Task**

1. Compare the unbiased variance and standard deviation of the population 
2. Compute the sample variance for each sample.
3. Compute the sample standard deviation for each sample

`population  <-  c(0, 3, 6)`

`samples  <-  list(c(0,3), c(0,6),  
               c(3,0), c(3,6),
               c(6,0), c(6,3))` # possible samples of size n=2

**Answer**

`variance <- function(vector) {
    distances  <-  (vector - mean(vector))**2 
    sum(distances) / length(distances)
}`

`standard_deviation <- function(vector) {
    distances  <-  (vector - mean(vector))**2 
    sqrt(sum(distances) / length(distances) )
}`

`population_var  <-  variance(population)
population_std  <-  standard_deviation(population)`

`st_devs  <-  purrr::map_dbl(samples,sd)
variances  <-  purrr::map_dbl(samples,var)`

`mean_std  <-  mean(st_devs)
mean_var  <-  mean(variances)`

`equal_stdev  <-  population_std == mean_std
equal_var  <-  population_var == mean_var`