**Task**

`library(readr)`

`# Heads up that the following instruction can generate a column specification warnings, you can ignore that.`

`# These warnings are related to that `readr` has misguidedly guessed the type of values in the column `Pool QC`. `

`# To avoid this warning we can manually specify the type of this column the `col_types` parameter.`

`houses  <-  read_tsv('AmesHousing_1.txt', col_types = cols(`Pool QC` = col_character())) `

`library(dplyr)`

`houses_per_year  <-  houses %>%
    mutate(Year = `Yr Sold`) %>%
    group_by(Year) %>%
    summarize(MeanPrice = mean(SalePrice), 
              HousesSold = n() ) %>%
    arrange(HousesSold)`
    
* Compute the mean of the `MeanPrice` column in the `houses_per_year` dataset.
* Compute the mean of the SalePrice column in the `houses` dataset.
* Measure the difference between the two means 

**Answer**

`mean_new  <-  mean(houses_per_year$MeanPrice)`

`mean_original  <-  mean(houses$SalePrice)`

`difference  <-  mean_original - mean_new`

**Task**

* Using only the data we have in the `houses_per_year` dataset, compute the sum of prices for each year using columns `MeanPrice` and `HousesSold`

**Answer**

`houses_per_year  <- houses_per_year %>%
    mutate(sum_per_year = MeanPrice * HousesSold)`

`all_sums_together  <-  sum(houses_per_year$sum_per_year)`

`total_n_houses  <-  sum(houses_per_year$HousesSold)`

`weighted_mean  <-  all_sums_together / total_n_houses`

`mean_original  <-  mean(houses$SalePrice)`

`difference  <-  round(mean_original, digits = 10) - round(weighted_mean, digits = 10)`

When we take into account the different weights and compute the mean like we did above, we call it **the weighted mean**.

Basically, here are some cases that can guide the choice between the weighted mean and mean.

* If we have a distribution and all the values are present (i.e., also repeated values, no aggregation), we can compute the **mean**, which is a weighted mean where the weights are equal to 1.

* If we have a distribution with only distinct values, and we are sure that repetitions are not possible, the **mean** is also enough.

* If we have an aggregated distribution, i.e., only the distinct values are present, and each of them has an extra value (number of times, the probability of that value) associated with it, we have to use **weighted mean** because the additional information has to be considered to get the right results.

**Task**

Write a function that computes the **weighted mean** for any numerical vector.

**Answer**

`compute_weighted_mean <- function(distribution, weights)` `{`
    `weighted_distribution  <-  distribution * weights  `  
    `sum(weighted_distribution) / sum(weights)`
`}`

`computed_weighted_mean  <-  compute_weighted_mean(houses_per_year$MeanPrice, houses_per_year$HousesSold)`
`weighted_mean_r  <-  weighted.mean(houses_per_year$MeanPrice, houses_per_year$HousesSold)`

`equal  <-  round(computed_weighted_mean, 10) == round(weighted_mean_r, 10)`

Before going further, let's discover the namespaces that will be useful to us in the future. As the name suggests, namespaces provide "spaces" for "names." In R, they permit specifying which package a function belongs to. If we've already used the operator `::`, then we've already used the namespaces without knowing it.

Namespaces are useful in many cases:

* Knowing precisely to which package a function belongs.
* Avoiding loading an entire package when we only want to use this package once.
* Disambiguating the functions of different packages with the same names.

We use namespace when we provide the name of a package followed by the operator `::` followed by the name of a function.

`name_of_package::name_of_function(function parameters)`

For example, assuming we want to concatenate the words "Hello" and "World" using the `str_c()` function from the stringr package, we can do 

`stringr::str_c("Hello", "World")`

instead of :

`library(stringr)`

`str_c("Hello", "World")`

These pieces of code will produce the same result but are not equivalent because the first will only load the function `str_c()` from the `stringr` package where the second will load all the available functions in the `stringr` package.

We use the function `filter()` from the `dplyr` package several times. This function also exists in the R built-in package `stats` package. So, if we want to specify which `filter()` we are using, we have to use namespace to avoid conflict: `dplyr::filter()` or `stats::filter()`.

# Median

**Task**

`distribution1  <-  c(23, 24, 22, '20 years or lower,', 23, 42, 35)`

`distribution2  <-  c(55, 38, 123, 40, 71)`

`distribution3  <-  c(45, 22, 7, '5 books or lower', 32, 65, '100 books or more')`

* Find out median of odd values

**Answer**

`median1  <-  23`

`median2  <-  55`

`median3  <-  32`

**Task**

* Find the median value of the `TotRms AbvGrd` column from the `houses` dataframe.

**Answer**

`rooms  <-  houses$TotRms AbvGrd`

`rooms  <-  as.numeric(stringr::str_replace(rooms, '10 or more', '10'))`

`rooms_sorted  <-  sort(rooms)`

`# Find the median`

`middle_indices  <-  c(length(rooms_sorted) / 2,
                      (length(rooms_sorted) / 2) + 1
                 )` `# 2930 is even so we need two indices.`
                 
`middle_values  <-  rooms_sorted[middle_indices]`

`median  <-  mean(middle_values)`

When we compute the mean, we account equally for each value in the distribution — we sum up all the values in the distribution and then divide the total by the number of values we added. When we compute the median, however, we don't consider equally each value in the distribution. In fact, we only consider the middle value (or the middle two values).

Because the median is so resistant to changes in the data, it's classified as a **resistant** or **robust** statistic.

This property makes the median ideal for finding reasonable averages for distributions containing outliers. 

Consider this distribution of annual salaries for five people in a company:

![image.png](attachment:image.png)

The mean is heavily influenced by the person earning `$800,000`, and it amounts to a value of `$187,000`, which is not representative for anyone — the first four people earn much less that `$187,000`, and the last person earns much more. It makes more sense to compute a median value for this distribution, and report that the average salary in the company is `$40,000`, accompanied by an outlier of `$800,000`.

These observations can be confirmed by viewing the data using a barplot.

`df <- tibble::tibble( employee = purrr::map_chr(1:5, function(x) stringr::str_c("Employee ", x)), salary = c(20000,34000,40000,45000, 800000) )`

`library(ggplot2)`

`ggplot(data = df,
    aes(x = employee, y = salary)) +
    geom_bar(stat = "identity", 
             fill = "blue") +
    geom_hline(aes(yintercept = mean(df$salary),
                   color = "black"), 
               size = 1) +
    geom_hline(aes(yintercept = median(df$salary), 
                    color = "red"), 
               size = 1) +
    scale_y_continuous(labels = scales::comma) +
    scale_colour_manual(values = c("black", "red"), 
                        name = "", 
                        labels = c("Mean", "Median")) +
    theme_bw() + theme(axis.text.x = element_text(angle = 90)) + 
    xlab("") + 
    ylab("Salaries")`


![image.png](attachment:image.png)

* The option `stat = "identity"` in the [`geom_bar()` function](https://ggplot2.tidyverse.org/reference/geom_bar.html) allows creating barplots when the frequency values are available.
* Remember that the `scale_y_continuous(labels = scales::comma)` piece of code avoids the scientific notation of ggplot2.
* The function `scale_colour_manual()` allows renaming the elements of the legend.

We can play a bit around this plot by removing those elements and observe the differences.

**Task**

1. The `Lot Area` and `SalePrice` variables have outliers. Confirm this information by visualizing the distributions using a box plot for each variable. 
2. Compute the median and the mean for each of the two variables.
3. For each variable, compute the difference between the mean and the median.

**Answer**

`library(ggplot2)`

`ggplot(data = houses,
    aes(x = "", y = `Lot Area`)) +
    geom_boxplot() +
    xlab("Lot Area") + 
    ylab("")`

`ggplot(data = houses,
    aes(x = "", y = SalePrice)) +
    geom_boxplot() +
    xlab("Sale Price") + 
    ylab("")`

`lotarea_difference  <-  mean(houses$`Lot Area`) - median(houses$`Lot Area`)`

`saleprice_difference  <-  mean(houses$`SalePrice`) - median(houses$`SalePrice`)`

**Task**

1. Find the mean and the median of the` Overall Cond` variable.
2. Plot a histogram to visualize the distribution of the `Overall Cond` variable.
3. Between the mean and the median, which one do we think describes better the shape of the histogram?
 * If we think it's the mean, assign the string 'mean' to a variable named `more_representative`, otherwise assign 'median'.
 
**Answer**

`mean  <-  mean(houses$`Overall Cond`)`

`median  <-  median(houses$`Overall Cond`)`

`ggplot(data = houses, aes(x = `Overall Cond`)) +
    geom_histogram(bins = 9, 
                   position = "identity", 
                   alpha = 0.6, 
                   fill='blue') +`        
    `#geom_vline(xintercept = mean, color = 'red', size=1) +`
    `#geom_vline(xintercept = median, color = 'green', size=1) +`
    `xlab("Overall Cond") + 
    ylab("Frequency")`

`more_representative  <-  'mean' `

###### Extra comment:

The mean seems more representative and more informative because it captures the fact that there are more houses rated above 5 than rated under 5. Because of this, the mean is slightly shifted above 5.`

Although it can be argued that it's theoretically unsound to compute the mean for ordinal variables, above we found the mean more informative and representative than the median. This is because it captures the fact that there are more houses rated above 5 than rated under 5. That's why the mean is slightly shifted above 5.

The truth is that, in practice, many people get past the theoretical hurdles and use the mean because, in many cases, it's richer in information than the median.