# Assignment 02


## Introduction to programming with R for data manipulation

You can run R code in jupyter notebook. If you can't figure out how to do that, you can write your code using Rstudio and attach your code. Clearly idetnify the question number and the answer.

# Rubric

- Rubric: 5 points
- Exercise_1: 30 points
- Exercise_2: 10 points
- Exercise_3: 20 points
- Exercise_4: 10 points
- Exercise_5: 20 points
- Exercise_6: 5


## Code Quality

The code that you write for this assignment will be given one overall grade for code quality, see our code quality rubric as a guide to what we are looking for. Also, for this course (and other MDS courses that use R), we are trying to follow the tidyverse code style. There is a guide you can refer too: http://style.tidyverse.org/

Each code question will also be assessed for code accuracy (i.e., does it do what it is supposed to do?)

## Table of contents

1. [Exercise 1: Reading data into R](#Exercise-1:-Reading-data-into-R)

2. [Exercise 2: Data wrangling with {dplyr}](#Exercise-2:-Data-wrangling-with-{dplyr})

3. [Exercise 3: More data wrangling with {dplyr}](#Exercise-3:-More-data-wrangling-with-{dplyr})

4. [Exercise 4: Tidying data with {tidyr}](#Exercise-4:-Tidying-data-with-{tidyr})

5. [Exercise 5: Tidying more data with {tidyr}](#Exercise-5:-Tidying-more-data-with-{tidyr})

6. [Exercise 6: Subsetting with base R](#Exercise-6:-Subsetting-with-base-R)


Use the below libraries for this assignment.

In [3]:
library(canlang)
library(readxl)
library(repurrrsive)
library(testthat)
library(tidyverse)
options(repr.matrix.max.rows = 10)

## Exercise 1: Reading data into R

Read the data files listed in the table below and and bind the names listed in the table to them. 

**Note - if the column names are missing from any data sets you need to add them yourself programmatically via R** (see `col_names` arguement of the `read_*` functions).


| File  | Object name to bind to the data frame | File location |
|---|---|----|
| `abbotsford_lang.xlsx`  | `abbotsford` | `data` directory of this repo |
| `calgary_lang.csv`  | `calgary`  | `data` directory of this repo |
| `edmonton_lang.xlsx`  | `edmonton`  | https://github.com/ttimbers/canlang/blob/master/inst/extdata/edmonton_lang.xlsx?raw=true |
|  `kelowna_lang.csv` | `kelowna`  | `data` directory of this repo |
| `vancouver_lang.csv`  | `vancouver`  | `data` directory of this repo |
| `victoria_lang.csv`  | `victoria`  | https://github.com/ttimbers/canlang/raw/master/inst/extdata/victoria_lang.tsv |

> #### The data
> The data you will be working with in this first exercise is language data from the 2016 Canadian Census for cities in Western Canada. If you are unfamiliar with Western Canadian geography, here's a map to help you start to get more familiar:
>
> <img src="https://www.canadatours.com/images/maps/Canada_W.gif" width=500>
>
> *Image source: https://www.canadatours.com/canada_maps.cfm?#W*

## Exercise 2: Data wrangling with {dplyr}

You have loaded individual data sets for 6 Western Canadian cities above, but we have more data than that from the 2016 Canadian Census (the most recent Canadian Census, the next one is scheduled for 2021). The {[canlang](https://ttimbers.github.io/canlang/)} R data package has language and regional data for all census metropolitan areas in Canada!

<img src="https://github.com/ttimbers/canlang/blob/master/man/figures/hex-canlang.png?raw=true" width=100>

For this exercise, we want you to use the {dplyr} functions you have already met in this course and the {canlang} R package `region_lang` data set to uncover the name of the Canadian census metropolitan area which has the second most number of people who claim that they language they speak most often at home is Mandarin. Return the region name as a **character vector of length 1** (i.e., make sure it is not a data frame with one value in it). Bind the name `mandarin2` to the final object you return.

> Here's a map of Canada to help you see where the different Canadian census metropolitan areas are:
> 
> <img src="https://www.mapsopensource.com/images/canada-cities-map.gif" width=600>
>
> *Image source: https://www.mapsopensource.com/canada-cities-map.html*

## Exercise 3: More data wrangling with {dplyr}

For this exercise, we want you to choose a Canadian census metropolitan area from the {canlang} R package `region_lang` data set, and use the {dplyr} functions that you have already met in this course to find the top 5 languages spoken most often at home from that area. Your final result should be a data frame with two columns:
1. `language`
2. `perc_pop` 

The column `perc_pop` should be the percentage of the area's population who reported that they speak that language most often at home. You can find the population size for each Canadian census metropolitan area in the  {canlang} R package `region_data` data set. 

## Exercise 4: Tidying data with {tidyr}


We're going to shift to a different data set to practice making data frames wider and longer, as the {canlang} data sets are already pretty tidy. Let's load a data set that is not tidy, because it is too wide for the statistical question being asked, and then use one of the {tidyr} functions to tidy it.

This next data set that we will be looking at contains environmental data from 1914 to 2018. The data was collected by the DFO (Canada's Department of Fisheries and Oceans) at the Pacific Biological Station (Departure Bay). Daily sea surface temperatures were recorded. Original data source: http://www.pac.dfo-mpo.gc.ca/science/oceans/data-donnees/lightstations-phares/index-eng.html

A statistical question we might be interested in answering with this data set is, has sea surface temperature been changing over time, and is there an association between time of year (i.e., month) and this change over time? Read the `departure_bay_temperature.csv` data set in from the `data` directory and decide what tidying you will have to do, and then get to work and tidy it! Bind the name `temps_tidy` to the tidy data frame you create.

- Let's take a look and see whether sea surface temperature been changing over time at Departure Bay, BC. Given that time of year is a factor that influences, plot this for each month separately.

## Exercise 5: Tidying more data with {tidyr}

Use one of the {tidyr} functions to tidy the data that you will load in from the `language_diversity.csv` file located in the `data` directory. This data was collected to answer research questions, such as what factors are associated with language diversity (as measured by the number of languages spoken in a country).  Read the `language_diversity.csv` data set into R and decide what tidying you will have to do, and then get to work and tidy it! Bind the name `tidy_lang` to the tidy data frame you create.

Data source: https://www.jvcasillas.com/untidydata/

Now that we have this data in a tidy format, let's explore it and plot the number of languages spoken in each country in the data set against the country's population.

## Exercise 6: Subsetting with base R

The {tidyverse} functions were written for manipulating data frames and vectors, but you now know that there are other types of objects in R. To work with these kinds of objects we need to know how to do subsetting using base R. Lets get some more practice doing this first with matrices! 

### Subsetting matrices

Consider the matrix created below, subset a smaller matrix that is the 10 - 15 colums and 1 - 5 rows. Bind the name `small_matrix` to it:

In [26]:
set.seed(2020) # this makes the random number process below reproducible
random_matrix <- matrix(rexp(200, rate=.1), ncol=20)
random_matrix

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
2.938057,6.5233139,11.13552533,1.431066,1.1601004,12.6860753,16.4134368,37.61158,2.8760736,11.9359657,10.1429664,3.7368253,14.1535378,2.5883681,1.70020847,7.641389,3.507495,27.733644,0.7635843,1.492056
12.700502,13.8406351,4.961900343,11.788127,2.286639,12.1997007,3.7501219,25.521498,0.3410495,3.6322129,1.4065676,0.3350206,0.3627339,18.7108811,2.86708585,7.827604,8.39894,11.501245,1.4928199,5.63758
2.370036,13.3029784,10.784506729,17.122631,2.6865893,17.6286187,8.7758532,13.049184,6.4395061,18.4997913,7.9907373,1.4970579,8.0956285,6.8583796,8.32895351,4.377994,40.051038,8.11095,3.8880998,39.920596
7.398545,0.7938523,22.229445818,1.575422,0.6059261,14.6876833,4.5239944,8.346802,11.6813587,21.2545254,0.2040772,2.8268398,17.7621596,0.2964559,0.08151982,9.872077,16.67761,41.16282,14.2382275,47.132323
14.195153,3.7895417,0.008232868,8.462461,8.7432093,18.0210972,1.1309803,10.524262,13.6747683,14.1242168,0.5155448,18.2741893,5.4473414,2.3499926,14.54604977,25.113806,3.571127,11.366806,3.4441976,2.57298
12.656189,7.5740144,11.622642902,3.614166,8.4369874,11.1485233,3.6596141,1.719779,17.3328534,1.4198751,26.6164124,11.9498316,0.2640172,5.2254129,1.49033928,15.709789,1.919996,13.510032,14.9959522,5.366673
58.675191,19.6065916,45.720541661,56.452869,32.3882064,5.9444357,5.7199025,33.945311,1.3362299,0.3681266,3.9765598,5.0458668,4.1642927,20.2440382,4.38639193,30.591778,12.794629,2.770155,4.5837962,19.566883
2.404119,23.2537627,1.538694979,8.486318,6.3638649,0.1769528,0.9452913,3.676081,38.7291467,4.132905,4.4178181,17.7528238,4.1940696,6.7261254,1.18011127,22.963183,12.437981,22.577034,29.8337177,1.638553
5.28828,6.3678666,1.561867743,8.374958,26.4357742,18.5346461,6.8387739,6.605413,0.298029,4.7021308,20.8172305,8.8334537,13.5828541,21.4938658,4.78286147,12.663138,14.105986,22.403308,5.4939446,36.78106
4.876715,1.1497813,2.496821731,11.185452,9.2018113,4.9077449,0.9648821,2.156965,2.6015399,13.9129809,0.1818018,35.1253667,2.354673,16.9889198,0.82372545,26.525397,3.740227,2.702319,12.9271982,11.0445


#### Reference
This assignment is attributed from R programming course written by Tiffany Timbers