In [1]:
options(jupyter.rich_display = FALSE)

## Week 6 Tutorial: Data Wrangling in R

### POP77001 Computer Programming for Social Scientists

##### Module website: [tinyurl.com/POP77001](https://tinyurl.com/POP77001)

## Loading the dataset

- Replace filepath with the location of the file on your computer

In [2]:
library("readr")
library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [3]:
PATH <- "../data/kaggle_survey_2021_responses.csv"

# As the header of this dataset is composite (consisting ot 2 rows)
# we start by reading in the first 2 rows and then using the header
# of that 'header' dataset for the actual full dataset
questions <- readr::read_csv(PATH, n_max = 2)

[1m[1mRows: [1m[22m[34m[34m2[34m[39m [1m[1mColumns: [1m[22m[34m[34m369[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (369): Time from Start to Finish (seconds), Q1, Q2, Q3, Q4, Q5, Q6, Q7_P...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



In [4]:
kaggle2021 <- readr::read_csv(PATH, col_names = names(questions), skip = 2)

[1m[1mRows: [1m[22m[34m[34m25973[34m[39m [1m[1mColumns: [1m[22m[34m[34m369[34m[39m

[36m──[39m [1m[1mColumn specification[1m[22m [36m───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (360): Q1, Q2, Q3, Q4, Q5, Q6, Q7_Part_1, Q7_Part_2, Q7_Part_3, Q7_Part_...
[32mdbl[39m   (1): Time from Start to Finish (seconds)
[33mlgl[39m   (8): Q30_B_Part_1, Q30_B_Part_2, Q30_B_Part_3, Q30_B_Part_4, Q30_B_Par...


[36mℹ[39m Use [30m[47m[30m[47m`spec()`[47m[30m[49m[39m to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set [30m[47m[30m[47m`show_col_types = FALSE`[47m[30m[49m[39m to quiet this message.



In [5]:
head(kaggle2021, 1)

  Time from Start to Finish (seconds) Q1    Q2  Q3    Q4                Q5   
1 910                                 50-54 Man India Bachelor’s degree Other
  Q6         Q7_Part_1 Q7_Part_2 Q7_Part_3 ⋯ Q38_B_Part_3 Q38_B_Part_4
1 5-10 years Python    R         NA        ⋯ NA           NA          
  Q38_B_Part_5 Q38_B_Part_6 Q38_B_Part_7 Q38_B_Part_8 Q38_B_Part_9
1 NA           NA           NA           NA           NA          
  Q38_B_Part_10 Q38_B_Part_11 Q38_B_OTHER
1 NA            NA            NA         

In [6]:
questions[,1:10]

  Time from Start to Finish (seconds) Q1                         
1 Duration (in seconds)               What is your age (# years)?
2 910                                 50-54                      
  Q2                                    
1 What is your gender? - Selected Choice
2 Man                                   
  Q3                                       
1 In which country do you currently reside?
2 India                                    
  Q4                                                                                                             
1 What is the highest level of formal education that you have attained or plan to attain within the next 2 years?
2 Bachelor’s degree                                                                                              
  Q5                                                                                                     
1 Select the title most similar to your current role (or most recent title if retired): - Selected 

## Exercise 1: Summarise categorical variable

- Load the dataset (as local file)
- Consider country of residence reported by respondents (question Q3).
- Make sure you can select the column both using both it name and index
- Calculate the percentages of top 3 countries of residence in the sample

## Dummy variables

- When analysing categorical data (particularly using it as indepedent variables in regression) it is common to contruct [design matrices](https://en.wikipedia.org/wiki/Design_matrix), where categorical variables are represented by $1$'s and $0$'s depending on whether it is true or not for a given observation.
- For example, gender of respondents in survey can be represented by this matrix below, where $1$'s indicate whether a given respondent is female and $0$'s if they are male:

$$
\stackrel{female}{
\begin{bmatrix}
1 \\
0 \\
1 \\
\vdots \\
1
\end{bmatrix}
}
$$

- This process of replacing actual labels (e.g. 'female' and 'male' in the example above) with binary values is called creating [dummy variables](https://en.wikipedia.org/wiki/Dummy_variable_(statistics)) in statistics and [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) in computer science.


## Dummy variables continued

- A more complex example would be when instead of having just two levels of a categorical (i.e. factor in R) variable, we have multiple different values that a variable might take.
- For instance, a variable like age group might be represented as follows:

$$
\stackrel{{\scriptstyle25-34\,35-44\,45-64\,65+}}{
\begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 \\
\vdots & \vdots & \vdots & \vdots\\
0 & 0 & 0 & 0
\end{bmatrix}
}
$$

Where the first row corresponds to a respondent who is between 25 and 34 years old, the second to someone between 35 and 44 and the third one to a participant who is older than 65. Note that the number of columns in this matrix is one lower than the number of levels of our imaginary categorical variable age. We are omitting the baseline (reference) category. You can see that we can establish belonging to this category from the information provided in the matrix. If the values in all columns are $0$ (such as in the last row above), we can be sure that this observation is from a respondent who is in age group 18-24.

## Exercise 2: Pivoting tables

- Now let's construct such design matrix with dummy variables for respondents' age group in Kaggle survey.
- First, check what levels does the variable age group take (question Q1).
- Since we are making use of only a small portion of the data in this exercies, make the survey dataset more manageable by subsetting the columns Q1 to Q5.
- Check the function `model.matrix()` from base R and apply it to the dataset to get a design matrix (you need to specify formula as the first argument).
- This might be not the most usual example of pivoting data frame (as while the number of columns increases, the number of rows remains the same), but it gives you a sense of what it can entail.
- To simplify working with the dataset, let's also create a unique id for each respondent (you can use `seq_along()` function in combination with any other variable to do so).
- Finally, use `pivot_wider` function from `tidyr` package to create a separate column for each age group.
- If the original pivoting produced columns that are populated by values of the categorical variable and `NA`'s, use `mutate` function to replace them with $0$'s and $1$'s.
- Finally, use `pivot_longer` function to convert this representation of the dataset back into its original form.
- You might also need to use `dplyr::filter()` function to remove redundant rows.

## Week 6: Assignment 2

- Functions and data wrangling in R
- Due by 12:00 on Monday, 24th October