# Introduction to R

This introducion to the R language aims at understanding how to represent and manipulate data objects as commonly found in *data science*.

## Installing R and RStudio

The R statistical package can be installed from [CRAN](https://cran.r-project.org). Be sure to also download [RStudio](https://www.rstudio.com) as it provided a full-featured user interface to interact with R.

## Useful additional packages

This tutorial mainly relies on core facilities that come along so called R [base packages](https://stackoverflow.com/a/9705725). However, it is possible to install additional packages as shown below:

    install.packages("ggplot2")

## Getting started


### Variables

There are fundamentally two kind of data structures in statistics-oriented programming languages: numbers and strings. Numbers can be integers or real numbers and they are used to represent values observed for a continuous or discrete statistical variable, while strings are everything else that cannot be represented as numbers or list of numbers, e.g. address of a building, answer to an open-ended question in a survey, etc.

Here is how we can create a simple variable, say `x`, to store a list of 5 numerical values:

In [None]:
x <- c(1, 3, 2, 5, 4)

Note that the symbol `<-` stands for the recommended assignment operator, yet it is possible to use `=` to assign some quantity to a given variable, which appears on the left hand side of the above expression. Also, the series of values is reported between round brackets, and each values is separated by a comma. From now on, we will talk interchangeably of values or of observations as if we were talking of a measure collected on a statistical unit.

Some properties of this newly created variable can be queried online, e.g. how many elements does `x` has or how those elements are represneted in R:

In [None]:
length(x)
typeof(x)

It should be noted that `x` contains values stored as real numbers (`double`) while they may just be stored as integers. It is however possible to ask R to use truly integer values:

In [None]:
x <- c(1L, 3L, 2L, 5L, 4L)
typeof(x)

# Practical use case: The ESS survey

The `data` directory includes three [RDS](https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/readRDS) files related to the [European Social Survey](https://www.europeansocialsurvey.org) (ESS). This survey first ran in 2002 (round 1), and it is actually renewed every two years. The codebook can be downloaded, along [other data sheets](http://www.europeansocialsurvey.org/data/download.html), on the main website.

There are two files related to data collected in France (round 1 or rounds 1-5, `ess-*-fr.rds`) and one file for all particpating countries (`ess-one-round.rds`).

## French data

Assuming the `data` directory is available in the current working directory, here is how we can load French data for round 1:

In [None]:
d <- readRDS("data/ess-one-round-fr.rds")
head(d[1:10])

In [None]:
table(d$yrbrn)

In [None]:
summary(d$agea)

Let us focus on the following list of variables:

In [None]:
vars <- c("tvtot", "rdtot", "nwsptot", "polintr", "trstlgl", "trstplc", "trstplt", "vote",
          "happy", "sclmeet", "inmdisc", "sclact", "health", "ctzcntr", "brncntr", "facntr",
          "mocntr", "hhmmb", "gndr", "yrbrn", "agea", "edulvla", "eduyrs", "pdjobyr",
          "wrkctr", "wkhct", "marital", "martlfr", "lvghw")
d <- d[vars]

If you look carefully at the data structure, you will probably notice that there's a lot of extra attributes for some variables. For example, the variable `gndr` (sex of respondent) has the following:

In [None]:
str(d$gndr)

In this case, `gndr` is a categorical variable where numerical codes 1 and 2 stand for 'Male' and 'Female', respectively. (The extra attributes come form the fact that this dataset was preprocessed using the [haven](https://haven.tidyverse.org) package, based on an original Stata file.)

Although it is primarily stored as a numerical variable (`num`), it would be more useful to convert this variable to a factor using, e.g., `factor(d$gndr, levels = c(1, 2), labels = c("Male", "Female"))`. Below are little instructions that will take care of converting all relevant variables to factor, while discarding unnecessary information afterwards. You don't need to understand every piece of code at this stage, especially given the fact that it would be easy to use `haven`'s built-in functionalities to perform the same operations:

In [None]:
## retrieve labels when available
for (v in vars) {
  atr <- attr(d[[v]], "labels")
  if (length(atr) > 0)
    d[[v]] <- factor(d[[v]], levels = as.numeric(atr), labels = names(atr))
}

## discard haven/tibble attributes
num.vars <- vars[sapply(d, is.numeric)]
d[num.vars] <- sapply(d[num.vars], as.numeric)
d <- as.data.frame(d)

In [None]:
summary(d$gndr)

In [None]:
library(ggplot2)
theme_set(theme_minimal())

In [None]:
p <- ggplot(data = d, aes(x = agea)) +
  geom_line(stat = "density", bw = 2) +
  labs(x = "Age of respondant")
p

In [None]:
p <- ggplot(data = d, aes(x = agea)) +
  geom_histogram(binwidth = 5) +
  facet_grid(~ gndr) +
  labs(x = "Age of respondant")
p

## Data from other countries

Data for all other participating countries can be loaded in the same manner:

In [None]:
db <- readRDS("data/ess-one-round.rds")
cat("No. observations =", nrow(db))
table(db$cntry)

Since French data are (deliberately) missing from this dataset, we can append them to the above data frame as follows: 

In [None]:
db <- rbind.data.frame(db, d)
cat("No. observations =", nrow(db))

In [None]:
db$cntry <- factor(db$cntry)
table(db$cntry)

Remember that is also possible to use `summary()` with a factor variable to display a table of counts.