# Introduction to R

This introducion to the R language aims at understanding how to represent and manipulate data objects as commonly found in *data science*.

## Installing R and RStudio

The R statistical package can be installed from [CRAN](https://cran.r-project.org). Be sure to also download [RStudio](https://www.rstudio.com) as it provided a full-featured user interface to interact with R.

## Useful additional packages

This tutorial mainly relies on core facilities that come along so called R [base packages](https://stackoverflow.com/a/9705725). However, it is possible to install additional packages as shown below:

    install.packages("ggplot2")

## Setup

The following settings will be used in this practical:

In [None]:
library(ggplot2)
theme_set(theme_minimal())

## Getting started

### Variables

There are fundamentally two kind of data structures in statistics-oriented programming languages: numbers and strings. Numbers can be integers or real numbers and they are used to represent values observed for a continuous or discrete statistical variable, while strings are everything else that cannot be represented as numbers or list of numbers, e.g. address of a building, answer to an open-ended question in a survey, etc.

Here is how we can create a simple variable, say `x`, to store a list of 5 numerical values:

In [None]:
x <- c(1, 3, 2, 5, 4)

Note that the symbol `<-` stands for the recommended assignment operator, yet it is possible to use `=` to assign some quantity to a given variable, which appears on the left hand side of the above expression. Also, the series of values is reported between round brackets, and each values is separated by a comma. From now on, we will talk interchangeably of values or of observations as if we were talking of a measure collected on a statistical unit.

Some properties of this newly created variable can be queried online, e.g. how many elements does `x` has or how those elements are represneted in R:

In [None]:
length(x)
typeof(x)

It should be noted that `x` contains values stored as real numbers (`double`) while they may just be stored as integers. It is however possible to ask R to use truly integer values:

In [None]:
x <- c(1L, 3L, 2L, 5L, 4L)
typeof(x)

The distinction between 32 bits integers and reals will not be that important in common analysis tasks, but it is important to keep in mind that it is sometimes useful to check whether data are represented as expected, especially in the case of categorical variables, also called 'factor' in R parlance (more on this latter).

The list of numbers we stored in `x` is called a *vector*, and it is one of the building block of common R data structures. Oftentimes, we will need richer data structures, especially two-dimensional objects, like *matrix* or *data frame*, or higher-dimensional objects such as *array* or *list*.

![](/assets/lang-r-base-001.png)

### Vecteurs

The command `c` ('concatenate') we used to create our list of integers will be very useful when it comes to pass multiple options to a command. It can be nested into another call to `c` like in the following exemple:

In [None]:
x <- c(c(1, 2, 3), c(4, 5, 6), 7, 8)

In passing, note that since we use the same name for our newly created variable, `x`, the old content referenced in `x` (1, 3, 2, 5, 4) is definitively lost. Once you have a vector of values, you can access each item by providing the (one-based) index of the item(s), e.g.:

In [None]:
x[1]
x[3]
x[c(1,3)]
x[1:3]

A convenient shorhand notation for regular sequence of integers is `start:end`, where `start` is the starting value and `end` is the last value (both included). Hence, `c(1,2,3,4)` is the same as `1:4`. This is useful when one wants to preview the first 3 or 5 values in a vector, for example. A more general function to work with regular sequence of numbers is `seq`. Here is an example of use:

In [None]:
seq(1, 10)
seq(1, 10, by = 2)
seq(0, 10, length = 5)

Updating content of a vector can be done directly by assigning a new value to one of the item:

In [None]:
x[3] <- NA

In the above statement, the third item has been assigned a missing value, which is coded as `NA` ('not available') in R. Again, there is no way to go back to the previous state of the variable, so be careful when updating the content of a variable.

The presence of missing data is important to check before engaging into any serious statistical stuff. The `is.na` function can be used to check for the presence of any missing value in a variable, while `which` will return the index that matches a `TRUE` result, if any:

In [None]:
is.na(x)
which(is.na(x))

Notice that many functions like `is.na`, or `which`, act in a vectorized way, meaning that you don't have to iterate manually over each item in the vector. Moreover, function calls can be nested one into the other. In the latter R expression, `which` is actually processing the values returned by the call to `is.na`.

### Data frames

Data frames are one of the core data structures to store and represent statistical data. Many routine functions that are used to load data stored in flat files or databases or to preprocess data stored in memory rely on data frames.

Observations are arranged in rows and variables are arranged in columns. Each variable can be viewed as a single vector, but all those variables are all recorded into a common data structure, each with an unique name. Moreover, each column, or variable, can be of a different type--numeric, factor, character or boolean.

Here is an example of a built-in data frame, readily available by using the command `data`:

In [None]:
data(ToothGrowth)
head(ToothGrowth)

In [None]:
str(ToothGrowth)

While `head` allows to preview the first 6 lines of a data frame, `str` provides a concise overview of what's available in the data frame, namely: the name of each variable (column), its mode of representation, et the first 10 observations (values).

The dimensions (number of lines and columns) of a data frame can be verified using `dim` (a shortcut for the combination of `nrows` and `ncols`): 

In [None]:
dim(ToothGrowth)

To access any given cell in this data frame, we will use the indexing trick we used in the case of vectors, but this time we have to indicate the line number as well as the column number, or name: Hence, `ToothGrowth[i,j]` means the value located at line `i` and column `j`, while `ToothGrowth[c(a,b),j]` would mean values at line `a` and `b` for the same column `j`.

![](assets/lang-r-base-002.png)

Here is how we can retrieve the second observation in the first column:

In [None]:
ToothGrowth[2,1]

Since the columns of a data frame have names, it is equivalent to use `ToothGrowth[2,1]` and `ToothGrowth[2,"len"]`. In the latter case, variable names must be quoted. Column names can be displayed using `colnames` or `names` (in the special case of data frames), while row names are available *via* `rownames`. Row names can be used as unique identifier for statistical units, but best practice is usually to store unique IDs as characters or factor levels in a dedicated column in the data frame.

Since we know that we can use `c` to create a list of numbers, we can use `c` to create a list of line numbers to look for. Imagine you want to access the content of a given column (`len`, which is the first column, numbered 1), for lines 2 and 4: (`c(2, 4)`):

![](assets/lang-r-base-003.png)

Here is how we would do in R:

In [None]:
ToothGrowth[c(2,4),1]

This amounts to 'indexed selection', meaning that we need to provide the row (or column) numbers, while most of the time we are interested in criterion-based indexation--which observation fullfills a given criterion. This can be done easily given the fact that most operations are vectorized in R. For instance, to display observations on `supp` that satisfy the condition `len > 6`, we would use: 

In [None]:
ToothGrowth$supp[ToothGrowth$len > 6]

![](assets/lang-r-base-004.png)

# Practical use case: The ESS survey

The `data` directory includes three [RDS](https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/readRDS) files related to the [European Social Survey](https://www.europeansocialsurvey.org) (ESS). This survey first ran in 2002 (round 1), and it is actually renewed every two years. The codebook can be downloaded, along [other data sheets](http://www.europeansocialsurvey.org/data/download.html), on the main website.

There are two files related to data collected in France (round 1 or rounds 1-5, `ess-*-fr.rds`) and one file for all participating countries (`ess-one-round.rds`).

## French data

Assuming the `data` directory is available in the current working directory, here is how we can load French data for round 1:

In [None]:
d <- readRDS("data/ess-one-round-fr.rds")
head(d[1:10])

In [None]:
table(d$yrbrn)

In [None]:
summary(d$agea)

Let us focus on the following list of variables, readily available in the file `ess-one-round-29vars-fr.rds`:

- `tvtot`: TV watching, total time on average weekday
- `rdtot`: Radio listening, total time on average weekday
- `nwsptot`: Newspaper reading, total time on average weekday
- `polintr`: How interested in politics
- `trstlgl`: Trust in the legal system
- `trstplc`: Trust in the police
- `trstplt`: Trust in politicians
- `vote`: Voted last national election
- `happy`: How happy are you
- `sclmeet`: How often socially meet with friends, relatives or colleagues
- `inmdisc`: Anyone to discuss intimate and personal matters with
- `sclact`: Take part in social activities compared to others of same age
- `health`: Subjective general health
- `ctzcntr`: Citizen of country
- `brncntr`: Born in country
- `facntr`: Father born in country
- `mocntr`: Mother born in country
- `hhmmb`: Number of people living regularly as member of household
- `gndr`: Gender
- `yrbrn`: Year of birth
- `agea`: Age of respondent, calculated
- `edulvla`: Highest level of education
- `eduyrs`: Years of full-time education completed
- `pdjobyr`: Year last in paid job
- `wrkctr`: Employment contract unlimited or limited duration
- `wkhct`: Total contracted hours per week in main job overtime excluded
- `marital`: Legal marital status
- `martlfr`: Legal marital status, France
- `lvghw`: Currently living with husband/wife

Note that variables in the file `ess-one-round-29vars-fr.rds` have been recoded and categorical variables now have proper labels. See the script file `scripts/ess-one-round-29vars-fr.r` to see what has been done to the base file.

In [None]:
d <- readRDS("data/ess-one-round-29vars-fr.rds")

First, let us look at the distribution of the `gndr` variable;

In [None]:
summary(d$gndr)

In [None]:
p <- ggplot(data = d, aes(x = gndr)) +
  geom_bar() +
  labs(x = "Sex of respondant", y = "Counts")
p

Now, let's look at the distribution of age:

In [None]:
summary(d$agea)

In [None]:
p <- ggplot(data = d, aes(x = agea)) +
  geom_line(stat = "density", bw = 2) +
  labs(x = "Age of respondant")
p

In [None]:
p <- ggplot(data = d, aes(x = agea)) +
  geom_histogram(binwidth = 5) +
  facet_grid(~ gndr) +
  labs(x = "Age of respondant")
p

In [None]:
p <- ggplot(data = d, aes(x = gndr, y = agea)) +
  geom_boxplot() +
  coord_flip() +
  labs(x = NULL, y = "Age of respondants")
p

## Data from other countries

Data from all other participating countries can be loaded in the same manner:

In [None]:
db <- readRDS("data/ess-one-round.rds")
cat("No. observations =", nrow(db))
table(db$cntry)

Since French data are (deliberately) missing from this dataset, we can append them to the above data frame as follows: 

In [None]:
db <- rbind.data.frame(db, d)
cat("No. observations =", nrow(db))

In [None]:
db$cntry <- factor(db$cntry)
table(db$cntry)

Remember that is also possible to use `summary()` with a factor variable to display a table of counts.