# Introduction to R

This introducion to the R language aims at understanding how to represent and manipulate data objects as commonly found in *data science*.

## Installing R and RStudio

The R statistical package can be installed from [CRAN](https://cran.r-project.org). Be sure to also download [RStudio](https://www.rstudio.com) as it provided a full-featured user interface to interact with R.

## Useful additional packages

This tutorial mainly relies on core facilities that come along so called R [base packages](https://stackoverflow.com/a/9705725). However, it is possible to install additional packages as shown below:

    install.packages("ggplot2")

## Setup

The following settings will be used in this practical:

In [None]:
library(ggplot2)
theme_set(theme_minimal())

## Getting started

### Variables

There are fundamentally two kind of data structures in statistics-oriented programming languages: numbers and strings. Numbers can be integers or real numbers and they are used to represent values observed for a continuous or discrete statistical variable, while strings are everything else that cannot be represented as numbers or list of numbers, e.g. address of a building, answer to an open-ended question in a survey, etc.

Here is how we can create a simple variable, say `x`, to store a list of 5 numerical values:

In [None]:
x <- c(1, 3, 2, 5, 4)

Note that the symbol `<-` stands for the recommended assignment operator, yet it is possible to use `=` to assign some quantity to a given variable, which appears on the left hand side of the above expression. Also, the series of values is reported between round brackets, and each values is separated by a comma. From now on, we will talk interchangeably of values or of observations as if we were talking of a measure collected on a statistical unit.

Some properties of this newly created variable can be queried online, e.g. how many elements does `x` has or how those elements are represneted in R:

In [None]:
length(x)
typeof(x)

It should be noted that `x` contains values stored as real numbers (`double`) while they may just be stored as integers. It is however possible to ask R to use truly integer values:

In [None]:
x <- c(1L, 3L, 2L, 5L, 4L)
typeof(x)

# Practical use case: The ESS survey

The `data` directory includes three [RDS](https://www.rdocumentation.org/packages/base/versions/3.5.3/topics/readRDS) files related to the [European Social Survey](https://www.europeansocialsurvey.org) (ESS). This survey first ran in 2002 (round 1), and it is actually renewed every two years. The codebook can be downloaded, along [other data sheets](http://www.europeansocialsurvey.org/data/download.html), on the main website.

There are two files related to data collected in France (round 1 or rounds 1-5, `ess-*-fr.rds`) and one file for all participating countries (`ess-one-round.rds`).

## French data

Assuming the `data` directory is available in the current working directory, here is how we can load French data for round 1:

In [None]:
d <- readRDS("data/ess-one-round-fr.rds")
head(d[1:10])

In [None]:
table(d$yrbrn)

In [None]:
summary(d$agea)

Let us focus on the following list of variables, readily available in the file `ess-one-round-29vars-fr.rds`:

- `tvtot`: TV watching, total time on average weekday
- `rdtot`: Radio listening, total time on average weekday
- `nwsptot`: Newspaper reading, total time on average weekday
- `polintr`: How interested in politics
- `trstlgl`: Trust in the legal system
- `trstplc`: Trust in the police
- `trstplt`: Trust in politicians
- `vote`: Voted last national election
- `happy`: How happy are you
- `sclmeet`: How often socially meet with friends, relatives or colleagues
- `inmdisc`: Anyone to discuss intimate and personal matters with
- `sclact`: Take part in social activities compared to others of same age
- `health`: Subjective general health
- `ctzcntr`: Citizen of country
- `brncntr`: Born in country
- `facntr`: Father born in country
- `mocntr`: Mother born in country
- `hhmmb`: Number of people living regularly as member of household
- `gndr`: Gender
- `yrbrn`: Year of birth
- `agea`: Age of respondent, calculated
- `edulvla`: Highest level of education
- `eduyrs`: Years of full-time education completed
- `pdjobyr`: Year last in paid job
- `wrkctr`: Employment contract unlimited or limited duration
- `wkhct`: Total contracted hours per week in main job overtime excluded
- `marital`: Legal marital status
- `martlfr`: Legal marital status, France
- `lvghw`: Currently living with husband/wife

Note that variables in the file `ess-one-round-29vars-fr.rds` have been recoded and categorical variables now have proper labels. See the script file `scripts/ess-one-round-29vars-fr.r` to see what has been done to the base file.

In [None]:
d <- readRDS("data/ess-one-round-29vars-fr.rds")

First, let us look at the distribution of the `gndr` variable;

In [None]:
summary(d$gndr)

In [None]:
p <- ggplot(data = d, aes(x = gndr)) +
  geom_bar() +
  labs(x = "Sex of respondant", y = "Counts")
p

Now, let's look at the distribution of age:

In [None]:
summary(d$agea)

In [None]:
p <- ggplot(data = d, aes(x = agea)) +
  geom_line(stat = "density", bw = 2) +
  labs(x = "Age of respondant")
p

In [None]:
p <- ggplot(data = d, aes(x = agea)) +
  geom_histogram(binwidth = 5) +
  facet_grid(~ gndr) +
  labs(x = "Age of respondant")
p

In [None]:
p <- ggplot(data = d, aes(x = gndr, y = agea)) +
  geom_boxplot() +
  coord_flip() +
  labs(x = NULL, y = "Age of respondants")
p

## Data from other countries

Data from all other participating countries can be loaded in the same manner:

In [None]:
db <- readRDS("data/ess-one-round.rds")
cat("No. observations =", nrow(db))
table(db$cntry)

Since French data are (deliberately) missing from this dataset, we can append them to the above data frame as follows: 

In [None]:
db <- rbind.data.frame(db, d)
cat("No. observations =", nrow(db))

In [None]:
db$cntry <- factor(db$cntry)
table(db$cntry)

Remember that is also possible to use `summary()` with a factor variable to display a table of counts.