<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz](03.03-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [JDemetra](04.02-JDemetra.ipynb) ></span>

<a href="https://colab.research.google.com/github/eurostat/e-learning/blob/main/r-official-statistics/04.01-Data-Table.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


<a id='top'></a>

# Data Processing
## Content  
- [Data Processing with data.frame](#dframe)
 - [Handling missing data](#miss)
 - [Identify and Remove Duplicate Data](#dup)
 - [Reshaping Your Data with tidyr](#tidyr)
 - [Transforming Your Data with dplyr](#dplyr)
- [data.table Package](#dtable)
- [data.table vs. data.frame](#versus)


<a id='dframe'></a>

## Data Processing with data.frame
data.frame is the most important data structure used for data analisys, but is not the most efficient. Later in the chapter we will present data.table, an improvement structure similar to data.frame, with the most important differences in usage.  
But for now let us show some of the most important use cases in dealing with two dimensional data in data.frame structures.  

<a id='miss'></a>

### Handling missing data
A common task in data analysis is dealing with missing values. In R, missing values are often represented by NA or some other value that represents missing values (i.e. 99).

#### Test for missing values
To identify missing values use is.na() function which returns a logical vector with TRUE in the element locations that contain missing values represented by NA. is.na() will work on vectors, lists, matrices, and data frames.

In [1]:
# vector with missing data
x <- c(1:4, NA, 6:7, NA)
x
# identify NAs in vector
is.na(x)

# data frame with missing data
df1 <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA, "is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)
df1
# identify NAs in full data frame
is.na(df1)
# identify NAs in specific data frame column
is.na(df1$col4)

col1,col2,col3,col4
<int>,<chr>,<lgl>,<dbl>
1.0,this,True,2.5
2.0,,False,4.2
3.0,is,True,3.2
,text,True,


col1,col2,col3,col4
False,False,False,False
False,True,False,False
False,False,False,False
True,False,False,True


To identify the location or the number of NAs we can leverage the which() and sum() functions:

In [2]:
# identify location of NAs in vector
which(is.na(x))

# identify count of NAs in data frame
sum(is.na(df1))

For data frames, a convenient shortcut to compute the total missing values in each column is to use colSums():

In [3]:
colSums(is.na(df1))

#### Recode missing values
To recode missing values, or a value that represent missing values, we can use normal subsetting and assignment operations. For example, we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NAs and then assign these elements a value. Similarly, if missing values are represented by another value (i.e. 99) we can simply subset the data for the elements that contain that value and then assign a desired value to those elements.

In [4]:
# recode missing values with the mean
# vector with missing data
x
# recode in vector and add some round adjustment
x[is.na(x)] <- mean(x, na.rm = TRUE)
round(x, 2)
# recode in a data.frame column
df1$col4[is.na(df1$col4)] <- mean(df1$col4, na.rm = TRUE)
df1
# data frame that codes missing values as 99
df2 <- data.frame(col1 = c(1:3, 99), col2 = c(2.5, 4.2, 99, 3.2))
df2
# change 99s to NAs
df2[df2 == 99] <- NA
df2

col1,col2,col3,col4
<int>,<chr>,<lgl>,<dbl>
1.0,this,True,2.5
2.0,,False,4.2
3.0,is,True,3.2
,text,True,3.3


col1,col2
<dbl>,<dbl>
1,2.5
2,4.2
3,99.0
99,3.2


col1,col2
<dbl>,<dbl>
1.0,2.5
2.0,4.2
3.0,
,3.2


#### Exclude missing values
We can exclude missing values in a couple different ways. First, if we want to exclude missing values from mathematical operations use the na.rm = TRUE argument. If you do not exclude these values most functions will return an NA.

In [5]:
# A vector with missing values
x <- c(1:4, NA, 6:7, NA)
x
# including NA values will produce an NA output
mean(x)
# excluding NA values will calculate the mathematical operation for all non-missing values
mean(x, na.rm = TRUE)

We may also desire to subset our data to obtain complete observations, those observations (rows) in our data that contain no missing data. We can do this in two different ways.

In [6]:
# data frame with missing values
df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)
df

col1,col2,col3,col4
<int>,<chr>,<lgl>,<dbl>
1.0,this,True,2.5
2.0,,False,4.2
3.0,is,True,3.2
,text,True,


First, to find complete cases we can leverage the complete.cases() function which returns a logical vector identifying rows which are complete cases.

In [7]:
df[complete.cases(df), ]

Unnamed: 0_level_0,col1,col2,col3,col4
Unnamed: 0_level_1,<int>,<chr>,<lgl>,<dbl>
1,1,this,True,2.5
3,3,is,True,3.2


Second alternative is to simply use na.omit() to omit all rows containing missing values.

In [8]:
na.omit(df)

Unnamed: 0_level_0,col1,col2,col3,col4
Unnamed: 0_level_1,<int>,<chr>,<lgl>,<dbl>
1,1,this,True,2.5
3,3,is,True,3.2


<a id='dup'></a>

### Identify and Remove Duplicate Data
A dataset can have duplicate values and to keep it redundancy-free and accurate, duplicate rows need to be identified and removed.
#### Identifying Duplicate Data
For identification, we will use duplicated() function which returns the count of duplicate rows.

In [9]:
# Creating a sample data frame of students 
# and their marks in respective subjects.
student_result=data.frame(name=c("Ramee","Don","John","Paul",
                                 "Cassie","Don","Paul"),
                          maths=c(7,8,8,9,10,8,9),
                          science=c(5,7,6,8,9,7,8),
                          history=c(7,7,7,7,7,7,7))
  
# Printing data
student_result
duplicated(student_result)
sum(duplicated(student_result))

name,maths,science,history
<chr>,<dbl>,<dbl>,<dbl>
Ramee,7,5,7
Don,8,7,7
John,8,6,7
Paul,9,8,7
Cassie,10,9,7
Don,8,7,7
Paul,9,8,7


#### Removing Duplicate Data
There are two functions we can use to accomplish the task: `unique()` and `distinct()` from the library `tidyverse`. The `distinct()` function has more flexibility, but you need to have the corresponding package installed.

In [10]:
unique(student_result)

Unnamed: 0_level_0,name,maths,science,history
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>
1,Ramee,7,5,7
2,Don,8,7,7
3,John,8,6,7
4,Paul,9,8,7
5,Cassie,10,9,7


<a id='tidyr'></a>

### Reshaping Your Data with tidyr
There are many fundamental data processing functions in R (`base`), but they have lacked consistent coding and the ability to easily flow together. There are some packages that are more consistent and easy to use.  
`tidyr` is a one such package which was built for the sole purpose of simplifying the process of creating _tidy data_. Next we will provide you with the basic understanding of the four fundamental functions from tidyr:
- gather() - makes “wide” data longer
- spread() - makes “long” data wider
- separate() - splits a single column into multiple columns
- unite() - combines multiple columns into a single column

In [11]:
if (!require("tidyr")) {
    install.packages("tidyr")
    library(tidyr)
}

Loading required package: tidyr



#### gather( ) & spread() functions:
There are times when our data is considered unstacked and a common attribute of concern is spread out across columns. To reformat the data such that these common attributes are gathered together as a single variable, the `gather()` function will take multiple columns and collapse them into key-value pairs, duplicating all other columns as needed.  
This function is a complement to `spread()`.

In [12]:
# wide
df_m <- read.csv('data/tidier.csv', row.names = 1)
df_m
# long: tidier (with pipe)
df_l <- df_m %>% gather(Quarter, Revenue, Qtr.1:Qtr.4)
head(df_l)

# These all produce the same results:
df_m %>% gather(Quarter, Revenue, -Group, -Year) %>% str()
df_m %>% gather(Quarter, Revenue, 3:6) %>% str()
df_m %>% gather(Quarter, Revenue, Qtr.1, Qtr.2, Qtr.3, Qtr.4) %>% str()

# now going back with spread()
df_l %>% spread(Quarter, Revenue)

Unnamed: 0_level_0,Group,Year,Qtr.1,Qtr.2,Qtr.3,Qtr.4
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<int>,<int>
1,1,2006,15,16,19,17
2,1,2007,12,13,27,23
3,1,2008,22,22,24,20
4,1,2009,10,14,20,16
5,2,2006,12,13,25,18
6,2,2007,16,14,21,19
7,2,2008,13,11,29,15
8,2,2009,23,20,26,20
9,3,2006,11,12,22,16
10,3,2007,13,11,27,21


Unnamed: 0_level_0,Group,Year,Quarter,Revenue
Unnamed: 0_level_1,<int>,<int>,<chr>,<int>
1,1,2006,Qtr.1,15
2,1,2007,Qtr.1,12
3,1,2008,Qtr.1,22
4,1,2009,Qtr.1,10
5,2,2006,Qtr.1,12
6,2,2007,Qtr.1,16


'data.frame':	48 obs. of  4 variables:
 $ Group  : int  1 1 1 1 2 2 2 2 3 3 ...
 $ Year   : int  2006 2007 2008 2009 2006 2007 2008 2009 2006 2007 ...
 $ Quarter: chr  "Qtr.1" "Qtr.1" "Qtr.1" "Qtr.1" ...
 $ Revenue: int  15 12 22 10 12 16 13 23 11 13 ...
'data.frame':	48 obs. of  4 variables:
 $ Group  : int  1 1 1 1 2 2 2 2 3 3 ...
 $ Year   : int  2006 2007 2008 2009 2006 2007 2008 2009 2006 2007 ...
 $ Quarter: chr  "Qtr.1" "Qtr.1" "Qtr.1" "Qtr.1" ...
 $ Revenue: int  15 12 22 10 12 16 13 23 11 13 ...
'data.frame':	48 obs. of  4 variables:
 $ Group  : int  1 1 1 1 2 2 2 2 3 3 ...
 $ Year   : int  2006 2007 2008 2009 2006 2007 2008 2009 2006 2007 ...
 $ Quarter: chr  "Qtr.1" "Qtr.1" "Qtr.1" "Qtr.1" ...
 $ Revenue: int  15 12 22 10 12 16 13 23 11 13 ...


Group,Year,Qtr.1,Qtr.2,Qtr.3,Qtr.4
<int>,<int>,<int>,<int>,<int>,<int>
1,2006,15,16,19,17
1,2007,12,13,27,23
1,2008,22,22,24,20
1,2009,10,14,20,16
2,2006,12,13,25,18
2,2007,16,14,21,19
2,2008,13,11,29,15
2,2009,23,20,26,20
3,2006,11,12,22,16
3,2007,13,11,27,21


#### separate( ) & unite() functions:
Many times a single column variable will capture multiple variables, or even parts of a variable you just don’t care about.  

This function is a complement to `unite()`.  
We can go back to our df_m dataframe we created above in which way may desire to clean up or separate the Quarter variable.

In [13]:
samp <- sample(nrow(df_l), 5)
df_l[samp, ]
df_sep <- df_l %>% separate(Quarter, c("Time_Interval", "Interval_ID"))
df_sep[samp, ]

# This produce the same results:
df_sep %>% str()
df_l %>% separate(Quarter, c("Time_Interval", "Interval_ID"), sep = "\\.") %>% str()

# now going back with unite()
df_sep %>% unite(Quarter, Time_Interval, Interval_ID, sep = ".") %>% head()

Unnamed: 0_level_0,Group,Year,Quarter,Revenue
Unnamed: 0_level_1,<int>,<int>,<chr>,<int>
12,3,2009,Qtr.1,14
33,3,2006,Qtr.3,22
28,1,2009,Qtr.3,20
2,1,2007,Qtr.1,12
15,1,2008,Qtr.2,22


Unnamed: 0_level_0,Group,Year,Time_Interval,Interval_ID,Revenue
Unnamed: 0_level_1,<int>,<int>,<chr>,<chr>,<int>
12,3,2009,Qtr,1,14
33,3,2006,Qtr,3,22
28,1,2009,Qtr,3,20
2,1,2007,Qtr,1,12
15,1,2008,Qtr,2,22


'data.frame':	48 obs. of  5 variables:
 $ Group        : int  1 1 1 1 2 2 2 2 3 3 ...
 $ Year         : int  2006 2007 2008 2009 2006 2007 2008 2009 2006 2007 ...
 $ Time_Interval: chr  "Qtr" "Qtr" "Qtr" "Qtr" ...
 $ Interval_ID  : chr  "1" "1" "1" "1" ...
 $ Revenue      : int  15 12 22 10 12 16 13 23 11 13 ...
'data.frame':	48 obs. of  5 variables:
 $ Group        : int  1 1 1 1 2 2 2 2 3 3 ...
 $ Year         : int  2006 2007 2008 2009 2006 2007 2008 2009 2006 2007 ...
 $ Time_Interval: chr  "Qtr" "Qtr" "Qtr" "Qtr" ...
 $ Interval_ID  : chr  "1" "1" "1" "1" ...
 $ Revenue      : int  15 12 22 10 12 16 13 23 11 13 ...


Unnamed: 0_level_0,Group,Year,Quarter,Revenue
Unnamed: 0_level_1,<int>,<int>,<chr>,<int>
1,1,2006,Qtr.1,15
2,1,2007,Qtr.1,12
3,1,2008,Qtr.1,22
4,1,2009,Qtr.1,10
5,2,2006,Qtr.1,12
6,2,2007,Qtr.1,16


<a id='dplyr'></a>

### Transforming Your Data with dplyr
For a very similar reason we presented you some of the functions from `tidyr`, we will introduce to you a new useful library `dplyr`.  
This package was built for the sole purpose of simplifying the process of manipulating, sorting, summarizing, and joining data frames. The fundamental functions of data transformation that the dplyr package offers includes:
- select() - selects variables
- filter() - provides basic filtering capabilities
- group_by() - groups data by categorical levels
- summarise() - summarizes data by functions of choice
- arrange() - orders data
- join() - joins separate dataframes
- mutate() - creates new variables

In [14]:
if (!require("dplyr")) {
    install.packages("dplyr")
    library(dplyr)
}

Loading required package: dplyr


Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union




#### select( ) function:
Reduce dataframe size to only desired variables for current task.  
When working with a sizable dataframe, often we desire to only assess specific variables. The select() function allows you to select and/or rename variables.  
  
_Note: In the following example we will use some public data from datasets package._

In [15]:
# library(help = "datasets")
# Monthly Airline Passenger Numbers 1949-1960
str(AirPassengers)
# this one is a time series, let's convert it into a data.frame
df_air <- data.frame(t(matrix(AirPassengers, 12, dimnames=list(month.abb))), row.names=1949:1960)
head(df_air)
# selecting just the summer months
df_air %>% select(Jun, Jul, Aug) %>% head()
# or these
df_air %>% select(Jun:Aug) %>% head()
df_air %>% select(6:8) %>% head()
# or for selection you can use one of the special functions: 
# starts_with, ends_with, contains, matches
df_air %>% select(starts_with('J')) %>% head()


 Time-Series [1:144] from 1949 to 1961: 112 118 132 129 121 135 148 148 136 119 ...


Unnamed: 0_level_0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1949,112,118,132,129,121,135,148,148,136,119,104,118
1950,115,126,141,135,125,149,170,170,158,133,114,140
1951,145,150,178,163,172,178,199,199,184,162,146,166
1952,171,180,193,181,183,218,230,242,209,191,172,194
1953,196,196,236,235,229,243,264,272,237,211,180,201
1954,204,188,235,227,234,264,302,293,259,229,203,229


Unnamed: 0_level_0,Jun,Jul,Aug
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1949,135,148,148
1950,149,170,170
1951,178,199,199
1952,218,230,242
1953,243,264,272
1954,264,302,293


Unnamed: 0_level_0,Jun,Jul,Aug
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1949,135,148,148
1950,149,170,170
1951,178,199,199
1952,218,230,242
1953,243,264,272
1954,264,302,293


Unnamed: 0_level_0,Jun,Jul,Aug
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1949,135,148,148
1950,149,170,170
1951,178,199,199
1952,218,230,242
1953,243,264,272
1954,264,302,293


Unnamed: 0_level_0,Jan,Jun,Jul
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1949,112,135,148
1950,115,149,170
1951,145,178,199
1952,171,218,230
1953,196,243,264
1954,204,264,302


#### filter( ) function:
Reduce rows/observations with matching conditions.  
Filtering data is a common task to identify/select observations in which a particular variable matches a specific value/condition. The filter() function provides this capability.

In [16]:
# Motor Trend Car Road Tests
head(mtcars, 4)
# filter the cars with 4 cylinders
head(mtcars %>% filter(cyl == 4), 4)

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1


Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1


We can apply multiple logic rules in the filter() function such as:
- %in% - Group membership
- is.na - is NA
- &,|,! - Boolean operators

#### group_by( ) function:
Group data by categorical variables.  
Often, observations are nested within groups or categories and our goals is to perform statistical analysis both at the observation level and also at the group level. The group_by() function allows us to create these categorical groupings.

In [17]:
cars.cyl <- mtcars %>% group_by(cyl)
head(cars.cyl, 4)


mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1


The group_by() function is a silent function in which no observable manipulation of the data is performed as a result of applying the function. The real magic of the group_by() function comes when we perform summary statistics which we will cover shortly.

#### summarise( ) function:
Perform summary statistics on variables.
Obviously the goal of all this data wrangling is to be able to perform statistical analysis on our data. The summarise() function allows us to perform the majority of the initial summary statistics when performing exploratory data analysis.

In [18]:
# an important metric for cars is the fuel consumtion
mtcars %>% summarise(Min = min(mpg),
                     Median = median(mpg, na.rm=TRUE),
                     Mean = mean(mpg, na.rm=TRUE),
                     Var = var(mpg, na.rm=TRUE),
                     SD = sd(mpg, na.rm=TRUE),
                     Max = max(mpg, na.rm=TRUE),
                     N = n())

Min,Median,Mean,Var,SD,Max,N
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
10.4,19.2,20.09062,36.3241,6.026948,33.9,32


This information is useful, but being able to compare summary statistics at multiple levels is when you really start to gather some insights. This is where the group_by() function comes in.

In [19]:
cars.cyl %>% summarise(Min = min(mpg),
                     Median = median(mpg, na.rm=TRUE),
                     Mean = mean(mpg, na.rm=TRUE),
                     Var = var(mpg, na.rm=TRUE),
                     SD = sd(mpg, na.rm=TRUE),
                     Max = max(mpg, na.rm=TRUE),
                     N = n())

cyl,Min,Median,Mean,Var,SD,Max,N
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
4,21.4,26.0,26.66364,20.338545,4.509828,33.9,11
6,17.8,19.7,19.74286,2.112857,1.453567,21.4,7
8,10.4,15.2,15.1,6.553846,2.560048,19.2,14


#### arrange( ) function:
Order variable values.  
Often, we desire to view observations in rank order for a particular variable(s). The arrange() function allows us to order data by variables in accending or descending order.

In [20]:
mtcars %>% arrange(cyl) %>% head()

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1


#### join( ) functions:
Objective: Join two datasets together
Often we have separate dataframes that can have common and differing variables for similar observations and we wish to join these dataframes together. The are multiple join functions providing multiple ways to join dataframes: inner_join, left_join, right_join, full_join, semi_join, anti_join.

In [21]:
# Dataframe “x”:
x <- data.frame(name=c('John', 'Paul', 'George', 'Ringo', 'Stuart', 'Pete'),
               instrument=c('guitar', 'bass', 'guitar', 'drums', 'bass', 'drums'))
x
# Dataframe “y”:
y <- data.frame(name=c('John', 'Paul', 'George', 'Ringo', 'Brian'),
               band=c(T, T, T, T, F))
y

name,instrument
<chr>,<chr>
John,guitar
Paul,bass
George,guitar
Ringo,drums
Stuart,bass
Pete,drums


name,band
<chr>,<lgl>
John,True
Paul,True
George,True
Ringo,True
Brian,False


##### Left Join
Include all of x, and matching rows of y

In [22]:
x %>% left_join(y)

[1m[22mJoining, by = "name"


name,instrument,band
<chr>,<chr>,<lgl>
John,guitar,True
Paul,bass,True
George,guitar,True
Ringo,drums,True
Stuart,bass,
Pete,drums,


##### Inner Join
Include only rows in both x and y that have a matching value

In [23]:
inner_join(x,y)

[1m[22mJoining, by = "name"


name,instrument,band
<chr>,<chr>,<lgl>
John,guitar,True
Paul,bass,True
George,guitar,True
Ringo,drums,True


##### Semi Join
Include rows of x that match y but only keep the columns from x

In [24]:
semi_join(x,y)

[1m[22mJoining, by = "name"


name,instrument
<chr>,<chr>
John,guitar
Paul,bass
George,guitar
Ringo,drums


##### Anti Join
Opposite of semi_join

In [25]:
anti_join(x,y)

[1m[22mJoining, by = "name"


name,instrument
<chr>,<chr>
Stuart,bass
Pete,drums


#### mutate( ) function:
Creates new variables.  
Often we want to create a new variable that is a function of the current variables in our dataframe or even just add a new variable. The mutate() function allows us to add new variables while preserving the existing variables.

In [26]:
# back to car dataset
# let's create a column for power per cylinder and sort on it
mtcars %>% mutate(hp_cyl = hp/cyl) %>% arrange(hp_cyl) %>% head()

Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,hp_cyl
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Honda Civic,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2,13.0
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2,15.5
Toyota Corolla,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1,16.25
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1,16.5
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1,16.5
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1,17.5


<a id='dtable'></a>

## data.table Package

data.table is an R package that provides an enhanced version of data.frames, which are the standard data structure for storing data in base R.  
It is considered to be faster and more memory efficient.  
data.table is code efficient, we can able to write less number of lines of code in data.table.  
But maybe the best way to introduce the new structure is comparing it with the data.frame:

<a id='versus'></a>

## data.table vs. data.frame
There is a third parameter that is game changing. Looks very similar with sql language syntax.  
<br>  
  
### data.frame  
syntax for selection: df[i, j]
- i: row selection
- j: column selection
<br><br>  
  
### data.table  
syntax for selection: dt[i, j, by]
- i: row selection
- j: column selection or function to apply
- by: grouping  

In [27]:
if (!require("data.table")) {
    install.packages("data.table")
    library(data.table)
}

Loading required package: data.table


Attaching package: 'data.table'


The following objects are masked from 'package:dplyr':

    between, first, last




### Read from file
The library comes with a generic function `fread` more customizeable that the corresponding read.csv: 

In [28]:
# read the file in df
df <- read.csv('data/df_dt.csv')
str(df)
head(df)
# read the file in dt
# fread 10 times faster
# contains also a data.frame
dt <- fread('data/df_dt.csv')
str(dt)
head(dt)
# converting Col3
dt$Col3 <- dt$Col3 == 'T'
head(dt)

'data.frame':	12 obs. of  5 variables:
 $ Index: chr  "A1" "B1" "C1" "D1" ...
 $ Col1 : int  1 5 9 13 1 5 9 13 1 5 ...
 $ Col2 : num  2.4 6.2 10.3 14.4 2.3 6.3 10.4 14.4 2.2 6.4 ...
 $ Col3 : logi  TRUE TRUE FALSE TRUE TRUE TRUE ...
 $ Col4 : chr  "A" "A" "B" "B" ...


Unnamed: 0_level_0,Index,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<lgl>,<chr>
1,A1,1,2.4,True,A
2,B1,5,6.2,True,A
3,C1,9,10.3,False,B
4,D1,13,14.4,True,B
5,A2,1,2.3,True,A
6,B2,5,6.3,True,A


Classes 'data.table' and 'data.frame':	12 obs. of  5 variables:
 $ Index: chr  "A1" "B1" "C1" "D1" ...
 $ Col1 : int  1 5 9 13 1 5 9 13 1 5 ...
 $ Col2 : num  2.4 6.2 10.3 14.4 2.3 6.3 10.4 14.4 2.2 6.4 ...
 $ Col3 : chr  "T" "T" "F" "T" ...
 $ Col4 : chr  "A" "A" "B" "B" ...
 - attr(*, ".internal.selfref")=<externalptr> 


Index,Col1,Col2,Col3,Col4
<chr>,<int>,<dbl>,<chr>,<chr>
A1,1,2.4,T,A
B1,5,6.2,T,A
C1,9,10.3,F,B
D1,13,14.4,T,B
A2,1,2.3,T,A
B2,5,6.3,T,A


Index,Col1,Col2,Col3,Col4
<chr>,<int>,<dbl>,<lgl>,<chr>
A1,1,2.4,True,A
B1,5,6.2,True,A
C1,9,10.3,False,B
D1,13,14.4,True,B
A2,1,2.3,True,A
B2,5,6.3,True,A


### Rows Selection

In [29]:
# select rows in df
df[1:3,]

# select rows in dt
# comma can be ommited
dt[1:3]


Unnamed: 0_level_0,Index,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<lgl>,<chr>
1,A1,1,2.4,True,A
2,B1,5,6.2,True,A
3,C1,9,10.3,False,B


Index,Col1,Col2,Col3,Col4
<chr>,<int>,<dbl>,<lgl>,<chr>
A1,1,2.4,True,A
B1,5,6.2,True,A
C1,9,10.3,False,B


### Filtering

In [30]:
# filter rows in df
df[df$Col3 & df$Col4 == 'A',]

# filter rows in dt
# 10 times faster
# comma ommited
# columns as variables
dt[Col3 & Col4 == 'A']

Unnamed: 0_level_0,Index,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<lgl>,<chr>
1,A1,1,2.4,True,A
2,B1,5,6.2,True,A
5,A2,1,2.3,True,A
6,B2,5,6.3,True,A
9,A3,1,2.2,True,A
10,B3,5,6.4,True,A


Index,Col1,Col2,Col3,Col4
<chr>,<int>,<dbl>,<lgl>,<chr>
A1,1,2.4,True,A
B1,5,6.2,True,A
A2,1,2.3,True,A
B2,5,6.3,True,A
A3,1,2.2,True,A
B3,5,6.4,True,A


### Sorting

In [31]:
# order in df
df[order(df$Col1, df$Col2),]

# order in dt
# 4 times faster
dt[order(Col1, Col2)]

Unnamed: 0_level_0,Index,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<chr>,<int>,<dbl>,<lgl>,<chr>
9,A3,1,2.2,True,A
5,A2,1,2.3,True,A
1,A1,1,2.4,True,A
2,B1,5,6.2,True,A
6,B2,5,6.3,True,A
10,B3,5,6.4,True,A
3,C1,9,10.3,False,B
7,C2,9,10.4,False,B
11,C3,9,10.5,False,B
4,D1,13,14.4,True,B


Index,Col1,Col2,Col3,Col4
<chr>,<int>,<dbl>,<lgl>,<chr>
A3,1,2.2,True,A
A2,1,2.3,True,A
A1,1,2.4,True,A
B1,5,6.2,True,A
B2,5,6.3,True,A
B3,5,6.4,True,A
C1,9,10.3,False,B
C2,9,10.4,False,B
C3,9,10.5,False,B
D1,13,14.4,True,B


### Columns Selection

In [32]:
# select a column of type chr

# in df returns factor
df$Index
df[, 'Index']

# in dt this way returns vector
dt$Index
dt[, Index]
# but this way a table with one column
dt[, 'Index']

Index
<chr>
A1
B1
C1
D1
A2
B2
C2
D2
A3
B3


In [33]:
# select multiple columns

# in df returns data.frame
df[, c('Col1', 'Col2')]

# in dt returns data.table
dt[, c('Col1', 'Col2')]
# also data.table
dt[, list(Col1, Col2)]
# also data.table (. means list in data.table)
dt[, .(Col1, Col2)]
# this way a single vector with half data from Sex followed by Age
dt[, c(Col1, Col2)]

Col1,Col2
<int>,<dbl>
1,2.4
5,6.2
9,10.3
13,14.4
1,2.3
5,6.3
9,10.4
13,14.4
1,2.2
5,6.4


Col1,Col2
<int>,<dbl>
1,2.4
5,6.2
9,10.3
13,14.4
1,2.3
5,6.3
9,10.4
13,14.4
1,2.2
5,6.4


Col1,Col2
<int>,<dbl>
1,2.4
5,6.2
9,10.3
13,14.4
1,2.3
5,6.3
9,10.4
13,14.4
1,2.2
5,6.4


Col1,Col2
<int>,<dbl>
1,2.4
5,6.2
9,10.3
13,14.4
1,2.3
5,6.3
9,10.4
13,14.4
1,2.2
5,6.4


In [34]:
# using variable for column name or id
cname <- 'Col2'
cnum <- 3

# returns a factor or vector for df (depends of data type of the column)
df[, cname]
df[, cnum]

# return a data.table for dt, but must use .. means 'one level up'  
# look for cname and cnum in the workspace objects, not in dt columns
dt[, ..cname]
dt[, ..cnum]

Col2
<dbl>
2.4
6.2
10.3
14.4
2.3
6.3
10.4
14.4
2.2
6.4


Col2
<dbl>
2.4
6.2
10.3
14.4
2.3
6.3
10.4
14.4
2.2
6.4


### Different table functionality

In [35]:
# table by Area and Year

# with df
table(df$Col3, df$Col4)

# with dt
# 10 times faster
# .N is a function that count the number of rows for each group
dt[, .(.N), by = .(Col3, Col4)]

       
        A B
  FALSE 0 3
  TRUE  6 3

Col3,Col4,N
<lgl>,<chr>,<int>
True,A,6
False,B,3
True,B,3


### Different aggregate functionality

In [36]:
# some aggregates: mean age of men per Area and Year

# with df
aggregate(Col2~Col3+Col4, data=df, FUN=mean, subset=df$Col3)

# with dt
# 10 times faster
dt[Col3 == TRUE, .(mean(Col2)), by = .(Col3, Col4)]

Col3,Col4,Col2
<lgl>,<chr>,<dbl>
True,A,4.3
True,B,14.4


Col3,Col4,V1
<lgl>,<chr>,<dbl>
True,A,4.3
True,B,14.4


<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz](03.03-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [JDemetra](04.02-JDemetra.ipynb) > [Top](#top) ^ </span>

<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>This is the Jupyter notebook version of the __Python for Official Statistics__ produced by Eurostat; the content is available [on GitHub](https://github.com/eurostat/e-learning/tree/main/python-official-statistics).
<br>The text and code are released under the [EUPL-1.2 license](https://github.com/eurostat/e-learning/blob/main/LICENSE).</span>