<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz 2](02.03-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [Data sources for Official Statistics](03.02-Official-Statistics.ipynb) ></span>

<a href="https://colab.research.google.com/github/eurostat/e-learning/blob/main/r-official-statistics/03.01-Read-Data.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>


<a id='top'></a>

# Read Data
## Content  
- [Working Directory and File Paths](#wd)
- [CSV and TSV Files](#csv)
- [Excel Files](#excel)
- [RIO Package](#rio)
- [RDA/RDS Binary Files](#rdas)
- [Other Packages](#other)

<a id='wd'></a>

## Working Directory and File Paths
When you run R, it starts in a default location in your computer's file system called the working directory. You can check your working directory with the getwd() function.

In [1]:
getwd()

The working directory acts as your starting point for accessing other files on your computer. To load data into R, you either need to put the data file in your working directory, change your working directory to the folder containing the data or supply the data's file path to the data reading function.

You can change your working directory by supplying a new file path in quotes to the setwd().

In [2]:
setwd('./data')
getwd()

> Note: In a local R environment, you can use forward slashes for your file path even in Windows which normally uses backslashes.  
  
> Note: Also you can use '.' for the current folder, and '..' for the parent folder.  
  
Better then worring about slash or backslash you can use some functions to create a valid path:

In [3]:
my_path <- file.path("..", "data")
print(my_path)
setwd(my_path)
getwd()

[1] "../data"


Here a few examples with some other functions related to files anf folders:

In [4]:
# List all files and folders from specified folder
dir('.')
# actualy the same thing
list.files("..", pattern='*.ipynb')


<a id='csv'></a>

## CSV and TSV Files
You can use the read.csv() function to read CSV files into R.  
utils::read.csv() is the standard function, but for a better experience you can use the corresponding function, from package readr, readr::read_csv().

In [2]:
# second parameter is signaling that the first column represent the row names
small <- read.csv('small.csv', row.names = 1)
small

For a tsv file example we will use an internet path. Same function but added the separater parameter:

In [6]:
path <- 'https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&downfile=data%2F'
filename <- 'aact_ali01.tsv.gz'
download.file(paste(path, filename, sep=''), filename)
aact_ali <- read.csv(filename, sep = '\t')
typeof(aact_ali)
head(aact_ali)

Unnamed: 0_level_0,itm_newa.geo.time,X2021,X2020,X2019,X2018,X2017,X2016,X2015,X2014,X2013,⋯,X1982,X1981,X1980,X1979,X1978,X1977,X1976,X1975,X1974,X1973
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,"40000,AT",112.32 e,113.4,115.88,117.15,118.63,119.57,120.49,122.33,124.19,⋯,:,:,:,:,:,:,:,:,:,:
2,"40000,BE",56.78 e,56.08,55.59,55.66,55.28,55.35,57.12,56.84,57.9,⋯,112.90,115.10,119.40,125.10,126.00,130.30,136.20,143.60,149.90,155.80
3,"40000,BG",166.00 e,179.5,190.4,220.2,236.4,253.9,274.3,296.4,321.2,⋯,:,:,:,:,:,:,:,:,:,:
4,"40000,CH",72.11 e,72.67,73.86,75.17,74.63,75.61,76.41,77.0,77.3,⋯,:,:,:,:,:,:,:,:,:,:
5,"40000,CY",16.49 e,16.46,16.83,16.31,17.59,18.26,17.12,17.38,18.45,⋯,:,:,:,:,:,:,:,:,:,:
6,"40000,CZ",95.37 e,95.37,102.02,104.34,104.6,104.48,104.8,104.9,105.1,⋯,:,:,:,:,:,:,:,:,:,:


After you process your data: cleaning it, or doing some analysis you'll likely end up generating new data. You can write data in an R data frame to CSV using the write.csv() function.

<a id='excel'></a>

## Excel Files
One of the packages working with Excel files is this one: 'readxl'.

In [7]:
library("readxl")
small2 <- read_excel('small.xls', col_names=c('A', 'B', 'C', 'D'))
typeof(small2)
small_df = data.frame(small2)
small_df

A,B,C,D
<dbl>,<dbl>,<dbl>,<dbl>
1,2,3,4
5,6,7,8
9,10,11,12
13,14,15,16


<a id='rio'></a>

## RIO Package
But there is a better solution for all situations described before, it's a package, rio, trying to unify all types of input into a single interface:

In [8]:
library(rio)

# install_formats()
# CSV
rio_csv <- import("small.csv")
rownames(rio_csv) <- rio_csv[, 1]
rio_csv <- rio_csv[, -1]
head(rio_csv)

# archived TSV
rio_txt <- import("aact_ali01.tsv.gz")
typeof(rio_txt)
head(rio_txt)

# Excel XLSX
rio_xlsx <- import("small.xls", col_names=c('A', 'B', 'C', 'D'))
rio_xlsx


Unnamed: 0_level_0,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<int>,<int>,<int>,<int>
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16


data.table::fread() does not support reading from connections. Using utils::read.table() instead.



Unnamed: 0_level_0,itm_newa.geo.time,X2021,X2020,X2019,X2018,X2017,X2016,X2015,X2014,X2013,⋯,X1982,X1981,X1980,X1979,X1978,X1977,X1976,X1975,X1974,X1973
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,"40000,AT",112.32 e,113.4,115.88,117.15,118.63,119.57,120.49,122.33,124.19,⋯,:,:,:,:,:,:,:,:,:,:
2,"40000,BE",56.78 e,56.08,55.59,55.66,55.28,55.35,57.12,56.84,57.9,⋯,112.90,115.10,119.40,125.10,126.00,130.30,136.20,143.60,149.90,155.80
3,"40000,BG",166.00 e,179.5,190.4,220.2,236.4,253.9,274.3,296.4,321.2,⋯,:,:,:,:,:,:,:,:,:,:
4,"40000,CH",72.11 e,72.67,73.86,75.17,74.63,75.61,76.41,77.0,77.3,⋯,:,:,:,:,:,:,:,:,:,:
5,"40000,CY",16.49 e,16.46,16.83,16.31,17.59,18.26,17.12,17.38,18.45,⋯,:,:,:,:,:,:,:,:,:,:
6,"40000,CZ",95.37 e,95.37,102.02,104.34,104.6,104.48,104.8,104.9,105.1,⋯,:,:,:,:,:,:,:,:,:,:


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
1,1,2,3,4
2,5,6,7,8
3,9,10,11,12
4,13,14,15,16


Now, as you can see the functionality is not quite identical but in majority of the situation it is hiding some details and give a more robust and simple to use interface.  
And you can use the same package, function export, to save the files in the format you want. Let's try save all of them as RDS file, we'll need them in the next chapter.

In [9]:
export(rio_csv, 'rio_csv.rds')
export(rio_txt, 'rio_txt.rds')
export(rio_xlsx, 'rio_xlsx.rds')

export(list(x = rio_csv, y = rio_xlsx), 'rio_csv.rda')


<a id='rdas'></a>

## RDA/RDS Binary Files

R comes with its own binary format. There are two different types of binary R files for multiple objects (RDA) and for single objects (RDS). Before looking at the functions to read/write, let us first compare the two formats:

#### For multiple objects: RDA (R data) files.  
- File extension (typically): .rda, .Rda, or RData.  
- RDA files can take up one or multiple different R objects.  
- Preserves the name(s) of the object(s) stored.  
- load(): Read/load existing files.  
- save(): Write/save data.
#### For single objects: RDS (serialized R data) files.
- File extension (typically) .rds, or .Rds.
- Can only take up one single object.
- Does not preserve the name of the object saved.
- readRDS(): Read/load existing files.
- saveRDS(): Write/save data.

  
  Some examples, please:

In [10]:
# previously saved rio_csv data.frame as file: rio_csv.rds
# row names and column names preserved!
rio_csv2 <- readRDS('rio_csv.rds')
rio_csv2

# the two objects rio_csv and rio_xlsx saved together in the file: rio_csv.rda
# load the binary files and create the two objects x & y, in the global namespace.
rio_csv3 <- load('rio_csv.rda')
rio_csv3
# or ls()
objects()
x
y

Unnamed: 0_level_0,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<int>,<int>,<int>,<int>
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16


Unnamed: 0_level_0,Col1,Col2,Col3,Col4
Unnamed: 0_level_1,<int>,<int>,<int>,<int>
A,1,2,3,4
B,5,6,7,8
C,9,10,11,12
D,13,14,15,16


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
1,1,2,3,4
2,5,6,7,8
3,9,10,11,12
4,13,14,15,16


<a id='other'></a>

## Other Packages

Additional packages allow to read a wide variety of data formats. The following list is far away from being complete, but might be of interest to you at some point:

- __sf, sp__: Reading/writing and handling of spatial data (polygons, points, lines).
- __raster, stars__: Handling of gridded spatial data.
- __rjson__: to read/parse JSON data,
- __xml2__: to read, parse, manipulate, create, and save xml files,
- __png and jpeg__: Reading and writing raster images.
- __ncdf4__: Read and write NetCDF files.
- __RSQLite, RMySQL, influxdbr, RPostgres__: Reading and writing different data bases.
- __haven__: Additional support for data from other statistical systems (SAS, SPSS, Stata, …).


<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Quiz 2](02.03-Quiz.ipynb) | [Contents](00.00-Index.ipynb) | [Data sources for Official Statistics](03.02-Official-Statistics.ipynb) > [Top](#top) ^ </span>

<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>This is the Jupyter notebook version of the __Python for Official Statistics__ produced by Eurostat; the content is available [on GitHub](https://github.com/eurostat/e-learning/tree/main/python-official-statistics).
<br>The text and code are released under the [EUPL-1.2 license](https://github.com/eurostat/e-learning/blob/main/LICENSE).</span>