# Getting Data In and Out of R
There are a few principal functions reading data into R.

- read.table, read.csv, for reading tabular data
- readLines, for reading lines of a text file
- source, for reading in R code files (inverse of dump)
- dget, for reading in R code files (inverse of dput)
- load, for reading in saved workspaces
- unserialize, for reading single R objects in binary form 

There are analogous functions for writing data to files

- write.table, for writing tabular data to text files (i.e. CSV) or connections
- writeLines, for writing character data line-by-line to a file or connection
- dump, for dumping a textual representation of multiple R objects
- dput, for outputting a textual representation of an R object
- save, for saving an arbitrary number of R objects in binary format (possibly compressed) to
a file. • serialize, for converting an R object into a binary format for outputting to a connection (or
file).


## Reading Data Files with read.table()

In [2]:
# see the help for the read.table() function
?read.table

#### Usage
read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

stringsAsFactors :
- should character variables be coded as factors?
- defaults to TRUE (those strings represented levels of a categorical variable)
- Now lots of data is text & data and not represent categorical variables. 
- You may want to set this to be FALSE in those cases. 
- If you always want this to be FALSE, you can set a global option via options(stringsAsFactors = FALSE)


You can usually call read.table without specifying any other arguments.

In this case, R will automatically
- skip lines that begin with a #
- figure out how many rows there are (and how much memory needs to be allocated)
- figure what type of variable is in each column of the table.

Telling R all these things directly makes R run faster and more efficiently. 

In [6]:
data <- read.table("fake-file.csv", sep=",")
head(data)

V1,V2,V3,V4,V5
,First,Last,Gender,Birthdate
0.0,Forest,Kutch,M,1950-10-06
1.0,Mignon,Bahringer,F,1971-12-07
2.0,Loyd,Koelpin,F,1989-09-16
3.0,Newt,Hahn,M,1940-04-29
4.0,Ama,Waelchi,M,1958-03-30


In [15]:
attributes(data)

The read.csv() function is identical to read.table except that some of the defaults are set differently (like the sep argument).

In [17]:
data <- read.csv("fake-file.csv")
head(data)

X,First,Last,Gender,Birthdate
0,Forest,Kutch,M,1950-10-06
1,Mignon,Bahringer,F,1971-12-07
2,Loyd,Koelpin,F,1989-09-16
3,Newt,Hahn,M,1940-04-29
4,Ama,Waelchi,M,1958-03-30
5,Calla,Spinka,M,1988-10-25


####  Reading in Larger Datasets with read.table 
there are a few things that you can do that will make your life easier and will prevent R from choking.
- Read the help page for read.table, which contains many hints
- Make a rough calculation of the memory required.
- If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.
- Set comment.char = "" if there are no commented lines in your file.
- Use the colClasses argument. Specifying this option instead of using the default can make
’read.table’ run MUCH faster, often twice as fast.


In [18]:
?read.table()

#### Usage
read.table(file, header = FALSE, sep = "", quote = "\"'",  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;row.names, col.names, as.is = !stringsAsFactors,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;na.strings = "NA", colClasses = NA, nrows = -1,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;skip = 0, check.names = TRUE, fill = !blank.lines.skip,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;strip.white = FALSE, blank.lines.skip = TRUE,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;comment.char = "#",  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;allowEscapes = FALSE, flush = FALSE,  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;stringsAsFactors = default.stringsAsFactors(),  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)



In [13]:
#Set nrows. This doesn’t make R run faster but it helps with memory usage. A mild overestimate is okay.
initial <- read.table("fake-file-1000.csv", nrows = 100, sep=",", stringsAsFactors = FALSE )
attributes(initial)

In [14]:
head(initial)

V1,V2,V3,V4,V5
,First,Last,Gender,Birthdate
0.0,Akeelah,Medhurst,M,1956-04-08
1.0,Charissa,Wisoky,F,2000-04-06
2.0,Charle,Swaniawski,F,1967-01-30
3.0,Amaya,Wilkinson,M,1966-03-11
4.0,Benton,Murphy,F,2005-09-17


In [15]:
# generate a vector with the element classes
classes <- sapply(initial, class)
classes

In [16]:
# now a more efficient way to load a dataset
tabAll <- read.table("fake-file-1000.csv", colClasses = classes, stringsAsFactors = FALSE, sep=",")


“número de itens não é múltiplo do número de colunas”

In [17]:
head(tabAll)

V1,V2,V3,V4,V5
,First,Last,Gender,Birthdate
0.0,Akeelah,Medhurst,M,1956-04-08
1.0,Charissa,Wisoky,F,2000-04-06
2.0,Charle,Swaniawski,F,1967-01-30
3.0,Amaya,Wilkinson,M,1966-03-11
4.0,Benton,Murphy,F,2005-09-17


### Using the readr Package
Deals with reading in large flat files quickly
 - provides replacements for functions like read.table() and read.csv()
 - analogous functions in readr are read_table() and read_csv()


In [22]:
# read_csv() example
data <- read_csv("fake-file-1000.csv", col_names = TRUE)

“Missing column names filled in: 'X1' [1]”Parsed with column specification:
cols(
  X1 = col_double(),
  First = col_character(),
  Last = col_character(),
  Gender = col_character(),
  Birthdate = col_date(format = "")
)


In [23]:
head(data)

X1,First,Last,Gender,Birthdate
0,Akeelah,Medhurst,M,1956-04-08
1,Charissa,Wisoky,F,2000-04-06
2,Charle,Swaniawski,F,1967-01-30
3,Amaya,Wilkinson,M,1966-03-11
4,Benton,Murphy,F,2005-09-17
5,Sheena,Walker,M,1976-01-12


### Using Textual and Binary Formats for Storing Data
