# Data Frames

Data frames are the most used data structure in R. They are essentially
equivalent to the rows and columns in a spreadsheet. Each row is a
different record from a set of records, and each column is a different
attribute or property from that set of records.

| name    | age | isStudent | favColor |
|---------|-----|-----------|----------|
| James   | 23  | TRUE      | red      |
| Sally   | 43  | FALSE     | green    |
| Rupert  | 19  | TRUE      | blue     |
| Octavia | 21  | TRUE      | yellow   |
| Belinda | 33  | FALSE     | orange   |

In R the columns are vectors, and so each element in a particular column
is the same data type.

Here is an example where the different columns are given by 4 vectors:

In [None]:
name <- c("James", "Sally", "Rupert", "Octavia", "Belinda") 
age <- c(23, 43, 19, 21, 33) 
isStudent <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
favColor <- c("red", "green", "blue", "yellow", "orange")

customers <- data.frame(name, 
                        age, 
                        isStudent, 
                        favColor, 
                        stringsAsFactors=FALSE)

Notice how each vector has the same length. This is important when you
are creating Data frames.

In [None]:
length(name)

[1] 5

[1] 5

[1] 5

[1] 5

What does the structure of the data frame look like:

In [None]:
str(customers)

'data.frame':   5 obs. of  4 variables:
 $ name     : chr  "James" "Sally" "Rupert" "Octavia" ...
 $ age      : num  23 43 19 21 33
 $ isStudent: logi  TRUE FALSE TRUE TRUE FALSE
 $ favColor : chr  "red" "green" "blue" "yellow" ...

Lets look at the first few items in the data frame

In [None]:
head(customers)

     name age isStudent favColor
1   James  23      TRUE      red
2   Sally  43     FALSE    green
3  Rupert  19      TRUE     blue
4 Octavia  21      TRUE   yellow
5 Belinda  33     FALSE   orange

If we try to create a data frame with different length vectors, we get
an error message.

Here we try it with a vector of length 5 and one of length 2.

In [None]:
data.frame(x = c(1,2,3,4,5), y = c(1,2)) 

Now there are various ways to refer to the rows and the columns in a
data frame. Each of the columns we can refer to by using a dollar sign.
Here is the name column:

In [None]:
customers$name

[1] "James"   "Sally"   "Rupert"  "Octavia" "Belinda"

And the ages column:

In [None]:
customers$age

[1] 23 43 19 21 33

You can remove a column of data by just setting it to NULL:

In [None]:
customers$favColor <- NULL
str(customers)

'data.frame':   5 obs. of  3 variables:
 $ name     : chr  "James" "Sally" "Rupert" "Octavia" ...
 $ age      : num  23 43 19 21 33
 $ isStudent: logi  TRUE FALSE TRUE TRUE FALSE

You can add a column of data by just setting it as one of the fields of
the data frame:

In [None]:
favMusic <- c("jazz", "classical", "electronic", "country", "electronic")
customers$favMusic <- favMusic 
str(customers)

'data.frame':   5 obs. of  4 variables:
 $ name     : chr  "James" "Sally" "Rupert" "Octavia" ...
 $ age      : num  23 43 19 21 33
 $ isStudent: logi  TRUE FALSE TRUE TRUE FALSE
 $ favMusic : chr  "jazz" "classical" "electronic" "country" ...

Finally you can get some info on the data in a data frame:

In [None]:
summary(customers)

     name                age       isStudent         favMusic        
 Length:5           Min.   :19.0   Mode :logical   Length:5          
 Class :character   1st Qu.:21.0   FALSE:2         Class :character  
 Mode  :character   Median :23.0   TRUE :3         Mode  :character  
                    Mean   :27.8                                     
                    3rd Qu.:33.0                                     
                    Max.   :43.0                                     

You can see here that is shows the mean age is 27.8 and the min and the
max of the ages are 19 and 43. There are 3 who are students and 2 who
are not. For name and favMusic there is not much it can tell us. It says
those are “character” types and there are 5 of them.

We can also refer to one particular entry in the data frame by using
indices. The first index is for the row and the second index is for the
column.

For example this is the entry in the 2nd row and the 3rd column:

In [None]:
customers[2,3]

[1] FALSE

In [None]:
head(customers)

     name age isStudent   favMusic
1   James  23      TRUE       jazz
2   Sally  43     FALSE  classical
3  Rupert  19      TRUE electronic
4 Octavia  21      TRUE    country
5 Belinda  33     FALSE electronic

Here is the entry for the 4th row and the 1st column:

In [None]:
customers[4,1]

[1] "Octavia"

This is the 2nd row in its entirety:

In [None]:
customers[2,]

   name age isStudent  favMusic
2 Sally  43     FALSE classical

In [None]:
head(customers)

     name age isStudent   favMusic
1   James  23      TRUE       jazz
2   Sally  43     FALSE  classical
3  Rupert  19      TRUE electronic
4 Octavia  21      TRUE    country
5 Belinda  33     FALSE electronic