# Data frames and factor in R

Hi guys, today's topic is Data frames and factor.

**Data frames**:

Data frames may be the most common data structure we will use.  A data frame could contain different types of data. Each column in the data frame could be any data type, but it must have only one type. 

We use `data.frame()` to create data frame. Here is the general form.
```R
df <- data.frame(col1, col2, col3,...)
```
Now, we create a data frame.

In [1]:
studentID <- c(1, 2, 3, 4)
age <- c(21, 22, 22, 19)
major <- c('STAT', 'CS', 'EDU', 'STAT')
GPA <- c(3.1, 3.2, 3.3, 3.4)
name <- c('Shawn', 'Jack', 'Peter', 'Mary')
gender <- c('M', 'M', 'M', 'F')

# Create data frae
df_1 <- data.frame(studentID, name, age, gender, major, GPA)
# We create a same data frame for future use
df_2 <- df_1
df_1

studentID,name,age,gender,major,GPA
1,Shawn,21,M,STAT,3.1
2,Jack,22,M,CS,3.2
3,Peter,22,M,EDU,3.3
4,Mary,19,F,STAT,3.4


So far, we have a small, simple data frame with only four rows and six columns. This case is rarely happening in our real life. So, if our file is enormous, it will waste a lot of time and memory to check the whole file. What should we do? R provides following functions to help us get a glimpse of the data frame. 

1. `head()` and `tail()` will help us to check the first and last six rows of the data frame. Six is a default number, and if you want, you could use `head(df, n = 10)` to check the first 10 rows. It is the same story to `tail()` function.
2. `str()` will tell us the structure information of data frame. 

In [2]:
str(df_1)

'data.frame':	4 obs. of  6 variables:
 $ studentID: num  1 2 3 4
 $ name     : Factor w/ 4 levels "Jack","Mary",..: 4 1 3 2
 $ age      : num  21 22 22 19
 $ gender   : Factor w/ 2 levels "F","M": 2 2 2 1
 $ major    : Factor w/ 3 levels "CS","EDU","STAT": 3 1 2 3
 $ GPA      : num  3.1 3.2 3.3 3.4


3.  Use`summary()`  for basic summary information

In [3]:
summary(df_1)

   studentID       name        age       gender  major        GPA       
 Min.   :1.00   Jack :1   Min.   :19.0   F:1    CS  :1   Min.   :3.100  
 1st Qu.:1.75   Mary :1   1st Qu.:20.5   M:3    EDU :1   1st Qu.:3.175  
 Median :2.50   Peter:1   Median :21.5          STAT:2   Median :3.250  
 Mean   :2.50   Shawn:1   Mean   :21.0                   Mean   :3.250  
 3rd Qu.:3.25             3rd Qu.:22.0                   3rd Qu.:3.325  
 Max.   :4.00             Max.   :22.0                   Max.   :3.400  

**Subset**:

Subset for data frame is almost the same way we talked before. Let's check some quick examples.. 

1. Select row or rows

In [4]:
# select row or rows
df_1[1, ]

studentID,name,age,gender,major,GPA
1,Shawn,21,M,STAT,3.1


In [5]:
df_1[c(1, 2), ]

studentID,name,age,gender,major,GPA
1,Shawn,21,M,STAT,3.1
2,Jack,22,M,CS,3.2


2. Select column or columns. You could use `df + $ + colname` to select one column. 

In [6]:
#select column or columns
df_1$studentID

In [7]:
df_1[, 1]

In [8]:
df_1[, c(1, 2)]

studentID,name
1,Shawn
2,Jack
3,Peter
4,Mary


3. Select rows and columns together

In [9]:
df_1[c(1, 2), c(1, 2)]

studentID,name
1,Shawn
2,Jack


4, To select particular rows we need to use `subset()`  for conditional filtering. I will only show a simple example here because we will talk a power package `dplyr` later. This package provides us with a lot of friendly, readable and convenient ways for data manipulation.

In [10]:
# You want to get male student with gpa larger than 3.1
subset(df_1, gender == 'M' & GPA > 3.1)

Unnamed: 0,studentID,name,age,gender,major,GPA
2,2,Jack,22,M,CS,3.2
3,3,Peter,22,M,EDU,3.3


**Update data frame**

If you have a data frame and you want to update it. R allows you to do it by following ways.


In [11]:
# change one value
df_1[1, 6] <- 4.0

# add a new column
df_1$state <- c('MO', "IN", 'CA', 'NY')

# delete a column
df_1$state <- NULL

# delte columns
df_1 <- df_1[, -c(6, 5)]

# add a new row
df_1 <- rbind(df_1, c(5, 'Frany', 25, 'F'))

# delete row, delete second and third row
df_1 <- df_1[-c(2, 3),]

“invalid factor level, NA generated”

Some points need to pay attention.

1. When we try to add a row or column, be careful, the new line should be the same size as your data frame. If not, R will use `NA` to replace missing values. Try following code by yourself. 

```R
> df_1 <- rbind(df_1, c(10, 'Frany'))
```
2. We use `rbind()` to add a new row. We could understand we are creating new small vector then combine this vector to our data frame by row.  
3. If we would like to delete rows or columns, please remember to use `-` before selected rows or columns.

So far, we almost finish the basic operation about data frame. I will cover advanced part when we talk about `dplyr` package.  Before we stop, I want to mention that, there is an error in today's code. Let's what is it and how to solve this error.

**Factor**:

Let's check the error. 

```R
> df_2 <- rbind(df_1, c(5, 'Frany', 25, 'F', 'EE', 2.5))
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "Frany") :
  invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "EE") :
  invalid factor level, NA generated
```
As you have seen here, R return us two errors. `Frany` and `EE`  are invalid factor. What this mean? 

In [12]:
df_2 <- rbind(df_2, c(5, 'Frany', 25, 'F', 'EE', 2.5))

“invalid factor level, NA generated”

In [13]:
df_2

studentID,name,age,gender,major,GPA
1,Shawn,21,M,STAT,3.1
2,Jack,22,M,CS,3.2
3,Peter,22,M,EDU,3.3
4,Mary,19,F,STAT,3.4
5,,25,F,,2.5


R set name, gender, and major as factor. If something we want to add to these three columns which are not included in original factors, R will use NA to replace it.  Before we solve this problem, we need to figure out what factor is?

First, what is the factor? We could understand the factor in this way. Usually, the variables could be treated as two different types generally which are continuously variable and categorical variable. 

> A **categorical variable** is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or nominal category on the basis of some qualitative property. 
>
> -- Wikipedia

We use some examples to help us to understand categorical variable. 

1. Usually, we treat gender as two categories, such as Male and Female.
2. Sometimes, we will split age into some categories, teenage, youth, adult and, old man.
3. If you want to buy a new cell phone. All brands are categorical data. Such as apple, Samsang, Xiaomi, OnePlus and so on. 

Now, you could figure out what is categorical variable. You may also notice that some categorical variables have an order but rest are not. For `gender` , Male and Female are same, they do not have an order. However, for `age`, it has  a order like  $teenage < youth < adult < old man$. It depends on your analysis goal and questions to decide whether or not set a variable as a categorical variable.

In R we use `factor()` to create factor data.

In [14]:
# data without order
cella <- c('Male', 'Femal')
gender <- factor(cella, ordered = FALSE)
# data withe order
cellb <- c('Bachelor', 'Master', "Ph.D")
degree <- factor(cellb, ordered = TRUE)

In [15]:
gender

In [16]:
degree

We could find, if we set `order = TRUE`, the R will show us an order.  Here the order is the default which based on position. If you want to set your order, you could use the following method.

In [17]:
# data with own order
cellc <- c('Samsung', 'Apple', 'Xiaomi', 'OnePlus')
phone <- factor(cellc, ordered = TRUE, levels = c('Apple', 'Xiaomi', 'Samsung', 'OnePlus'))
phone

Sometimes, we already have a certain level, in this case, we could combine `factor()` and `level()`to do something interesting. 

In [18]:
celld <- c(1, 3, 2, 3, 3, 3, 2, 2, 1, 3, 2, 1)
fdata <- factor(celld)
levels(fdata) <- c('I', 'II', 'III')
fdata

See, we the transfer `double` data to the `factor` data. This method will help us save a lot of time if we want to input a tremendous amount of categorical data. 

Here are basic operations about `factor	` in R. Last thing we need to pay attention before we solve the error.  When we create a data frame, R will consider character as factor.  In this case, if we want to add a new item to these variables, this item should be an exit at the original level. 

In [19]:
test <- factor(c('a', 'b', 'c'))
test

In [20]:
# Try to add a new item
test[4] <- 'd' 

“invalid factor level, NA generated”

In [21]:
# R use NA to replace
test

In [22]:
# add an item already existed in original factor level
test[5] <- 'a'
test

Now, we know the reason for the error and need to transfer factor to character then we could add a new item. The`as.character()` function coulde help us..

In [23]:
df_2

studentID,name,age,gender,major,GPA
1,Shawn,21,M,STAT,3.1
2,Jack,22,M,CS,3.2
3,Peter,22,M,EDU,3.3
4,Mary,19,F,STAT,3.4
5,,25,F,,2.5


In [24]:
df_2$name <- as.character(df_2$name)
df_2$major <- as.character((df_2$major))
df_2 <- rbind(df_2, c(6, 'Jack', 25, 'M', 'EE', 2.5))
# Use str() to check structure information again
str(df_2)

'data.frame':	6 obs. of  6 variables:
 $ studentID: chr  "1" "2" "3" "4" ...
 $ name     : chr  "Shawn" "Jack" "Peter" "Mary" ...
 $ age      : chr  "21" "22" "22" "19" ...
 $ gender   : Factor w/ 2 levels "F","M": 2 2 2 1 1 2
 $ major    : chr  "STAT" "CS" "EDU" "STAT" ...
 $ GPA      : chr  "3.1" "3.2" "3.3" "3.4" ...


In [25]:
# We are successful add a new row without missing values
df_2[6, ]

Unnamed: 0,studentID,name,age,gender,major,GPA
6,6,Jack,25,M,EE,2.5


OK, we solved the error and finished this tutorial.

Next time, we will talk about the `list` in R, it will be short.

See you!