# Study Note - BDU R101(RP0101EN) 
The Learning Objectives:
- The basics of R
- Writing your own R scripts
- How to use R to solve problems related to movies data
- The fundamentals of R Syntax
- Vectors, lists, matrix, arrays and dataframes
- Reading and writing data in R

The Syllabus:
- Module 1 - R basics
- Module 2 - Data structures in R
- Module 3 - R programming fundamentals
- Module 4 - Working with data in R
- Module 5 - Strings and Dates in R

<hr>
## Module 1: R Basics 

Load the movie dataset with the first five rows for exercise.

In [3]:
data = read.csv("/resources/movies-db.csv", nrows = 5, header = TRUE)   
data #read the example dataset

                name year length_min     genre average_rating cost_millions
1          Toy Story 1995         81 Animation            8.3          30.0
2              Akira 1998        125 Animation            8.1          10.4
3 The Breakfast Club 1985         97     Drama            7.9           1.0
4         The Artist 2011        100   Romance            8.0          15.0
5       Modern Times 1936         87    Comedy            8.6           1.5
  foreign age_restriction
1       0               0
2       1              14
3       0              14
4       1              12
5       0              10

### Math in R

Total time of the first two movies

In [4]:
81 + 125

[1] 206

Convert the minutes to hours

In [5]:
206/60

[1] 3.433333

**Math Operations in R include:**
- addition: 4+2
- subtraction: 4-2
- multiplication: 4*2
- division: 4/2
- exponentiation: 4^2

### Variables in R

Assign value to a variable 'x'

In [6]:
x <- 81 + 125

In [9]:
x

[1] 206

Perform math operations using variables

In [10]:
y <- x/60

In [11]:
y

[1] 3.433333

Note: variables are typicall assigned using ***<-*** but can also assigned using ***=***, as in x<-1 or x=1. 

Varilable can also be reassigned (over-write)

In [12]:
x <- 97 + 100
x

[1] 197

In [13]:
x <- x/60
x

[1] 3.283333

Note: variables in R occupy memory. It's a good practice to remove variable from memory use ***rm(my_variable)*** command when we do not need it.

In [14]:
rm(x) #remove x variable from the memory

In [15]:
rm(y) #remove y variable from the memory

Order of operations follow the common math principle.

### Strings in R

Assign a moive name to varilable 'movie1'

In [16]:
movie1 <- "Toy Story"
movie1

[1] "Toy Story"

### Vector in R

Vector is a one-dimensional array of objects, it's a simple tool to store your data. There is no restrictions on the number or type of elements that a vector can contain. ***C( )*** command can be used to create a vector. There are three types of vector: numeric, characer, and logical vector.

To create a vector that contains the run time of the first two movies. Convert the time to hours. 

In [17]:
c(81, 125)/60

[1] 1.350000 2.083333

To assign the vector to a variable movie_length.

In [18]:
movie_length <- c(81, 125)
movie_length/60

[1] 1.350000 2.083333

To create a numeric vector in two ways.

In [19]:
c(1,2,3,4,5,6)

[1] 1 2 3 4 5 6

In [20]:
c(1:6)

[1] 1 2 3 4 5 6

In [21]:
c(6:1) #store the numbers in decreasing order

[1] 6 5 4 3 2 1

To create a character vector.

In [22]:
c("Toy Story","Akira", "The Breakfast Club", "The Artist", "Modern Times")

[1] "Toy Story"          "Akira"              "The Breakfast Club"
[4] "The Artist"         "Modern Times"      

To create a logical vector (***TRUE/FALSE, T/F***).

In [23]:
movie_ratings <- c(7.3, 8.5, 8.3, 6.5, 6.9)
movie_ratings > 7.5

[1] FALSE  TRUE  TRUE FALSE FALSE

### Factors in R

Factors in R are vectors that can take a limited number of values, i.e.  categorical variables. Another extreme case is continuous variables with unlimited number of components. 

To transform a vector to factor using ***factor( )***.

In [24]:
genre_vector <- c("comedy","comedy","animation","animation","crime")
genre_factor <- factor(genre_vector)
genre_factor

[1] comedy    comedy    animation animation crime    
Levels: animation comedy crime

To compare the output of ***summary( )*** command on character vector and factor.

In [25]:
summary(genre_vector)

   Length     Class      Mode 
        5 character character 

In [26]:
summary(genre_factor)

animation    comedy     crime 
        2         2         1 

To transform a vector to an ordered factor.

In [27]:
movielength_vector <- c("very short", "short", "medium", "short", "long", "very short", "very long")
mvlength_factor <- factor(movielength_vector, ordered = TRUE, levels = c("very short", "short", "medium", "long", "very long"))
mvlength_factor

[1] very short short      medium     short      long       very short very long 
Levels: very short < short < medium < long < very long

### Vector Operations in R

To name elements of a vector using ***name( )*** function.

In [28]:
year <- c(1995, 1998, 1985, 2011, 1936)
names(year) <- c("Toy Story","Akira", "The Breakfast Club", "The Artist", "Modern Times")
year["Akira"]

Akira 
 1998 

To find the length of a vector using ***length( )*** function.

In [29]:
length(year)

[1] 5

To sort a vector using ***sort( )*** function.

In [30]:
year_sorted <- sort(year)  #sort in ascending order
year_sorted

      Modern Times The Breakfast Club          Toy Story              Akira 
              1936               1985               1995               1998 
        The Artist 
              2011 

To fing the smalles and largest number using ***min( )*** and ***max( )*** functions.

In [31]:
min(year)

[1] 1936

In [32]:
max(year)

[1] 2011

To compute the average of numbers in two ways using ***sum( )*** or ***mean( )***.

In [2]:
cost_2014 <- c(8.6, 8.5, 8.1)  #create a new vector for demonstration
sum(cost_2014)/length(cost_2014)  #compute average using 'sum()' function

[1] 8.4

In [3]:
mean(cost_2014)  #compute average using 'mean()' function

[1] 8.4

To view the descriptive statistics of a vector using ***summary( )*** function.

In [4]:
summary(cost_2014)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.10    8.30    8.50    8.40    8.55    8.60 

To retrieve an element of vector using the index. 
<br>*Note: The index in R starts from 1, which is different from Python that starts from 0. *

In [5]:
cost_2014[2]  #retrieve the second element

[1] 8.5

In [6]:
cost_2014[c(2,3)]  #retrieve the second and third elements

[1] 8.5 8.1

In [7]:
cost_2014[1:3]  #retrieve the first to third elements

[1] 8.6 8.5 8.1

To remove an element of vector from the output (not from the vector) using negative index. 
<br>*Note: The use of negative index differs from Python.* 

In [8]:
cost_2014[-1]  #remove the first element from the output

[1] 8.5 8.1

In [10]:
cost_2014[1]  #the first element is not removed from the vector

[1] 8.6

In [11]:
cost_2014[4]  #access an unavailable index will return "NA"

[1] NA

To retrieve elements that meet a certain criteria.

In [12]:
cost_2014[cost_2014 > 8.3]  #retrieve elements that are larger than 8.3

[1] 8.6 8.5

Missing values are represented with **NA**. It is common to have missing values in real world datasets. 

In [14]:
age_restric <- c(14, 12, 10, NA, 18, NA) 
age_restric  #there might be unknown age restrcitions for some movies

[1] 14 12 10 NA 18 NA

To perform arithmetic on vectors.

In [19]:
age_restriction <- c(0, 14, 14, 12, 10)  #the age restrictions for first five movies
sequences <- c(2, 3, 0, 2, 6) #the number of element must be equal to the above vector
multiply <- age_restriction * sequences #vector multiplication
multiply

[1]  0 42  0 24 60

In [20]:
cost_2014 * 10  #multiplication between vector and number

[1] 86 85 81

<hr>
## Module 2: Data Structures in R

### Arrays in R
An array is a structure that contains data of the **same type**, whether that's strings, or characters, or integers. Arrays can be **multi-dimensional** as well, so the data can be contained in multiple rows and columns.

#### What is the difference between an array and a vector?

Vectors are always one dimensional like a single row of data. On the other hand, an array can be multidimensional (stored as rows and columns). The "dimension" indicates how many rows of data there are.

To create an array from a vector using ***array( )*** function.

In [30]:
movie_vector <- c("Toy Story","Akira", "The Breakfast Club", "The Artist", "Modern Times", "Jumanji")
movie_array <- array(movie_vector, dim = c(2,3))
movie_array

     [,1]        [,2]                 [,3]          
[1,] "Toy Story" "The Breakfast Club" "Modern Times"
[2,] "Akira"     "The Artist"         "Jumanji"     

To access an array by specifying the row and column of the element.

In [31]:
movie_array[1, ]  #access the entire first row with empty value for column

[1] "Toy Story"          "The Breakfast Club" "Modern Times"      

In [32]:
movie_array[,1]  #access the entire first column with empty value for row

[1] "Toy Story" "Akira"    

In [33]:
movie_array[1,1]  #access the first element

[1] "Toy Story"

### Matrix in R
A matrix is similar in structure to an array. The main difference is that a matrix must be **two dimensional**.

To create a matrix from a vector using ***matrix( )*** function with specifying ***nrow*** and ***ncol***.

In [34]:
movie_matrix <- matrix(movie_vector, nrow=2, ncol=3)
movie_matrix  #default order is by column

     [,1]        [,2]                 [,3]          
[1,] "Toy Story" "The Breakfast Club" "Modern Times"
[2,] "Akira"     "The Artist"         "Jumanji"     

In [36]:
movie_matrix <- matrix(movie_vector, nrow=2, ncol=3, byrow=TRUE) 
movie_matrix #change order to by row

     [,1]         [,2]           [,3]                
[1,] "Toy Story"  "Akira"        "The Breakfast Club"
[2,] "The Artist" "Modern Times" "Jumanji"           

To access a certain subset of a matrix.

In [38]:
movie_matrix[1, 2:3]  #access the elements in row 1 and column 2 to 3. 

[1] "Akira"              "The Breakfast Club"

### List in R
In R, a list is a collection of objects, similar to a vector.
But unlike a vector, the elements inside of a list can **differ in terms of data type**.

To **create** a list using ***list( )*** function.

In [40]:
movie <- list("Toy Story", 1995, c("Animation", "Adventure", "Comedy"))
movie

[[1]]
[1] "Toy Story"

[[2]]
[1] 1995

[[3]]
[1] "Animation" "Adventure" "Comedy"   


To **access** items in a list using the index.

In [41]:
movie[2]  #access the second item in the movie list

[[1]]
[1] 1995


In [42]:
movie[2:3]  #access the second to third items in the movie list

[[1]]
[1] 1995

[[2]]
[1] "Animation" "Adventure" "Comedy"   


To **name** the individual variables in a list.

In [88]:
movie <- list(name="Toy Story", year=1995, genre=c("Animation", "Adventure", "Comedy"))
movie

$name
[1] "Toy Story"

$year
[1] 1995

$genre
[1] "Animation" "Adventure" "Comedy"   


To **access** the named list with ***$*** symbol followed by the name,   ***[name]***, or ***index***.

In [89]:
movie$genre   #access the genre element in the movie list

[1] "Animation" "Adventure" "Comedy"   

In [90]:
movie["genre"]   #access the genre element in the movie list

$genre
[1] "Animation" "Adventure" "Comedy"   


In [91]:
movie[3]  #access the items in list using index

$genre
[1] "Animation" "Adventure" "Comedy"   


To **add** new items with *** <- *** symbol. It will be append to the **end** of the list.

In [47]:
movie["age"] <- 5  #add an element named age with a value of 5
movie

$name
[1] "Toy Story"

$year
[1] 1995

$genre
[1] "Animation" "Adventure" "Comedy"   

$age
[1] 5


To **change** the value of elements in a list.

In [48]:
movie["age"] <- 6  #overwrite an element
movie

$name
[1] "Toy Story"

$year
[1] 1995

$genre
[1] "Animation" "Adventure" "Comedy"   

$age
[1] 6


To **remove** an element from a list.

In [49]:
movie["age"] <- NULL  #remove an element from the list
movie

$name
[1] "Toy Story"

$year
[1] 1995

$genre
[1] "Animation" "Adventure" "Comedy"   


### Data Frame in R
A data frame is a type of structure that contains **correlated information**. So for example, a data frame would be a great structure for storing these movie titles along with their corresponding years.

To **create** a data frame using the ***data.frame( )*** function. <br> *Note: To avoid invalid factor level, set stringsAsFactors to FALSE. **stringsAsFactors = FALSE** tells R to keep character variables as they are rather than convert to factors. This helps to avoid error warning when insert new rows to the dataframe.*

In [78]:
movie <- data.frame(name=c("Toy Story","Akira", "The Breakfast Club", "The Artist", "Modern Times", "Fight Club"), year=c(1995, 1998, 1985, 2011, 1936, 1999), stringsAsFactors=FALSE)
movie

                name year
1          Toy Story 1995
2              Akira 1998
3 The Breakfast Club 1985
4         The Artist 2011
5       Modern Times 1936
6         Fight Club 1999

To **access** the data using ***$*** symbol or the ***[index]***.

In [51]:
movie$name  #retrieve all the names of the movie dataframe

[1] Toy Story          Akira              The Breakfast Club The Artist        
[5] Modern Times       Fight Club        
6 Levels: Akira Fight Club Modern Times The Artist ... Toy Story

In [53]:
movie[1]  #retrieve all the first columns

                name
1          Toy Story
2              Akira
3 The Breakfast Club
4         The Artist
5       Modern Times
6         Fight Club

In [54]:
movie[2]  #retrieve all the second columns

  year
1 1995
2 1998
3 1985
4 2011
5 1936
6 1999

In [55]:
movie[1,2]  #retrieve value in row 1 and column 2

[1] 1995

To get information about the data frame structure using ***str( )***, ***head( )***, ***tail( )*** functions.

In [58]:
str(movie)  #view the structure of movie dataframe

'data.frame':	6 obs. of  2 variables:
 $ name: Factor w/ 6 levels "Akira","Fight Club",..: 6 1 5 4 3 2
 $ year: num  1995 1998 1985 2011 1936 ...


In [59]:
head(movie)  #retrieve the first 6 records

                name year
1          Toy Story 1995
2              Akira 1998
3 The Breakfast Club 1985
4         The Artist 2011
5       Modern Times 1936
6         Fight Club 1999

In [60]:
tail(movie)  #retrieve the last 6 records

                name year
1          Toy Story 1995
2              Akira 1998
3 The Breakfast Club 1985
4         The Artist 2011
5       Modern Times 1936
6         Fight Club 1999

To ***insert*** a new column and a new row.

In [79]:
movie['length'] <- c(81, 125, 97, 100, 87, 139) #insert a new column

In [80]:
movie

                name year length
1          Toy Story 1995     81
2              Akira 1998    125
3 The Breakfast Club 1985     97
4         The Artist 2011    100
5       Modern Times 1936     87
6         Fight Club 1999    139

In [81]:
movie <- rbind(movie, c(name="Dr. Strangelove", year=1964, length=94))
movie  # insert a new row
# warning message was received here before set stringsAsFactors = FALSE 

                name year length
1          Toy Story 1995     81
2              Akira 1998    125
3 The Breakfast Club 1985     97
4         The Artist 2011    100
5       Modern Times 1936     87
6         Fight Club 1999    139
7    Dr. Strangelove 1964     94

To **delete** rows or columns.

In [82]:
movie <- movie[-7,]  #delete row 9
movie

                name year length
1          Toy Story 1995     81
2              Akira 1998    125
3 The Breakfast Club 1985     97
4         The Artist 2011    100
5       Modern Times 1936     87
6         Fight Club 1999    139

In [84]:
movie["length"] <- NULL  #delete column "length"
movie

                name year
1          Toy Story 1995
2              Akira 1998
3 The Breakfast Club 1985
4         The Artist 2011
5       Modern Times 1936
6         Fight Club 1999

<hr>
### A Summary of Five Data Structure in R

<table style="width:100%">
  <caption>Five Data Structures in R</caption>
  <tr>
    <th>Data Structure</th>
    <th>No. of Dimension</th>
    <th>No. of Data Type</th>
    <th>Order</th>
    <th>Syntax</th>
  </tr>
  <tr>
    <th>Vector<br>Factor</th>
    <td>One-Dimension</td>
    <td>Single Type</td>
    <td>Ordered</td>
    <td>c(...)<br>factor(...)</td>
  </tr>
  <tr>
    <th>Array</th>
    <td>Multiple-Dimensions</td>
    <td>Single Type</td>
    <td>Not Ordered</td>
    <td>array(vector, dim(row, column)</td>
  </tr>
  <tr>
    <th>Matrix</th>
    <td>Two-Dimensions</td>
    <td>Single Type</td>
    <td>Not Ordered</td>
    <td>matrix(vector, nrow=row, ncol=column)</td>
  </tr>
  <tr>
    <th>List</th>
    <td>One-Dimension</td>
    <td>Multiple Types</td>
    <td>Not Ordered</td>
    <td>list(...)</td>
  </tr>
  <tr>
    <th>Data Frame</th>
    <td>Two-Dimensions</td>
    <td>Multiple Types</td>
    <td>Not Ordered</td>
    <td>data.frame(column_1=c(...),..., column_n=c(...), stringsAsFactors=FALSE)</td>
  </tr>
</table>


<hr>
## Module 3 - R Programming Fundamentals

### Conditions and Loops in R

To use ***if statement*** to select the movie that is after year 2000.

In [1]:
movie_year <- 1997

In [3]:
if(movie_year > 2000){
    print('Movie year is greater than 2000')
} else{
    print('Movie year is not greater than 2000')
}

[1] "Movie year is not greater than 2000"


**Logical Operators**
<li> "<" less than? </li>
<li> ">" greater than? </li>
<li> "<=" less than or equal to? </li>
<li> ">=" greater than or equal to? </li>
<li> "==" is equal to? </li>
<li> "!=" is not equal to? </li>
<li> "&" and </li>
<li> "|" or </li>
<li> "!" not </li>
<li> "%in%" is found in? </li>

To use ***for loop*** to cycle through all the values in a vector.

In [5]:
years <- c(1995, 1985, 2011, 1936, 1999)
for (yr in years){
    print(yr)
}  #print all values in the years vector

[1] 1995
[1] 1985
[1] 2011
[1] 1936
[1] 1999


In [6]:
for (yr in years){
    if(yr < 1980){
        print("Old movie")
    } else{
        print("Not that old")
    }
}  #embed the if statement in the for loop to customize the output

[1] "Not that old"
[1] "Not that old"
[1] "Not that old"
[1] "Old movie"
[1] "Not that old"


To use ***while loop*** to execute commands repetitively until condition is met. 

In [7]:
count <- 1

while(count<=5){
    print(c("Iteration number:", count))
    count <- count + 1
}  #print the iteration number until 5

[1] "Iteration number:" "1"                
[1] "Iteration number:" "2"                
[1] "Iteration number:" "3"                
[1] "Iteration number:" "4"                
[1] "Iteration number:" "5"                


### Functions in R

A **function** is a block of code that can be **re-used** in different parts of a program. Generally speaking, this can be broken down into **pre-defined functions**, and **user-defined functions**.Pre-defined functions are the functions that are already defined for you, whether they're built in to R or provided in a separate package. Examples of pre-defined functions are ***mean( )*** and ***sort( )***.

To define a new function using ***<- function(parameter){...}*** statement.

In [9]:
printHelloWorld <- function(){
    print("Hello World")
}
printHelloWorld()

[1] "Hello World"


In [33]:
add <- function(x, y){
    return(x+y)
}
add(3, 4)

[1] 7

**Note:** The "return" statement can be used to explicitly output a value from the function. When "return" is encountered, anywhere in the function, the corresponding value will be output and the function will exit. Keep in mind that if the function lacks a return statement, then R will automatically return the value of the last evaluated expression. The "return" statement is particularly useful when you need an "if else" block, since the final output value will be dependent on some condition.

In [11]:
isGoodRating <- function(rating){
    if(rating < 7){
        return("No")
    }else{
        return("Yes")
    }
}
isGoodRating(4)

[1] "No"

In [12]:
isGoodRating <- function(rating, threshold=7){  
    if(rating < threshold){
        return("No")
    }else{
        return("Yes")
    }
}    #Default input values in functions
isGoodRating(8)

[1] "Yes"

In [13]:
isGoodRating(8, threshold = 8.5)  #Default value can be overwritten

[1] "No"

***Note: functions can be embedded into other functions.***

To distinguish **Global (<<-) and Local (<-)** variables. In the following example, *y* is global variable and *temp* is a local variable which can not be accessed outside the function it belongs to.

In [14]:
myFunction <- function(){
    y <<-3.14  #y is defined as global variable
    temp <- 'Hello World' #temp is defined as local variable
    return(temp)
}
myFunction()

[1] "Hello World"

In [15]:
y  #global variable can be accessed outside the function

[1] 3.14

In [16]:
temp  #local variable is not defined outside the function

ERROR: Error in eval(expr, envir, enclos): object 'temp' not found


### Objects and Classes in R

In R, an object is a data structure that has attributes, and methods that act on those attributes. Types of object classes include:
<li>Numberic (default type)</li>
<li>Character</li>
<li>Logical</li>
<li>Integer (indicate explicitly as.integer)</li>

To create different object classes and check their class type using ***class(object)*** statement.

In [18]:
average_rating <- 8.3  #create a numeric object class
average_rating
class(average_rating)  #check the class type

[1] 8.3

[1] "numeric"

In [19]:
movies <- c("Toy Story", "Akira")  #create an character object class
movies
class(movies)  #check the class type

[1] "Toy Story" "Akira"    

[1] "character"

In [20]:
logical_vector <- c(T, F, F, T)  #create a logical object with TRUE or FALSE boolean varables
logical_vector
class(logical_vector)  #check the class type

[1]  TRUE FALSE FALSE  TRUE

[1] "logical"

In [21]:
age_restriction <- c(12, 10, 18, 18)
class(age_restriction)  #by default, it is numeric type
integer_vector <- as.integer(age_restriction) #create an integer class
class(integer_vector)

[1] "numeric"

[1] "integer"

To convert the class types.

In [22]:
year <- as.character(1995)  #convert numeric class to character class
class(year)

[1] "character"

***Note: if combining numbers and characters in a vector (only support single data type), the class type will automatically converted to character. ***

In [23]:
combined <- c("Toy Story", 1995, "Akira", 1998)
class(combined)  #will be character

[1] "character"

### Debugging in R

**Errors** can often be difficult to locate in a large body of code. The process of finding the source of these programming bugs and fixing them, is known as **debugging**.

To catch error with ***tryCatch( )*** function. A "tryCatch" statement will run the code normally, assuming that there are no errors involved.
But if there is an invalid statement, the "tryCatch" will alert that this is invalid.

In [24]:
tryCatch(10+10)  #no errors in the statement
tryCatch("a"+10)  #invalid statement

[1] 20

ERROR: Error in "a" + 10: non-numeric argument to binary operator


In [25]:
tryCatch(10+10, error = function(e)
    print("Oops, something went wrong!"))

[1] 20

In [26]:
tryCatch("a"+10, error = function(e)
    print("Oops, something went wrong!"))

[1] "Oops, something went wrong!"


To handle warning using ***tryCatch( )*** function as well.

In [27]:
as.integer("A")  #will generate a waning message

In eval(expr, envir, enclos): NAs introduced by coercion

[1] NA

In [29]:
tryCatch(as.integer("A"), warning = function(e) print("Warning."))
    #catch the warning message



<hr>
## Module 4 - Working with Data in R

### Reading CSV, Excel, and Built-in Datasets in R

<li>To read csv data files using ***read.csv("file_path")*** statement and assign it to a variable.</li> 
<li>To read excel data files using ***read_excel("file_path")*** statement  and assign it to a variable.</li>
<br> ***Note:*** *Unlike CSV, R does not have a native function for reading Excel files. So to add this functionality, we're going to have to run the "install.packages"function. Once a package is installed, it does not need to be installed again unless it is uninstalled. Whenever you use a library that is not native to R, you have to it into the R environment by calling the "library" function.*

In [34]:
my_data <- read.csv("/resources/movies-db.csv")  
#The default data structure is data frame
my_data  #to view the data

                                        name year length_min     genre
1                                  Toy Story 1995         81 Animation
2                                      Akira 1998        125 Animation
3                         The Breakfast Club 1985         97     Drama
4                                 The Artist 2011        100   Romance
5                               Modern Times 1936         87    Comedy
6                                 Fight Club 1999        139     Drama
7                                City of God 2002        130     Crime
8                           The Untouchables 1987        119     Drama
9                       Star Wars Episode IV 1977        121    Action
10                           American Beauty 1999        122     Drama
11                                      Room 2015        118     Drama
12                           Dr. Strangelove 1964         94    Comedy
13                                  The Ring 1998         95    Horror
14    

<li>To **access data** from dataset using index.</li>

In [35]:
my_data['name']

                                        name
1                                  Toy Story
2                                      Akira
3                         The Breakfast Club
4                                 The Artist
5                               Modern Times
6                                 Fight Club
7                                City of God
8                           The Untouchables
9                       Star Wars Episode IV
10                           American Beauty
11                                      Room
12                           Dr. Strangelove
13                                  The Ring
14           Monty Python and the Holy Grail
15                       High School Musical
16                         Shaun of the Dead
17                               Taxi Driver
18                  The Shawshank Redemption
19                              Interstellar
20                                    Casino
21                            The Goodfellas
22        

In [36]:
my_data[1, ]  #select the first row

       name year length_min     genre average_rating cost_millions foreign
1 Toy Story 1995         81 Animation            8.3            30       0
  age_restriction
1               0

In [38]:
my_data[1, c("name", "length_min")]  #select the first row and selected columns

       name length_min
1 Toy Story         81

<li>To access **built-in datasets** in R package.</li>

In [42]:
data()  #see the list of built-in datasets



In [40]:
help(co2)  #see the description of the built-in dataset "co2"



In [43]:
help(Titanic)  #see the description of the built-in dataset "co2"



In [44]:
co2  #don't need to import built-in dataset 

        Jan    Feb    Mar    Apr    May    Jun    Jul    Aug    Sep    Oct
1959 315.42 316.31 316.50 317.56 318.13 318.00 316.39 314.65 313.68 313.18
1960 316.27 316.81 317.42 318.87 319.87 319.43 318.01 315.74 314.00 313.68
1961 316.73 317.54 318.38 319.31 320.42 319.61 318.42 316.63 314.83 315.16
1962 317.78 318.40 319.53 320.42 320.85 320.45 319.45 317.25 316.11 315.27
1963 318.58 318.92 319.70 321.22 322.08 321.31 319.58 317.61 316.05 315.83
1964 319.41 320.07 320.74 321.40 322.06 321.73 320.27 318.54 316.54 316.71
1965 319.27 320.28 320.73 321.97 322.00 321.71 321.05 318.71 317.66 317.14
1966 320.46 321.43 322.23 323.54 323.91 323.59 322.24 320.20 318.48 317.94
1967 322.17 322.34 322.88 324.25 324.83 323.93 322.38 320.76 319.10 319.24
1968 322.40 322.99 323.73 324.86 325.40 325.20 323.98 321.95 320.18 320.09
1969 323.83 324.26 325.47 326.50 327.21 326.54 325.72 323.50 322.22 321.62
1970 324.89 325.82 326.77 327.97 327.91 327.50 326.18 324.53 322.93 322.90
1971 326.01 326.51 327.01

### Reading Text Files into R

<li>To read text files into R using ***readLines( )*** statement.</li><br> ***Note: *** *The "text" variable receives a character vector, containing one item for each of the lines in the file. Keep in mind that a "line" is not the same thing as a sentence. Instead, lines are broken up by individual line breaks, which essentially form new paragraphs.*

In [47]:
text <- readLines("/resources/The_Artist.txt")
text

[1] "The Artist is a 2011 French romantic comedy-drama in the style of a black-and-white silent film. It was written, directed, and co-edited by Michel Hazanavicius, produced by Thomas Langmann and starred Jean Dujardin and Bérénice Bejo. The story takes place in Hollywood, between 1927 and 1932, and focuses on the relationship of an older silent film star and a rising young actress as silent cinema falls out of fashion and is replaced by the talkies."                                                                                                                                                          
[2] ""                                                                                                                                                                                                                                                                                                                                                                                               

In [50]:
length(text)
# count up the number of elements in the vector.

[1] 5

In [51]:
nchar(text) 
# count up the number of characters in each line of our character vector.

[1] 450   0 604   0   0

In [53]:
file.size("/resources/The_Artist.txt")
# the file size in bytes

[1] 1065

<li>To read text files into R using ***scan( )*** function </li><br>
***Note: *** *The "scan" function can read a text file by word, rather than by line. Again, the first argument is simply a file path. And the second argument is an empty string, which is used to avoid an error.* **Why?**

In [56]:
text <- scan("/resources/The_Artist.txt", "")
text

  [1] "The"             "Artist"          "is"              "a"              
  [5] "2011"            "French"          "romantic"        "comedy-drama"   
  [9] "in"              "the"             "style"           "of"             
 [13] "a"               "black-and-white" "silent"          "film."          
 [17] "It"              "was"             "written,"        "directed,"      
 [21] "and"             "co-edited"       "by"              "Michel"         
 [25] "Hazanavicius,"   "produced"        "by"              "Thomas"         
 [29] "Langmann"        "and"             "starred"         "Jean"           
 [33] "Dujardin"        "and"             "Bérénice"        "Bejo."          
 [37] "The"             "story"           "takes"           "place"          
 [41] "in"              "Hollywood,"      "between"         "1927"           
 [45] "and"             "1932,"           "and"             "focuses"        
 [49] "on"              "the"             "relationship"    "of"

### Writing and Saving to Files in R

<li> To export file as a text file using ***write( )*** function.</li>
<li> To export file as a csv file using ***write.csv( )*** function.</li>
<li> To export file as a excel file using ***write.xlsx( )*** function.</li>
<li> To R objects in .RData files using ***save( )*** function.</li>
<br>
***Note:*** *Save data frames into an excel file need to rely on an external package, like the "xlsx" package here. Run the "install.packages" function first. The "library" function will load the package into the R environment so that it can be used.*



In [58]:
m <- matrix(c(1,2,3,4,5,6), nrow=2, ncol=3)
m

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

In [66]:
write(m, file="/resources/matrix_as_text.text", ncolumns = 3, sep=" ") 
# export as a text file

In [64]:
write.csv(m, file="/resources/dataset_1.csv", row.names = FALSE)
# export as a csv file

In [65]:
write.table(m, file="/resources/dataset_2.csv", row.names = FALSE, col.names = FALSE, sep = ",")
# ?

In [67]:
install.packages("xlsx")
library(xlsx)

Installing package into ‘/resources/common/R/Library’
(as ‘lib’ is unspecified)
also installing the dependency ‘xlsxjars’

Loading required package: rJava
Loading required package: xlsxjars


In [72]:
write.xlsx(m, file="/resources/dataset_3.xlsx", sheetName="Sheet1", col.names = TRUE, row.names = FALSE)

In [71]:
save(m, file = "/resources/vars.RData", safe = T)

<hr>
## Module 5 - Strings and Dates in R

### String Operations in R

In [74]:
summary <- readLines("/resources/The_Artist.txt")
summary

[1] "The Artist is a 2011 French romantic comedy-drama in the style of a black-and-white silent film. It was written, directed, and co-edited by Michel Hazanavicius, produced by Thomas Langmann and starred Jean Dujardin and Bérénice Bejo. The story takes place in Hollywood, between 1927 and 1932, and focuses on the relationship of an older silent film star and a rising young actress as silent cinema falls out of fashion and is replaced by the talkies."                                                                                                                                                          
[2] ""                                                                                                                                                                                                                                                                                                                                                                                               

In [75]:
#count the number of characters in the first string
nchar(summary[1])

[1] 450

In [76]:
#convert the character to uppercase
toupper(summary[1])

[1] "THE ARTIST IS A 2011 FRENCH ROMANTIC COMEDY-DRAMA IN THE STYLE OF A BLACK-AND-WHITE SILENT FILM. IT WAS WRITTEN, DIRECTED, AND CO-EDITED BY MICHEL HAZANAVICIUS, PRODUCED BY THOMAS LANGMANN AND STARRED JEAN DUJARDIN AND BÉRÉNICE BEJO. THE STORY TAKES PLACE IN HOLLYWOOD, BETWEEN 1927 AND 1932, AND FOCUSES ON THE RELATIONSHIP OF AN OLDER SILENT FILM STAR AND A RISING YOUNG ACTRESS AS SILENT CINEMA FALLS OUT OF FASHION AND IS REPLACED BY THE TALKIES."

In [77]:
#convert the character to lower case
tolower(summary[1])

[1] "the artist is a 2011 french romantic comedy-drama in the style of a black-and-white silent film. it was written, directed, and co-edited by michel hazanavicius, produced by thomas langmann and starred jean dujardin and bérénice bejo. the story takes place in hollywood, between 1927 and 1932, and focuses on the relationship of an older silent film star and a rising young actress as silent cinema falls out of fashion and is replaced by the talkies."

In [78]:
#replace a specific set of characters in string. Replace space with "-".
chartr(" ", "-", summary[1])

[1] "The-Artist-is-a-2011-French-romantic-comedy-drama-in-the-style-of-a-black-and-white-silent-film.-It-was-written,-directed,-and-co-edited-by-Michel-Hazanavicius,-produced-by-Thomas-Langmann-and-starred-Jean-Dujardin-and-Bérénice-Bejo.-The-story-takes-place-in-Hollywood,-between-1927-and-1932,-and-focuses-on-the-relationship-of-an-older-silent-film-star-and-a-rising-young-actress-as-silent-cinema-falls-out-of-fashion-and-is-replaced-by-the-talkies."

In [79]:
#split the string
char_list <- strsplit(summary[1], " ")
word_list <- unlist(char_list)
word_list

 [1] "The"             "Artist"          "is"              "a"              
 [5] "2011"            "French"          "romantic"        "comedy-drama"   
 [9] "in"              "the"             "style"           "of"             
[13] "a"               "black-and-white" "silent"          "film."          
[17] "It"              "was"             "written,"        "directed,"      
[21] "and"             "co-edited"       "by"              "Michel"         
[25] "Hazanavicius,"   "produced"        "by"              "Thomas"         
[29] "Langmann"        "and"             "starred"         "Jean"           
[33] "Dujardin"        "and"             "Bérénice"        "Bejo."          
[37] "The"             "story"           "takes"           "place"          
[41] "in"              "Hollywood,"      "between"         "1927"           
[45] "and"             "1932,"           "and"             "focuses"        
[49] "on"              "the"             "relationship"    "of"             

In [80]:
#sort the list
sorted_list <- sort(word_list)
sorted_list

 [1] "1927"            "1932,"           "2011"            "a"              
 [5] "a"               "a"               "actress"         "an"             
 [9] "and"             "and"             "and"             "and"            
[13] "and"             "and"             "and"             "Artist"         
[17] "as"              "Bejo."           "Bérénice"        "between"        
[21] "black-and-white" "by"              "by"              "by"             
[25] "cinema"          "co-edited"       "comedy-drama"    "directed,"      
[29] "Dujardin"        "falls"           "fashion"         "film"           
[33] "film."           "focuses"         "French"          "Hazanavicius,"  
[37] "Hollywood,"      "in"              "in"              "is"             
[41] "is"              "It"              "Jean"            "Langmann"       
[45] "Michel"          "of"              "of"              "of"             
[49] "older"           "on"              "out"             "place"          

In [81]:
#contatenate the character vector
paste(sorted_list, collapse = " ")

[1] "1927 1932, 2011 a a a actress an and and and and and and and Artist as Bejo. Bérénice between black-and-white by by by cinema co-edited comedy-drama directed, Dujardin falls fashion film film. focuses French Hazanavicius, Hollywood, in in is is It Jean Langmann Michel of of of older on out place produced relationship replaced rising romantic silent silent silent star starred story style takes talkies. the the the The The Thomas was written, young"

In [83]:
#isolate the specific portion of the string
sub_string <- substr(summary[1], start = 4, stop = 50)
sub_string

[1] " Artist is a 2011 French romantic comedy-drama "

In [84]:
#remove white space from the beginning and end of the string
trimws(sub_string)

[1] "Artist is a 2011 French romantic comedy-drama"

In [86]:
#count back from the last character
library(stringr)  #install the library
str_sub(summary[1], -8, -1)  #retrieve the last character in the string

[1] "talkies."

### The Date Format in R

In [88]:
#Date String Formatting
as.Date("27/06/94","%d/%m/%y")

[1] "1994-06-27"

In [89]:
#Caluculate the number of days between two dates
as.Date("1994/06/27") - as.Date("1959/01/01")

Time difference of 12961 days

In [90]:
# compare two dates
as.Date("1994/06/27") > as.Date("1959/01/01")

[1] TRUE

In [91]:
# subtract the number of days
as.Date("1994/06/27") - 14

[1] "1994-06-13"

In [92]:
# show the current date in the system
Sys.Date()

[1] "2017-02-25"

In [93]:
# show the current date with details
date()

[1] "Sat Feb 25 14:12:45 2017"

In [95]:
# show the timestamp
Sys.time()

[1] "2017-02-25 14:13:05 UTC"

In [96]:
# show the weekdays/months/quarters
weekdays(Sys.Date())
months(Sys.Date())
quarters(Sys.Date())

[1] "Saturday"

[1] "February"

[1] "Q1"

In [97]:
# convert the date to a julian format
julian(Sys.Date())

[1] 17222
attr(,"origin")
[1] "1970-01-01"

In [98]:
# create a sequence of date based on some reference points and length
seq(Sys.Date(), by="month", length.out=4)

[1] "2017-02-25" "2017-03-25" "2017-04-25" "2017-05-25"

### Regular Expression in R
A regular expression, regex or regexp is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Regular expressions are used to match patterns in strings and text.

In [99]:
# load a csv file for demonstration
email <- read.csv("/resources/email_list.csv")
email

          Name              Email
1     John Doe   doej@example.com
2     Jane Doe    jadoe@sample.ca
3    Mark Mann   mmann@example.ca
4  Barry Goode bgoode@example.com
5     Joe Star    joes@sample.com
6  Susan Quinn    qsus@example.br
7   Alice Erin  erina@example.com
8 Frank Irving  irving@sample.com

In [100]:
# extract the index of elements that matches the pattern "@.*"
grep("@.*", c("test@testing.com", "not an email", "test2@testing.com"))

[1] 1 3

In [102]:
# extract the elements that matches the pattern "@.*"
grep("@.*", c("test@testing.com", "not an email", "test2@testing.com"), value = TRUE)

[1] "test@testing.com"  "test2@testing.com"

In [105]:
# substitute strings found by the regular expression 
# The second argument serves as the replacement string.
gsub("@.*", "@newdomain.com", c("test@testing.com", "not an email", "test2@testing.com"))

[1] "test@newdomain.com"  "not an email"        "test2@newdomain.com"

In [106]:
# extract the matched strings use the "regexpr" function
# which is like a more detailed "grep" that will find the matching substrings.
matches <- regexpr("@.*", c("test@testing.com", "not an email", "test2@testing.com"))
regmatches(c("test@testing.com", "not an email", "test2@testing.com"), matches)

[1] "@testing.com" "@testing.com"

To address the problem in the email dateset.

In [109]:
# extract the pattern in the email address
matches <- regexpr("@.*\\.", email[,'Email'])
# add the pattern to a new column
email[, "Domain"] = regmatches(email[, 'Email'], matches)
email

          Name              Email    Domain
1     John Doe   doej@example.com @example.
2     Jane Doe    jadoe@sample.ca  @sample.
3    Mark Mann   mmann@example.ca @example.
4  Barry Goode bgoode@example.com @example.
5     Joe Star    joes@sample.com  @sample.
6  Susan Quinn    qsus@example.br @example.
7   Alice Erin  erina@example.com @example.
8 Frank Irving  irving@sample.com  @sample.

**Regular Expression Applications**
<li> Data Extraction </li>
<li> Data Cleaning </li>
<li> Data Analysis </li>
<li> Data Validation </li>
<li> Text Mining </li>
<li> Parsing </li>

A regular expression may be followed by one of several repetition quantifiers:

?
The preceding item is optional and will be matched at most once.

*
The preceding item will be matched zero or more times.

+
The preceding item will be matched one or more times.

{n}
The preceding item is matched exactly n times.

{n,}
The preceding item is matched n or more times.

{n,m}
The preceding item is matched at least n times, but not more than m times.

<hr>
### About the Author:  
This is Yimeng's study note on the BDU course R101. 

**Study Period: Feburary 24th-25th, 2017**

<hr>
Copyright &copy; 2017