<a href="https://colab.research.google.com/github/brendanpshea/data-science/blob/main/DataScience_11_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# From Python to R
### Brendan Shea, PhD
THis chapter will provide a brief introduction to **R**. R is a programming language and environment specifically designed for statistical computing and graphics. It is widely used among statisticians and data analysts to develop statistical software and data analysis. In the context of data science, R provides an extensive array of libraries and built-in functions for complex data analysis and graphical models, enabling detailed statistical modeling and visualization.

Differences between R and Python in data science primarily revolve around their origins and design philosophies:

1. R is rooted in statistical analysis, with a design that inherently understands the needs of data modeling and visualization. Python, while versatile and powerful in data science, is a general-purpose language that has been adapted for data analysis through libraries like Pandas and SciPy.

2. R has a slight edge in the complexity and variety of statistical models available, and its graphics packages like `ggplot2` are considered to be more sophisticated in terms of capabilities for creating advanced statistical plots.

3. R's community is traditionally composed of statisticians and academics, leading to a wealth of packages for nearly every statistical test or model imaginable. Python's data science community is broader, attracting professionals from various backgrounds including software engineering, leading to a robust set of tools that are often seen as more user-friendly.

4. R's syntax is thought to have a somewhat steeper learning curve for those not already familiar with statistical software, whereas Python's syntax is often praised for being more intuitive and easier for beginners to grasp.

In the end, however, both Python and R are widely used in Data Science, and its beneficial to have exposure to both languages. Google Colab supports both Python and R (and, in fact, these are the *only* languages it currently supports).

## Data Types in R
**Vectors** are the most basic data structure in R. They hold elements of the same type. Let's say we're exploring a story where characters collect gems of different values. A vector could represent the values of gems that a character has collected:

In [4]:
gem_values <- c(50, 100, 200, 150)
gem_values # display values

Here, `gem_values` is a numeric vector containing four elements. The `c()` function combines values into a vector.

Moving on to Matrices, imagine our characters are in a grid-based world, and we want to represent the number of gems in different grid locations. A matrix could represent this:

In [2]:
grid_gems <- matrix(c(1, 0, 3, 4, 2, 0, 5, 1, 3), nrow = 3, ncol = 3)
grid_gems # display matrix

0,1,2
1,4,5
0,2,1
3,0,3


This matrix `grid_gems` has 3 rows and 3 columns, showing the count of gems in a 3x3 grid.

**Data Frames** are akin to datasets or tables in other software. They can have columns of different types. Suppose we're tracking different types of items (e.g., gems, keys) found by each character:

In [3]:
characters <- c("Aragem", "Rubella", "Topazia")
gems <- c(5, 3, 8)
keys <- c(2, 2, 4)
inventory <- data.frame(characters, gems, keys)

inventory # display data frame

characters,gems,keys
<chr>,<dbl>,<dbl>
Aragem,5,2
Rubella,3,2
Topazia,8,4


The `inventory` data frame has three columns: character names, number of gems, and number of keys.

**Lists** in R can contain different types and sizes of elements. For instance, a character's profile including their name, age, and the types of gems they like could be a list:

In [6]:
character_profile <-
  list(name = "Aragem", age = 42, favorite_gems = c("Emerald", "Sapphire"))

character_profile

Here, `character_profile` includes a string, a numeric value, and a vector of strings.

**Factors** are used to represent categorical data. If we have a list of characters' roles in our story, we could use a factor to categorize them:

In [7]:
roles <- factor(c("Warrior", "Mage", "Archer", "Mage"))
roles

**Logical types** are straightforward: they represent boolean values. We could track whether a character has completed a quest:

In [8]:
quest_completed <- c(TRUE, FALSE, TRUE)
quest_completed

The vector `quest_completed` tells us which characters (perhaps in the same order as our `characters` vector) have completed a quest.

**Numeric** and **Integer** types are numbers, with the former including decimals and the latter being whole numbers. If each character has a certain amount of gold:

In [9]:
gold <- c(100.5, 200, 150) # Numeric
gold

In [10]:
steps_walked <- c(500L, 700L, 450L) # Integer, denoted by 'L'
steps_walked

`gold` is a numeric vector with decimals, whereas `steps_walked` is an integer vector showing how many steps each character has walked.

Lastly, **Character** types are text strings. If we wanted to note the title given to each character after an achievement:

In [11]:
titles <- c("The Brave", "The Wise", "The Swift")
titles

The `titles` vector holds these honorary titles as character strings.

## Printing with `cat()`
The `cat` function in R is used to concatenate and print objects. It is particularly useful for creating custom-formatted strings and can handle different types of objects by converting them to a character string.

To illustrate how `cat` can be used to print and format variables in R, let's utilize our narrative-driven examples with a focus on displaying informative messages.

#### Printing a simple message with a vector value:

In [12]:
cat("The values of gems collected are:", gem_values, "\n")

The values of gems collected are: 50 100 200 150 


#### Custom message for a data frame's content:

In [23]:
cat("Character inventory:\n",
  "Names:", inventory$characters, "\n",
  "- Gems:", inventory$gems, "\n",
  "- Keys:", inventory$keys, "\n")

Character inventory:
 Names: Aragem Rubella Topazia 
 - Gems: 5 3 8 
 - Keys: 2 2 4 


Here, we use `$` to access each column of the data frame `inventory`.

#### Printing lists:

For lists, since they can contain different types of elements, we can print each element with a custom message:

In [22]:
cat("Character profile:\n",
  "Name:", character_profile$name, "- Age:", character_profile$age, "\n",
  "Favorite gems:", toString(character_profile$favorite_gems), "\n")

Character profile:
 Name: Aragem - Age: 42 
 Favorite gems: Emerald, Sapphire 


`toString()` is used to collapse the elements of the `favorite_gems` vector into a single, comma-separated string.

#### To display a factor with custom formatting:

In [19]:
cat("Character roles are:", levels(roles), "\n")

Character roles are: Archer Mage Warrior 


This prints out the distinct levels of the factor roles.

#### Printing logical values with an explanatory message:

In [16]:
cat("Quest completion status:",
  ifelse(quest_completed, "Completed", "Not completed"),
  "\n")

Quest completion status: Completed Not completed Completed 


The `ifelse` function helps in printing "Completed" or "Not completed" based on the logical values in quest_completed.

For numeric and integer vectors, you might want to format the output to control the number of decimal places:

In [24]:
cat("Gold amounts:", format(gold, nsmall = 2), "\n") # nsmall ensures two decimal places
cat("Steps walked:", steps_walked, "\n")

Gold amounts: 100.50 200.00 150.00 
Steps walked: 500 700 450 


As you can see `cat()` (like python's `print()`, but more highly focused on the formatted display of numerical data) is an incredibly powerful function for displaying data (and you shouldn't expect to master it all right away!).

## Understanding Vectors

**Vectors** are fundamental in R, as they are the simplest type of data structure. Unlike Python, where lists are the go-to linear data structure and can contain elements of different types, R vectors are **homogenous**, meaning all elements must be of the same type. When you attempt to mix types, such as combining numbers and strings, R will coerce the elements to the same type, following a set of hierarchy rules (e.g., numeric to character).

There are six basic data types that vectors can hold in R:

-   logical
-   integer
-   double (often called numeric)
-   complex
-   character
-   raw

#### Creation
You can create vectors using the `c()` function, which stands for 'concatenate' or 'combine':

In [25]:
numbers <- c(1, 2, 3, 4, 5)  # Numeric vector
characters <- c("a", "b", "c")  # Character vector
booleans <- c(TRUE, FALSE, TRUE)  # Logical vector

### Basic Operations and Characteristics
R is **vectorized**, which means that operations are applied to each element of the vector without the need for explicit looping. For instance:

In [27]:
numbers
numbers * 2  # Multiplies each element of the vector by 2

Elements in a vector are **indexed** starting with 1 (not 0 as in Python). You can access elements with square brackets:

In [29]:
characters[2] # second element (not third!)

The `length()` function gives you the number of elements in a vector:

In [30]:
length(numbers)

If you try to combine different types, R will **coerce** them into one type, with a hierarchy that generally goes from less to more informative (logical < integer < numeric < complex < character):

In [31]:
mixed <- c(1, "a")
str(mixed)  # Will show that all elements are now of character type

 chr [1:2] "1" "a"


### Basic Mathematical Operations

Vectors support arithmetic operations, which are performed element-wise. This means that if you add two vectors together, R will add the first element of the first vector to the first element of the second vector, the second element to the second element, and so on:

In [32]:
v1 <- c(10, 20, 30)
v2 <- c(1, 2, 3)
sum <- v1 + v2  # Results in c(11, 22, 33)

cat("The sume is:", sum)

The sume is: 11 22 33

Similarly, subtraction, multiplication, and division are also done element-wise:

In [34]:
difference <- v1 - v2
product <- v1 * v2
quotient <- v1 / v2

cat("Difference: ", difference,
  "\nProduct: ", product,
  "\nQuotient: ", quotient)


Difference:  9 18 27 
Product:  10 40 90 
Quotient:  10 10 10

You can also perform operations between a vector and a single number (**scalar**), where the operation is applied to each element:

In [35]:
doubled <- v1 * 2
doubled


## Table: R Code to Know

| R Code Example | Description |
| --- | --- |
| `vec <- c(1, 2, 3)` | R code to create a numeric vector with the elements 1, 2, and 3. |
| `char_vec <- c("a", "b", "c")` | R code to create a character vector with the elements 'a', 'b', and 'c'. |
| `logic_vec <- c(TRUE, FALSE, TRUE)` | R code to create a logical vector with the elements TRUE, FALSE, and TRUE. |
| `vec[2]` | R code to access the second element of a vector. |
| `length(vec)` | R code to get the number of elements in a vector. |
| `names(vec) <- c("first", "second", "third")` | R code to assign names to the elements of a vector. |
| `sum(vec)` | R code to calculate the sum of the elements in a vector. |
| `mean(vec)` | R code to calculate the mean of the elements in a vector. |
| `vec * 2` | R code to multiply each element of a vector by 2. |
| `cat("The value is:", vec[1])` | R code to print a message followed by the first element of a vector. |
| `df$column` | R code to access a specific column in a data frame named `df`. |
| `vec > 2` | R code to evaluate whether each element of a vector is greater than 2. |
| `!logic_vec` | R code to negate the elements of a logical vector. |
| `vec1 + vec2` | R code to add two vectors element-wise. |
| `c(vec, 4)` | R code to append an element to the end of a vector. |

## Exercises
### Exercise 1: Creating Numeric Vectors

-   Create a numeric vector with the numbers from 1 to 5.
-   Hint: Use the `c()` function to combine values into a vector.

### Exercise 2: Creating Character Vectors

-   Construct a character vector with the names of the seven days of the week.
-   Hint: Remember that character strings must be enclosed in quotes.

### Exercise 3: Mixing Types in Vectors

-   Try to create a vector that contains both numbers and characters. What happens?
-   Hint: R will coerce the data types to be consistent. Observe the result.

### Exercise 4: Vector Arithmetic

-   Create two numeric vectors, `a` and `b`, each with 5 elements, and calculate their sum.
-   Hint: Use `+` to add vectors of the same length.

### Exercise 5: Accessing Vector Elements

-   Create a vector with 10 elements and access the 4th element.
-   Hint: Use the square brackets `[]` with the index of the element you want to access.

In [None]:
# Exercise 1

In [None]:
# Exercise 2

In [None]:
# Exercise 3

In [None]:
# Exercise 4

In [None]:
# Exercise 5

### Exercise 6: Basic Statistical Operations

-   Calculate the mean and standard deviation of a numeric vector with at least 5 elements.
-   Hint: Use the `mean()` and `sd()` functions.

### Exercise 7: Printing with Formatting

-   Use the `cat()` function to print out a statement that includes elements from a vector. For example, print "The sum of the vector is: " followed by the actual sum.
-   Hint: You'll need to perform a calculation within the `cat()` function and use commas to separate text and calculation.

### Exercise 8 (Challenge): Understanding Logical Vectors

-   Logical vectors in R represent sequences of TRUE and FALSE values. They are the result of logical operations and are very useful for subsetting and conditional testing.
-   Generate a logical vector that signifies whether the numbers 1 through 5 are greater than 3.
-   To do this, you will compare a numeric vector (of numbers 1 to 5) the number 4, and R will perform the comparison element-wise and produce a new "logical vector".
-   Hint: Use a comparison operator like `>` with the `c()` function to create your numeric vector.

### Exercise 9 (Challenge): Exploring Named Vectors

-   In R, you can assign names to the elements of a vector, which can be especially helpful for readability and referencing elements by name instead of by position.
-   Create a numeric vector with three elements. Assign names to each element so that you have a named vector where each element corresponds to "first", "second", and "third".
-   Hint: Use the `names()` function to assign names to your vector after you've created it with the `c()` function, like so: `names(my_vector) <- c("name1", "name2", "name3")`.

In [None]:
# Exercise 6

In [None]:
# Exercise 7

In [None]:
# Exercise 8

In [None]:
# Exercise 9