# Mapping from `datascience` to R
**Authors**: Eric Van Dusen, Soham Mandal

This notebook serves as an introduction to basic R terminology, data structures and commands. The functions introduced will be analogous to those in Berkeley's `datascience` module, with examples provided for each.

## 1. Basics of R

R is a command line driven program. This means that the user can enter expressions, create variables and define functions and run them in the R console. In the Jupyter notebook interface, code chunks can be run as individual cells either by clicking on 'Run' in the toolbar above or using the shortcut keys `shift + enter`
<br>

### 1.1 Importing and loading packages

In Python, we use the following syntax to install packages:
```python
!pip install datascience
```
And we load them using:
```python
import numpy as np
from datascience import Table
```

In R, we use `install.packages('package_name')` to install new packages from the CRAN repository. For a full list of available packages, refer to https://cran.r-project.org/web/packages/available_packages_by_name.html

It is not necessary to reinstall packages everytime we quit or reload an R session. Once we have a package installed, we can load it using `library('package_name')`.

In [26]:
# example: install a package
install.packages('ggplot2')


The downloaded binary packages are in
	/var/folders/nz/bp6h770d6kq3p86l3jm192jw0000gn/T//RtmpNCpKhd/downloaded_packages


In [27]:
# example: loading a package
library('ggplot2')

### 1.2 Arithmetic and Logical Operators

Here are the basic arithmetic operations in Python:
<br>
```python
import math
import numpy as np

print(2 + 3) # add numbers
print(3**4) # powers
print(pow(3, 4)) # powers
print(math.sqrt(4**4)) # functions
print(21 % 5) # 21 mod 5
print(math.log(10)) # take log
print(math.exp(2)) # exponential
print(np.abs(-2)) # absolute value
print(2*math.pi) # mathematical constant

# scientific notation
print(5000000000 * 1000)
print(5e9 * 1e3)
```
In Python, we need to import `numpy` and `math` for certain mathematical operations. In R, however, these capabilities are built-in and no imports are required.
<br>
<br>
Running the following cells will demostrate some basic operations performed in R.

In [1]:
# adding two numbers
2 + 3

In [2]:
# raising to a power
3 ^ 4

In [3]:
# square roots
sqrt(4 ^ 4)

In [4]:
# 21 mod 5
21 %% 5 

In [5]:
# taking the log
log(10)

In [6]:
# exponential
exp(2) 

In [7]:
# using mathematical constants
2 * pi 

In [8]:
# absolute value
abs(-2)

In [9]:
# scientific notation
5e9 * 1e3

Now, recall the logical operations in Python:
```python
print((1 > 0) and (3 <= 5))
print((1 < 0) or (3 > 5))
print((3 == 9/3) or (2 < 1) )
print(not(2 != 4/3))
```
<br>
In R, the logical operators are <, <=, >, >=, == for exact equality and != for inequality. 
<br>
`and`, `or`, `not` is replaced by `&`, `|`, `!`. 
<br>
The boolean values True/False in Python correspond to TRUE/FALSE in R (Notice the case difference). 
<br><br>
Run the cells below to see how logical operators work in R.

In [10]:
(1 > 0) & (3 <= 5)

In [11]:
(1 < 0) | (3 > 5)

In [12]:
(3 == 9/3) | (2 < 1)

In [13]:
!(2 != 4/3)

### 1.3 Assigning variables

In R, the assignment operator is `<-`. In most (not all) contexts, the `=` operator can be used as an alternative. It is recommended to use `<-` as standard usage to avoid mistakes.

Variables names in R are case sensitive, which means `A` and `a` are different symbols and would refer to different variables.

In [14]:
# run this cell
val <- 3
print(val) # same usage of print function as in Python 3

Val <- 7 # case-sensitive!
print(Val)

[1] 3
[1] 7


Vectors in R are analogous to lists in Python. The syntax for vectors is of the form `c(a, b, c)` (c means we are declaring a vector datatype).
<br>
```python
# A comparison: create a list of numbers in python
a = [0.125, 4.75, -1.3]
a1 = np.array([0.125, 4.75, -1.3])
a1
```

In [15]:
# run this cell
a <- c(1, 2, 3)
a
b <- c(4, 5, 6)
b

In [16]:
# combine two vectors
ab <- c(a, b)
ab

In Python, we used `np.arange` to create a numpy array with a start, end and a step value as follows:
```python
a = np.arange(4, 9, 1) # creates [4 5 6 7 8]
```
<br>
In R, we can use the `seq` function to do the same. The end element is included in `seq`, unlike `np.arange`.

In [18]:
# run this cell
seq1 <- seq(from=4, to=9, by=1)
seq1 # Notice the output difference with np.arange

There are more parameters available for the `seq` function. To pull up more information about an R function, we can use either `?seq` or `help(seq)` 

Another important difference is that in R, indexing starts at 1, unlike 0 in Python.

In [21]:
seq1[1] # extracting element at first index of vector

## 2. Dataframes: Storing tabular data

In Python's `datascience` module, we used `Table` to build our dataframes and used commands such as `select()`, `where()`, `group()`, `column()` etc. In this section, we will go over some basic commands to work with the most commonly used data structure in R: Dataframes.

### 2.1 Creating a Dataframe

In Python, this is how we create tables from scratch by extending an empty table:
```python
t = Table().with_columns([
     'letter', ['a', 'b', 'c', 'z'],
     'count',  [  9,   3,   3,   1],
     'points', [  1,   2,   2,  10],
 ])
```
<br> 
In R, we can initialize a dataframe using `data.frame()`. For a full list of parameters and options, refer to [this guide](https://www.rdocumentation.org/packages/base/versions/3.5.0/topics/data.frame)

When not specified, the function `data.frame` will coerce all character variables to factors. If you want to keep the strings as character variables, you need to specify `stringsAsFactors = FALSE`.

In [25]:
# example: creating a dataframe in R
t <- data.frame(letter = c('a', 'b', 'c', 'z'),
                count = c(9, 3, 3, 1),
                points = c(1, 2, 2, 10),
                stringsAsFactors = FALSE
               )
t

letter,count,points
a,9,1
b,3,2
c,3,2
z,1,10


More often, we will need to create a dataframe by importing data from a .csv file. In `datascience`, this is how we read data from a csv:
```python
Table.read_table('sample.csv')
```

In R, we can use `read.csv()` to read data from a csv file. For a full list of parameters, refer to [this guide](https://www.rdocumentation.org/packages/utils/versions/3.5.0/topics/read.table)

In [37]:
# example: reading baby.csv (Located in current working directory)
baby <- read.csv('baby.csv')
head(baby) # display first few rows of dataframe

X,Birth.Weight,Gestational.Days,Maternal.Age,Maternal.Height,Maternal.Pregnancy.Weight,Maternal.Smoker
1,120,284,27,62,100,False
2,113,282,33,64,135,False
3,128,279,28,64,115,True
4,108,282,23,67,125,True
5,136,286,25,62,93,False
6,138,244,33,62,178,False


In [38]:
# view data summary
summary(baby)

       X           Birth.Weight   Gestational.Days  Maternal.Age  
 Min.   :   1.0   Min.   : 55.0   Min.   :148.0    Min.   :15.00  
 1st Qu.: 294.2   1st Qu.:108.0   1st Qu.:272.0    1st Qu.:23.00  
 Median : 587.5   Median :120.0   Median :280.0    Median :26.00  
 Mean   : 587.5   Mean   :119.5   Mean   :279.1    Mean   :27.23  
 3rd Qu.: 880.8   3rd Qu.:131.0   3rd Qu.:288.0    3rd Qu.:31.00  
 Max.   :1174.0   Max.   :176.0   Max.   :353.0    Max.   :45.00  
 Maternal.Height Maternal.Pregnancy.Weight Maternal.Smoker
 Min.   :53.00   Min.   : 87.0             Mode :logical  
 1st Qu.:62.00   1st Qu.:114.2             FALSE:715      
 Median :64.00   Median :125.0             TRUE :459      
 Mean   :64.05   Mean   :128.5                            
 3rd Qu.:66.00   3rd Qu.:139.0                            
 Max.   :72.00   Max.   :250.0                            

In [39]:
# example: Load csv from URL
sat <- read.csv('http://data8.org/textbook/notebooks/sat2014.csv')
head(sat)

State,Participation.Rate,Critical.Reading,Math,Writing,Combined
North Dakota,2.3,612,620,584,1816
Illinois,4.6,599,616,587,1802
Iowa,3.1,605,611,578,1794
South Dakota,2.9,604,609,579,1792
Minnesota,5.9,598,610,578,1786
Michigan,3.8,593,610,581,1784


In [41]:
# view information about dataframe
summary(baby) # view data summary
nrow(sat) # display no. of rows
dim(sat) # view dimensions (rows, cols)
colnames(sat) # view column names

       X           Birth.Weight   Gestational.Days  Maternal.Age  
 Min.   :   1.0   Min.   : 55.0   Min.   :148.0    Min.   :15.00  
 1st Qu.: 294.2   1st Qu.:108.0   1st Qu.:272.0    1st Qu.:23.00  
 Median : 587.5   Median :120.0   Median :280.0    Median :26.00  
 Mean   : 587.5   Mean   :119.5   Mean   :279.1    Mean   :27.23  
 3rd Qu.: 880.8   3rd Qu.:131.0   3rd Qu.:288.0    3rd Qu.:31.00  
 Max.   :1174.0   Max.   :176.0   Max.   :353.0    Max.   :45.00  
 Maternal.Height Maternal.Pregnancy.Weight Maternal.Smoker
 Min.   :53.00   Min.   : 87.0             Mode :logical  
 1st Qu.:62.00   1st Qu.:114.2             FALSE:715      
 Median :64.00   Median :125.0             TRUE :459      
 Mean   :64.05   Mean   :128.5                            
 3rd Qu.:66.00   3rd Qu.:139.0                            
 Max.   :72.00   Max.   :250.0                            

### 2.2 Accessing values in dataframe

In Python, we can use `column` to access values in a particular column as follows:
```python
In [10]: t.column('letter')
Out[10]: 
array(['a', 'b', 'c', 'z'], 
      dtype='<U1')
```

   In R, to access values in a particular column, we can use the `$` sign or use the following syntax: `df[, colname]`

In [50]:
# accessing column values
t$letter
t[, 'letter'] # Can also use t[, 1] to access column at first index

In Python, we can use `row()` to access a row:
```python
In [15]: t.rows[0]
Out[15]: Row(points=1, letter='a', count=9)
```

In R, we can use the following syntax to access row data: `df[rowname, ]`

In [55]:
# example: Access first row of dataframe
t[1, ]

letter,count,points,vowel,twice_count,double
a,9,1,yes,18,18


We can also access a specific value in the dataframe by specifiying the row and column as follows:

In [57]:
# extracting one value
t[1, 'letter']
# slicing the dataframe
t[1:3, 'count']

### 2.3 Manipulating data

Adding a new column in `datascience` is done by the `with_columns()` function as follows:
```python
In [23]: t.with_column('vowel', ['yes', 'no', 'no', 'no'])
Out[23]: 
points | letter | count | vowel
1      | a      | 9     | yes
2      | b      | 3     | no
2      | c      | 3     | no
10     | z      | 1     | no
```
In R, we can use `data$newcolumn<-datavector` to add a new column to an existing dataframe. 

In [52]:
# example: Adding a new column
t$vowel <- c('yes', 'no', 'no', 'no')
t

letter,count,points,vowel
a,9,1,yes
b,3,2,no
c,3,2,no
z,1,10,no


We can also add an existing column to the dataframe, which might be manipulated and added as a new column.

In [54]:
# Example: Adding twice the count to the dataframe
t$double <- t$count * 2
t

letter,count,points,vowel,twice_count,double
a,9,1,yes,18,18
b,3,2,no,6,6
c,3,2,no,6,6
z,1,10,no,2,2
