# STA 141A Fundamentals of Statistical Data Science

### Lecture 1, 9/28/23, Basics of R

### Today's topics

<style>
    font-size: 40x;
</style>

- Course Organization
- Statistical Computing
- Basics of R
- Data classes  
- Vectors

### Course Organization

This course is an introduction to statistical programming with <tt>R</tt>. 
It covers the basics of data processing, fundamentals of data analysis, and important extensions. 
Upon completion of the course, you will be able to implement sophisticated methods that you will learn as you continue your studies. The course consists of three parts: 

1. Basics of R
2. Regression analysis with R
3. Selected topics in statistical computing

The final grade is determined by 
- eight homeworks (15%),
- two exams on October 19th and November 21st (each 25%),
- project due December 15th (35%).

For comprehensive information about the course, please consult [Canvas](https://canvas.ucdavis.edu/courses/815764). 

The project will be collaborative work with three to five group members. You will procure a data set, develop a research question you intend to answer, and showcase methods covered in this class. 

[Kaggle](https://www.kaggle.com/datasets) will help you to come up with good data sets and ideas. 

The groups have to be formed by October 20th. A project proposal is due November 4th. 

#### Ethics

This is a programming class. Using assistance is part of programming and is encouraged. This can be AI based, or from online sources (e.g., [stackoverflow](https://stackoverflow.com/questions)). 

However, you will be graded by your proficiency in coding. In all assignments, make sure that you display your own contribution. Submitting AI generated code, answers from online sources, or even classmates' solutions will not be enough to pass the course. Furthermore, if you pass off someone else's work as your own, then you are engaging in academic misconduct. 


All material of this course will be made available online on Canvas and Piazza. Use the latter for any inquiries regarding organization, homework or lectures. 

Office hours: 
* Peter Kramlinger: T 10:45-11:45 AM, MSB 1143
* Emily Chang: 	R 11 AM-12 PM, MSB 1117
* Pranash Ramesh: T.B.A.

### Statistical Computing: Overview 

Modern statistical theory is heavily reliant on numerical procedures. This includes bootstrapping, random number generation (RNG) or neural networks [\[Gelman & Vehtari, 2021\]](https://arxiv.org/pdf/2012.00174.pdf). Often, these methods have to be tailored to be problem specific. This constitutes the need for a statistician to be able to code in a versatile programming language. 

On the other hand, statistical methods are applied to real data sets. These have to be cleaned, processed and understood prior to be worked on. Therefore, a statistician needs to be able to easily access such tools. 

R is a statistical programming language which compromises between these two main requirements. 

This need for exploratory data analysis sparked the developement of a statistical programming language <tt>S</tt> in the 70s. [R](https://www.r-project.org/) is its very successful successor. It is an intuitive language that is forgiving to beginners and often requires little effort to accomplish statistical tasks. 

The freely available R distribution and many extensions (called *packages*), are available on [CRAN](https://cran.r-project.org/) (Comprehensive R Archive Network). 

You can run R from the command line. However, we very much recommend to use the free intergrated developement environment [R Studio](https://posit.co/download/rstudio-desktop/). It will help you running code, producing visualizations, creating markdowns and building your own package. 

<p align="center"> 
    <img src="../source/01-RStudio.png" alt="R Studio""/>
</p>

In the editor, to execute the single line at which the cursor is at (or a marked chunk of code), hit <kbd>Ctrl</kbd> + <kbd>Enter</kbd> on Windows or  <kbd>&#8984;</kbd> + <kbd>return</kbd> on Mac. 

In this lecture, I will run code via Jupyter notebook slides. You will need  [Jupyter notebook](https://jupyter.org/) to complete the homework. 

Note that UC Davis offers [Jupyter lab](https://jupyter.libretexts.org/hub/login) to run notebooks online. This service is not maintained by the Department of Statistics and we cannot guarantee its reliability. 

R can be used as a simple calculator:

In [None]:
1 + 3
5 - 1
2 * 2
2^2 
8 / 2
(1 + 12 / 4)^(3 + 2) / 16^2
9 %% 5 # modulo operation

More complex computations require us to save values to variables. This can be done with the assignment operator `<-` or `=`. `<-` is displayed on Windows via <kbd>Alt</kbd> + <kbd>-</kbd>, on Mac via <kbd>&#8997;</kbd> + <kbd>-</kbd>. 
Allowed variable names may include numbers, `_` and `.`, but no other special characters. 

In [None]:
x <- 4 
y <- 5 
z.0 = z_star = y - x 
z.0 = 3

They are available in the workspace (*environment*) and can be retrieved by: 

In [None]:
z.0
z_star

In [None]:
x + y

In [None]:
x + 3

In [None]:
X # case sensitivity, throws an error!

The mathematical operations used above `+`, `*`, ect. are functions mapping real values to real values. A plethora of other functions is build in <tt>R</tt> as well. 

In [None]:
sqrt(4)
exp(0)
log(z.0) # z.0 = 1

In [None]:
log(0)
log(-1) # NaN = not a number, this command throws a warning

In [None]:
sin(0)
sin(pi) # the variable pi is build-in as well

Information about the functions can be accessed 


In [None]:
?factorial # prints information about the function

In [None]:
factorial(5)
5 * 4 * 3 * 2 
prod(1:5)
factorial(x = 5) # within a function, the assignment operator <- must not be used!

What if we use ` <- ` within a function? 

In [None]:
z

In [None]:
factorial(z <- 3)

In [None]:
z

Its not immediately clear that a variable assignment has taken place. This is bad practice and should be avoided! 

The value of a variable can be printed by calling the variable in the console. Printing outside the global environment can be done using `print`. This is very useful for debugging. 

In [None]:
z_star 
print(z_star) # the [i] indicates the i-th line of the output

There are other non-mathematical basic functions you should know. E.g., you can list all objects in the enviroment (e.g. variables, custom functions): 

In [None]:
ls()

Object can be deleted from the environment by calling `rm`. 

In [None]:
rm(z.0, z_star)
ls()

Any objects can be saved with and retrieved with `save` and `load`. 

In [None]:
save(x, y, file = "../source/example.RData")
rm(list=ls()) # removes everything, check ?rm
ls()

In [None]:
x

In [None]:
load("../source/example.RData") # loads 
ls()
x
y

You can navigate from your *working directory* using standard command line syntax. 

In [None]:
getwd() # displays the current working directory

In [None]:
setwd("../") # set the path in quotation marks; the ".." goes to the upper folder
getwd()

In [None]:
setwd("./lectures") # "." marks the current directory
getwd()

### Data classes 

In R, there are five common basic data classes: <kbd>integer</kbd>, <kbd>numeric</kbd>, <kbd>complex</kbd>, <kbd>logical</kbd> (`TRUE`/`FALSE`), <kbd>character</kbd>. 
We can check the data type with the function `class`. `str` displays the value and class (i.e., structure) of an object. 

In [None]:
x <- 4
class(x) # outputs the string 'numeric'

In [None]:
str(x) # displays class abbreviation and value

Note that although `x` is an integer, it is coded as numeric. To force a variable to be of integer type, we have to explicitly state: 

In [None]:
y <- 4L
z <- as.integer(x)
class(y)
class(z)

Contrary to many other programming languages, re-assigning a different data type to a variable is not a problem in R. 

In [None]:
x <- 3L
class(x)

In [None]:
x <- x + 3i
class(x)

The <kbd>logical</kbd> data type is useful for boolean operations.  

In [None]:
4 == 3 # EQUAL

In [None]:
4 > 3 # GREATER 

In [None]:
3 >= 3 # GREATER OR EQUAL

In [None]:
3 != 3 # NOT EQUAL

In [None]:
!TRUE # NEGATION

In [None]:
x <- (3 >= 8) & !(3 > 2) # AND
str(x)

Instead of `TRUE` and `FALSE` one may use `T` and `F`. 

In [None]:
y <- T # same as y <- TRUE

In [None]:
y & x # AND 
y | x # OR

<kbd>Character</kbd> strings are generated by using either `""` or `''`. 

In [None]:
"Hello World!"

In [None]:
paste('Hello', 'World!')

In [None]:
paste0('This is a gr', 8, " lecture!") # note the extra space!

Missing values are coded either as `NaN` (not a number, cf. Lecture 1) or `NA` (not available). 

In [None]:
x <- NA
str(x)

#### Vectors

Vectors are very common objects used in R. It is a single entity collection of values. Usually, they are created with the `c` function, which concatenates values into a vector. Alternatively, several build in functions output a vector object.  

In [None]:
c(5,1,2,3,4,2) # concatenate values

In [None]:
1:6 # sequence with difference 1

In [None]:
seq(from = 1, to = 12, by = 2) # ?seq, same as seq(1, 10, 2) 

In [None]:
seq(1, 12, 2)

In [None]:
seq(1, 6, len = 4) # specify length of output instead 

In [None]:
seq(1, 6, by = 1) # specify length of output instead of difference # same as seq(1, 5, l = 6)

In [None]:
rep(c(1,2,3), each = 2) # ?rep, repeat each input element two times

#### Summary statistics

The argument `na.rm = TRUE` removes `NA`-entries before processing. 

In [None]:
x <- c(5, pi, -6, 2.4, 2L, 1, 12, 99, NA)
length(x)
min(x, na.rm = TRUE)
median(x, na.rm = TRUE)
mean(x, na.rm = TRUE)
max(x, na.rm = TRUE)
quantile(x, c(0.25, 0.75), na.rm = TRUE)

In [None]:
summary(x, na.rm = TRUE)

In [None]:
sd(x, na.rm = TRUE)

In [None]:
sd

In [None]:
sqrt(var(x, na.rm = TRUE))

Alternatively, you can perform arithmetic operations on the vector itself.

In [None]:
x

In [None]:
x + 1

In [None]:
x + rep(c(7,4,4), times = 3)

In [None]:
x + c(1,2) # throws warning, but still gives output! 

In [None]:
x^2

In [None]:
x[3] = 3
sqrt(x)
log(x)

#### Combinatorics

In [None]:
x[3] = -6
x

In [None]:
sort(x) # ?sort

In [None]:
order(x) # ?order

In [None]:
x[order(x)]

In [None]:
rev(x) # check ?rev

In [None]:
set.seed(123) # setting the seed allows replication of pseudo-random results
sample(x, size = 5, replace = TRUE) # -6 is sampled twice! 

#### Subsetting Vectors

Elements in vectors can be accessed via squared brackets `[...]`. Contrary to the programming language Python, the first element is indexed with `1`. 


In [None]:
x

In [None]:
x[5]

In [None]:
x[c(6, 2:4)] # we can subset x by indexing multiple entries

In [None]:
x[-c(4:6)] # remove the third element

In [None]:
x

In [None]:
which.max(x)

In [None]:
x

In [None]:
x[which.max(x)] # check ?which.max
max(x, na.rm = T)

To ease illustration, lets remove the `NA`value in `x`: 

In [None]:
x <- na.omit(x)
x

As an alternative to indexing, elements in vectors can also be accessed via logical vectors: 

In [None]:
x > 3

In [None]:
x[x > 3] # x > 3 returns a logical vector

In [None]:
x[c(F, F, T, T, T, F, F, F)] 

In [None]:
x>3
which(x>3)

In [None]:
x[x>3]
x[which(x>3)]

In [None]:
abs(x[x>3] - x[which(x>3)]) > 10^-3 # boolean operations on vectors are element-wise

In [None]:
c(5, pi) %in% x # checks wether the constant pi is in x

Accessing the last entry is tricky as there is no `end` analogue. 

In [None]:
x[length(x)]
rev(x)[1]
tail(x, 7) # check ?tail and ?head

In [None]:
x[length(x)-1]
rev(x)[2]

### Exercise

- Use three conceptually different ways to create the vector `c(1, 2, 3, 0, 1, 2, 3, 0)`. Code golf: Use the fewest number of characters to create the vector. I need six. Can you beat that? Next, apply the `log` function on that vector. Use three conceptually different ways to remove all non-positive values. 
- Compute the value of the Gaussian density with mean $\mu = 3$ and variance $\sigma^2 = 2$ at $x = 0.3$ by hand and compare it to the result of the build-in function `dnorm`. Use `?dnorm`for help. 