<div class="alert alert-block alert-warning">
<h1 style="color:black">Let's Get Started!</h1>

Welcome to your first Jupyter Notebook. It is a collection of "cells" in which you can run code for various programming languages. This notebook is set up to run "R". R is a powerful open source language and environment for statistical computing and graphics!

<blockquote> <b><mark>1. Variables:</mark></b> Let's begin by assigning a variable named x to have the value of 2.3. The code is already in the next cell. To run the code, make sure your cursor is in the cell and then press <code>ENTER</code> while holding down the <code>SHIFT</code> key. </blockquote>

In [2]:
x = 2.3

<blockquote>You have not asked for any output so let's verify that x has the value 2.3. In the next cell, type either <code>x</code> or <code>print(x)</code> and then run the cell by pressing <code>ENTER</code> while holding down the <code>SHIFT</code> key.</blockquote>

In [3]:
print(x)

[1] 2.3


<div class="alert alert-block alert-warning">
An alternative way to assign the value of <code>x</code> to be <code>2.3</code> is to use the "assignment operator", which is an arrow made from the "less than symbol" and a dash like this: 

`x <- 2.3 ` 
    
This is more common among R "purists". Check out this blog __[this blog](https://blog.revolutionanalytics.com/2008/12/use-equals-or-arrow-for-assignment.html)__ if you are interested in reading more about the difference. 

<hr style="height:2px;border-width:0;color:black;background-color:black">

<b><mark>2. Vectors:</mark></b> A vector is defined with a "c". 

<blockquote>For example, to define a vector x to be (3,1,7,4,-1,8), type the following into the next open cell.
      

`x = c(3,1,7,4,-1,8)`
    
(You may want to write down this vector. Throughout this notebook you will be asked to perform operations on x and predict the results. It may be easier for you to not have to keep scrolling back to look at it.)
          

Access elements using square brackets. R indexing starts from 1. 
    (By comparison, some languages instead use a zero to access the first element in a vector.) 
    

What do you expect the code


`x[3]-x[1]`


to produce? After you have guessed, add this command into the same cell below your vector definition and run the cell to check your answer.
        </blockquote>

In [5]:
x = c(3,1,7,4,-1,8)
x[3] - x[1]

<blockquote>Access two or more consecutive elements using a starting index, then a colon, then an ending index. 
    
What do you expect from the code  
    
`x[2:3]` ?
    
Try it by running the next cell!    
</blockquote>

In [6]:
# Run this cell
x[2:3]

<div class="alert alert-block alert-warning">
    Notice how we have included a "comment" in the previous cell. Lines starting with "#" are not executed as code!

<blockquote>
Typing  <br>  
    
`x[-2]`
    
will return what remains from the vector x when the second element is removed.  However, it will not overwrite the vector x. To do this we would have to reassign x as the result like this.
    
`x = x[-2]`    
    
Of course, we could also define a new vector like this
    
`y = x[-2]`
    
To remove the second through fourth elements of x, we type
    
`x[-(2:4)]`
    
To obtain or remove non-consecutive elements like, for example, the second and fifth elements, we use a vector of indices like
    
`x[c(2,5)]` or `x[-c(2,5)]`    

</blockquote>

In [None]:
# This cell is here in case you want to use it to play with any of the commands just mentioned.
# You can type and run code in this cell!



<blockquote>In the next cell, define a vector named y whose 
elements are (2,7,-2,4,0,3.3). Then add the 
vectors x and y. Is the result what you expected? </blockquote>


<blockquote>In the next cell, we will add a single number to 
 every element of x by typing x+7. 

While it doesn't make sense mathematically to add a scalar to a vector, it makes sense in R!
    
Give it a try.   
</blockquote>


<blockquote >A sequence of consecutive numbers, say 1 through 6,
 can be produced  by typing  <br>

`1:6`

 or

`seq(1,6)`

 Try it below.</blockquote>

<blockquote>The "sequence" function we used in the last cell 
can also take a third argument. To produce a sequence
from 1 to 6, while incrementing by 2, type the following into
    the next cell and run the cell.
Do the results make sense to you? <br>

`seq(1,6,2)`
    
</blockquote>

<blockquote>One of the coolest things about R is its abillity to find all elements of a vector meeting certain conditions. To find all elements of x that are greater than or equal to 5, type  <br>  
    
`x[x>=5]`
    
To find all elements that are equal to 7 type
    
`x[x==7]`
    
and to find all elements that are not equal to 7 type
    
`x[x!=7]`    
</blockquote>

In [None]:
# Run this cell
x[x>=5]
x[x==7]
x[x!=7]

<blockquote>
Sometimes you'll want to know the locations of the elements of x that are greater than or equal to 5. <br>
    
`which(x>=5)`
    
Try it by running the next cell. There is an additional command there. From the results, can you figure out what it does?
</blockquote>    

In [None]:
# Run this cell
which(x>=5)
x>=5

<blockquote>We can find locations in one vector and use them to pull elements out of another.

Let's try it.
</blockquote>

In [None]:
# Run this cell
x
y
y[x==7]

<blockquote>We can return values of x satisfying two (or more) conditions. For example, typing <br>

`
    x[x>1 & x<7]
`   
                   
will return all values of x that are greater than 1 <u>and</u> also less than 7.

To return all values that are either greater than or equal to 7 <u>or</u> less than 1, use a vertical line like this

`
    x[x>=7 | x<1]
`    

What do you think the following line of code will produce?

`
    x[(x>1 & x<7)| x<0]
`

Try it in the next cell to see if you are correct!

</blockquote>

<hr style="height:2px;border-width:0;color:black;background-color:black">

 <blockquote><b><mark>3. Built-In Functions:</mark></b> There are hundreds of built-in functions in the base R package and in the extra libraries that you can install.


Read the commands in the next cell before running it and try to guess what the output will be for each!
</blockquote>

In [None]:
sum(x)
prod(x)
length(x)
mean(x)
exp(x)
log(x)    # A warning is expected here!
sqrt(x)   # A warning is expected here!

<blockquote>To square each element of the vector x, we type <code>x^2</code>.

The <mark>sample variance</mark> of the data held in the vector x is 
    
$$
S^{2} = \frac{\sum_{i=1}^{n}(X_{i}-\overline{X})^{2}}{n-1} = \frac{\sum_{i=1}^{n} X_{i}^{2} - (\sum_{i=1}^{n} X_{i})^{2}/n}{n-1}
$$
        
We can compute it with a built-in variance function but let's also try it using the sum and "squaring" functions.    

(Don't worry if you don't know what sample variance is and are unfamiliar with this formula. We will talk about it in this course!)
</blockquote>

In [None]:
var(x)
(sum(x^2)-sum(x)^2/length(x))/(length(x)-1)

# After running this cell, go back, delete the second 
# line, and to try to reproduce it! 

<blockquote>Many built-in functions in R are "guessable". When they are not, they are usually easy to find on the internet with your favorite search engine!</blockquote>

<hr style="height:2px;border-width:0;color:black;background-color:black">

<blockquote> <b><mark>4. Loops and Conditional Statements:</mark></b> A "for loop" allows us to go through several iterations of a task. For example, run the next cell to print out the numbers 1 through 3. 
</blockquote>

In [None]:
for (i in 1:3)
{
    print(i)
}

<blockquote>Unlike the case in a language such as Python, the indentation is unimportant here and is used only to keep things nice and organized. The code in the previous cell could be written all on one line or in any other way such as <br>

    
`for (i in 1:3){
 print(i)}`    
</blockquote>

<blockquote>Let's try something more computational. Let's go through the elements of the vector x and increase the first element by 1, the second element by 2, et cetera. You'll need to add a line with just "x" or "print(x)" to the following cell if you want to actually see the results. 

In [None]:
# Run this cell
for (i in 1:length(x))
{
    x[i] = x[i]+i
}

<blockquote><mark style="background: #ffcccb!important">Herein lies an important lesson about R.</mark> "Looping" through operations can be quite slow when you are coding something more complicated. It is good practice to always think about ways to avoid loops by harnessing the power of matrix/vector manipulation in R. For example, we can restore the vector x to hold its original values by writing another loop, <b><u>or</u></b>, we can make a sequence from 1 to the length of x like this

`1:length(x)`
    
and subtract it all at once like this
    
`x = x - (1:length(x))`  
    
Try this in the next cell and add a line so that you can see the results. 

</blockquote>

<blockquote>A "while loop" will execute until a condition is met. For example,<br>

`
i = i
while(i < 4)
{
   print(i)
   i = i+1
}           
`   
<br>            
(This is just an illustration of the "while loop". There are much better ways to achieve the desired result for this simple task!)           
</blockquote>

<div class="alert alert-block alert-warning">At this point in this tutorial, the vector x should have its original values. However, since we have done so many manipulations and you may have run some cells more than once while exploring, check in the next cell to make sure that the vector x is still (3,1,7,4,-1,8). If it is not, redefine x so that it holds these values. 

<blockquote> An "if statement" will check if given conditions are met. For example, let's define a vector $y$ that will take on values $1$ or $-1$ for the $i$th entry, depending on whether the corresponding entry of $x$ is less than or equal to $5$ or greater than $5$.
    
Try the following code in the next cell. In the first line, we are initializing a new vector y as a vector of zeros having the same length as x. "rep" means "repeat"   


</blockquote>

In [None]:
# Run this cell.
y = rep(0,length(x))   

for (i in 1:length(x))
{
    if(x[i]<=5)
    {
        y[i] = 1
    }
    else
    { 
        y[i] = -1          
    }           
}    

y # This line is included so that y is shown at the end.


<blockquote>The better and more "R way" to do the same thing would be to type <br>

`
 y<-rep(-1,length(x))
 y[x <= 5]<-1                     
`    
             
Try it in the next cell and add a line to display the results.              
</blockquote>
    

<div class="alert alert-block alert-warning"><b>***Note:</b> In a Jupyter notebook, lines of code do not run until we hold down <code>Shift</code> and hit <code>Enter</code>. If you are using R software outside of a Jupyter notebook, there is a good chance the code is being interpreted line by line each time you hit <code>Enter</code> to move to the next line. In this case, if you were to type
    

<code>if(x[1]<=5)
 {
    y[1] = 1
 }
 else
 { 
    y[1] = -1          
 }          
</code>
    
when you hit enter after the bracket on the 4th line, R is going assign y[1] to be 1 if x[1] is less than or equal to 5 and not expect that you have another command for the case that x[1]>5. The second part will not be executed and, indeed, you will get an error starting the next command with "else".   One way to avoid this is to let R know that the "else" is coming before executing the 4th line by including on this 4th line.
    
 <code>if(x[1]<=5)
 {
    y[1] = 1
 }else
 { 
    y[1] = -1          
 }          
</code>   

<hr style="height:2px;border-width:0;color:black;background-color:black">

<blockquote> 
    <b><mark>5. Matrices:</mark></b> For completeness, we will cover matrices and simple matrix operations but only briefly as we will not really be using them in this course. 
    
Let's define a 6 by 6 matrix of zeros and call it "A".
</blockquote>


In [None]:
# Run this code
A = matrix(0,6,6)

<blockcode>We can access, for example, 
    <ul><li>the (1,2) entry of A by typing <code>A[1,2]</code></li> <li>the second row of A by typing <code>A[2,]</code> </li><li>the first column of A by typing <code>A[,1]</code></li></ul> 

A, however, is not a very interesting matrix. Let's populate the first row with the vector x. The code is already in the next cell.
    
Add some additional lines to the cell to populate the second row with the vector y and the third row with the sum of the two vectors x and y.  Type `A` and run the cell to see the result.  
</blockcode>


In [None]:
# Modify and run this cell.
A[1,] = x
A[2,] = y
A[3,] = x+y

<blockquote>As in the case with vectors, typing <code>A*A</code> will perform elementwise multiplication. In order to perform actual matrix multiplication, where elements come from dot products between columns and rows, we would type <code>A%*%A.</code> 

Try both operations in the next cell.
</blockquote>

<blockquote>Suppose that we want to compute the inverse of the matrix A. You might be suprised to find that "inverse(A)" does not work. Instead we use the command <code>solve(A)</code>. However, we would get an error if we tried this now because our silly example matrix is not invertible!
</blockquote>

<hr style="height:2px;border-width:0;color:black;background-color:black">

<blockquote> 
    <b><mark>6. Data Frames:</mark></b> A "data frame" is like a matrix but is far more flexible as it can consist of many different types of data. We will be using data frames a lot in this course!
    
There is a plain text file in the same directory as this lab called "dogs". We will read it into R with a function called <code>read.table</code>. Run the next cell to see the results.  
</blockquote>


In [None]:
# Run this cell.
dogs<-read.table("dogs")
dogs


<blockquote>At the top of the resulting table you should see the labels "V1", "V2", and "V3". These are generic column names for a data frame. However, it appears that the file already had column names as a "header row". Let's try reading the data in while telling R that we already have column names in the file.</blockquote>

In [None]:
# Run this cell.
dogs<-read.table("dogs",header=TRUE)
dogs

<blockquote>Much better! 


We can grab an entire column using its name after a dollar sign. Try typing 
    
`dogs$age`

in the next cell and running the cell.
</blockquote>

In [None]:
dogs$age

<blockquote>The columns for age and weight hold "dbl" or "double" type variables. "Doubles" are numeric variables with decimal points. Looking at the data frame above, it appears that "name" is a "factor" variable. Factor variables are useful for categorizing data into "types". For example, if every dog was one of three types called "A", "B", or "C", we would store that as a factor variable. Here, the dogs' names are not really categories. Let's change the variable type and make it a "character" column.  </blockquote>

In [None]:
# Run this cell.
dogs$name<-as.character(dogs$name)
dogs

<blockquote>We can get the dimension of the data frame by type <code>dim(dogs)</code>. This will return a vector whose first element is the number of rows of the data frame and whose second is the number of columns. We can then specifically pick out, for example, the number of rows as follows.</blockquote>

In [None]:
dim(dogs)[1]

<blockquote>Typing <code>dogs[3,1]</code> will return the (3,1) entry of the data frame. To return the entire third row, type <code>dogs[3,]</code></blockquote>

In [None]:
# Try it!


<blockquote>We can return the second column of the data frame by type <code>dogs[,2]</code> or by refering to the column by its name and typing <code>dogs$age</code>.</blockquote>

In [None]:
# Try it!


<blockquote>Suppose that we want to see the names of all dogs that weigh more than $5.5$ units.</blockquote>

In [None]:
# Run this cell.
dogs$name[dogs$age>5.5]

<blockquote>Let's average all of the dog weights.

`mean(dogs$weights)`

</blockquote>

In [None]:
# Try it!


<blockquote>To average the weights of only the dogs whose age exceeds 5.5 type the following.

`mean(dogs$weight[dogs$age>5.5])`

</blockquote>

In [None]:
# Run this cell.
# Tip: In your head, read these square brackets as "such that".
mean(dogs$weight[dogs$age>5.5])

<blockquote>Let's plot the dogs ages versus weights. We can do this by typing <code>plot(dogs$age,dogs$weight)</code>.

In the next cell we have included other arguments in the plot function. Can you figure out what they do?
</blockquote>

In [None]:
# Run this cell.
plot(dogs$age,dogs$weight,xlab="age",ylab="weight",main="My Wonderful Plot")

<blockquote>There are many different "plotting characters" available in R. <u>Return to the previous cell.</u> After "My Wonderful Plot" but before the closing parentheses, type a comma and then <code>pch=2</code>. Run the cell. Try some other numbers!</blockquote>

<blockquote>We will now remake the plot with solid filled blue points. Then, we will add a red point at $(6,5)$.</blockquote>

In [None]:
# Run this cell.
plot(dogs$age,dogs$weight,xlab="age",ylab="weight",main="My Wonderful Plot",pch=19,col="blue")
points(6,5,pch=19,col="red")

<blockquote>Return to the previous cell and replace the line<br>
  
`points(6,5,pch=19,col="red")`
    
with<br>
    
`points(dogs$age[dogs$weight<10],dogs$weight[dogs$weight<10],pch=19,col="red")`    

    
Think about what this line is doing!  
</blockquote>

<blockquote>To conclude this tutorial, let's make a histogram of the dog weights. In the next cell, type<br>

`hist(dogs$weight`)

Can you change the title on this plot?
</blockquote>

<blockquote>The histogram that we just made has "Frequency" on the $y$-axis. In this case, the height of each bar represents the total number of dogs in the data set whose height falls in the ranges that make up the bases ("bins") of the rectangles. Throughtout this course, we will want to use "Density" instead of "Frequency". When the $y$-axis uses "density", the height of each bar will be such that the total are of each bar is the proportion of dogs whose weight falls into the associated range. This will become an estimate of the probability of any dog's weight being in that range if we were to sample more dogs. Run the next cell. </blockquote>

In [None]:
# Run this cell.
hist(dogs$weight,prob=T)

In [None]:
# Let's check out the proportion of dogs in our sample whose weight is less than 5 units.
length(dogs$weight[dogs$weight<5])/length(dogs$weight)

<blockquote>Note that the first rectangle has base width $5$ and therefore area $(5)(0.04)=0.2$.</blockquote>

<blockquote>Finally, let's change the width of the bins of this histogram. That is, let's change where we put the "breaks" and make each bar only 2 units wide. We will make a sequence of numbers that will cover the range of the data. (We have already checked the range of the data by typing <code>min(dogs$weight)</code> and 

<code>max(dogs$weight)</code> )</blockquote>

In [None]:
br<-seq(0,24,2)
hist(dogs$weight,prob=T,breaks=br)

<hr>
That was fun. Let's get back to the course shall we?