# R Exercise 0 : Fuel Efficiency


Choosing a fuel efficient car will not only help slow climate change but will also save you money.
The difference between a car that gets 20 miles per gallon (MPG) and one that gets 30 MPG can
amount to over $500 in avoided gasoline costs per year.
After introducing R and some basic commands, this exercise looks at data on gas mileage of cars as
reported in www.fueleconomy.gov.

## Getting Started with R
The R software can conveniently perform most of the calculations in statistics. Think of R as a
calculator for statistics where the many dedicated buttons are replaced by a keyboard where you
type the commands for what you want to do.


Interacting with R is done in a command-and-answer manner: you issue commands and R provides
answers. 

In RStudio, you issue these commands by typing them in the console after the prompt:
```
>
```
For example, see the screen capture below of an *RStudio Console*. I have typed in the command:  
```
\> 2 + 2
```
The output is: 
```
[1] 4
```
--------------------------------------

![](https://i.imgur.com/GB68FVM.gif)

--------------------------------------

After a leading [1], R returns the correct answer. 


However, here in **<span class="girk">Jupyter Notebook</span>** you do **not** need to use a ">" symbol, nor will you see a leading [1].

Type the command in the input box below:
```
2+2 
```
Then run the code using **SHIFT + ENTER** and see what your output is. 
You can also click the box and then hit the **Run** in button at the top of the page.  

You should have seen a 4 appear below the input. 

R uses +, -, *, /, sqrt and ^ for the usual math notations. Parentheses are used to group expressions. "log(x)" is used for the natural log (otherise known as ln) and "log10(x)" is ised for log of base 10.

**<span class="girk">EXERCISE 1:</span>** Lets try some more complicated expressions before we move any further. In the **input box** below, use R to find values for each expression. (Hint: To run more than **one** expression in the same input box, just put each on a *newline*)

1. 52 x 17.75
2. 365/4 
3. 125 x (10.61 − 7.27)
4. $\sqrt{(4 + 3)(2 + 1)}$
5. log10(250)
6. ln(250)

If you get stuck, click on the solution below to see the correct input syntax and output.


In [None]:
52*17.75
365/4 
125*(10.61-7.27)
sqrt((4 + 3)*(2 + 1))
log10(250)
log(250)

From now on, each Exercise will have an input bo below it. Type and run your R code answers here. You can use your scratchpad for scratchwork or testing, but know that the code in the scratchpad is not preserved. If you ever loose your place, click on the table of contents button on the upper toolbar.

![](https://i.imgur.com/qwoPOd6.gif)


## Working with data

Statistics is about analyzing data sets which likely will have more than one data point. Unlike most
calculators, R works naturally well with data sets.
Data on the average fuel efficiency of the auto fleets of various manufacturers in 2008 are given in
the table below.

![table.JPG](attachment:table.JPG)

## Storing data
Let’s start by entering the data into R for analysis.
We use the function c() to *combine* numbers into a data set. 

For example, if the data set had values 1,2,3:

```
Input: c(1,2,3)
Output: 1,2,3

``` 



**<span class="girk">EXERCISE 2:</span>** Combine the values for the 2008 MPG into a data set using the c() function:

In [None]:
c(22.8, 23.9, 22.3, 21.9, 22.3, 23.1, 22.3, 21.2, 19.6, 19.3)

When you do this, the numbers are combined and then printed – then they are forgotten! 

*Note* If you run this command in an *RStudio Console*, the [1] appears first. This helps keep track of how many numbers are in the data vector (we call a
variable that stores data a data vector). When there are several rows of numbers output, the
number in square brackets indicates the position of the first number in that row. 
meaning of the leading [1] clear. Take a look at a screenacap of the console output below:

![](https://i.imgur.com/dBcKFSt.png)

The leading [1] indicates 22.8 is the 1st value. The leading [8] indicates that 21.2 is the 8th value.


We need to **store** the data so we can reuse it. To do this, we assign the output to a **variable** using an
**equals** sign. 

For example:

```
x = c(1,2,3)

```
Stores the dataset of 1,2,3 into variable 'x'. 


**<span class="girk"><span class="girk">EXERCISE</span> 3:</span>** Store our 2008 MPG dataset into a variable called 'mpg' (you should not get any output). Now type in 'mpg' and run it. Do you see the dataset you saved?

In [None]:
mpg = c(22.8, 23.9, 22.3, 21.9, 22.3, 23.1, 22.3, 21.2, 19.6, 19.3)
mpg

## Manipulating data using functions
In R data sets are explored, summarized, and analyzed by applying functions to data vectors. A basic
usage looks like **function_name(data_vector_name)**. Don’t forget the parentheses! Many functions will
have extra arguments to change their default behavior.

Many things can be done with the output of a function. It may simply answer your question. Or you
may want to store it for later usage, or you may use it directly with another function.

For example, R has functions **max()** and **min()** to find the maximum and minimum values in a data
vector. The function **range()** returns both of these values.


<span class="girk">**EXERCISE 4:**</span> Let's get familiar with some important commands.

1. Find the maximum value of mpg 
2. Find the minimum value of mpg
3. Find the range of mpg 
4. The summary() function is even more useful. Find a summary of mpg. Notice what statistical values it tells you! 

In [None]:
max(mpg)
min(mpg)
range(mpg)
summary(mpg)

Finding the mean: The average value of a data vector can be found in several ways, as illustrated next. For the data in
mpg we can do it all by hand.

<span class="girk">**EXERCISE 5:**</span> Add up all the values in your data vector and divide by the number of values to find the mean.


In [None]:
(22.8 + 23.9 + 22.3 + 21.9 + 22.3 + 23.1 + 22.3 + 21.2 + 19.6 + 19.3)/10


But, why should we type the data values in when they are already stored into mpg? We can let the
computer do the addition using the sum() function. Rather than counting the ten numbers we added, we can let the computer find the length using length():

<span class="girk">**EXERCISE 6:**</span> Use **sum()** to sum the values of mpg first and then divide by the number of values using **length()**. Check you got the same mean value as before!


In [None]:
 sum(mpg)/length(mpg)

This works fine, but as finding the average is a common task in statistics, there is a built-in function,
mean(), for this. 

<span class="girk">**EXERCISE 7**</span>: Use **mean()** to find the mean of your data vector. Check you got the same mean value as before!

In [None]:
mean(mpg)

## Using indices
The entries in a data vector come with a natural order: the first, second, ..., nth. Being able to access
the values by their index can extend the ways we can look at a data vector.
Accessing a single value in a data set can be done using square brackets [ ]. 

For example, the nth value of mpg can be found with:
```
mpg[n]
````

Arithmetic can also be used with this notation. The difference between the mth and nth entries is: 
```
mpg[m]-mpg[n]
```

If you want more than one entry:

```
mpg[c(m,n)]
```



<span class="girk">**EXERCISE 8:**</span> 

1. Find the 7th entry of mpg
2. Find the difference between the 3rd and 8th entries of mpg
3. Find the 1st and 5th values of mpg in ***one line of R code***

In [None]:
mpg[7]
mpg[3]-mpg[8]
mpg[c(1, 5)]

## Logical expressions
Indices can also be logical expressions, allowing one to further explore the data.
We can ask what fleets had fuel efficiency greater than 23 mpg:


In [None]:
mpg > 23

The answer is TRUE or FALSE for each value in the data vector mpg. When using such answers as
indices, the values corresponding to **TRUE** are returned.

In [None]:
 mpg[mpg > 23]

Logical expressions used for indices must be the same length as the data vector. Other logical
questions are possible using >, >=, <, <=, == (double equals signs), and ! for the negative. Expressions
can be combined using & (and), | (or).

For example, values less than or equal to 20 are

In [None]:
mpg[mpg <= 20]

Values meeting either of the conditions are found with

In [None]:
 mpg[mpg <= 20 | mpg > 23]

<span class="girk">**EXERCISE 9:**</span> 

1. What values in your data vector strictly greater 21?
2. What values in your data vector are less than 19.5 or greater than 22.5 (inclusive for both)?
3. What values in your data vector are greater than 19 AND less than 20(non-inclusive for both) ?

In [None]:
mpg[mpg > 21]
mpg[mpg <= 19.5 | mpg >= 22.5]
mpg[mpg > 19 & mpg < 20]

##  What index was that?
A natural question to ask is what index has a value that fulfills a logical expression. For example,
when is something at its maximum, or minimum? The **which()** command can answer in terms of the
index.

For example, the following command outputs the index that holds the largest value in the data vector:

In [None]:
which(mpg == max(mpg))

<span class="girk">**EXERCISE 10:**</span> Using the same logic, which index holds the smallest (minimum) value in the dataset?

In [None]:
which(mpg == min(mpg))

##  Graphical views
R has many functions that produce graphics for viewing a data set. For example, we can make a
simple plot of the data using plot():


In [None]:
plot(mpg)

After typing this command, a plot window should open up showing an admittedly boring plot. By
default, this plots the values in the order in which they were entered. The x-axis label, Index, refers
to the position in the data vector of the data point. We’ll explore more interesting plots later.

#  Real data sets
All of the previous work could have been done by hand or with a calculator. To illustrate why a
computer is a much better tool for statistics, let’s use a larger dataset.
Data on cars manufactured in 1993 are available in the MASS library. Libraries are collections of data
and/or functions that can be optionally installed in R. The MASS library is loaded into an R session
with the command:


In [None]:
 library(MASS)

The data are contained in the dataset Cars93. The different variable names in Cars93 are returned
with the names() function. 

In [None]:
 names(Cars93)

Most names are self-explanatory. However if the variable is not clear to you, more information can
be found on the help page for the dataset, which can be viewed with the command

In [None]:
help(Cars93)

Specific variables (i.e. data vectors) can be referred to using the name of the dataset, followed by a
$, followed by the variable name.

In this case "Cars93" + "$" + variable_name

<span class="girk">**EXERCISE 11:**</span> Using this format, get the following data vectors for the following variables:

1. Type
2. Model
3. RPM





In [None]:
 Cars93$Type
 Cars93$Model
 Cars93$RPM

Did you notice the tab called **Levels**? This is a list of all the different *types* in the data vector.
For example, there are 93 entries in "Type", but there are only 6 possible "Type" options: Compact, Large, Midsize, Small, Sporty, and Van.

To save ourselves some typing, we can associate the variable names with the data vectors directly
using the command attach():


In [None]:
 attach(Cars93)

Now try typing any variable name (you dont need Cars93 or the $ symbol anymore!):

In [None]:
MPG.highway

But be careful if you have multiple datasets attached with the same variable names! Only the data
vector from the dataset most recently attached will be active.

<span class="girk">**EXERCISE 12:**</span> Find the maximum highway efficiency (MPG). Which manufacturer and model does this correspond to?

*Note* when you type a part of the name and press Tab, the name can be autocompleted. E.g.
try typing ‘Manu’ and then press Tab.

In [None]:
max(MPG.highway)
Model[which(MPG.highway == max(MPG.highway))]

## Histograms and density estimates
We know that fuel efficiency of the various models is widely variable. The shape of this distribution
can be seen with several graphics. The histogram is a familiar one:

In [None]:
hist(MPG.highway)

We can visually estimate the mean and median from the histogram as the mean is a balancing point,
and the median splits the area into equal areas. For symmetric distributions these are the same, but
from this histogram we see this isn’t so.

An estimate from the figure for the median might be somewhere between 25 and 30 miles per
gallon, for the mean a bit more, maybe around 30, because the right tail is longer.

The difference between highway and city mileage should be positive for most cars (not hybrids), but
is it always the same amount? Make a histogram of the variable (MPG.highway - MPG.city). From
the histogram assess: Are there any cars whose city mileage is better than their highway mileage?
What is the maximum amount for the difference between the two mileages?

A density plot summarizes the shape of a distribution in a manner that is similar to the histogram.
To see the differences, we add a density estimate to our histogram. It is important to add the
argument probability=TRUE to the hist() function so that the total area under the histogram is
normalized to be 1. Now the relative frequencies will be shown instead of frequencies.

<span class="girk">**EXERCISE 12:**</span> Plot a histogram of the Use hist() with the argument 'probability = TRUE' to normalize the area


In [None]:
 hist(MPG.highway, probability = TRUE)

The density estimate is found with density(). It can be added to a graphic using lines() (or plot() if a
new graphic is desired):


In [None]:
hist(MPG.highway, probability = TRUE)
lines(density(MPG.highway))

It isn’t really practical to draw two histograms for comparison, but it is for densities. The following
commands plot density estimates for both highway and city mileage.


In [None]:
plot(density(MPG.highway))
lines(density(MPG.city))

Do they have similar shapes? Similar centers? Similar spreads?

## Boxplots
Boxplots conveniently summarize the center, spread and shape of a distribution in a manner that
allows several to be displayed in the same graphic.
A simple univariate boxplot of the highway mileage is created with the function boxplot():


In [None]:
boxplot(MPG.highway, main = "Highway mileage (MPG)")

Multiple boxplots can be made in a similar manner. The command

In [None]:
boxplot(MPG.highway, MPG.city, main = "Mileage (MPG)", names = c("highway", "city"))

will show the two boxplots on the same scale.

Using R’s notation for model construction (~), you can build and display an informal “model” of
highway fuel efficiency as a function of Type. 

<span class="girk">**EXERCISE 13:**</span> Try the command:boxplot(MPG.highway ~ Type, main = "Highway mileage by type")


In [None]:
boxplot(MPG.highway ~ Type, main = "Highway mileage by type")

## Bivariate plots
Many factors influence the fuel efficiency of automobiles. We’ve already graphically observed the
relation with the categorical variable Type. Now let’s look at some continuous variables. 

<span class="girk">**EXERCISE 14:**</span> Try the command: plot(MPG.highway ~ EngineSize)

In [None]:
plot(MPG.highway ~ EngineSize)

What is the visual relation between these two variables? You might also try plotting MPG.highway
against the variables Weight, Length, or Fuel.tank.capacity. What happens if you try to plot MPG.highway against a categorical variable without using the
function boxplot()? 

<span class="girk">**EXERCISE 15:**</span> Try the command: plot(MPG.highway ~ DriveTrain)


In [None]:
plot(MPG.highway ~ DriveTrain)

<span class="girk">**EXERCISE 16:**</span> Generally, R is pretty smart about recognizing and treating variables appropriately. But try plotting MPG.highway against Passengers:

In [None]:
plot(MPG.highway ~ Passengers)

Why do you think the plot looks this way? 

<span class="girk">**EXERCISE 17:**</span> Now try: boxplot(MPG.highway ~ Passengers)

In [None]:
boxplot(MPG.highway ~ Passengers)

If time allows, try some other data explorations on your own …

## Afterword
If you would like to explore more recent automobile data, you can download the file **“Cars2012.txt”** from Canvas.
In order to import data to R, you need to first set the working directory to where your data file is stored. Use the following command, replacing the path with the path to the folder where you have the data (note that forward slashes (/) rather than backslashes (\) are used in R).
```
setwd("Your/Path/Here")
```
You can then read in the data to the R object “Cars2012” using the command:

```
Cars2012 = read.table("Cars2012.txt", header=TRUE)
```

*Data importation will be explained more in depth in the next exercise, RX1.*

The argument “header=TRUE” indicates that the data have column headers.

Attach the new dataset so you don’t need to use $ sign when accessing the new data vectors every time:

```
attach(Cars2012)
```

View a summary of the data:
```
summary(Cars2012)
```