# More Python Concepts

(mechanics)=

> *Form follows function*
>
>-- Louis Sullivan

In the chapter on [Getting started with Python](getting-started-with-python) our main goal was to, well, get started with Python. As we go through the book we'll run into a lot of new  Python concepts, which I'll explain alongside the relevant data analysis concepts. However, there's still quite a few things that I need to talk about now, otherwise we'll run into problems when we start trying to work with data and do statistics. So that's the goal in this chapter: to build on the introductory content from the last chapter, to get you to the point that we can start using Python for statistics. Broadly speaking, the chapter comes in two parts. The first half of the chapter is devoted to the "mechanics" of Python: installing and loading packages, managing the workspace, navigating the file system, and loading and saving data. In the second half, I'll talk more about what kinds of variables exist in Python, and introduce three new kinds of variables: factors, data frames and formulas. I'll finish up by talking a little bit about the help documentation in Python as well as some other avenues for finding assistance. In general, I'm not trying to be comprehensive in this chapter, I'm trying to make sure that you've got the basic foundations needed to tackle the content that comes later in the book. However, a lot of the topics are revisited in more detail later. 

## Using comments

Before discussing any of the more complicated stuff, I want to introduce the **_comment_** character, `#`. It has a simple meaning: it tells Python to ignore everything else you've written on this line. You won't have much need of the `#` character immediately, but it's very useful later on when writing scripts. However, while you don't need to use it, I want to be able to include comments in my Python extracts. For instance, if you read this: [^note1]

[^note1]: Notice that I used `print(keeper)` rather than just typing `keeper`. Later on in the text I'll sometimes use the `print()` function to display things because I think it helps make clear what I'm doing, but in practice people rarely do this.


In [2]:

seeker = 3.1415           # create the first variable
lover = 2.7183            # create the second variable
keeper = seeker * lover   # now multiply them to create a third one
print( keeper )           # print out the value of 'keeper'


8.539539450000001


it's a lot easier to understand what I'm doing than if I just write this:

In [3]:

seeker = 3.1415
lover = 2.7183
keeper = seeker * lover
print( keeper )    


8.539539450000001


From now on, you'll start seeing `#` characters appearing in the code extracts, with some human-readable explanatory remarks next to them. These are still perfectly legitimate commands, since Python knows that it should ignore the `#` character and everything after it. But hopefully they'll help make things a little easier to understand.

(packageinstall)=

## Installing and importing 


There is lots to love about Python as a programming language. Although it has its quirks and peculiarities like any language (programming or natural), it is relatively flexible and welcoming to newcomers, while still be very, very powerful. But one of the best things about Python isn't even the language itself, it is the rich ecosystem of code written by other people that you can use to make Python do things for you. These **libraries** or **packages** [^note2] contain code that people have written to solve particular problems, and then kindly made available for other people, like you and me, so that we don't have to spend our time reinventing the wheel. By installing and importing libraries, you can achieve very complicated things with only a few lines of your own code, by standing on the shoulders of others. Just ask Cueball from the webcomic xkcd:[^note3]

[^note2]: There are some subtle differences between libraries, packages, and modules, but we don't need to concern ourselves with these here, and I may well mix up these words in the text. The key thing is, they are bits of code that we need to import to make stuff happen in Python.  
[^note3]: https://xkcd.com/353/

![xkcdpython](https://imgs.xkcd.com/comics/python.png)

When doing anything other than the very most basic forms of data analysis in Python, we will almost always need to use libraries. However, before we get started, there's a critical distinction that you need to understand, which is the difference between having a package **_installed_** on your computer, and having a package **_imported_** in Python. I do not have any idea how many Python libraries are available out there, but it is a lot. Thousands. If you install Python on your computer, you won't get all of them, just a handfull of the standard ones. Depending on how you install Python on your computer, you may have more or fewer libraries installed, but either way, there are thousands more out there that you do not currently have installed. So that's what installed means: it means "it's on your computer somewhere". The critical thing to remember is that just because something is on your computer doesn't mean Python can use it. In order for Python to be able to *use* one of your installed libraries, that library must also be "imported". Basically what it boils down to is this:

>A library must be installed before it can be imported.

>A library must be imported before it can be used.

This two step process might seem a little odd at first, but the designers of Python had very good reasons to do it this way,[^note4] and you get the hang of it pretty quickly.

I won't get into the details of installing libraries here, simply because it is too much for me to tackle. If you are using Python in an online enviroment, you may already have access to all the libraries mentioned in this book. If you are working with Python on your own computer, the exact details of how you install packages may vary. If you want to use Python on your own computer, and are just getting started, I recommend [Anaconda](https://www.anaconda.com/products/individual#Downloads) as a relatively easy way to install Python and get quick access to all the most common and important libraries.

[^note4]: Basically, the reason is that there are thousands of libraries, and probably thousands of authors of libraries, and no-one really knows what all of them do. Keeping the installation separate from the loading minimizes the chances that two libraries will interact with each other in a nasty way. 

### What libraries does this book use?

In this book, I have made a concerted effort to limit the number of libraries needed. Often you will find that you can use different libraries to achieve the same results, and sometimes one of these may suit your needs more than another. This is something that can make doing analysis by code rather than pointing and clicking in a dedicated statistics program a bit off-putting; in Excel, there is usually only way to do things, while in Python, there are many. I think this is part of what makes doing statistics using code better, though: you can make your own informed choices, and do *exactly* the analysis you want to do; you don't have to accept some piece of software's default settings. However, point of this book is to get you started doing data analysis and statistics in Python, not to show you all the different ways you could achieve the same goal, so in an effort to keep things simple, I have tried to limit the libraries used in this book to a few of the most important and most common ones for doing statistics with Python. The most prominent ones are: `numpy`, `scipy`, `pandas`, `matplotlib`, `seaborn`, `statistics`, `math`, and `statsmodels`, but I may use others as well, as needed.

### How will I import libraries in this book?

Once you import a library into Python's active memory, you don't need to do it again. In writing this book, each chapter is a python file [^note5]. So, if I have imported e.g. `numpy` early in the chapter, I don't need to do it again in a later section of the chapter. But, normally I will, because I want the code snippets in this book to be as easy as possible to copy and paste into your own computer. If I don't put the `import` command at the top of the snippet, and you have not already imported the library, then you might copy and paste my code into your computer and get an error message. Then again, sometimes I might forget to put the import statement in, or I might think it should be obvious, or I might just get lazy, so make sure to keep an eye out for this!

[^note5] Well, actually it is an `.ipynb` file, but let's not bicker and argue about who killed who.

### Importing libraries

Assuming you have the libraries you need installed on your computer, or can access them in the virtual Python environment you are using in your browser, you will need to import them before you can actually use them. So, for instance, if I want to find the sum of five numbers, I can write

In [13]:
numbers = [4, 5, 1, 2, 6]
sum(numbers)

18

because the authors of Python felt that adding numbers together was such a basic thing that there should be a built-in command for it. At least, I assume so. I don't really know what the authors of Python thought. But, oddly, enough, Python doesn't have a built-in command called "mean". So if I want to know the mean of those same five numbers, I cannot just write

In [14]:
mean(numbers)

NameError: name 'mean' is not defined

because Python doesn't know what mean, er, means. Luckily, we don't have to resort to first finding the sum and then dividing by the number of numbers, because there are libraries that _do_ have built-in commands for finding means. The `statistics` library is one. To use the commands in this library, we first have to `import` it. This gives us access to all the many useful commands in the `statistics` library, one of which is `mean`:

In [16]:
import statistics

numbers = [4, 5, 1, 2, 6]

statistics.mean(numbers)

3.6

You probably noticed the `.` in the code above. This is the way we tell Python that we want to use a command called `mean` which is found inside the library `statistics`. Without the `.`, even though we have imported `statistics`, which has a command called `mean`, we still can't just write `mean(numbers)`. We have to tell Python where to look for this command. This all seems very cumbersome, but it's really not so bad, there are good reasons for doing it this way[^note4], and you will get used to it fairly quickly.

One of the ways in which Python is quite flexible is that it gives you some options in terms of how you import libraries. More precisely, you can:

>Choose to import only a portion of a library
>rename libraries of portions of libraries when importing

Let's say we don't want to import the entire `statistics` library — we only want the `mean` command. We can achieve this like this:

In [25]:
from statistics import mean

Why would we want to do this? Well, one good reason is that now we *can* simply write `mean(numbers)`; we no longer have to write out `statistics.mean(numbers`:

In [26]:
numbers = [4, 5, 1, 2, 6]
mean(numbers)

3.6

Is this the height of laziness? Maybe. But if you start writing the same thing over and over again, saving a few characters here and there is pretty sweet. And this brings us to the other import option: renaming libraries. It is common practice in Python to give libraries abbreviations when we import them. Many of the most common libraries have conventional abbreviations, although you could use anything you like. Thus, you will often see e.g.

In [27]:
import numpy as np
import seaborn as sns

This is very convenient, but be careful: if you e.g. import `numpy` as `np`, the Python will only recognize it as `np`, at least for the time your code is in Python's active memory. Also, although you can use whatever abbreviations you like, I highly recommend sticking to the conventional ones, for your sake and others. It's kind of fun the first time to do something like

In [28]:
import statistics as why_you_gotta_be_so

why_you_gotta_be_so.mean(numbers)

3.6

but good code should be easy to read by yourself and others, and if you start playing too fast and loose with renaming, it starts to get less clear what's going on.

## Listing the objects in active memory

Let's suppose that you're reading through this book, and what you're doing is sitting down with it once a week and working through a whole chapter in each sitting. Not only that, you've been following my advice and typing in all these commands into Python. So far during this chapter, you'd have typed quite a few commands, although not all of them actually created variables.

An important part of learning to program is to develop the ability to keep a mental model of what Python knows and doesn't know at any given time active in your mind. This sounds very abstract, and it is, but as you become more familar with coding I think you will see what I mean. I won't dwell on this here, but it may be useful to take a quick peak at what I mean. If you are working in e.g. a Jupyter Notebook (and I do suggest you do this, at least at first), then by typing `%who` you can see a list of all the variable that Python is currently aware of. So, in my case, I get the following:

In [35]:
%who

keeper	 lover	 mean	 mean_numbers	 np	 numbers	 seeker	 sns	 statistics	 
sum_numbers	 why_you_gotta_be_so	 


Here we can see variable that we defined, like `keeper` and `lover`, and also libraries that we imported (and renamed), like `np` and `sns`, as well as the library `statistics` which I then ill-advisedly re-imported and renamed `why_you_gotta_be_so`. To see more details on these variables, we can type `%whos`

In [36]:
%whos

Variable              Type        Data/Info
-------------------------------------------
keeper                float       8.539539450000001
lover                 float       2.7183
mean                  function    <function mean at 0x7fd3d2cdfc20>
mean_numbers          float       3.6
np                    module      <module 'numpy' from '/op<...>kages/numpy/__init__.py'>
numbers               list        n=5
seeker                float       3.1415
sns                   module      <module 'seaborn' from '/<...>ges/seaborn/__init__.py'>
statistics            module      <module 'statistics' from<...>python3.7/statistics.py'>
sum_numbers           int         18
why_you_gotta_be_so   module      <module 'statistics' from<...>python3.7/statistics.py'>


This tells us that e.g. `keeper` is a floating-point decimal number with the value 8.539539450000001, and shows us the true names of the objects we have renamed on import. These commands that start with a `%` sign, by the way, are called "magic" commands, and can only be used in environments like Jupyter, which support them. If you are not working in such an environment, you can use the command `dir()`, which achieve the same thing, but will also show you lots of information you probably aren't interested in at this stage.

### Removing variables from the workspace

Looking over that list of variables, it occurs to me that I really don't need them any more. I created them originally just to make a point, but they don't serve any useful purpose anymore, and now I want to get rid of them.  I'll show you how to do this, but first I want to warn you -- there's no "undo" option for variable removal. Once a variable is removed, it's gone forever unless you save it to disk. I'll show you how to do *that* in a later section, but quite clearly we have no need for these variables at all, so we can safely get rid of them by using the `del()` command.

In [37]:
del(keeper, lover)

With `%who` or `dir()` we can check that they are gone.

In [38]:
%who

mean	 mean_numbers	 np	 numbers	 seeker	 sns	 statistics	 sum_numbers	 why_you_gotta_be_so	 



If you want to remove all the variables in memory, and you are working in a Jupyter environment, then `%reset` is a handy way to do this, although I must say that in practice I rarely, if ever, have a need to remove variables from memory. There is usually no harm in them sitting around unused, and if you define a new variable with the same name as an old one, it will just write over the old variable with the same name.

(load)=
## Loading and saving data


There are two main types of files that are likely to be relevant to us when doing data analysis. There are three in particular that are especially important from the perspective of this book:


- *Comma separated value (CSV) files* are those with a .csv file extension. These are just regular old text files, and they can be opened with almost any software. It's quite typical for people to store data in CSV files, precisely because they're so simple.
- *Script files* are those with a .py file extension or an .ipynb extension. These aren't data files at all; rather, they're used to save a collection of commands that you want Python to execute later. They're just text files, but we won't make use of them until later.
 

In this section I'll talk about how to import data from a CSV file. First though, we need to make a quick digression, and talk about file systems. I know this is not a very exciting topic, but it is absolutely _critical_ to doing data analysis. If you want to work with your data in Python, you need to be able to tell Python where the data is located.

(filesystem)=
### Filesystem paths

In this section I describe the basic idea behind file locations and file paths. Regardless of whether you're using Window, macOS or Linux, every file on the computer is assigned a (fairly) human readable address, and every address has the same basic structure: it describes a *path* that starts from a *root* location, through as series of *folders* (or if you're an old-school computer user, *directories*), and finally ends up at the file. 

On a Windows computer the root is the physical drive [^note4] on which the file is stored, and for most home computers the name of the hard drive that stores all your files is C: and therefore most file names on Windows begin with C:. After that come the folders, and on Windows the folder names are separated by a `\` symbol. So, the complete path to this book on my Windows computer might be something like this:

```
C:\Users\danRbook\pythonbook.pdf
```
and what that *means* is that the book is called LSR.pdf, and it's in a folder called `book` which itself is in a folder called dan which itself is ... well, you get the idea. On Linux, Unix and Mac OS systems, the addresses look a little different, but they're more or less identical in spirit. Instead of using the backslash, folders are separated using a forward slash, and unlike Windows, they don't treat the physical drive as being the root of the file system. So, the path to this book on my Mac might be something like this:

```
/Users/dan/Rbook/pythonbook.pdf
```

So that's what we mean by the "path" to a file, and before we move on, it is critical that you learn how to copy the path to a file on your computer so that you can paste it into Python. There are (again!) multiple ways to do this on the various operating systems, and it doesn't really matter which method you use. A quick search will lead you to many many online tutorials; just find a method that works for you, on your computer.

[^note4]: Well, the partition, technically.

(loadingcsv)=

### Loading data from CSV files into Python

One quite commonly used data format is the humble "comma separated value" file, also called a CSV file, and usually bearing the file extension .csv. CSV files are just plain old-fashioned text files, and what they store is basically just a table of data. This is illustrated below, which shows a file called booksales.csv that I've created. As you can see, each row corresponds to a variable, and each row represents the book sales data for one month. The first row doesn't contain actual data though: it has the names of the variables.

    
    

```{image} ../img/mechanics/booksalescsv.png
:alt: booksales
:width: 600px
:align: center
```

As is often the case, there are many different ways to get the data from a CSV file into Python so that you can begin doing things with it. Here we will use the `pandas` library, which happens to have a handy command called `read_csv()` which does just what it says.

We can't just read the data in willy-nilly, though. We need some place to put it. As you may have already guessed, we need to define a variable to put our data into. We haven't talked about variable types yet, and now is not the time, but let it suffice to say that there are different kinds of variables, and some of them can store structured data like the rows and columns in a CSV file. `pandas` calls this kind of variable a "dataframe". You can name your dataframe whatever you like, of course, but by convention they are often called "df", so we'll do that too. Thus:

In [46]:
# import pandas, and call it "pd" for short

import pandas as pd

# make a new dataframe variable, and use the "read_csv" command from the pandas library to put the contents
# of the file located at "/Users/ethan/Documents/GitHub/pythonbook/Data/booksales.csv" in the dataframe df

df = pd.read_csv("/Users/ethan/Documents/GitHub/pythonbook/Data/booksales.csv")


Here I have put the full path into the parentheses following `pd.read_csv`, but often I prefer to save the path as a variable, and put that variable into the parentheses instead, like this:

In [None]:
import pandas as pd

file = "/Users/ethan/Documents/GitHub/pythonbook/Data/booksales.csv"

df = pd.read_csv(file)

Either way works, but I think this looks nicer, and it also has an additional advantage: it makes the code more versitile. Right now we are just loading a single CSV, but if we want to load many CSV files, it might be useful to write a [loop](loops) that puts different paths into the same variable `file`. But that is a discussion for another time.

### Saving a dataframe as a CSV

Sometimes we create new dataframes using Python. Maybe the CSV we loaded had lots of information that we don't need, or maybe we have loaded several CSV's in, taken a few columns of data from each one, and then combined these into a new dataframe. Or perhaps we have done some calculations on the original data and added a column with e.g. the sum of each row. If we want to save this new dataframe as a csv, `pandas` has a command for that as well:

``df.to_csv('/Users/ethan/Desktop/my_file.csv')``

Every `pandas` dataframe has the built-in ability to be exported as a CSV file. We just need to tell Python what the new file should be called, and where we want it to go in our filesystem. Pretty straightforward, really.

(useful)=
## Useful things to know about variables

In the chapter on [Getting started with Python](getting-started-with-python), I talked a lot about variables, how they're assigned and some of the things you can do with them, but there's a lot of additional complexities. That's not a surprise of course. However, some of those issues are worth drawing your attention to now. So that's the goal of this section; to cover a few extra topics. As a consequence, this section is basically a bunch of things that I want to briefly mention, but don't really fit in anywhere else. In short, I'll talk about several different issues in this section, which are only loosely connected to one another.

(types)=
### Variable types

As we've seen, Python allows you to store different kinds of data. We have seen variables that store text (_strings_), numbers(_integers_ or _floats_), and even whole datasets (_dataframes_). These are just three of the many different types of variable that Python can store. Other common variable types in Python include _dictionaries_, _lists_, and _tuples_. It's important that we remember what kind of information each variable stores (and even more important that Python remembers) since different kinds of variables allow you to do different things to them. For instance, if your variables have numerical information in them, then it's okay to multiply them together:

In [56]:
x = 1    # a is a number
y = 2    # b is a number
z = x + y
print(z)

3


But if they contain character data, Python will still let you add the variables, but the outcome might be unexpected:

In [58]:
x = "1"   # x is character, as indicated by the quotation marks
y = "2"   # y is character, as indicated by the quotation marks
x + y           

'12'

To us, there isn't really that big a difference between 1 and "1", but to Python, these are entirely different classes of things.

### Checking variable types

Okay, let's suppose that I've forgotten what kind of data I stored in the variable `x` (which happens depressingly often). Python provides a function that will let us find out: `type()`

In [62]:
x = "hello world"     # x is text (aka a "string")

type(x)

str

In [64]:
import pandas as pd

file = "/Users/ethan/Documents/GitHub/pythonbook/Data/booksales.csv"

x = pd.read_csv(file)

type(x)

pandas.core.frame.DataFrame

In [65]:
x = 100     # x is an integer
type(x)

int

In [66]:
x = 3.14
type(x)

float

Exciting, no?

(lists)=

## Lists

A kind of variable that shows up all the time in data analysis with Python is the list. A list is just what it sounds like, it is a single variable that contains a list of items. Just about any variable type you can think of can be listed in a Python list:

In [74]:
shopping = ["apples", "pears", "bananas"] # a list strings
scores = [90, 65, 100, 82]                # a list of integers
mixed = ["cats", 7, 309.42]               # a list of mixed strings, integers, and floats
all = [shopping, scores, mixed]           # a list of lists!

In [75]:
print(shopping)
print(scores)
print(mixed)
print(all)

['apples', 'pears', 'bananas']
[90, 65, 100, 82]
['cats', 7, 309.42]
[['apples', 'pears', 'bananas'], [90, 65, 100, 82], ['cats', 7, 309.42]]


Let's say I am so enamored with Python that I actually decided to keep my shopping list in a Python list. Seems unlikely, I know, but bear with me. Later, I realize I have forgotten what I wrote on the list. This does kind of sound like me, actually. To see the contents of the entire list, I can use `print()`, the way I did above. But let's say I only want to see the second item on the list. Python has a way access specific items in lists, but it will seem strange at first!

Let's take a look. To access an item in a list, we need to know its `index`, that is, its location in the list. We indicate an index with square brackets. So to find the item with the index 2 in my shopping list, I can write `shopping[2]`, like so:

In [80]:
shopping = ["apples", "pears", "bananas"]
shopping[2]

'bananas'

What?!?!? We asked for item 2, and Python gives us "bananas"? But "bananas" is the third item in the list?!?! What's going on?

The simple answer for this is that Python uses "zero-based indexing": basically, Python starts counting at zero. So "apples" is in the zeroeth position, "pears" is in the first position, and "bananas" is in the second position. If it helps, maybe think of it like buildings in Europe that start with the ground floor, and then go up to the first floor, etc.

Now just when you have started to get used to zero-indexing, try negative indexing on for size. We can also count backwards from the end of the list, by using negative indices such as `shopping[-2]` But be careful: when you use negative indexing, Python behaves the way you might have originally expected it to. Thus, `shopping[-1]` will return "bananas", `shopping[-2]` will give us "pears", etc. That's just how Python is.

### Finding the length of a list

One last thing on lists for now: it can often be useful to check how many items are in your list. With the toy examples we are using here, of course, it is easy to see how long the list is, because we just typed in the items ourselves. But in actual data analysis, we often deal with very long lists that contain an unknown number of items. In these cases we can use `len()` to check how long the list is:

In [81]:
len(shopping)

3

(dataframes)=

## Data frames

It's now time to go back and take a closer look at dataframes.

In order to understand why we use dataframes, it helps to try to see what problem it solves. So let's imagine a little scenario in which I collected some data from nine participants. Let's say I divded the participants in two groups ("test" and "control"), and gave them a task. I then recorded their score on the task, as well as the time it took them to complete the task. I also noted down how old they were.

the data look like this:

In [83]:
age = [17, 19, 21, 37, 18, 19, 47, 18, 19]
score = [12, 10, 11, 15, 16, 14, 25, 21, 29]
rt = [3.552, 1.624, 6.431, 7.132, 2.925, 4.662, 3.634, 3.635, 5.234]
group = ["test", "test", "test", "test", "test", "control", "control", "control", "control"]

So there are four variables in active memory: `age`, `rt`, `group` and `score`. And it just so happens that all four of them are the same size (i.e., they're all lists with 9 elements). Aaaand it just so happens that `age[0]` corresponds to the age of the first person, and `rt[0]` is the response time of that very same person, etc. In other words, you and I both know that all four of these variables correspond to the *same* data set, and all four of them are organised in exactly the same way. 

However, Python *doesn't* know this! As far as it's concerned, there's no reason why the `age` variable has to be the same length as the `rt` variable; and there's no particular reason to think that `age[1]` has any special relationship to `score[1]` any more than it has a special relationship to `score[4]`. In other words, when we store everything in separate variables like this, Python doesn't know anything about the relationships between things. It doesn't even really know that these variables actually refer to a proper data set. The data frame fixes this: if we store our variables inside a data frame, we're telling Python to treat these variables as a single, fairly coherent data set. 

To see how they do this, let's create one. So how do we create a data frame? One way we've already seen: if we use `pandas` to import our data from a CSV file, it will store it as a data frame. A second way is to create it directly from some existing lists using the `pandas.Dataframe()` function. All you have to do is type a list of variables that you want to include in the data frame. The output is, well, a data frame. So, if I want to store all four variables from my experiment in a data frame called `df` I can do so like this[^note6]:

[^note6]: Although it really doesn't matter at this point, you may have noticed a new symbol here: the "curly brackets" or "curly braces". Python uses these to indicate yet another variable type: the dictionary. Here we are using the dictionary variable type in passing to feed our lists into a `pandas` dataframe.

In [84]:
df = pd.DataFrame(
    {'age': age,
     'score': score,
     'rt': rt,
     'group': group
    })

In [85]:
df

Unnamed: 0,age,score,rt,group
0,17,12,3.552,test
1,19,10,1.624,test
2,21,11,6.431,test
3,37,15,7.132,test
4,18,16,2.925,test
5,19,14,4.662,control
6,47,25,3.634,control
7,18,21,3.635,control
8,19,29,5.234,control


Note that `df` is a completely self-contained variable. Once you've created it, it no longer depends on the original variables from which it was constructed. That is, if we make changes to the original `age` variable, it will *not* lead to any changes to the age data stored in `df`. 

### Pulling out the contents of the data frame using `$`

```{r echo=FALSE}
rm(age, gender, group, score)
```


At this point, our workspace contains only the one variable, a data frame called `expt`. But as we can see when we told R to print the variable out, this data frame contains 4 variables, each of which has 9 observations. So how do we get this information out again? After all, there's no point in storing information if you don't use it, and there's no way to use information if you can't access it. So let's talk a bit about how to pull information out of a data frame. 

The first thing we might want to do is pull out one of our stored variables, let's say `score`. One thing you might try to do is ignore the fact that `score` is locked up inside the `expt` data frame. For instance, you might try to print it out like this:
```{r error=TRUE}
score
```
This doesn't work, because R doesn't go "peeking" inside the data frame unless you explicitly tell it to do so. There's actually a very good reason for this, which I'll explain in a moment, but for now let's just assume R knows what it's doing. How do we tell R to look inside the data frame? As is always the case with R there are several ways. The simplest way is to use the `$` operator to extract the variable you're interested in, like this:
```{r}
expt$score
```



### Getting information about a data frame

One problem that sometimes comes up in practice is that you forget what you called all your variables. Normally you might try to type `objects()` or `who()`, but neither of those commands will tell you what the names are for those variables inside a data frame! One way is to ask R to tell you what the *names* of all the variables stored in the data frame are, which you can do using the `names()` function:
```{r}
names(expt)
```
An alternative method is to use the `who()` function, as long as you tell it to look at the variables inside data frames. If you set `expand = TRUE` then it will not only list the variables in the workspace, but it will "expand" any data frames that you've got in the workspace, so that you can see what they look like. That is:
```{r}
who(expand = TRUE)
```
or, since `expand` is the first argument in the `who()` function you can just type `who(TRUE)`. I'll do that a lot in this book.


### Looking for more on data frames?

There's a lot more that can be said about data frames: they're fairly complicated beasts, and the longer you use R the more important it is to make sure you really understand them. We'll talk a lot more about them in Chapter \@ref(datahandling).







## Lists{#lists}

The next kind of data I want to mention are **_lists_**. Lists are an extremely fundamental data structure in R, and as you start making the transition from a novice to a savvy R user you will use lists all the time. I don't use lists very often in this book -- not directly -- but most of the advanced data structures in R are built from lists (e.g., data frames are actually a specific type of list). Because lists are so important to how R stores things, it's useful to have a basic understanding of them. Okay, so what is a list, exactly? Like data frames, lists are just "collections of variables." However, unlike data frames -- which are basically supposed to look like a nice "rectangular" table of data -- there are no constraints on what kinds of variables we include, and no requirement that the variables have any particular relationship to one another. In order to understand what this actually *means*, the best thing to do is create a list, which we can do using the `list()` function. If I type this as my command:
```{r}
Dan <- list( age = 34,
            nerd = TRUE,
            parents = c("Joe","Liz") 
)
```
R creates a new list variable called `Dan`, which is a bundle of three different variables: `age`, `nerd` and `parents`. Notice, that the `parents` variable is longer than the others. This is perfectly acceptable for a list, but it wouldn't be for a data frame. If we now print out the variable, you can see the way that R stores the list:
```{r}
print( Dan )
```
As you might have guessed from those `$` symbols everywhere, the variables are stored in exactly the same way that they are for a data frame (again, this is not surprising: data frames *are* a type of list). So you will (I hope) be entirely unsurprised and probably quite bored when I tell you that you can extract the variables from the list using the `$` operator, like so:
```{r}
Dan$nerd
```
If you need to add new entries to the list, the easiest way to do so is to again use `$`, as the following example illustrates. If I type a command like this
```{r}
Dan$children <- "Alex"
```
then R creates a new entry to the end of the list called `children`, and assigns it a value of `"Alex"`. If I were now to `print()` this list out, you'd see a new entry at the bottom of the printout. Finally, it's actually possible for lists to contain other lists, so it's quite possible that I would end up using a command like `Dan$children$age` to find out how old my son is. Or I could try to remember it myself I suppose. 




## Formulas{#formulas}
 
The last kind of variable that I want to introduce before finally being able to start talking about statistics is the **_formula_**. Formulas were originally introduced into R as a convenient way to specify a particular type of statistical model (see Chapter \@ref(regression)) but they're such handy things that they've spread. Formulas are now used in a lot of different contexts, so it makes sense to introduce them early.

Stated simply, a formula object is a variable, but it's a special type of variable that specifies a relationship between other variables. A formula is specified using the "tilde operator" `~`. A very simple example of a formula is shown below:^[Note that, when I write out the formula, R doesn't check to see if the `out` and `pred` variables actually exist: it's only later on when you try to use the formula for something that this happens.]
```{r}
formula1 <- out ~ pred
formula1
```
The *precise* meaning of this formula depends on exactly what you want to do with it, but in broad terms it means "the `out` (outcome) variable, analysed in terms of the `pred` (predictor) variable". That said, although the simplest and most common form of a formula uses the  "one variable on the left, one variable on the right" format, there are others. For instance, the following examples are all reasonably common
```{r}
formula2 <-  out ~ pred1 + pred2   # more than one variable on the right
formula3 <-  out ~ pred1 * pred2   # different relationship between predictors 
formula4 <-  ~ var1 + var2         # a 'one-sided' formula
```
and there are many more variants besides. Formulas are pretty flexible things, and so different functions will make use of different formats, depending on what the function is intended to do.


## Generic functions{#generics}

There's one really important thing that I omitted when I discussed functions earlier on in Section \@ref(usingfunctions), and that's the concept of a **_generic function_**. The two most notable examples that you'll see in the next few chapters are `summary()` and `plot()`, although you've already seen an example of one working behind the scenes, and that's the `print()` function. The thing that makes generics different from the other functions is that their behaviour changes, often quite dramatically, depending on the `class()` of the input you give it. The easiest way to explain the concept is with an example. With that in mind, lets take a closer look at what the `print()` function actually does. I'll do this by creating a formula, and printing it out in a few different ways. First, let's stick with what we know:
```{r}
my.formula <- blah ~ blah.blah    # create a variable of class "formula"
print( my.formula )               # print it out using the generic print() function
```
So far, there's nothing very surprising here. But there's actually a lot going on behind the scenes here. When I type `print( my.formula )`, what actually happens is the `print()` function checks the class of the `my.formula` variable. When the function discovers that the variable it's been given is a formula, it goes looking for a function called `print.formula()`, and then delegates the whole business of printing out the variable to the `print.formula()` function.^[For readers with a programming background: what I'm describing is the very basics of how S3 methods work. However, you should be aware that R has two entirely distinct systems for doing object oriented programming, known as S3 and S4. Of the two, S3 is simpler and more informal, whereas S4 supports all the stuff that you might expect of a fully object oriented language. Most of the generics we'll run into in this book use the S3 system, which is convenient for me because I'm still trying to figure out S4. ] For what it's worth, the name for a "dedicated" function like `print.formula()` that exists only to be a special case of a generic function like `print()` is a **_method_**, and the name for the process in which the generic function passes off all the hard work onto a method is called **_method dispatch_**. You won't need to understand the details at all for this book, but you do need to know the gist of it; if only because a lot of the functions we'll use are actually generics. Anyway, to help expose a little more of the workings to you, let's bypass the `print()` function entirely and call the formula method directly:
```{r eval=FALSE}
print.formula( my.formula )       # print it out using the print.formula() method

## Appears to be deprecated
```
There's no difference in the output at all. But this shouldn't surprise you because it was actually the `print.formula()` method that was doing all the hard work in the first place. The `print()` function itself is a lazy bastard that doesn't do anything other than select which of the methods is going to do the actual printing. 

Okay, fair enough, but you might be wondering what would have happened if `print.formula()` didn't exist? That is, what happens if there isn't a specific method defined for the class of variable that you're using? In that case, the generic function passes off the hard work to a "default" method, whose name in this case would be `print.default()`. Let's see what happens if we bypass the `print()` formula, and try to print out `my.formula` using the  `print.default()` function:
```{r}
print.default( my.formula )      # print it out using the print.default() method
```
Hm. You can kind of see that it is trying to print out the same formula, but there's a bunch of ugly low-level details that have also turned up on screen. This is because the `print.default()` method doesn't know anything about formulas, and doesn't know that it's supposed to be hiding the obnoxious internal gibberish that R produces sometimes. 

At this stage, this is about as much as we need to know about generic functions and their methods. In fact, you can get through the entire book without learning any more about them than this, so it's probably a good idea to end this discussion here.

## Getting help{#help}

The very last topic I want to mention in this chapter is where to go to find help. Obviously, I've tried to make this book as helpful as possible, but it's not even close to being a comprehensive guide, and there's thousands of things it doesn't cover. So where should you go for help? 


### How to read the help documentation

I have somewhat mixed feelings about the help documentation in R. On the plus side, there’s a lot of it, and it’s very thorough. On the minus side, there’s a lot of it, and it’s very thorough. There’s so much help documentation that it sometimes doesn’t help, and most of it is written with an advanced user in mind. Often it feels like most of the help ﬁles work on the assumption that the reader already understands everything about R except for the speciﬁc topic that it’s providing help for. What that means is that, once you’ve been using R for a long time and are beginning to get a feel for how to use it, the help documentation is awesome. These days, I ﬁnd myself really liking the help ﬁles (most of them anyway). But when I ﬁrst started using R I found it very dense.

To some extent, there’s not much I can do to help you with this. You just have to work at it yourself; once you’re moving away from being a pure beginner and are becoming a skilled user, you’ll start ﬁnding the help documentation more and more helpful. In the meantime, I’ll help as much as I can by trying to explain to you what you’re looking at when you open a help ﬁle. To that end, let’s look at the help documentation for the `load()` function. To do so, I type either of the following:

```{r eval=FALSE}
?load 
help("load")
```

When I do that, R goes looking for the help ﬁle for the "load" topic. If it ﬁnds one, Rstudio takes it and displays it in the help panel. Alternatively, you can try a fuzzy search for a help topic

```{r eval=FALSE}
??load 
help.search("load")

```

This will bring up a list of possible topics that you might want to follow up in. Regardless, at some point you’ll ﬁnd yourself looking at an actual help ﬁle. And when you do, you’ll see there’s a quite a lot of stuﬀ written down there, and it comes in a pretty standardised format. So let’s go through it slowly, using the "`load`" topic as our example. Firstly, at the very top we see this:
```{block2, type='rmdnote'}
<table width="100%" summary="page for load {base}"><tr><td>load {base}</td><td style="text-align: right;">R Documentation</td></tr></table>

<h4>Reload Saved Datasets</h4>

<h5>Description</h5>

<p>Reload datasets written with the function <code>save</code>.
</p>
```


Fairly straightforward. The next section describes how the function is used:
```{block2, type='rmdnote'}
<h5>Usage</h5>

<pre>
load(file, envir = parent.frame(), verbose = FALSE)
</pre>
```
In this instance, the usage section is actually pretty readable. It’s telling you that there are two arguments to the `load()` function: the ﬁrst one is called `file`, and the second one is called `envir`. It’s also telling you that there is a default value for the envir argument; so if the user doesn’t specify what the value of envir should be, then R will assume that `envir = parent.frame()`. In contrast, the file argument has no default value at all, so the user must specify a value for it. So in one sense, this section is very straightforward.

The problem, of course, is that you don’t know what the `parent.frame()` function actually does, so it’s hard for you to know what the `envir = parent.frame()` bit is all about. What you could do is then go look up the help documents for the `parent.frame()` function (and sometimes that’s actually a good idea), but often you’ll ﬁnd that the help documents for those functions are just as dense (if not more dense) than the help ﬁle that you’re currently reading. As an alternative, my general approach when faced with something like this is to skim over it, see if I can make any sense of it. If so, great. If not, I ﬁnd that the best thing to do is ignore it. In fact, the ﬁrst time I read the help ﬁle for the load() function, I had no idea what any of the `envir` related stuﬀ was about. But fortunately I didn’t have to: the default setting here (i.e., `envir = parent.frame()`) is actually the thing you want in about 99% of cases, so it’s safe to ignore it. 

Basically, what I’m trying to say is: don’t let the scary, incomprehensible parts of the help ﬁle intimidate you. Especially because there’s often some parts of the help ﬁle that will make sense. Of course, I guarantee you that sometimes this strategy will lead you to make mistakes... often embarrassing mistakes. But it’s still better than getting paralysed with fear. 

So, let’s continue on. The next part of the help documentation discusses each of the arguments, and what they’re supposed to do:
```{block2, type='rmdnote'}
<h5>Arguments</h5>

<table summary="R argblock">
<tr valign="top"><td><code>file</code></td>
<td>
<p>a (readable binary-mode) <a href="../../base/help/connection">connection</a> or a character string
giving the name of the file to load (when <a href="../../base/help/tilde expansion">tilde expansion</a>
is done).</p>
</td></tr>
<tr valign="top"><td><code>envir</code></td>
<td>
<p>the environment where the data should be loaded.</p>
</td></tr>
<tr valign="top"><td><code>verbose</code></td>
<td>
<p>should item names be printed during loading?</p>
</td></tr>
</table>
```

Okay, so what this is telling us is that the `file` argument needs to be a string (i.e., text data) which tells R the name of the ﬁle to load. It also seems to be hinting that there’s other possibilities too (e.g., a “binary mode connection”), and you probably aren’t quite sure what “tilde expansion” means^[It’s extremely simple, by the way. We discussed it in Section 4.4, though I didn’t call it by that name. Tilde expansion is the thing where R recognises that, in the context of specifying a ﬁle location, the tilde symbol ~ corresponds to the user home directory (e.g., /Users/dan/).]. But overall, the meaning is pretty clear.

Turning to the `envir` argument, it’s now a little clearer what the Usage section was babbling about. The `envir` argument speciﬁes the name of an environment (see Section 4.3 if you’ve forgotten what environments are) into which R should place the variables when it loads the ﬁle. Almost always, this is a no-brainer: you want R to load the data into the same damn environment in which you’re invoking the `load()` command. That is, if you’re typing `load()` at the R prompt, then you want the data to be loaded into your workspace (i.e., the global environment). But if you’re writing your own function that needs to load some data, you want the data to be loaded inside that function’s private workspace. And in fact, that’s exactly what the `parent.frame()` thing is all about. It’s telling the `load()` function to send the data to the same place that the `load()` command itself was coming from. As it turns out, if we’d just ignored the envir bit we would have been totally safe. Which is nice to know. 

Moving on, next up we get a detailed description of what the function actually does:
```{block2, type='rmdnote'}
<h5>Details</h5>

<p><code>load</code> can load <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> objects saved in the current or any earlier
format.  It can read a compressed file (see <code><a href="../../base/help/save">save</a></code>)
directly from a file or from a suitable connection (including a call
to <code><a href="../../base/help/url">url</a></code>).
</p>
<p>A not-open connection will be opened in mode <code>"rb"</code> and closed
after use.  Any connection other than a <code><a href="../../base/help/gzfile">gzfile</a></code> or
<code><a href="../../base/help/gzcon">gzcon</a></code> connection will be wrapped in <code><a href="../../base/help/gzcon">gzcon</a></code>
to allow compressed saves to be handled: note that this leaves the
connection in an altered state (in particular, binary-only), and that
it needs to be closed explicitly (it will not be garbage-collected).
</p>
<p>Only <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> objects saved in the current format (used since <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> 1.4.0)
can be read from a connection.  If no input is available on a
connection a warning will be given, but any input not in the current
format will result in a error.
</p>
<p>Loading from an earlier version will give a warning about the
&lsquo;magic number&rsquo;: magic numbers <code>1971:1977</code> are from <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> &lt;
0.99.0, and <code>RD[ABX]1</code> from <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> 0.99.0 to <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> 1.3.1.  These are all
obsolete, and you are strongly recommended to re-save such files in a
current format.
</p>
<p>The <code>verbose</code> argument is mainly intended for debugging.  If it
is <code>TRUE</code>, then as objects from the file are loaded, their
names will be printed to the console.  If <code>verbose</code> is set to
an integer value greater than one, additional names corresponding to
attributes and other parts of individual objects will also be printed.
Larger values will print names to a greater depth.
</p>
<p>Objects can be saved with references to namespaces, usually as part of
the environment of a function or formula.  Such objects can be loaded
even if the namespace is not available: it is replaced by a reference
to the global environment with a warning.  The warning identifies the
first object with such a reference (but there may be more than one).
</p>
```

Then it tells you what the output value of the function is:

```{block2, type='rmdnote'}
<h5>Value</h5>

<p>A character vector of the names of objects created, invisibly.
</p>
```

This is usually a bit more interesting, but since the `load()` function is mainly used to load variables into the workspace rather than to return a value, it’s no surprise that this doesn’t do much or say much. Moving on, we sometimes see a few additional sections in the help ﬁle, which can be diﬀerent depending on what the function is:

```{block2, type='rmdnote'}
<h5>Warning</h5>

<p>Saved <span style="font-family: Courier New, Courier; color: #666666;"><b>R</b></span> objects are binary files, even those saved with
<code>ascii = TRUE</code>, so ensure that they are transferred without
conversion of end of line markers.  <code>load</code> tries to detect such a
conversion and gives an informative error message.
</p>
<p><code>load(&lt;file&gt;)</code> replaces all existing objects with the same names
in the current environment (typically your workspace,
<code><a href="../../base/help/.GlobalEnv">.GlobalEnv</a></code>) and hence potentially overwrites important data.
It is considerably safer to use <code>envir = </code> to load into a
different environment, or to <code><a href="../../base/help/attach">attach</a>(file)</code> which
<code>load()</code>s into a new entry in the <code><a href="../../base/help/search">search</a></code> path.
</p>

<h5>Note</h5>

<p><code>file</code> can be a UTF-8-encoded filepath that cannot be translated to
the current locale.
</p>
```

Yeah, yeah. Warning, warning, blah blah blah. Towards the bottom of the help ﬁle, we see something like this, which suggests a bunch of related topics that you might want to look at. These can be quite helpful:

```{block2, type='rmdnote'}
<h5>See Also</h5>

<p><code><a href="../../base/help/save">save</a></code>, <code><a href="../../base/help/download.file">download.file</a></code>; further
<code><a href="../../base/help/attach">attach</a></code> as wrapper for <code>load()</code>.
</p>
<p>For other interfaces to the underlying serialization format, see
<code><a href="../../base/help/unserialize">unserialize</a></code> and <code><a href="../../base/help/readRDS">readRDS</a></code>.
</p>
```

Finally, it gives you some examples of how to use the function(s) that the help ﬁle describes. These are supposed to be proper R commands, meaning that you should be able to type them into the console yourself and they’ll actually work. Sometimes it can be quite helpful to try the examples yourself. Anyway, here they are for the "`load`" help ﬁle:
```{block2, type='rmdnote'}
<h5>Examples</h5>

<pre>


## save all data
xx &lt;- pi # to ensure there is some data
save(list = ls(all = TRUE), file= "all.rda")
rm(xx)

## restore the saved values to the current environment
local({
   load("all.rda")
   ls()
})

xx &lt;- exp(1:3)
## restore the saved values to the user's workspace
load("all.rda") ## which is here *equivalent* to
## load("all.rda", .GlobalEnv)
## This however annihilates all objects in .GlobalEnv with the same names !
xx # no longer exp(1:3)
rm(xx)
attach("all.rda") # safer and will warn about masked objects w/ same name in .GlobalEnv
ls(pos = 2)
##  also typically need to cleanup the search path:
detach("file:all.rda")

## clean up (the example):
unlink("all.rda")


## Not run: 
con &lt;- url("http://some.where.net/R/data/example.rda")
## print the value to see what objects were created.
print(load(con))
close(con) # url() always opens the connection

## End(Not run)</pre>
```
As you can see, they’re pretty dense, and not at all obvious to the novice user. However, they do provide good examples of the various diﬀerent things that you can do with the `load()` function, so it’s not a bad idea to have a look at them, and to try not to ﬁnd them too intimidating.

### Other resources

- The Rseek website (www.rseek.org). One thing that I really find annoying about the R help documentation is that it's hard to search properly. When coupled with the fact that the documentation is dense and highly technical, it's often a better idea to search or ask online for answers to your questions. With that in mind, the Rseek website is great: it's an R specific search engine. I find it really useful, and it's almost always my first port of call when I'm looking around.
- The R-help mailing list (see http://www.r-project.org/mail.html for details). This is the official R help mailing list. It can be very helpful, but it's *very* important that you do your homework before posting a question. The list gets a lot of traffic. While the people on the list try as hard as they can to answer questions, they do so for free, and you *really* don't want to know how much money they could charge on an hourly rate if they wanted to apply market rates. In short, they are doing you a favour, so be polite. Don't waste their time asking questions that can be easily answered by a quick search on Rseek (it's rude), make sure your question is clear, and all of the relevant information is included. In short, read the posting guidelines carefully (http://www.r-project.org/posting-guide.html), and make use of the `help.request()` function that R provides to check that you're actually doing what you're expected.


## Summary

This chapter continued where Chapter \@ref(introR) left off. The focus was still primarily on introducing basic R concepts, but this time at least you can see how those concepts are related to data analysis:

- [Installing, loading and updating packages](#packageinstall). Knowing how to extend the functionality of R by installing and using packages is critical to becoming an effective R user
- Getting around. Section \@ref(workspace) talked about how to manage your workspace and how to keep it tidy. Similarly, Section \@ref(navigation) talked about how to get R to interact with the rest of the file system.
- [Loading and saving data](#load). Finally, we encountered actual data files. Loading and saving data is obviously a crucial skill, one we discussed in Section \@ref(load).
- [Useful things to know about variables](#useful). In particular, we talked about special values, element names and classes.
- More complex types of variables. R has a number of important variable types that will be useful when analysing real data. I talked about factors in Section \@ref(factors), data frames in Section \@ref(dataframes), lists in Section \@ref(lists) and formulas in Section \@ref(formulas).
- [Generic functions](#generics). How is it that some function seem to be able to do lots of different things? Section \@ref(generics) tells you how.
- [Getting help](#help). Assuming that you're not looking for counselling, Section \@ref(help) covers several possibilities. If you are looking for counselling, well, this book really can't help you there. Sorry. 

Taken together, Chapters \@ref(introR) and \@ref(mechanics) provide enough of a background that you can finally get started doing some statistics! Yes, there's a lot more R concepts that you ought to know (and we'll talk about some of them in Chapters\@ref(datahandling) and\@ref(scripting)), but I think that we've talked quite enough about programming for the moment. It's time to see how your experience with programming can be used to do some data analysis...