# Common tasks in python pandas
To help you get started with the things you need to know for class. If you think of something useful to add, let me know and I'll consider adding it. I will update this as we go along if I have forgotten anything.

**I have some more wordy descriptions at the bottom, plus links to more resources.**

### This is just the very beginning
You will learn more 'common tasks' as the course progresses.

I will jump right in and add more stuff as needed/requested.

This quick startup is intended to be even more elementary than other guides you might see, focusing just on what we need at this point in the course, and not very comprehensive.

### Run this first
Makes pandas available to you.

In [1]:
import pandas as pd

### Lists
A sequence of data of any kind. Multiple types in a single list. 

Here I create a list of length three, save it as an object called `me`, and check its length with the *function* called `len`

Comments are text in your code that python does not attempt to run. They start with a hash `#`.

In [2]:
# no output just saving
me = ['Brendan', 'Robert', 'Brown']

In [3]:
# show the output of function len with argument me
len(me)

3

In [4]:
# seperate multiple outputs with a comma
len(me), me

(3, ['Brendan', 'Robert', 'Brown'])

### Data types
Words are not numbers are not True/False (boolean) values. These are different data types. We will primarily use

- strings, written str in python
- numbers with decimals, called float
- integers, called int
- True/False values, called bool

Check the type of something with the *function* called `type`

In [5]:
type(1), type(1.0), type("one"), type(True)

(int, float, str, bool)

### Object types
A list is a type of python object. We will see more. 

The `type` function works here too.

In [6]:
# I am the list type, as you can see
type(me)

list

### Indexing lists
To select a single list item from `me`, or range of list items, use square brackets. This is called indexing.

Indexing starts at 0 and goes up to the length minus one for a list.

In [7]:
# one thing at a time
me[0], me[2]

('Brendan', 'Brown')

In [8]:
# multiple with :
# right endpoint index not included!
me[0:2]

['Brendan', 'Robert']

In [9]:
# open ended grabs all remaining index values
me[1:]

['Robert', 'Brown']

In [10]:
# right endpoint index not included!
me[:2]

['Brendan', 'Robert']

### Value assignment
Replace or assign a single value by referencing its index and setting it with `=`.

Lists can have mutiple types of data! This makes them relatively special.

In [11]:
# no output, just doing assignment
me[0] = 1010
me[2] = False

Assign to an object that doesn't yet exist and you will get an error

In [12]:
# we have not created anything called m so this will produce an error
# remove the comment to see for yourself

# m[0] = "oops"

### List comprehensions
Run a function on each item in the list, then return a list of the results.

This is called a list comprension and is very useful.

Here, check the type of each element in the list and return the results as another list.

`x` is a placeholder for a generic element of the list `me`. `for x in me` says to set `x` to each item of the list `me` in turn.

In [13]:
out = [type(x) for x in me]

type(out), out, me

(list, [int, str, bool], [1010, 'Robert', False])

In [14]:
# x is a placeholder. I could have named it any other name (within some naming rules you need not know now)

[type(anything) for anything in me]

[int, str, bool]

### Pandas Series
A series is an object type from pandas. 

Create one from a list using the *pandas function pd.Series*

In [15]:
me_name = pd.Series(me)

In [16]:
type(me_name)

pandas.core.series.Series

In [17]:
me_name

0      1010
1    Robert
2     False
dtype: object

This looks the same. **Why bother?**

A series has many specialized tools in pandas that do not work on lists, beyond the scope of this little intro. See the more comprehensive beginner guides linked at the bottom.

I will mention a few things though.

### Indexing series
Indexes play an important role in pandas.

A series can be indexed the same way as a list. But it also can be indexed with more useful labels.

Access the index of a series with `.index` after the series object name.

In [18]:
# access the first element, same as for a list
me_name[0]

1010

In [19]:
# or create the series with a more informative index
# here I wrote over the old object called me_name and replaced it with the new one
me_name = pd.Series(['Brendan', 'Robert', 'Brown'], index = ['first', 'middle', 'last'])
me_name

first     Brendan
middle     Robert
last        Brown
dtype: object

In [20]:
# view the index.  it's also a series!
me_name.index, type(me_name.index)

(Index(['first', 'middle', 'last'], dtype='object'),
 pandas.core.indexes.base.Index)

### Accessing series elements by index
Now you can access elements in the series by the new index names. You're always allowed to use the default numerical index names!

In [21]:
me_name['first'], me_name[0]

('Brendan', 'Brendan')

### Dictionaries
A dictionary is yet another object type that initially will look similar to a series. It holds values that are indexed by names called `keys` and the values are called `values`.

Create one using curly braces `{}` as follows:  a dictionary holding the same information as the me_name series.

Dictionary elements can even be lists.

In [22]:
me_again = {'first': 'Brendan', 'middle': 'Robert', 'last': 'Brown'}

type(me_again), me_again

(dict, {'first': 'Brendan', 'middle': 'Robert', 'last': 'Brown'})

In [23]:
friends = {'first': ['Brendan', 'Winnie'], 'middle': ['Robert', 'The'], 'last': ['Brown', 'Pooh']}

friends

{'first': ['Brendan', 'Winnie'],
 'middle': ['Robert', 'The'],
 'last': ['Brown', 'Pooh']}

### Create a series from a dictionary
As you expect, the keys become the index when you convert a dictionary to a series.

In [24]:
me_again = pd.Series({'first': 'Brendan', 'middle': 'Robert', 'last': 'Brown'})

me_again.index, me_again.values

(Index(['first', 'middle', 'last'], dtype='object'),
 array(['Brendan', 'Robert', 'Brown'], dtype=object))

### Pandas data frame

A data frame is the pandas version of a spreadsheet.

In this course, you should think of it this way: columns hold variables, rows hold observations.

Create a data frame from a dictionary. **Keys now become column names!**

In [25]:
# just creating the object, no output
d = pd.DataFrame(friends)

In [26]:
type(d)

pandas.core.frame.DataFrame

In [27]:
d

Unnamed: 0,first,middle,last
0,Brendan,Robert,Brown
1,Winnie,The,Pooh


Or create a data frame from a list of series. **Series indexes now become column names!**

In [28]:
pooh_name = pd.Series(['Winnie', 'The', 'Pooh'], index = ['first', 'middle', 'last'])
pd.DataFrame([me_name, pooh_name])

Unnamed: 0,first,middle,last
0,Brendan,Robert,Brown
1,Winnie,The,Pooh


### Data frame columns and indexes
With two dimensions in a data frame we need two kinds of indices. 

View column names with `.columns`. It's considered an 'index' of column information.

In [29]:
d.columns

Index(['first', 'middle', 'last'], dtype='object')

View *row* indexes with `.index`

In [30]:
d.index

RangeIndex(start=0, stop=2, step=1)

Data frames can also be created with more useful row index names.

**Important: Indexes should uniquely identify your observations.**

We can't have two bears walking around now can we? How will we know which is which?

In [31]:
d = pd.DataFrame(friends, index = ['human', 'bear'])
d

Unnamed: 0,first,middle,last
human,Brendan,Robert,Brown
bear,Winnie,The,Pooh


### Accessing rows of a data frame with .loc and .iloc
There are several ways to access rows/columns of a data frame. I will show only `.loc` and `.iloc` because they are the most useful across a variety of cases.

First I need more friends.

In [32]:
eeyore = pd.Series(['Eeyore', 'the', 'Donkey'], index = ['first', 'middle', 'last'])

d = pd.DataFrame([me_name, pooh_name, eeyore], index = ['human', 'bear', 'horse'])

d

Unnamed: 0,first,middle,last
human,Brendan,Robert,Brown
bear,Winnie,The,Pooh
horse,Eeyore,the,Donkey


Every row of a data frame is a series object.

Access the rows of an object using row index *names* with `.loc[]` 

In [33]:
# first row
type(d.loc['human']), d.loc['human']

(pandas.core.series.Series,
 first     Brendan
 middle     Robert
 last        Brown
 Name: human, dtype: object)

Access the rows of an object using *default row index (numbers starting from 0)* with `.iloc[]` 

In [34]:
# first row
type(d.iloc[0]), d.iloc[0]

(pandas.core.series.Series,
 first     Brendan
 middle     Robert
 last        Brown
 Name: human, dtype: object)

Access multiple rows with either method using `:` as before, works in both `.loc` and `.iloc`

**Notice I get back a data frame here, not a series, since I selected multiple rows.**

In [35]:
d.loc['bear':]

Unnamed: 0,first,middle,last
bear,Winnie,The,Pooh
horse,Eeyore,the,Donkey


In [36]:
type(d.loc['bear':])

pandas.core.frame.DataFrame

Or select several specific ones by using a list of indexes in `.loc[]`

In [37]:
d.loc[['human', 'horse']]

Unnamed: 0,first,middle,last
human,Brendan,Robert,Brown
horse,Eeyore,the,Donkey


### Accessing columns of a data frame
This works similarly, but now we need to distinguish rows from columns in `.loc` and `.iloc`

Do that by placing a comma in `[row_stuff, column_stuff]`

Every row of a data frame is a series object.

Access the rows of an object using column *names* with `.loc[]` 

In [38]:
# first column, all rows with :
type(d.loc[:, 'first']), d.loc[:, 'first']

(pandas.core.series.Series,
 human    Brendan
 bear      Winnie
 horse     Eeyore
 Name: first, dtype: object)

Or use `.iloc` where columns are now numbered starting from zero, left to right.

In [39]:
# first row
type(d.iloc[:, 0]), d.iloc[:, 0]

(pandas.core.series.Series,
 human    Brendan
 bear      Winnie
 horse     Eeyore
 Name: first, dtype: object)

Again you can do multiple with either method

In [40]:
# returns a data frame!
d.loc[:, :'middle']

Unnamed: 0,first,middle
human,Brendan,Robert
bear,Winnie,The
horse,Eeyore,the


In [41]:
type(d.loc[:, :'middle'])

pandas.core.frame.DataFrame

Or select several specific ones by using a list of columns in `.loc[]`

In [42]:
d.loc[:, ['first', 'last']]

Unnamed: 0,first,last
human,Brendan,Brown
bear,Winnie,Pooh
horse,Eeyore,Donkey


### Selecting rows and columns

This should be easy now. Check and see if they show what you expect. Some examples

In [43]:
d

Unnamed: 0,first,middle,last
human,Brendan,Robert,Brown
bear,Winnie,The,Pooh
horse,Eeyore,the,Donkey


In [44]:
d.loc['bear':, :'middle']

Unnamed: 0,first,middle
bear,Winnie,The
horse,Eeyore,the


In [45]:
# a single object of type str
type(d.loc['bear', 'middle']), d.loc['bear', 'middle']

(str, 'The')

In [46]:
d.iloc[1:, 0]

bear     Winnie
horse    Eeyore
Name: first, dtype: object

In [47]:
d.iloc[[0, 1], 0]

human    Brendan
bear      Winnie
Name: first, dtype: object

In [48]:
d.iloc[[0, 2], [0, 2]]

Unnamed: 0,first,last
human,Brendan,Brown
horse,Eeyore,Donkey


In [49]:
d.loc[['human', 'horse'], ['first', 'last']]

Unnamed: 0,first,last
human,Brendan,Brown
horse,Eeyore,Donkey


At some point you might decide that you want to use your index information as a variable in its own right. Indices and columns, you'll see, have some different methods available to them.

To place your index information into a new column, use the `reset_index` method:

In [52]:
d.reset_index()

Unnamed: 0,index,first,middle,last
0,human,Brendan,Robert,Brown
1,bear,Winnie,The,Pooh
2,horse,Eeyore,the,Donkey


## Summarizing numerical information

A few basic methods for getting summary information from a series. We will see more in class, or browse the list of methods for series in the pandas [documentation for series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)

In [64]:
height = pd.Series([185, 102, 71], index = ['brendan', 'winnie', 'eeyore'])
weight = pd.Series([77, 1.4, 2.3], index = ['brendan', 'winnie', 'eeyore'])

In [65]:
# average, min, max heights in the series
height.mean(), height.min(), height.max()

(119.33333333333333, 71, 185)

And some methods for data frames. The `describe` *method* gives an overview. Since each column is a pandas series, you can also run the methods above on individual columns.

In [66]:
d = pd.DataFrame({'height' : height, 'weight' : weight})

In [67]:
d

Unnamed: 0,height,weight
brendan,185,77.0
winnie,102,1.4
eeyore,71,2.3


In [68]:
d.describe()

Unnamed: 0,height,weight
count,3.0,3.0
mean,119.333333,26.9
std,58.943476,43.390206
min,71.0,1.4
25%,86.5,1.85
50%,102.0,2.3
75%,143.5,39.65
max,185.0,77.0


In [69]:
d.loc[:, 'height'].mean()

119.33333333333333

# Further reading

### More detailed beginner guides

If this cookbook is not enough, or you don't like this style, some good introductions to the two packages we will use throughout the course are here:

- [Introduction to Pandas](https://developers.arcgis.com/python/guide/part3-introduction-to-pandas/)
- [Introduction to Numpy](https://developers.arcgis.com/python/guide/part2-introduction-to-numpy/)

- [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- [Pandas cookbook (from the docs site)](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html)
- [Numpy for absolute beginners](https://numpy.org/doc/stable/user/absolute_beginners.html)

The first two resources are from the ArcGIS website, a software that we will not use but that has extensive interaction with python. These guides are nonetheless very good and thorough.

The second two are from the **documentation pages for pandas and numpy**, which have many details on all possible functionality for these packages.


### If you have never coded before

Don't worry. The first hour or two might be rather frustrating, but once you start getting used to it things should go more smoothly. 

- expect errors and crazy-looking error messages
- walk through the basic stuff below 
- follow the links to more detailed tutorials
- play!
- show up to the data homework demos on Thursdays
- ask me for help


### Very very high level intro
We will be writing lines of code, which give instructions to your computer to do something, in the python language. By 'running' code I will mean the instructions we have written are actually executed (or attempted to be executed).

The building blocks of code used in this class will be

- functions and methods: takes input(s), does something and returns output(s)
- data objects: lists, numpy arrays, pandas series, data frames

You will see examples of these throughout this cookbook, and I will try to call out what is what to orient you.

### Jupyter notebooks

Each little cell holds code (or text, like this one). They can be run in any order but keep your program in top-to-bottom order!

For example, if you reference an object called d then be sure your code to create that object is in a cell above where you created it.

Some initial tips:

- strive to write code that produces at most one output (meaning a printed table or plot) for each cell. A cell need not produce any ouput.

- run the cell with Shift Enter.

- right-click the cell to see more options.

- export your final product running all cells by selecting File > Export Notebook as > Export Notebook to HTML

Get more help, e.g. more keyboard shortcuts, through the Jupyter Lab help menu at Help > JupyterLab Reference > Notebooks

### Step one: Each notebook in this class should have the following cell before all other code 
Packages/libraries in python contain functions and other tools you might want to use, in addition to those python loads automatically.

To make these available, you need to import them. And usually you want to import them with a convenient abbreviation for ease of use, such as pd to stand for pandas.

In [50]:
import pandas as pd

Often you also will want to import numpy, the other package we will use regularly in the course. Do that using the standard abbreviation for this package, np:

In [51]:
import numpy as np

These two cells of code, when run, now make available all of the functionality in pandas and numpy