# Geopandas

Geopandas provides a pandas-like syntax for geospatial operations.

Let's start by reading in some data.
Because it's familiar, we'll again use the US Census state boundaries and congressional districts. 

First, let's set up some paths.
We'll use the `os` module to do this is in a more system-agnostic way.

In [None]:
import geopandas as gpd
import os

downloads = os.path.expanduser(os.path.join("~", "Downloads"))

statefile = os.path.join(downloads, "tl_2018_us_state", "tl_2018_us_state.shp")
cdfile = os.path.join(downloads, "tl_2018_us_cd116", "tl_2018_us_cd116.shp")

In [None]:
states = gpd.read_file(statefile)

In [None]:
states

Even though it read a shapefile, the underlying `states` object is a data frame, just like the ones you've worked with previously.
It supports all of the same Pandas operations you are already used to:

In [None]:
states.head()

In [None]:
states.info()

(Quick aside: If you're curious _why_ string columns have type object, [here](https://stackoverflow.com/a/21020411/2477097) is a concise answer from Stack Overflow).

# Pandas review: Working with columns

### Selecting columns

Pandas offers many ways to select individual columns.


(1) Passing the column name as a string (note the quotes!).

In [None]:
states["NAME"]

(2) As an _attribute_ of the data frame object, via the `object.attribute` syntax

In [None]:
states.NAME

---
**EXERCISE**:
Select the column that stores the state abbreviations.
Do this twice --- once with each method.

In [None]:
# Answer here...

**QUESTION**:
What are some advantages of method 1?
What are some advantages of method 2?
Which method do you prefer (right now) and why?

{Answer here}

---
By replacing method 1's string with a `list`, you can select multiple columns by name.

In [None]:
states[["NAME", "ALAND"]]

Recall that strings and lists, like all objects in Python, can be assigned to variables.
So, the above could be achieved with code like the following (pay attention to which code is and is not quoted):

In [None]:
name_col = "NAME"
states[name_col]

In [None]:
my_cols = [name_col, "ALAND", "AWATER"]
print my_cols

In [None]:
states[my_cols]

---
**EXERCISE**:
Use this syntax to create a table of state names and abbreviations.

In [None]:
# Answer here

### Creating new columns

First, let's create a "sandbox" data frame that we can safely modify by using the `DataFrame.copy` method.

In [None]:
states2 = states.copy()

> WEEDS: The reason we can't just do something like `states2 = states` is that Pandas does a "shallow copy" by default, which means that rather than copying the entire `states` object, this just creates a new reference to the same object.
> In other words, if we just did `states2 = states`, any changes we made to `states2` _would also apply to_ `states`, because both of those variables point to the same object in the computer's memory.
> This is actually pretty useful for memory efficiency, but is a bit counterintuitive and counter to our immediate aims here---hence, `states2 = states.copy()`.

You can use similar syntax for selecting columns (Method 1) to create new ones.

In [None]:
states2["dummy_column"] = 42
states2.info()

However, note that you _cannot_ create new columns by using the `object.attribute` syntax.
Confusingly, this _will not throw an error_, but will silently do nothing.*

> \* Technically, it creates a new attribute for the object that you will be able to access and do things with. But this is pretty non-standard practice, and if you're not careful, you could potentially accidentally overwrite important methods that prevent the data frame from working correctly. So I would avoid this. 

Note that although we passed only one value to the data frame, it was automatically "recycled" to every row.

In [None]:
states2["dummy_column"].head()

A common use case for creating new columns is doing math on existing columns.
For example, let's convert the state land areas from square meters to hectares (1 ha = 100 m x 100 m = 10,000 m$^2$).

In [None]:
# Note the use of 10000.0, not just 10000, to avoid integer math
states2["ALAND_ha"] = states2["ALAND"] / 10000.0
states2[["NAME", "ALAND", "ALAND_ha"]].head()

(In case you haven't seen it before, `e+06` is scientific notation, i.e. `6.22e+06` means $6.22 \times 10^6$).

---
**EXERCISE**:
Create a column `TOTAL_AREA` that is the sum of land and water area for each state.

In [None]:
# Answer here

---
Column assignment can also be used to modify existing columns in place.

The `REGION` and `DIVISION` columns are stored as strings, but since they are all integers, let's convert them to make our lives easier.

In [None]:
states2[["REGION", "DIVISION"]].head()

To convert columns to a different type, us the `astype()` method.

One way to do this is for multiple columns is with a `for` loop:

In [None]:
int_cols = ["REGION", "DIVISION"]
for col in int_cols:
    states2[col] = states2[col].astype(int)
states2.info()

Fortunately, Pandas also allows this function to work on multiple columns in one go:

In [None]:
states2[int_cols] = states2[int_cols].astype(int)
states2.info()

**QUESTION**: The `STATEFP` and `STATEENS` columns are also digits that are stored as strings.
Why might it be a good idea to keep them as strings?
(HINT: Consider ZIP codes. Look up the ZIP code for Brookline, MA, and try to store it as an integer. What happens?) 

{Your answer here}

**QUESTION**:
Why might `float` be a better type for the `ALAND` and `AWATER` columns?

{Your answer here}

**EXERCISE**:
Convert these two columns to `float`.
Use the `info()` method to confirm that your code worked.

In [None]:
# Your answer here

**EXERCISE**:
Create new columns `FRAC_LAND` and `FRAC_WATER` containing the area fractions of land and water, respectivly, for each state.

## Pandas review: Working with rows

Somewhat confusingly, the `[]` operators in Pandas can be used to select rows, but _only_ if they are passed _slices_.

In [None]:
states[0:4]

Again, this _only_ works for _slices_ defined with the `start:end` syntax.
The following will not work:

In [None]:
states[1]

To select individual rows this way, you will have to be clever about creating slices containing only one number.

In [None]:
states[3:4]

To review, the `head(n)` and `tail(n)` methods can be used to select the first and last `n` rows of a data frame.

In [None]:
states.head(3)

In [None]:
states.tail(4)

A more useful way to select rows is based on values of specific columns.

First, let's see how comparison operators (`==`, `>`, etc.) work on columns.

In [None]:
states["NAME"] == "West Virginia"

Note that this returns a `Series` (Pandas class for a column) of type `bool` (boolean), with value `True` where the condition is met (only in the first row) and `False` everywhere else.

We can then use this series for subsetting.

In [None]:
wv = states["NAME"] == "West Virginia"
states[wv]

You can combine multiple conditions using the `&` ("and") or `|` ("or") operators (but, note that each statement has to be wrapped in parentheses, `()`).

For example, this selects some really small states (area < $10^9 \text{m}^2$).

In [None]:
states[states["ALAND"] < 10e9]

This selects really small states that are also in region 3 (again, pay attention to the parentheses).

In [None]:
states[(states["ALAND"] < 10e9) & (states["REGION"] == 3)]

Meanwhile, this selects some really small states and some really large ones (area less than $10^9 \text{m}^2$ or greater than $10^{12} \text{m}^2$).

In [None]:
states[(states["ALAND"] < 1e9) | (states["ALAND"] > 1e12)]

Another useful operation is the `isin` method for selecting values that are members of a specific set.

In [None]:
my_subset = ["WV", "AK", "OK"]
states[states["STUSPS"].isin(my_subset)]

Finally, any of these conditions can be negated with the `~` operator. For example, the following code selects every state _except_ West Virginia and Florida.

In [None]:
states[~(states.STUSPS.isin(["WV", "FL"]))]

---
**EXERCISE**:
Select states whose land area is between $10^9$ and $10^10$ square meters.

In [None]:
# Answer here

**EXERCISE**:
Select states in regions 1 and 2.

In [None]:
# Answer

**EXERCISE**: Select every state that is _not_ in region 9.

In [None]:
# Answer

---

## Geopandas: Basics and simple maps

Let's look more closely at the structure of the `states` `DataFrame`.

In [None]:
states.info()

Note that the last column is called `geometry` and has type `geometry`.
This column stores the spatial information used by GIS software and analysis tools.

In [None]:
states["geometry"].head()

We are dealing with vector data, so the geometry information consists of the vector type (e.g. Polygon, MultiPolygon, Line, Point) and its corresponding coordinate pairs.

Let's see what happens if we try to plot this.

In [None]:
states.plot()

It's not a very good one, but it's a map of the US states!

Let's see if we can zoom in on a few states using the syntax we used earlier.

In [None]:
states[states["NAME"] == "West Virginia"].plot()

---
**EXERCISE**:
Plot the states in region 1.

In [None]:
# Answer here

In [None]:
states[states.REGION == 1].plot()

---

### Thematic maps

Geopandas makes it really easy to make thematic maps.
Simply pass a column name to the `column` argument of `plot()`.

In [None]:
states[states.REGION.isin([1,2])].plot(column = "DIVISION")

For our current purposes, these crude, simple maps will suffice.
You can find more information on adjusting map aesthetics in the [geopandas manual](http://geopandas.org/mapping.html).

## Spatial operations

Join congressional districts and states.

In [None]:
cd116 = gpd.read_file(cdfile)

In [None]:
cd116[["NAMELSAD", "geometry"]].head()

In [None]:
states[["NAME", "geometry"]].head()

In [None]:
districts = gpd.sjoin(cd116, states, op = "within")

In [None]:
districts.sort_values(["NAME", "NAMELSAD"])

In [None]:
districts[districts.NAME == "Alabama"].plot(column = "NAMELSAD")

### Combine this with election data

In [None]:
election_file = os.path.join(downloads, "1976-2016-president.csv")
election_full = pd.read_csv(election_file)

In [None]:
election_full

In [None]:
election_2016 = election_full[election_full.year == 2016]
election_2016.head()

In [None]:
election_2016_geo = states.merge(
    election_2016,
    left_on = "STUSPS",
    right_on = "state_po"
)

In [None]:
election_2016_geo["candidate_frac"] = election_2016_geo.candidatevotes / election_2016_geo.totalvotes

In [None]:
election_2016_geo[
    ~(election_2016_geo.state.isin(["Alaska", "Hawaii"])) &
    (election_2016_geo.party == "democrat")
].plot(column = "candidate_frac")

For each state, get the winning candidate (the one with the most votes, by group).

In [None]:
election_winner = election_2016_geo.loc[
    election_2016_geo.groupby("NAME")["candidatevotes"].idxmax()
]

In [None]:
election_winner[election_winner.state != "Alaska"].plot(column = "party")

In [None]:
election_2016_geo[election_2016_geo.state == "Maryland"]

In [None]:
elect_sub = election_2016_geo[
    (election_2016_geo.party.isin(["republican", "democrat"])) &
    ~(election_2016_geo.writein)
][["state", "party", "candidatevotes", "geometry"]]

In [None]:
elect_wide = elect_sub.pivot(
    index = ["state", "geometry"],
    columns = "party",
    values = "candidatevotes"
)

In [None]:
elect_wide.info()

In [None]:
election_2016_geo[election_2016_geo.party == "republican"].plot(
    column = "candidatevotes"
)

In [None]:
election_2016_geo.head()

In [None]:
states.info()

### Applying functions to columns

In the previous examples, arithmetic operators (e.g. `+`, `/`) worked without modification on Pandas `Series` (columns).
In other words, `states["ALAND"] + states["AWATER"]` is assumed to mean "add every element of `ALAND` to every element of `AWATER`".
However, many (in fact, most) functions will not work this way.

Fortunately, Pandas provides a convenient syntax for applying a function to every element of a `Series`.

Let's start by defining a simple function that adds `.shp` to the state name.
Recall that you have at least two options for doing this in Python.

In [None]:
mystring = "Arkansas"
mystring + ".shp"

In [None]:
"%s.shp" % mystring

Let's wrap these steps in a simple function:

In [None]:
def create_filename(s):
    result = "%s.shp" % s
    return result

In [None]:
create_filename("Arkansas")

In [None]:
create_filename(states["NAME"])

Note the error: `'Series' object has no attribute 'lower'`.
That's because our code literally tried to do this:

```python
s_lowercase = states["NAME"].lower()
```

...and `lower` is not something that `Series` know how to do.

To make this work, we can instead use the `apply` method, which takes an argument that is a _function_ and applies that argument to every element of a `Series`.

In [None]:
states["NAME"].apply(file_friendly)

Let's start by defining a simple function to convert the state name (which has upper case letters and spaces) to a more file-friendly name that is all lowercase and replaces spaces with dashes (`-`).

Let's interactively identify and test the relevant `str` methods we need to use.

In [None]:
mystring = "West Virginia"
mystring.lower()

In [None]:
mystring.replace(" ", "-")

Now, define a function that combines these two pieces.

In [None]:
def file_friendly(s):
    s_lowercase = s.lower()
    s_file = s_lowercase.replace(" ", "-")
    return s_file

Test it out on a few cases.

In [None]:
file_friendly("West Virginia")

In [None]:
file_friendly("Washington, D.C.")

That's not a great file name -- periods and commas can confuse operating systems. Let's modify the function to remove those.

In [None]:
def file_friendly(s):
    s_lowercase = s.lower()
    s_file = s_lowercase.replace(" ","-")
    s_file = s_file.replace(",", "")
    s_file = s_file.replace(".", "")
    return s_file

In [None]:
file_friendly("Washington, D.C.")

Much better!

Now, if we try to use this on a `Series`, we get an error.

Another useful approach for selecting rows and columns by number is the `iloc` method.
This takes two slices, one for rows and one for columns.
Recall that a slice with no arguments (`:`) means "everything".

In [None]:
states.iloc[0:2,:]

In [None]:
states.iloc[3,:]

Note that unlike the previous methods, which returned `DataFrame` objects with a subset of rows (even if those `DataFrame`s only had one row!), selecting a row with `iloc` returns a `Series`.

Equivalently, the following code can be used to select a column by index.

In [None]:
# Recall: Python uses zero-based indexing!
# NAME is the 7th column in the data frame, so its index is 6
states.iloc[:, 6]

In [None]:
states.iloc[3,6]