# Lecture 3 - Dictionaries, Pandas, and Geopandas
![](images/panda.jpeg)

![](images/UoA_missing_map_poster.png)

### A cool thing I saw on the net this week:
Open Street Map and LLMs (GPT)
https://github.com/rowheat02/osm-gpt

### Objectives
- Dictionaries
  - What, why, how
- Pandas
  - The industry and scientific standard means of tabular data handling*
  - https://pandas.pydata.org/
- Geopandas
  - Making pandas spatial!
  - https://geopandas.org/en/stable/

### Dictionaries
- A dictionary is another type of variable
  - It is useful for storing loosely structured information
  - It uses a key:value structure
  - Denoted using curly brackets {}
  - Often combined with a list
  

In [None]:
GISCI_dictionary = {
                        "apple":"a delicious round fruit",
                        "bench":"a place to sit next to the footpath",
                        "candy":"something thats tastey but causes cavities",
                        "door":"a portal to another room",
                        "entomologist":"somone that studies bugs",
                        "farce":"a sausage meat mixture",
                        "gas":"a petrochemical nearing end of life",
                        "hatchback":"a car with a rear that enitrely opens",
                        "ice":"the solid state of water",
                        "juice":"a sugary taste fruit drink",
                        "k****":"asdfasdf",
                     }
GISCI_dictionary["bench"]

In [None]:
GISCI_dictionary


### Dictionary functions
- mydict.keys() gives you the 'key' of the key:value pairing
- mydict.items() give you the contents as a list
- mydict.values() gives you the values

In [None]:
# Uncomment each line of code one by one and check out what is returned by each function call
webstersDictionary.keys()
# webstersDictionary.values()
# webstersDictionary.items()

In [None]:
# We can use a for loop to run through the keys in dictionary:
for i in webstersDictionary.keys():
    print(i)

In [None]:
# Or we could use it as a way to format a printing of the entire dictionary
for i in webstersDictionary.keys(): # using the key we get from this line...
    print (i + ":\n\t" + webstersDictionary[i]) # ...to call the item/value from the dictionary as we loop through it

### Dictionaries are often multi layered (multi-dimensional)
- Often dictionaries contain lists of dictionaries
- Yes, it is confusing, but creates more freedom

In [None]:
# Here a multi-level ditionary lets us store the variable 'people, 
# Within which we can then store more information about those people, in another dictionary.
mydict = {"people":[
            {
                "name":"michael",
                 "role":"teacher"
            },
            {
                "name":"sila",
                 "role":"teacher"
            },
            {
                "name":"amber",
                 "role":"tutor"
            }    
            ]}
# print(mydict)
print(mydict["people"])


In [None]:
# We can add another dictionary within that first one, that contains a totally different set of information...
mydict = {"people":[
            {"name":"michael","role":"teacher"},
            {"name":"sila","role":"teacher"},
            {"name":"amber","role":"tutor"},    
        ],
         "mascots":[
             {"name":"pepper","species":"dog"},
             {"name":"pip","species":"dog"},
             {"name":"harriet","species":"cat"}
         ]}
# print(mydict)
print(mydict["people"])

In [None]:
# Combine this multi-level dictionary with a for loop and a conditional statement...
for i in mydict["mascots"]: #for every 1st level dict with this key...
    print(str(i))
    if i["species"] == "cat":        # Within that dict, if the value of the key 'species' is cat...
        print ("all hail our new ruler! bow to "+i["name"]+"!") # Do this...
    else:
        print ("who\'s the best? \n\t"+i["name"]+"\'s the best!\n") # Otherwise do this.
        
# And we get the ability to selectively call and use data from different levels within it.

### Why use dictionaries?
- Dictionaries are really useful as they create a bit more freedom in our data strucutre.
- In comparison, in array its a really bad idea to have mixed data types and uneven arrays, however dictionaries are great for this type of data.
<br></br>
- This moves us towards 'unstructured' data. Lots of the info we get from the web is fairly unstrucutred becasue it relies on non-complete datasets. 
  - For example: some data from Twitter (or X as it is now known, booooo) has geolocation, and some does not. 
  <br></br>
- Having said that, using a dictionary doesn't mean _no_ structure, it just means that not all elements of the strucutre will be there. 
  - For example:

In [None]:
# Here we have some pretty unstructured data as a dict...
mydict = {
    "restaurants":[
        {
            "name":"McDonalds",
            "nickname":"Maccas",
            "known-for":"Big Mac",
            "likely-result":"heart-Attack",
            "rating":3.3            
        },
        {
            "name":"Burger King",
            "nickname":"The King",
            "known-for":"Whopper",
            "rating":2
        },
        {
            "name":"Burger Wisconsin",
            "nickname":"Burg-Wickies",
            "known-for":"Expensive Trash!",            
        },
    ]
}

# Here it is as a complex array but... 
restaurants = ["McDonalds","Burger King","Burger Wisconsin"]
nicknames = ["Maccas","The King","Burg-Wickies"]
known_for = ["Big Mac","Whopper","Expensive Trash"]
likely_result = ["heart attack",None,None]
rating = [3.3,2,None]

# It's messy and not easy to access items

In [None]:
# For example, let's try to get the info of the first record...
print(restaurants[0]+", nickname:"+nicknames[0]+", known for:"+known_for[0]+", likely result:"+likely_result[0]+", rating:"+str(rating[0]))

# Ok that worked, but what about the second record?
#print(restaurants[1]+", nickname:"+nicknames[1]+", known for:"+known_for[1]+", likely result:"+likely_result[1]+", rating:"+str(rating[1]))


In [None]:
# Let's compare that to using the key from a dictionary for the broad class and...
# ... then an index call the specific record (try 0, then try 1)
print(mydict.keys())
mydict["restaurants"][0]

In [None]:
# In addition to accessing those individal records, when can of course call the whole dictionary
print(mydict["restaurants"])

### This is hard to see... I want only the ratings for my map, thanks.
- Use a loop to get each restaurant, set the inner dictionary to be the variable 'r'
- Once we have 'r' we ask for the ratings...

In [None]:
# Let's try this... loop over all records in the restaurants dict and access their rating.
mydict = {"restaurants":[
        {   "name":"McDonalds",
            "nickname":"Maccas",
            "known-for":"Big Mac",
            "likely-result":"heart-Attack",
            "rating":3.3},
        {   "name":"Burger King",
            "nickname":"The King",
            "known-for":"Whopper",
            "rating":2},
        {   "name":"Burger Wisconsin",
            "nickname":"Burg-Wickies",
            "known-for":"Expensive Trash!"},]}

for resto in mydict["restaurants"]:
    print(resto["rating"])

### Ah, zut alors!
- A record is completely missing a bit of information!
- Lets fix this using a simple if statement
- [Lets google that for us](https://www.bing.com/search?q=python+dictionary+keyerror&qs=n&form=QBRE&sp=-1&pq=python+dictionary+keyerror&sc=3-26&sk=&cvid=239D59BE272A4BE19C94C9B98C874DF0)

In [None]:
# hmm. Error!!
# A key error is given when we are trying to return the value of a dict key that doesn't exist.
# Also we can use a method of the dict variable type mydict.get().
# get() returns the value or None (the null value) if there is something there.
# We can use this with a if statement to see if it exists!

for i in mydict["restaurants"]: # Again, loop over all the restaurants
    if (i.get("rating") != None): # IF a record has the rating key, use it
        print(i["rating"])
    else:
        print("No rating for"+str(i["name"])) # ELSE if there is no rating key, handle the error

        
# Clean it up a bit... 
# for i in mydict["Restaurants"]: 
#     if (i.get("rating")!=None): print(i["name"]+"'s rating is:"+str(i["rating"]))

# Pandas
![](images/panda.jpeg)

## Pandas
![](images/pandas_logo.jpg)

- What is pandas?
  - Pandas is a _very_ powerful data handling and processing library for Python. 
  - It has a blazing fast ability to load and save data from a wide variety of formats (csv,json,excel, etc)
  - It can transform data very quickly, too.
- How do I get it?
  - Run command prompt from anaconda
    - If you forget how, have a look at last week's lecture :)
  - Type: pip install pandas
    - if you are having toruble with this, check in with us in lecture or lab time

### Introducing, the dataframe
- Pandas is all organized around a concept called a DataFrame
- The dataframe is a powerful 2D array
- Pandas brings data transformation, statistical analys, and plotting/visualization directly to you
![](images/pandas_dataframe.jpg)

In [None]:
# To get started, we need to import the pandas library
# if you get lost in class today, I highly recommend the pandas website
# the tutorials on the site are excellent!
# https://pandas.pydata.org/docs/getting_started
import pandas as pd

# To make a dataframe we can easily construct one ourselves...
# ...creating a DataFrame using a dictionary.
df = pd.DataFrame(
    {
        "Name":[
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        
        "Sex": ["male", "male", "female"],
        "Age": [22, 35, 58],
    })
df



In [None]:
df = pd.read_csv("data/Age.csv")

In [None]:
print(df)

In [None]:
df.dtypes

### Query the table, asking for a single column of information
- In pandas, a column is a 'series'

In [None]:
# Get a series (column) of data
df["Name"]

In [None]:
print(df[["Name", "pop23_ov65"]])

print(df.iloc[:, 3])

In [None]:
df.head()

In [None]:
df[df["Name"] == "Waitemata Local Board Area"]

In [28]:
ov10m_old = df[(df["pop23_ov65"] > 1000) & (df["Name"] == "Waitemata Local Board Area") ]
print(ov10m_old)

    Code                        Name  pop13_u15  pop13_1529  pop13_3064  \
85  7610  Waitemata Local Board Area       7881       29865       34473   

    pop13_ov65  pop13_tot  pop18_u15  pop18_1529  pop18_3064  pop18_ov65  \
85        4914      77136       7818       30387       38118        6543   

    pop18_tot  pop23_u15  pop23_1529  pop23_3064  pop23_ov65  pop23_tot  
85      82866       7206       26775       39333        8232      81546  


In [30]:
df.iloc[0:4, 2:5]
print(df.head(20))

     Code                                Name  pop13_u15  pop13_1529  \
0    3500             South Taranaki District       6078        4788   
1    2000                    Waitomo District       2133        1584   
2    7500                   Invercargill City      10254        9891   
3    7604          Kaipatiki Local Board Area      15993       17985   
4    7620           Papakura Local Board Area      11136        9864   
5    7618   Otara-Papatoetoe Local Board Area      19605       18951   
6   99900  Area Outside Territorial Authority          3           3   
7    7300                  Southland District       6531        4980   
8    5100                     Tasman District       9432        6387   
9    3600                    Ruapehu District       2766        2175   
10   6200                     Selwyn District       9933        7977   
11   7200                     Clutha District       3501        2658   
12   6300                  Ashburton District       6423        

### Pandas makes statistically summarizing data easy

In [None]:
# We can also simply ask pandas for stats
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html
df.describe()

### ...and we can make graphs really easily

In [None]:
# We can also ask pandas for a really simple graph of the data
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/04_plotting.html
df.plot.barh(x='Name',y='Age')

# There are lots of different options for the type and styling of plots, far too much for today!
# for example these are all the types of plots!
#   'area','bar','barh','box','density', 
#   'hexbin', 'hist', 'kde', 'line', 'pie', 'scatter'

## Pandas and Dictionaries
![](images/panda_dictionary_reading.jpg)

In [None]:
# Re-set/ declare our dictionary once more...
mydict = {
    "restaurants":[
        {
            "name":"McDonalds",
            "nickname":"Maccas",
            "known-for":"Big Mac",
            "likely-result":"heart-Attack",
            "rating":3.3            
        },
        {
            "name":"Burger King",
            "nickname":"The King",
            "known-for":"Whopper",
            "rating":2
        },
        {
            "name":"Burger Wisconsin",
            "nickname":"Burg-Wickies",
            "known-for":"Expensive Trash!",            
        },
    ]
}

## Pandas and dictionaries, a match made in paradise
- You may have noticed in the code before that we are feeding pandas a dictionary
- The dictionary is a lot like our restaurants data!
  - But it is slightly different. 
    - Pandas set its dictionary up as the columns (series) as the dict keys
    - The rows are arrays for each dict key. 
      - This means it is more 'strucutred'
      
![](images/unstructured-dict_vs_pandas.jpg)

### Lets make our restaurant dict more pandas-like
- We need to do two things. 
  1. Organize the strucutre into series and rows 
  2. Handle the missing values

In [None]:
# Remember our dictionary is mydict
# Remember that I can use get() to return the key data i want OR None if it doesn't exist

# set up a structure of the pandas data
restos = {
    "name":[], # The values are created empty here, ready get the data in
    "nickname":[],
    "known-for":[],
    "likely-result":[],
    "rating":[]
}

# We put the data into the dict using append
for r in mydict["restaurants"]:
    restos["name"].append(r.get("name")) # the data or None if it does not exist
    restos["nickname"].append(r.get("nickname"))
    restos["known-for"].append(r.get("known-for"))
    restos["likely-result"].append(r.get("likely-result"))
    restos["rating"].append(r.get("rating"))           

In [None]:
restos

In [None]:
# Brilliant. Now lets turn that into a pandas dataframe...

resto_df = pd.DataFrame(restos)
resto_df

# oooo look what pandas as done with that missing data... (None vs Nan)

### Pandas makes transforming data easy
- We could get some stats and make a plot
  - If you don't have matplotlib installed, you'll need to do so
    - Open cmd.exe prompt from anaconda
    - Type: conda install matplotlib
- Pandas has a LOT of functionality, we can only really scratch the surface
  - We will look at: 
    - Creating dataframes
    - Making plots
    - Making statistical summaries
    - Grouping information
    - Getting the head and tail of a large set of data

In [None]:
import matplotlib
# Quick line of test code commented out here, use it to test if your matplotlib is working
#resto_df.plot.bar(x='name',y='rating')

# Describe is a very useful function built into PD!
resto_df.describe()

### ... lets try some working with some real data.
- Use Pandas to open the Titanic Passenger Data
  - One of the reasons pandas is so useful is that it give us amazing tools to handle info
- Create some graphs and stats for it.

In [None]:
# Dataset and docs available at: 
# https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html
titanic = pd.read_csv("data/titanic.csv") 
# titanic
titanic

### Head and tail
- These functions allow us to limit the number of rows we see
- Very useful when have massive datasets but want to check what they look like...


In [None]:
# titanic.head(10)
# titanic.tail(10)
t_n_b = pd.concat([titanic.head(5),titanic.tail(5)]) # sneaky combine the two for first and last! 
t_n_b

### The describe function
- Tell us the basic information about a dataframe, or a serries
    - Works on all levels of your dataset!

In [None]:
# titanic.describe()
# titanic["Sex"].describe()
titanic["Age"].describe()

### Value counts function
- Tells us how many of each

In [None]:
# Get a count of the number of passengers by Sex
# titanic["Sex"].value_counts(sort=False)

titanic["Age"].value_counts(sort=True).tail(20)

### Groupby function
- groupby allows us to summarize categorical values in a serries
    - Excel has this function...
    - Pandas does it better/on more data!
        - The covid uk gov database... 

![](images/rand_karma.jpg)

![](images/groupby_karma_result.jpg)

![](images/groupby_max.JPG)

In [None]:
# Pandas makes organizing your data pretty straightforward

# Before we looked at age, but what about age and sex?
#titanic["Age"].describe()

#titanic.groupby("Sex").mean()

titanic.groupby("Sex")["Age"].mean()

In [None]:
# Get a count of those that did, and did not survive, by sex
# 0 = dead, 1 = alive
survived_df = titanic.groupby("Sex")["Survived"].value_counts(sort=False) # New operator 'value_counts'
survived_df

In [None]:
# Lets plot the survivability
survived_df.plot.bar(x='Sex',y="Survived")

# But... does this plot accord with what we know of the disaster? (hint: absolute vs relative)

# Lets make this interesting
- While making this lecture I got interested in harvesting Reddit.com data
  - Reddit is an 'open api' meaning all the publically posted information is free to grab and play with
      - For now...
  - I've used an the Reddit Library PRAW (python reddit) to download a ton of Reddit data
  - If you would like to know how I did that, I've included a python notebook in this week's files called RedditDataScraper
- The dataset contains all the top level comments in the posts on the /r/Auckland subreddit as of start-2022
  - 6,492 posts

In [None]:
# Lets take a look at what pandas sees
# By defualt, pandas loads the top and bottom 10 items when we ask for information
filename="data/old_rAuckland_top_comments.csv"
rAuckland_df = pd.read_csv(filename,parse_dates=["created_datetime"]) # read the file and parse the datetime test as pandas datetime objets

rAuckland_df

In [None]:
# what are the stats of this? Headings/cols are a bit messy...
rAuckland_df.describe()

### Now that we have a lot of data lets investigate!

In [None]:
# lots of comments
# usually not that much upvotes, though usually positive 
# there is at least one very negative post!
rAuckland_df["comment_score"].describe()

In [None]:
# what does it look like as a plot?
rAuckland_df["comment_score"].plot()

### Investigation. When is the best time to post for max comments?

In [None]:
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html
import matplotlib.pyplot as plt
fig,axs = plt.subplots(figsize=(12,4))
# Take a look how the operators have been chained together here...
rAuckland_df.groupby(rAuckland_df["created_datetime"].dt.hour)["comment_score"].max().plot(kind='bar',rot=0,ax=axs)
plt.xlabel("hour of the day");
plt.ylabel("avg comment score");

### Does comment length correlate to karma?

In [None]:
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html
# Combining a call to pandas... with a built in python string function (len)
rAuckland_df["comment_length"] = rAuckland_df["comment_body"].str.len() 
rAuckland_df.plot.scatter(y="comment_score",x="comment_length",alpha=0.5,figsize=(12,6),logx=True) # Check out the axis!

### Investigation. Who has the most single comment karma from r/auckland?
- To answer this question we have to do something quite algorithmically taxing, groupby
  - To group all of the posts that are related to specific users, we have to sort, then summarize
  - In this groupby clause we use the .max() method that sums as it goes. Its kind of like the summary statistic you can add to a spatial join function in ArcGIS (*boo-hiss*) or QGIS (*yaaaay*)

In [None]:
fig,axs = plt.subplots(figsize=(12,4))
# In this clause we use groupby. its a taxing operation... bit of a pause here as it thinks..
# Also, wow we would need to do something about those x-axis labels
rAuckland_df.groupby(rAuckland_df["comment_author"])["comment_score"].max().plot(kind='bar',rot=0,ax=axs)
plt.xlabel("comment author");
plt.ylabel("avg comment score");

### Final investigation. What user has the highest karma, from r/Auckland?

In [None]:
# Create a new series 'k' by grouping name with summed karma
k = rAuckland_df.groupby(rAuckland_df["comment_author"])["comment_score"].sum() 
# Convert the series to a dataframe (series 1 col, df many cols)
l = pd.DataFrame(k) 
# Output a sorted version, and make permanent
l.sort_values(by=["comment_score"],ascending=False,inplace=True)

# The default version of this is too big a table, so lets just grab
# the top and bottom 5
overall_karma = pd.concat([l.head(5),l.tail(5)])
overall_karma

In [None]:
overall_karma.plot()

# Geopandas
- Geopandas is the same as pandas with two important differences
  1. it creates the 'geoSeries'
  2. it creates the 'geoDataFrame'
![](images/geopandas_logo.jpg)

### First, install geopandas

- Open the Anaconda Prompt
  - conda install -c conda-forge geopandas
- https://geopandas.org/getting_started.html

<p style="background:black">
<code style="background:black;color:white">(base) C:\Users\YOUR_USERNAME> conda install -c conda-forge geopandas
</code>
</p>

### Then test in the installation
- To see if it works, just import the library

### Useful packages
- Geopandas (duh!)
- folium (we will install this in a bit)
  - if there is issue, ask in lab!
- fiona (pre-installed: data import and export using GDAL tools)
- PySAL (not needed yet: spatial analysis)
- cartoPy (pre-installed:cartographic and projection)
- shapely (pre-installed:already installed with pandas, but handles the geometery

- remember pip and conda!
<p style="background:black">
<code style="background:black;color:white">(base) C:\Users\YOUR_USERNAME> pip install Descarts 
</code>
</p>

In [None]:
import pandas as pd
import geopandas


## Geopandas uses the same strucutre but adds geometry
![](images/geodataframe.png)
- https://geopandas.org/getting_started/introduction.html

### What is geometry?
- Good question. 
  - The purpose of geopandas is to add geometry (spatial) data
  - But its also to add spatial operations, too. 
- First however we do need to know what geometry is:
  - Its a represenatation of a spatial location
    - It can only have one CRS (coordinate reference system)
  - It can come in several types, well beyond point, line, and polygon
    - We can actually mix points, lines, and polygons in the same geodataframe
      - I _really_ don't recommend this. It's like mixing array items, but worse
  - Spatial inforamtion is stored as spatially encoded objects (using a library called _shapely_ but we don't really need to know about it, it in turn is built on GDAL)
- Lets use an example we are familiar with and make it spatial

In [None]:
# Return of our toy dict!
mydict = {
    "restaurants":[
        {
            "name":"McDonalds",
            "nickname":"Maccas",
            "known-for":"Big Mac",
            "likely-result":"heart-Attack",
            "rating":3            
        },
        {
            "name":"Burger King",
            "nickname":"The King",
            "known-for":"Whopper",
            "rating":2
        },
        {
            "name":"Burger Wisconsin",
            "nickname":"Burg-Wickies",
            "known-for":"Expensive Trash!",            
        },
    ]
}

### Now add a spatial location for it
- Ripped from Google Maps
![](images/burger_location_gmaps.JPG)

In [None]:
# Add the lat/lon
mydict = {
    "restaurants":[
        {
            "name":"McDonalds",
            "nickname":"Maccas",
            "known-for":"Big Mac",
            "likely-result":"heart-Attack",
            "rating":3,
            "longitude":174.7650551,
            "latitude":-36.8500934,
        },
        {
            "name":"Burger King",
            "nickname":"The King",
            "known-for":"Whopper",
            "rating":2,
            "longitude":174.76595,
            "latitude":-36.8462059,
        },
        {
            "name":"Burger Wisconsin",
            "nickname":"Burg-Wickies",
            "known-for":"Expensive Trash!",
            "longitude":174.7459792,
            "latitude":-36.855049
        },
    ]
}

### Convert to dataframe-like dict

In [None]:
# Same again on the convert front...
restos = {
    "name":[],
    "nickname":[],
    "known-for":[],
    "likely-result":[],
    "rating":[],
    "latitude":[],
    "longitude":[],
}

for r in mydict["restaurants"]:
    restos["name"].append(r.get("name")) # the data is set to None if it does not exist
    restos["nickname"].append(r.get("nickname"))
    restos["known-for"].append(r.get("known-for"))
    restos["likely-result"].append(r.get("likely-result"))
    restos["rating"].append(r.get("rating")) 
    restos["latitude"].append(r.get("latitude")) 
    restos["longitude"].append(r.get("longitude")) 


In [None]:
restos_df = pd.DataFrame(restos) # make a dataframe

# Use geopandas to take the raw floats of the lat lon and turn them into a geospatially meaningful object
restos_gdf = geopandas.GeoDataFrame(
    restos_df, geometry=geopandas.points_from_xy(restos_df.longitude, restos_df.latitude))

# POINT
print(restos_gdf)

In [None]:
import matplotlib.pyplot as plt
# Get the built in basic outlines... (soon to be depreceated but I will ignore until forced heh)
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# We restrict to New Zealand
ax = world[world.name == 'New Zealand'].plot(color='white', edgecolor='black')

# We can now plot our ``GeoDataFrame``.
restos_gdf.plot(ax=ax, color='red')

plt.show()

### Things we can do with geopandas
- Geopandas is primarilly used for data manipulation. 
- Most of the analysis and cartographic operations are handled by other packages that integrate well with pandas 
- Pandas is great with spatial data mangement though and this means we can
  - Join data (table join and spatial join) 
- Today we are just going to cover the basics. 
  - Getting data into geopandas
    - Create some data from stractch (done!)
    - Open some data in a csv file with x,y locations
    - Open a shapefile
  - Style the data with plots and background maps
    - Take a look at folium
  - Adding data to spatial data
    - Table join
    - Spatial join
  - Basic geo-stats
    - Just some little things for your assignments. nothing major.
    - Area, length, overlay, search
  
    
  

### Loading data
- Really easy to do with geopandas.
- Here is how you load a shapefile
  - Note here that we are actually loading a zip file!
    - This is simply awesome that we can do this, as it means we no longer have to mess about with .shp .shx .prj .dbf
    - You can actually store your entire dataset in a zip file with multiple folders and datasets. it is simply fantastic!


In [None]:
# Requires fiona, conda install fiona
import geopandas
data_location = "zip://data/doc-tracks.zip" # a string to hold the file location (relative path!)
tracks_gdf = geopandas.read_file(data_location) # read the file
tracks_gdf.head()

## helpful tip!
# if you only want to test that things are working you can tell pandas to only load a few spatial features
# all you have to do is send is the rows argument
# https://geopandas.org/docs/user_guide/io.html
#test_tracks_gdf = geopandas.read_file(data_location, rows=10)

- And we can have a look at the data too

In [None]:
# tracks_gdf.head() # show me the rows

import matplotlib.pyplot as plt
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))

# We restrict to New Zealand - a spatial filtering operations!
ax = world[world.name == 'New Zealand'].plot(
    color='white', edgecolor='black')
tracks_gdf.plot(ax=ax, color='red')
plt.show()


### The matplotlib map looks cool in a retro way
- What we really want is to make it look like a modern web map
- We can use the folium library
- Install folium using

<p style="background:black">
<code style="background:black;color:white">(base) C:\Users\YOUR_USERNAME> pip install folium 
</code>
</p>

- or if you are using the geo_env (on a lab computer)

<p style="background:black">
<code style="background:black;color:white">(geo_env) C:\Users\YOUR_USERNAME> conda install -c conda-forge folium 
</code>
</p>

- Folium also works great with the Python API for GEE map display, just for those folks interested in that side of things

In [None]:
import pandas as pd
import geopandas
import folium # this requires folium to be installed. from cmd.exe prompt use: pip install folium
import matplotlib.pyplot as plt

from shapely.geometry import Point


In [None]:
# Other maps to try: tiles='OpenStreetMap' , tiles='Stamen Toner' , 'cartodbpositron'
map = folium.Map(location = [-39.8,174.7], tiles = "Stamen Terrain", zoom_start = 6)

# folium is a lot like leaflet
# folium only accepts geojson files
gjson = tracks_gdf.to_crs(epsg='4326').to_json() # convert to geojson, and make sure the shapfile is in WGS84 (SRID:4326)
lines = folium.features.GeoJson(gjson) # use the geojson variable, and create the features in folium format

map.add_child(lines) # add the data to folium, 'child' is what folium calls layers
map # show the map, ooo it be interactive!


- If you want to know more, I found [this website useful](https://ocefpaf.github.io/python4oceanographers/blog/2015/12/14/geopandas_folium/)

### Geodataframes are pandas dataframes too
- Because the geodataframe is built on top of the pandas library, we can actually do everything we could do with pandas before.
  - For example, we could describe and plot!

In [None]:
tracks_gdf.describe()


In [None]:
tracks_gdf["STLength"].plot.line()


### What about csv, thanks.
- Yes csv are totally possible too. 
  - Open the csv as a pandas dataframe
  - Convert to geopandas 
  - Et Volia!

In [None]:
file_location = 'data/myLocations.csv'

import pandas as pd

myLocations_df = pd.read_csv(file_location)

myLocations_df


In [None]:
myLocations_gdf = geopandas.GeoDataFrame(myLocations_df,geometry=geopandas.points_from_xy(myLocations_df.x,myLocations_df.y),crs='EPSG:4326')
myLocations_gdf.head()


In [None]:
# other maps to try: tiles='OpenStreetMap' , tiles='Stamen Toner' , 'cartodbpositron'
myLocations_map = folium.Map(location = [-39.8,174.7], tiles = "Stamen Toner", zoom_start = 6)

myLocations_gjson = myLocations_gdf.to_json() # convert to geojson (it is already in wgs84)
myLocations_points = folium.features.GeoJson(myLocations_gjson) # use the geojson variable, and create the features in folium format

myLocations_map.add_child(myLocations_points) # add the data to folium
myLocations_map # show the map

# fun fact! 
# if you need to create random coordiates, use the following excel equations
# =RANDBETWEEN(-4160,-3700)/100
# =RANDBETWEEN(17300,17800)/100

## Adding data using a join
- Before now, we haven't used any kind of data joins. 
- But, as you may know, joining data is a big thing in GIS
  - Joins allow us to link multiple datasets together
  - Joins allow us to easily make non-spatial data -> spatial

- Doing table joins is actually not a geopandas thing, its a pandas thing.
  - We could have done this before the break, but we didn't really need to 
  - Using pandas, we can table joins realtively easily.
- Lets consider the previous geopandas data table, and some data we want to join to it  

In [None]:
myLocations_gdf


In [None]:
myLocations_joinData_df = pd.read_csv('data/myLocations_tableJoin.csv')
myLocations_joinData_df


### In Geopandas we use merge for a table join
- Furthermore, we use a 'left join'
  - A left join means that the features on the left (original) table will be preserved
  - More on this when we talk about databases...
- The operation is done 'on' the geodataframe
- The result of the operation makes a new geoDataFrame

In [None]:
# To make things simple, make the joining fields have the same name.
myLocations_joinData_df
joined_gdf = myLocations_gdf.merge(myLocations_joinData_df,how='left',left_on='name',right_on='location_name')
joined_gdf


### Okay, but we want spatial joins!
- We are a gis class after all.
- In the data folder, I've included the SA2 (statistical Area 2 shapefile)
- In our Spatial Join, we are going to append the names of the SA2 to the random locations that we've been adding to above

In [None]:
# First we need to open our shapefile
data_location = "zip://data/SA2.zip" # a string to hold the file location

# Note that we are taking it straight from the zip. PANDAS POWER!
SA2_gdf = geopandas.read_file(data_location,crs='EPSG:4326') # read the file
SA2_gdf.head()


- The operation name for a spatial join is sjoin
  - Info on [sjoin](https://geopandas.org/docs/user_guide/mergingdata.html)
  - sjoin takes the parameter 'op' to represent the topology rule for the spatial join
    - For example: intersects, contains, within, touches, overlaps
  - It also uses 'how' and we'll use the same as before 'left' for a left join

In [None]:
# Note the left join again, and the spatial style of operation (op)
sjoined_gdf = geopandas.sjoin(joined_gdf, SA2_gdf, op='intersects',how='left')

sjoined_gdf.head(10)


### Last Thing.
- Can we do analysis with geopandas?
  - yes we can :)
  - buffer
  - dissolve
  - area/length

- Back to our trusty doc-tracks

In [None]:
import geopandas
data_location = "zip://data/doc-tracks.zip" # a string to hold the file location
tracks_gdf = geopandas.read_file(data_location) # read the file, but really lets not go overboard
tracks_gdf = tracks_gdf.head(100)
tracks_gdf


### Buffer
- This falls under ['Geometric Manipluations'](https://geopandas.org/docs/user_guide/geometric_manipulations.html)
  - Other similar operations include:
    - envelope (extent)
    - centroid (center)
    - simplify
    - intersection

In [None]:
# first change to NZTM 2000 so we can measure properly in meters
tracks_gdf = tracks_gdf.to_crs(epsg="2193")


In [None]:
# buffer the tracks, in map units (will be defined by your CRS)
buffer_tracks = tracks_gdf.buffer(1000)


In [None]:
# Other maps to try: tiles='OpenStreetMap' , tiles='Stamen Toner' , 'cartodbpositron'
map = folium.Map(location = [-39.8,174.7], tiles = "Stamen Terrain", zoom_start = 6)

gjson = buffer_tracks.to_crs(epsg='4326').to_json() # convert to geojson, and make sure the shapfile is in WGS84 (SRID:4326)
lines = folium.features.GeoJson(gjson) # use the geojson variable, and create the features in folium format

map.add_child(lines) # add the data to folium
map # show the map


### Dissolve
- Dissolve is both a spatial dissolve and a aggregate function, like groupby. 
  - As a result, we need to give it a attribute field (series) to groupby, even if we don't need it.
  - We can append a series that just equals 1 for the whole column for our purposes.
  

In [None]:
buffer_tracks = geopandas.GeoDataFrame(geometry=buffer_tracks.geometry)

buffer_tracks['disfield'] =1 # dummy value
b= buffer_tracks.dissolve(by='disfield') # Note the type of geometry that this generates
b.head()


In [None]:
# other maps to try: tiles='OpenStreetMap' , tiles='Stamen Toner' , 'cartodbpositron'
map = folium.Map(location = [-39.8,174.7], tiles = "Stamen Terrain", zoom_start = 6)

# folium is a lot like leaflet
# folium only accepts geojson files
gjson = b.to_crs(epsg='4326').to_json() # convert to geojson, and make sure the shapfile is in WGS84 (SRID:4326)
lines = folium.features.GeoJson(gjson) # use the geojson variable, and create the features in folium format

map.add_child(lines) # add the data to folium
map # show the map


### Area and Length
- These are built in pretty easily
- For area of a polygon

In [None]:
buffer_tracks.head(5)
buffer_tracks.area


In [None]:
buffer_tracks["area"] = buffer_tracks.area


In [None]:
buffer_tracks.head(5)


### Length
- Length is the same thing, but we need to use a line
  - We can use the tracks file

In [None]:
data_location = "zip://data/doc-tracks.zip" # a string to hold the file location
tracks_gdf = geopandas.read_file(data_location) # read the file
tracks_gdf.head()

final_example = tracks_gdf.head(10).to_crs(epsg='4326') # Expect a red box to pop up due to this...
final_example["length_of_track"] = final_example.length
final_example


### Wait a second... 
- those length values are off. why?

In [None]:
data_location = "zip://data/doc-tracks.zip" # a string to hold the file location
tracks_gdf = geopandas.read_file(data_location) # read the file
tracks_gdf.head()
# first change to NZTM 2000 so we can measure properly in meters
tracks_gdf = tracks_gdf.to_crs(epsg="2193")


final_example = tracks_gdf.head(10).copy() # easier if we copy of the object to keep things clean...
final_example["length_of_track"] = final_example.length
final_example


### Searching for a specific item in the Geopandas Dataframe
- One of the important things you can do with pandas is to use the 'loc' function. it allows us to 'search' in out data. 
  - You can find a specific name, or evaluate a boolean. 
    - The boolean operations allo use to 'select' data like, find all highways with a speed limit of 100kph or higher.

In [None]:
data_location = "zip://data/SA2.zip" # A string to hold the file location
SA2_gdf = geopandas.read_file(data_location,crs='EPSG:4326') # Read the file
poly = SA2_gdf.loc[SA2_gdf['SA22018__1']=='Coromandel'] # Find a specific SA2 by name
poly


In [None]:
# Find all of the really large land areas of SA2's
poly = SA2_gdf.loc[SA2_gdf['LAND_AREA_']>5000]
poly


### And thats all, folks!
- We covered a lot of ground today
  - Dictionaries
  - Pandas
  - Geopandas

# Homework

### 1. Create a pandas dataframe from the dictionary in the cell below, then tell me Michael's average in the series "karma". Round down to the nearest whole number.

In [None]:
postdata = {'posts': [{'author': 'Michael', 'postid': 22, 'karma': 15},
  {'author': 'Michael', 'postid': 23, 'karma': 15},
  {'author': 'Michael', 'postid': 25, 'karma': 6},
  {'author': 'Michael', 'postid': 26, 'karma': 200},
  {'author': 'Michael', 'postid': 27, 'karma': 76},
  {'author': 'Michael', 'postid': 28, 'karma': 2},
  {'author': 'Sila', 'postid': 29, 'karma': 73},
  {'author': 'Michael', 'postid': 30, 'karma': 3},
  {'author': 'Michael', 'postid': 31, 'karma': 1},
  {'author': 'Michael', 'postid': 32, 'karma': 5},
  {'author': 'Michael', 'postid': 33, 'karma': 15},
  {'author': 'Michael', 'postid': 34, 'karma': 54},
  {'author': 'Amber', 'postid': 35, 'karma': 15},
  {'author': 'Michael', 'postid': 36, 'karma': 16},
  {'author': 'Michael', 'postid': 37, 'karma': 65},
  {'author': 'Michael', 'postid': 38, 'karma': 25},
  {'author': 'Michael', 'postid': 39, 'karma': 66},
  {'author': 'Michael', 'postid': 40, 'karma': 12},
  {'author': 'Amber', 'postid': 41, 'karma': 32},
  {'author': 'Michael', 'postid': 42, 'karma': 61},
  {'author': 'Michael', 'postid': 43, 'karma': 63},
  {'author': 'Michael', 'postid': 44, 'karma': 78},
  {'author': 'Sila', 'postid': 45, 'karma': 63},
  {'author': 'Michael', 'postid': 46, 'karma': 98},
  {'author': 'Michael', 'postid': 47, 'karma': 97},
  {'author': 'Michael', 'postid': 48, 'karma': 16},
  {'author': 'Sila', 'postid': 49, 'karma': 96},
  {'author': 'Michael', 'postid': 50, 'karma': 22},
  {'author': 'Michael', 'postid': 51, 'karma': 33},
  {'author': 'Michael', 'postid': 52, 'karma': 6},
  {'author': 'Amber', 'postid': 53, 'karma': 66},
  {'author': 'Michael', 'postid': 54, 'karma': 47},
  {'author': 'Michael', 'postid': 55, 'karma': 32},
  {'author': 'Michael', 'postid': 56, 'karma': 15}]}

### 2. In the supplied zip file "doc-regions" (in the data directory), What is the name of the smallest region by area?