# Section 04: Spatial Joins and Handling Missing Data

In [None]:
# Import libraries
import os
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from matplotlib.colors import ListedColormap
import numpy as np
import pandas as pd
from shapely.geometry import box
import geopandas as gpd

In [None]:
# Import data
data_path = os.path.join("data/")

country_bound_us = gpd.read_file(os.path.join(data_path, "usa", 
                                              "usa-boundary-dissolved.shp"))
                                 
state_bound_us = gpd.read_file(os.path.join(data_path, "usa", 
                                            "usa-states-census-2014.shp"))
                               
pop_places = gpd.read_file(os.path.join(data_path, "global", 
                                        "ne_110m_populated_places_simple", 
                                        "ne_110m_populated_places_simple.shp"))
                                        
ne_roads = gpd.read_file(os.path.join(data_path, "global", 
                                      "ne_10m_roads", "ne_10m_roads.shp"))

Next dissolve the state data by region like you did in the previous lesson.



In [None]:
# Simplify the country boundary just a little bit to make this run faster
country_bound_us_simp = country_bound_us.simplify(.2, preserve_topology=True)

# Clip the roads to the US boundary - this will take about a minute to execute
roads_cl = gpd.clip(ne_roads, country_bound_us_simp)
roads_cl.crs = ne_roads.crs

# Dissolve states by region
regions_agg = state_bound_us.dissolve(by="region")

### Spatial Joins in Python
Just like you might do in ArcMap or QGIS you can perform spatial joins in Python too. A spatial join is when you append the attributes of one layer to another based upon its spatial relationship.

So - for example if you have a roads layer for the United States, and you want to apply the “region” attribute to every road that is spatially in a particular region, you would use a spatial join. To apply a join you can use the ```geopandas.sjoin()``` function as following:

```.sjoin(layer-to-add-region-to, region-polygon-layer)```

Sjoin Arguments:
The ```op``` argument specifies the type of join that will be applied

- ```intersects```: Returns True if the boundary and interior of the object intersect in any way with those of the other.
- ```within```: Returns True if the object’s boundary and interior intersect only with the interior of the other (not its boundary or exterior).
- ```contains```: Returns True if the object’s interior contains the boundary and interior of the other object and their boundaries do not touch at all.
[You can read more about each type here.](https://shapely.readthedocs.io/en/stable/manual.html?highlight=binary%20predicates#binary-predicates)

How allows the following options: (this is taken directly from the [geopandas code on github!](https://github.com/geopandas/geopandas/blob/main/geopandas/tools/sjoin.py#L18)

‘left’: use keys from left_df; retain only left_df geometry column
‘right’: use keys from right_df; retain only right_df geometry column
‘inner’: use intersection of keys from both dfs; retain only left_df geometry column

In [None]:
# Roads within region
roads_region = gpd.sjoin(roads_cl, 
                         regions_agg, 
                         how="inner", 
                         op='intersects')

# Notice once you have joins the data - you have attributes 
# from the regions_object (i.e. the region) attached to each road feature
roads_region[["featurecla", "index_right", "ALAND"]].head()

Reproject and plot the data

In [None]:
# Reproject to Albers for plotting
country_albers = country_bound_us.to_crs('epsg:5070')
roads_albers = roads_region.to_crs('epsg:5070')

In [None]:
# Plot the data
fig, ax = plt.subplots(figsize=(12, 8))

country_albers.plot(alpha=1,
                    facecolor="none",
                    edgecolor="black",
                    zorder=10,
                    ax=ax)

roads_albers.plot(column='index_right',
                  ax=ax,
                  legend=True)

# Adjust legend location
leg = ax.get_legend()
leg.set_bbox_to_anchor((1.15,1))

ax.set_axis_off()
plt.axis('equal')
plt.show()

If you want to customize your legend even further, you can once again use loops to do so.



In [None]:
# First, create a dictionary with the attributes of each legend item
road_attrs = {'Midwest': ['black'],
              'Northeast': ['grey'],
              'Southeast': ['m'],
              'Southwest': ['purple'],
              'West': ['green']}

# Plot the data
fig, ax = plt.subplots(figsize=(12, 8))

regions_agg.plot(edgecolor="black",
                 ax=ax)
country_albers.plot(alpha=1,
                    facecolor="none",
                    edgecolor="black",
                    zorder=10,
                    ax=ax)

for ctype, data in roads_albers.groupby('index_right'):
    data.plot(color=road_attrs[ctype][0],
              label=ctype,
              ax=ax)
    
# This approach works to place the legend when you have defined labels
plt.legend(bbox_to_anchor=(1.0, 1), loc=2)
ax.set_axis_off()
plt.axis('equal')
plt.show()

### Calculate Line Segment Length


In [None]:
# Turn off scientific notation
pd.options.display.float_format = '{:.4f}'.format

# Calculate the total length of road 
road_albers_length = roads_albers[['index_right', 'length_km']]

# Sum existing columns
roads_albers.groupby('index_right').sum()

roads_albers['rdlength'] = roads_albers.length
sub = roads_albers[['rdlength', 'index_right']].groupby('index_right').sum()
sub

## Handling Missing Data
This lesson covers how to rename and clean up attribute data using **geopandas.



In [None]:
# Import roads shapefile
sjer_roads_path = os.path.join("data/california/madera-county-roads/tl_2013_06039_roads.shp")
sjer_roads = gpd.read_file(sjer_roads_path)

type(sjer_roads)

### Explore Data Values

There are several ways to use ```pandas``` to explore your data and determine if you have any missing values.

- To find the number of missing values per column in a DataFrame you can run ```dfname.is_null().sum()```
- Look at the unique values for a specific column of a DataFrame ```dfname['column'].unique()```



In [None]:
sjer_roads.isnull().sum()


Based on this method there are no ```NaN``` or ```None``` type obejcts as values in the ```geodataframe```. Double check the unique values in the road type column.



In [None]:
# View data type 
print(type(sjer_roads['RTTYP']))

# View unique attributes for each road in the data
print(sjer_roads['RTTYP'].unique())

### Replacing Values
- If the value you want to replace is a Nan or Nonetype you can use ```dfname.loc[dfname['column'].isnull(), 'column' = 'newvaluu'```

- Or you can use the pandas ```.fillna()``` method and .fullna takes in the value that you want to replace.

Hmmmm there’s a road type that’s given an empty ```string``` as a name. It would be helpful to fix this before doing more analyis or mapping with this dataset.

There are several ways to deal with this issue. One is to use the ```.replace``` method to replace all instances of None in the attribute data with some new value. In this case, you will use - ‘Unknown’.

In [None]:
# Map each value to a new value 
sjer_roads["RTTYP"] = sjer_roads["RTTYP"].fillna("Unknown")
print(sjer_roads['RTTYP'].unique())

Alternatively you can use the ```.isnull()``` function to select all attribute cells with a value equal to ```null``` and set those to ‘Unknown’.

If the value you want to change is not ```NaN``` or a ```Nonetype``` then you will have to specify the origina value that you want to change, as shown below.

In [None]:
sjer_roads.head()


### Removing Values


In some specific instances you will want to remove ```NaN``` values from your ```DataFrame```, to do this you can use the ```pandas .dropna``` function, note that this function will remove all rows from the dataframe that have a ```Nan`` value in any of the columns.