# Introduction
We will learn about 2 common manipulations for geospatial data: geocoding and table joins.

## Geocoding
It is the process of converting the name of a place or an address to a location on the map. Example google maps.

In [1]:
from geopandas.tools import geocode

To use the geocoder, we need to provide:<br>
1) The name or address as a Python string<br>
2) The name of the provider, to avoid having to provide an API key, we will use the OpenStreetMap Nominatim geocoder.<br>
<br>
If the geocoding is successful, it returns a GeoDataFrame with 2 columns<br>
1) the "geometry" column which contains the (latitude, longitude) location and<br>
2) The "Address" column contains the full address

In [2]:
from geopy.geocoders import Nominatim

nom = Nominatim(user_agent='ad_application')
place = 'Taj mahal'
result = nom.geocode(place)
result

Location(Taj Mahal, Taj Mahal Internal Path, Taj Ganj, Agra, Uttar Pradesh, 282001, India, (27.1750123, 78.04209683661315, 0.0))

In [3]:
result.latitude

27.1750123

In [4]:
result.longitude

78.04209683661315

Use case that we will want to geocode many different addresses. For instance, say we want to obtain the locations of 100 top universities in Europe.

In [5]:
import pandas as pd

universities = pd.read_csv('../Datasets/University.csv')
universities.head()

Unnamed: 0.1,Unnamed: 0,Universities
0,1,Andrews University
1,3,Alabama A&M University
2,4,Arizona State University (ASU)
3,5,Alabama State University
4,6,Athens State University


In [6]:
universities.drop('Unnamed: 0', axis=1, inplace=True)

In [7]:
universities.head()

Unnamed: 0,Universities
0,Andrews University
1,Alabama A&M University
2,Arizona State University (ASU)
3,Alabama State University
4,Athens State University


In [8]:
universities.Universities.dtype

dtype('O')

In [10]:
import numpy as np
import geopandas as gpd


In [19]:
def my_geocoder(row):
    try:
        point = nom.geocode(row).geometry.iloc[0]
        return pd.Series({'Latitude': point.y, 'Longitude': point.x, 'geometry': point})
    except:
        return None

universities[['Latitude', 'Longitude', 'geometry']] = universities.apply(lambda x: my_geocoder(x['Universities']), axis=1)

print("{}% of addresses were geocoded!".format(
    (1 - sum(np.isnan(universities["Latitude"])) / len(universities)) * 100))

# Drop universities that were not successfully geocoded
universities = universities.loc[~np.isnan(universities["Latitude"])]
universities = gpd.GeoDataFrame(universities, geometry=universities.geometry)
universities.crs = {'init': 'epsg:4326'}
universities.head()

ValueError: Must have equal len keys and value when setting with an iterable

## Table joins
We will see how to combine data from different sources.<br>
### Attribute join
For joining a dataframe we use pd.DataFrame.join() to combine information from multiple DataFrames with a shared index. We refer this way of joining data by simpling matching values in the index as an attribute join.<br>
When performing attribute join with a GeoDataFrame, its best to use gpd.GeoDataFrame.merge() 

In [14]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
europe = world.loc[world.continent == 'Europe'].reset_index(drop=True)

europe_stats = europe[['name', 'pop_est', 'gdp_md_est']]
europe_boundaries = europe[['name', 'geometry']]

In [15]:
europe_boundaries.head()

Unnamed: 0,name,geometry
0,Russia,"MULTIPOLYGON (((178.725 71.099, 180.000 71.516..."
1,Norway,"MULTIPOLYGON (((15.143 79.674, 15.523 80.016, ..."
2,France,"MULTIPOLYGON (((-51.658 4.156, -52.249 3.241, ..."
3,Sweden,"POLYGON ((11.027 58.856, 11.468 59.432, 12.300..."
4,Belarus,"POLYGON ((28.177 56.169, 29.230 55.918, 29.372..."


In [16]:
europe_stats.head()

Unnamed: 0,name,pop_est,gdp_md_est
0,Russia,142257519,3745000.0
1,Norway,5320045,364700.0
2,France,67106161,2699000.0
3,Sweden,9960487,498100.0
4,Belarus,9549747,165400.0


We will join the europe_boundaries with the europe_stats containing the extimated population and gross domestic product for each country.

We do the attribute join in the code cell below. The "on" arguement is set to the column name that is used to match rows in europe_boundaries to rows in europe_stats

In [17]:
# Use an attribute join to merge data about countries in Europe
europe = europe_boundaries.merge(europe_stats, on='name')
europe.head()

Unnamed: 0,name,geometry,pop_est,gdp_md_est
0,Russia,"MULTIPOLYGON (((178.725 71.099, 180.000 71.516...",142257519,3745000.0
1,Norway,"MULTIPOLYGON (((15.143 79.674, 15.523 80.016, ...",5320045,364700.0
2,France,"MULTIPOLYGON (((-51.658 4.156, -52.249 3.241, ...",67106161,2699000.0
3,Sweden,"POLYGON ((11.027 58.856, 11.468 59.432, 12.300...",9960487,498100.0
4,Belarus,"POLYGON ((28.177 56.169, 29.230 55.918, 29.372...",9549747,165400.0


### Spatial Join
Another type of join is a spatial join. With a spatial join we combine GeoDataFrames based on the spatial relationships between the objects in the geometry columns. For instance we can have a GeoDataFrame universities containing the geocoded addresses of European universities. We can use spatial join to match each university to its corresponding country. We do this by gpd.sjoin()

In [18]:
# The syntax is
# european_universities = gpd.sjoin(universities, europe)

The spatial join above looks at the "geometry" columns in both GeoDataFrames. If a Point object from the universities GeoDataFrame intersects a Polygon object from the europe DataFrame, the corresponding rows are combined and added as a single row of the european_universities DataFrame. Otherwise, countries without a matching university (and universities without a matching country) are omitted from the results.

The gpd.sjoin() method is customizable for different types of joins, through the how and op arguments. For instance, you can do the equivalent of a SQL left (or right) join by setting how='left' (or how='right').