# **Operations on spatial data attributes using Python**

This lesson focuses on Python methods for dealing with attribute data of spatial layers, including: 

* getting descriptive summaries
* subsetting data based on conditions
* calculating or adding new columns to the attribute table
* sorting values in a column
* reclassifying values
* removing columns and 
* merging data from two tables based on a common field.

Let's use geopandas to read in and plot the shapefile for this tutorial:

In [None]:
import geopandas
america = geopandas.read_file('america.shp')
america.plot()

There are several pandas commands helpful for exploring the attribute data of a spatial layer, such as `head()`, `describe()` and `info()`.

`describe()` is useful to generate a summary with descriptive statistics for numeric columns, including central tendency and dispersion measures (e.g., average, standard deviation, quartiles):

In [None]:
america.describe()

`info()` prints a summary of the (Geo)DataFrame showing the number of entries (i.e., rows) and columns. 

Furthermore, `info()` prints each column name with the corresponding count of non-null values and its data type. 

In [None]:
america.info()

## **Subsetting data**

Geopandas is built on top of Pandas so Pandas functionalities can be applied to GeoDataFrames. 

We can query the attribute data of a GeoDataFrame to find features that meet a given condition in the same way we would do it with a pandas DataFrame.

For example let's subset the `america` GeoDataFrame to create a map named `north_america`, that contains only two countries, Canada and the United States:

In [None]:
north_america = america[america['CNTRY_NAME'].isin(['Canada','United States'])]
north_america.plot()

## **Adding a new column to a GeoDataFrame**

To add a new column to a GeoDataFrame we can type the name of the GeoDataFrame followed by square brackets with the name of the new column, and after an equal sign we type the new values or the operation that will produce the values for the new column.

For instance, let's create a new column in the `north_america` GeoDataFrame called 'region' filled with the string 'North America' for all the records:

In [None]:
north_america['region'] = 'North America'
north_america.head()

## **Sorting and reclassifying values in a column in a GeoDataFrame**

The `sort_values()` method can be used to sort values in a column.

For example, the values in the 'ID_CNTRY' column in the GeoDataFrame are unordered:

In [None]:
america.head()

Let's run `sort_values()` to sort the GeoDataFrame by the 'ID_CNTRY' column.

Let's update the GeoDataFrame modifying it in place with `inplace=True`:

In [None]:
america.sort_values(by='ID_CNTRY', inplace=True)
america.head()

## **Reclassifying values in a column in a GeoDataFrame**

The `cut()` method can help to reclassify values in a GeoDataFrame column. We can use `cut()` to bin the values in a numeric column to create a categorical variable.

For this tutorial, let's run `cut()` to reclassify the 'ID_CNTRY' column in the `america` GeoDataFrame into 3 categories:

* countries with ID_CNTRY 1-13 will be classified as `south` (South America, i.e., Argentina, Bolivia, etc.)
* countries with ID_CNTRY 14-21 will be classified as `central` (Central America, i.e., Belize, Costa Rica, etc.) 
* countries with ID_CNTRY 22-23 will be classified as `north` (North America, i.e., Canada and United States).

In [None]:
import pandas as pd
america['region'] = pd.cut(x=america.ID_CNTRY, bins=[0,13,21,23], labels=['south', 'central', 'north'])
america

## **Removing columns from a GeoDataFrame**

Sometimes it can be helpful to remove one or more columns from a (Geo)DataFrame to make a dataset smaller or to produce a file with smaller size.

This can be achieved with pandas' `drop()` function entering the list of columns to remove as input to the 'columns' parameter:

In [None]:
america.drop(columns=['SQMI_CNTRY'], inplace=True)
america.head()

## **Merge a GeoDataFrame with a DataFrame based on a common column**

The `merge()` function can be used for merging tables (also referred to as 'joining' tables), for example to merge data from a .csv table to a GeoDataFrame.

Let's merge the `america` GeoDataFrame with a DataFrame read from a .csv containing population data for each country. 

Let's read in the 'pop_by_country_america.csv' file first and take a look at it:

In [None]:
population = pd.read_csv('pop_by_country_america.csv')
population.head()

Both tables have a column called 'CNTRY_NAME' with identical names for each country, so we can perform the merging based on this key column.

Let's run the `merge()` method to merge the `population` DataFrame to the `america` GeoDataFrame based on the 'CNTRY_NAME' column, saving the result as `america`:

In [None]:
america = america.merge(population, on='CNTRY_NAME')
america