# An overview of GeoPandas

You have already been introduced in this course to Pandas, the Python Data Analysis Library: https://pandas.pydata.org

## A quick recap on Pandas

Pandas is a Python package "providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive". 

It provides us with a range of capabilities:

- DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, fancy indexing, and subsetting of large data sets.
- Data structure column insertion and deletion.
- Group by engine allowing split-apply-combine operations on data sets.
- Data set merging and joining.
- Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
- Time series-functionality: Date range generation[6] and frequency conversions, moving window statistics, moving window linear regressions, date shifting and lagging.
- Provides data filtration.




## So what is special about GeoPandas?

"GeoPandas is a project to add support for geographic data to pandas objects. It currently implements GeoSeries and GeoDataFrame types which are subclasses of pandas.Series and pandas.DataFrame respectively. GeoPandas objects can act on shapely geometry objects and perform geometric operations."

See the Git repo for more information: https://github.com/geopandas/geopandas

The GeoPandas dataframe holds a geometry column which enables cartesian geometry operations (meaning it can interpret pairs of numerical coordinates in space). 

The coordinate reference system (crs) can be stored as an attribute on an object, and is automatically set when loading from a file. Objects may be transformed to new coordinate systems with the `to_crs()` method. 

Here we will cover the following basic operations:

- Reading data to a geopandas dataframe
- Manipulating column data 
- Creating a new column
- Changing coordinate reference systems
- Writing data to a geopandas dataframe


### Reading vector shapefile data to a GeoPandas dataframe

Let's read in the shapefile we previously used of GMU. 

It's stored in the github repository in the `shapes` folder, with the filename `gmu.shp`. 

To load this in, we can find the current folder using the `os` package which we previously used, as follows, via the `getcwd` function:

In [1]:
import os

## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

print(current_dir)      

D:\Github\satellite-image-analysis\notebooks\week9


The `current_dir` variable is merely a string of the directory path which we can manipulate.

Thus, from here we can navigate up and down directories by adding on new parts to this string. Our trusty double period, which we previously used (`..`), enables us to navigate up the file path. For example:

In [2]:
## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

path = current_dir + '/..'

print(path)    

D:\Github\satellite-image-analysis\notebooks\week9/..


So now we added the double period to our string, when a computer interprets this, it essentially reads 'go up one folder from week9'.

What we want to do is to have our string navigate to the main `satellite-image-analysis` folder, which means we need to go up three folders, and finally go into the shapes folder, as follows:


In [3]:
## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

path = current_dir + '/../../shapes'

print(path)    

D:\Github\satellite-image-analysis\notebooks\week9/../../shapes


Once in the `shapes` folder, we need to get the `GMU.shp` file. Therefore, we need to add this filename to the path:

In [4]:
## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

path = current_dir + '/../../shapes/gmu.shp'

print(path)    

D:\Github\satellite-image-analysis\notebooks\week9/../../shapes/gmu.shp


Now we're ready to read in the data using the path we've specified.

Let's first load GeoPandas which should already be installed in your environment. 

Then we can use the GeoPandas function `read_file` and provide the following arguments:
- `path` which contains the path to the shapefile we want to load, and
- `crs` which states the coordinate reference system


In [5]:
import geopandas as gpd

#load the file as the variable named data
data = gpd.read_file(path, crs='epsg:4326') 
print(data)

     id                                           geometry
0  None  POLYGON ((-77.31540 38.83630, -77.29790 38.836...


You can see here when we print that `data` is a dataframe containing an `id` column and a `geometry` column.

## GeoPandas Examples

GeoPandas provide us with some great functionality, for example, we can change the crs as follows:

In [6]:
# The previous crs was in degrees (epsg:4326), so let's change to meters ('epsg:3857')
data = data.to_crs('epsg:3857')
print(data)

     id                                           geometry
0  None  POLYGON ((-8606710.958 4698250.004, -8604762.8...


Now we are in meters, we can take the area of this shape as follows:

In [7]:
# Due to our current CRS, the area will be in square meters
data['area'] = data['geometry'].area 
print(data)

     id                                           geometry          area
0  None  POLYGON ((-8606710.958 4698250.004, -8604762.8...  3.414576e+06


The beauty is we can manipulate this as a normal pandas dataframe.

So let's for example, convert our square meters into square kilometers (which requires us to divide by 1e6)

Remember, we can select a variable by using the square parentheses to index (e.g. `data['area']` gets the area column), and then create a new column this way too (e.g. `data['area_km2']` is the new column we wish to make).

In [8]:
data['area_km2'] = data['area'] / 1e6
print(data['area_km2'])

0    3.414576
Name: area_km2, dtype: float64


We can see the whole dataframe structure with our new column, as follows:

In [9]:
print(data)

     id                                           geometry          area  \
0  None  POLYGON ((-8606710.958 4698250.004, -8604762.8...  3.414576e+06   

   area_km2  
0  3.414576  


We are able to loop over any content in a GeoDataFrame the same way we would a normal DataFrame, by using the `iterrows()` function, as follows:

In [10]:
for row in data.iterrows():
    print(row)

(0, id                                                       None
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                           3414575.784904
area_km2                                             3.414576
Name: 0, dtype: object)


This means we can access and print specific parts of each row. 

The important thing to remember is that you have the row index (here it's a zero) and then the actual row information.

For example, we can break out the row index here using `[0]`, and the row information using `[1]`:

In [11]:
for row in data.iterrows():
    
    ##this will print our row index
    print(row[0]) 
    print('')
    print('')
    ##this will print our row information
    print(row[1])

0


id                                                       None
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                           3414575.784904
area_km2                                             3.414576
Name: 0, dtype: object


We can then access just the geometry as follows:

In [12]:
for row in data.iterrows():
    
    ##this will print our row geometry
    print(row[1]['geometry'])

POLYGON ((-8606710.958478265 4698250.004406621, -8604762.867389381 4698250.004406621, -8604762.867389381 4696499.43231632, -8606707.988080831 4696492.346566786, -8606710.958478265 4698250.004406621))


And we can carry out any manipulations we want in this loop, such as taking the area (let's reuse this as it used it before, so you will be familiar):

In [13]:
for row in data.iterrows():
    
    ##this will print our row geometry
    area_km2 = (row[1]['geometry'].area / 1e6)
    
    ##this will round our area to 1 decimal place
    area_km2 = round(area_km2, 1)
    
    print("The area of GMU campus is {} square kilometers".format(area_km2))

The area of GMU campus is 3.4 square kilometers


We can loop over our GeoDataFrame and extract any information we want, and write it to a DataFrame. For example:

In [14]:
output = []

for row in data.iterrows():
    
    ##this will print our row geometry
    area_km2 = (row[1]['geometry'].area / 1e6)

    ##we can append this information to a list as a dictionary
    output.append({
        'index': row[0],
        'area_km2': area_km2,
    })

print(output)

[{'index': 0, 'area_km2': 3.414575784903849}]


Once we have this list of dictionaries we can create a pandas dataframe from it as follows:

In [15]:
import pandas as pd

output = []

for row in data.iterrows():
    
    ##this will print our row geometry
    area_km2 = (row[1]['geometry'].area / 1e6)

    ##we can append this information to a list as a dictionary
    output.append({
        'index': row[0],
        'area_km2': area_km2,
        'any_other_properties': 'test_properties',
    })

## Let's convert our list of dicts to a pandas dataframe
output = pd.DataFrame(output)

## Write the dataframe to a .csv file
output.to_csv('output.csv')

print('Completed writing "output.csv"')

Completed writing "output.csv"


We can also write any geospatial information we produce to a shapefile.

Let's convert a geojson data structures into a GeoPandas dataframe and then write out. 

For example:

In [16]:
## Here is a geojson point in space, in a list. 
gmu_point = {
        'type': 'Feature',
        'geometry': {
            'type': 'Point',
            'coordinates': (-77.31540, 38.83630),
        },
        'properties': {}
    }

my_list_of_dicts = []

my_list_of_dicts.append(gmu_point)

## Now we can specify a GeoDataFrame, providing the list of dicts and the CRS.
output = gpd.GeoDataFrame.from_features(my_list_of_dicts, crs='epsg:4326')

## Finally, let's write this GeoDataFrame to a shapefile. 
output.to_file('output.shp', crs='epsg:4326')

print('Completed writing "output.shp"')

  pd.Int64Index,
