# Intro `geopandas`

The beauty of `geopandas` is that it enables us to manage spatial info using the Python Data Analysis Library: https://pandas.pydata.org

Let's start by installing the package:

In [1]:
# Example
import sys
# !conda install --yes --prefix {sys.prefix} numpy=1.22
# !conda install --yes --prefix {sys.prefix} geopandas
import geopandas

## A quick recap on `pandas`

Pandas is a Python package "providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive". 

It provides us with a range of capabilities:

- DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, fancy indexing, and subsetting of large data sets.
- Data structure column insertion and deletion.
- Group by engine allowing split-apply-combine operations on data sets.
- Data set merging and joining.
- Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
- Time series-functionality: Date range generation[6] and frequency conversions, moving window statistics, moving window linear regressions, date shifting and lagging.
- Provides data filtration.




## So what is special about `geopandas`?

"GeoPandas is a project to add support for geographic data to pandas objects. It currently implements GeoSeries and GeoDataFrame types which are subclasses of pandas.Series and pandas.DataFrame respectively. GeoPandas objects can act on shapely geometry objects and perform geometric operations."

See the Git repo for more information: https://github.com/geopandas/geopandas

The GeoPandas dataframe holds a geometry column which enables cartesian geometry operations (meaning it can interpret pairs of numerical coordinates in space). 

The coordinate reference system (crs) can be stored as an attribute on an object, and is automatically set when loading from a file. Objects may be transformed to new coordinate systems with the `to_crs()` method. 

Here we will cover the following basic operations:

- Reading data to a geopandas dataframe
- Manipulating column data 
- Creating a new column
- Changing coordinate reference systems
- Writing data to a geopandas dataframe


### Reading vector shapefile data to a `geopandas` dataframe

Let us read in the shapefile for GMU. 

To load this in, we can find the current folder using the `os` package which we previously used, as follows, via the `getcwd` function:

In [2]:
import os

## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

print(current_dir)    

C:\Users\eoughton\Desktop\Github\satellite-image-analysis\notebooks


The `current_dir` variable is merely a string of the directory path which we can manipulate.

In [3]:
## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

path = current_dir + '/files' + '/gmu.shp'

print(path)    

C:\Users\eoughton\Desktop\Github\satellite-image-analysis\notebooks/files/gmu.shp


Now we're ready to read in the data using the path we've specified.

Let's first load `geopandas` which should already be installed in your environment. 

Then we can use the GeoPandas function `read_file` and provide the following arguments:
- `path` which contains the path to the shapefile we want to load, and
- `crs` which states the coordinate reference system


In [4]:
import geopandas as gpd

#load the file as the variable named data
data = gpd.read_file(path, crs='epsg:4326') 
print(data)

   FID                                           geometry
0    0  POLYGON ((-77.31540 38.83630, -77.30000 38.836...


## Basic `geopandas` functions

`geopandas` provide us with some great functionality, for example, we can change the crs as follows:

In [5]:
# The previous crs was in decimel degrees (epsg:4326), so let's change to meters ('epsg:3857')
data = data.to_crs('epsg:3857')
print(data)

   FID                                           geometry
0    0  POLYGON ((-8606710.958 4698250.004, -8604996.6...


Now we are working with a crs which is in meters, we can take the area of this shape as follows:

In [6]:
# Due to our current CRS, the area will be in square meters
data['area'] = data['geometry'].area 
print(data)

   FID                                           geometry          area
0    0  POLYGON ((-8606710.958 4698250.004, -8604996.6...  2.768233e+06


The beauty is we can manipulate this as a normal pandas dataframe.

So let's for example, convert our square meters into square kilometers (which requires us to divide by 1e6)

Remember, we can select a variable by using the square parentheses to index (e.g. `data['area']` gets the area column), and then create a new column this way too (e.g. `data['area_km2']` is the new column we wish to make).

In [7]:
data['area_km2'] = data['area'] / 1e6
print(data['area_km2'])

0    2.768233
Name: area_km2, dtype: float64


We can see the whole dataframe structure with our new column, as follows:

In [8]:
print(data)

   FID                                           geometry          area  \
0    0  POLYGON ((-8606710.958 4698250.004, -8604996.6...  2.768233e+06   

   area_km2  
0  2.768233  


We are able to loop over any content in a GeoDataFrame the same way we would a normal DataFrame, by using the `iterrows()` function, as follows:

In [33]:
for row in data.iterrows():
    print(row)

(0, FID                                                         0
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                            2768233.05248
area_km2                                             2.768233
Name: 0, dtype: object)


This means we can access and print specific parts of each row. 

The important thing to remember is that you have the row index (here it is a zero) and then the actual row information.

For example, we can break out the row index here using `[0]`, and the row information using `[1]`:

In [34]:
for row in data.iterrows():
    
    ##this will print our row index
    print(row[0]) 
    print('')
    print('')
    ##this will print our row information
    print(row[1])

0


FID                                                         0
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                            2768233.05248
area_km2                                             2.768233
Name: 0, dtype: object


We can then access just the geometry as follows:

In [35]:
for row in data.iterrows():
    
    ##this will print our row geometry
    print(row[1]['geometry'])

POLYGON ((-8606710.958478265 4698250.004406621, -8604996.638320047 4698250.004406621, -8604996.638320047 4696635.23423711, -8606599.63898747 4696635.23423711, -8606710.958478265 4696635.23423711, -8606710.958478265 4698250.004406621))


And we can carry out any manipulations we want in this loop, such as taking the area (let's reuse this as it used it before, so you will be familiar):

In [36]:
for row in data.iterrows():
    
    ##this will print our row geometry
    area_km2 = (row[1]['geometry'].area / 1e6)
    
    ##this will round our area to 1 decimal place
    area_km2 = round(area_km2, 1)
    
    print("The area of GMU campus is {} square kilometers".format(area_km2))

The area of GMU campus is 2.8 square kilometers


## Exercise 1

Using `geopandas`, create a path to the OpenCelliD points data we explored in the first exercise of the previous notebook (in `~/satellite-image-analysis/shapes`). 

- Load in the points data.
- Print the head of the dataframe for the top 10 rows.
- Print the length of the dataframe to find the total number of rows. 
- Using a loop, iterate over each point, printing the affiliated attributes information.

Next, repeat this sequence for the fiber linestring data, but also:

- Estimate the length of each fiber route in square kilometers. 

Finally,  repeat this sequence for the boundary polygon datasets, but instead:

- Find the area in square kilometers of each ETH region. 

## Alternative options for looping over a dataframe

Finally, it is important to note there are many other options for looping over a dataframe.  

For example, the method we already covered uses the `iterrows()` function:

In [20]:
for row in data.iterrows():
    print(row)

(0, FID                                                         0
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                            2768233.05248
area_km2                                             2.768233
Name: 0, dtype: object)


To access the information provided, we then need to index into this iterator (e.g., `row[1]`), prior to specifying the key of the variable we want to obtain (e.g., `row[1]['area_km2']`):

In [22]:
for row in data.iterrows():
    print(row[1]['area_km2'])

2.768233052479508


But there are alternative options which do not require us to index into our iterator.

For example, an alternative way to loop over a (geo)pandas dataframe is to separate out the index and the iterator (`for idx, row`) at the loop stage.

In [15]:
for idx, row in data.iterrows():
    print(row['area_km2'])

2.768233052479508


Finally, one option is to convert our dataframe into a list of dictionaries as follows, using the `.to_dict('records')`:

In [17]:
my_list_of_dicts = data.to_dict('records')
my_list_of_dicts

[{'FID': 0,
  'geometry': <shapely.geometry.polygon.Polygon at 0x225810dd8b0>,
  'area': 2768233.052479508,
  'area_km2': 2.768233052479508}]

And then we can treat the loop part as a normal list, as follows:

(meaning we do not need `iterrows()` as that is a pandas/geopandas function only required for a dataframe)

In [19]:
for item in my_list_of_dicts:
    print(item)

{'FID': 0, 'geometry': <shapely.geometry.polygon.Polygon object at 0x00000225810DD8B0>, 'area': 2768233.052479508, 'area_km2': 2.768233052479508}


In reality, it does not matter which way you iterate over the data in your dataframe. 

However, you should be aware that the approach you select, will affect how you later index into the iterator to access the necessary information.