# Intro to `pandas` and `geopandas`

There are a very large number of open-source libraries we can import. 

The Python data analysis package `pandas` is a favorite: https://pandas.pydata.org/

Pandas is based on using **Dataframes**. They allow us to store data as panels in  rows and columns. You may have encountered this data structure in R or Matlab. If not, you can think of the approach as similar to how you use a spreadsheet (except with far greater flexibility and power in Python).

In this tutorial, we will focus on manipulation of dataframes using pandas. 

<div class="alert alert-block alert-info">
    
<b>Note:</b> There are several sub-libraries of pandas such as geopandas for geospatial data that you can always harness their power while performing highly specialised tasks. **Remember Google is your friend!**.
</div>

To begin working with dataframes, we first install and import pandas (as we did for packages in previous sessions). 

<div class="alert alert-block alert-warning">

<b>!! Note !!</b> Pandas is not part of most python distribution packages hence it should be installed where it is not available before starting the tutorial. You may need to exit and reinstall. 

</div>

In [5]:
# Example
import sys
!conda install --yes --prefix {sys.prefix} pandas

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Retrieving notices: ...working... done




  current version: 22.9.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -c defaults conda




## Creating a `pandas` dataframe

Here we are focusing on ways of creating dataframes. However, in most day-to-day programming, you are likely to work with data in other formats which you can convert to dataframe by a single line of code. 

Nevertheless, dataframes can be created by converting lists to dictionaries and then to a dataframe, as shown in the subsequent steps, using an example of satellite missions. 

First, let's create a list with our information.

In [6]:
# Example
satellites = ['LandSat','Sentinel','PlanetScope','Starlink','Iridium','OneWeb']
year = [1972,2013,2015,2018,1999,2021]
government_owned = [True,True,False,False,False,False]
mission = ['Remote sensing','Remote sensing','Remote sensing',\
        'Communication','Communication','Communication']
satellites

['LandSat', 'Sentinel', 'PlanetScope', 'Starlink', 'Iridium', 'OneWeb']

Next, we need to combine the lists into a dictionary:

In [7]:
# Example
sat_data = {
    'satellite': satellites,
    'year': year,
    'government_owned': government_owned,
    'mission': mission
}
sat_data

{'satellite': ['LandSat',
  'Sentinel',
  'PlanetScope',
  'Starlink',
  'Iridium',
  'OneWeb'],
 'year': [1972, 2013, 2015, 2018, 1999, 2021],
 'government_owned': [True, True, False, False, False, False],
 'mission': ['Remote sensing',
  'Remote sensing',
  'Remote sensing',
  'Communication',
  'Communication',
  'Communication']}

Finally, we convert the dictionary into dataframe using the `pandas` library function `.DataFrame()`.

In [9]:
# Example
import pandas as pd
df = pd.DataFrame(sat_data)
df

Unnamed: 0,satellite,year,government_owned,mission
0,LandSat,1972,True,Remote sensing
1,Sentinel,2013,True,Remote sensing
2,PlanetScope,2015,False,Remote sensing
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication
5,OneWeb,2021,False,Communication


The prepared dataframe can be converted to other formats such as excel spreadsheet or CSV files, Example;

In [10]:
# Example
df.to_csv('satellite_missions.csv', index=False)

## Exercise

Have a go at manually creating a `pandas` dataframe called 'image_data' using the following:

    - A column called 'id' containing four numbers from 0-3.
    - A column called 'longitude' containing 0.02, 0.05, 0.06 and 0.08.
    - A column called 'latitude' containing 1.54, 1.65, 1.48 and 1.59.
    - A column called 'filename' containing 'multiband0.tiff', 'multiband1.tiff', 'multiband2.tiff' and 'multiband3.tiff'.
    
Write this dataframe to a .csv file called 'my_csv.csv'. 
        

In [6]:
#Enter your attempt below:


## Accessing data

`pandas` provides a lot of functionality. 

We can easily import a .csv file into our notebook using the `.read_csv()` function.  

In [11]:
df = pd.read_csv('satellite_missions.csv')
df

Unnamed: 0,satellite,year,government_owned,mission
0,LandSat,1972,True,Remote sensing
1,Sentinel,2013,True,Remote sensing
2,PlanetScope,2015,False,Remote sensing
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication
5,OneWeb,2021,False,Communication


Then we can view the first few rows of data using the 'head' function:

In [13]:
# Example
df.head(n=3)

Unnamed: 0,satellite,year,government_owned,mission
0,LandSat,1972,True,Remote sensing
1,Sentinel,2013,True,Remote sensing
2,PlanetScope,2015,False,Remote sensing


You can view the last few rows of your data using the 'tail' function.

In [16]:
# Example
df.tail(n=3)

Unnamed: 0,satellite,year,government_owned,mission
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication
5,OneWeb,2021,False,Communication


You can know how many rows and columns are in your data using the shape function.

In [17]:
# Example
df.shape

(6, 4)

You can know the name of columns by using the column function.

In [19]:
# Example
list(df.columns)

['satellite', 'year', 'government_owned', 'mission']

You can access a single columns using the following syntax.

In [20]:
# Example
df['satellite']

0        LandSat
1       Sentinel
2    PlanetScope
3       Starlink
4        Iridium
5         OneWeb
Name: satellite, dtype: object

You can access multiple columns using the following line of code.

In [21]:
# Example
df[['satellite','year']]

Unnamed: 0,satellite,year
0,LandSat,1972
1,Sentinel,2013
2,PlanetScope,2015
3,Starlink,2018
4,Iridium,1999
5,OneWeb,2021


You can access individual rows by specifying the row index, e.g;

In [22]:
# Example
df.loc[4]

satellite                 Iridium
year                         1999
government_owned            False
mission             Communication
Name: 4, dtype: object

You can access datapoints between a given range by;

In [45]:
# Example
df.loc[3:4]

Unnamed: 0,satellite,year,government_owned,mission
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication


You can access data of specific attributes by subsetting. Eg satellites launched in 2015 or after.

In [23]:
# Example
df[df['year'] >= 2015]

Unnamed: 0,satellite,year,government_owned,mission
2,PlanetScope,2015,False,Remote sensing
3,Starlink,2018,False,Communication
5,OneWeb,2021,False,Communication


We can also to multiple subsets at the same time. For example, selecting those launched in 2010 or after, and which were select only remote sensing satellites;

In [24]:
# Example
df[df['year'] >= 2015][df['mission'] == 'Remote sensing']

  df[df['year'] >= 2015][df['mission'] == 'Remote sensing']


Unnamed: 0,satellite,year,government_owned,mission
2,PlanetScope,2015,False,Remote sensing


You can create add a new column to the existing dataframe as follows;

In [27]:
# Example
df['mission_code'] = ''
df

Unnamed: 0,satellite,year,government_owned,mission,mission_code
0,LandSat,1972,True,Remote sensing,
1,Sentinel,2013,True,Remote sensing,
2,PlanetScope,2015,False,Remote sensing,
3,Starlink,2018,False,Communication,
4,Iridium,1999,False,Communication,
5,OneWeb,2021,False,Communication,


## Exercise

Now let's use the dataframe you created in the first exercise to practise extracting information.

First, print the shape of 'image_data':

In [19]:
#Enter your attempt below:


Now print the top rows of 'image_data' to inspect the contents:

In [20]:
#Enter your attempt below:


Subset the 'filename' column, convert to a list, and inspect the contents:

In [21]:
#Enter your attempt below:


Subset only the `longitude` and `latitude` columns, and then use a new function called `.to_dict('records')` to convert the dataframe of coordinates to a list of dicts:

In [22]:
#Enter your attempt below:


Next subset those rows which have a 'latitude' between 1.5 and 1.6:

In [23]:
#Enter your attempt below:


Finally, subset those rows which have a 'longitude' between 0.05 and 0.1:

In [24]:
#Enter your attempt below:


# Intro `geopandas`

The beauty of `geopandas` is that it enables us to manage spatial info using the Python Data Analysis Library: https://pandas.pydata.org

Let's start by installing the package:

In [28]:
# Example
import sys
!conda install --yes --prefix {sys.prefix} numpy=1.22
!conda install --yes --prefix {sys.prefix} geopandas
import geopandas

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Anaconda\envs\sia

  added / updated specs:
    - numpy=1.22


The following packages will be DOWNGRADED:

  numpy                               1.23.5-py39h3b20f71_0 --> 1.22.3-py39h7a0a035_0 None
  numpy-base                          1.23.5-py39h4da318b_0 --> 1.22.3-py39hca35cd5_0 None


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Retrieving notices: ...working... done




  current version: 22.9.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -c defaults conda




  current version: 22.9.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -c defaults conda




Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: D:\Anaconda\envs\sia

  added / updated specs:
    - geopandas


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    boost-cpp-1.73.0           |      h2bbff1b_12          16 KB
    cairo-1.16.0               |       haedb8bc_4         1.9 MB
    fontconfig-2.14.1          |       hc0defaf_1         198 KB
    gdal-3.6.2                 |   py39h36fb4bc_0         1.8 MB
    geotiff-1.7.0              |       h4545760_1         133 KB
    kealib-1.5.0        

## A quick recap on `pandas`

Pandas is a Python package "providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive". 

It provides us with a range of capabilities:

- DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of data sets.
- Label-based slicing, fancy indexing, and subsetting of large data sets.
- Data structure column insertion and deletion.
- Group by engine allowing split-apply-combine operations on data sets.
- Data set merging and joining.
- Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
- Time series-functionality: Date range generation[6] and frequency conversions, moving window statistics, moving window linear regressions, date shifting and lagging.
- Provides data filtration.




## So what is special about `geopandas`?

"GeoPandas is a project to add support for geographic data to pandas objects. It currently implements GeoSeries and GeoDataFrame types which are subclasses of pandas.Series and pandas.DataFrame respectively. GeoPandas objects can act on shapely geometry objects and perform geometric operations."

See the Git repo for more information: https://github.com/geopandas/geopandas

The GeoPandas dataframe holds a geometry column which enables cartesian geometry operations (meaning it can interpret pairs of numerical coordinates in space). 

The coordinate reference system (crs) can be stored as an attribute on an object, and is automatically set when loading from a file. Objects may be transformed to new coordinate systems with the `to_crs()` method. 

Here we will cover the following basic operations:

- Reading data to a geopandas dataframe
- Manipulating column data 
- Creating a new column
- Changing coordinate reference systems
- Writing data to a geopandas dataframe


### Reading vector shapefile data to a `geopandas` dataframe

Let's read in the shapefile for GMU. 

To load this in, we can find the current folder using the `os` package which we previously used, as follows, via the `getcwd` function:

In [30]:
import os

## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

print(current_dir)    

C:\Users\edwar\Desktop\satellite-image-analysis\notebooks


The `current_dir` variable is merely a string of the directory path which we can manipulate.

In [31]:
## getcwd stands for 'get current working directory'
current_dir = os.getcwd()

path = current_dir + '/files' + '/gmu.shp'

print(path)    

C:\Users\edwar\Desktop\satellite-image-analysis\notebooks/files/gmu.shp


Now we're ready to read in the data using the path we've specified.

Let's first load `geopandas` which should already be installed in your environment. 

Then we can use the GeoPandas function `read_file` and provide the following arguments:
- `path` which contains the path to the shapefile we want to load, and
- `crs` which states the coordinate reference system


In [47]:
import geopandas as gpd

#load the file as the variable named data
data = gpd.read_file(path, crs='epsg:4326') 
print(data)

   FID                                           geometry
0    0  POLYGON ((-77.31540 38.83630, -77.30000 38.836...


## Basic `geopandas` functions

`geopandas` provide us with some great functionality, for example, we can change the crs as follows:

In [48]:
# The previous crs was in decimel degrees (epsg:4326), so let's change to meters ('epsg:3857')
data = data.to_crs('epsg:3857')
print(data)

ProjError: x, y, z, and time must be same size

Now we are working with a crs which is in meters, we can take the area of this shape as follows:

In [39]:
# Due to our current CRS, the area will be in square meters
data['area'] = data['geometry'].area 
print(data)

   FID                                           geometry      area
0    0  POLYGON ((-77.31540 38.83630, -77.30000 38.836...  0.000174



  data['area'] = data['geometry'].area


The beauty is we can manipulate this as a normal pandas dataframe.

So let's for example, convert our square meters into square kilometers (which requires us to divide by 1e6)

Remember, we can select a variable by using the square parentheses to index (e.g. `data['area']` gets the area column), and then create a new column this way too (e.g. `data['area_km2']` is the new column we wish to make).

In [60]:
data['area_km2'] = data['area'] / 1e6
print(data['area_km2'])

0    2.768233
Name: area_km2, dtype: float64


We can see the whole dataframe structure with our new column, as follows:

In [61]:
print(data)

   FID                                           geometry          area  \
0    0  POLYGON ((-8606710.958 4698250.004, -8604996.6...  2.768233e+06   

   area_km2  
0  2.768233  


We are able to loop over any content in a GeoDataFrame the same way we would a normal DataFrame, by using the `iterrows()` function, as follows:

In [62]:
for row in data.iterrows():
    print(row)

(0, FID                                                         0
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                            2768233.05248
area_km2                                             2.768233
Name: 0, dtype: object)


This means we can access and print specific parts of each row. 

The important thing to remember is that you have the row index (here it's a zero) and then the actual row information.

For example, we can break out the row index here using `[0]`, and the row information using `[1]`:

In [63]:
for row in data.iterrows():
    
    ##this will print our row index
    print(row[0]) 
    print('')
    print('')
    ##this will print our row information
    print(row[1])

0


FID                                                         0
geometry    POLYGON ((-8606710.958478265 4698250.004406621...
area                                            2768233.05248
area_km2                                             2.768233
Name: 0, dtype: object


We can then access just the geometry as follows:

In [64]:
for row in data.iterrows():
    
    ##this will print our row geometry
    print(row[1]['geometry'])

POLYGON ((-8606710.958478265 4698250.004406621, -8604996.638320047 4698250.004406621, -8604996.638320047 4696635.23423711, -8606599.63898747 4696635.23423711, -8606710.958478265 4696635.23423711, -8606710.958478265 4698250.004406621))


And we can carry out any manipulations we want in this loop, such as taking the area (let's reuse this as it used it before, so you will be familiar):

In [65]:
for row in data.iterrows():
    
    ##this will print our row geometry
    area_km2 = (row[1]['geometry'].area / 1e6)
    
    ##this will round our area to 1 decimal place
    area_km2 = round(area_km2, 1)
    
    print("The area of GMU campus is {} square kilometers".format(area_km2))

The area of GMU campus is 2.8 square kilometers
