# Intro to `pandas`

There are a very large number of open-source libraries we can import. 

The Python data analysis package `pandas` is a favorite: https://pandas.pydata.org/

Pandas is based on using **Dataframes**. They allow us to store data as panels in  rows and columns. You may have encountered this data structure in R or Matlab. If not, you can think of the approach as similar to how you use a spreadsheet (except with far greater flexibility and power in Python).

In this tutorial, we will focus on manipulation of dataframes using pandas. 

<div class="alert alert-block alert-info">
    
<b>Note:</b> There are several sub-libraries of pandas such as geopandas for geospatial data that you can always harness their power while performing highly specialised tasks. **Remember Google is your friend!**.
</div>

To begin working with dataframes, we first install and import pandas (as we did for packages in previous sessions). 

<div class="alert alert-block alert-warning">

<b>!! Note !!</b> Pandas is not part of most python distribution packages hence it should be installed where it is not available before starting the tutorial. You may need to exit and reinstall. 

</div>

In [1]:
# Example
import sys
!conda install --yes --prefix {sys.prefix} pandas

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\eoughton\Anaconda3\envs\test

  added / updated specs:
    - pandas


The following packages will be UPDATED:

  pandas                               1.4.3-py39hd77b12b_0 --> 1.5.3-py39hf11a4ad_0 



Downloading and Extracting Packages


Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done




  current version: 23.1.0
  latest version: 23.7.4

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.7.4




## Creating a `pandas` dataframe

Here we are focusing on ways of creating dataframes. However, in most day-to-day programming, you are likely to work with data in other formats which you can convert to dataframe by a single line of code. 

Nevertheless, dataframes can be created by converting lists to dictionaries and then to a dataframe, as shown in the subsequent steps, using an example of satellite missions. 

First, let's create a list with our information.

In [2]:
# Example
satellites = ['LandSat','Sentinel','PlanetScope','Starlink','Iridium','OneWeb']
year = [1972,2013,2015,2018,1999,2021]
government_owned = [True,True,False,False,False,False]
mission = ['Remote sensing','Remote sensing','Remote sensing','Communication','Communication','Communication']

satellites, year, government_owned, mission

(['LandSat', 'Sentinel', 'PlanetScope', 'Starlink', 'Iridium', 'OneWeb'],
 [1972, 2013, 2015, 2018, 1999, 2021],
 [True, True, False, False, False, False],
 ['Remote sensing',
  'Remote sensing',
  'Remote sensing',
  'Communication',
  'Communication',
  'Communication'])

Next, we need to combine the lists into a dictionary:

In [3]:
# Example
sat_data = {
    'satellite': satellites,
    'year': year,
    'government_owned': government_owned,
    'mission': mission
}
sat_data

{'satellite': ['LandSat',
  'Sentinel',
  'PlanetScope',
  'Starlink',
  'Iridium',
  'OneWeb'],
 'year': [1972, 2013, 2015, 2018, 1999, 2021],
 'government_owned': [True, True, False, False, False, False],
 'mission': ['Remote sensing',
  'Remote sensing',
  'Remote sensing',
  'Communication',
  'Communication',
  'Communication']}

Finally, we convert the dictionary into dataframe using the `pandas` library function `.DataFrame()`.

In [4]:
# Example
import pandas as pd
df = pd.DataFrame(sat_data)
df

Unnamed: 0,satellite,year,government_owned,mission
0,LandSat,1972,True,Remote sensing
1,Sentinel,2013,True,Remote sensing
2,PlanetScope,2015,False,Remote sensing
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication
5,OneWeb,2021,False,Communication


The prepared dataframe can be converted to other formats such as excel spreadsheet or CSV files, Example;

In [5]:
# Example
df.to_csv('satellite_missions.csv', index=False)

## Exercise 1

Have a go at manually creating a `pandas` dataframe called 'image_data' using the following:

    - A column called 'id' containing four numbers from 0-3.
    - A column called 'longitude' containing 0.02, 0.05, 0.06 and 0.08.
    - A column called 'latitude' containing 1.54, 1.65, 1.48 and 1.59.
    - A column called 'filename' containing 'multiband0.tiff', 'multiband1.tiff', 'multiband2.tiff' and 'multiband3.tiff'.
    
Write this dataframe to a .csv file called 'my_csv.csv'. 
        

In [6]:
#Enter your attempt below:


## Accessing data

`pandas` provides a lot of functionality. 

We can easily import a .csv file into our notebook using the `.read_csv()` function.  

In [7]:
df = pd.read_csv('satellite_missions.csv')
df

Unnamed: 0,satellite,year,government_owned,mission
0,LandSat,1972,True,Remote sensing
1,Sentinel,2013,True,Remote sensing
2,PlanetScope,2015,False,Remote sensing
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication
5,OneWeb,2021,False,Communication


Then we can view the first few rows of data using the 'head' function:

In [8]:
# Example
df.head(n=3)

Unnamed: 0,satellite,year,government_owned,mission
0,LandSat,1972,True,Remote sensing
1,Sentinel,2013,True,Remote sensing
2,PlanetScope,2015,False,Remote sensing


You can view the last few rows of your data using the 'tail' function.

In [9]:
# Example
df.tail(n=3)

Unnamed: 0,satellite,year,government_owned,mission
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication
5,OneWeb,2021,False,Communication


You can know how many rows and columns are in your data using the shape function.

In [10]:
# Example
df.shape

(6, 4)

You can know the name of columns by using the column function.

In [11]:
# Example
list(df.columns)

['satellite', 'year', 'government_owned', 'mission']

You can access a single columns using the following syntax.

In [12]:
# Example
df['satellite']

0        LandSat
1       Sentinel
2    PlanetScope
3       Starlink
4        Iridium
5         OneWeb
Name: satellite, dtype: object

You can access multiple columns using the following line of code.

In [13]:
# Example
df[['satellite','year']]

Unnamed: 0,satellite,year
0,LandSat,1972
1,Sentinel,2013
2,PlanetScope,2015
3,Starlink,2018
4,Iridium,1999
5,OneWeb,2021


You can access individual rows by specifying the row index, e.g;

In [14]:
# Example
df.loc[4]

satellite                 Iridium
year                         1999
government_owned            False
mission             Communication
Name: 4, dtype: object

You can access datapoints between a given range by;

In [15]:
# Example
df.loc[3:4]

Unnamed: 0,satellite,year,government_owned,mission
3,Starlink,2018,False,Communication
4,Iridium,1999,False,Communication


You can access data of specific attributes by subsetting. Eg satellites launched in 2015 or after.

In [16]:
# Example
df[df['year'] >= 2015]

Unnamed: 0,satellite,year,government_owned,mission
2,PlanetScope,2015,False,Remote sensing
3,Starlink,2018,False,Communication
5,OneWeb,2021,False,Communication


We can also to multiple subsets at the same time. For example, selecting those launched in 2010 or after, and which were select only remote sensing satellites;

In [17]:
# Example
df[df['year'] >= 2015][df['mission'] == 'Remote sensing']

  df[df['year'] >= 2015][df['mission'] == 'Remote sensing']


Unnamed: 0,satellite,year,government_owned,mission
2,PlanetScope,2015,False,Remote sensing


You can create add a new column to the existing dataframe as follows;

In [18]:
# Example
df['mission_code'] = ''
df

Unnamed: 0,satellite,year,government_owned,mission,mission_code
0,LandSat,1972,True,Remote sensing,
1,Sentinel,2013,True,Remote sensing,
2,PlanetScope,2015,False,Remote sensing,
3,Starlink,2018,False,Communication,
4,Iridium,1999,False,Communication,
5,OneWeb,2021,False,Communication,


## Exercise 2

Now let us use the dataframe you created in the first exercise to practise extracting information.

First, print the shape of 'image_data':

In [19]:
#Enter your attempt below:


Now print the top rows of 'image_data' to inspect the contents:

In [20]:
#Enter your attempt below:


Subset the 'filename' column, convert to a list, and inspect the contents:

In [21]:
#Enter your attempt below:


Subset only the `longitude` and `latitude` columns, and then use a new function called `.to_dict('records')` to convert the dataframe of coordinates to a list of dicts:

In [22]:
#Enter your attempt below:


Next subset those rows which have a 'latitude' between 1.5 and 1.6:

In [23]:
#Enter your attempt below:


Finally, subset those rows which have a 'longitude' between 0.05 and 0.1:

In [24]:
#Enter your attempt below:
