# Dataframe Processing

We will use a few of the files from the course content:

* [Credit Cards](https://d2l.bowvalleycollege.ca/d2l/le/content/408503/viewContent/6071873/View)
* [Names](https://d2l.bowvalleycollege.ca/d2l/le/content/408503/viewContent/6071856/View)
* [OPSD](https://d2l.bowvalleycollege.ca/d2l/le/content/408503/viewContent/6071875/View)

We will use each of these files to demonstrate [Pandas Dataframes](https://pandas.pydata.org/). Please refer to the [Pandas API Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/index.html).

In [1]:
import pandas as pd
from math import pi

## What is a Dataframe

To begin let's look at what a Dataframe is by making a new empty dataframe.

In [None]:
# Make an empty call to make an empty dataframe
newdf = pd.DataFrame()
print(newdf)

## Adding columns

We can add columns when we create a dataframe, or afterwards

In [None]:
# Make an empty call to make an empty dataframe
newdf = pd.DataFrame(columns = ["Identifier", "Description"])
print(newdf)

# Add columns after creation
newdf = pd.DataFrame()
newdf.insert(0, "Identifier", None)
newdf.insert(1, "Description", None)
print(newdf)

## Do Not Manage Columns using Attribute Access

Even though we can add and modify columns using the period `.` after the dataframe, this
is discouraged because we can confuse the methods and properties of a dataframe.

In [None]:
# Please do not do this
newdf = pd.DataFrame()
newdf.insert(0, "Identifier", None)
print(newdf)
newdf.Identifier = ["Hello"]
print(newdf)

## Indexing in Dataframes

The columns of a dataframe can be directly access through indexing on the dataframe, because
dataframes are just lists of columns:

* `df["Column Name"]` selects all the rows of the column.
* `df[["Column A", "Column B"]]` selects all the rows of a list of columns.

To access a dataframe along one or both dimensions, row or columns we use the special `loc`
attribute, which accepts two dimensional indexing:

* `df.loc[:, "Column Name"]` selects all the rows of the column.
* `df.loc[row]` selects a single row.
* `df.loc[start:stop]` selects a slice of rows.
* `df.loc[:, ["Column A", "Column B"]]` selects all the rows of a list of columns.

We can use location indexing to modify both rows and columns, including inserting, updating, and deleting.

One the most common tasks with dataframes is to make new columns that are calculated from 
old columns in this case the Pandas tries to provide a simple syntax of:
```python

def makealist():
    return []

df["calculated column"] = makealist()
```

In [None]:
# Using direct indexing
newdf = pd.DataFrame()
newdf["Identifier"] = ["45", "23", "-10"]
print(newdf)

# Add another couple columns
newdf["Code"] = None
newdf["Description"] = None
print(newdf)

# Our first look at loc is to modify one cell
newdf.loc[:, "Code"] = "UV"
print(newdf)
newdf.loc[:, "Code"] = ["UV", "WT", "ABC"]
print(newdf)

# Let's add an entire row
newdf.loc["new row"] = ["76", "IJ", "This is an index example"]
print(newdf)

# We can overwrite
newdf.loc[0] = ["23-i76", "Z", "Complex number"]
print(newdf)

# Single cell modification
newdf.loc[0, "Description"] = "Complicated Number"
print(newdf)

## Indexing is General

In [None]:
print(newdf.loc["new row"])

## Always check the column data types

In [None]:
print(newdf.dtypes)

# Fix some datatypes
newdf["Identifier"] = newdf["Identifier"].astype(str)
print(newdf.dtypes)
print(newdf)

## File I/O

We can directly read and write a number of data file formats using dataframes, including
Excel and CSV. The first thing is to always check how the read translated the datatypes.
Often it gets the types wrong.

In [None]:
# Using Pandas' implicit type guessing results in a string columns that contain empty
# strings being cast as objects.
hurricanes = pd.read_csv("../data/hurdat2.csv")
print(hurricanes.dtypes)

Lets try again and tell the dataframe the column datatypes.

In [2]:
# Explicit use the Pandas data types that can handle missing values.
hurricanes = pd.read_csv(
    "..\\data\\hurdat2.csv",
    parse_dates = [ "Observed" ],
    dtype = {
        "Identifier": pd.StringDtype(),
        "Basin Code": pd.StringDtype(),
        "Basin Name": pd.StringDtype(),
        "Storm Number": pd.Int64Dtype(),
        "Season Year": pd.Int64Dtype(),
        "Storm Name": pd.StringDtype(),
        "Tracks": pd.Int64Dtype(),
        "Track Code": pd.StringDtype(),
        "Track Type": pd.StringDtype(),
        "Storm Code": pd.StringDtype(),
        "Storm Type": pd.StringDtype(),
        "Latitude": pd.Float64Dtype(),
        "Longitude": pd.Float64Dtype(),
        "Maximum Wind (kt)": pd.Int64Dtype(),
        "Minimum Pressure (mbar)": pd.Int64Dtype(),
        "NE Radius (nmi)": pd.Int64Dtype(),
        "SE Radius (nmi)": pd.Int64Dtype(),
        "SW Radius (nmi)": pd.Int64Dtype(),
        "NW Radius (nmi)": pd.Int64Dtype(),
        "Eye Radius (nmi)": pd.Int64Dtype()
    }
)
print(hurricanes.dtypes)

Identifier                      string[python]
Basin Code                      string[python]
Basin Name                      string[python]
Storm Number                             Int64
Season Year                              Int64
Storm Name                      string[python]
Tracks                                   Int64
Observed                   datetime64[ns, UTC]
Track Code                      string[python]
Track Type                      string[python]
Storm Code                      string[python]
Storm Type                      string[python]
Latitude                               Float64
Longitude                              Float64
Maximum Wind (kt)                        Int64
Minimum Pressure (mbar)                  Int64
NE Radius (nmi)                          Int64
SE Radius (nmi)                          Int64
SW Radius (nmi)                          Int64
NW Radius (nmi)                          Int64
Eye Radius (nmi)                         Int64
dtype: object

## Searching Tabular Data

Row and column indexing works well when you now where you need to go, much like having a
street address. What do we do when we do not know the precise row and column, but rather
want to find matches according to a criteria. In this case we use vectorized versions of the
logical and mathematical operators (they implicitly loop over the elements), and combine
that with Boolean Indexing.

### Tabular Slicing

Select all the columns of 10 rows.

In [None]:
hurricanes[45:55]

In [None]:
hurricanes[45:46]

In [None]:
type(hurricanes[45:46])

In [None]:
hurricanes["Minimum Pressure (mbar)"]

In [None]:
hurricanes[["Identifier", "Basin Code"]]

Allowed arguments to a Pandas dataframe index `[]`:

* A numerical slice `start:stop`, slices rows.
* A single string `"Column"`, slices one column.
* A list of columns `["Column A", "Column B"]`, slices a set of rows.
* A list of Booleans `True` and `False` that has the same number of rows as the dataframe.

Select the 10 rows first and then the columns.

In [None]:
hurricanes["Identifier"][45:46]

In [None]:
hurricanes[45:46]["Identifier"]

In [None]:
type(hurricanes[45:46]["Identifier"])

In [None]:
type(hurricanes[45:46][["Identifier", "Basin Code"]])

In [None]:
hurricanes[45:55][["Identifier", "Basin Code"]]

Select the columns first and then the rows.

In [None]:
hurricanes[["Identifier", "Basin Code"]][45:55]

There are some performance rules of thumbs when slicing. Generally slice by the narrowest
selection first to reduce the amount of computations.

* If you want to select a small number of columns and most rows, then slice by columns and
then rows.
* If you are selecting a small number of rows, and most columns, then slice by rows first.

### Boolean Indexing

What happens if we replace the numerical slice with a list of Booleans? First of all by the
[documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) the
list must be the same length as the dataframe.

In [None]:
# Note that the underscore _ is a placeholder for an unused variable
print(len(hurricanes))
chosen = [ False for _ in range(len(hurricanes))]
chosen[45:55] = [ True for _ in range(10) ]
chosen[80000] = True
chosen[40:60]

Lets try the same three slices but with the Boolean list. Starting with all the columns.

In [None]:
hurricanes[chosen]

In [None]:
hurricanes[45:55]

Select the rows and then the columns.

In [None]:
hurricanes[chosen][["Identifier", "Basin Code"]]

Select the columns and then the rows.

In [None]:
hurricanes[["Identifier", "Basin Code"]][chosen]

The Boolean index can be generated by any means possible, especially using the vectorized
versions of comparison operators. Lets find all the observations with a maximum wind
exceeding 150 kt. Usually this is done in a single step but we will break it down. The
following table maps between the scalar syntax and the vectorized syntax. Vectorized
operations implicitly loop over the data and are always faster and more concise than
writing explicit loops. In general explicitly looping through dataframes will be very
slow.

| Scalar      | Vector              |
|-------------|---------------------|
|`==`         |`==`                 |
|`<=`         |`<=`                 |
|`>=`         |`>=`                 |
|`>`          |`>`                  |
|`<`          |`<`                  |
|`+`          |`+`                  |
|`-`          |`-`                  |
|`*`          |`*`                  |
|`/`          |`/`                  |
|`%`          |`%`                  |
|`**`         |`**`                 |
|`and`        |`&`                  |
|`or`         |`\|`                 |
|`not`        |`~`                  |
|`is None`    |`isna()` `isnull()`  |
|`is not None`|`notna()` `notnull()`|
|`for in`     |`apply()`            |

The most important point to remember is that each comparison needs to be wrapped in brackets
`()` to ensure it is evaluated before any other logical operators.

In [None]:
# Implicitly loop through all the values of wind speed and compare to 150.
extremewind = hurricanes["Maximum Wind (kt)"] >= 150
print(type(extremewind))
print(extremewind)

# Now grab only those records
extremewindDF = hurricanes[extremewind]

In [3]:
# Implicitly loop through all the values of wind speed and compare to 150.
extremestorm = (
    (hurricanes["Maximum Wind (kt)"] >= 150) |
    (hurricanes["Minimum Pressure (mbar)"].fillna(1015) <= 900)
)
print(type(extremestorm))
print(extremestorm)

# Now grab only those records
extremestormDF = hurricanes[extremestorm]

<class 'pandas.core.series.Series'>
0        False
1        False
2        False
3        False
4        False
         ...  
85922    False
85923    False
85924    False
85925    False
85926    False
Length: 85927, dtype: boolean


For a more complicated example suppose we wanted to get all the records for storms that
had an extreme wind at any point. To do this we can use the `isin` operator to find all
the storm track records based on the identifier.

In [4]:
# The storm identifiers with extreme wind
identifiers = hurricanes[extremestorm]["Identifier"]
print(identifiers)

# Logical test that the track identifiers is one of the extreme wind identifiers
extremetracks = hurricanes["Identifier"].isin(identifiers)
notmostdangerous = ~extremetracks
print(extremetracks)

# Get the data
extremetracksDF = hurricanes[extremetracks]
notmostdangerousDF = hurricanes[notmostdangerous]

16757    AL141932
16758    AL141932
16759    AL141932
16760    AL141932
17941    AL031935
           ...   
77515    EP202009
81176    EP202015
81177    EP202015
81178    EP202015
81179    EP202015
Name: Identifier, Length: 82, dtype: string
0        False
1        False
2        False
3        False
4        False
         ...  
85922    False
85923    False
85924    False
85925    False
85926    False
Name: Identifier, Length: 85927, dtype: bool


### Missing Values

Another critical query is to determine which records are missing values. We can generate a
logical index of missing values using the `isna` or `isnull` method. We can also reverse
this using the negation operator, or equivalently is the `notna` or `notnull` methods. The
methods `fillna()` or `fillnull()` are used to replace missing values with a specified
default value.

In [None]:
hastype = hurricanes["Track Code"].isna()
print(hastype)
hurricanes[hastype]

As an example of replacing missing values we can supply a default indictor to the track
record.

In [9]:
print("Fill the track code")
print(hurricanes["Track Code"].fillna("O"))
print("Fill the track type")
print(hurricanes["Track Type"].fillna("Observation"))

# Note the changes are mot permanent.
print("Changes are not saved!")
print(hurricanes[[ "Track Code", "Track Type" ]])

Fill the track code
0        O
1        O
2        O
3        O
4        L
        ..
85922    O
85923    O
85924    O
85925    O
85926    O
Name: Track Code, Length: 85927, dtype: string
Fill the track type
0        Observation
1        Observation
2        Observation
3        Observation
4           Landfall
            ...     
85922    Observation
85923    Observation
85924    Observation
85925    Observation
85926    Observation
Name: Track Type, Length: 85927, dtype: string
Changes are not saved!
      Track Code Track Type
0           <NA>       <NA>
1           <NA>       <NA>
2           <NA>       <NA>
3           <NA>       <NA>
4              L   Landfall
...          ...        ...
85922       <NA>       <NA>
85923       <NA>       <NA>
85924       <NA>       <NA>
85925       <NA>       <NA>
85926       <NA>       <NA>

[85927 rows x 2 columns]


### Calculated Columns

Another common task is to create new columns from the existing data. In this example we
will estimate the area of the storm using the vectorized math operations. We have already
created a Pandas series, a column, of Booleans. We can assign the Booleans to a new
column.

In [None]:
# The LHS and RHS must have equal number of rows.
hurricanes["Is Extreme"] = extremetracks

The one exception to requiring equal lengths is broadcasting a single scalar value across
all rows.

In [6]:
hurricanes["Broadcast Test"] = False

Note how we chained a vectorized functions together. The math formula we use it
$$
\text{Area} = \frac{\pi}{4}\left(\text{NE}^2 + \text{SE}^2 + \text{SW}^2 + \text{NW}^2 \right)
$$

In [12]:
hurricanes["Estimated Area"] = (
        pi * (
        hurricanes["NE Radius (nmi)"]**2 +
        hurricanes["SE Radius (nmi)"]**2 +
        hurricanes["SW Radius (nmi)"]**2 +
        hurricanes["NW Radius (nmi)"]**2
    ) / 4
).round().astype(pd.Int64Dtype()).fillna(400)
hurricanes

Unnamed: 0,Identifier,Basin Code,Basin Name,Storm Number,Season Year,Storm Name,Tracks,Observed,Track Code,Track Type,...,Maximum Wind (kt),Minimum Pressure (mbar),NE Radius (nmi),SE Radius (nmi),SW Radius (nmi),NW Radius (nmi),Eye Radius (nmi),Is Extreme,Broadcast Test,Estimated Area
0,AL011851,AL,Atlantic,1,1851,,14,1851-06-25 00:00:00+00:00,O,Observation,...,80,,,,,,,False,False,400
1,AL011851,AL,Atlantic,1,1851,,14,1851-06-25 06:00:00+00:00,O,Observation,...,80,,,,,,,False,False,400
2,AL011851,AL,Atlantic,1,1851,,14,1851-06-25 12:00:00+00:00,O,Observation,...,80,,,,,,,False,False,400
3,AL011851,AL,Atlantic,1,1851,,14,1851-06-25 18:00:00+00:00,O,Observation,...,80,,,,,,,False,False,400
4,AL011851,AL,Atlantic,1,1851,,14,1851-06-25 21:00:00+00:00,L,Landfall,...,80,,,,,,,False,False,400
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85922,EP202023,EP,East Pacific,20,2023,RAMON,25,2023-11-26 06:00:00+00:00,O,Observation,...,35,1004,40,30,0,30,30,False,False,2670
85923,EP202023,EP,East Pacific,20,2023,RAMON,25,2023-11-26 12:00:00+00:00,O,Observation,...,30,1006,0,0,0,0,20,False,False,0
85924,EP202023,EP,East Pacific,20,2023,RAMON,25,2023-11-26 18:00:00+00:00,O,Observation,...,25,1008,0,0,0,0,20,False,False,0
85925,EP202023,EP,East Pacific,20,2023,RAMON,25,2023-11-27 00:00:00+00:00,O,Observation,...,25,1008,0,0,0,0,20,False,False,0


In [14]:
completequadrants = (
    (hurricanes["NE Radius (nmi)"] > 0) &
    (hurricanes["SE Radius (nmi)"] > 0) &
    (hurricanes["SW Radius (nmi)"] > 0) &
    (hurricanes["NW Radius (nmi)"] > 0)
)
hurricanes[completequadrants]

Unnamed: 0,Identifier,Basin Code,Basin Name,Storm Number,Season Year,Storm Name,Tracks,Observed,Track Code,Track Type,...,Maximum Wind (kt),Minimum Pressure (mbar),NE Radius (nmi),SE Radius (nmi),SW Radius (nmi),NW Radius (nmi),Eye Radius (nmi),Is Extreme,Broadcast Test,Estimated Area
43729,AL012004,AL,Atlantic,1,2004,ALEX,25,2004-08-02 12:00:00+00:00,O,Observation,...,50,992,75,90,60,20,,False,False,13921
43730,AL012004,AL,Atlantic,1,2004,ALEX,25,2004-08-02 18:00:00+00:00,O,Observation,...,50,993,75,90,50,30,,False,False,13450
43731,AL012004,AL,Atlantic,1,2004,ALEX,25,2004-08-03 00:00:00+00:00,O,Observation,...,60,987,75,90,50,30,,False,False,13450
43732,AL012004,AL,Atlantic,1,2004,ALEX,25,2004-08-03 06:00:00+00:00,O,Observation,...,70,983,75,90,50,40,,False,False,14000
43733,AL012004,AL,Atlantic,1,2004,ALEX,25,2004-08-03 12:00:00+00:00,O,Observation,...,85,974,75,90,50,40,,False,False,14000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85879,EP192023,EP,East Pacific,19,2023,PILAR,41,2023-11-01 18:00:00+00:00,O,Observation,...,45,999,60,50,60,70,30,False,False,11467
85880,EP192023,EP,East Pacific,19,2023,PILAR,41,2023-11-02 00:00:00+00:00,O,Observation,...,45,999,60,50,60,70,30,False,False,11467
85881,EP192023,EP,East Pacific,19,2023,PILAR,41,2023-11-02 06:00:00+00:00,O,Observation,...,45,999,50,40,50,60,20,False,False,8011
85882,EP192023,EP,East Pacific,19,2023,PILAR,41,2023-11-02 12:00:00+00:00,O,Observation,...,50,998,50,40,50,60,20,False,False,8011


We can overwrite columns as well. This is useful for storing the filled defaults.

In [None]:
# By re-assiging the the columns we save the changes made by filling the missing values.
hurricanes["Track Code"] = hurricanes["Track Code"].fillna("O")
hurricanes["Track Type"] = hurricanes["Track Type"].fillna("Observation")
print(hurricanes)


      Identifier Basin Code    Basin Name  Storm Number  Season Year  \
0       AL011851         AL      Atlantic             1         1851   
1       AL011851         AL      Atlantic             1         1851   
2       AL011851         AL      Atlantic             1         1851   
3       AL011851         AL      Atlantic             1         1851   
4       AL011851         AL      Atlantic             1         1851   
...          ...        ...           ...           ...          ...   
85922   EP202023         EP  East Pacific            20         2023   
85923   EP202023         EP  East Pacific            20         2023   
85924   EP202023         EP  East Pacific            20         2023   
85925   EP202023         EP  East Pacific            20         2023   
85926   EP202023         EP  East Pacific            20         2023   

      Storm Name  Tracks                  Observed Track Code   Track Type  \
0           <NA>      14 1851-06-25 00:00:00+00:00       

In [11]:
print(hurricanes["Track Code"].isna())

0        False
1        False
2        False
3        False
4        False
         ...  
85922    False
85923    False
85924    False
85925    False
85926    False
Name: Track Code, Length: 85927, dtype: bool


## Saving Dataframes

For every format of file that Pandas can read using the `Pandas.read_{file type}()` methods,
there is a corresponding `Pandas.Dataframe.to_{file type}()` method for storing the data.
Importantly you do not have to read and write files in the same format. This makes Pandas a
flexible [data format conversion tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).
Among the common formats in use today Pandas supports:

* [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) and [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)
* [`read_json()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html) and [`to_json()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html)
* [`read_html()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) and [`to_html()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_html.html)
* [`read_xml()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_xml.html) and [`to_xml()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xml.html)
* [`read_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html) and [`to_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)

The Pandas I/O API is documented
[here](https://pandas.pydata.org/pandas-docs/stable/reference/io.html). Note how the `to`
methods are attached to each dataframe, whereas the `read` methods are static methods of
the Pandas library that create dataframes.

Let's try saving our updated hurricanes dataframe as an Excel file.

In [17]:
hurricanes[[ "Identifier", "Estimated Area" ]].to_excel(
    "..\data\hurdat2.xlsx",
    sheet_name = "Storm Areas"
)

  "..\data\hurdat2.xlsx",


In [18]:
hurricanes.to_csv("../data/hurricanes-final.csv")