In [2]:
import pandas as pd

open_seven_days_df = pd.read_parquet("../../data/pandas/open_seven_days_df.parquet")

A theme in this course will be learning transformations across languages— the ability to select the proper tool for the job depends on a knowledge of _what tools exist_.

In this lesson, we'll cover select + filter operations in Pandas.

Creating columns in Pandas is as simple as assigning those columns through the syntax

```python
dataframe['column_name'] = column_value
```

In [3]:
import numpy as np

open_seven_days_df["closed_open"] = np.where(
    open_seven_days_df["standardHours.thursday"] == "Closed", "Closed", "Open"
)
open_seven_days_df["is_closed"] = np.where(
    open_seven_days_df["standardHours.thursday"] == "Closed", True, False
)

Now you might be saying "when are we assigning a single value to a column vs. performing a calculation on a column?" and that would be a great question! The answer lies in _vectorization_— the process of performing calculations on entire columns at once. 

Certain operations can be vectorized and act on other columns, while others need to be _applied_ row-by-row. We'll talk about applying row-wise functions later in the course, but for now we'll focus on vectorized operations. 

In [4]:
open_seven_days_df.columns

Index(['relevanceScore', 'designation', 'weatherInfo', 'addresses',
       'operating_hours', 'entrancePasses', 'name', 'description',
       'directionsUrl', 'fees', 'topics', 'states', 'entranceFees', 'contacts',
       'activities', 'url', 'longitude', 'id', 'images', 'directionsInfo',
       'fullName', 'parkCode', 'latLong', 'latitude', 'category',
       'operating_hours_description', 'exceptions', 'standardHours.friday',
       'standardHours.sunday', 'standardHours.thursday',
       'standardHours.tuesday', 'standardHours.saturday',
       'standardHours.monday', 'standardHours.wednesday', 'monday_hours',
       'tuesday_hours', 'wednesday_hours', 'thursday_hours', 'friday_hours',
       'saturday_hours', 'sunday_hours', 'open_seven_days_a_week',
       'closed_open', 'is_closed'],
      dtype='object')

In [None]:
open_seven_days_df["open_closed"] = (
    "Today, the park is: " + open_seven_days_df["closed_open"]
)

open_seven_days_df["open_closed"]

It's also possible to select in Pandas using `iloc` and `loc`. As the name suggest, one is for selecting an _index_, the other a _column_

In [5]:
# this gets the first row of the dataframe
open_seven_days_df.iloc[0:1]

Unnamed: 0,relevanceScore,designation,weatherInfo,addresses,operating_hours,entrancePasses,name,description,directionsUrl,fees,...,monday_hours,tuesday_hours,wednesday_hours,thursday_hours,friday_hours,saturday_hours,sunday_hours,open_seven_days_a_week,closed_open,is_closed
6,1,Memorial Parkway,Summers on the parkway are generally hot and h...,"[{'city': 'McLean', 'countryCode': 'US', 'line...",{'description': 'The George Washington Memoria...,[],George Washington,The George Washington Memorial Parkway was des...,http://www.nps.gov/gwmp/planyourvisit/directio...,[],...,All Day,All Day,All Day,All Day,All Day,All Day,All Day,True,Open,False


In [6]:
# this gets the rows of the dataframe with index 6, which happens to be the first row :)
open_seven_days_df.loc[6:7]

Unnamed: 0,relevanceScore,designation,weatherInfo,addresses,operating_hours,entrancePasses,name,description,directionsUrl,fees,...,monday_hours,tuesday_hours,wednesday_hours,thursday_hours,friday_hours,saturday_hours,sunday_hours,open_seven_days_a_week,closed_open,is_closed
6,1,Memorial Parkway,Summers on the parkway are generally hot and h...,"[{'city': 'McLean', 'countryCode': 'US', 'line...",{'description': 'The George Washington Memoria...,[],George Washington,The George Washington Memorial Parkway was des...,http://www.nps.gov/gwmp/planyourvisit/directio...,[],...,All Day,All Day,All Day,All Day,All Day,All Day,All Day,True,Open,False


Filtering in pandas is most easily accomplished by supplying conditions when selecting data, for example

In [7]:
parks_df = pd.read_parquet("../../data/nps/nps_public_data_parks.parquet")

parks_df[parks_df["fullName"] == "Zion National Park"]

Unnamed: 0,relevanceScore,designation,weatherInfo,addresses,operatingHours,entrancePasses,name,description,directionsUrl,fees,...,activities,url,longitude,id,images,directionsInfo,fullName,parkCode,latLong,latitude
257,1,National Park,Zion is known for a wide range of weather cond...,"[{'type': 'Physical', 'line2': '1 Zion Park Bl...","[{'name': 'Zion National Park', 'standardHours...",[],Zion,Follow the paths where people have walked for ...,http://www.nps.gov/zion/planyourvisit/directio...,[],...,"[{'name': 'Arts and Culture', 'id': '09DF0950-...",https://www.nps.gov/zion/index.htm,-113.026514,41BAB8ED-C95F-447D-9DA1-FCC4E4D808B2,[{'url': 'https://www.nps.gov/common/uploads/s...,"Zion National Park's main, south entrance and ...",Zion National Park,zion,"lat:37.29839254, long:-113.0265138",37.298393


We can pass any number of boolean operations to successively filter a dataframe this way

In [None]:
parks_df[parks_df["states"].str.contains("UT") & parks_df["states"].str.contains("AZ")]

In [None]:
parks_df[
    (parks_df["states"].str.contains("UT") & parks_df["states"].str.contains("AZ"))
    | parks_df["states"].str.contains("WY")
]

In [None]:
parks_df[
    (parks_df["longitude"] < -140)
    & (parks_df["latitude"] > 60)
    & (parks_df["designation"] == "National Park")
]

We can select entire columns through a familiar notation & combine with our filtering, too

In [None]:
parks_df[["fullName", "states"]]

In [None]:
parks_df[
    (parks_df["longitude"] < -140)
    & (parks_df["latitude"] > 60)
    & (parks_df["designation"] == "National Park")
][["fullName", "states"]]