# How did I decide which cities to include?

This notebook walks through how I created a list of cities to include in the tracker. First, let's import pandas.

In [1]:
import pandas as pd

Then, let's read in our raw data. We'll use it as a starting point.

In [35]:
data = pd.read_csv("../data/raw_data-max.csv")

data.head()

Unnamed: 0.1,Unnamed: 0,CMANAME,Month,Day,Year,value
0,0,Toronto,3,1,1840,8.3
1,1,Toronto,3,2,1840,7.8
2,2,Toronto,3,3,1840,11.1
3,3,Toronto,3,4,1840,15.0
4,4,Toronto,3,5,1840,6.7


Now, we'll use two criteria to filter our cities.

1. They should have a minimum number of years with at least one weather station reporting data.
2. They should have at least one active weather station reporting data recently (in 2022). We don't want to track cities for which there is no active gathering process!

Let's start by counting the number of years during which a weather station was active in each city.

In [36]:
numYearsCollected = (data
                     .pivot_table(index="CMANAME", values=["Year"], aggfunc=lambda x: len(x.unique()))
                     .sort_values("Year")
                     )

numYearsCollected.head()

Unnamed: 0_level_0,Year
CMANAME,Unnamed: 1_level_1
High River,3
Ingersoll,10
Camrose,14
Wasaga Beach,16
Campbellton (partie du Québec / Quebec part),23


Let's take a sidebar to check out Ingersoll. If those 10 years were from 2012 to 2022, we might want to track it. Which years are they?

In [37]:
data.loc[data["CMANAME"] == "Ingersoll", "Year"].unique()

array([1872, 1873, 1874, 1875, 1876, 1879, 1880, 1881, 1887, 1888],
      dtype=int64)

All in the 1800s! We may want to exclude that one... On to the next criteria, which will accomplish just that.

In [39]:
lastYearCollected = (data
                     .pivot_table(index="CMANAME", values=["Year"], aggfunc="max")
                     .sort_values("Year", ascending=False)
                     )

lastYearCollected.head()

Unnamed: 0_level_0,Year
CMANAME,Unnamed: 1_level_1
Abbotsford - Mission,2022
Powell River,2022
Sarnia,2022
Salmon Arm,2022
Saint-Hyacinthe,2022


Looks like there are at least a few cities with weather stations reporting in 2022. How many?

In [42]:
lastYearCollected.loc[lastYearCollected["Year"] == 2022, "Year"].count()

121

121! That's plenty. These criteria should work just fine. Now, let's put our two dataframes together...

In [43]:
df = lastYearCollected.join(numYearsCollected, rsuffix="_")
df.columns = ["Latest Year", "Number of Years"]

df

Unnamed: 0_level_0,Latest Year,Number of Years
CMANAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Abbotsford - Mission,2022,114
Powell River,2022,99
Sarnia,2022,77
Salmon Arm,2022,122
Saint-Hyacinthe,2022,94
...,...,...
Wasaga Beach,1981,16
Portage la Prairie,1971,86
Pembroke,1969,75
Camrose,1941,14


...and filter using some arbitrary criteria. In this case, only cities with stations reporting in 2022, and only those with at least one station reporting over a minimum of 10 years.

In [44]:
to_include = df.loc[(df["Latest Year"] == 2022) & (df["Number of Years"] >= 10), :]

to_include.head()

Unnamed: 0_level_0,Latest Year,Number of Years
CMANAME,Unnamed: 1_level_1,Unnamed: 2_level_1
Abbotsford - Mission,2022,114
Powell River,2022,99
Sarnia,2022,77
Salmon Arm,2022,122
Saint-Hyacinthe,2022,94


Looks like we have plenty of cities we can include. We'll convert this city to a list and use it in our module.

In [32]:
to_include.index.to_list()

['Saint John',
 'Saguenay',
 'Saint-Hyacinthe',
 'Saint-Georges',
 'Abbotsford - Mission',
 'Rivière-du-Loup',
 'Regina',
 'Québec',
 'Quesnel',
 'Prince Rupert',
 'Prince George',
 'Prince Albert',
 'Powell River',
 'Port Alberni',
 'Peterborough',
 'Petawawa',
 'Penticton',
 'Rouyn-Noranda',
 'Salmon Arm',
 'Shawinigan',
 'Saskatoon',
 'Woodstock',
 'Wood Buffalo',
 'Winnipeg',
 'Winkler',
 'Windsor',
 'Williams Lake',
 'Whitehorse',
 'Weyburn',
 'Victoriaville',
 'Victoria',
 'Vernon',
 'Vancouver',
 "Val-d'Or",
 'Truro',
 'Trois-Rivières',
 'Trail',
 'Toronto',
 'Sault Ste. Marie',
 'Sept-Îles',
 'Parksville',
 'Sherbrooke',
 'Sorel-Tracy',
 'Squamish',
 'Sarnia',
 'St. Catharines - Niagara',
 'Summerside',
 'Swift Current',
 'Terrace',
 'Thetford Mines',
 'Thunder Bay',
 'Timmins',
 "St. John's",
 'Owen Sound',
 'North Battleford',
 "Ottawa - Gatineau (Ontario part / partie de l'Ontario)",
 'Essa',
 'Edmundston',
 'Edmonton',
 'Duncan',
 'Drummondville',
 'Dawson Creek',
 'Cranbro