![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)

# Climate Across Provinces

Studying climate patterns across Canadian provinces offers insights into diverse weather behaviors. Analyzing temperature shifts during specific months can reveal important information in regard to impacts on agriculture, biodiversity, and disaster preparedness. Such data helps us understand regional climate variations and make informed decisions for managing resources and planning in the face of changing temperatures.

*Curriculum Connections*
- [Investigating factors affecting Climate](https://www.alberta.ca/curriculum-science) 

*Investigating Questions*
- Which provinces experience the highest and lowest annual temperatures? What geographical distinctions contribute to these temperature differences?
- Does the variation in elevation significantly affect temperatures across provinces?
- What natural elements, such as proximity to large water bodies or the presence of mountain ranges, can play a role in influencing temperatures across provinces?

***

### Import the Data
The code below will import the Python programming libraries we need to gather and organize the data to answer our question. `▶Run` the code cell below 

In [None]:
import pandas as pd
import plotly.express as px 
import geopandas as gpd
import folium
import re
print("Libraries imported.")

### Analysis

Let's begin exploring our dataset containing information about provincial temperatures across Canada. The dataset was obtained from [Kaggle](https://www.kaggle.com/datasets/hemil26/canada-weather), data being scraped from [Wikipedia](https://en.wikipedia.org/wiki/Temperature_in_Canada). 

The dataset when imported is also transformed into a dataframe. You can think of a dataframe as essentially a spreadsheet, where each row represents a record or an observation, while each column represents a different attribute or variable.

In [None]:
climate_data = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/Science/ClimateAcrossProvinces/canada_weather.csv')
climate_data

We notice that in our dataset, we have *10* different columns. Going from left to right, `Community` refers to the city and province of the climate data, `Weather station` refers to the 3 digit airport code that represents a particular city, `Location` refers to the latitude and longitude coordinates of the particular weather station, and `Elevation` refers how far above each particular station is from sea-level. The rest of the columns seems to refer specific month average temperatures, varying from high to low, or annual average temperatures, which also vary from high to low. 

### Data Cleaning

Now that we've gotten a better sense of what type of information is inside our dataset, let's perform data cleaning.

Data cleaning is like tidying up information to make it useful and accurate. Just like you clean and organize your room, data cleaning helps make data neat and organized.

Imagine you have a bunch of information about your classmates, like their names, ages, and favorite colors. But sometimes, mistakes or errors can happen when collecting this information. For example, someone may have misspelled a name or entered the wrong age for a classmate.

Data cleaning involves finding and fixing these mistakes. It's important because clean data helps us make better decisions and find meaningful patterns.

In our particular situation, we have a couple of things we want to adjust within our dataframe. These include separating provincial codes from cities and adjusting the temperature/elevation columns into only showing numbers for data manipulation later on. We'll go over these steps more in depth in the upcoming cells.

To begin, let's take the provincial codes from each city (which is found at the end of each city) and put the codes into their own column. 

Note: To reduce showing large outputs of **climate_data** every time something is adjusted, we will be using *.sample* to show a random sample of 5 different rows in the data as a means of showing change in the dataframe.

In [None]:
climate_data['Province'] = climate_data.Community.str[-2:]
climate_data.sample(5)

Perfect! We now have a new column called `Province` which contains only provincial codes. The next issue we want to tackle is in regard to how our data is presented in columns, particularly the temperature-related columns. 

In these columns, we only want individual temperature values without brackets. This would allow us to define a specific value or float (decimal-point number) to each column entry instead of the entries being considered strings (words). We can fix this by identifying the first *(* bracket and then cutting off the string at this bracket point. Afterwards, we can also replace the string negative sign with an actual negative sign. We then can convert our string into a float as we would only have a string version of a number which can be easily converted by Python. 


In [None]:
def clean_and_convert_to_float(s):
    # replace to actual negative float symbol
    cleaned_value = s.split('(')[0].strip().replace('−', '-')
    try:
        return float(cleaned_value)
    except ValueError:
        return None  

In [None]:
columns_to_clean = [
    'Annual(Avg. high °C (°F))',
    'January(Avg. high °C (°F))',
    'January(Avg. low °C (°F))',
    'July(Avg. high °C (°F))',
    'July(Avg. low °C (°F))',
    'Annual(Avg. low °C (°F))'
]

for column in columns_to_clean:
    climate_data[column] = climate_data[column].apply(clean_and_convert_to_float)
climate_data.sample(5)

Now our values are only single decimal-point numbers, alongside negative numbers. The final thing we need to do is fix the `Elevation` column. Similarly to our temperature columns, we want to only have a single decimal-point number in these column entries for easier data manipulation later on. This is achieved similarly where we find the first instance of an *m* and then cut the string off from that point. We also have cases of numbers having commas in them such as 1,084m. To get around this, whenever we see a *,*, we can replace it with an empty space.  

In [None]:
def fix_elevation(text):
    new_elevation = text.split('m')[0].replace(',', '').strip()
    try:
        return float(new_elevation)
    except ValueError:
        return None

In [None]:
climate_data['Elevation'] = climate_data['Elevation'].apply(fix_elevation)
climate_data.sample(5)

### Highest and Lowest Temperatures

Perfect! Now that we're done cleaning our data, we can move onto analysis and visualizing our data. First, since we have access to months, January and July, that are correlated with *winter* and *summer* respectively, let's find which city/province contains the coldest and hottest temperatures in our dataframe.

Let's do this by finding the lowest average temperature found in the column `January(Avg. low °C (°F))` (representing winter) and the highest average temperature found in the column `July(Avg. high °C (°F))`  (representing summer).

In [None]:
lowest_jan_temp_data = climate_data[climate_data['January(Avg. low °C (°F))'] == climate_data['January(Avg. low °C (°F))'].min()]
highest_july_temp_data = climate_data[climate_data['July(Avg. high °C (°F))'] == climate_data['July(Avg. high °C (°F))'].max()]

location_jan = lowest_jan_temp_data['Community'].to_string(index=False)
location_july = highest_july_temp_data['Community'].to_string(index=False)

jan_temp = lowest_jan_temp_data['January(Avg. low °C (°F))'].to_string(index=False)
july_temp = highest_july_temp_data['July(Avg. high °C (°F))'].to_string(index=False)

print(f"The lowest average temperature recorded in January in the dataset was in {location_jan}, with a recorded temperature of {jan_temp}°C")
print(f"The highest average temperature recorded in July in the dataset was in {location_july}, with a recorded temperature of {july_temp}°C")

It appears that the coldest recorded temperature in January was recorded in Resolute, Nunavut and the highest recorded temperature in July was recorded in Kamloops, British Columbia. 

This makes sense due to the geographical and climatic differences between Nunavut and British Columbia. Nunavut is a territory located in northern Canada, characterized by its high latitude and Arctic climate. In January, during the heart of winter, Nunavut experiences extreme cold temperatures due to its proximity to the Arctic Circle and its exposure to polar air masses.

On the other hand, British Columbia is a province situated on the west coast of Canada. It benefits from the moderating influence of the Pacific Ocean, which tends to keep its coastal regions milder in comparison to more inland areas. In July, during the summer months, British Columbia can experience higher temperatures due to the ocean's ability to store and release heat slowly, leading to warmer conditions along its coastline.

However, considering the averages of the highest and lowest temperatures for each month might provide a more comprehensive understanding of the overall climate in these regions. Let's do this by taking the average between both the January and July temperature columns. 

In [None]:
climate_data['January Average'] = climate_data[['January(Avg. low °C (°F))', 'January(Avg. high °C (°F))']].mean(axis=1)
climate_data['July Average'] = climate_data[['July(Avg. low °C (°F))', 'July(Avg. high °C (°F))']].mean(axis=1)
climate_data.sample(5)

In [None]:
avg_jan_temp = climate_data[climate_data['January Average'] == climate_data['January Average'].min()]
avg_july_temp = climate_data[climate_data['July Average'] == climate_data['July Average'].max()]

location_jan = avg_jan_temp['Community'].to_string(index=False)
location_july = avg_july_temp['Community'].to_string(index=False)

jan_temp = avg_jan_temp['January Average'].to_string(index=False)
july_temp = avg_july_temp['July Average'].to_string(index=False)

print(f"The lowest overall average temperature recorded in January in the dataset was in {location_jan}, with a recorded temperature of {jan_temp}°C")
print(f"The highest overall average temperature recorded in July in the dataset was in {location_july}, with a recorded temperature of {july_temp}°C")

It appears that there is a difference in our output, with the highest overall average temperature in July now being Windsor, Ontario. This would mean that the reason why Windsor overall has a higher overall average temperature is due to a combination of either/both Windsor having a high `July(Avg. low °C (°F))` value or Kamloops having a low `July(Avg. low °C (°F))` value. 

As for general reasons why Windsor has a high overall average temperature, Windsor being situated in southern Ontario, is close to the Great Lakes, leading to similar effects onto the local climate like Kamloops, British Columbia. 

Let's also find the overall highest/lowest temperatures recorded in our dataframe and see if it differs to our findings in January and July. 

In [None]:
overall_highest_temp = climate_data[climate_data['Annual(Avg. high °C (°F))'] == climate_data['Annual(Avg. high °C (°F))'].max()]
overall_lowest_temp = climate_data[climate_data['Annual(Avg. low °C (°F))'] == climate_data['Annual(Avg. low °C (°F))'].min()]
highest_overall_loc = overall_highest_temp['Community'].to_string(index=False)
lowest_overall_loc = overall_lowest_temp['Community'].to_string(index=False)
highest_temp = overall_highest_temp['Annual(Avg. high °C (°F))'].to_string(index=False)
lowest_temp = overall_lowest_temp['Annual(Avg. high °C (°F))'].to_string(index=False)

print(f"The ovearll lowest annual temperature recorded in the dataset was in {lowest_overall_loc}, with a recorded temperature of {lowest_temp}°C")
print(f"The overall highest annual temperature recorded in the dataset was in {highest_overall_loc}, with a recorded temperature of {highest_temp}°C")

Looking at the output, it appears there are no differences compared to the January and July outputs. This generally makes sense as the province that records the coldest temperature in January, a month characterized by winter, naturally tends to have a climate that sustains cold conditions throughout the year. In contrast, the province boasting the highest July temperatures, a month representing the peak of summer, naturally maintains a warmer climate on an annual basis. 

### Elevation Effects

Now that we've taken a look into the provinces/cities that have the coldest/hottest average temperatures, let's see if *elevation* has any significant impact onto the overall temperatures in the dataframe. This can be done by plotting the elevations of cities in the dataframe versus the overall annual average temperatures (both high and low).

In [None]:
elevation_fig = px.scatter(climate_data, x='Elevation', y='Annual(Avg. high °C (°F))', color='Province', hover_data='Community').show()

In [None]:
elevation_fig = px.scatter(climate_data, x='Elevation', y='Annual(Avg. low °C (°F))', color='Province', hover_data='Community').show()

Looking at the output of the figures, there doesn't seem to be a definitive case to say that elevation is a main cause of temperatures. Starting with the highest annual averages, there appears to be cities that have low elevation such as Resolute and Victoria but have staggeringly different annual highest temperatures. The former has an annual average high of -12.7°C while the latter has an annual average high of 14.4°C. This is reflected similarly in the annual lowest temperatures. 

### Visualizing Provinces

In this final section, we'll be visualizing climate differences between provinces, visualizing it through folium. To begin, let's read the geoJSON data of the provinces in Canada. GeoJSON in this context refers to representing the geographical boundaries and shapes between the different provinces in Canada. 

In [None]:
prov_data = gpd.read_file('https://raw.githubusercontent.com/callysto/data-files/main/Science/ClimateAcrossProvinces/geopandas.geojson')
prov_data

Looking at the imported data, the only columns of real importance are the geometries of each province, which refers to the boundaries for each province, and the actual province names. Let's get rid of the other columns to remove any unnecessary confusion in the dataframe. 

In [None]:
prov_data.drop(columns=['prov_name_en', 'geo_point_2d', 'prov_area_code', 'year'],inplace=True)

Furthermore, let's convert the names of the provinces to just the abbrevations of the provinces in order to match the format we have of our other *climate_data* dataframe.

In [None]:
prov_data.prov_name_fr.replace(
    {
        'Alberta': 'AB',
        'Manitoba': 'MB',
        'Yukon': 'YT',
        'Terre-Neuve-et-Labrador': 'NL',
        'Nouveau-Brunswick': 'NB',
        'Saskatchewan': 'SK',
        'Nouvelle-Écosse': 'NS',
        'Territoires du Nord-Ouest': 'NT',
        'Île-du-Prince-Édouard': 'PE',
        'Nunavut': 'NU',
        'Québec': 'QC',
        'Ontario': 'ON',
        'Colombie-Britannique': 'BC'
    },
    inplace=True
)

prov_data.rename(columns={'prov_name_fr': "Province"}, inplace=True)  
prov_data

Now, we can merge our two dataframes *climate_data* and *prov_data* into one dataframe in order to get the right `geometry` data for each city/province as both dataframes contain a `Province` column. By   leveraging a left join approach, it ensures that every entry in the resulting dataframe, named data_w_climate, retains its geometry data while also incorporating climate-related details. 

In [None]:
data_w_climate = prov_data.merge(climate_data, left_on='Province', right_on='Province', how='left')
data_w_climate.sample(5)

Perfect! Now we have a merged dataframe that contains provincial geometry data for each entry. One last step before plotting our data is to take the average of each province so that we have individual numbers for each province. Afterwards, we can start visualize provincial differences, starting with `Annual(Avg. high °C (°F))`.

In [None]:
agg_data = data_w_climate.groupby("Province")["Annual(Avg. high °C (°F))"].mean().reset_index()
merged_data = prov_data.merge(agg_data, on="Province")
merged_data

In [None]:
# Initialize the map and store it in an m object
highest_temp_map = folium.Map(location=[50, -65], zoom_start=3)

folium.Choropleth(
    geo_data=prov_data,
    data=merged_data,
    name="choropleth",
    columns=["Province", "Annual(Avg. high °C (°F))"],
    fill_color="YlOrRd",
    key_on="feature.properties.Province",
    legend_name="Average Highest Temperature (Annual)",
).add_to(highest_temp_map)

folium.LayerControl().add_to(highest_temp_map)
highest_temp_map

Looking at the folium map, it appears that provinces like British Columbia, Ontario, 

In [None]:
agg_data = data_w_climate.groupby("Province")["Annual(Avg. low °C (°F))"].mean().reset_index()
merged_data_low_temp = prov_data.merge(agg_data, on="Province")

# Initialize the map and store it in an m object
lowest_temp_map = folium.Map(location=[50, -65], zoom_start=3)

folium.Choropleth(
    geo_data=prov_data,
    data=merged_data_low_temp,
    name="choropleth",
    columns=["Province", "Annual(Avg. low °C (°F))"],
    fill_color="PuBu",
    key_on="feature.properties.Province",
    legend_name="Average Lowest Temperature (Annual)",
).add_to(lowest_temp_map)

folium.LayerControl().add_to(lowest_temp_map)
lowest_temp_map

# Conclusion

The Canadian Government provides information in regard to large bodies of water, known as [hydrometric data](https://wateroffice.ec.gc.ca/mainmenu/real_time_data_index_e.html). In this notebook, we imported this data and identified potential trends apparent in water level data alongside predictions of future water levels through machine learning.

Perhaps you can try extension activities such as utilizing machine learning to different fields of science, such as predicting a classmate's height, or seeing what a house would be priced at based on the size of the house.

[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)