# <span style="color:#5c2f2f">Vancouver in Pink</span> 
### 🌸<span style="color:pink">Cherry Blossoms in Vancouver</span>🌸

May 12, 2025
Adji Rahardjo


# Introduction

Vancouver always blooms beautifully in April. Especially when the park and streets are lush in pink with Cherry Blossom trees. What started as a [gift in 1925 from Japan](https://vcbf.ca/history-of-our-cherry-trees/) has evolved for a century turning the city into one of the top destination for cherry blossom according to [this article](https://www.lonelyplanet.com/articles/best-cherry-blossoms-around-the-world) from Lonely Planet. I always thought they were beautiful, but my wife is obsessed! Not just to cherry blossoms but to anything that is pink. With the privilege of living in Vancouver I would like to explore this data to find the best spot where I could take my wife on a date. Where in the city is most dense in cherry blossom? Where are the biggest ones that may have a nice canopy? If there are many species of them, how do they differ in size?

This project will look into the Cherry Blossom trees in Vancouver. We will be using data from City of Vancouver about the trees in the city. The [original data source](https://opendata.vancouver.ca/explore/dataset/public-trees/information/?disjunctive.neighbourhood_name&disjunctive.on_street&disjunctive.species_name&disjunctive.common_name) contains over 180,000 rows of data and has been sampled, cleaned, wrangled and saved into 5000 rows in [UBC-MDS GitHub](https://github.com/UBC-MDS/data_viz_wrangled/blob/main/data/Trees_data_sets/small_vancouver_trees.csv) for practice.


1. What neighbourhood has the most cherry blossom trees
2. What street has the most cherry blossom trees
3. Which cherry blossom tree has the highest average tree size
4. Is there a relationship between tree diameter and height_range_id?


# Analysis

In [1]:
# Import the libraries needed

import altair as alt
import pandas as pd
import json

# Import the data

trees_df = pd.read_csv('small_unique_vancouver.csv')

## The Dataset

The following is an explanation for each field in the dataset we are using. There may have been some time since the data is pulled and the field we see here is not necessarily on the [Vancouver open data portal anymore](https://opendata.vancouver.ca/explore/dataset/public-trees/information/?disjunctive.neighbourhood_name&disjunctive.on_street&disjunctive.species_name&disjunctive.common_name)

| column | description | 
|:--------|:--------|
|  null  |  an index in the original dataset   |
|  std_street   |  the street name a tree is on   |
|  on_street   |  the street intersecting std_street. default is std_street if there is no intersecting street.   |
|  species_name  |  the tree species name   |
|  neighbourhood_name |  the neighbourhood a tree is in   |
|  date_planted  |  date planted   |
|  diameter  |  tree diameter in inches   |
|  street_side_name  |  which side of street a tree is on   |
|  genus_name  |  the tree's genus name   |
|  assigned  |  old column that is deprecated in the most updated dataset   |
|  civic_number  |  street number for a tree   |
|  plant_area  |  old column that is deprecated in the most updated dataset   |
|  curb  |  boolean if a tree is on a curb  |
|  tree_id  |  unique id assigned to each tree   |
|  common_name  |  the tree common name   |
|  height_range_id  |  classified height range of a tree in 10 feet increments    |
|  on_street_block  |  the block of the on_street a tree is in   |
|  cultivar_name  |  cultivar name   |
|  root_barrier  |  maybe boolean on whether a root barrier is installed   |
|  latitude  |  geographic latitude   |
|  longitude  |  geographic longitude   |

Lets also look at whether our data has null values we should be aware of.

In [2]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          5000 non-null   int64  
 1   std_street          5000 non-null   object 
 2   on_street           5000 non-null   object 
 3   species_name        5000 non-null   object 
 4   neighbourhood_name  5000 non-null   object 
 5   date_planted        2363 non-null   object 
 6   diameter            5000 non-null   float64
 7   street_side_name    5000 non-null   object 
 8   genus_name          5000 non-null   object 
 9   assigned            5000 non-null   object 
 10  civic_number        5000 non-null   int64  
 11  plant_area          4950 non-null   object 
 12  curb                5000 non-null   object 
 13  tree_id             5000 non-null   int64  
 14  common_name         5000 non-null   object 
 15  height_range_id     5000 non-null   int64  
 16  on_str

There seem to be quite a few null values in date_planted and cultivar_name. And a little bit in plant_area. But for the purposes of this project those column are irrelevant so we will drop them. So is some others such as 'root_barrier' or 'assigned'

Given that this is a sample of a dataset, observations may differ if we choose to use the original dataset. However, methods and approaches used in this analysis should be applicable towards the original dataset.

In [3]:
# Clean the dataframe and drop irrelevant columns

#filter by common name that contains flowering cherry
cherry_df = trees_df[trees_df['common_name'].str.contains('flowering cherry', case=False, na=False)]

# Drop unnecessary columns
cherry_clean_df = cherry_df.drop(columns=[cherry_df.columns[0], 'assigned', 'root_barrier'])

cherry_clean_df.head()

Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,civic_number,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,latitude,longitude
21,ST. CATHERINES ST,E 49TH AV,SERRULATA,Sunset,,14.0,ODD,PRUNUS,6499,4,Y,44256,KWANZAN FLOWERING CHERRY,3,800,KWANZAN,49.225494,-123.0872
42,W 35TH AV,W 35TH AV,SERRULATA,Shaughnessy,,11.0,EVEN,PRUNUS,2028,12,Y,33656,KWANZAN FLOWERING CHERRY,2,2000,KWANZAN,49.239992,-123.152677
60,CAMOSUN ST,CAMOSUN ST,SERRULATA,Dunbar-Southlands,,16.0,ODD,PRUNUS,4475,N,Y,204485,KWANZAN FLOWERING CHERRY,2,4400,KWANZAN,49.24643,-123.1969
62,E 10TH AV,CAROLINA ST,SERRULATA,Mount Pleasant,,12.0,ODD,PRUNUS,580,6,Y,9073,KWANZAN FLOWERING CHERRY,2,2600,KWANZAN,49.261203,-123.091148
63,W 59TH AV,W 59TH AV,X YEDOENSIS,Marpole,2010-10-18,3.0,ODD,PRUNUS,1239,10,Y,126657,AKEBONO FLOWERING CHERRY,1,1200,AKEBONO,49.217274,-123.133047


## Questions

### Question 1: What neighbourhood has the most cherry blossom trees?

Lets plot our data into a map and see which neighbourhood is the most pink!

In [4]:
# Rename the dataset field to match the GeoJSON field so they can be connected
cherry_clean_df = cherry_clean_df.rename(columns={'neighbourhood_name':'name'})


In [5]:

url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/refs/heads/binder/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features', type='json'))

# Selection object
click = alt.selection_single(fields=['name'])

# Define the Vancouver map
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
    stroke='#811331' #add dark pink borders
).encode(
    color=alt.condition(click, alt.value('pink'), alt.value('lightgray')),
    tooltip=['name:N']
).transform_lookup(
    lookup='properties.name', #the thing to look up for
    from_=alt.LookupData(cherry_clean_df, key='name', fields=['name']) #where its looking for it
).project(type='identity', reflectY=True).add_selection(click).properties(width=600, height=300)

# Define the bar chart
neighbourhood_plot = alt.Chart(cherry_clean_df).mark_bar().encode(
    alt.Y('count()', title='Number of Cherry Blossom Trees'),
    alt.X('name:N', sort='-y', title='Neighbourhood'),
    color=alt.condition(click, alt.value('pink'), alt.value('lightgray'))
).add_selection(click).properties(
    title='Count of Cherry Blossom by Neighbourhood',
    width=600, height=300
)

# Link the two plots
vancouver_map & neighbourhood_plot

The above plots shows the neighbourhood boundaries within the city of Vancouver and below it are count of cherry blossom trees by neighbourhood. You may click on either plot to show where they are on the other plot.

We can see that Renfrew-Collingwood has the most cherry blossom trees. With Dunbar-Southlands and Victoria-Fraserview trailing close. VIctoria-Fraserview is relatively close to Renfrew-Collingwood, maybe those are the best two neighbourhoods I should head to!

However a picture is most 'Instagrammable' if it's in a street lined with cherry blossom trees with a lush canopy. That brings us to the next question;

### Question 2: What street has the most cherry blossom trees

Given the large number of street we have (200+), lets filter the streets to the top 50

In [6]:
# Filter the dataframe to only list top 5 streets
top50_streets = (
    cherry_clean_df.groupby('std_street')
    .size()
    .reset_index(name='count')
    .sort_values(by='count', ascending=False)
    .head(50)
)

In [7]:
#the bar chart
street_plot = alt.Chart(top50_streets).mark_bar(color='pink').encode(
    alt.X('count', title='Number of Trees'),
    alt.Y('std_street:N', sort='x', title='Street', axis=None),
    tooltip=['std_street:N']).properties(
    title='Count of Cherry Blossom Trees by Street (Top 50)', height=650
)

# the texts inside bar chart
text = alt.Chart(top50_streets).mark_text(align='left', dx=-100, color='black', size=10).encode(
    alt.X('count'),
    alt.Y('std_street:N', sort='x'),
    text=alt.Text('std_street:N')
)

streets = street_plot + text

streets

Here is a rank of streets by the number of trees planted on them with the highest being on the bottom. I guess I should take my wife for a walk along West 16th avenue! But perhaps whe shouldn't just look at the count of trees, maybe bigger the tree the more blossom and pink!

## Question 3. Which cherry blossom tree are bigger in general

First we look at the diameter of the trees and then the height via height_range_id. A high value for both could mean a nice lush canopy.

In [8]:
filtered_cherry_df = cherry_df[['common_name', 'height_range_id', 'diameter']]

click2 = alt.selection_single(fields=['common_name'], bind='legend')
diameter_plot = alt.Chart(filtered_cherry_df).transform_density(
    'diameter',
    groupby=['common_name'],
    as_=['diameter','density']).mark_area(opacity=0.5).encode(
    alt.X('diameter'),
    alt.Y('density:Q'),
    color='common_name:N',
    opacity=alt.condition(click2, alt.value(0.8), alt.value(0.05))).add_selection(click2).properties(title='Cherry Blossom Kinds and their Diameter', width=450)

height_plot = alt.Chart(filtered_cherry_df).transform_density(
    'height_range_id',
    groupby=['common_name'],
    as_=['height_range_id','density']).mark_area(opacity=0.5).encode(
    alt.X('height_range_id'),
    alt.Y('density:Q'),
    color='common_name:N',
    opacity=alt.condition(click2, alt.value(0.8), alt.value(0.05))).add_selection(click2).properties(title='Cherry Blossom Kinds and their Height', width=450)

density = (height_plot | diameter_plot)
density


Above are plots on diameter and height of each of the four flowering cherry. You can click on the Legend to highlight each variety's distribution. The Kwanzan has trees with diameter up to more than 40 inches! They have more trees in the wider end of diameters compared to other varieties. The Akebono seem to have more trees with narrower diameter.

Looking at the height range, the Kwanzan have more trees in the 2 and 3 height range (20-40 feet) than in the 1 height range with a few on 4-5 range. The original dataset have the tree height record in more detail. Perhaps those would be better suited to plot this graph.

However, with what we currently have there is some indication here that the Kwanzan may be the best kind to look for. But the sampled table contains more of the Kwanzan than the other. Applying the same methodology to the full dataset might show a difference. 

Does the trees that are wide in diameter also taller? Resulting in a lush pink canopy?
Lets check on height and diameter relationship by answering this next question:
  

## Question 4: Is there a relationship between tree diameter and height_range_id?

In [10]:
# make slider and define rage
slider = alt.binding_range(name='diameter', min=0, max=45, step=0.5)

#action on sliding
select_height=alt.selection_single(fields=['diameter']
                                   , bind=slider
                                   , init={'diameter': 25})

# chart
relationship = alt.Chart(cherry_clean_df).mark_circle(size=70).encode(
    alt.X('diameter:Q', title='Diameter'),
    alt.Y('height_range_id:O', title='Height Range ID', sort='descending')
    ).properties(
    title='Diameter vs Height Range', height=250
)

relationship.encode(opacity=alt.condition(select_height,
        alt.value(1), alt.value(0.01))
).add_selection(select_height)

Every circle in this plot is a tree, however due to the rounding of the diameter and the height range being ordinal, circles may stack. It is faint but as you move the slider rightward the circles appear  more on the higher end of Y-axis and less on the lower end, indicating the slight relationship between the two. A Seaborn kdeplot might be able to show this relationship better, however for the purposes of this project we stick with Altair.

# Discussion

I am now better equipped to make decision on where are the best places for a cherry blossom date. Having identified neighbourhoods with the most cherry blossom trees and where they are through the map. Which could be useful for those that primarily use public transit as their method of transport to find the neighbourhood that isn't too far from main transit routes. And I've ranked streets with most cherry blossom trees so I could identify which ones to prioritize my visit to get the best view. 

I also discovered that through our data we can assume the Kwanzan variety are generally wider and taller in Vancouver. Though the data we are using is a sample, which could be different if we apply the same methodology to the full dataset. Once we do that and better confirm our assumption. And perhaps plot where all the Kwanzan are, filtered by diameter or height if we want to find the larger ones.

Also discovered a slight relationship between height and diameter which makes total biological sense. Though the relationship isn't super clear, I think utilizing a seaborn kdeplot in the future could better show the positive relationship.

Through this project I have made the data-driven decision to visit W 16th avenue and the Renfrew-Collingwood neigbourhood.


## Dashboard

In [11]:
(diameter_plot | height_plot) & ((vancouver_map & neighbourhood_plot) | streets) & (relationship.encode(opacity=alt.condition(select_height,
        alt.value(1), alt.value(0.01))
).add_selection(select_height))

## References
Not all the work in this notebook is original. Parts that were borrowed from other resources are as follows:

### Resources used
- Programming in Python for Data Science sample final project by by Junghoo Kim
- Learning materials from UBC KCDS - [Data Visualization](https://viz-learn.mds.ubc.ca/)
- Data Source from UBC-MDS team sampled from [Public trees dataset](https://opendata.vancouver.ca/explore/dataset/public-trees/information/?disjunctive.neighbourhood_name&disjunctive.on_street&disjunctive.species_name&disjunctive.common_name)
- City of Vancouver [GeoJSON data](https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/refs/heads/binder/data/local-area-boundary.geojson) provided by UBC-MDS team 
- Altair and Vega documentation including, but not limited to,
    - [Geoshape](https://altair-viz.github.io/user_guide/marks/geoshape.html)
    - [Custom Color Mapping](https://vega.github.io/vega/docs/schemes/)

Vancouver Cherry Blossom Festival Website - [History of Our Cherry Blossom](https://vcbf.ca/history-of-our-cherry-trees/)

Article on [Top Places to see Cherry Blossom Around the World](https://www.lonelyplanet.com/articles/best-cherry-blossoms-around-the-world)

