# Exploratory Data Analysis, Vancouver street trees

## Fatemeh Salim

In [1]:
import altair as alt
import pandas as pd

Here, are we are going to work with Vancouver Street trees data set. I chose to work with a smaller data set that contains only 5,000 rows. Let's import the data and look at first few rows and then I am going to start exploratory data analysis for this data set.

In [2]:
trees_df = pd.read_csv(
    "https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_vancouver_trees.csv",
    parse_dates=["date_planted"],
)

In [3]:
trees_df.head()

Unnamed: 0.1,Unnamed: 0,std_street,on_street,species_name,neighbourhood_name,date_planted,diameter,street_side_name,genus_name,assigned,...,plant_area,curb,tree_id,common_name,height_range_id,on_street_block,cultivar_name,root_barrier,latitude,longitude
0,19886,W 10TH AV,W 10TH AV,BIGNONIOIDES,Kitsilano,NaT,34.0,ODD,CATALPA,N,...,10,Y,9945,COMMON CATALPA,5,3200,,N,49.2634,-123.1771
1,7941,W 59TH AV,W 59TH AV,SACCHARINUM,Marpole,NaT,20.0,ODD,ACER,Y,...,16,Y,50427,SILVER MAPLE,4,700,,N,49.217059,-123.120787
2,4613,W 47TH AV,W 47TH AV,PLATANOIDES,Kerrisdale,NaT,24.0,ODD,ACER,N,...,12,Y,43456,NORWAY MAPLE,5,2200,,N,49.229119,-123.159841
3,7388,COMMERCIAL DRIVE,COMMERCIAL DRIVE,EUCHLORA X,Grandview-Woodland,NaT,8.0,EVEN,TILIA,N,...,C,Y,69099,CRIMEAN LINDEN,3,1300,,N,49.272647,-123.069463
4,1894,E 55TH AV,E 55TH AV,SPECIES,Victoria-Fraserview,NaT,14.0,EVEN,ABIES,N,...,B,Y,164752,CRIMSON SUNSET NORWAY MAPLE,5,1900,,N,49.219958,-123.067159


# Questions of interest


# Description & Review of Data

In [4]:
trees_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2338 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4963 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64       


To answer these questions, I will need only the following columns.
I kept date_planted in the dataframe for now. However I won't use it since more than half of the dates are missing.

In [5]:
trees_df = trees_df[
    [
        "on_street",
        "species_name",
        "neighbourhood_name",
        "date_planted",
        "diameter",
        "genus_name",
        "common_name",
        "height_range_id",
        "root_barrier",
    ]
]
trees_df

Unnamed: 0,on_street,species_name,neighbourhood_name,date_planted,diameter,genus_name,common_name,height_range_id,root_barrier
0,W 10TH AV,BIGNONIOIDES,Kitsilano,NaT,34.0,CATALPA,COMMON CATALPA,5,N
1,W 59TH AV,SACCHARINUM,Marpole,NaT,20.0,ACER,SILVER MAPLE,4,N
2,W 47TH AV,PLATANOIDES,Kerrisdale,NaT,24.0,ACER,NORWAY MAPLE,5,N
3,COMMERCIAL DRIVE,EUCHLORA X,Grandview-Woodland,NaT,8.0,TILIA,CRIMEAN LINDEN,3,N
4,E 55TH AV,SPECIES,Victoria-Fraserview,NaT,14.0,ABIES,CRIMSON SUNSET NORWAY MAPLE,5,N
...,...,...,...,...,...,...,...,...,...
4995,E 6TH AV,MORDENSIS,Mount Pleasant,2011-11-02,3.0,CRATAEGUS,TOBA HAWTHORN,1,N
4996,E 22ND AV,PSEUDOPLATANUS,Kensington-Cedar Cottage,NaT,12.5,ACER,SYCAMORE MAPLE,4,N
4997,WILLOW ST,OXYACANTHA,Fairview,NaT,20.0,CRATAEGUS,ENGLISH HAWTHORN,3,N
4998,W 19TH AV,XX,Riley Park,2017-05-15,3.0,MAGNOLIA,MAGNOLIA 'GALAXY',1,N


In [6]:
trees_df.describe()

Unnamed: 0,diameter,height_range_id
count,5000.0,5000.0
mean,12.1329,2.6998
std,9.310923,1.550923
min,0.25,0.0
25%,4.25,2.0
50%,10.0,2.0
75%,17.0,4.0
max,182.0,9.0


In [7]:
trees_df.describe(exclude="number", datetime_is_numeric=True)


Unnamed: 0,on_street,species_name,neighbourhood_name,date_planted,genus_name,common_name,root_barrier
count,5000,5000,5000,2338,5000,5000,5000
unique,607,157,22,,62,339,2
top,W KING EDWARD AV,SERRULATA,Kensington-Cedar Cottage,,ACER,KWANZAN FLOWERING CHERRY,N
freq,59,464,441,,1277,363,4662
mean,,,,2003-10-10 07:57:19.863131008,,,
min,,,,1989-11-15 00:00:00,,,
25%,,,,1997-12-11 06:00:00,,,
50%,,,,2003-04-10 12:00:00,,,
75%,,,,2009-11-06 00:00:00,,,
max,,,,2019-04-16 00:00:00,,,


# Exploratory visualizations

Let's first take a look at which columns are categorical and which ones are numerical.

In [8]:
categorical_columns = trees_df.select_dtypes("object").columns.tolist()
categorical_columns

['on_street',
 'species_name',
 'neighbourhood_name',
 'genus_name',
 'common_name',
 'root_barrier']

In [9]:
numerical_columns = trees_df.select_dtypes("number").columns.tolist()
numerical_columns

['diameter', 'height_range_id']

Now, we can start answering the questions we pose at the beginning of this notebook.

### Question 1: Which neighbourhoods in Vancouver has the most number of trees?

In [10]:
neighbourhood_trees = (
    alt.Chart(trees_df)
    .mark_bar()
    .encode(
        alt.X("count()", title="Count of trees planted"),
        alt.Y("neighbourhood_name", sort="x", title="Neighbourhood"),
    )
).properties(title="Neighbourhood tree counts")
neighbourhood_trees

  for col_name, dtype in df.dtypes.iteritems():


We can tell from the above bar chart that **Kensington-Cedar Cottage**, **Renfrew-Collingwood**, and **Hastings-Sunrise** are the top three neighbourhood in terms of number of tree planted.

### Question 2: Are height range and diameter of trees related?


In [11]:
tree_size_plot_scatter = (
    alt.Chart(trees_df)
    .mark_circle()
    .encode(alt.X("diameter"), alt.Y("height_range_id"))
)
tree_size_plot_line = (
    alt.Chart(trees_df)
    .mark_line(color = 'Green')
    .encode(alt.X("mean(diameter)"), alt.Y("height_range_id"))
)
tree_size_plot_scatter + tree_size_plot_line

I figured that using the mean of diameter for answering this question can hide information about how the diameter range is scattered for each height range. So I decided to consider both scatter plot with all the diameter point and a line plot with the mean of the diameter.
I can see that there is one outlier point. I am going to remove that and repeat the chart to get a better understanding.


In [12]:
tree_size_plot_scatter = (
    alt.Chart(trees_df[trees_df["diameter"] < 80])
    .mark_circle()
    .encode(alt.X("diameter", title = "Diameter"), alt.Y("height_range_id"))
)
tree_size_plot_line = (
    alt.Chart(trees_df)
    .mark_line(color = 'Green')
    .encode(
        alt.X("mean(diameter)", title=" Mean of diameter"),
        alt.Y("height_range_id", title="Height range"),
    )
)
tree_size_plot_scatter + tree_size_plot_line

From this plot, we can tell taller trees, by average has bigger diameter. However, I can tell from the scatter plot that there is good number of trees that are tall with smaller diameter.

Calculating the correlation, shows a positive relationship between this two columns.



In [13]:
corr_df = (
    trees_df[numerical_columns].corr("pearson").stack().reset_index(name="correlation")
)
corr_df

Unnamed: 0,level_0,level_1,correlation
0,diameter,diameter,1.0
1,diameter,height_range_id,0.752331
2,height_range_id,diameter,0.752331
3,height_range_id,height_range_id,1.0


Now let's explore flowering cherry trees. These trees are beautiful in spring. Photographers and tourists can use these locations. here I am going to answer question 3.

### Question 3: Which neighbourhoods have more flowering cherry trees? 

In [14]:
cherry_trees = trees_df[trees_df["common_name"] == "KWANZAN FLOWERING CHERRY"]
cherry_trees

Unnamed: 0,on_street,species_name,neighbourhood_name,date_planted,diameter,genus_name,common_name,height_range_id,root_barrier
6,BROUGHTON ST,SERRULATA,West End,NaT,24.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
14,NASSAU DRIVE,SERRULATA,Victoria-Fraserview,NaT,16.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
46,W 11TH AV,SERRULATA,Fairview,NaT,17.0,PRUNUS,KWANZAN FLOWERING CHERRY,2,N
61,E 23RD AV,SERRULATA,Riley Park,NaT,26.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
90,E 28TH AV,SERRULATA,Renfrew-Collingwood,NaT,38.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
...,...,...,...,...,...,...,...,...,...
4928,E 21ST AV,SERRULATA,Kensington-Cedar Cottage,NaT,24.5,PRUNUS,KWANZAN FLOWERING CHERRY,2,N
4962,ALBERTA ST,SERRULATA,Oakridge,NaT,19.5,PRUNUS,KWANZAN FLOWERING CHERRY,2,N
4976,PARKER ST,SERRULATA,Grandview-Woodland,NaT,29.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
4981,W 20TH AV,SERRULATA,Arbutus-Ridge,NaT,10.0,PRUNUS,KWANZAN FLOWERING CHERRY,2,N


In [15]:
title = alt.TitleParams(
    "Mount Pleasent neighbourhood has the most number of cherry trees",
    subtitle="downtown vancouver has least cherry trees",
)
neighbourhood_cherry = (
    alt.Chart(cherry_trees, title=title)
    .mark_bar()
    .encode(
        alt.X("count()"), alt.Y("neighbourhood_name", sort="x", title="Neighbourhood")
    )
)
neighbourhood_cherry

  for col_name, dtype in df.dtypes.iteritems():


### Question 4: Neighbourhoods with tallest cherry trees?

In [16]:
neighbourhood_cherry = (
    alt.Chart(cherry_trees, height=250, width=150)
    .mark_bar()
    .encode(
        alt.X("count()", title = ""),
        alt.Y("neighbourhood_name", sort="x"),
        color=alt.Color("height_range_id"),
        tooltip="count()",
    )
    .facet("height_range_id")
    .properties(title="a")
)
neighbourhood_cherry

There are 5 specific neighbourhoods that have few trees in the 4-height range, including **Mount Pleasant**, **Dunbar-Southlands**, **Kerrisdale**, **Fairview**, and **West point Grey**. However, each of these neighbourhood has less than 5 tall trees.  We can see **Victoria-Fraserview** neighbourhood has 19 tall cherry trees in 3-high range.

### Question 5: Distribution of diameter of flowering cherry trees for different heights?


In [17]:
(
    alt.Chart(cherry_trees)
    .mark_tick()
    .encode(alt.X("diameter"), alt.Y("height_range_id"))
)

From the plot above, we can tell cherry trees with diameter bigger than 25, are among taller trees.

In [18]:
(
    alt.Chart(cherry_trees)
    .transform_density(
        "diameter", groupby=["height_range_id"], as_=["diameter", "density"]
    )
    .mark_area()
    .encode(x="diameter", y="density:Q", color="height_range_id")
)

We can tell that the most common diameter for different height range is different among cherry trees. for example, the most common diameter for shorter cherry trees is 5, whereas tallest cherry trees' most common diameter is about 32 inches.

However, I can tell from this density plot that for trees in height range 4, there is not enough example to be able to draw accurate conclusion, since the density plot seems to be cut at both ends.

### Question 6: Distribution of cherry trees' diameter?

In [19]:
diameter_order = (
    cherry_trees.groupby("neighbourhood_name")["diameter"]
    .median()
    .sort_values()
    .index.tolist()
)
box = (
    alt.Chart(cherry_trees)
    .mark_boxplot()
    .encode(alt.X("diameter:Q"), alt.Y("neighbourhood_name:N", sort=diameter_order))
    .properties(title=" Cherry trees diameter for neighbourhood")
)
bar = (
    alt.Chart(cherry_trees)
    .mark_bar()
    .encode(
        alt.X("diameter:Q"),
        alt.Y("neighbourhood_name:N", sort=diameter_order),
        tooltip="diameter",
    )
    .properties(title=" Cherry trees diameter for neighbourhood")
)
box | bar

  for col_name, dtype in df.dtypes.iteritems():


We can tell from the above plot that **Killarney** has the thicker trees both in terms of median of the diameter and number of thicker trees. From the bar chart or the mouse hovering over the box plot, the max diameter for this neighbourhood is 34. Bar chart will show the **Victoria-Fraserview** has trees that their diameter reaches 46. However, for the box plot we can tell the median of tree diameter in this neighbourhood is lower than Killarney. What caused this neighbourhood to show a taller bar in bar chart is few trees that went above the 30 inches in diameter.

### Question 7: What are top 20 popular trees in Vancouver?

In [20]:
common_trees = (
    trees_df["common_name"]
    .value_counts()[:10]
    .sort_values(ascending=False)
    .reset_index()
)
common_trees

Unnamed: 0,index,common_name
0,KWANZAN FLOWERING CHERRY,363
1,PISSARD PLUM,301
2,NORWAY MAPLE,219
3,CRIMEAN LINDEN,151
4,BOWHALL RED MAPLE,105
5,NIGHT PURPLE LEAF PLUM,98
6,KOBUS MAGNOLIA,93
7,HEDGE MAPLE,93
8,RED MAPLE,92
9,PYRAMIDAL EUROPEAN HORNBEAM,85


In [21]:
common_trees_df = trees_df[trees_df["common_name"].isin(common_trees["index"])]
common_trees_df

Unnamed: 0,on_street,species_name,neighbourhood_name,date_planted,diameter,genus_name,common_name,height_range_id,root_barrier
2,W 47TH AV,PLATANOIDES,Kerrisdale,NaT,24.0,ACER,NORWAY MAPLE,5,N
3,COMMERCIAL DRIVE,EUCHLORA X,Grandview-Woodland,NaT,8.0,TILIA,CRIMEAN LINDEN,3,N
5,ADERA ST,CERASIFERA,Kerrisdale,NaT,1.0,PRUNUS,PISSARD PLUM,2,N
6,BROUGHTON ST,SERRULATA,West End,NaT,24.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
7,CHURCHILL ST,CERASIFERA,Shaughnessy,NaT,9.0,PRUNUS,PISSARD PLUM,2,N
...,...,...,...,...,...,...,...,...,...
4976,PARKER ST,SERRULATA,Grandview-Woodland,NaT,29.0,PRUNUS,KWANZAN FLOWERING CHERRY,3,N
4981,W 20TH AV,SERRULATA,Arbutus-Ridge,NaT,10.0,PRUNUS,KWANZAN FLOWERING CHERRY,2,N
4983,W 22ND AV,PLATANOIDES,Dunbar-Southlands,NaT,25.0,ACER,NORWAY MAPLE,6,N
4985,W 10TH AV,PLATANOIDES,West Point Grey,NaT,19.0,ACER,NORWAY MAPLE,5,N


Let's explore this new data frame that I made.

In [22]:
common_trees_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1600 entries, 2 to 4987
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   on_street           1600 non-null   object        
 1   species_name        1600 non-null   object        
 2   neighbourhood_name  1600 non-null   object        
 3   date_planted        471 non-null    datetime64[ns]
 4   diameter            1600 non-null   float64       
 5   genus_name          1600 non-null   object        
 6   common_name         1600 non-null   object        
 7   height_range_id     1600 non-null   int64         
 8   root_barrier        1600 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 125.0+ KB


In [23]:
common_trees_df.describe()

Unnamed: 0,diameter,height_range_id
count,1600.0,1600.0
mean,13.891719,2.8125
std,7.988612,1.343399
min,0.25,1.0
25%,7.5,2.0
50%,13.0,3.0
75%,19.0,4.0
max,46.0,9.0


### Question 8:  Visualize the distributions of all numerical columns for popular trees in Vancouver.

Let's first find the categorical and numerical columns in common trees dataframe.

In [24]:
categorical_columns = common_trees_df.select_dtypes("object").columns.tolist()
categorical_columns

['on_street',
 'species_name',
 'neighbourhood_name',
 'genus_name',
 'common_name',
 'root_barrier']

In [25]:
numerical_columns = common_trees_df.select_dtypes("number").columns.tolist()
numerical_columns

['diameter', 'height_range_id']

Now we can use this information to answer question 8 and visualize the distributions of all numerical columns in common trees dataframe. this will sure help us understand the data better.

In [26]:
(
    alt.Chart(common_trees_df)
    .mark_bar()
    .encode(
        alt.X(alt.repeat(), type="quantitative", bin=alt.Bin(maxbins=25)),
        alt.Y("count()"),
    )
    .properties(width=250, height=150)
    .repeat(numerical_columns, columns=4)
)

  for col_name, dtype in df.dtypes.iteritems():


That the diameter of the trees plot has at least two peaks. most of the trees has a diameter between 2 to 4 inches and are of height range 2 and 3.

### Question 9: What is the most frequent combination of height and diameter among popular trees in Vancouver?




In [27]:
(
    alt.Chart(common_trees_df)
    .mark_rect()
    .encode(
        alt.X("diameter", bin=alt.Bin(maxbins=30)),
        alt.Y("height_range_id", bin=alt.Bin(maxbins=30)),
        alt.Color("count()", title=None),
    )
    .properties(width=350, height=350)
)

From the heat map above, the most frequent combination of height and diamtere among popular trees in vancouver is diamter between 2 and 4 and height range id 1.

### Question 10: Visualize the count of all categorical aspects of popular trees in Vancouver.


I am hoping to get a better understanding of most frequent specie, tree name, and genus of all popular trees by answering this question.

In [28]:
tree_category_plot = (
    alt.Chart(common_trees_df, height=250, width=300)
    .mark_bar()
    .encode(alt.X("count()"), alt.Y(alt.repeat(), type="nominal", sort="x"))
    .properties(width=250)
    .repeat(categorical_columns[1:], columns=2)
)
tree_category_plot


From these repeated plots, we can tell **Ceratifera** is the most common specie, **Flowering cherry tree** is the most common tree and **Prinus** is the most common genus. **Renfrew-collingwood** has the most of popular trees in Vancouver.


### Question 11: Explore the relationship between categorical and numerical columns in popular tree data frame.

Answering this question, wiil help to have a better understanding the height and diameter changes for different specie and genues of trees as well as different neighbourhood.

In [29]:
diameter_order = []
for groupby_col in ["species_name", "neighbourhood_name", "genus_name", "common_name"]:
    diameter_order.extend(
        common_trees_df.groupby(groupby_col)
        .median()["diameter"]
        .sort_values()
        .index.to_list()
    )
# diameter_order

  common_trees_df.groupby(groupby_col)
  common_trees_df.groupby(groupby_col)
  common_trees_df.groupby(groupby_col)
  common_trees_df.groupby(groupby_col)


In [30]:
(
    alt.Chart(common_trees_df)
    .mark_boxplot()
    .encode(
        alt.X(alt.repeat("column"), type="quantitative"),
        alt.Y(alt.repeat("row"), type="nominal", sort=diameter_order),
    )
    .properties(width=350, height=350)
    .repeat(column=numerical_columns, row=categorical_columns[1:4])
)

  for col_name, dtype in df.dtypes.iteritems():



This exploration of categorical and numerical columns leads to very interesting results.
Among the species **Platinoids** has the largest diameter median and height median.
The median of trees thickness in **Marpole neighbourgood**, is the largest.


# Concluding remarks

This section explains which **five** plots I am going to include in my report and how they will be changed for the audience.

**1**: The plot for question 1, I can add more explanatory title and subtitle. removing the x axis and instead showing the counts of each neighbourhood tree beside it’s related bar.

**2**: Second plot from question 2, better axis labels. tool tip can be added for the line chart to show that the line marks the mean of diameter. adding a explanatory title.

**3**: Plot from question 4,adding title for the plot.

**4**: Plots from question 6, axis title and plot title needs work.

**5**: Plot from question 9, y axis ticks can be changed to be integer. Axis title and plot tile needs work. 
I think for public audience I change this plot to a square plot that size of squares and their colors reflect the count of observation. That probably is easier to understand.
