# Week 2: Data types and insights from a column of data

We can learn a lot from exploring a single column of data.

This notebook walks through identifying and fixing data types, selecting columns, computing summary statistics, and interpreting the results.

In [1]:
import pandas as pd

### 1. NYC street trees

These are records of street trees maintained by NYC Parks, from [NYC OpenData](https://data.cityofnewyork.us/Environment/Forestry-Tree-Points/hn5i-inap/about_data)

In [2]:
trees = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/hn5i-inap/rows.csv?accessType=DOWNLOAD',
    usecols=[
        'OBJECTID',
        'GenusSpecies',
        'DBH',
        'StumpDiameter',
        'TPStructure',
        'TPCondition',
        'Location',
        'PlantedDate'
    ]
)

In [10]:
trees.sample(10)

Unnamed: 0,OBJECTID,DBH,TPStructure,TPCondition,StumpDiameter,GenusSpecies,PlantedDate,Location
655241,4646880,6.0,Full,Good,0.0,Prunus serrulata 'Green leaf' - 'Green leaf' J...,,POINT (-73.79401925236839 40.78529595669278)
981470,12730118,5.0,Full,Good,,Quercus bicolor - swamp white oak,,POINT (-74.12460855847837 40.57737064520913)
955691,10579306,24.0,Full,Good,,Liquidambar styraciflua - sweetgum,,POINT (-73.88365764364066 40.884794805717235)
1061450,15888991,3.0,Full,Excellent,,Acer tataricum 'Hot Wings' - 'Hot Wings' Tatar...,2024-12-06 05:00:00.0000000,POINT (-73.91958305981771 40.712999298385974)
717632,4755294,2.0,Full,Good,,Cornus mas - Cornelian cherry,,POINT (-73.86110068416039 40.85727192092965)
107061,1414849,26.0,Full,Good,,Tilia americana - American basswood,,POINT (-73.76316050418015 40.76515320014331)
969666,11591661,3.0,Full,Excellent,,Koelreuteria paniculata - goldenrain tree,2021-05-06 04:00:00.0000000,POINT (-74.0167136858958 40.6779190299109)
356239,2617256,14.0,Full,Fair,0.0,Pyrus calleryana - Callery pear,,POINT (-74.1331973707107 40.55501589395529)
235669,2403028,15.0,Full,Good,,Quercus rubra - northern red oak,,POINT (-73.9892178015829 40.59489180617588)
993850,13449665,3.0,Full,Excellent,,Nyssa sylvatica 'Wildfire' - 'Wildfire' Black gum,2022-12-14 05:00:00.0000000,POINT (-74.03469100785696 40.63939692775647)


What can you infer about these data from this sample?

- What is each row?
- What type is each column?

What limitations or biases might there be in these data?

Check the data types:

In [None]:
trees.dtypes

`OBJECTID` looks like and id, not a measure. Is it unique?

In [None]:
trees['OBJECTID'].is_unique

How many trees are in these data?

In [None]:
trees['OBJECTID'].nunique()

`TPCondition` is the health and condition of the tree. How are trees doing?

What's the most frequent (modal) condition?

In [None]:
trees['TPCondition'].mode()

How many trees are in that condition?

In [None]:
(
    trees['TPCondition']
    .value_counts()
    .head(1)
)

What proportion of trees is that?

In [None]:
(
    trees['TPCondition']
    .value_counts(normalize=True) # this returns proportions, instead of counts
    .head(1)
)

Almost half the trees are 'Good'

And the rest?

In [17]:
trees['TPCondition'].value_counts()

TPCondition
Good         493725
Fair         293429
Dead         112475
Excellent     94465
Poor          45461
Unknown       30117
Critical       5820
Name: count, dtype: int64

This is an ordinal variable; we can assign an order to these values:

In [None]:
trees['TPCondition'] = (
    pd.Categorical(
        values=trees['TPCondition'],
        categories=[
            'Unknown',
            'Dead',
            'Critical',
            'Poor',
            'Fair',
            'Good',
            'Excellent'
        ],
        ordered=True
    )
)

trees['TPCondition'].head()

... then sort them:

In [None]:
(
    trees['TPCondition']
    .value_counts()
    .sort_index(ascending=False)
)

What is the most common tree species?

In [None]:
(
    trees['GenusSpecies']
    .value_counts()
    .head(10)
)

In [None]:
(
    trees['GenusSpecies']
    .value_counts(normalize=True) 
    .head(10)
)

'DBH' is "diameter at breast height", a standard measure for the size of the tree.

Let's take a look at the range of sizes:

In [None]:
trees['DBH'].mean()

In [None]:
trees['DBH'].median()

In [None]:
trees['DBH'].describe()

What do these central values tell you about the typical size of trees?

(You might notice that these data include a datetime-type column and a geometry-type column. Those types are a bit more complex, we'll tackle those later in the course.)

# Tasks:

- What portion of trees are rated as having a "Full" structure (labeled `TPStructure`)?
- What is the largest stump size?
- How many trees are (perhaps erroniously) labeled with a stump diameter of 0?
- What is the most common tree diameter measurement? 
- What is the largest tree diameter? 
- What are the 20 largest tree diameters? (_hint_: look up the `.nlargest()` method)

In [None]:
### Your code here

In [None]:
### Your code here