# Week 2: Data types and insights from a column of data

We can learn a lot from exploring a single column of data.

This notebook walks through identifying and fixing data types, selecting columns, computing summary statistics, and interpreting the results.

In [2]:
import pandas as pd

### 1. NYC street trees

These are records of street trees maintained by NYC Parks, from [NYC OpenData](https://data.cityofnewyork.us/Environment/Forestry-Tree-Points/hn5i-inap/about_data)

In [4]:
trees = pd.read_csv(
    'https://data.cityofnewyork.us/api/views/hn5i-inap/rows.csv?accessType=DOWNLOAD',
    usecols=[
        'OBJECTID',
        'GenusSpecies',
        'DBH',
        'StumpDiameter',
        'TPStructure',
        'TPCondition',
        'Location',
        'PlantedDate'
    ]
)

In [10]:
trees.sample(10)

Unnamed: 0,OBJECTID,DBH,TPStructure,TPCondition,StumpDiameter,GenusSpecies,PlantedDate,Location
655241,4646880,6.0,Full,Good,0.0,Prunus serrulata 'Green leaf' - 'Green leaf' J...,,POINT (-73.79401925236839 40.78529595669278)
981470,12730118,5.0,Full,Good,,Quercus bicolor - swamp white oak,,POINT (-74.12460855847837 40.57737064520913)
955691,10579306,24.0,Full,Good,,Liquidambar styraciflua - sweetgum,,POINT (-73.88365764364066 40.884794805717235)
1061450,15888991,3.0,Full,Excellent,,Acer tataricum 'Hot Wings' - 'Hot Wings' Tatar...,2024-12-06 05:00:00.0000000,POINT (-73.91958305981771 40.712999298385974)
717632,4755294,2.0,Full,Good,,Cornus mas - Cornelian cherry,,POINT (-73.86110068416039 40.85727192092965)
107061,1414849,26.0,Full,Good,,Tilia americana - American basswood,,POINT (-73.76316050418015 40.76515320014331)
969666,11591661,3.0,Full,Excellent,,Koelreuteria paniculata - goldenrain tree,2021-05-06 04:00:00.0000000,POINT (-74.0167136858958 40.6779190299109)
356239,2617256,14.0,Full,Fair,0.0,Pyrus calleryana - Callery pear,,POINT (-74.1331973707107 40.55501589395529)
235669,2403028,15.0,Full,Good,,Quercus rubra - northern red oak,,POINT (-73.9892178015829 40.59489180617588)
993850,13449665,3.0,Full,Excellent,,Nyssa sylvatica 'Wildfire' - 'Wildfire' Black gum,2022-12-14 05:00:00.0000000,POINT (-74.03469100785696 40.63939692775647)


What can you infer about these data from this sample?

- What is each row?
- What type is each column?

What limitations or biases might there be in these data?

Check the data types:

In [None]:
trees.dtypes

`OBJECTID` looks like and id, not a measure. Is it unique?

In [None]:
trees['OBJECTID'].is_unique

How many trees are in these data?

In [None]:
trees['OBJECTID'].nunique()

`TPCondition` is the health and condition of the tree. How are trees doing?

What's the most frequent (modal) condition?

In [None]:
(
    trees['TPCondition']
    .value_counts()
    .head(1)
)

In [None]:
(
    trees['TPCondition']
    .value_counts(normalize=True) # this returns proportions, instead of counts
    .head(1)
)

Almost half the trees are 'Good'

And the rest?

In [17]:
trees['TPCondition'].value_counts()

TPCondition
Good         493725
Fair         293429
Dead         112475
Excellent     94465
Poor          45461
Unknown       30117
Critical       5820
Name: count, dtype: int64

This is an ordinal variable; we can assign an order to these values:

In [None]:
trees['TPCondition'] = (
    pd.Categorical(
        values=trees['TPCondition'],
        categories=[
            'Unknown',
            'Dead',
            'Critical',
            'Poor',
            'Fair',
            'Good',
            'Excellent'
        ],
        ordered=True
    )
)

trees['TPCondition'].head()

... then sort them:

In [None]:
(
    trees['TPCondition']
    .value_counts()
    .sort_index(ascending=False)
)

What is the most common tree species?

In [None]:
(
    trees['GenusSpecies']
    .value_counts()
    .head(10)
)

In [None]:
(
    trees['GenusSpecies']
    .value_counts(normalize=True) 
    .head(10)
)

'DBH' is "diameter at breast height", a standard measure for the size of the tree.

Let's take a look at the range of sizes:

In [None]:
trees['DBH'].mean()

In [None]:
trees['DBH'].median()

In [None]:
trees['DBH'].describe()

What do these central values tell you about the typical size of trees?

(You might notice that these data include a datetime-type column and a geometry-type column. Those types are a bit more complex, we'll tackle those later in the course.)

### 2. NYC weather

These data show daily weather conditions measured in New York City.

In [2]:
weather = pd.read_csv('https://raw.githubusercontent.com/dlevine01/urban-data-analysis-course/refs/heads/main/Data/Source%20Data/weather_data_nyc_centralpark_2016.csv')

First, inspect this data a bit:
- what does each row represent?
- what does each column represent?

In [None]:
weather.head()

What data types does it seem like each column should be?

Check that pandas read the correct data types:

In [None]:
weather.dtypes

Uh oh! it looks like not.

Pandas will infer data types, and if it can't tell or if a column has mixed types, it will fall back on the catch-all 'object', which is strings or a mix of strings and numbers.

We can check why that column did not get parsed as numeric:

In [None]:
(
    weather
    .sort_values('precipitation')
)

See something that is not a number?

Let's filter to see all the rows with letters where we expect numbers

In [None]:
(
    weather
    [
        weather['precipitation'].str.isalpha()
    ]
)

If you found this dataset in the wild, this is where you would go check the data documentation to know what the "T" code means. I'll save you some work and tell you it means "none recorded"

So now we have a choice, do we want that to be None, or zero?

When computing a mean or a median, a None won't count, but a zero will.

This is not a trivial decision!

For now, we will treat the not-recoded values as None.

First, set the columns as numeric. If we don't tell pandas what to do with values it can't turn into numbers, it will raise an error: 

In [None]:
pd.to_numeric(weather['precipitation'])

(If we had skipped the exploration above, this would be a good warning that something in this column is not what we expect)

If we tell pandas to 'coerce' the errors, it will replace values it can't turn to numbers to None values.

In [None]:
pd.to_numeric(weather['precipitation'], errors='coerce')

(An extra caution about coercing text to numbers: check for commas between thousands of large numbers. by default, `to_numeric` won't understand those, so you need to first strip the commas.)

Assign this re-cast column to a new column.

In [9]:
weather['precipitation_n'] = pd.to_numeric(weather['precipitation'],errors='coerce')

(We could also overwrite the existing column by assigning the transformation to the same name. You will often see this approach. But the downside is that it destroys your original data. If later on we find that we should have parsed this a different way, it's better if we still have the original data to refer to. Don't make invisible mistakes.)

Phew! Now our data is in the format we expect and we can start analyzing it.

How hot is the hottest day?

In [None]:
weather['maximum temperature'].max()

What's the average temperature?

In [None]:
weather['average temperature'].mean()

Is this about the same as the representative middle temperature?

In [None]:
weather['average temperature'].median()

What does the difference tell you about the skew of the data?

What's the average rainfall?

In [None]:
weather['precipitation_n'].mean()

how about the typical day rainfall?

In [None]:
weather['precipitation_n'].median()

What does this difference tell you?

How many days is there any rainfall?

There's not a single built-in method for that like there is for `.mean()` or `.median()`, but you can string together a few methods:

` > 0` returns `True` if the value is greater than zero:

In [None]:
weather['precipitation_n'] > 0

You can also use the syntax `.gt(0)`:

In [None]:
weather['precipitation_n'].gt(0)

But you still want to condense this new column to a summary statistic. pandas counts `True` as 1 and `False` as 0, so the total of this column is the number of instances of `True`

In [None]:
weather['precipitation_n'].gt(0).sum()

Because the mean is computed as the sum divided by the count, the mean of a boolean column like this is the portion of values that are `True` (or multiply this by 100 to get the percent that meet the condition)

In [None]:
weather['precipitation_n'].gt(0).mean()

In [None]:
weather['maximum temperature'].mean()

In [None]:
(
    weather['average temperature']
    .gt(weather['maximum temperature'].min())
    .mean().sum()
)

# Tasks:

- What portion of trees are rated as having a "Full" structure (labeled `TPStructure`)?
- How many days have a high temperature over 90 degrees?
- What is the lowest temperature recorded?
- What is the mean and median maximum temperature?
- Which are closer together, the mean and median of the maximum temperature or the minimum temperature?
- How many days did it snow?

In [None]:
### Your code here

Extra credit:

You saw above how you can assign a transformed value to a new column. You can also create a column from operations on multiple columns.

Add a new column 'temperature range' as the maximum column minus the minimum temperature.

What is the average of this daily temperature fluctuation?

In [None]:
### Your code here