# Insights from a column of data

We can learn a lot from exploring a single column of data.

This notebook walks through selecting columns, computing summary statistics, and interpreting the results.

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
weather = pd.read_csv('https://raw.githubusercontent.com/dlevine01/urban-data-analysis-course/refs/heads/main/Data/Source%20Data/weather_data_nyc_centralpark_2016.csv')

First, inspect this data a bit:
- what does each row represent?
- what does each column represent?

In [None]:
weather.head()

What data types does it seem like each column should be?

Check that pandas read the correct data types:

In [None]:
weather.dtypes

Uh oh! it looks like not.

Pandas will infer data types, and if it can't tell or if a column has mixed types, it will fall back on the catch-all 'object', which is strings or a mix of strings and numbers.

We can check why that column did not get parsed as numeric:

In [None]:
(
    weather
    .sort_values('precipitation')
)

See something that is not a number?

Let's filter to see all the rows with letters where we expect numbers

In [None]:
(
    weather
    [
        weather['precipitation'].str.isalpha()
    ]
)

If you found this dataset in the wild, this is where you would go check the data documentation to know what the "T" code means. I'll save you some work and tell you it means "none recorded"

So now we have a choice, do we want that to be None, or zero?

When computing a mean or a median, a None won't count, but a zero will.

This is not a trivial decision!

For now, we will treat the not-recoded values as None.

First, set the columns as numeric. If we don't tell pandas what to do with values it can't turn into numbers, it will raise an error: 

In [None]:
pd.to_numeric(weather['precipitation'])

(If we had skipped the exploration above, this would be a good warning that something in this column is not what we expect)

If we tell pandas to 'coerce' the errors, it will replace values it can't turn to numbers to None values.

In [None]:
pd.to_numeric(weather['precipitation'], errors='coerce')

(An extra caution about coercing text to numbers: check for commas between thousands of large numbers. by default, `to_numeric` won't understand those, so you need to first strip the commas.)

Assign this re-cast column to a new column.

In [9]:
weather['precipitation_n'] = pd.to_numeric(weather['precipitation'],errors='coerce')

(We could also overwrite the existing column by assigning the transformation to the same name. You will often see this approach. But the downside is that it destroys your original data. If later on we find that we should have parsed this a different way, it's better if we still have the original data to refer to. Don't make invisible mistakes.)

Phew! Now our data is in the format we expect and we can start analyzing it.

How hot is the hottest day?

In [None]:
weather['maximum temperature'].max()

What's the average temperature?

In [None]:
weather['average temperature'].mean()

Is this about the same as the representative middle temperature?

In [None]:
weather['average temperature'].median()

What does the difference tell you about the skew of the data?

What's the average rainfall?

In [None]:
weather['precipitation_n'].mean()

how about the typical day rainfall?

In [None]:
weather['precipitation_n'].median()

What does this difference tell you?

How many days is there any rainfall?

There's not a single built-in method for that like there is for `.mean()` or `.median()`, but you can string together a few methods:

` > 0` returns `True` if the value is greater than zero:

In [None]:
weather['precipitation_n'] > 0

You can also use the syntax `.gt(0)`:

In [None]:
weather['precipitation_n'].gt(0)

But you still want to condense this new column to a summary statistic. pandas counts `True` as 1 and `False` as 0, so the total of this column is the number of instances of `True`

In [None]:
weather['precipitation_n'].gt(0).sum()

Because the mean is computed as the sum divided by the count, the mean of a boolean column like this is the portion of values that are `True` (or multiply this by 100 to get the percent that meet the condition)

In [None]:
weather['precipitation_n'].gt(0).mean()

In [None]:
weather['maximum temperature'].mean()

In [None]:
(
    weather['average temperature']
    .gt(weather['maximum temperature'].min())
    .mean().sum()
)

# Tasks:

- How many days have a high temperature over 90 degrees?
- What is the lowest temperature recorded?
- What is the mean and median maximum temperature?
- Which are closer together, the mean and median of the maximum temperature or the minimum temperature?
- How many days did it snow?

In [None]:
### Your code here

Extra credit:

You saw above how you can assign a transformed value to a new column. You can also create a column from operations on multiple columns.

Add a new column 'temperature range' as the maximum column minus the minimum temperature.

What is the average of this daily temperature fluctuation?

In [None]:
### Your code here