# Exploring Data

Let's look into a new data set, `temperatures.csv` -- we'll see later on how this is still related to wildfires. This data was compiled from a number of queries from the NOAA (National Oceanic and Atmospheric Administration) website.

Citation:
    NOAA National Centers for Environmental information, Climate at a Glance: Regional Time Series, published October 2018, retrieved on October 18, 2018 from https://www.ncdc.noaa.gov/cag/regional/time-series

In [None]:
import babypandas as bpd
import numpy as np

temps = bpd.read_csv('../../data/temperatures.csv')

The most important task when working with a new dataset is to get familiar with the data by exploring it -- that way it will be easier for us to analyze the data in the future.

Let's start by understanding what the table actually contains. The easiest way to do this is by simply looking at the first few rows of the data.

In [None]:
temps.iloc[range(3)]

A data description explains that the table contains a Year (2000 to 2018), Month (1 to 12), Region (a climate region), and Average Temperature (the average temperature for that year/month/region in degrees Farenheit)... but I'm not a meteorologist so, what climate regions are we working with?

The Region column is considered to be a *categorical* feature, because its values are not numbers, and each value can only come from a finite set of choices. Since we know there is a limit to how many climate regions there are, it makes sense to use the `.unique()` method to list out all of the regions in our table.

In [None]:
temps.get('Region').unique()

Based on the names of the regions, we can deduce the general area of the U.S. that each region covers.

```{tip}
Questions about your data set should also be looked up online! Effective data exploration revolves around quickly becoming an semi-expert in the domain of the data set.
```

The rest of the columns seem pretty self explanatory in terms of their meaning, so the next step of exploration often revoles around getting a sense of what our data 'looks' like, then eventually discovering patterns and interesting aspects of our data.

## Introduction to proportions

One quick way to explore our data is to see how much of the data satisfies a certain condition.

Say we are curious -- for the sake of exploration -- about how many of the the average temperatures were at or below the freezing point (32 degrees Fahrenheit). We know how to perform a query to only get the rows where this is true.

In [None]:
freezing = temps[temps.get('Average Temperature') <= 32]

And we can use `.shape[0]` to find out how many rows are in this selection.

In [None]:
freezing.shape[0]

That doesn't seem so useful. Is 270 big or small? Is it closer to half of our data set, or less than a single percent?

A raw count, like 270, isn't insightful unless we know what we're comparing it against!

Fortunately, we can *make* it insightful by dividing it by the number of rows in our full data set.

In [None]:
prop_freezing = freezing.shape[0] / temps.shape[0]
prop_freezing

The result is called a {dterm}`proportion`. A proportion is a way of expressing the fraction of the total that satisfies a condition -- and they really do come from fractions!

When someone asks for the "proportion of ...", they're wondering what fraction of data satisfies the condition they're about to propose.

For example, if we had four data points and only one of them was below freezing, we'd consider the *proportion of points below freezing* to be $\frac{1}{4}$ or $0.25$.

Notice that you can turn this into a percent by just multiplying it by 100. So out of our full data set, roughly $0.133 \times 100\% = 13.3\%$ of the recorded temperatures were at or below freezing (brrrrr).

What proportion of recorded temperatures do you think were above freezing?

In [None]:
not_freezing = temps[temps.get('Average Temperature') > 32]
prop_not_freezing = not_freezing.shape[0] / temps.shape[0]
prop_not_freezing

The final important observation we need to make is that the proportion of points that *do* satisfy a condition and the proportion of points that *don't* satisfy that condition will add up to one. And, since we're just calculating the size of a subset of our data, proportions will *always* be between zero and one.

```{margin}
...unless there's one of those pesky floating point errors! Then they might add up to slightly more or less than one.
```

In [None]:
prop_freezing + prop_not_freezing

That's all on proportions for now, but don't worry! We'll work with proportions again in the future.

## Your first chart

There's an age-old-addage that "a picture is worth a thousand words" -- the same often holds true for data science!

Throughout the exploration process, we'll be relying on charts to help us visualize and ultimately understand the data. While Python has many libraries that can create charts, Babypandas has the capability built in by using the `.plot()` method. With this method, we can create many types of charts -- including some you may be familiar with like bar charts and line charts, as well as others like histograms and scatter plots.

The general syntax to create a chart using Babypandas is:
```html
<table>.plot(kind='<chart_type>', x='<column_name>', y='<column_name>')
```

If the plot uses two features, then *x* will be on the horizontal axis and *y* will be on the vertical axis -- we'll see this momentarily. If the chart only needs a single feature, then only *y* needs to be specified.

To actually understand what's going on, let's try making our first chart. Below we define a example table of ice cream cones, along with the number of cones sold and a yumminess score (out of 10) for each flavor.

In [None]:
icecream = bpd.DataFrame().assign(
    Flavor=['Chocolate', 'Vanilla', 'Strawberry'],
    Cones_Sold=[7, 5, 4],
    Yumminess=[8, 9, 2]
)
icecream

We can use a {dterm}`bar chart` to compare the number of cones sold of each flavor. Since we're interested in seeing how the number of cones sold changes as a result of the flavor changing, we specify *x* as Flavor, and *y* as Cones Sold.

In the resulting chart, each flavor will have its own bar, and the height of the bar corresponds to the value of cones sold -- so in this case, the bar for Chocolate is 7 units long since there were 7 chocolate cones sold, but only 4 units long for Strawberry since there were 4 strawberry cones sold.

```{margin}
Try opening the interactive version of this page and seeing what other charts you can create by changing the `kind`, `x`, and `y` arguments.
```

In [None]:
icecream.plot(kind='bar', x='Flavor', y='Cones_Sold')

You've done it! We made a chart from a table. Perhaps you're already getting some ideas for how useful these charts can be -- read on to see just how much we rely on them while exploring our data.

## Distributions

As you work with data scientists and statisticians throughout your life, you'll often hear people asking about the {dterm}`distribution` of a feature.

This generally boils down to answering three important questions about a feature:
1. What is a typical value for this feature?
2. What range of possible values would be expected?
3. Are we more likely to see high values within that range? Low values? Both high values *and* low values, but not values in the middle?

```{margin}
Technically, you *can* describe the shape using measurements such as **skewness** and **kurtosis** -- you'll learn about these in Math 189 -- however it's considerably more challenging to *interpret* these values as opposed to using a chart.
```

Well, the first two seem pretty approachable through metrics that we already know how to calculate -- such as using the mean of a feature to find the typical value, or calculating the min and max to get a range of possible values. But the third question seems a lot more challenging to represent just by using numbers.

But we have the power of charts! And we can use a specific type of chart, called a histogram, to answer all three of those questions.

### Histograms

A {dterm}`histogram` is a type of chart that gives us a visual representation of the distribution of a numerical feature by expressing how many times the data shows up in a given range.

We can create a histogram of our recorded temperatures by simply specifying `kind='hist'` and setting `y` to the name of the column we're intreseted in. This example also specifies two additional arguments that may seem mysterious, but soon you'll know what they do!

In [None]:
temps.plot(kind='hist', y='Average Temperature',
           bins=np.arange(0, 100+10, 10), density=True)

You can think of a histogram as similar to a bar chart in concept. Each bar corresponds to a range of values called a **bin** -- for example, at the very left of the chart there is a bin from zero degrees to ten degrees. Then, the height of each bar corresponds to how much of the data falls into that bin.

Important to note: the bins of a histogram will always include the lower number, but *exclude* the higher number. Also, we'll usually express these bins using the mathematical notation for intervals so the bin containing temperatures from 0 degrees to 9.99999... degrees is written as $[0, 10)$.

![histograms](../images/histogram-annotation.jpg)

All sorts of information about our data can be gleaned from just the histogram. For example, in the histogram above we can tell that the bin with the greatest frequency of values was $[60, 70)$ since that bar is the highest, whereas no data fell in the bin $[90, 100)$ since that bar doesn't show up at all! By thinking about where the balance point of the shape is, we can find the mean of the feature. By looking at the extremes of our histogram we know that our temperature data is never greater than 90, and never less than 0. But, since the bars at the extremes are short, we know that it's more likely for us to find temperatures between 20 and 80 degrees. And, we can notice that the mass of the histogram bunched up towards the higher end but stretched out on the low end, so it's more likely for us to see a high value but every once in a while we might find a *very* low value.

### Setting the level of detail

[you'll notice that in addition to specifying the kind and y, we also set bins.]

[recall that bins are the ranges that data falls into]

[can specify level of detail by either setting bins to the number of bins we want, or by setting it to an array of bin endpoints] [np.arange to set equal bins of set width]

```{margin}
Notice how the shape changes as we decrease our bins from one-hundred to two.

With too many bins it looks spikey and is hard to find a smooth pattern. With too few bins our view becomes overly simplified and we risk overlooking patterns.
```

![The level of detail decreases, but it becomes easier to see trends](../images/histogram-detail.gif)

### The math behind histograms

Often, we use histograms to quickly get an intuition about the *shape* of our data. However, if you study the *numbers* a bit closer, we can extract important information about the proportion of data falling in certain ranges. We just need to make sure we set the *density* argument to `True` when we call the plotting method.

Remember that proportions are almost always more insightful than raw counts. Unfortunately, the default behavior of the plotting method is to plot a histogram of counts (yuck!).

In [None]:
# Two histograms, one with counts one with density
# Caption: The shapes are the same, but the counts aren't so useful when we
# don't know what the total is!

When we set the *density* argument to `True`, however, something magical happens [the proportion of a data point being in a given bin is equal to the *area* of that bin.] [multiply the width of the bin by the height of the bar] [if we really must find a count, we can still calculate count from the proportion by just multiplying by the total number of entries in our data set.]

[if we want to find proportion of falling in multiple bins, we just add the areas]

[total area of the histogram sums to 1 -- the proportion of finding a value between the minimum and maximum values is 1!]

### Categorical distributions

If we're working with a categorical variable it's important to look at the frequency of the categories. For this, we can use a bar chart in conjunction with `.groupby` and `.count`.

The resulting chart is no longer a histogram since we're no longer working with numerical bins, but the end result is similar: we get a sense of the frequency of different values of the feature. And, similar to the histogram, it is preferrable to work with proportions of frequency instead of raw counts. We accomplish this by performing element-wise division on our counts by the total number of rows in our data.

In [None]:
region_counts = temps.get(['Region', 'Year']).groupby('Region').count()

# We can assign a new column as the proportion of each region
region_props = (
    region_counts
    .assign(proportion=region_counts.get('Year') / temps.shape[0])
    .drop(columns=['Year'])
)

region_props.plot(kind='bar', y='proportion')

Unsurprisingly, the proportion of data is equal for every single climate region. That's what we expect in this situation (each climate region has a measurement for all months from 2000 to 2019). Since we know that we should expect all of the climate regions to have the same proportion of measurements, if the chart showed us that the frequency *wasn't* equal then we'd know something was wrong with our data!

## Trends over time

Since our new data set contains a column to progresses through time (the Year), it's possible for us to discover trends in our data. For example, how has average temperature changed over time?

We can group by the year and aggregate using the mean, but once again it's challenging to spot trends in a table.

In [None]:
temp_over_time = (
    temps.get(['Year', 'Average Temperature'])
    .groupby('Year')
    .mean()
)
temp_over_time

 The human brain is hardwired to detect patterns, but in order to see a pattern amongst numbers, we need to keep all of the numbers in our head. Visualizations like {dterm}`line charts` address this issue!
 
Once again, the column that we want to be the horizontal axis is our table index, so we set *x* to `None` (or don't specify it).

In [None]:
temp_over_time.plot(kind='line', x=None, y='Average Temperature')

A word of caution: notice that that the chart boundaries shrink to fit the data as closely as possible. While this can be nice because it allows us to focus on the changes in the data, it can also be very misleading! For example, the average temperature from 2000 to 2005 never varied by more than a degree, but the line plot still looks very dramatic!

```{margin}
This is another situation where *domain knowledge* is important. How important is a single degree of changing in average temperature? Probably not anywhere near as critical as this plot makes it look like!
```

In [None]:
temp_over_time[temp_over_time.index <= 2005].plot(y='Average Temperature')

## Comparing groups

We've seen a bar chart as the very first example of a chart in this chapter, and we've seen it again to compare the proportions of categories. In general, {dterm}`bar charts` are used to give *context* when comparing multiple values between groups.

For any value that can be calculated on a group -- be it the mean, minimum, count, or other metric -- the bar chart lets us easily spot the groups with the highest values, and highlights disparities between groups.

Let's see what the *minimum* reported temperature is for each climate region.

In [None]:
min_temp_by_region = (
    temps.get(['Region', 'Average Temperature'])
    .groupby('Region')
    .min()
)
min_temp_by_region.plot(kind='bar', y='Average Temperature')

The bar chart makes it instantly aparent that that there's a large disparity between the minimum temperature in the regions. We can pick out the region with the lowest and highest minimum temperature... but it would be better if the labels weren't so hard to read.

There are actually two changes we want to make to this chart. First, since it's challenging to read the labels when they're vertical, we can make the whole chart horizontal by setting *kind* to `'barh'`. Second, we can make it easier to find patterns in the ranking of the regions by first sorting our table, then plotting.

In [None]:
(
    min_temp_by_region
    .sort_values(by='Average Temperature', ascending=False)
    .plot(kind='bar', y='Average Temperature')
)

Now that the bar chart is sorted based on the minimum temperature -- essentially ranking the climate regions -- it becomes much easier for us to spot additional patterns. For instance, did you notice that all of the regions that contain 'South' have some of the highest minimum temperatures? Also, there's a really big drop between the minimum temperature in the Ohio Valley region and the Northern Rockies and Plains region -- looks like 10 degrees! I wonder why that is.

## Relationships between features

[we've seen how values can change depending on time, or depending on a category (group) -- how about how values change depending on other values?]

In [None]:
# TODO: Aggregate climate data to include additional features for scatter

In statistics, if we see a trend in the scatter plot between two features, we consider there to be an {dterm}`association` between the features.

## Comparing charts

[Nice to overlay charts in order to make them easier to compare]

In [None]:
sw_annual = temps[temps.get('Region').str.contains('Southwest')].groupby('Year').mean()
w_annual = temps[temps.get('Region').str.contains('West')].groupby('Year').mean()
sw_annual

In [None]:
ax = sw_annual.plot(x=None, y='Average Temperature')
w_annual.plot(ax=ax, x=None, y='Average Temperature')

In [None]:
ax = sw_annual.plot(kind='hist', y='Average Temperature')
w_annual.plot(ax=ax, kind='hist', y='Average Temperature')
temps.groupby('Year').mean().plot(ax=ax, kind='hist', y='Average Temperature')

In [None]:
temps.groupby('Year').mean().plot(x=None, y='Average Temperature')