# Exploring Data

Let's look into a new data set, `temperatures.csv` -- we'll see later on how this is still related to wildfires. This data was compiled from a number of queries from the NOAA (National Oceanic and Atmospheric Administration) website.

Citation:
    NOAA National Centers for Environmental information, Climate at a Glance: Regional Time Series, published October 2018, retrieved on October 18, 2018 from https://www.ncdc.noaa.gov/cag/regional/time-series

In [None]:
import babypandas as bpd
import numpy as np

temps = bpd.read_csv('../../data/temperatures.csv')

The most important task when working with a new dataset is to get familiar with the data by exploring it -- that way it will be easier for us to analyze the data in the future.

Let's start by understanding what the table actually contains. The easiest way to do this is by simply looking at the first few rows of the data.

In [None]:
temps.iloc[range(3)]

A data description explains that the table contains a Year (2000 to 2018), Month (1 to 12), Region (a climate region), and Average Temperature (the average temperature for that year/month/region in degrees Farenheit)... but I'm not a meteorologist so, what climate regions are we working with?

The Region column is considered to be a *categorical* feature, because its values are not numbers, and each value can only come from a finite set of choices. Since we know there is a limit to how many climate regions there are, it makes sense to use the `.unique()` method to list out all of the regions in our table.

In [None]:
temps.get('Region').unique()

Based on the names of the regions, we can deduce the general area of the U.S. that each region covers.

```{tip}
Questions about your data set should also be looked up online! Effective data exploration revolves around quickly becoming an semi-expert in the domain of the data set.
```

The rest of the columns seem pretty self explanatory in terms of their meaning, so the next step of exploration often revoles around getting a sense of what our data 'looks' like, then eventually discovering patterns and interesting aspects of our data.

## Your first chart

There's an age-old-addage that "a picture is worth a thousand words" -- the same often holds true for data science!

Throughout the exploration process, we'll be relying on charts to help us visualize and ultimately understand the data. While Python has many libraries that can create charts, Babypandas has the capability built in by using the `.plot()` method. With this method, we can create many types of charts -- including some you may be familiar with like bar charts and line charts, as well as others like histograms and scatter plots.

The general syntax to create a chart using Babypandas is:
```html
<table>.plot(kind='<chart_type>', x='<column_name>', y='<column_name>')
```

[quick description of x and y] [usually we're most interested in how *y* is changing, so if a type of chart only needs a single feature specified then we specify y]

Let's try one out. Below we define a example table of ice cream cones, along with the number of cones sold and a yumminess score (out of 10) for each flavor.

In [None]:
icecream = bpd.DataFrame().assign(
    Flavor=['Chocolate', 'Vanilla', 'Strawberry'],
    Cones_Sold=[7, 5, 4],
    Yumminess=[8, 9, 2]
)
icecream

We can use a bar chart to compare the number of cones sold of each flavor. The height of the bar corresponds to the value -- so in this case, the bar for Chocolate is 7 units long since there were 7 chocolate cones sold, but only 4 units long for Strawberry since there were 4 strawberry cones sold.

In [None]:
icecream.plot(kind='bar', x='Flavor', y='Cones_Sold')

```{margin}
Try opening the interactive version of this page and seeing what other charts you can create by changing the `kind`, `x`, and `y` arguments.
```

You've done it! We made a chart from a table. Perhaps you're already getting some ideas for how useful these charts can be -- read on to see just how much we rely on them while exploring our data.

## Distributions

As you work with data scientists and statisticians throughout your life, you'll often hear people asking about the {dterm}`distribution` of a feature.

This generally boils down to answering three important questions about a feature:
1. What is a typical value for this feature?
2. What range of possible values would be expected?
3. Are we more likely to see high values within that range? Low values? Both high values *and* low values, but not values in the middle?

Well, the first two seem pretty approachable through metrics that we already know how to calculate -- such as using the mean of a feature to find the typical value, or calculating the min and max to get a range of possible values. But the third question seems a lot more challenging to represent just by using numbers.

```{margin}
Technically, you *can* describe the shape using measurements such as **skewness** and **kurtosis** -- you'll learn about these in Math 189 -- however it's considerably more challenging to *interpret* these values as opposed to using a chart.
```

But we have the power of charts! And we can use a specific type of chart, called a histogram, to answer all three of those questions.

### Histograms

A {dterm}`histogram` is a type of chart that gives us a visual representation of the distribution of a numerical feature by expressing how many times the data shows up in a given range.

We can create a histogram of our recorded temperatures by simply specifying `kind='hist'` and setting `y` to the name of the column we're intreseted in. We'll also specify some additional arguments that will be explained momentarily.

In [None]:
temps.plot(kind='hist', y='Average Temperature',
           bins=np.arange(0, 100+10, 10), density=True)

You can think of a histogram as similar to a bar chart in concept. Each bar corresponds to a range of values -- for example from zero degrees to ten degrees -- and we call the range a **bin**. Then, the height of each bar corresponds to how much of the data falls into that bin.

The bins of a histogram will always include the lower number, but *exclude* the higher number. Also, we'll usually express these bins using the mathematical notation for intervals so the bin containing temperatures from 0 degrees to 9.99999... degrees is written as $[0, 10)$.

All sorts of information about our data can be gleaned from just the histogram. For example, in the histogram above we can tell that the bin with the greatest frequency of values was $[60, 70)$ since that bar is the highest, whereas no data fell in the bin $[90, 100)$, so that bar doesn't show up at all! By imagining where the center of mass of the shape would be (if you tried to balance the shape on your finger, where along the bottom edge would you place it?) then we can get an idea where the mean of the data set is. By looking at the extremes of our histogram we know that our temperature data is never greater than 90, and never less than 0. But, since the bars at the extremes are short, we know that most of the temperatures are actually between 20 and 80 degrees. And, we can notice that the mass of the histogram is more bunched up towards the higher values than the lower values, so it's more likely for us to see a high value within our possible range than to see a low value.

### Setting the level of detail

[you'll notice that in addition to specifying the kind and y, we also set bins.]

[recall that bins are the ranges that data falls into] [how many bins ]

[can specify level of detail by either setting bins to the number of bins we want, or by setting it to an array of bin endpoints] [np.arange to set equal bins]

(Do we need to discuss unequal bins (?) The only time I've even seen it brought up is on tests in this class.)

### The math behind histograms

[also specified density, otherwise it would have set the heights to the number of occurrences in each bin]

[when using density, the area of each bar is equal to the proportion of data that falls into that bin. this is done by multiplying the bin width by the height of the bar] [can still calculate count from proportion by multiplying by the number of entries in our dataset]

### Categorical distributions

[When working with categorical variable, can use .groupby and .count to get the frequency of values in each category, (then divide by rows to get proportion?), then use a bar plot to visualize.] [this is often referred to as the distribution of a categorical variable]

## Trends over time

[since we have a column that defines time (year), it's interesting to see how our data has changed over time.] [we can use groupby, but looking at a table doesn't make it easy to spot patterns] [another core reason to visualize is to make it easier to discover patterns]

In [None]:
temp_over_time = (
    temps.get(['Year', 'Average Temperature'])
    .groupby('Year')
    .mean()
)
temp_over_time

[not so easy to see what's going on (plus, we're working with lots of values that we need to remember in our head!)]

In [None]:
temp_over_time.plot(kind='line', y='Average Temperature')

[x defaults to index]

## Comparing values

[In general, bar charts are used to compare values] [a primary use of data visualization is to give *context* when comparing values.]

In [None]:
temps.groupby('Region').mean().plot(kind='barh', y='Average Temperature')

[easier to explore by sorting then plotting] [default behavior is to sort by the *labels*, we want to sort by the *values*]

## Relationships between features

[we've seen how values can change depending on time, or depending on a category (group) -- how about how values change depending on other values?]

In [None]:
# TODO: Aggregate climate data to include additional features for scatter

## Comparing charts

[Nice to overlay charts in order to make them easier to compare]

In [None]:
sw_annual = temps[temps.get('Region').str.contains('Southwest')].groupby('Year').mean()
w_annual = temps[temps.get('Region').str.contains('West')].groupby('Year').mean()
sw_annual

In [None]:
ax = sw_annual.plot(x=None, y='Average Temperature')
w_annual.plot(ax=ax, x=None, y='Average Temperature')

In [None]:
ax = sw_annual.plot(kind='hist', y='Average Temperature')
w_annual.plot(ax=ax, kind='hist', y='Average Temperature')
temps.groupby('Year').mean().plot(ax=ax, kind='hist', y='Average Temperature')

In [None]:
temps.groupby('Year').mean().plot(x=None, y='Average Temperature')