# Data Structure

Real world data is messy.  There's no end to the challenges you'll run into.  The good news is there are a few ways to look at the data to easily understand what kind of information it carries to give you a nice head start on addressing any issues.  Regardless of the type of data you're working with, structures and hierarchies are generally present and convey the information required for certain types of analysis and modeling.

Let's start with unit structures and nested hierarchies contained in the data.

What do we mean by data structures?  These are formats and levels in the data that can be thought of as how the data relates to itself.  Whenever the conversation turns to talking about what kind of data is available for your project, you should immediately start asking questions about what's know as the unit, or level, of analysis to understand how to analyze the data properly.  After that, we want to start thinking about the "shape" of our data, i.e. wide vs. long.

- Unit of Analysis
- Wide vs. Long Data

Understanding this kind of detail as early as possible is super important when we start talking to our subject matter experts about data.

**<h3>Unit of Analysis</h3>**

_Unit of analysis_, or level of analysis, is is first up when we start analyzing the structure of our data.  You can think of this as the details contained in a _row_ of data.  It always made sense to me to also think of it as having some kind of subject, e.g. by Customer, by Region, or by Manufacturer.  If our data has been prefiltered down to just one Customer, then you'll only have one unit/level, but it's still the highest _object_ level that everything else in the data relates to.

```{note}
Unit of Analysis is the lowest grain of the dataset, conveying what information is recorded in a row of data.
```

For example, if we have a sales report and we have a column called "Customer", then I would say our unit of analysis is "Sales, by Customer".  We may also see nested hierarchies such as "by Customer, by Region".  So you might see something like - Costco as the customer, and then there are 3 Regions, Midwest, Northeast, and South, that all roll up to the Costco banner level.  You will also frequently run into having a time frame or frequency in your data as well.  So it might be something like "by Customer, by Region, by Day".

These are all examples of the unit of analysis that map to whatever the data in your set actually is.  If it's a daily sales report, then it's "Sales, by Customer, by Region, by Day".  Each Customer-Region will have sales for each day for the overall time horizon of your data set.

It'll be much easier to grasp if we jump in and look at some actual data.  Below we show 10 random rows from the classic Iris flower dataset.  This one is about as simple as it gets.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

dat = sns.load_dataset('iris')
dat.sample(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
134,6.1,2.6,5.6,1.4,virginica
60,5.0,2.0,3.5,1.0,versicolor
89,5.5,2.5,4.0,1.3,versicolor
53,5.5,2.3,4.0,1.3,versicolor
59,5.2,2.7,3.9,1.4,versicolor
104,6.5,3.0,5.8,2.2,virginica
68,6.2,2.2,4.5,1.5,versicolor
20,5.4,3.4,1.7,0.2,setosa
83,6.0,2.7,5.1,1.6,versicolor
45,4.8,3.0,1.4,0.3,setosa


We can see from the table above that this set is 150 observations of individual flowers, recording 4 measurements and 1 variable indicating the specific Specie of flower.

This one is pretty straightforward.  What do you think?  What would you say the unit of analysis is?  What does each row correspond to?

If you asked me, I'd say it's -

> Measurements, by **Flower** (each row are the measurements for just one flower)

<br>

Ok, that was an easy one.  How about the Gasoline Consumption dataset below from the same `pydataset` library?

In [2]:
from pydataset import data
dat = data("Gasoline")
dat[['country','year','lgaspcar']]

Unnamed: 0,country,year,lgaspcar
1,AUSTRIA,1960,4.173244
2,AUSTRIA,1961,4.100989
3,AUSTRIA,1962,4.073177
4,AUSTRIA,1963,4.059509
5,AUSTRIA,1964,4.037689
...,...,...,...
338,U.S.A.,1974,4.798626
339,U.S.A.,1975,4.804932
340,U.S.A.,1976,4.814891
341,U.S.A.,1977,4.811032


What do you think?  Can you see the categorical and hierarchical levels in the data?  If I was working on this project, I would say the unit of analysis for our data is

> Gas Consumption (lgaspcar), by **Country**, by **Year**

Did you come up with the same?

Maybe you're asking why understanding this level of analysis thing matters?  Well, there are a number of super important reasons.  First and foremost, you need to know how to read the data so you can analyze what's being recorded.  One of the first things you might want to do is some exploratory summary descriptive kind of views.  You'll have a very difficult time knowing how to ask your questions and look at the data if you don't know how it's structured.

For the gasoline consumption example above, we can ask a few basic investigatory questions now that we understand it's "Gas Consumption, by **Country**, by **Year**".  We'll cover this in much more detail in the upcoming {doc}`../Chapter5/EDA` section, but for now see if you can follow along without too much explanation.

Below we show the "counts" as number or records for each Country by Year.

In [3]:
# Count how many records each country has for each year
dat.groupby('country')['year'].value_counts().unstack(0)

country,AUSTRIA,BELGIUM,CANADA,DENMARK,FRANCE,GERMANY,GREECE,IRELAND,ITALY,JAPAN,NETHERLA,NORWAY,SPAIN,SWEDEN,SWITZERL,TURKEY,U.K.,U.S.A.
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
1960,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1961,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1962,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1963,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1964,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1965,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1966,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1967,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1968,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
1969,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


Now we know that every country has at least one record for every year, what if we wanted to aggregate the data in some way?  You'd need to know what levels you want to roll your values up to.  See below where we have calculated the average gasoline consumption (lgaspcar) by Country for all of the years in the dataset.

In [4]:
# Calculate the mean gasoline consumption for each country over all of the years
agg_dat = dat.groupby(['country'])['lgaspcar'].mean().reset_index().rename({'lgaspcar': 'avg_lgaspcar'}, axis = 1)
agg_dat.sort_values(by = ['avg_lgaspcar'], ascending = False)

Unnamed: 0,country,avg_lgaspcar
15,TURKEY,5.766355
6,GREECE,4.878679
2,CANADA,4.862402
17,U.S.A.,4.819075
9,JAPAN,4.699642
14,SWITZERL,4.237586
7,IRELAND,4.22556
3,DENMARK,4.189886
11,NORWAY,4.109773
10,NETHERLA,4.080338


This dataset is pretty simple too so no doubt you would have been able to figure all of this out without having to realize that you were actually considering the unit of analysis as you were going along, but at least notice all of the things we needed to know about our data for a simple aggregation.

When our data starts to get more complicated, and I promise it will in the real world, you will absolutely want to make sure you take adequate time in the beginning scoping stages of your project to think about what levels you have so you can be better prepared to ask a lot of questions and interrogate your data properly.

**<h3>Long vs. Wide Data</h3>**

Next up we'll take a look at some of the common data _shapes_ you'll run into.

Basically the idea here is that data can be recorded in a couple of different ways.  Different structures require different operations for creating new data, aggregating, building graphs and plots, modeling, and many other data munging techniques.  None of these formats are right or wrong necessarily, but as you'll come to learn it's really just a function of how the data was generated/stored coupled with what we need to do with it, and if those two align or not.  If they do not align, well, then you'll need to do something about it by reshaping the data somehow.  This is where we go next.

<h4>Long  Data</h4>

Let's see what we're talking about by creating a small synthetic set of a student's test scores.  The unit of analysis for the datset would be

> Scores, by Class, by Date.

In [5]:
import pandas as pd
import numpy as np

dat = pd.DataFrame({"date": pd.to_datetime(["2024-01-01"] * 4 + ["2024-01-02"] * 4 + ["2024-01-03"] * 4),
                    "class": ["History", "Science", "Math", "Art"] * 3,
                    "score": np.random.choice(range(90, 101), 12)})
dat

Unnamed: 0,date,class,score
0,2024-01-01,History,100
1,2024-01-01,Science,94
2,2024-01-01,Math,91
3,2024-01-01,Art,90
4,2024-01-02,History,98
5,2024-01-02,Science,95
6,2024-01-02,Math,91
7,2024-01-02,Art,99
8,2024-01-03,History,94
9,2024-01-03,Science,100


Want to venture a guess as to whether this is _wide_ or _long_ data?

That's correct!  I have no doubt you said _long_ data.  Good job.

```{note}
Long data consists of one row per unit of analysis per time period.  Therefore, each subject will have multiple rows if there are multiple other levels in the unit of analysis.
```

<h4>Wide Data</h4>

So how about _wide_ data then?  I'm sure you can already imagine what it might look like.  Imagine casting all of the individual levels of a variable out into their own columns.  So each Class will now have their own column, and the value will be the Score.  Pretty simple.

```{note}
Wide data will have each subject's information spread out horizontally in a single row, with each variable in a separate column, vs. having repeated rows for all of the additional variable levels of analysis.
``` 
  
Instead of creating a new wide set to illustrate, let's introduce the topic of _pivoting_ and transform our long data we've already created by "casting" it into a wide format.

In [6]:
# Reshape our long dataset into wide
wide_dat = (dat
                .groupby(['date','class'])['score']
                .mean()
                .unstack('class')
                .rename_axis(None, axis = 'columns')
                .reset_index())
#wide_dat = dat.pivot(index = ['date'], columns = ['class'], values = ['score'])
wide_dat

Unnamed: 0,date,Art,History,Math,Science
0,2024-01-01,90.0,100.0,91.0,94.0
1,2024-01-02,99.0,98.0,91.0,95.0
2,2024-01-03,95.0,94.0,93.0,100.0


Pretty straight forward, right?  You can see each row now contains just one Date record, with all of the Student's grades distributed by Class within that one row for the date.

So far so good.  You'll come to understand that this topic of data structures and formats can actually become super complicated and difficult to deal with.  This is why we spend so much of our time working with and massaging the data.  It often takes quite a bit of effort to get data in the right shape for the analysis we want to perform.

Let's now move to the next section and talk about the different Types of data we can expect to run across.