# Data Formats & Types

Like the title says - Data Formats & Types.  The data you'll run across carries structures and hierarchies we should be aware of, and there are different types of ways information can be recorded which are required for certain types of analysis and modeling.

Let's start with unit structures and nested hierarchies contained in the data.

<h3>Data Formats</h3>

What do we mean by data formats?  These are structures in the data that can be thought of as how the data relates to itself.  There are two very common ones you should think about straight away when you start working with project sponsors and scoping out your projects.

- Unit of Analysis
- Wide vs. Long Data

<h5>Unit of Analysis</h5>

_Unit of analysis_, or level of analysis, is generally the first thing we want to understand when we start analyzing the structure of our data.  I like to think of it as at a minimum, of having some kind of subject, e.g. by Customer, by Region, or by Manufacturer.  If our data has been prefiltered down to just one Customer, then you'll only have one unit/level, but it's still the highest _object_ level that everything else in the data relates to.

For example, if we have a sales report and we have a column called "Customer", then I would say our data scope is "sales, by Customer".  We may also see nested hierarchies such as "by Customer, by Region".  So you might see something like - Costco as the customer, and then there are 3 Regions, Midwest, Northeast, and South, that all roll up to the Costco header level.  You will also frequently run into having a time frame or frequency in your data as well.  So it might be something like "by Customer, by Region, by Day".

These are all examples of the unit of analysis that map to whatever the data in your set actually is.  If it's a daily sales report, then it's "sales, by Customer, by Region, by Day".  Each Customer-Region will have sales for each day in the overall time span horizon of your data set.

It'll be much easier to grasp if we jump in and look at some actual data.  Below we read in the iris data set from the well-known UCI ML Data Set repository {cite}`UCI_2024`.  This classic data set is about as simple as it gets.

In [11]:
# Import libraries
from ucimlrepo import fetch_ucirepo, list_available_datasets
import pandas as pd

# Uncomment below to see which datasets are available
#list_available_datasets()

# Import Iris data (id = 53) and see what we get
dat_in = fetch_ucirepo(id = 53)
type(dat_in)

ucimlrepo.dotdict.dotdict

Ok, so `dat_in` is a dictionary.  Good news, we know how to see what's in there.  We could just print the whole thing to the screen, but that would take up a lot of real estate.  So how about we just try looking at the `.keys()` to see if we can tell what we want?

In [12]:
print(dat_in.keys())

dict_keys(['data', 'metadata', 'variables'])


Ok nice.  We probably want what's stored in the "data" key if I had to guess.  Let's see what's in there next.

In [13]:
# Print the "data" object from the "dat_in" dictionary
dat_in['data']

{'ids': None,
 'features':      sepal length  sepal width  petal length  petal width
 0             5.1          3.5           1.4          0.2
 1             4.9          3.0           1.4          0.2
 2             4.7          3.2           1.3          0.2
 3             4.6          3.1           1.5          0.2
 4             5.0          3.6           1.4          0.2
 ..            ...          ...           ...          ...
 145           6.7          3.0           5.2          2.3
 146           6.3          2.5           5.0          1.9
 147           6.5          3.0           5.2          2.0
 148           6.2          3.4           5.4          2.3
 149           5.9          3.0           5.1          1.8
 
 [150 rows x 4 columns],
 'targets':               class
 0       Iris-setosa
 1       Iris-setosa
 2       Iris-setosa
 3       Iris-setosa
 4       Iris-setosa
 ..              ...
 145  Iris-virginica
 146  Iris-virginica
 147  Iris-virginica
 148  Iris-virgini

Hmm.  We can see the "data" object is another dictionary collection housing what we're actually looking looking for.  Looks like there's a data frame called "original" that is the complete original data set.  That's what we want.  Go get it.

In [15]:
# Extract the original dataset from the "data" dictionary
dat = dat_in['data']['original'].copy()

# Replace the spaces in the column names with underscores (definitely a good habit to get into)
dat.columns = dat.columns.str.replace(' ', '_')
dat

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [21]:
dat['class'].value_counts()

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

<h5>Long Data</h5>

<h5>Wide Data</h5>

<h3>Data Types</h3>

What are the types of data we may encounter?

- Cross sectional
- Time series
- Panel
- Text

There are other more exotic data to be sure, but this will cover the common ones you'll deal with the most often.

<h5>Cross Sectional</h5>

<h5>Time Series</h5>

<h5>Panel</h5>

<h5>Text</h5>