# The Many Shapes of Data

Data rarely shows up in the neat spreadsheet we hope for. It arrives **messy, uneven, unexpected**, a mix of numbers, text, images, timestamps, social connections, and whatever the world records. Before we can analyze anything, we must learn how to recognize the shape that data takes.

Engineers sometimes describe data as “raw.” That sounds intimidating, but it really means the information hasn’t been processed yet. It is potential, waiting to become evidence, insight, or sometimes, just noise.

### What Counts as Data?

If you sit in a coffee shop for ten minutes and simply observe, you’ll see data everywhere: prices on a menu, Wi-Fi signal strength, the time between notifications on your phone, the music playlist, the number of people wearing headphones, and the language of their conversations. All of it is data, even if nobody wrote it down.

In computing and data science, we usually store data as numbers, text, images, audio, or combinations of these. But the real distinction isn’t the material; it’s the structure.

> **What is Data?** Data is raw information: facts, measurements, observations, or descriptions that can be stored, transmitted, and analyzed. In data science, data refers to representations of information that can be digitized and manipulated by computational systems. It can take many forms, including numerical, categorical, text, image, audio, temporal, or relational.

>> Depending on how it is organized, data may be structured, semi-structured, or unstructured.

> Scientists, analysts, and businesses collect data because the patterns hidden within it help them understand behavior, make decisions, and predict what comes next.

>> **Remember:** data becomes valuable only after it is interpreted and transformed into knowledge or used to answer meaningful questions.

At its simplest, data is recorded information about the world. It might be a number, a word, a photograph, a sound, or a sensor reading. On its own, data does not explain or prove anything; it simply captures what exists or what has happened.

Data becomes useful when we organize it, analyze it, and interpret it in context. A single heart rate reading means very little; a series of readings across time can reveal stress, fitness, or disease. Data does not guarantee truth; it offers evidence.





# How Do We Describe or Measure Data?

Before we think about how data is stored or structured, it is useful to ask a simpler question: *what kind of information are we dealing with?* Not all data behaves the same way. Some can be counted, some can be measured with precision, and some can only be described or categorized.

Broadly, based on what the data represents, we can divide it into two classical types:

- **Quantitative data**, which expresses numerical quantities (how much, how many, how fast)
- **Qualitative data**, which expresses categories, labels, or descriptive attributes (what kind, which type)

### **Quantitative Data**

Quantitative data expresses **quantities**. It tells us *how much*, *how many*, or *how fast*. This kind of data can be measured numerically and analyzed mathematically. We can add it, average it, compare it, and track how it changes over time.

Quantitative data often comes in two forms:

- **Discrete:** countable values (e.g., number of laptops, number of cars, number of students)
- **Continuous:** measurable values that vary smoothly (e.g., height, weight, temperature, time)

Discrete values make jumps (from 1 to 2 to 3), while continuous values can take any value within a range.

### **Qualitative Data**

Qualitative data describes **qualities, categories, or labels**. Instead of telling us *how much* of something exists, it tells us *what kind* it is.

Examples include hair color, species, survey responses, movie genres, or customer satisfaction levels. Qualitative data can be grouped into categories, but arithmetic operations like averaging or subtracting do not make sense here.

<center>
  <img src="https://static.wixstatic.com/media/89aacd_282be84a1c254a1ebb1e25e3e2613dfd~mv2.png/v1/fill/w_980,h_513,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/89aacd_282be84a1c254a1ebb1e25e3e2613dfd~mv2.png" width="600">
  <br>
  <em><strong>Figure 1.</strong> Classical categorization of data types. Quantitative data may be discrete or continuous, while qualitative data describes categorical properties. Source: International Journal of Neurolinguistics & Gestalt Psychology.</em>
</center>

### **Why This Matters in Data Science**

These distinctions shape how we visualize, summarize, and model data. For example:

- Heights (**continuous**) may be plotted on a histogram
- Survey answers (**qualitative**) may be shown as bar charts
- Counts (**discrete**) may be modeled with Poisson or binomial distributions
- Categorical labels may be encoded for machine learning (e.g., **one-hot encoding**)

Knowing *what kind* of data we have is often the first step in deciding *what tools* to use.


# How Is Data Organized or Stored?

So far, we have described data based on what it represents (quantitative vs. qualitative). Data science also cares about a different and equally important dimension: **how the data is organized for computation**. The structure of data influences how easily it can be stored, queried, cleaned, and analyzed.

Unlike quantitative/qualitative categories (which describe meaning), structural categories describe the **shape and format** of data.

In data science, the key difference isn’t what the data *is* made of (numbers, pixels, words), but how it is *organized*. In practice, we encounter three major structural families:

1. **Structured data**: rigid, organized, predictable  
2. **Semi-structured data**: flexible, loosely organized  
3. **Unstructured data**: non-tabular data (such as text, images, audio, and video) without a predefined schema that does not fit neatly into tables  


<!-- <center>
  <img src="https://media.licdn.com/dms/image/v2/D5622AQHyT5xvgFUu1w/feedshare-shrink_800/B56Zicwt_ZHkAg-/0/1754976676754?e=2147483647&v=beta&t=SuiwEQUB8thRo8tscXW0QILLc_7rUSagTEndFqXU-qI" width="600">
  <br>
  <em><strong>Figure 5.</strong> Structured vs Semi-structured vs Unstructured data illustrated across common formats such as CSV, JSON, and media content. Source: LinkedIn</em>
</center> -->

<center>
 <img src="https://skyvia.com/learn/images/scheme-3.png" width="550">
 <br>
   <img src="https://media.licdn.com/dms/image/v2/D5622AQHyT5xvgFUu1w/feedshare-shrink_800/B56Zicwt_ZHkAg-/0/1754976676754?e=2147483647&v=beta&t=SuiwEQUB8thRo8tscXW0QILLc_7rUSagTEndFqXU-qI" width="500">

  <br>
  <em><strong>Figure 2.</strong> Structured, Semi-Structured, and Unstructured data formats. Source: Skyvia</em>
</center>

These categories apply to text, numbers, images, audio, and almost any modern dataset. Each category demands different tools, techniques, and assumptions.

<center>
  <img src="https://cdn.prod.website-files.com/670526c69cb938e8bd8b4754/68481a4d8d2149316f373142_10th_June_2025_E.jpg" width="550">
  <br>
  <em><strong>Figure 3.</strong> Comparison of Structured, Semi-Structured, and Unstructured data. Source: Arya.ai</em>
</center>

In modern systems, all three often coexist: structured transactional data, semi-structured event logs, and unstructured user content.



## **1. Structured Data: The Orderly World**


Structured data is neatly organized, predictable, and follows a predefined schema. Structured data behaves politely. It typically appears in **tables**, with rows and columns, where each field has a specific data type (e.g., integer, string, boolean).

Example structures include:

- Tabular data (spreadsheets, CSV files)
- Relational databases (PostgreSQL, MySQL)
- Time-based or Temporal tables (stock prices, sensor logs)



<center>
  <img src="https://miro.medium.com/v2/1*i3rsfDZ_bu7OkHoM6UjKCQ.jpeg" width="450">
  <br>
  <em><strong>Figure 4.</strong> Structure Data. Source: Miro.medium</em>
</center>

Structured data is easy to query (SQL), easy to validate, and easy to join across tables. It is the backbone of business analytics and scientific measurement.


### **(a) Tabular Data: The Spreadsheet World**

Tabular data is what most people picture first, a **rectangular grid** of rows and columns, familiar from Excel or Pandas. Each row represents an entity, each column a property. A database table of airline flights, for example, might have:

• departure airport  
• arrival airport  
• time  
• delay  
• airline

<center>
  <img src="https://devopedia.org/images/article/231/4844.1571654694.jpg" width="450">
  <br>
  <em><strong>Figure 5.</strong> Conceptual diagram illustrating internal structure or hierarchy. Source: Devopedia</em>
</center>

There’s a reason the spreadsheet survived decades of technological change: it works. Its shape matches how humans reason about comparisons (Who spent more? Which day had the highest sales?).




### **(b) Time-Based Data: The Rhythm of Change**

Not everything stays constant. Stock prices drift by the second, weather shapes the day hour by hour, and fitness trackers quietly log your heartbeat as you sleep.

Time-based (or **temporal**) data puts events in order and lets us ask how something changes rather than merely what it is. Analysts in finance, meteorology, and operations live in this world. The key idea is not the value itself, but the **trajectory**.

A single temperature reading is trivia; a year of temperature readings reveals climate.

<center>
  <img src="https://chartexpo.com/blog/wp-content/uploads/2024/02/time-series-graph-examples.jpg" width="600">
  <br>
  <em><strong>Figure 6.</strong> Examples of time-series visualizations showing how values evolve over time. Source: ChartExpo</em>
</center>

Temporal data allows us to detect trends (upward/downward), cycles (seasonal patterns), anomalies (abrupt changes), and responses to interventions (before/after effects). Many forecasting tasks—from weather to sales to traffic—are built on time-series models.


---

## **2. Semi-Structured Data: The Flexible Middle**

Semi-structured data does not fit neatly into tables (doesn’t prescribe fixed column) , but it still contains **markers, tags, or key-value pairs** that impose some level of organization (structure and hierarchy).


Common examples include:

- JSON (web APIs, configuration data)
- XML and HTML (documents and web pages)
- logs (server events, application traces)
- key-value pairs (NoSQL databases)
<center>
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*3mMje9YpBjZHPO-VyWtv4Q.png" width="450">
  <br>
  <em><strong>Figure 7.</strong> Semi-structured data can be represented using JSON,XML etc. Source: Medium (miro.medium.com)</em>
</center>


Semi-structured data is more flexible than structured data. It supports nesting and irregular formats, which is why modern web APIs and mobile apps use JSON extensively.


---

## **3. Unstructured Data: The Untamed Majority**

Most digital information refuses to sit neatly in tables or hierarchical tags. This is **unstructured data**, and it now dominates the modern world.

Unstructured data lacks a predefined schema and is difficult to store in rows and columns. It includes rich media, human language, and other natural forms of expression.

Examples include:

- images (satellite photos, medical scans, selfies)
- audio (voice commands, music, podcasts)
- video (streaming content, surveillance footage)
- text (tweets, emails, reviews, articles)
- graphs of social or biological networks



For decades, computers couldn’t make sense of this category. The recent boom in machine learning, NLP for text, CNNs for images, transformers for everything, is largely about taming unstructured information.

<!-- This form of data dominates the modern world. Machine learning models (NLP, computer vision, transformers) help unlock value from unstructured information. -->


### **(a) Images: Data You Can See**

A picture is not a number, until a model turns it into one.

Images are deceptively simple. What we perceive as a photograph is actually a grid of **pixels**, tiny squares holding values for red, green, and blue. Those values range from 0 to 255 and describe how bright each channel should appear.

**Image formats involve trade-offs**:  JPEG compresses aggressively and loses some detail; PNG preserves every pixel; medical imaging prefers lossless formats because a mistake isn’t aesthetic, it’s scientific.

Compression, resolution, and color channels are not trivia. They determine how algorithms “see.”



---


## **Special Case of Data: "Graphs" — Data with Relationships**

Some data is best understood through **relationships**, not just individual values. A social network, a map of roads, or protein interactions inside a cell are examples.

Graphs introduce their own structure:

- **nodes** represent objects (e.g., people, airports, proteins)
- **edges** represent relationships (e.g., friendships, flights, bindings)

The power lies in the **pattern of connections** rather than the elements themselves. Who follows whom? Which cities connect? Which proteins interact? Influence, connectivity, centrality, and path structure become more important than single values.

<center>
  <img src="https://cdn-media-1.freecodecamp.org/images/9KFiyFYi9bMktsJkMKLKaeJl31heUN9A-xrr" width="400">
  <br>
  <em><strong>Figure 8.</strong> Example graph representation showing nodes connected by edges. Source: freeCodeCamp</em>
</center>

---

## **Why Data Structure Matters**

Data structure determines:

- how easily data can be queried and indexed
- what tools or languages we use (SQL vs. parsers vs. ML models)
- how expensive or difficult preprocessing becomes
- how storage systems are designed (databases vs. file systems vs. cloud object stores)

For example, structured data works well with SQL and joins. Semi-structured data works well with hierarchical queries and document stores. Unstructured data may require embeddings, tokenization, image processing, or deep learning.

Modern data systems often mix all three. A single application may store user profiles in a relational table, telemetry in a JSON log, and user photos as unstructured images.

