# The Many Shapes of Data

Data rarely shows up in the neat spreadsheet we hope for. It arrives **messy, uneven, unexpected**, a mix of numbers, text, images, timestamps, social connections, and whatever the world records. Before we can analyze anything, we must learn how to recognize the shape that data takes.

Engineers sometimes describe data as “raw.” That sounds intimidating, but it really means the information hasn’t been processed yet. It is potential, waiting to become evidence, insight, or sometimes, just noise.

### What Counts as Data?

If you sit in a coffee shop for ten minutes and simply observe, you’ll see data everywhere: prices on a menu, Wi-Fi signal strength, the time between notifications on your phone, the music playlist, the number of people wearing headphones, and the language of their conversations. All of it is data, even if nobody wrote it down.

In computing and data science, we usually store data as numbers, text, images, audio, or combinations of these. But the real distinction isn’t the material; it’s the structure.

> **What is Data?** Data is raw information: facts, measurements, observations, or descriptions that can be stored, transmitted, and analyzed. In data science, data refers to representations of information that can be digitized and manipulated by computational systems. It can take many forms, including numerical, categorical, text, image, audio, temporal, or relational.

>> Depending on how it is organized, data may be structured, semi-structured, or unstructured.

> Scientists, analysts, and businesses collect data because the patterns hidden within it help them understand behavior, make decisions, and predict what comes next.

>> **Remember:** data becomes valuable only after it is interpreted and transformed into knowledge or used to answer meaningful questions.

At its simplest, data is recorded information about the world. It might be a number, a word, a photograph, a sound, or a sensor reading. On its own, data does not explain or prove anything; it simply captures what exists or what has happened.

Data becomes useful when we organize it, analyze it, and interpret it in context. A single heart rate reading means very little; a series of readings across time can reveal stress, fitness, or disease. Data does not guarantee truth; it offers evidence.





# How Do We Describe or Measure Data?

Before we think about how data is stored or structured, it is useful to ask a simpler question: *what kind of information are we dealing with?* Not all data behaves the same way. Some can be counted, some can be measured with precision, and some can only be described or categorized.

Broadly, based on what the data represents, we can divide it into two classical types:

- **Quantitative data**, which expresses numerical quantities (how much, how many, how fast)
- **Qualitative data**, which expresses categories, labels, or descriptive attributes (what kind, which type)

### **Quantitative Data**

Quantitative data expresses **quantities**. It tells us *how much*, *how many*, or *how fast*. This kind of data can be measured numerically and analyzed mathematically. We can add it, average it, compare it, and track how it changes over time.

Quantitative data often comes in two forms:

- **Discrete:** countable values (e.g., number of laptops, number of cars, number of students)
- **Continuous:** measurable values that vary smoothly (e.g., height, weight, temperature, time)

Discrete values make jumps (from 1 to 2 to 3), while continuous values can take any value within a range.

### **Qualitative Data**

Qualitative data describes **qualities, categories, or labels**. Instead of telling us *how much* of something exists, it tells us *what kind* it is.

Examples include hair color, species, survey responses, movie genres, or customer satisfaction levels. Qualitative data can be grouped into categories, but arithmetic operations like averaging or subtracting do not make sense here.

<center>
  <img src="https://static.wixstatic.com/media/89aacd_282be84a1c254a1ebb1e25e3e2613dfd~mv2.png/v1/fill/w_980,h_513,al_c,q_90,usm_0.66_1.00_0.01,enc_avif,quality_auto/89aacd_282be84a1c254a1ebb1e25e3e2613dfd~mv2.png" width="600">
  <br>
  <em><strong>Figure 1.</strong> Classical categorization of data types. Quantitative data may be discrete or continuous, while qualitative data describes categorical properties. Source: International Journal of Neurolinguistics & Gestalt Psychology.</em>
</center>

### **Why This Matters in Data Science**

These distinctions shape how we visualize, summarize, and model data. For example:

- Heights (**continuous**) may be plotted on a histogram
- Survey answers (**qualitative**) may be shown as bar charts
- Counts (**discrete**) may be modeled with Poisson or binomial distributions
- Categorical labels may be encoded for machine learning (e.g., **one-hot encoding**)

Knowing *what kind* of data we have is often the first step in deciding *what tools* to use.


# How Is Data Organized or Stored?

So far, we have described data based on what it represents (quantitative vs. qualitative). Data science also cares about a different and equally important dimension: **how the data is organized for computation**. The structure of data influences how easily it can be stored, queried, cleaned, and analyzed.

Unlike quantitative/qualitative categories (which describe meaning), structural categories describe the **shape and format** of data.

In data science, the key difference isn’t what the data *is* made of (numbers, pixels, words), but how it is *organized*. In practice, we encounter three major structural families:

1. **Structured data**: rigid, organized, predictable  
2. **Semi-structured data**: flexible, loosely organized  
3. **Unstructured data**: non-tabular data (such as text, images, audio, and video) without a predefined schema that does not fit neatly into tables  


<!-- <center>
  <img src="https://media.licdn.com/dms/image/v2/D5622AQHyT5xvgFUu1w/feedshare-shrink_800/B56Zicwt_ZHkAg-/0/1754976676754?e=2147483647&v=beta&t=SuiwEQUB8thRo8tscXW0QILLc_7rUSagTEndFqXU-qI" width="600">
  <br>
  <em><strong>Figure 5.</strong> Structured vs Semi-structured vs Unstructured data illustrated across common formats such as CSV, JSON, and media content. Source: LinkedIn</em>
</center> -->

<center>
 <img src="https://skyvia.com/learn/images/scheme-3.png" width="550">
 <br>
   <img src="https://media.licdn.com/dms/image/v2/D5622AQHyT5xvgFUu1w/feedshare-shrink_800/B56Zicwt_ZHkAg-/0/1754976676754?e=2147483647&v=beta&t=SuiwEQUB8thRo8tscXW0QILLc_7rUSagTEndFqXU-qI" width="500">

  <br>
  <em><strong>Figure 2.</strong> Structured, Semi-Structured, and Unstructured data formats. Source: Skyvia</em>
</center>

These categories apply to text, numbers, images, audio, and almost any modern dataset. Each category demands different tools, techniques, and assumptions.

<center>
  <img src="https://cdn.prod.website-files.com/670526c69cb938e8bd8b4754/68481a4d8d2149316f373142_10th_June_2025_E.jpg" width="550">
  <br>
  <em><strong>Figure 3.</strong> Comparison of Structured, Semi-Structured, and Unstructured data. Source: Arya.ai</em>
</center>

In modern systems, all three often coexist: structured transactional data, semi-structured event logs, and unstructured user content.



##**1. Structured Data: The Orderly World**


Structured data is neatly organized, predictable, and follows a predefined schema. Structured data behaves politely. It typically appears in **tables**, with rows and columns, where each field has a specific data type (e.g., integer, string, boolean).

Example structures include:

- Tabular data (spreadsheets, CSV files)
- Relational databases (PostgreSQL, MySQL)
- Time-based or Temporal tables (stock prices, sensor logs)



<center>
  <img src="https://miro.medium.com/v2/1*i3rsfDZ_bu7OkHoM6UjKCQ.jpeg" width="450">
  <br>
  <em><strong>Figure 4.</strong> Structure Data. Source: Miro.medium</em>
</center>

Structured data is easy to query (SQL), easy to validate, and easy to join across tables. It is the backbone of business analytics and scientific measurement.


###**(a) Tabular Data: The Spreadsheet World**

Tabular data is what most people picture first, a **rectangular grid** of rows and columns, familiar from Excel or Pandas. Each row represents an entity, each column a property. A database table of airline flights, for example, might have:

• departure airport  
• arrival airport  
• time  
• delay  
• airline

<center>
  <img src="https://devopedia.org/images/article/231/4844.1571654694.jpg" width="450">
  <br>
  <em><strong>Figure 5.</strong> Conceptual diagram illustrating internal structure or hierarchy. Source: Devopedia</em>
</center>

There’s a reason the spreadsheet survived decades of technological change: it works. Its shape matches how humans reason about comparisons (Who spent more? Which day had the highest sales?).




###**(b) Time-Based Data: The Rhythm of Change**

Not everything stays constant. Stock prices drift by the second, weather shapes the day hour by hour, and fitness trackers quietly log your heartbeat as you sleep.

Time-based (or **temporal**) data puts events in order and lets us ask how something changes rather than merely what it is. Analysts in finance, meteorology, and operations live in this world. The key idea is not the value itself, but the **trajectory**.

A single temperature reading is trivia; a year of temperature readings reveals climate.

<center>
  <img src="https://chartexpo.com/blog/wp-content/uploads/2024/02/time-series-graph-examples.jpg" width="600">
  <br>
  <em><strong>Figure 6.</strong> Examples of time-series visualizations showing how values evolve over time. Source: ChartExpo</em>
</center>

Temporal data allows us to detect trends (upward/downward), cycles (seasonal patterns), anomalies (abrupt changes), and responses to interventions (before/after effects). Many forecasting tasks—from weather to sales to traffic—are built on time-series models.


---

##**2. Semi-Structured Data: The Flexible Middle**

Semi-structured data does not fit neatly into tables (doesn’t prescribe fixed column) , but it still contains **markers, tags, or key-value pairs** that impose some level of organization (structure and hierarchy).


Common examples include:

- JSON (web APIs, configuration data)
- XML and HTML (documents and web pages)
- logs (server events, application traces)
- key-value pairs (NoSQL databases)
<center>
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*3mMje9YpBjZHPO-VyWtv4Q.png" width="450">
  <br>
  <em><strong>Figure 7.</strong> Semi-structured data can be represented using JSON,XML etc. Source: Medium (miro.medium.com)</em>
</center>


Semi-structured data is more flexible than structured data. It supports nesting and irregular formats, which is why modern web APIs and mobile apps use JSON extensively.


---

##**3. Unstructured Data: The Untamed Majority**

Most digital information refuses to sit neatly in tables or hierarchical tags. This is **unstructured data**, and it now dominates the modern world.

Unstructured data lacks a predefined schema and is difficult to store in rows and columns. It includes rich media, human language, and other natural forms of expression.

Examples include:

- images (satellite photos, medical scans, selfies)
- audio (voice commands, music, podcasts)
- video (streaming content, surveillance footage)
- text (tweets, emails, reviews, articles)
- graphs of social or biological networks



For decades, computers couldn’t make sense of this category. The recent boom in machine learning, NLP for text, CNNs for images, transformers for everything, is largely about taming unstructured information.

<!-- This form of data dominates the modern world. Machine learning models (NLP, computer vision, transformers) help unlock value from unstructured information. -->


###**(a) Images: Data You Can See**

A picture is not a number, until a model turns it into one.

Images are deceptively simple. What we perceive as a photograph is actually a grid of **pixels**, tiny squares holding values for red, green, and blue. Those values range from 0 to 255 and describe how bright each channel should appear.

**Image formats involve trade-offs**:  JPEG compresses aggressively and loses some detail; PNG preserves every pixel; medical imaging prefers lossless formats because a mistake isn’t aesthetic, it’s scientific.

Compression, resolution, and color channels are not trivia. They determine how algorithms “see.”



---


## **Special Case of Data: "Graphs" — Data with Relationships**

Some data is best understood through **relationships**, not just individual values. A social network, a map of roads, or protein interactions inside a cell are examples.

Graphs introduce their own structure:

- **nodes** represent objects (e.g., people, airports, proteins)
- **edges** represent relationships (e.g., friendships, flights, bindings)

The power lies in the **pattern of connections** rather than the elements themselves. Who follows whom? Which cities connect? Which proteins interact? Influence, connectivity, centrality, and path structure become more important than single values.

<center>
  <img src="https://cdn-media-1.freecodecamp.org/images/9KFiyFYi9bMktsJkMKLKaeJl31heUN9A-xrr" width="400">
  <br>
  <em><strong>Figure 8.</strong> Example graph representation showing nodes connected by edges. Source: freeCodeCamp</em>
</center>

---

## **Why Data Structure Matters**

Data structure determines:

- how easily data can be queried and indexed
- what tools or languages we use (SQL vs. parsers vs. ML models)
- how expensive or difficult preprocessing becomes
- how storage systems are designed (databases vs. file systems vs. cloud object stores)

For example, structured data works well with SQL and joins. Semi-structured data works well with hierarchical queries and document stores. Unstructured data may require embeddings, tokenization, image processing, or deep learning.

Modern data systems often mix all three. A single application may store user profiles in a relational table, telemetry in a JSON log, and user photos as unstructured images.



<!-- ## **The Containers: File Formats**

Once data has a shape, it must also have a container. Formats determine how data is stored, transported, and read.

Different shapes favor different formats:

- structured → CSV, SQL tables
- semi-structured → JSON, XML, HTML, logs
- unstructured → JPEG, MP3, MP4, PDF, TIFF

Compression, resolution, and encoding matter more than most beginners expect — especially in scientific imaging or high-volume analytics. -->


# **Data Formats in Data Science**

Once we understand how data is described and how it is organized, the next layer to consider is how data is **formatted** for storage, transport, and consumption by tools. Formats determine how information is encoded on disk, how it moves between systems, and how efficiently it can be parsed or processed.

Different formats are optimized for different tasks: tabular data favors simple delimiters, hierarchical data favors nested keys, documents favor markup, and rich media requires specialized encoding. Understanding formats is essential for working with modern data pipelines.

These structural categories map naturally to different file formats used in practice:

- **Structured** → CSV, SQL tables  
- **Semi-structured** → JSON, XML, HTML, logs  
- **Unstructured** → JPEG, MP3, MP4, PDF, TIFF

The choice of format influences the tools we use, the complexity of data cleaning, and the computational effort required to extract meaningful information.

---

## **(I) CSV and TSV**

CSV (Comma-Separated Values) and TSV (Tab-Separated Values) are among the simplest and most widely used data formats. They store **tabular**, structured data in plain text, where each row represents a record and each column corresponds to a field.

> A **delimiter** is the character that separates values in a row. Common delimiters include commas (`,`), tabs (`\t`), semicolons (`;`), and colons (`:`).

**CSV** uses commas as the delimiter; commonly used for tabular data. It is popular because it is lightweight, human-readable, and compatible with nearly every data analysis tool, from Excel to Python's *pandas*. Example CSV snippet:

<center>
  <img src="https://www.myexcelonline.com/wp-content/uploads/2024/05/httpsoutranking.s3.amazonaws.com62459967_Aditi20Lundia648726082024-05-25T133A343A08.137891_Easy_Excel_Convert_XLSX_Files_to_CSV_Format_Quickly_-1.png" width="600">
  <br>
  <em><strong>Figure 9.</strong> Example workflow showing how spreadsheet formats such as Excel (XLSX) can be converted to CSV for use in data processing. Source: myexcelonline.com</em>
</center>

Python makes CSV loading straightforward:
```python
import pandas as pd
df = pd.read_csv("data.csv")
```

**TSV** uses tabs instead of commas, often improving readability when fields contain commas.

These formats are ideal for data exchange and quick inspection, but lack explicit typing, indexing, and schema validation.

---

## **(II) JSON**

JSON (JavaScript Object Notation) represents **semi-structured**, hierarchical data. It supports nesting, arrays, and variable-length fields, making it flexible for web APIs, mobile applications, and configuration data.

JSON structures are built from **key–value pairs**, where each key identifies a field and each value holds the associated data.


### **JSON Syntax Basics**

At the smallest level, JSON data is expressed as a **key–value pair**: JSON follows a small set of structural rules:

- Data is presented in key/value pairs.
- Data elements are separated by commas.
- Curly brackets {} determine objects.
  - **Objects** are enclosed in `{ }` and contain key–value pairs  

<center>
  <img src="https://i.sstatic.net/VHjrV.gif" width="600">
</center>


- Square brackets [] designate arrays.
  - **Arrays** are enclosed in `[ ]` and hold ordered lists  

<center>
  <img src="  https://i.sstatic.net/OCsS0.gif" width="600">
</center>

- **Keys** are strings: sequences of characters surrounded by quotation marks.
- **Values** may be strings, numbers, booleans, arrays, objects, or `null`
<center>
  <img src="  https://i.sstatic.net/7zOHB.gif" width="600">
</center>

**Remember**, A colon is placed between each key and value, with a comma separating pairs. Both components are held in quotation marks.
As a result, JSON object literal syntax looks like this:

> {"key":"value","key":"value","key":"value".}




Example JSON snippet:

```json
{"name": "Alice", "age": 24, "languages": ["Python", "SQL"]}
```

JSON also supports nesting through objects and arrays. For example:

```json
{
  "name": "Alice",
  "age": 24,
  "languages": ["Python", "SQL"],
  "education": {
    "degree": "BS",
    "major": "Computer Science",
    "year": 2025
  }
}
```
```json

{
"students":[
{"firstName":"Tom", "lastName":"Jackson"},
{"firstName":"Linda", "lastName":"Garner"},
{"firstName":"Adam", "lastName":"Cooper"}
]
}
```

JSON is human-readable, self-describing, and machine-friendly. Most modern RESTful APIs return data in JSON.

In **Python**, the json module is commonly used to work with JSON data.
> We use **json.dumps()** to convert Python objects to JSON format and **json.loads()** to convert JSON data back to Python objects.


<center>
  <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRfyeZOdVW-kq0VMNTKBWv2_hGvsaWqqM9E3g&s" width="400">
</center>

Example:
```python
import json

# Python object (dict)
person = {
    "name": "Alice",
    "age": 24,
    "languages": ["Python", "SQL"]
}

# Convert to JSON string
json_str = json.dumps(person)
print(json_str)

# Convert back to Python object
person_back = json.loads(json_str)
print(person_back["languages"])
```

## **(III) HTML and XML**


HTML (HyperText Markup Language) and XML (eXtensible Markup Language) are markup languages that describe structure using tags.


**HTML (HyperText Markup Language)** describes the structure and display of web pages. It defines elements such as headings, paragraphs, links, and tables. HTML focuses on **presentation** and is interpreted by web browsers.


Example HTML snippet:
```html
<p>Hello, world!</p>
<a href="https://umd.edu">Visit UMD</a>
```
HTML is not intended to store arbitrary data for computation. Instead, it tells browsers how content should appear and how users interact with it.

**XML (eXtensible Markup Language)** uses tags to describe **data and relationships** rather than presentation. Unlike HTML, XML does not have predefined tags. Users may *define their own custom tags*, which makes XML flexible for representing custom structures and domain-specific information.

XML is **self-describing**, **hierarchical**, and can represent nested structures similar to JSON.

> XML is often used for data storage or exchange, especially in enterprise systems, configuration files, and legacy APIs.

Example XML snippet:
```xml
<person>
  <name>Alice</name>
  <age>24</age>
  <languages>
    <language>Python</language>
    <language>SQL</language>
  </languages>
</person>
```

XML values are contained inside tags, and fields can repeat or nest, which gives XML flexibility for representing structured documents.

> HTML is primarily for presentation, while XML is more general-purpose and can encode hierarchical data for communication between systems.

---
**Where They Appear**

* HTML → browsers, webpages, crawlers, scraping

* XML → configuration files, enterprise systems, legacy APIs, document formats (e.g., DOCX uses XML internally)

* JSON → REST APIs, mobile apps, NoSQL databases, log formats

All three formats belong to the broader landscape of semi-structured data, sitting between rigid tables and free-form text.


---


## **(IV) Relational Databases**

Relational databases store data in a structured form using **tables** (rows and columns) and a well-defined **schema**. Each table represents an entity (e.g., `students`, `courses`, `orders`), and each row represents a record. Tables can be linked through **keys**, allowing meaningful relationships to be expressed and queried efficiently.

Relational formats enforce consistency through datatypes, primary keys, foreign keys, and constraints. They excel in transactional systems (e.g., banking, e-commerce, logistics).


### **Why "Relational"?**

The term **relational** comes from mathematics: each table is a **relation**, each row is a **tuple**, and each column is an **attribute**. Tables become powerful when they are **related** to one another through shared keys.

These relationships allow us to combine information across tables (e.g., *students ↔ courses ↔ enrollments*) without duplicating data across the system.

<!-- --- -->

### **Keys:** Linking Information

Relational databases rely on two important types of keys:

- A **primary key** uniquely identifies a row within a table (e.g., `student_id`).
- A **foreign key** references a primary key in another table, creating a relationship.

Example (conceptual):
- `students(id)` is the primary key for students.
- `enrollments(student_id)` is a foreign key pointing back to `students(id)`.

Keys enforce consistency and prevent invalid references (e.g., enrolling a student that does not exist).



### **What is a Database Schema?**

A **schema** is the blueprint of the database. It defines:

- what tables exist
- what columns each table contains
- the datatype of each column
- how tables relate to one another (via keys)
- rules or constraints that ensure valid data

Example schema (informal):
```
students(
id INTEGER PRIMARY KEY,
name TEXT,
major TEXT,
year INTEGER
)
```

<center>
  <img src="https://planetscale.com/assets/blog/content/schema-design-101-relational-databases/db72cc3ac506bec544588454972113c4dc3abe50-1953x1576.png" width="450">
  <br>
<em><strong>Figure 10.</strong> Relational schema showing tables linked through keys. Here, "customer_id" is a **primary key** in the `customers` table and a **foreign key** in the `orders` table, linking orders to customers. Data is stored once and referenced through keys. Source: PlanetScale</em>


</center>



Schemas enforce consistency and make it possible to write queries that join or filter across multiple tables.

---

### **DBMS: Database Management Systems**

A **DBMS (Database Management System)** is software that stores, manages, and provides controlled access to databases. While a **database** refers to the data itself, the **DBMS** manages how the data is stored, retrieved, updated, indexed, and protected.

Relational DBMSs (also called **RDBMS**) implement the relational model and use SQL for querying. Common examples include:

- PostgreSQL
- MySQL
- SQLite
- SQL Server
- Oracle

> In short: **the database stores the data; the DBMS manages the data.**

**Key idea of DBMS:** applications and users never talk directly to the raw data; the DBMS sits in the middle, handling queries, security, concurrency, and storage.
<!-- --- -->

<center>
  <img src="https://hbs-rcs.github.io/sql-novice-survey/fig/DBMS.PNG" width="430">
  <br><br>
  <img src="https://media.licdn.com/dms/image/v2/D4D22AQGwRT6iBD_9ew/feedshare-shrink_1280/B4DZm_KBf.HYAs-/0/1759848707243?e=2147483647&v=beta&t=ia51sBb3fopnuk2rJXOD-vxZbIm1Coln3V4pBUTI9Ys" width="430">
  <br>
  <em><strong>Figure 11.</strong> Conceptual views of a DBMS. Both diagrams illustrate how applications interact with a database through a DBMS layer that manages queries, storage, and controlled access. Source: HBS SQL Novice Survey (top) and LinkedIn (bottom).</em>
</center>



### **SQL and Querying**

The language used to interact with relational databases is **SQL (Structured Query Language)**. SQL supports powerful operations for filtering, joining, aggregating, and updating structured data.


Relational systems enforce consistency through:

- **data types** (integer, string, date, boolean, etc.)  
- **primary keys** (unique identifiers for rows)  
- **foreign keys** (references linking tables)  
- **constraints** (rules that maintain data integrity)  

These properties make relational databases ideal for transactional systems such as banking, logistics, retail, reservations, and scientific measurement.
**Example Table: Students**

| id | name  | major              | year |
|----|-------|-------------------|------|
| 1  | Alice | Computer Science  | 2025 |
| 2  | Bob   | Mathematics       | 2024 |

**Example SQL Query**

```sql
SELECT name, major
FROM students
WHERE year = 2025;
```
*(We will talk about SQL in later chapter)*

Relational databases excel when data must be consistent, structured, and queryable, especially when multiple tables relate to each other (e.g., students ↔ courses ↔ enrollments).

They are widely used in banking systems, logistics, healthcare, retail, reservations, and scientific measurement—any domain where correctness and structure matter.

---
## **BONUS Part: NoSQL Databases**

Relational databases shine when data is highly structured and relationships are well-defined. But not all data fits neatly into tables. Modern applications produce logs, events, nested objects, user interactions, sensor streams, and media, data that grows fast and changes shape over time.

**NoSQL** databases *relax the rigid schema* of relational systems to support higher scalability and more flexible data models. Instead of enforcing a single blueprint (schema), NoSQL systems allow the structure of records to evolve.

Different NoSQL databases optimize for different shapes of data:

- **Document stores** (e.g., MongoDB, Couchbase)  
  Store JSON-like objects with nested fields. Great for user profiles, product catalogs, or application events.

- **Key–value stores** (e.g., Redis, DynamoDB)  
  Store simple (key → value) mappings. Ideal for caching, session storage, configuration, and real-time lookups.

- **Wide column stores** (e.g., Bigtable, Cassandra)  
  Organize data by column families. Used in time-series analytics, telemetry, and large distributed workloads.

- **Graph databases** (e.g., Neo4j)  
  Represent nodes and edges directly. Useful for social networks, recommendation engines, fraud detection, and biological pathways.


<center>
  <img src="https://www.scylladb.com/wp-content/uploads/differences-between-sql-databases-and-nosql-databases.png" width="430">
  <br>
  <em><strong>Figure 12.</strong> High-level comparison of relational (SQL) and non-relational (NoSQL) databases, highlighting differences in schema, consistency, and scalability. Source: ScyllaDB</em>
</center>



NoSQL systems often target large-scale, distributed environments and are designed to **scale horizontally** (add more machines) rather than vertically (buy a bigger machine).

NoSQL gained popularity in the era of cloud and internet-scale platforms, where data volumes are high, latency matters, and structure is fluid. Common scenarios include:

- millions of user profiles with different fields  
- real-time click streams and telemetry data  
- nested JSON returned from APIs  
- social networks with rich relationships  
- sensor and IoT data  
- rapidly evolving application data  

In short, **NoSQL trades strict relational structure for flexibility and scale**. It does not replace relational databases, it complements them.

---


## **(V) Images, Audio, and Video**

Not all data is textual or numerical. Rich media formats dominate modern applications, especially in machine learning, robotics, healthcare, and online platforms.

Examples include:

- **Images**: JPEG, PNG, TIFF  
- **Audio**: WAV, MP3, FLAC  
- **Video**: MP4, AVI, MOV  

These formats encode pixels, waveforms, or frames in compressed or uncompressed forms.

### **How Image Data Works**

Images are typically organized as a grid of **pixels**, the tiny building blocks that form the picture. Each pixel holds **color information** through separate **channels**. In a standard color image:

- the channels are **red, green, and blue (RGB)**  
- each channel acts like a separate layer of information  
- each channel value ranges from **0 to 255**, representing intensity  
  - `0` means no intensity (dark)
  - `255` means full intensity (bright)
<center>
  <img src="https://process.filestackapi.com/cache=expiry:max/pAVCjc9LSYSFYnkM4HtK" width="600">
  <br>
  <em><strong>Figure 13.</strong> Visualization related to image data structures or pixel representation. Source: Filestack</em>
</center>


Combining intensities across channels produces the full range of visible colors. Machine learning models (e.g., vision transformers, CNNs) operate directly on these pixel arrays and often convert them into structured **embeddings** for downstream tasks such as classification or detection.

Audio and video follow the same idea at higher dimensionalities: audio stores **waveforms** over time; video stores sequences of **image frames**, sometimes with sound and metadata attached.

Machine learning models for speech, vision, and multimodal data help translate these rich formats into structured forms that can be queried, compared, or used for prediction.


## **Choosing the Right Format**

There is no single “best” format. CSV is ideal for spreadsheets; JSON is ideal for APIs; SQL is ideal for transactions; NoSQL is ideal for flexible schemas; media formats are essential for perception tasks. Data scientists must often convert between formats depending on the application.




# Where Does Data Come From?

Historically, datasets arrived through organized channels: government surveys, institutional studies, corporate databases. Today, data spills from everyday systems. Almost every action, transaction, or device leaves a trace.

Common sources include:

- transactional databases
- sensors and wearables
- web APIs
- web scraping
- user interactions
- application logs

Some sources deliver structured data; others arrive messy and require transformation before analysis.

## **Scraping vs. APIs**

When acquiring data from the web, two patterns are common: **web scraping** and **APIs**.

**Web scraping** imitates a browser, reading HTML and extracting what a human would see. It is useful when no official data interface exists, but fragile and sensitive to ethical and legal boundaries.

**APIs**, in contrast, are explicit contracts:

> “If you send a structured request, I will send structured information.”

APIs typically return machine-readable data (often JSON), avoiding the need to parse HTML manually.

Before scraping, it is ***always worth checking***  if an API exists.


### **How Web Scraping Works (Conceptual)**

<center>
  <img src="https://monashdatafluency.github.io/python-web-scraping/images/request.png" width="400">
</center>

1. Your script sends an HTTP GET request to a URL.  
2. The server returns an HTML page.  
3. Your program parses and extracts the relevant parts.



### **(a) Beautiful Soup and Parsing HTML**

When websites lack an API, we fall back to HTML, the layer designed for humans. **Beautiful Soup** converts that visual layer into structured data.

> **BeautifulSoup** is a Python library for parsing and extracting information from HTML; it takes the raw HTML returned by a website and lets you navigate and extract useful parts such as tables, links, or text.

HTML is hierarchical: tags contain tags, attributes provide meaning, and text sits inside. Beautiful Soup lets us locate elements (`div`, `table`, `span`) and extract text, links, or tabular content.

<center>
  <img src="https://miro.medium.com/v2/resize:fit:1400/1*6UIaApn54TOkhOQ607z-cw.jpeg" width="420">
  <img src="https://2.bp.blogspot.com/-oeOzu13C26U/V1_2uXbFE4I/AAAAAAAAALI/2RmiWjCb--YUVO6MAg3pG5eIOISFkVwBgCLcB/s1600/custom-web-scraping-624x301.png" width="420">
  <img src="https://monashdatafluency.github.io/python-web-scraping/section-2-HTML-based-scraping/figures/html_structure.png" width="420">
  <br>
  <em><strong>Figure X.</strong> HTML-based scraping workflow: identify elements visually, locate tags, extract content, convert to structured data.</em>
</center>

Scraping is useful for product data, weather reports, job postings, and research tables — information that exists publicly but without a formal data interface.



#### **Tiny Example: Requests + Beautiful Soup**

```python
import requests
from bs4 import BeautifulSoup

url = "https://example.com"
html = requests.get(url).text
soup = BeautifulSoup(html, "html.parser")

titles = [tag.text for tag in soup.find_all("h2")]
```

> Pattern: download → parse → select → extract → clean

Not every page has the same structure, so scraping often involves inspection and adaptation.

**Bonus Trick:** Scraping Tables with pandas
Pandas can parse HTML tables directly:
```python
import pandas as pd

df_list = pd.read_html("https://example.com/table")
df_list[0].head()
```

If the page contains "< table > ... < /table >" tags, this often works instantly; no tagging, no loops, no manual parsing.

---
### **BONUS Pandas Example: Scraping Population Data from an HTML Table**
Suppose you want to collect the latest population estimates for U.S. states. The site has no public API, but shows an HTML table visible in the browser; a perfect case for scraping.

The page might display data like this:

| State        | 2024 Population | Growth Rate | Area (sq mi) | Density (per sq mi) |
|-------------|-----------------|-------------|--------------|---------------------|
| California  | 39,074,000      | -0.59%      | 163,695      | 240                 |
| Texas       | 31,161,000      | +1.44%      | 268,597      | 116                 |
| Florida     | 23,655,000      | +1.37%      | 65,758       | 345                 |
| …           | …               | …           | …            | …                   |

Because this data lives inside HTML `<table>` tags, we can use `requests` to download the page, `BeautifulSoup` to parse the HTML, and then convert the table into a Pandas `DataFrame`. Here is the conceptual Python snippet that accomplishes this:

```python
import requests
from bs4 import BeautifulSoup
import pandas as pd

# 1. Fetch the HTML
url = "https://worldpopulationreview.com/states"
response = requests.get(url)
html = response.text

# 2. Parse with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# 3. Find the main table on the page
table = soup.find("table")

# 4.1 Extract rows
rows = table.find_all("tr")

# 4.2 Loop over the rows of the table
data = []
for row in rows:
    # Get all header or data cells in this row
    cells = row.find_all(["td", "th"])
    # Extract the text and strip extra spaces
    texts = [cell.get_text(strip=True) for cell in cells]
    if texts:  # skip empty rows
        data.append(texts)

# 5. First row is the header, the rest is the body
header = data[0]
body = data[1:]

# 6. Build a DataFrame from the scraped data
df = pd.DataFrame(body, columns=header)

# Take a quick look
df.head()
```

Once in a DataFrame `df`, the data becomes programmable: you can clean, convert types, sort, filter, and analyze.

---

#### **Why Scraping Is Fragile?**
Scrapers often break because HTML is designed for humans, not for machines.
Small changes can cause failure:
- renaming CSS classes
- changing layout
- adding ads or pop-ups
- switching to JavaScript-rendered tables

Contrast:
- APIs are contracts
- HTML is performance

APIs are meant for machines; HTML is meant for browsers.

### **Typical API Workflow**

APIs remove the guesswork of scraping by returning structured data directly. The workflow is straightforward:

1. Send a request (often a `GET`) to an API endpoint  
2. Receive structured data (usually JSON)  
3. Convert the JSON into a DataFrame for analysis

**Example (conceptual):**
```python
import requests
response = requests.get("https://api.example.com/weather")
data = response.json()
```
This is the programmatic equivalent of asking a well-behaved machine for information.

### **(b) RESTful APIs (Representational State Transfer)**
RESTful APIs dominate modern data access: a client requests information, a server processes it, and structured data (often JSON) returns.

> **RESTful API:** A style of API where data is organized as resources (e.g., /users, /repos) accessed using HTTP verbs such as GET, POST, PUT, and DELETE.

<center> <img src="https://miro.medium.com/v2/1*f-4u01cDYiy6N5IRBktZnw.png" width="400"> <br> <em><strong>Figure 14.</strong> High-level overview of a RESTful API: structured request → structured response. Source: Medium</em> </center>


####**Tiny Example: Requests + RESTful APIs**

The GitHub API is a public **RESTful API** that returns structured JSON data. Unlike scraping, no HTML parsing is required.

```python
import requests
import pandas as pd

# 1. Send a GET request to a RESTful API endpoint
url = "https://api.github.com/repositories"
response = requests.get(url)

# 2. Parse the JSON response
data = response.json()

# 3. Convert JSON into a DataFrame
df = pd.json_normalize(data)

df.head()
```

For data science, the key takeaway is practical: RESTful APIs provide predictable, machine-readable access to fresh data. Modern dashboards, pipelines, and ML systems rely on them.


> **A Note on Etiquette** Data availability is not the same as data permission. Good acquisition respects rate limits, terms of service, and the human effort behind the page or system being scraped.




## Summary of the Chapter: **The Point of All This**

<!-- ## **Summary of the Chapter: The Point of All This** -->

Data does not come in a single form. It may appear as numbers in a table, sounds from a microphone, pixels in an image, words in a review, or connections in a network. Each shape of data brings its own assumptions and tools, and more importantly, its own way of seeing the world. A spreadsheet invites comparison; a time-series reveals change; a graph exposes relationships; and an image asks to be interpreted.

The first step in data science is simply recognizing **what kind of information we have** and **how it is organized**. We do not analyze a voice recording the way we analyze an invoice, nor do we treat a timestamp like a photograph. Structure matters. It determines what questions we can ask and what answers we can trust.

Understanding data shapes is also practical. It tells us which formats are useful, which tools to reach for, and how much cleaning or transformation will be required. We do not clean data just to make it pretty, we shape it so that it can answer a question.

The next chapter takes us from **recognizing data** to **reasoning with data**. It will focus on asking better questions and on the principles of **experimental design**, where curiosity meets structure and evidence begins to speak.



# Knowledge Check