# Data Structures
A data structure is a particular way of organising and storing data in a computer such that it can be accessed and modified efficiently

A data structure consists of:
 - A collection of data values
 - The relationships among them
 - And the functions or operations that can be applied to the data

3 different data structures:
- Structured Data
- Unstructred Data
- Semi-structured Data 

# Structured Data

- Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyse

- Structured data conforms to a tabular format with relationship between the different rows and columns

    
| Date | Sensor ID | Temperature C | Humidity % |
| --- | --- | --- | --- |
| 2022-07-03 | 346 | 23 | 74 |
| 2022-07-04 | 345 | 13 | 64 |
| 2022-07-06 | 343 | 25 | 78 |
| 2022-07-10 | 346 | 30 | 56 |
| 2022-08-03 | 343 | 26 | 72 |


 - Common examples of structured data are spreadsheet files (CVS), HDF5, or SQL databases

## The Model

- Structured data depends on the existence of a data model â€“ a model of how data can be stored, processed and accessed. 
- Because of a data model, each field is discrete and can be accessed separately or jointly along with data from other fields. 
  - That's why Structured data extremely powerful: it is possible to quickly aggregate data from various locations in the database.

![](bookstore.svg)

# Unstructured Data

 - Unstructured data is information that either does not have a predefined data model or is not organised in a pre-defined manner.
 - Unstructured information is typically text-heavy, but may contain data such as dates, numbers, facts...

The ability to analyse unstructured data is especially relevant in the context of Big Data, since a large part of data in organisations is unstructured. 

Compared to structured data, it usually has irregularities and ambiguities that make it difficult to manage using traditional programs
 
- Some examples of unstructured data: Pictures, PDF docs, audio & video files...
    - Imagine a research institute that wants to hire a web developer and receives a bunch of offers and CVs. (Docs, PDFs, emails...
    - Not so easy to create a model and store all details in a Structured way
![](cvs.jpg)  
  
 The ability to extract value from unstructured data is one of main drivers behind the quick growth of Big Data.

# Semi-structured Data

A form of structured data that does not conform with the formal structure of data models.
- It contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. 
- It is also known as self-describing structure.
- Unlike structured Data, does not depends on a predefined model.
    - A new item to store may have one or more new fields previously non-existent.

 - Examples of semi-structured data formats: 
   - JSON and derivatives
   - Markup languages like XML
   - Pickle [python] (not safe: arbitrary code execution vulnerability)

 - NoSQL databases like MongoDB store the data in BSON format (similar to JSON)
 - Working example: 
   - MAC [Manfred Awesomic CV](https://github.com/getmanfred/mac#manfred-awesomic-cv)<span style="color:lightgray;font-size:80%">[CTRL + W] to close</span>

Semi-structured data is considerably easier to analyse than unstructured data.

Many Big Data solutions and tools have the ability to manage either JSON or XML. This reduces the complexity to analyse semi-structured data, compared to unstructured data.


# Database Management systems (DBMS)
 - Software layer on top of OS used for creating & managing Databases. 
 - Usualy DBMS store the data in one or more files in the FS (User/aplication does not need to know the details of underlying storage)
 ![](DB.svg)
 - Besides the data itself, organized in tables, databases may include:
     - Indexes (make searches a lot more efficient)
     - Relationships between tables
     - views (virtual tables based on the result of a query)
     - procedures or built-in functions (to keep specific logic in the DBMS (validation, access control, some calculations))
 - Inserting or accesing data in a database is very different from reading or writing to the FS. 
     - It is done via queries that specify the kind of data, format and order
 - There are DBMS for both structured (SQL. ie: MySQL,Postgress,"SQlite") and semi-structured data (NoSQL. ie: MongoDB)

# Advantages of Data Structures and DBMS
 * Data is efficiently stored in a structured an organized way that makes possible to make faster and more complex queries
 * Easily used by machine learning algorithms. The specific and organized nature of structured data allows for easy manipulation
 * No need to reinvent the weel. There are a lot of tools and libraries that have been developed for using and analyzing structured data. 