## Table of Contents

  - [Table of Contents](#Table-of-Contents)
- [Y1 Data](#Y1-Data)
  - [1.1 Introduction to Data in Digital Engineering](#1.1-Introduction-to-Data-in-Digital-Engineering)
  - [1.2 Types of Data in Digital Engineering](#1.2-Types-of-Data-in-Digital-Engineering)
  - [<font> 1.3 Data Collection Methods </font>](#%3Cfont%3E-1.3-Data-Collection-Methods-%3C/font%3E)
  - [1.4 Data Management and Storage](#1.4-Data-Management-and-Storage)
  - [1.5 Data Challenges](#1.5-Data-Challenges)
- [🏠 Home](../../welcomePage.ipynb)

# Y1 Data
## 1.1 Introduction to Data in Digital Engineering

### <font color = '#646464'> 1.1.1 Definition of Data </font>

Data is essentially raw, unprocessed information that hasn’t yet been organized or analyzed. Think of it as the basic facts, numbers, or observations collected for future use. This information can come from a variety of places such as devices that measure things (like temperature sensors), surveys we fill out, business transactions, or even simple observations of our surroundings.

Data can be split into two main types:

1. **Quantitative Data**: This type of data deals with numbers. For example, it can be the number of people who attended an event or the temperature recorded by a thermometer. Quantitative data helps us measure and compare things.

2. **Qualitative Data**: This type is descriptive rather than numerical. It includes things like colors, names, or a person's opinion about something. Qualitative data provides details that can't easily be counted or measured but are still valuable for understanding situations.

<center>
  <img src="Module 1 Content/quantitative vs qualitative.png" alt="Quantitative vs Qualitative Data" width="500"/>
</center>

Data can be found in many formats including:
- **Text**: Letters or words, such as a sentence in a document.
- **Numbers**: Simple values like 50 or 3.14, which can represent anything from age to the length of an object.
- **Images**: Pictures that might be used for analysis or documentation.
- **Audio**: Sounds that can include anything from speech to music.
- **Video**: Moving images and sounds, such as recordings from cameras.

This raw data, once organized and processed, can help us understand patterns, make decisions, or find solutions to problems. It is the first step in creating information and knowledge that we can use to make informed choices in various fields, such as science and healthcare.

Below is your first exposure to interactive widgets! You will find a small interactive exercise designed to help you understand and classify different types of data. In this activity, you'll be asked to identify whether each example represents qualitative or quantitative data.

In [None]:
import sys
sys.path.append('Module 1 Content')  # Adjust the path as necessary

from functions import *
quiz1()

### <font color = '#646464'> 1.1.2 Importance of Data in Digital Engineering </font>

Data is essential in Digital Engineering. It helps engineers make better decisions, improve how things work, and predict future problems. By using data, engineers can create detailed digital versions of real-world systems (called digital twins) and improve how things are designed and made. This results in less downtime, better products, and faster innovation. Here are some of the key ways data is important in Digital Engineering:

1. **Improved Accuracy and Precision**: With real-time data, engineers can create products more accurately, ensuring they meet all specifications and work as expected. Boeing uses real-time data to ensure precise assembly of the 787 Dreamliner, resulting in highly accurate and high-quality aircraft.

3. **Better Monitoring and Control**: Engineers can use data from connected devices to monitor systems in real-time. This allows them to see how things are performing, spot problems early, and make quick fixes. General Electric’s gas turbines are continuously monitored with sensors that provide real-time data, enabling engineers to detect and address issues before they lead to failure.

4. **Collaboration and Communication**: Data sharing helps teams from different departments or locations work together more effectively. By accessing and sharing information, teams can solve problems together and improve overall results. Ford’s global design teams collaborate efficiently by sharing data on vehicle performance and manufacturing processes, speeding up the development of new models.

5. **Predictive Maintenance**: By analyzing data, engineers can predict when machines or equipment are likely to break down. This allows them to fix things before they fail, saving money and reducing downtime. Rolls-Royce’s "TotalCare" program predicts engine maintenance needs using data analytics, minimizing unplanned downtime and maximizing operational efficiency.

6. **Driving Innovation**: Data can help engineers discover new patterns and trends. This can lead to new ideas for improving products, processes, or even creating entirely new products. Tesla collects vast amounts of data from its cars to improve its autonomous driving system, continually enhancing the technology's safety and performance.

## 1.2 Types of Data in Digital Engineering

### <font color = '#646464'> 1.2.1 Structured vs. Unstructured Data </font>

<center> <img src="Module 1 Content/Data Structured Un.png" alt="Description" width="500"/> </center>

#### <font color = '#646464'> *1.2.1.1 Structured Data* </font>

**Definition**: Structured data is organized, easily searchable, and typically stored in relational databases or spreadsheets. It follows a predefined format or schema, making it simple to manage and analyze using traditional data processing techniques.

**Characteristics**:
- **Schema-Defined**: Structured data adheres to a specific schema, which defines the structure, such as tables, columns, and data types.
- **Easily Searchable**: Due to its organized format, structured data can be easily queried (data can be easily searched and retrieved when needed) and other database query languages.
- **Quantitative**: Often consists of numerical values, dates, and categorical data that can be used for statistical analysis and reporting.

**Example**:
3D printing failure that could happen when testing different print materials and properties. The data is well-organized and presents key information about the printing settings and outcomes.

| Print Job ID | Material Type | Print Speed (mm/s) | Layer Height (mm) | Temperature (°C) | Support Type | Failed (Yes/No) |
|--------------|---------------|--------------------|-------------------|------------------|--------------|-----------------|
| 001          | PLA           | 50                 | 0.2               | 210              | Yes          | No              |
| 002          | ABS           | 60                 | 0.3               | 230              | No           | Yes             |
| 003          | PETG          | 40                 | 0.15              | 220              | Yes          | No              |
| 004          | PLA           | 55                 | 0.2               | 210              | No           | No              |
| 005          | ABS           | 45                 | 0.25              | 240              | Yes          | Yes             |

### <font color = '#646464'> *1.2.1.2 Unstructured Data* </font>

**Definition**: Unstructured data is raw and unorganized, lacking a predefined format or schema. It can come in various forms, including text, images, videos, and audio files. Unstructured data is more challenging to manage and analyze using traditional data processing techniques, but it holds valuable insights.

**Characteristics**:
- **No Fixed Schema**: Unstructured data does not adhere to a specific schema, making it more flexible but harder to organize and analyze.
- **Diverse Formats**: Can be in various formats such as text documents, emails, social media posts, images, videos, and audio recordings.
- **Qualitative**: Often consists of qualitative information that requires advanced analytics techniques, such as natural language processing (NLP) and image recognition, to extract meaningful insights.

**Examples**: Going back to the same structured data example, the sensor data might indicate that the printer is operating normally, with no issues detected in terms of temperature, speed, or material feed. However, unstructured image data captured during the printing process could reveal that the part is failing despite the sensors showing no abnormalities. Although we have meaningful images from the printing process, they are unstructured data and require a lot of processing to extract useful information. The images might show uneven layers, warping, or incomplete structures that were not captured by the sensor readings. In this case, the image data provides additional insight into the problem, highlighting issues that the sensor data alone could not identify, such as printing defects or alignment problems.

<center> <img src="Module 1 Content/Print Failure.png" alt="Description" width="700"/> </center>

## <font> 1.3 Data Collection Methods </font>

### <font color = '#646464'> 1.3.1 Sensors and IoT Devices </font>
Sensors and IoT devices automatically collect real-time data from equipment and environments. They provide immediate insights into operations and enable predictive maintenance by identifying potential issues early, improving overall efficiency and reducing downtime.
<center> <img src="Module 1 Content/thermalvideo.gif" alt="Description" width="400"/> <img src="Module 1 Content/Temperature IoT.jpeg" alt="Description" width="350"/> </center>

### <font color = '#646464'> 1.3.2 Manual Data Entry </font>
Manual data entry involves human operators recording data into systems. This method captures detailed and qualitative information that sensors may not detect, making it essential for scenarios where automation is not feasible and detailed documentation is required.

### <font color = '#646464'> 1.3.3 Data Logging Systems </font>
Data logging systems continuously and automatically record data over time. These systems ensure consistent data collection, facilitate historical analysis for trend identification, and help maintain data integrity by reducing the risk of human error.



### <font color = '#646464'> 1.3.4 CAD Metadata </font>
Computer-Aided Design (CAD) metadata includes detailed information about the design and specifications of manufactured parts. This metadata captures dimensions, tolerances, material properties, and other design attributes, providing a comprehensive digital representation that is crucial for quality control, process optimization, and building digital twins.

<center> <img src="Module 1 Content/cadmetadata.png" alt="Description" width="800"/> </center>

## 1.4 Data Management and Storage

In digital engineering, data needs to be stored safely and accessed easily. There are three common ways to do this:  

- **Databases** organize data so it’s easy to find and use.  
- **Data warehouses** collect data from different places to help with analysis.  
- **Cloud storage** saves data online, so it can be accessed from anywhere. 

## 1.5 Data Challenges

In the world of digital engineering, having high-quality data is essential for making accurate decisions and reliable predictions. However, there are several challenges that can affect the quality of the data, which in turn can influence the results of any analysis. These challenges include missing data, errors (like noise), outliers, and issues with data formatting. Understanding these problems is important because they can significantly impact the conclusions we draw from the data.

### <font color = '#646464'> 1.5.1 Missing Values </font>

Sometimes, data is simply not available. This is called **missing data**. It can happen for a variety of reasons, such as sensors malfunctioning, human error, or interruptions in data collection processes. When some data points are missing, it can lead to incomplete analysis and result in biased or inaccurate conclusions. For example, if we are monitoring a machine's performance and some data about temperature or pressure is missing, we might not be able to fully understand how well the machine is working.

**Why it's a problem**: Missing data can distort the results, making it difficult to make informed decisions.

Detailed Machine Monitoring Data

| Time       | Machine ID | Temperature (°C) | Pressure (psi) | Output (units) |
|:----------:|:---------:|:----------------:|:--------------:|:--------------:|
| 08:00:00   | M01       | 180              | **<span style="color:red">missing</span>**  | 450            |
| 08:01:00   | M02       | **<span style="color:red">missing</span>** | 150  | 430            |
| 08:02:00   | M01       | 175              | 145            | **<span style="color:red">missing</span>** |
| 08:03:00   | M03       | 190              | 155            | 470            |
| 08:04:00   | M02       | **<span style="color:red">missing</span>** | **<span style="color:red">missing</span>** | 420            |
| 08:05:00   | M01       | 185              | 148            | 455            |
| 08:06:00   | M03       | 195              | 157            | 475            |
| 08:07:00   | M01       | 180              | 149            | **<span style="color:red">missing</span>** |
| 08:08:00   | M02       | **<span style="color:red">missing</span>** | 152  | 435            |
| 08:09:00   | M03       | 188              | **<span style="color:red">missing</span>** | 468          |

### <font color = '#646464'> 1.5.2 Noise and Outliers </font>

- **Noise** refers to random, irrelevant fluctuations in the data that don't actually represent anything meaningful. It’s like static or interference that muddles the real information.
  
- **Outliers** are data points that are much different from most of the others. For example, if most of the temperatures in a machine’s operating environment are between 50 and 60 degrees, but one reading shows 150 degrees, that’s an outlier. It could be caused by a malfunction, or it could be valid data that needs to be understood in context.

Both noise and outliers can cause confusion and lead to incorrect conclusions if not identified and properly handled. For instance, an outlier might look like an important signal, but it could just be a mistake. Similarly, noise might hide the true pattern in the data.

**Why it's a problem**: Noise and outliers can distort analysis and lead to misleading results.

Example </font>: We will intentionally add noise to a dataset to demonstrate its effect visually. It's important to understand that in real-world scenarios, noise often accompanies data naturally due to various reasons such as errors in data collection, environmental interference, or variability in measurement devices. This demonstration will help you visualize how noise can obscure patterns in data and complicate the analysis.

In [None]:
import sys
sys.path.append('Module 1 Content')  # Adjust the path as necessary

from functions import *
example1()

### <font color = '#646464'> 1.5.3 Inconsistent Formats and Duplicates </font>

Sometimes, data is stored in different formats, even though it represents the same thing. For example, one dataset might list dates as "MM/DD/YYYY," while another uses "DD/MM/YYYY." This inconsistency can make it harder to combine or analyze the data correctly.

**Duplicates** occur when the same information is recorded multiple times. This often happens when data is collected from different sources or systems. Duplicates can waste resources and create confusion by showing incorrect totals or averages.

**Why it's a problem**: Inconsistent formats can slow down data processing and make it harder to get reliable results, while duplicates can lead to inaccurate analysis.

The table below illustrates several issues with inconsistent formats and duplicates. Printer types have variations such as "Laser Printer," "laser printer," and "LASER printer," while material types show differences like "PLA," "abs," "ABS," "PETG," "PET-G," and "petg." These inconsistencies make it difficult to standardize the data for analysis. Additionally, duplicates are present for printer types and material types, leading to redundancy and potential inaccuracies in usage hours. Such inconsistencies and duplicates can result in unreliable data processing and analysis, ultimately impacting decision-making and resource allocation.

| ID | Printer Type        | Material Type | Usage Hours |
|----|---------------------|---------------|-------------|
| 1  | Laser Printer       | PLA           | 100         |
| 2  | Inkjet Printer      | abs           | 150         |
| 3  | laser printer       | PLA           | 100         |
| 4  | InkJet Printer      | ABS           | 150         |
| 5  | Dot Matrix Printer  | PETG          | 200         |
| 6  | Inkjet printer      | abs           | 150         |
| 7  | Dot-Matrix Printer  | PET-G         | 250         |
| 8  | LASER printer       | PLA           | 100         |
| 9  | dot matrix printer  | petg          | 250         |
| 10 | Inkjet Printer      | ABS           | 15         |


### <center>[🏠 Home](../../welcomePage.ipynb)     [Module 2 ▶︎](Module2.ipynb)</center>