# <font color = 'red'>1. *Introduction to Data in Digital Engineering (Handling Unstructured Data)*

<font size = 3>In this module, we will focus on dealing with unstructured data, specifically in the context of engineering. One key aspect we will explore is how height map scans are essential in engineering applications. However, while the height map itself provides raw data, it is not inherently meaningful until we extract the relevant features from it. The importance of extracting these features lies in transforming the unstructured data into valuable information that can be used for analysis and decision-making. To improve this process, we will cover the tools and techniques that are essential for feature extraction, allowing us to derive actionable insights from the height map scans.
</font>
#### What is a 3D Scan?
A 3D scan is a digital representation of the three-dimensional shape of an object. This technology captures the exact size and shape of physical objects using a device that records comprehensive data on its shape, color, and sometimes texture. The resulting scan forms a point cloud, or a collection of data points in three-dimensional space, which can be used to create a 3D model. These scans are utilized across various industries, including manufacturing, entertainment, healthcare, and archaeology, to create models for analysis, reproduction, or digital rendering.

<center>
    <img src="Module 1 Content/img/01.jpg" alt="Alt Text" width="400"/>
</center>

#### Objective

The described training module is focused on developing key skills necessary for working with unstructured data throughout its entire processing lifecycle. While the 3D scan application is a specific example, the core concepts and techniques will be applicable to various types of unstructured data in different fields. This module aims to provide a comprehensive understanding of how to handle unstructured data, from initial raw data collection to meaningful feature extraction and analysis.

#### - Cleaning Noise from Data:
The module will cover techniques for identifying and removing irrelevant or extraneous data, similar to cleaning up noise in a 3D scan. This involves recognizing what constitutes useful information in a dataset and isolating the parts that are important for analysis. Whether it's removing unwanted noise in a height map or irrelevant data in other forms of unstructured data, this skill is crucial for ensuring that only relevant features are considered.

#### - Data Calibration:
Calibration techniques are an essential part of improving the quality of unstructured data. By applying mathematical methods such as regression to the raw data, you can adjust the data points to better fit the expected values or to correct for errors in the data collection process. This ensures that the resulting dataset is as accurate as possible and ready for further analysis.

#### - Feature Extraction and Analysis:
After cleaning and calibrating the data, the next step is extracting meaningful features from it. This is where the core value of unstructured data lies—by using advanced techniques to extract useful information, you can transform raw data into valuable insights. Whether fitting geometric shapes to 3D scan data or identifying key patterns in other types of unstructured data, this step is essential for turning noise into actionable knowledge.


#### Press ▶️ to Select 3D Data to Load

In [5]:
import pandas as pd
import numpy as np
import os
from scipy.stats import linregress
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import Dropdown, interact
from IPython.display import clear_output
import sys
sys.path.append('Module 1 Content')  # Adjust the path as necessary

from functions import *



# Function to get a list of CSV files in the 'data' folder
def get_csv_files():
    data_dir = './Module 1 Content/data'  # Directory where CSV files are located
    return [os.path.join(data_dir, f) for f in os.listdir(data_dir) if f.endswith('.csv')]

# Function to load the selected CSV file into a global variable
def load_selected_csv(file):
    global data
    clear_output(wait=True)
    print(f"Loading {file}...")
    data = load_data(file)  # Assuming load_data is a custom function to load CSV into data
    plot(data)  # Assuming plot is a custom function to plot data

# Dropdown widget to select a CSV file
csv_files = get_csv_files()
file_selector = Dropdown(options=csv_files, description="Select file:")

# Interactive widget to update the global variable based on selected file
interact(load_selected_csv, file=file_selector);

interactive(children=(Dropdown(description='Select file:', options=('./Module 1 Content/data\\Sin_T_1_Gr_1.csv…

### Plot Calibration Instructions

In the context of plot calibration, the primary goal is to ensure that the data we are working with is accurate and aligned, especially when it comes to working with physical objects or scans. The process described here involves using a flat and stable region of the scan, often referred to as the "bed," to help correct any distortions or biases present in the entire dataset.

The first step is to **select the bed region**. This flat area is crucial because it serves as a reference for what should be level or stable in the scan. By defining this region accurately using sliders, we ensure that we are working with a reliable base point from which to assess the rest of the data.

Next, we **fit a linear regression line** to the bed region. Why do we do this? A linear regression helps us capture the overall slope of the bed. Since the bed is expected to be flat, any deviation from flatness (such as tilting or unevenness) will be reflected in the slope of this line. Fitting a regression line allows us to quantify this slope mathematically, providing us with a precise measurement of how much the bed might be tilted or uneven.

Finally, we **apply slope calibration**. After obtaining the slope from the bed region, we use it to adjust the entire scan or plot. This ensures that any inclinations or distortions throughout the data are corrected based on the flat and stable bed reference. By doing this, we align the whole dataset, ensuring that any measurements or further analysis are accurate and reliable, free from any bias introduced by an uneven surface or setup.



<center>
    <img src="Module 1 Content/img/02.jpg" alt="Alt Text" width="800"/>
</center>

#### Press ▶️ to Select Bed Region for Calibration <font color = 'green'>(Scan Bed)

In [6]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import FloatSlider, interactive
from IPython.display import display

# Assuming 'data' is your original 2D numpy array
data_temp = np.copy(data)  # Use the original data directly with NaNs preserved

# Flatten the data to prepare for the scatter plot
points = data_temp.flatten()

# Initialize global variables
global upper_bound_plate, lower_bound_plate
upper_bound_plate = np.nanmax(points)
lower_bound_plate = np.nanmin(points)

def update_plots(y1, y2):
    global upper_bound_plate, lower_bound_plate
    # Update global variables with the slider values
    upper_bound_plate = y2
    lower_bound_plate = y1
    
    # Create a mask for values outside the slider range, respecting NaNs
    mask = (data_temp < y1) | (data_temp > y2)
    filtered_data = np.copy(data_temp)
    filtered_data[mask] = np.nan  # Set values outside the range to NaN
    
    # Plotting using subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
    
    # Scatter plot on the first subplot
    valid_points = ~np.isnan(points)  # Mask to remove NaNs for the scatter plot
    ax1.scatter(np.arange(len(points))[valid_points], points[valid_points], marker='o', linestyle='-', s=0.01)
    ax1.set_title('Flattened Array Values')
    ax1.set_xlabel('Index')
    ax1.set_ylabel('Height Value')
    ax1.grid(True)
    ax1.axhline(y=lower_bound_plate, color='r', linestyle='--')
    ax1.axhline(y=upper_bound_plate, color='g', linestyle='--')
    
    # Heatmap on the second subplot
    sns.heatmap(filtered_data, cmap='viridis', cbar=False, ax=ax2)
    ax2.set_aspect(aspect= 'equal')
    ax2.set_title('Filtered Data Heatmap')
    ax2.set_xticks([])
    ax2.set_yticks([])
    
    plt.tight_layout()
    plt.show()

# Set up sliders for the interactive lines
y1_slider = FloatSlider(min=np.nanmin(points), max=np.nanmax(points), step=0.01, value=np.nanmin(points), description='Minimum')
y2_slider = FloatSlider(min=np.nanmin(points), max=np.nanmax(points), step=0.01, value=np.nanmax(points), description='Maximum')

interactive_plot = interactive(update_plots, y1=y1_slider, y2=y2_slider)
output = interactive_plot.children[-1]
display(interactive_plot)


interactive(children=(FloatSlider(value=-6.696, description='Minimum', max=1.866, min=-6.696, step=0.01), Floa…

#### Press ▶️ to Start Calibration

In [7]:
import ipywidgets as widgets
from IPython.display import display

button = widgets.Button(description="Calibrate Scan")
output = widgets.Output()

def on_button_clicked(b):
    global data  # Make sure to modify the global 'data' variable
    with output:
        data = correct_tilt(data, lower_bound_plate, upper_bound_plate)
        print("Data has been updated.")

button.on_click(on_button_clicked)
display(button, output)

Button(description='Calibrate Scan', style=ButtonStyle())

Output()

### Feature Extraction Instructions

In the feature extraction process, the goal is to focus on the relevant part of the data while removing any unnecessary information that could distract from the analysis. The first step in this process is to **define the area** of interest.

To do this, you carefully adjust the sliders to select the region of the plot that contains the part you are most concerned with, whether it's a specific feature or the object you're studying. This allows you to isolate the relevant data from the rest of the plot. The **bed**, which we previously calibrated, is not part of the area of interest, so it must be excluded from this selection. By removing the bed, we ensure that the data we're working with is focused on the specific object or feature of interest, allowing for a more precise analysis. 

This process helps us filter out irrelevant data and concentrate on what truly matters, ensuring that our subsequent steps in analysis, such as fitting models or extracting more detailed features, are based only on the data that is most important for the task at hand.


#### Press ▶️ to select region of interest for feature extraction: <font color = 'red'>(Part)

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import FloatSlider, interactive
from IPython.display import display

# Assuming 'data' is your original 2D numpy array
data_temp = np.copy(data)  # Use the original data directly with NaNs preserved

# Flatten the data to prepare for the scatter plot
points = data_temp.flatten()

# Initialize global variables
global upper_bound, lower_bound
upper_bound = np.nanmax(points)
lower_bound = np.nanmin(points)

def update_plots(y1, y2):
    global upper_bound, lower_bound
    # Update global variables with the slider values
    upper_bound = y2
    lower_bound = y1
    
    # Create a mask for values outside the slider range, respecting NaNs
    mask = (data_temp < y1) | (data_temp > y2)
    filtered_data = np.copy(data_temp)
    filtered_data[mask] = np.nan  # Set values outside the range to NaN
    
    # Plotting using subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
    
    # Scatter plot on the first subplot
    valid_points = ~np.isnan(points)  # Mask to remove NaNs for the scatter plot
    ax1.scatter(np.arange(len(points))[valid_points], points[valid_points], marker='o', linestyle='-', s=0.01)
    ax1.set_title('Flattened Array Values')
    ax1.set_xlabel('Index')
    ax1.set_ylabel('Height Value')
    ax1.grid(True)
    ax1.axhline(y=lower_bound, color='r', linestyle='--')
    ax1.axhline(y=upper_bound, color='g', linestyle='--')
    
    # Heatmap on the second subplot
    sns.heatmap(filtered_data, cmap='viridis', cbar=False, ax=ax2)
    ax2.set_aspect(aspect= 'equal')
    ax2.set_title('Filtered Data Heatmap')
    ax2.set_xticks([])
    ax2.set_yticks([])
    
    plt.tight_layout()
    plt.show()

# Set up sliders for the interactive lines
y1_slider = FloatSlider(min=np.nanmin(points), max=np.nanmax(points), step=0.01, value=np.nanmin(points), description='Minimum')
y2_slider = FloatSlider(min=np.nanmin(points), max=np.nanmax(points), step=0.01, value=np.nanmax(points), description='Maximum')

interactive_plot = interactive(update_plots, y1=y1_slider, y2=y2_slider)
output = interactive_plot.children[-1]
display(interactive_plot)


interactive(children=(FloatSlider(value=-4.850075325360749, description='Minimum', max=3.659646700863387, min=…

### Rationale for Image Conversion in Feature Extraction

In engineering, rather than developing custom algorithms from scratch, we often seek existing, well-established solutions that can be applied to our specific problem. This approach is both efficient and practical, allowing us to leverage the expertise and resources of the broader engineering and research communities.

In the case of feature extraction from 3D scan data, the decision to **convert the scan into an image format** is driven by the need to simplify the data for analysis. By transforming the 3D data into a 2D image, we can more easily apply a range of proven image processing techniques that are specifically designed to work with visual data. This conversion step is essential because it facilitates the use of tools and methods that have been optimized for shape recognition and feature extraction.

The choice to utilize **image processing tools** is based on the availability of robust, efficient libraries within the image processing field. Python, in particular, offers a wide range of well-developed libraries for image-based feature extraction. These libraries, such as OpenCV and scikit-image, are extensively used in both academic and industrial applications for tasks such as edge detection, pattern recognition, and shape fitting. By converting the 3D scan into an image, we can take full advantage of these advanced, widely-used tools, allowing us to focus on solving the problem at hand without the need for reinventing the wheel.


### Explanation of the Code and its Functionality

This code is designed to extract and fit geometric shapes to a binary image, which represents data that has been processed for feature extraction. The following steps outline the core functionality:

1. **Binary Data Preparation**: The data is first thresholded into a binary format, where pixels falling within a certain intensity range are marked as `1` (white), while all other pixels are marked as `0` (black). This step essentially isolates the relevant features from the background.

2. **Contour Detection**: Once the binary data is prepared, the contours in the image are identified using the `cv2.findContours()` function. Contours represent the boundaries of connected regions in the binary image. The largest contour is selected for further analysis because it is assumed to correspond to the object of interest.

3. **Shape Fitting**: Based on the selected shape (Rectangle, Triangle, Square, Circle, or Ellipse), the code fits a specific geometric model to the detected contour.

   - **Rectangle**: The code uses `cv2.minAreaRect()` to fit a rectangle to the contour, and then the rectangle’s corners are extracted and scaled to physical units (in millimeters). The width and height of the rectangle are printed as part of the output.
   
   - **Triangle**: The function `cv2.minEnclosingTriangle()` fits the smallest enclosing triangle around the contour. The vertices of the triangle are extracted and scaled, and the coordinates of these vertices are printed.
   
   - **Square**: Similar to the rectangle fitting, a square is fitted by taking the largest side of the bounding rectangle and using that length for all sides. The square is placed at the center of the contour, and its side length is printed.
   
   - **Circle**: The function `cv2.minEnclosingCircle()` is used to fit the smallest enclosing circle to the contour. The center and radius of the circle are calculated, scaled, and displayed in the output.

   - **Ellipse**: If the contour has at least 5 points, the code uses `cv2.fitEllipse()` to fit an ellipse to the contour. The center, axes (major and minor), and the orientation angle of the ellipse are calculated, scaled, and displayed.

#### Press ▶️ to select feature

In [9]:
import numpy as np
import cv2
import matplotlib.pyplot as plt
from matplotlib.patches import Polygon, Ellipse
import ipywidgets as widgets
from IPython.display import display

# Define the scale factor (0.02 mm per pixel)
scale_factor = 0.02

# Define the function to execute upon shape and epsilon selection
def fit_and_plot(shape, epsilon_factor):
    binary_data = np.where((data >= lower_bound) & (data <= upper_bound), 1, 0)
    contours, _ = cv2.findContours(binary_data.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

    if contours:
        largest_contour = max(contours, key=cv2.contourArea)
        epsilon = epsilon_factor * cv2.arcLength(largest_contour, True)
        approx = cv2.approxPolyDP(largest_contour, epsilon, True)

        plt.figure(figsize=(6, 6))
        plt.imshow(binary_data, cmap='gray', origin='lower', extent=[0, binary_data.shape[1] * scale_factor, 0, binary_data.shape[0] * scale_factor])
        
        if shape == 'Rectangle':
            rect = cv2.minAreaRect(approx)
            box = cv2.boxPoints(rect) * scale_factor
            plt.gca().add_patch(Polygon(box, closed=True, color='red', fill=False, linewidth=2))
            print(f"Rectangle width: {rect[1][0] * scale_factor} mm, height: {rect[1][1] * scale_factor} mm")
        
        elif shape == 'Triangle':
            triangle = cv2.minEnclosingTriangle(approx)[1] * scale_factor
            plt.gca().add_patch(Polygon(triangle[0], closed=True, color='blue', fill=False, linewidth=2))
            print("Triangle vertices:", triangle[0])
        
        elif shape == 'Square':
            rect = cv2.minAreaRect(approx)
            side = max(rect[1]) * scale_factor
            center = np.array(rect[0]) * scale_factor
            box = np.array([
                [center[0] - side / 2, center[1] - side / 2],
                [center[0] + side / 2, center[1] - side / 2],
                [center[0] + side / 2, center[1] + side / 2],
                [center[0] - side / 2, center[1] + side / 2],
            ])
            plt.gca().add_patch(Polygon(box, closed=True, color='green', fill=False, linewidth=2))
            print(f"Square side length: {side} mm")
        
        elif shape == 'Circle':
            center, radius = cv2.minEnclosingCircle(approx)
            center_scaled = np.array(center) * scale_factor
            radius_scaled = radius * scale_factor
            circle_patch = plt.Circle(center_scaled, radius_scaled, color='orange', fill=False, linewidth=2)
            plt.gca().add_patch(circle_patch)
            print(f"Circle radius: {radius_scaled} mm")
        
        elif shape == 'Ellipse':
            if len(approx) >= 5:  # Ensure there are at least 5 points to fit an ellipse
                ellipse = cv2.fitEllipse(approx)
                ellipse_center = np.array(ellipse[0]) * scale_factor
                axes = np.array(ellipse[1]) * scale_factor
                angle = ellipse[2]
                plt.gca().add_patch(Ellipse(xy=ellipse_center, width=axes[0], height=axes[1], angle=angle, edgecolor='purple', fill=False, linewidth=2))
                print(f"Ellipse center: {ellipse_center} mm, axes: {axes} mm, angle: {angle}")
            else:
                print("Not enough points to fit an ellipse.")


        plt.xlabel('X (mm)')
        plt.ylabel('Y (mm)')
        plt.show()
    else:
        print("No contours found in the binary image.")

# Widgets for shape and epsilon factor selection
shape_selector = widgets.Dropdown(options=['Rectangle', 'Triangle', 'Square', 'Ellipse', 'Circle'], description='Shape:')
epsilon_slider = widgets.FloatSlider(value=0.01, min=0.0, max=0.1, step=0.005, description='Epsilon Factor:', readout_format='.3f')

# Link widgets to the function
interactive_plot = widgets.interactive_output(fit_and_plot, {'shape': shape_selector, 'epsilon_factor': epsilon_slider})

# Display the widgets and the plot
display(widgets.VBox([shape_selector, epsilon_slider]), interactive_plot)


VBox(children=(Dropdown(description='Shape:', options=('Rectangle', 'Triangle', 'Square', 'Ellipse', 'Circle')…

Output()

The 3D scans performed by technicians or engineers initially captured valuable data, but in their raw, unstructured form, they held little meaning. The data was essentially a collection of points and measurements without context, and without further processing, it wasn't immediately useful for decision-making or analysis. This is common in engineering where raw scans or measurements often lack direct applicability without additional interpretation.

However, once we applied techniques to extract meaningful features and metrics from the scans—such as fitting geometric shapes, calibrating for distortions, and isolating areas of interest—the raw data transformed into actionable insights. These processed metrics now provide a structured, quantitative representation of the scanned object or system, which can be integrated with other engineering parameters.

By joining the extracted features with existing engineering data, we can now generate a more complete understanding of the object being studied. This allows for better decision-making and more precise measurements. Instead of working with ambiguous, unprocessed scan data, engineers can now rely on refined, meaningful metrics that provide a clearer, more accurate picture of the object's characteristics, leading to improved performance and more informed analysis.
