# Data Generation Blueprint

## Introduction

Welcome to the **Data Generation Blueprint** notebook for Darepo! This notebook serves as the foundation for designing and building the data generation functionality of the Darepo web app. 
<br></br> 
**Goal**: *To create a flexible and scalable schema for generating realistic and structured tabular data that can be used for application testing, training ML & AI models, statistical analysis, and educational purposes*.

This notebook:
- Defines the different categories of data types (e.g., dates, categorical, numerical, and text).
- Breaks down these categories into subcategories and provides examples.
- Outlines constraints, relationships, and metadata attributes needed to create meaningful and realistic dummy data.

## Data Schema Design

This section defines and organizes the different categories of data types used in Darepo. The structure ensures a flexible and scalable schema, covering a variety of business scenarios while generating realistic dummy data. Each category is broken down into subcategories to capture the different ways data may appear in real datasets.

#### 1. Date

This category includes any time-related data, such as timestamps, dates, and time intervals. It is divided into:
   - **Timestamps**: Complete date and time values (e.g., `2024-10-13 14:35:00`).
   - **Dates Only**: Year, month, and day (e.g., `2024-10-13`).
   - **Times Only**: Hours, minutes, and seconds (e.g., `14:35:00`).


#### 2. Categorical

This category contains all non-numerical data that falls into distinct groups or categories. It is further divided into:
   - **Static Categories**: Fixed categories like product types, locations, or gender (e.g., “Electronics,” “New York,” “Male”).
   - **Dynamic Categories**: Context-specific categories that may vary over time (e.g., order status like “Pending,” “Shipped,” “Delivered”).
   - **Identifiers**: Unique strings such as product IDs, customer IDs, or user names.
   - **Boolean Categories**: Simple true/false or yes/no fields.
   

#### 3. Numerical

Numerical data is divided based on whether it’s continuous or discrete:
   - **Continuous Data**: Numeric values that can take on a wide range of values, often including decimals (e.g., prices, weights, temperatures).
   - **Discrete Data**: Numeric values counted in whole numbers (e.g., quantities, counts, ratings on a scale).
   - **Derived or Computed Data**: Values calculated based on other columns (e.g., “Total Price” = “Quantity” * “Price per Unit”).
   - **Ranges**: For numbers falling within a defined interval (e.g., ages from 18 to 65).
   

#### 4. Text Data (Separate Category)

While text data can sometimes be part of categorical data, separating it as its own category is helpful due to its variability in length and structure:
   - **Short Text**: Fields like "First Name" or "Job Title" that have a limited length.
   - **Long Text**: Descriptions, comments, or any unstructured text (e.g., product descriptions, customer feedback).
   - **Structured Text Patterns**: Text data that follows specific formats (e.g., email addresses, phone numbers, postal codes).

#### 5. Hierarchical or Relational Data

Some datasets have hierarchical or nested relationships (e.g., categories and subcategories, regions and cities). Including this helps design columns that connect related data (e.g., "Country" -> "State" -> "City"). This is particularly useful for schemas involving inventory or sales data, where hierarchical relationships are common.

#### 6. Data Integrity and Constraints

Defining constraints and rules ensures the generated data is meaningful and realistic:
   - **Value Constraints**: For example, age values must fall between 0 and 120.
   - **Uniqueness**: Fields like "Email" or "ID" must be unique.
   - **Foreign Key Relationships**: Ensures linked columns (e.g., Customer ID) match records in another table.

#### 7. Metadata for Columns

Metadata documents each column’s purpose, data type, and constraints, making it easier to generate realistic and relevant data. Examples include:
   - **Data Type**: Specifies whether the column is date, categorical, numerical, text, etc.
   - **Range**: Defines the minimum and maximum values allowed for numerical data.
   - **Format**: For date columns or structured text (e.g., DD-MM-YYYY or phone numbers).
   - **Nullability**: Indicates whether the column can have missing (null) values.