# Project Report: Domain-Specific Language (DSL) for Laboratory Data Analysis

**Course:** Principles of Compiler Design  
**Semester:** Fall 1404 (Winter 2026)  
**Team Members:** Mohammad Ali Ahmadian, Ali Jabbaripour, Seyed Ahmed Mousavi Malvajerdi

---

## 1. Executive Summary

This project involves the design and implementation of a Domain-Specific Language (DSL) aimed at simplifying laboratory data analysis. The system allows users (researchers or lab technicians) to interact with data using **Natural Persian Language** (or English-like intermediate commands) rather than writing complex Python code. The compiler translates these high-level instructions into executable Python scripts utilizing `pandas`, `matplotlib`, and `seaborn`.

## 2. System Architecture & Pipeline

The compiler operates through a four-stage pipeline designed to abstract complexity away from the user:

### Phase 1: Pre-processing (The Translation Layer)
Since parsing free-form natural Persian is complex, the system uses a Translation Engine based on Regular Expressions (Regex).

*   **Input:** Natural Persian sentences (e.g., *“Data ra az file data.csv barghozari kon”*).
*   **Process:** The engine maps Persian keywords and sentence structures to a standardized, English-like Intermediate Representation (IR).
*   **Output:** Structured IR (e.g., `LOAD "data.csv" INTO df`).

### Phase 2: Lexical & Syntactic Analysis (Parsing)
*   **Tool:** The project utilizes the **Lark** parsing library for Python.
*   **Grammar:** An EBNF (Extended Backus-Naur Form) grammar defines the rules for the Intermediate Representation.
*   **Process:** The parser reads the IR and constructs an **Abstract Syntax Tree (AST)**, which represents the hierarchical structure of the logic.

### Phase 3: Semantic Analysis & Code Generation
*   **Mechanism:** A `CodeGenerator` class (acting as a Transformer/Visitor) traverses the AST.
*   **Logic:** For every node in the tree (e.g., a `PlotNode` or `FilterNode`), the generator constructs the equivalent valid Python code.
*   **Libraries Used:**
    *   `pandas` for data manipulation (filtering, cleaning, sorting).
    *   `matplotlib` / `seaborn` for visualization.

### Phase 4: Execution & Reporting
The system saves the generated code into `generated_code.py` and automatically executes the script.

*   **Outputs:**
    *   Statistical summaries saved to `report.txt`.
    *   Charts/Graphs saved as image files (PNG).
    *   Logs of the operation.

## 3. Functional Capabilities (Language Features)

The DSL supports a wide range of data science operations, categorized as follows:

### A. Data Management
*   **LOAD:** Supports importing data from CSV, Excel, and JSON formats.
*   **SAVE:** Exports the processed data to a new file.
*   **DUPLICATE:** Creates a copy of the current dataset for safe experimentation.

### B. Data Cleaning (CLEAN)
*   **Missing Values:** Automatically fills `NaN` values with the mean, median, or a specific constant.
*   **Outlier Removal:** Uses the Interquartile Range (IQR) method to detect and remove statistical outliers from specific columns.
*   **Type Conversion:** Standardizes numeric columns.

### C. Data Analysis & Inspection
*   **DESCRIBE / HEAD:** Generates statistical summaries (mean, std dev, count) or shows the first few rows of data.
*   **CALC:** Performs specific math operations (Mean, Max, Min, Sum, Count, Correlation) on selected columns.
*   **CORRELATE:** Calculates the correlation matrix between two or more variables.

### D. Data Transformation
*   **FILTER:**
    *   Simple: $col > value$
    *   Range: $value_1 < col < value_2$
    *   Complex: Supports logical operators (AND/OR).
*   **SORT:** Sorts data ascending or descending based on one or more columns.
*   **SEARCH:** Finds rows containing specific substrings within string columns.
*   **LEVELING (Binning):** Categorizes numerical data into levels (e.g., converting test scores into “Low”, “Medium”, “High”).
*   **GROUP BY:** Groups data by a categorical column and calculates aggregates (e.g., “Average temperature per City”).

### E. Visualization (PLOT)
The DSL supports generating various plot types with automatic labeling:
*   **Histogram:** For distribution analysis.
*   **Scatter Plot:** For relationship analysis between two variables.
*   **Box Plot:** For statistical distribution and outlier visualization.
*   **Line Chart:** For trend analysis.
*   **Bar Chart:** For categorical comparisons.

## 4. Technical Implementation Details

### The Grammar (EBNF)
The project defines specific grammar rules for statements. For example, a filter statement might look like this in the EBNF definition:
```ebnf
filter_stmt: "FILTER" condition
condition: col_name operator value

### AST Visualization
To debug the compiler, the system includes a `visualize_ast` function. It uses the `graphviz` library to generate a visual representation (`ast.dot`) of the parse tree, showing how the compiler interprets the user’s commands.

### Error Handling & Logging
*   **Syntactic Errors:** If the user inputs a command that doesn’t match the grammar, the parser returns a friendly error indicating the line number.
*   **Runtime Errors:** The generated Python code includes `try-except` blocks to handle issues like “File Not Found” or “Column does not exist” gracefully.

## 5. Example Workflow

### User Input (Persian/Mixed):
1.  Load file "experiment_results.csv".
2.  Clean outlier data in column "Voltage".
3.  Filter where "Temperature" > 100.
4.  Plot histogram of "Voltage".

### Generated Python Code (Simplified):

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load
df = pd.read_csv("experiment_results.csv")

# Clean Outliers (IQR Logic)
Q1 = df['Voltage'].quantile(0.25)
Q3 = df['Voltage'].quantile(0.75)
IQR = Q3 - Q1
# Filter out data outside range: [Q1 - 1.5*IQR, Q3 + 1.5*IQR]
df = df[~((df['Voltage'] < (Q1 - 1.5 * IQR)) | (df['Voltage'] > (Q3 + 1.5 * IQR)))]

# Filter
df = df[df['Temperature'] > 100]

# Plot
plt.figure()
df['Voltage'].hist()
plt.title("Histogram of Voltage")
plt.savefig("plot_voltage.png")


## 6. Conclusion

This project successfully demonstrates the creation of a DSL that democratizes data analysis. By abstracting the programming layer, it allows domain experts to focus on the results of their laboratory experiments rather than the implementation details of the analysis tools.