# Architecture Diagram

This code creates a **system architecture diagram** using the `diagrams` library, which visualizes the flow and components involved in a complex system with **frontend**, **backend**, **database**, **orchestration pipelines**, and **external services**. Here's a breakdown:

1. **User Access**:
   - The diagram starts with a **User** interacting with the **Streamlit frontend**.

2. **Frontend Cluster**:
   - The **Frontend** section contains a **Streamlit server**. The user interacts with Streamlit, which passes API calls to the backend.

3. **Backend (FastAPI) Cluster**:
   - The **FastAPI server** handles the backend logic, including processing user requests and interacting with other services. It also connects to **Swagger UI**, a tool for API documentation and testing.

4. **Database**:
   - A **PostgreSQL database** stores user credentials and any PDF data. The FastAPI server uses a **SQL API Key** to interact with the database.

5. **Airflow Pipeline**:
   - **Airflow** orchestrates the data pipeline for handling PDF extraction tasks. It connects to external services that handle the extraction.

6. **PDF Extraction Cluster**:
   - Two tools are responsible for extracting text from PDFs:
     - **PyPDF**: A Python library for working with PDF files.
     - **OpenAI Text Extractor**: A service that uses OpenAI's API for extracting and processing text from PDFs.

7. **AWS S3 Storage**:
   - Extracted data from the PDF services is stored in **AWS S3**, with communication secured using an **S3 API Key**.

8. **OpenAI API**:
   - A generic compute service representing the **OpenAI API** is used for fetching answers or performing tasks like text generation. FastAPI sends queries to OpenAI, and the responses are forwarded back to the frontend.

9. **Data Flow and Communication**:
   - The arrows and edges depict how data flows between components, with **API keys** used for authentication and security at various stages. For example:
     - The user interacts with Streamlit, which sends API requests (secured via JWT authentication) to FastAPI.
     - FastAPI communicates with the database (using an SQL API key), Airflow (via Airflow API key), and OpenAI (fetching answers from the API).
     - Extracted PDF content is stored in S3, and the OpenAI API is also used for text extraction.

This diagram provides a clear overview of how different components—**frontend, backend, storage, and external services**—work together in a data-driven architecture.

In [1]:
!pip install diagrams


Defaulting to user installation because normal site-packages is not writeable
Collecting diagrams
  Using cached diagrams-0.23.4-py3-none-any.whl (24.6 MB)
Collecting typed-ast<2.0.0,>=1.5.4
  Using cached typed_ast-1.5.5-cp39-cp39-win_amd64.whl (139 kB)
Collecting graphviz<0.21.0,>=0.13.2
  Downloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: typed-ast, graphviz, diagrams
Successfully installed diagrams-0.23.4 graphviz-0.20.3 typed-ast-1.5.5


You should consider upgrading via the 'c:\program files\python39\python.exe -m pip install --upgrade pip' command.


In [2]:
pip install Pillow


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'c:\Program Files\Python39\python.exe -m pip install --upgrade pip' command.


In [3]:
import os
print(os.getcwd())  # Prints the current working directory



f:\NORTHEASTERN\DAMG 7245\Git Repo\Assignment2\Automated Text Extraction\ArchitectureDiagram


In [4]:
import os
print(os.listdir())  # List all files in the current directory


['ArchitectureDiagramCode.ipynb']


In [5]:
import os
print(os.listdir())  # List files in the directory


['ArchitectureDiagramCode.ipynb']


In [6]:
from google.colab import files
files.download("complex_architecture_diagram_openai_explicit.png")


ModuleNotFoundError: No module named 'google.colab'

In [2]:
!pip install diagrams


Collecting diagrams
  Downloading diagrams-0.23.4-py3-none-any.whl.metadata (7.0 kB)
Collecting typed-ast<2.0.0,>=1.5.4 (from diagrams)
  Downloading typed_ast-1.5.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.7 kB)
Downloading diagrams-0.23.4-py3-none-any.whl (24.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m68.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading typed_ast-1.5.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (824 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m824.7/824.7 kB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: typed-ast, diagrams
Successfully installed diagrams-0.23.4 typed-ast-1.5.5


In [6]:


from diagrams import Diagram, Cluster, Edge
from diagrams.aws.storage import S3
from diagrams.onprem.client import User
from diagrams.onprem.compute import Server
from diagrams.onprem.database import PostgreSQL
from diagrams.onprem.workflow import Airflow
from diagrams.programming.flowchart import Document
from diagrams.generic.compute import Rack  # Use generic compute service for OpenAI

# Create the diagram with a specific filename
with Diagram("complex_architecture_diagram_openai_explicit2", show=False, outformat="png"):
    # User accessing the frontend
    user = User("User")

    # Streamlit frontend cluster
    with Cluster("Frontend"):
        frontend = Server("Streamlit")

    # FastAPI backend cluster
    with Cluster("Backend (FastAPI)"):
        fastapi = Server("FastAPI")
        swagger = Server("Swagger UI")

    # Database for storing user credentials and PDF data
    database = PostgreSQL("SQL Database")

    # Airflow for pipeline orchestration
    airflow = Airflow("Airflow Pipeline")

    # PDF Extraction services (Pypdf, OpenAI Text Extractor)
    with Cluster("PDF Extraction"):
        pypdf = Document("PyPDF")
        openai_text = Document("OpenAI Text Extractor")

    # Cloud storage (S3)
    s3_storage = S3("AWS S3")

    # Generic service representing OpenAI API
    openai_service = Rack("OpenAI API")

    # Data flow with API key descriptions
    user >> frontend >> Edge(label="API Key: JWT Auth") >> fastapi >> Edge(label="SQL API Key") >> database
    frontend >> swagger
    fastapi >> Edge(label="Airflow API Key") >> airflow
    airflow >> [pypdf, openai_text] >> Edge(label="S3 API Key") >> s3_storage
    fastapi >> Edge(label="Fetch Answer from OpenAI") >> openai_service
    openai_service >> Edge(label="Use FastAPI to Streamlit") >> frontend

    # Adding arrow from OpenAI API to OpenAI Text Extractor
    openai_service >> Edge(label="Text Extraction") >> openai_text


In [7]:
----------
from google.colab import files
files.download("complex_architecture_diagram_openai_explicit2.png")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [6]:
from diagrams import Diagram, Cluster, Edge
from diagrams.aws.storage import S3
from diagrams.onprem.client import User
from diagrams.onprem.compute import Server
from diagrams.onprem.database import PostgreSQL
from diagrams.onprem.workflow import Airflow
from diagrams.programming.flowchart import Document
from diagrams.generic.compute import Rack
from diagrams.custom import Custom  # Import Custom for using logo

# Path to the OpenAI logo image (ensure it's in the correct directory)
openai_logo_path = "openai_logo.png"

# Create the diagram with a specific filename
with Diagram("complex_architecture_diagram_openai_explicit3", show=False, outformat="png"):
    # User accessing the frontend
    user = User("User")

    # Streamlit frontend cluster
    with Cluster("Frontend"):
        frontend = Server("Streamlit")

    # FastAPI backend cluster
    with Cluster("Backend (FastAPI)"):
        fastapi = Server("FastAPI")
        swagger = Server("Swagger UI")

    # Database for storing user credentials and PDF data
    database = PostgreSQL("SQL Database")

    # Airflow for pipeline orchestration
    airflow = Airflow("Airflow Pipeline")

    # PDF Extraction services (Pypdf, OpenAI Text Extractor)
    with Cluster("PDF Extraction"):
        pypdf = Document("PyPDF")
        openai_text = Document("OpenAI Text Extractor")

    # Cloud storage (S3)
    s3_storage = S3("AWS S3")

    # OpenAI API node with custom logo
    openai_service = Custom("OpenAI API", openai_logo_path)

    # Data flow with API key descriptions
    user >> frontend >> Edge(label="API Key: JWT Auth") >> fastapi >> Edge(label="SQL API Key") >> database
    frontend >> swagger
    fastapi >> Edge(label="Airflow API Key") >> airflow
    airflow >> [pypdf, openai_text] >> Edge(label="S3 API Key") >> s3_storage
    fastapi >> Edge(label="Fetch Answer from OpenAI") >> openai_service
    openai_service >> Edge(label="Use FastAPI to Streamlit") >> frontend

    # Adding arrow from OpenAI API to OpenAI Text Extractor
    openai_service >> Edge(label="Text Extraction") >> openai_text
