In [14]:
from dotenv import dotenv_values
environment_variables = dotenv_values()

In [15]:
OPENAI_API_KEY = environment_variables["OPENAI_API_KEY"]
MISTRAL_API_KEY = environment_variables["MISTRAL_API_KEY"]

In [16]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain

# Summarize a single document

In [17]:
from langchain_community.document_loaders.text import TextLoader
from IPython.display import Markdown, display
text = TextLoader("../sample_text.txt").load()[0]
# display(Markdown(text.page_content))

In [18]:
from langchain_mistralai import ChatMistralAI
llm = ChatMistralAI(model="mistral-small-latest", temperature=0.3, api_key=MISTRAL_API_KEY)

In [19]:
prompt_template = """
Write a comprehensive summary of the following text. The summary should:
1. Highlight the main points and key ideas
2. Include important details and supporting evidence
3. Maintain the original meaning and intent
4. Be well-structured and coherent

Text to summarize:
{text}

Comprehensive Summary:
"""
prompt = PromptTemplate(template=prompt_template, input_variables=["text"])
chain = prompt | llm | StrOutputParser()

In [20]:
summary = chain.invoke({"text": text})
# Markdown(summary)

# Summarize multiple documents

In [21]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=4000, chunk_overlap=200, separators=["\n\n", "\n", " ", ""]
)
docs = text_splitter.create_documents([text.page_content])

In [22]:
docs

[Document(metadata={}, page_content="# Comprehensive Summary of Data Processing Engines: Ray, Dask, and Apache Spark\n\n## Introduction\nThis summary provides a detailed comparison and integrated overview of three prominent data processing engines—Ray, Dask, and Apache Spark. Each engine is evaluated based on its core functionalities, performance capabilities, and ideal use cases in data science and machine learning (ML). Insights are drawn from various analyses, webinars, and blog posts discussing the evolution of machine learning tools and the importance of robust compute engines in handling large datasets and complex computations.\n\n## Overview of Data Processing Engines\n\n### Ray\nRay is recognized for its ease of use and efficiency in managing distributed applications, particularly in reinforcement learning and deep learning tasks. It employs a master-worker architecture, enabling effective scaling with both CPUs and GPUs. Ray's unique primitives (Tasks, Objects, Actors) facilit

In [23]:
map_prompt_template = """
Write a concise summary of the following text, focusing on the key points:
{text}

Concise Summary:
"""

combine_prompt_template = """
You are provided with multiple summaries from different sections of a document or article.
Your task is to create a comprehensive, well-structured final summary that:
1. Integrates all the important information from the individual summaries
2. Presents a coherent overview of the entire content
3. Organizes the information logically with appropriate headings and structure
4. Eliminates redundancy while preserving important details

Individual summaries:
{text}

Comprehensive Final Summary:
"""

map_prompt = PromptTemplate(
    template=map_prompt_template, input_variables=["text"]
)
combine_prompt = PromptTemplate(
    template=combine_prompt_template, input_variables=["text"]
)

In [24]:
summary_chain = load_summarize_chain(
    llm,
    chain_type="map_reduce",
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    verbose=False,
)

In [27]:
summary = summary_chain.invoke(docs)

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.4 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/gabriel/Documents/Git/Document summarizer/.venv/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/Users/gabriel/Documents/Git/Document summarizer/.venv/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/Users/gabriel/Documents/G

In [30]:
summary.keys()

dict_keys(['input_documents', 'output_text'])

In [32]:
Markdown(summary["output_text"])

### Comprehensive Summary of Data Processing Engines: Ray, Dask, and Apache Spark

#### Overview
This summary integrates key points from various sections to provide a coherent overview of three prominent data processing engines: Ray, Dask, and Apache Spark. Each engine has unique strengths, architectures, and use cases, making them suitable for different types of data processing and machine learning tasks.

#### Strengths and Architectures

**Ray**:
- **Strengths**: Known for ease of use, efficient distributed applications, and strong performance in reinforcement and deep learning.
- **Architecture**: Utilizes a master-worker model, supporting both CPUs and GPUs.
- **Unique Features**: Includes built-in AI libraries, making it ideal for generative AI and time series forecasting.
- **Performance**: High cost efficiency for large-scale workloads, with autoscaling capabilities.
- **Job Concurrency**: Supports gang scheduling but does not allow multiple jobs on the same cluster.
- **Memory Management**: Features shared-memory architecture, built-in checkpointing, and task retries.

**Dask**:
- **Strengths**: Offers seamless Python integration and easy scaling from single-node to distributed environments.
- **Architecture**: Employs a task graph for parallel processing and fault tolerance.
- **Compatibility**: Works well with NumPy and Scikit-learn.
- **Performance**: Best suited for medium-scale tasks and exploratory data analysis.
- **Job Concurrency**: Can share clusters but is not recommended for multiple jobs.
- **Memory Management**: Can spill data to disk, but has higher overhead due to Python and potential data loss if the scheduler fails.

**Apache Spark**:
- **Strengths**: Boasts a robust ecosystem and strong support for big data analytics.
- **Architecture**: Uses a driver node for resource management and task scheduling.
- **Unique Features**: Implements lazy evaluation, fault tolerance, and is effective for feature engineering and real-time machine learning.
- **Performance**: Optimized planning and fault tolerance, but requires tuning for autoscaling.
- **Job Concurrency**: Features a thread-safe scheduler, supporting multiple jobs within a single application.
- **Memory Management**: Utilizes RDD lineage and task rescheduling for fault tolerance, requiring tuning for memory-intensive workloads.

#### Developer Experience

- **Ray**: Provides Pythonic debugging but can be complex due to its C++ foundation.
- **Dask**: Fully Python-based, offering easy debugging and comprehensive documentation.
- **Spark**: Mature with solid documentation, but serialization overhead complicates debugging.

#### Use Cases in Data Science and Machine Learning

- **Scientific Computing**: Ray is preferred for scalability, Dask for Python libraries, while Spark can be suboptimal.
- **Time Series Forecasting**: Both Ray and Dask are effective, whereas Spark lacks specialized libraries.
- **Geospatial Data Analysis**: Ray and Dask excel, while Spark uses Apache Sedona.

#### Language Support and Community Engagement

- **Spark**: Supports Java, Scala, SQL, and R, with strong community and commercial support.
- **Ray and Dask**: Vary in support levels, with Spark leading in GitHub metrics.

#### Real-World Applications and Case Studies

- **Examples**: NVIDIA (Spark RAPIDS), Riot Games (Spark), and OpenAI (Ray) highlight the importance of choosing the right compute engine.
- **Onehouse Platform**: Integrates Ray, Dask, and Spark, offering managed ingestion and optimization.

#### Conclusion

- **Ray**: Best for memory-intensive and machine learning workloads.
- **Dask**: Ideal for lightweight data processing and exploratory analysis.
- **Spark**: Excels in large-scale data analytics with a robust ecosystem.

Practitioners should consider the data scale, workload complexity, GPU needs, and team familiarity when selecting a compute engine. Each engine has distinct advantages tailored to specific types of data processing and machine learning tasks.