In [75]:
import numpy as np
import pandas as pd
from datasets import load_dataset
# from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from sentence_transformers import SentenceTransformer

### Talking points

- jump from previous stuff (not related to what we've learned so far)
- just soemthing cool
- think of it as "what else can we do with text and deep learning"

# RAG

- embeddings documents
- vector search
- contextual summarization

## Step 1: Pulling a dataset

In [None]:
# Load the dataset
ds = load_dataset("SetFit/20_newsgroups")
ds


Repo card metadata block was not found. Setting CardData to empty.
Generating train split: 100%|██████████| 11314/11314 [00:00<00:00, 479647.00 examples/s]
Generating test split: 100%|██████████| 7532/7532 [00:00<00:00, 780114.03 examples/s]


DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 11314
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 7532
    })
})

In [50]:
df_train = pd.DataFrame(ds['train'])
df_query = pd.DataFrame(ds['test'])

df_train.shape, df_query.shape

((11314, 3), (7532, 3))

In [None]:
# # Subsample targets which start with talk.politics
# df_train = df_train[df_train['label_text'].str.startswith('talk.politics')] # optional; sentence tf is very fast
# df_query = df_query[df_query['label_text'].str.startswith('talk.politics')] # optional; sentence tf is very fast

# df_train.shape, df_query.shape


((1575, 3), (1050, 3))

# Embedding
Lets use a transformer model to create embedding representing the document.

In [86]:
# Load the embedding model
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [87]:
embeddings = model.encode(df_train['text'].tolist(), show_progress_bar=True)
emebddings_dim = embeddings.shape[1]

Batches: 100%|██████████| 354/354 [02:31<00:00,  2.33it/s]


# Vector Store

(not necessary)

In [88]:
import faiss 
index = faiss.IndexFlatL2(emebddings_dim)
index.ntotal

0

In [89]:
index.add(embeddings)
index.ntotal

11314

# Lets try it out on some document

In [90]:
i = 100
doc = df_query.iloc[i]
print(doc.text)

Hello All!

    It is my understanding that all True-Type fonts in Windows are loaded in
prior to starting Windows - this makes getting into Windows quite slow if you
have hundreds of them as I do.  First off, am I correct in this thinking -
secondly, if that is the case - can you get Windows to ignore them on boot and
maybe make something like a PIF file to load them only when you enter the
applications that need fonts?  Any ideas?


Chris


In [91]:
## Create a query vector
vec = model.encode(doc.text)
vec.shape

(768,)

In [92]:
# Lets find the relevant indices using faiss
ind = index.search(vec.reshape(1, -1), k=5)
ind

(array([[0.8118304 , 0.83757585, 0.9386765 , 0.9496102 , 0.9903735 ]],
       dtype=float32),
 array([[9448, 4746, 1693, 8315, 8888]]))

In [104]:
similar_articles = df_train.iloc[ind[1][0]].copy().reset_index().text
similar_articles

0    I just installed a new TrueType font under MS-...
1    I would like to change all of the system fonts...
2    ...\n\nThis is a common problem with highly co...
3    Hi\nI'm having a problem with TrueType fonts i...
4    OK...I've heard rumors about this...I might ha...
Name: text, dtype: object

In [105]:
print(similar_articles[0])

I just installed a new TrueType font under MS-Windows 3.1
but though all the applications display the font correctly on the
screen, quite a few of them fail to print out the document correctly
(on a LaserJet 4 - non-PostScript printer).

When I use the font in CorelDRAW, the document prints correctly, so I assume
CorelDRAW knows that the font has to be downloaded to the printer.

But when I use the Windows accessory Write, the printer prints square
boxes in place of the characters of the new font. Yet, Write does
display the font correctly on the screen.

I looked through all the Windows and LaserJet manuals, but got nowhere.
All of them just make the statement that TrueType fonts will print
exactly as you see them on the screen---so I assume Windows knows that a font
has to be downloaded automatically---but, how to make it do that????


In [None]:
# Seems good?

array([[1096,  330, 1981,  501, 1832]])

# Paraphrasing responses together

### How to use a local LLM?

We will use a lightweight qwen model -- https://ollama.com/library/qwen

In [110]:
from ollama import chat, generate
from ollama import ChatResponse

response: ChatResponse = chat(model='qwen:0.5b', messages=[
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
])

response

ChatResponse(model='qwen:0.5b', created_at='2025-06-02T20:11:52.77478Z', done=True, done_reason='stop', total_duration=460126250, load_duration=28134708, prompt_eval_count=14, prompt_eval_duration=45579667, eval_count=92, eval_duration=385452083, message=Message(role='assistant', content="The color of the sky can be influenced by several factors, including temperature, humidity, and atmosphere pressure.\n\nIn general, the temperature in the sky will affect its color. If it's hot, the clouds may be more dark and less blue. On the other hand, if it's cold, the clouds may be more clear and more blue.\n\nOverall, the color of the sky is determined by several factors, including temperature, humidity, and atmosphere pressure.", thinking=None, images=None, tool_calls=None))

In [116]:
generate(model='qwen:0.5b', prompt="Why is the sky blue?")

GenerateResponse(model='qwen:0.5b', created_at='2025-06-02T20:13:26.535447Z', done=True, done_reason='stop', total_duration=523075458, load_duration=25420583, prompt_eval_count=14, prompt_eval_duration=53058791, eval_count=109, eval_duration=444124292, response='The color of the sky蓝 arises from several processes, including:\n\n1. Ray scattering: When light passes through a.cloud or atmosphere, scattered light will be distributed in all directions, creating a bluish color.\n\n2. Ray absorption: When light enters an object, some of the absorbed light is reflected back into space. The intensity of this reflected light depends on the object being reflected from and the angle of incidence.\n\nTherefore, when light passes through a cloud or atmosphere, it will be scattered in all directions, creating a bluish color.', thinking=None, context=[151644, 872, 198, 10234, 374, 279, 12884, 6303, 30, 151645, 198, 151644, 77091, 198, 785, 1894, 315, 279, 12884, 100400, 47182, 504, 3807, 11364, 11, 2

In [129]:
# Lets create a prompt

context = "\n\n".join(similar_articles)
prompt = f"""You are a helpful assistant. For a given user question, summarize or paraphrase the key points that are relevant to this question.
Note that the context is not a direct response to answer user question. Rather, we want to provide the user with a summary of what other people are asking on the topic. 
Your summary should not be too short. At least 4-8 sentences. In english.

User Question: 
--------------
```
{doc.text}
```

Context:
--------
{context}

Summary:"""

print(prompt)

You are a helpful assistant. For a given user question, summarize or paraphrase the key points that are relevant to this question.
Note that the context is not a direct response to answer user question. Rather, we want to provide the user with a summary of what other people are asking on the topic. 
Your summary should not be too short. At least 4-8 sentences. In english.

User Question: 
--------------
```
Hello All!

    It is my understanding that all True-Type fonts in Windows are loaded in
prior to starting Windows - this makes getting into Windows quite slow if you
have hundreds of them as I do.  First off, am I correct in this thinking -
secondly, if that is the case - can you get Windows to ignore them on boot and
maybe make something like a PIF file to load them only when you enter the
applications that need fonts?  Any ideas?


Chris
```

Context:
--------
I just installed a new TrueType font under MS-Windows 3.1
but though all the applications display the font correctly on t

In [132]:
print(generate(model='qwen:0.5b', prompt=prompt).response)

The summary provides information about the problem you are experiencing with TrueType fonts in WIndows 3.1.
To address this issue, there are several ways to do so:

  1. Use alternative font formats such as TTF or PPTTT.
  2. Set the Windows font escape sequence to include characters from various fonts, including those from TrueType fonts.
  3. Use a more robust text editor to ensure proper handling of TrueType fonts and other document format files.

Overall, there are several ways to address this issue and improve the user experience with TrueType fonts in WIndows 3.1.


## Fallback

In [106]:
from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", "facebook/bart-large-cnn")

# Summarize each article
for i, article in enumerate(similar_articles, 1):
    summary = summarizer(article, max_length=100, min_length=30, do_sample=False)[0]['summary_text']
    print(f"\n--- Summary of Article {i} ---\n")
    print(summary)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Device set to use mps:0



--- Summary of Article 1 ---

I just installed a new TrueType font under MS-Windows 3.1. Though all the applications display the font correctly on the screen, quite a few of them fail to print out the document correctly.

--- Summary of Article 2 ---

I would like to change all of the system fonts in windows... I have a program that will generate system fonts from truetype, but i was wondering if there is a problem to help you set up all your systemfonts.

--- Summary of Article 3 ---

Microsoft admits to a problem with older versions of the PostScript printer driver. You can get around the problem by adjusting the parameter OutlineThreshold. The default is 256.

--- Summary of Article 4 ---

Sometimes windows uses Cyrillic when its supposed to use Times Roman. The PC-Tools Backup (version 7.1) has one line of Cyrilic text in its opening banner. Importing a Word for Windows 5.2 also results in Cyril.

--- Summary of Article 5 ---

I've heard rumors about this...I might have even seen 

In [None]:
### Not exactly the same (dont expect this model to do what qwen can do)