# Final Project: NeuroDump - thoughts organizer

## 1. Environment setup

### 1.1. Environment

In [None]:
#!pip install -r requirements.txt --- UPDATE THIS AT THE END

### 1.2. Tools and Libraries

In [2]:
import subprocess

### 1.3. Custom functions

## First experiment: Llama3.2

In [3]:
def query_ollama(model: str, prompt: str) -> str:
    result = subprocess.run(
        ["ollama", "run", model],
        input=prompt.encode('utf-8'),
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    return result.stdout.decode('utf-8')

In [None]:
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 

In [4]:
prompt = """
I had an idea for a new project. Also I need to buy cat food. 
Later I thought maybe I should send that message to Alex.

1. Group related sentences.
2. Label each group as: [Idea, Task, Message]
3. Reorder and rewrite each clearly.
"""

response = query_ollama("mistral", prompt)
print(response)

 Here's the organized and rewritten version of your text according to your instructions:

[Idea]
- I had an idea for a new project.

[Task]
- Later I thought maybe I should buy cat food.

[Message]
- [Idea: Sharing my idea about a new project]
- [Task: Suggesting that Alex might help with buying cat food]

Rewritten and Reordered version:

1. First, I'd like to share an idea I had for a new project - do you think you could help discuss it?
2. Later, I thought it would be good if we could buy some cat food together as I need to take care of my pet. Let me know what you think.




It didn't work exactly how I expected and it took too long to give me an answer. Prompt engineering refinement, but still, hard to test. Changing approach to classical NLP and using generative AI models only for rewriting. 

## Craft pipeline

   1. Input: "unstructured note"
   2. Split text into chunks of words
      1. nltk.word_tokenize()
      2. text.split()
      3. RecursiveCharacterTextSplitter
   3. Embed all chunks
      1. sentence-transformers (all-MiniLM-L6-v2)
      2. vector storage: faiss or chromadb 
   4. Cluster the embeddings by semantic similarity (use all database to identify the clusters)
      1. 	HDBSCAN
      2. output: fragment 1 (theme A), fragment 2 (theme B)
   5. Label each fragment with classifier = name the themes (based on existing themes from db)
      1. cosine similarity to existing cluster centroids + string-matching (e.g., fuzzywuzzy)
      2. label generation: Prompt to local LLM (Mistral via Ollama)
   6. Edit each fragment into a coherent text (generative AI)
      1. 	ollama run mistral + smart prompt
      2. output: "clean note 1"+label A, "clean note 2"+label B
   7. Store each "clean note" in correspondent folder(folder=theme=label)
      1. Python os + pathlib for folders, later
      2. Notion API
   8. Improve the model
   9.  Streamlit + add inputs
   10. Stats - EDA
       1.  pandas, matpotlib

In [None]:
#%pip install hdbscan

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting hdbscan
  Downloading hdbscan-0.8.40-cp312-cp312-macosx_10_13_universal2.whl.metadata (15 kB)
Downloading hdbscan-0.8.40-cp312-cp312-macosx_10_13_universal2.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m739.6 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: hdbscan
Successfully installed hdbscan-0.8.40
Note: you may need to restart the kernel to use updated packages.


In [1]:
import warnings
warnings.filterwarnings('ignore')
import os
from IPython.display import display, Markdown
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer
from langchain.vectorstores import Chroma
from langchain_community.embeddings import SentenceTransformerEmbeddings
import hdbscan

### 1. Input unstructured document
Incremental version. It will process only new files.

In [2]:
folder_path = "./data/mock_notes/"
log_path = "./data/processed_files.txt"
persist_dir = "./data/chroma_db"

In [3]:
# 1.1. Read processed files log
if os.path.exists(log_path):
    with open(log_path, "r") as f:
        processed_files = set(line.strip() for line in f)
else:
    processed_files = set()


# 1.2. Find new files
file_list = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
new_files = [f for f in file_list if f not in processed_files]

### 2. Document segmentation (text split)
Split the document into chunks of 80 words, which is enough to capture the semantic context. Sentences and words would miss the context. Paragraphs are not consistent with unstructured documents.

In [4]:
### function to split document with TreebankWordDetokenizer ###

def split_by_words_de(text, chunk_size=80, overlap=20):
    words = word_tokenize(text)
    detok = TreebankWordDetokenizer()                       # detokenizer to join the tokens in a more natural way
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = words[i:i + chunk_size]
        if chunk:
            chunks.append(detok.detokenize(chunk))
    return chunks

In [5]:
# 2. Split document into chunks of words

if not new_files:
    print("No new files to process.")
else:
    all_chunks = []
    all_metadatas = []
    for fname in new_files:
        with open(os.path.join(folder_path, fname), 'r', encoding='utf-8') as f:
            text = f.read()
    
        chunks = split_by_words_de(text)                                    #split in chunks
        display(f"Original file: {fname}", Markdown(f"```\n{text}\n```"))   #display original file(s)
        print(f"{fname} = {len(chunks)} chunks")                            #display number of chunks per file
        for i, chunk in enumerate(chunks, 1):
            display(Markdown(f"**Chunk {i}:**\n{chunk})\n"))
        all_chunks.extend(chunks)
        all_metadatas.extend([{"source": fname}] * len(chunks))   

'Original file: note-3-dt-r-ul.txt'

```
lecture today was fast af. started with trees. entropy vs gini impurity — diff metrics to decide best split. both OK. CART = binary tree = each node has 2 splits. sklearn uses this. code: from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(criterion='entropy', max_depth=4)
trees prone to overfit — esp if depth unbounded. pruning = way to fix. early stopping or post-prune. bagging helps = ensemble. RandomForest = multiple trees on bootstrapped samples + rand subset of features per split. reduces variance.
trees = interpretability good, but unstable to data changes.
then prof jumped into regression. regularization = shrink model capacity. Ridge = L2 norm = λ * Σ(w²). Lasso = L1 norm = λ * Σ|w|. Ridge keeps all weights ≠ 0, Lasso can zero out → sparse. Lasso good for feature selection. ElasticNet = mix of both — good if features correlated
code ex:
from sklearn.linear_model import ElasticNet model = ElasticNet(alpha=0.1, l1_ratio=0.5) model.fit(X, y)
important: scale features before fitting regularized models — otherwise magnitudes skew the penalty. StandardScaler or RobustScaler if outliers.
tune α via cross-val — use GridSearchCV or RandomizedSearchCV.
metrics: RMSE, R². underfit vs overfit — regularization helps balance bias/var tradeoff.
last part was unsupervised learning. clustering w/o labels. k-means = most used. init centers, assign pts, recalc, repeat. problem: sensitive to init. use k-means++. elbow method not always clear. silhouette score better maybe.
PCA + k-means often combined for vis. t-SNE only for viz — not for modeling. clusters in t-SNE are sometimes fake.
DBSCAN = cluster via density. can detect noise. great for shape-agnostic clusters. params hard to tune tho. hierarchical = dendrograms. use ‘ward’ linkage. but slow w/ big data.
general note: sklearn models consistent API — fit / predict / score.
also: why trees don’t need scaling? bcz splits based on order not value. contrast w/ reg models.
each algo has tradeoffs — no free lunch! choose based on data, task, interpretability needs.
```

note-3-dt-r-ul.txt = 7 chunks


**Chunk 1:**
lecture today was fast af . started with trees . entropy vs gini impurity — diff metrics to decide best split . both OK. CART = binary tree = each node has 2 splits . sklearn uses this . code: from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier (criterion='entropy', max_depth=4) trees prone to overfit — esp if depth unbounded . pruning = way to fix . early stopping or post-prune . bagging helps = ensemble.)


**Chunk 2:**
if depth unbounded . pruning = way to fix . early stopping or post-prune . bagging helps = ensemble . RandomForest = multiple trees on bootstrapped samples + rand subset of features per split . reduces variance . trees = interpretability good, but unstable to data changes . then prof jumped into regression . regularization = shrink model capacity . Ridge = L2 norm = λ * Σ (w²). Lasso = L1 norm = λ *)


**Chunk 3:**
. Ridge = L2 norm = λ * Σ (w²). Lasso = L1 norm = λ * Σ|w| . Ridge keeps all weights ≠ 0, Lasso can zero out → sparse . Lasso good for feature selection . ElasticNet = mix of both — good if features correlated code ex: from sklearn.linear_model import ElasticNet model = ElasticNet (alpha=0.1, l1_ratio=0.5) model.fit (X, y) important: scale features before fitting regularized)


**Chunk 4:**
= ElasticNet (alpha=0.1, l1_ratio=0.5) model.fit (X, y) important: scale features before fitting regularized models — otherwise magnitudes skew the penalty . StandardScaler or RobustScaler if outliers . tune α via cross-val — use GridSearchCV or RandomizedSearchCV . metrics: RMSE, R² . underfit vs overfit — regularization helps balance bias/var tradeoff . last part was unsupervised learning . clustering w/o labels . k-means = most used . init centers, assign pts)


**Chunk 5:**
last part was unsupervised learning . clustering w/o labels . k-means = most used . init centers, assign pts, recalc, repeat . problem: sensitive to init . use k-means++ . elbow method not always clear . silhouette score better maybe . PCA + k-means often combined for vis . t-SNE only for viz — not for modeling . clusters in t-SNE are sometimes fake . DBSCAN = cluster via density . can detect noise . great)


**Chunk 6:**
modeling . clusters in t-SNE are sometimes fake . DBSCAN = cluster via density . can detect noise . great for shape-agnostic clusters . params hard to tune tho . hierarchical = dendrograms . use ‘ ward ’ linkage . but slow w/ big data . general note: sklearn models consistent API — fit / predict / score . also: why trees don ’ t need scaling? bcz splits based on order not value . contrast w/)


**Chunk 7:**
also: why trees don ’ t need scaling? bcz splits based on order not value . contrast w/ reg models . each algo has tradeoffs — no free lunch! choose based on data, task, interpretability needs.)


'Original file: note-1-dt-r.txt'

```
entropy = -p * log2(p) – yeah info gain is difference in entropy before and after split — okay so DT picks feature that max info gain at each node. Gini impurity also similar but faster? less bias? not 100% sure. trees go deep and then prone back? no, prune. to prevent overfitting. training error low but generalization bad. CART uses binary splits – only yes/no right? sklearn.tree.DecisionTreeClassifier(max_depth=3) — yeah that’s what prof used.
note: good to visualize trees but not for high dim data. lots of axis-aligned splits, hard to interpret when too many features. oh and trees are unstable — small data change = big model change. they said bagging can help that — RandomForest.
wait then they jumped to regularization — lasso vs ridge. ridge adds λ * sum(w²), shrinks weights, but all stay ≠ 0. Lasso adds λ * sum(|w|) — forces some to zero. ohhh good for feature selection. balance bias-variance tradeoff. λ too big = underfit. low λ = overfit. prof wrote this on board: from sklearn.linear_model import Lasso model = Lasso(alpha=0.1) model.fit(X_train, y_train)
ElasticNet = mix of both? ratio param controls mix. good when multicollinearity or many small coeffs. visualize loss function — lasso diamond corners cause zeros. interesting.
btw they said don’t scale trees but do scale for lasso etc. bcz regularization depends on magnitude. std scaling or minmax okay.
might try gridsearch to tune alpha — sklearn.model_selection.GridSearchCV
decision boundary of tree is step-like, not smooth like linear models.
ok Q: why trees overfit more than lasso? more flexible model class I think?
```

note-1-dt-r.txt = 6 chunks


**Chunk 1:**
entropy = -p * log2 (p) – yeah info gain is difference in entropy before and after split — okay so DT picks feature that max info gain at each node . Gini impurity also similar but faster? less bias? not 100% sure . trees go deep and then prone back? no, prune . to prevent overfitting . training error low but generalization bad . CART uses binary splits – only yes/no right)


**Chunk 2:**
. to prevent overfitting . training error low but generalization bad . CART uses binary splits – only yes/no right? sklearn.tree.DecisionTreeClassifier (max_depth=3) — yeah that ’ s what prof used . note: good to visualize trees but not for high dim data . lots of axis-aligned splits, hard to interpret when too many features . oh and trees are unstable — small data change = big model change . they said bagging can help that)


**Chunk 3:**
oh and trees are unstable — small data change = big model change . they said bagging can help that — RandomForest . wait then they jumped to regularization — lasso vs ridge . ridge adds λ * sum (w²), shrinks weights, but all stay ≠ 0 . Lasso adds λ * sum (|w|) — forces some to zero . ohhh good for feature selection . balance bias-variance tradeoff . λ too big =)


**Chunk 4:**
— forces some to zero . ohhh good for feature selection . balance bias-variance tradeoff . λ too big = underfit . low λ = overfit . prof wrote this on board: from sklearn.linear_model import Lasso model = Lasso (alpha=0.1) model.fit (X_train, y_train) ElasticNet = mix of both? ratio param controls mix . good when multicollinearity or many small coeffs . visualize loss function — lasso diamond corners cause zeros . interesting.)


**Chunk 5:**
good when multicollinearity or many small coeffs . visualize loss function — lasso diamond corners cause zeros . interesting . btw they said don ’ t scale trees but do scale for lasso etc . bcz regularization depends on magnitude . std scaling or minmax okay . might try gridsearch to tune alpha — sklearn.model_selection.GridSearchCV decision boundary of tree is step-like, not smooth like linear models . ok Q: why trees overfit more than lasso? more flexible)


**Chunk 6:**
step-like, not smooth like linear models . ok Q: why trees overfit more than lasso? more flexible model class I think?)


'Original file: note-2-ul.txt'

```
unsup = no labels. kmeans = simplest one but still used a lot. init k centroids randomly (k-means++ better), assign pts, recalc centroids, repeat. converge when no pt changes. but result depends on init + scale. scale important! feature w bigger range dominates dist calc — always standardize first.
elbow method = plot inertia vs k — look for bend, but not always obvious. inertia = sum of dist² to centroid. alt metric = silhouette score — between -1 and 1. close to 1 = well-clustered.
clustering ≠ classification. labels are not known. use cases: market segmentation, gene expr clustering, anomaly detection (esp dbscan). DBSCAN better for weird shapes, dense clusters — uses eps + min_samples. tricky to tune tho. forms clusters based on density, noisy pts marked as outliers (label -1). sklearn DBSCAN ex: from sklearn.cluster import DBSCAN model = DBSCAN(eps=0.5, min_samples=5) model.fit(X)
hierarchical clustering = agglomerative or divisive — we focus on bottom-up (agglomerative). start w all pts as indiv cluster, merge closest pairs step by step. dendrogram = tree of merges. can "cut" tree at diff levels = diff num clusters. linkage: single, complete, avg. sklearn has AgglomerativeClustering.
before clustering, can reduce dim (PCA) for speed + viz. tSNE/UMAP for 2D plot = better for human eye but not for modeling. PCA = linear, tSNE = non-linear. tSNE distorts structure globally. good for pattern discovery.
spectral clustering = build similarity graph → Laplacian → eigenvectors → k-means in lower-dim eigenspace. nice when structure is graphy, not spherical.
most clustering algos rely on distance metric — Euclidean default. alt: cosine sim (for text), manhattan, etc.
eval: hard bcz no true label. silhouette best for most. DB index too. compare within/between cluster distance. can also visualize clusters to judge quality.
problem: k-means assumes spherical clusters, equal size. not true for real-world. if data has diff density or shapes → fails.
open Q: how to know if clusters mean anything in real world?
scaling is essential — StandardScaler or MinMaxScaler from sklearn.
pipeline ex:
from sklearn.pipeline import make_pipeline pipe = make_pipeline(StandardScaler(), KMeans(n_clusters=3)) pipe.fit(X)
```

note-2-ul.txt = 8 chunks


**Chunk 1:**
unsup = no labels . kmeans = simplest one but still used a lot . init k centroids randomly (k-means++ better), assign pts, recalc centroids, repeat . converge when no pt changes . but result depends on init + scale . scale important! feature w bigger range dominates dist calc — always standardize first . elbow method = plot inertia vs k — look for bend, but not always obvious . inertia =)


**Chunk 2:**
. elbow method = plot inertia vs k — look for bend, but not always obvious . inertia = sum of dist² to centroid . alt metric = silhouette score — between -1 and 1. close to 1 = well-clustered . clustering ≠ classification . labels are not known . use cases: market segmentation, gene expr clustering, anomaly detection (esp dbscan). DBSCAN better for weird shapes, dense clusters — uses eps +)


**Chunk 3:**
, anomaly detection (esp dbscan). DBSCAN better for weird shapes, dense clusters — uses eps + min_samples . tricky to tune tho . forms clusters based on density, noisy pts marked as outliers (label -1). sklearn DBSCAN ex: from sklearn.cluster import DBSCAN model = DBSCAN (eps=0.5, min_samples=5) model.fit (X) hierarchical clustering = agglomerative or divisive — we focus on bottom-up (agglomerative). start w)


**Chunk 4:**
(X) hierarchical clustering = agglomerative or divisive — we focus on bottom-up (agglomerative). start w all pts as indiv cluster, merge closest pairs step by step . dendrogram = tree of merges . can "cut" tree at diff levels = diff num clusters . linkage: single, complete, avg . sklearn has AgglomerativeClustering . before clustering, can reduce dim (PCA) for speed + viz . tSNE/UMAP for)


**Chunk 5:**
sklearn has AgglomerativeClustering . before clustering, can reduce dim (PCA) for speed + viz . tSNE/UMAP for 2D plot = better for human eye but not for modeling . PCA = linear, tSNE = non-linear . tSNE distorts structure globally . good for pattern discovery . spectral clustering = build similarity graph → Laplacian → eigenvectors → k-means in lower-dim eigenspace . nice when structure is graphy, not spherical . most clustering algos rely on)


**Chunk 6:**
→ k-means in lower-dim eigenspace . nice when structure is graphy, not spherical . most clustering algos rely on distance metric — Euclidean default . alt: cosine sim (for text), manhattan, etc . eval: hard bcz no true label . silhouette best for most . DB index too . compare within/between cluster distance . can also visualize clusters to judge quality . problem: k-means assumes spherical clusters, equal size . not)


**Chunk 7:**
. can also visualize clusters to judge quality . problem: k-means assumes spherical clusters, equal size . not true for real-world . if data has diff density or shapes → fails . open Q: how to know if clusters mean anything in real world? scaling is essential — StandardScaler or MinMaxScaler from sklearn . pipeline ex: from sklearn.pipeline import make_pipeline pipe = make_pipeline (StandardScaler (), KMeans (n_clusters=3)) pipe.fit ()


**Chunk 8:**
: from sklearn.pipeline import make_pipeline pipe = make_pipeline (StandardScaler (), KMeans (n_clusters=3)) pipe.fit (X))


### 3. Embedding new chunks from new files

In [6]:
# 3.1. Embed new chunks
embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')


# 3.2. Add embeddings to Chroma DB
persist_dir = "./data/chroma_db"

if os.path.exists(persist_dir):
    print(f"Loading existing Chroma DB from: {persist_dir}")
    vectorstore = Chroma(
        persist_directory=persist_dir,
        embedding=embeddings
    )
    vectorstore.add_texts(texts=all_chunks, metadatas=all_metadatas)
else:
    print(f"Creating new Chroma DB in: {persist_dir}")
    vectorstore = Chroma.from_texts(
        texts=all_chunks, 
        metadatas=all_metadatas, 
        embedding=embeddings, 
        persist_directory=persist_dir
    )
vectorstore.persist()


# 3.3. Update processed files log
with open(log_path, "a") as f:
    for fname in new_files:
        f.write(fname + "\n")
print(f"Processed and added {len(new_files)} new files to Chroma DB.")

  embeddings = SentenceTransformerEmbeddings(model_name='all-MiniLM-L6-v2')


Creating new Chroma DB in: ./data/chroma_db
Processed and added 3 new files to Chroma DB.


  vectorstore.persist()


### 4. Clusterize by semantic similarity
1 exprimente: HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise -> not good. moved to other file
2 experiment: kmeans


In [None]:
### function to retrieve the chunks from a specific file

def get_file_chunks(fname):
    results = vectorstore.get(where={"source": fname})
    file_chunks = results['documents']
    print(f"Found {len(file_chunks)} chunks from {fname}")
    for chunk in file_chunks:
        display(Markdown(f"```\n{chunk}\n```"))             # display the contents of each chunk for conference
    return file_chunks                                      # return the list for further use

In [17]:
# 4.1. Get embeddings chunks from File 1

file1_chunks = get_file_chunks("note-1-dt-r.txt")
file1_embeddings = embeddings.embed_documents(file1_chunks)

Found 6 chunks from note-1-dt-r.txt


```
entropy = -p * log2 (p) – yeah info gain is difference in entropy before and after split — okay so DT picks feature that max info gain at each node . Gini impurity also similar but faster? less bias? not 100% sure . trees go deep and then prone back? no, prune . to prevent overfitting . training error low but generalization bad . CART uses binary splits – only yes/no right
```

```
. to prevent overfitting . training error low but generalization bad . CART uses binary splits – only yes/no right? sklearn.tree.DecisionTreeClassifier (max_depth=3) — yeah that ’ s what prof used . note: good to visualize trees but not for high dim data . lots of axis-aligned splits, hard to interpret when too many features . oh and trees are unstable — small data change = big model change . they said bagging can help that
```

```
oh and trees are unstable — small data change = big model change . they said bagging can help that — RandomForest . wait then they jumped to regularization — lasso vs ridge . ridge adds λ * sum (w²), shrinks weights, but all stay ≠ 0 . Lasso adds λ * sum (|w|) — forces some to zero . ohhh good for feature selection . balance bias-variance tradeoff . λ too big =
```

```
— forces some to zero . ohhh good for feature selection . balance bias-variance tradeoff . λ too big = underfit . low λ = overfit . prof wrote this on board: from sklearn.linear_model import Lasso model = Lasso (alpha=0.1) model.fit (X_train, y_train) ElasticNet = mix of both? ratio param controls mix . good when multicollinearity or many small coeffs . visualize loss function — lasso diamond corners cause zeros . interesting.
```

```
good when multicollinearity or many small coeffs . visualize loss function — lasso diamond corners cause zeros . interesting . btw they said don ’ t scale trees but do scale for lasso etc . bcz regularization depends on magnitude . std scaling or minmax okay . might try gridsearch to tune alpha — sklearn.model_selection.GridSearchCV decision boundary of tree is step-like, not smooth like linear models . ok Q: why trees overfit more than lasso? more flexible
```

```
step-like, not smooth like linear models . ok Q: why trees overfit more than lasso? more flexible model class I think?
```

In [18]:
from sklearn.cluster import KMeans

n_clusters = 2  # or 1, 2, 3 depending on your expectation
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
labels = kmeans.fit_predict(file1_embeddings)
print("KMeans cluster labels:", labels)

KMeans cluster labels: [0 0 0 1 1 1]


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
# 5. Label clusters with classifier

In [None]:
# 6. Evaluate classification

In [None]:
# 7. Rewrite selected outputs with [llama 3.2]

In [None]:
# 8. Save outputs (original + generated) into folders

In [None]:
# 9. Improve the model

In [None]:
# 10. Streamlit for user input

In [None]:
# 11. Generate stats from repository (EDA)