# Utprøving

Bare lag kopier av notebooken for å gjøre nye vinklinger

In [7]:
import dhlab as dh
from IPython.display import HTML, Markdown
import os
import sqlite3
import pandas as pd
import tiktoken

dh.css - gir fin styling (men ikke alle er enig, så bare slett eller gi et filargument om du har en egen css)

In [8]:
dh.css()

## Filplasseringer

N-grammene ligger ferdig indeksert. Men det her er bigrammer fra 2015, i dag har vi fler. De er lastet ned på en annen disk

In [2]:
ngram_root = "/mnt/disk4/NB-ngram-assoc/"

In [9]:
uni = "/mnt/disk4/NB-ngram-assoc/unigram-one-row.db"
bi = "/mnt/disk4/NB-ngram-assoc/bigram-one-row.db"
tri = "/mnt/disk4/NB-ngram-assoc/trigram-one-row.db"

In [10]:
def query(db, sql, params=()):
    with sqlite3.connect(db) as con:
        cur = con.cursor()
        res = cur.execute(sql, params).fetchall()
    return res

Spørringer kan gjøres med ```query```

In [11]:
query(bi, "select * from sqlite_master limit 2")

[('table',
  'bigram',
  'bigram',
  2,
  'CREATE TABLE bigram (freq int, lang varchar, first varchar, second varchar, json text, assoc float, pmi float)'),
 ('index',
  '_lfsf_',
  'bigram',
  5854348,
  'CREATE INDEX _lfsf_ on bigram (lang,first,second,freq)')]

In [12]:
query(tri, "select first, second,third, assoc, freq from trigram where lang='nob' and second = 'spiser'  order by assoc desc limit 10")

[('jærbuer', 'spiser', 'kirsebær', 23.55522887335257, 13),
 ('Overfallsmenn', 'spiser', 'ikke', 20.609993505366152, 11),
 ('nøden', 'spiser', 'fanden', 20.05728960628191, 83),
 ('sekstiåttere', 'spiser', 'ikke', 18.693989407879293, 69),
 ('frokost', 'spiser', 'Vilde', 18.537601256676304, 113),
 ('Hvorfor', 'spiser', 'mesteren', 17.25426539638431, 13),
 ('konkurransedagen', 'spiser', 'jeg', 17.05598532838897, 10),
 ('Hva', 'spiser', 'fiskene', 16.108569299517168, 10),
 ('Vi', 'spiser', 'frokost', 15.824185530260493, 167),
 ('Hege', 'spiser', 'ett', 15.498541255648027, 13)]

## Kode for å lage matrisen

Her er kode for å lage matriser. Under er det satt opp kode for venstre og så mot høyre, altså at høyre kontekst er alle ord til høyre for $a$, som i $\{x|\beta(a, x)\}$ der$\beta$ koder bigrambasen, og venstre er alle til venstre $\{x|\beta(x, a)\}$. All spørringen gjøres mot sqlite-basen, og vi lager terskelverdier på frekvens og pmi.

## Sparse matrix 



### Versjon med ord før og etter

for et gitt ord samles både ordet før og ordet etter for å lage matrisen sånn for _løpe_ vil både _å_ og _fort_ bli med med utgangspunkt i _å løpe_ og _løpe fort_

koden returnerer et par av (matrise, ordindeks)

In [40]:
def create_sparse_matrix(db_path, vocab_size=500000):
   from scipy.sparse import csr_matrix
   import numpy as np
   
   rows = []
   cols = []
   data = []
   
   with sqlite3.connect(db_path) as conn:
       cur = conn.cursor()
       
       # Create temporary table for vocabulary
       print("Creating temporary vocabulary table...")
       cur.execute("""
           CREATE TEMP TABLE top_words AS
           SELECT first as word, SUM(freq) as total_freq 
           FROM bigram 
           WHERE lang='nob'
           GROUP BY first 
           ORDER BY total_freq DESC 
           LIMIT ?
       """, (vocab_size,))
       
       # Create index on the temporary table
       cur.execute("CREATE INDEX idx_top_words ON top_words(word)")
       
       # Get word to index mapping
       print("Creating vocabulary mapping...")
       word_to_idx = {word: idx for idx, (word, freq) 
                     in enumerate(cur.execute("SELECT word, total_freq FROM top_words"))}
       
       # Stream bigrams with fixed column references
       print("Streaming bigrams...")
       cur.execute("""
           SELECT b.first, b.second, b.pmi 
           FROM bigram b
           JOIN top_words w1 ON b.first = w1.word
           JOIN top_words w2 ON b.second = w2.word
           WHERE b.lang='nob'
       """)
       
       count = 0
       while True:
           chunk = cur.fetchmany(10000)
           if not chunk:
               break
               
           for first, second, pmi in chunk:
               # Add right context
               rows.append(word_to_idx[first])
               cols.append(word_to_idx[second])
               data.append(pmi)
               
               # Add left context for the same bigram
               rows.append(word_to_idx[second])
               cols.append(word_to_idx[first])
               data.append(pmi)
           
           count += len(chunk)
           if count % 2000000 == 0:
               print(f"Processed {count} bigrams...")
   
   print("Creating sparse matrix...")
   matrix = csr_matrix((data, (rows, cols)), shape=(len(word_to_idx), len(word_to_idx)))
   return matrix, word_to_idx

Lag matrise

In [41]:
matrix, word_to_idx = create_sparse_matrix(bi)

Creating temporary vocabulary table...
Creating vocabulary mapping...
Streaming bigrams...
Processed 2000000 bigrams...
Processed 4000000 bigrams...
Processed 6000000 bigrams...
Processed 8000000 bigrams...
Processed 10000000 bigrams...
Processed 12000000 bigrams...
Processed 14000000 bigrams...
Creating sparse matrix...


### Versjoner for høyre- og venstrekontekst

For et gitt ord samles sammen ordene som kommer etter, eller høyrekontekst, og så ordene som kommer foran eller venstrekontekst.

Så for _kaffe_ for vi et eget sett for _sterk kaffe_ og ett sett for _kaffe med_, men både _sterk_ og _med_ kan forekomme på hver sin side.

koden beregner begge og returnerer et trippen av 
(venstrematrise, høyrematrise, ordindeks)

In [15]:
def create_context_matrices(db_path, vocab_size=500000):
    from scipy.sparse import csr_matrix
    import numpy as np
    
    # Separate lists for left and right contexts
    left_rows, left_cols, left_data = [], [], []  # for (x, target) pairs
    right_rows, right_cols, right_data = [], [], []  # for (target, x) pairs
    
    with sqlite3.connect(db_path) as conn:
        cur = conn.cursor()
        
        # Create temporary vocabulary table (same as before)
        print("Creating temporary vocabulary table...")
        cur.execute("""
            CREATE TEMP TABLE top_words AS
            SELECT first as word, SUM(freq) as total_freq 
            FROM bigram 
            WHERE lang='nob'
            GROUP BY first 
            ORDER BY total_freq DESC 
            LIMIT ?
        """, (vocab_size,))
        
        cur.execute("CREATE INDEX idx_top_words ON top_words(word)")
        
        # Get word to index mapping
        print("Creating vocabulary mapping...")
        word_to_idx = {word: idx for idx, (word, freq) 
                      in enumerate(cur.execute("SELECT word, total_freq FROM top_words"))}
        
        # Stream bigrams and separate into left/right contexts
        print("Streaming bigrams...")
        cur.execute("""
            SELECT b.first, b.second, b.pmi 
            FROM bigram b
            JOIN top_words w1 ON b.first = w1.word
            JOIN top_words w2 ON b.second = w2.word
            WHERE b.lang='nob'
        """)
        
        count = 0
        while True:
            chunk = cur.fetchmany(10000)
            if not chunk:
                break
                
            for first, second, pmi in chunk:
                # (first, second) represents right context for 'first'
                right_rows.append(word_to_idx[first])
                right_cols.append(word_to_idx[second])
                right_data.append(pmi)
                
                # (first, second) represents left context for 'second'
                left_rows.append(word_to_idx[second])  # target word
                left_cols.append(word_to_idx[first])   # context word
                left_data.append(pmi)
            
            count += len(chunk)
            if count % 3000000 == 0:
                print(f"Processed {count} bigrams...")
    
    print("Creating sparse matrices...")
    # Matrix where rows are target words and columns are left context words
    left_matrix = csr_matrix((left_data, (left_rows, left_cols)), 
                           shape=(len(word_to_idx), len(word_to_idx)))
    
    # Matrix where rows are target words and columns are right context words
    right_matrix = csr_matrix((right_data, (right_rows, right_cols)), 
                            shape=(len(word_to_idx), len(word_to_idx)))
    
    return left_matrix, right_matrix, word_to_idx

In [18]:
left, right, context_word_to_idx = create_context_matrices(bi, vocab_size=500000)

Creating temporary vocabulary table...
Creating vocabulary mapping...
Streaming bigrams...
Processed 1000000 bigrams...
Processed 2000000 bigrams...
Processed 3000000 bigrams...
Processed 4000000 bigrams...
Processed 5000000 bigrams...
Processed 6000000 bigrams...
Processed 7000000 bigrams...
Processed 8000000 bigrams...
Processed 9000000 bigrams...
Processed 10000000 bigrams...
Processed 11000000 bigrams...
Processed 12000000 bigrams...
Processed 13000000 bigrams...
Processed 14000000 bigrams...
Creating sparse matrices...


## Lag komprimert matrise

Her bruker vi TruncatedSVD, men kan benytte andre metoder

In [21]:
from sklearn.decomposition import TruncatedSVD

# Create and fit the SVD
left_svd = TruncatedSVD(n_components=200, random_state=42)
right_svd = TruncatedSVD(n_components=200, random_state=42)

left_embeddings = left_svd.fit_transform(left)
right_embeddings = right_svd.fit_transform(right)
print("Shape of embeddings:", left_embeddings.shape, right_embeddings.shape)
print("Explained left variance ratio:", left_svd.explained_variance_ratio_.sum())
print("Explained right variance ratio:", right_svd.explained_variance_ratio_.sum())

Shape of embeddings: (500000, 200) (500000, 200)
Explained left variance ratio: 0.29979637079691973
Explained right variance ratio: 0.29887136580229257


In [42]:
from sklearn.decomposition import TruncatedSVD

# Create and fit the SVD
svd = TruncatedSVD(n_components=200, random_state=42)
embeddings = svd.fit_transform(matrix)

print("Shape of embeddings:", embeddings.shape)
print("Explained variance ratio:", svd.explained_variance_ratio_.sum())

Shape of embeddings: (500000, 200)
Explained variance ratio: 0.311545461099871


## Lagre embeddinger

Lagrer som numpy arrayer - ta bort det som ikke trengs

In [23]:
import numpy as np

In [24]:
np.save('nob_left_embeddings_bigram.npy', left_embeddings)

In [25]:
np.save('nob_right_embeddings_bigram.npy', right_embeddings)

In [33]:
np.save('nob_embeddings_bigram.npy', embeddings)

In [34]:
# Save word to index mapping
import json
with open('word_to_idx_bigram.json', 'w', encoding='utf-8') as f:
    json.dump(word_to_idx, f, ensure_ascii=False)
import json
with open('word_to_idx_bigram_contexts.json', 'w', encoding='utf-8') as f:
    json.dump(context_word_to_idx, f, ensure_ascii=False)


## Test embeddinger

In [29]:
def find_nearest(word, embeddings, word_to_idx, n=10):
    from sklearn.metrics.pairwise import cosine_similarity
    
    if word not in word_to_idx:
        return "Word not in vocabulary"
    
    word_idx = word_to_idx[word]
    word_vector = embeddings[word_idx].reshape(1, -1)
    
    # Compute similarities
    similarities = cosine_similarity(word_vector, embeddings)[0]
    
    # Get top N similar words (excluding the word itself)
    most_similar = []
    indices = similarities.argsort()[::-1][1:n+1]
    
    return [(list(word_to_idx.keys())[list(word_to_idx.values()).index(idx)], 
             similarities[idx]) for idx in indices]

## Testing

In [30]:
find_nearest("folkestyre", left_embeddings, context_word_to_idx, 20)

[('demokrati', 0.7748139528955333),
 ('sikkerhetssystem', 0.7560614309893858),
 ('utbyggingsmønster', 0.7484329430291474),
 ('distribusjonsnett', 0.7248803844701253),
 ('samvirke', 0.7190325434001414),
 ('skolebibliotek', 0.7184723683923118),
 ('politisamarbeid', 0.7082358370995934),
 ('skatteregnskap', 0.7012121197083403),
 ('industrielt', 0.6993679135443505),
 ('forbedringsarbeid', 0.6993178306850556),
 ('tiltaksarbeid', 0.697016644089604),
 ('byråkrati', 0.6962829587107863),
 ('foreldreskap', 0.693503199578171),
 ('importvern', 0.69124670617893),
 ('kulturarbeid', 0.6904486467555324),
 ('evalueringsarbeid', 0.6886498036890953),
 ('familieliv', 0.6866255236358859),
 ('kjønnsliv', 0.6850864522081741),
 ('samhold', 0.6843587961158817),
 ('kontrollsystem', 0.6780904663097107)]

In [45]:
find_nearest("folkestyre", right_embeddings, context_word_to_idx, 10)

[('imperialisme', 0.957410742904829),
 ('ansvarslæring', 0.9562066658776511),
 ('bedriftsdemokrati', 0.9526920649674018),
 ('urbefolkning', 0.9415133284607968),
 ('immanens', 0.9408424669641113),
 ('relativitet', 0.9372458794302525),
 ('mannsarbeid', 0.9367848331240615),
 ('islamisme', 0.9324445061244013),
 ('tilbakeliggenhet', 0.932390893165072),
 ('opportunisme', 0.9308411706505932)]

In [46]:
find_nearest("folkestyre", embeddings, word_to_idx, 10)

[('fagstyre', 0.8134198609463499),
 ('foreldreskap', 0.8062806780890421),
 ('sorgarbeid', 0.793484955554102),
 ('flertallsstyre', 0.7914832581651849),
 ('bomiljø', 0.7890958154964803),
 ('medborgerskap', 0.7882201084305864),
 ('skatteregnskap', 0.7878211147804894),
 ('lokaldemokrati', 0.7829862781320682),
 ('deltakerdemokrati', 0.7788566695770515),
 ('kjønnsliv', 0.777391363323904)]

In [47]:
find_nearest("spise", embeddings, word_to_idx, 10)

[('drikke', 0.8795432513950372),
 ('spille', 0.8161752586706609),
 ('sove', 0.8115023031497804),
 ('kjøpe', 0.8102030587129718),
 ('dø', 0.7905000589369365),
 ('sitte', 0.7845383675111297),
 ('leke', 0.7755767249861851),
 ('danse', 0.7684195793455394),
 ('snakke', 0.7662041866569884),
 ('vente', 0.7629548597919222)]

In [38]:
find_nearest("spise", right_embeddings, context_word_to_idx, 10)

[('drikke', 0.8708795088698218),
 ('spist', 0.8391220462918889),
 ('spiser', 0.8026804078043023),
 ('spiste', 0.7606383162441575),
 ('drukket', 0.7505574575879221),
 ('drikker', 0.7236008987434204),
 ('kjøpe', 0.7175299143798848),
 ('drakk', 0.6983177967705037),
 ('kopp', 0.683305928554896),
 ('servert', 0.6639450946358891)]

In [39]:
find_nearest("spise", left_embeddings, context_word_to_idx, 10)

[('spille', 0.9642619853637652),
 ('kjøre', 0.9631011425133194),
 ('føle', 0.958907210235288),
 ('kjøpe', 0.9574394843696666),
 ('danse', 0.9534318350667251),
 ('sende', 0.9515402713713874),
 ('handle', 0.9515395980785162),
 ('øve', 0.9510206710242984),
 ('slippe', 0.9498104734017206),
 ('flytte', 0.9463676615654408)]