# Eksperymenty: Strojenie Hiperparametrów Modeli

W tym notatniku przeprowadzimy eksperymenty mające na celu znalezienie najlepszych hiperparametrów dla modeli TF-IDF i Doc2Vec.

## Metodyka

1.  **Złoty Zbiór Zapytań**: Definiujemy zestaw zapytań i oczekiwanych dokumentów.
2.  **Grid Search**: Iterujemy po różnych kombinacjach parametrów dla każdego modelu.
3.  **Ocena**: Dla każdej kombinacji obliczamy średnią rangę i wynik podobieństwa dla oczekiwanych dokumentów.
4.  **Wnioski**: Wybieramy najlepszy zestaw parametrów.

In [1]:
import pandas as pd
import itertools
from tqdm.std import tqdm as tqdm
import sys

from service.document_service import DocumentService
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# Ustawienia i ładowanie danych
sys.path.append('.')
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)
doc_service = DocumentService()
documents = doc_service.load_documents()
print(f"Załadowano {len(documents)} dokumentów.")

Załadowano 1000 dokumentów.


## 1. Zdefiniowanie Złotego Zbioru Zapytań

**Wypełnij poniższą listę.** Wpisz zapytania i nazwę pliku, który Twoim zdaniem najlepiej na nie odpowiada.

In [2]:
GOLDEN_SET = {
    "Violence on board during US flight": "kaggle_1.txt",
    "Statistics regarding new virus vaccination fall campaign": "kaggle_0.txt",
    "Supreme Court spouse talks to Capitol riot investigators": "kaggle_14.txt",
    "Rapper gives financial support to Bronx education institution": "kaggle_50.txt",
    "Lawyers argue for life imprisonment for mass murderer": "kaggle_188.txt",
    "New anchor takes over prime time slot on cable news": "kaggle_450.txt",
    "Movie director discusses complex adult film industry character": "kaggle_667.txt",
    "Fatal accident involving college sports team students": "kaggle_999.txt"
}


## 2. Eksperymenty z TF-IDF

In [9]:
def run_tfidf_experiment(docs, golden_set, vectorizer_params):
    doc_contents = [d.content for d in docs]
    doc_names = [d.name for d in docs]
    
    vectorizer = TfidfVectorizer(**vectorizer_params)
    tfidf_matrix = vectorizer.fit_transform(doc_contents)
    
    results = {}
    for query, expected_doc_name in golden_set.items():
        query_processed = DocumentService.preprocess_text(query)
        query_vector = vectorizer.transform([query_processed])
        
        sims = cosine_similarity(query_vector, tfidf_matrix).flatten()
        
        # Znajdź wynik dla oczekiwanego dokumentu
        try:
            expected_doc_idx = doc_names.index(expected_doc_name)
            score = sims[expected_doc_idx]
            
            # Znajdź rangę
            sorted_indices = sims.argsort()[::-1]
            rank = list(sorted_indices).index(expected_doc_idx) + 1
            
            results[query] = {'rank': rank, 'score': score}
        except ValueError:
            results[query] = {'rank': -1, 'score': -1}
            
    return results

# Siatka parametrów do przetestowania dla TF-IDF
tfidf_param_grid = {
    'ngram_range': [(1, 1), (1, 2), (2, 2)],
    'min_df': [1, 2, 3, 5],
    'max_df': [0.8, 0.9],
    'sublinear_tf': [True, False],
    'norm': ['l2']
}



tfidf_results = []
params_list = list(itertools.product(*tfidf_param_grid.values()))

for params in tqdm(params_list, desc="TF-IDF Experiments"):
    current_params = dict(zip(tfidf_param_grid.keys(), params))
    
    # Przebieg eksperymentu dla wszystkich zapytań z GOLDEN_SET
    query_results = run_tfidf_experiment(documents, GOLDEN_SET, current_params)
    
    ranks = [res['rank'] for res in query_results.values() if res['rank'] != -1]
    scores = [res['score'] for res in query_results.values() if res['score'] != -1]
    
    if ranks:
        avg_rank = sum(ranks) / len(ranks)
        avg_score = sum(scores) / len(scores)
        result_entry = {**current_params, 'avg_rank': avg_rank, 'avg_score': avg_score, 'all_results': query_results}
        tfidf_results.append(result_entry)

tfidf_df = pd.DataFrame(tfidf_results)
tfidf_df.sort_values(by=['avg_rank', 'avg_score'], ascending=[True, False], inplace=True)

print("Najlepsze parametry dla TF-IDF")
display(tfidf_df.head(100))

TF-IDF Experiments: 100%|██████████| 48/48 [00:00<00:00, 51.25it/s]


Najlepsze parametry dla TF-IDF


Unnamed: 0,ngram_range,min_df,max_df,sublinear_tf,norm,avg_rank,avg_score,all_results
28,"(1, 2)",5,0.8,True,l2,141.125,0.179287,"{'Violence on board during US flight': {'rank': 9, 'score': 0.21996089551565737}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.18954643879981192}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 28, 'score': 0.19314707622584368}, 'Rapper gives financial support to Bronx education institution': {'rank': 5, 'score': 0.25862309051365917}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 59, 'score': 0.08955198808071897}, 'New anchor takes over prime time slot on cable news': {'rank': 53, 'score': 0.11167377483939141}, 'Movie director discusses complex adult film industry character': {'rank': 961, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 3, 'score': 0.37179015473094784}}"
30,"(1, 2)",5,0.9,True,l2,141.125,0.179287,"{'Violence on board during US flight': {'rank': 9, 'score': 0.21996089551565737}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.18954643879981192}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 28, 'score': 0.19314707622584368}, 'Rapper gives financial support to Bronx education institution': {'rank': 5, 'score': 0.25862309051365917}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 59, 'score': 0.08955198808071897}, 'New anchor takes over prime time slot on cable news': {'rank': 53, 'score': 0.11167377483939141}, 'Movie director discusses complex adult film industry character': {'rank': 961, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 3, 'score': 0.37179015473094784}}"
29,"(1, 2)",5,0.8,False,l2,141.125,0.174523,"{'Violence on board during US flight': {'rank': 9, 'score': 0.21996089551565737}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.185013592740212}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 32, 'score': 0.17722173653810155}, 'Rapper gives financial support to Bronx education institution': {'rank': 5, 'score': 0.2455921702276022}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 58, 'score': 0.08493210772175865}, 'New anchor takes over prime time slot on cable news': {'rank': 50, 'score': 0.11167377483939141}, 'Movie director discusses complex adult film industry character': {'rank': 961, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 3, 'score': 0.37179015473094784}}"
31,"(1, 2)",5,0.9,False,l2,141.125,0.174523,"{'Violence on board during US flight': {'rank': 9, 'score': 0.21996089551565737}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.185013592740212}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 32, 'score': 0.17722173653810155}, 'Rapper gives financial support to Bronx education institution': {'rank': 5, 'score': 0.2455921702276022}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 58, 'score': 0.08493210772175865}, 'New anchor takes over prime time slot on cable news': {'rank': 50, 'score': 0.11167377483939141}, 'Movie director discusses complex adult film industry character': {'rank': 961, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 3, 'score': 0.37179015473094784}}"
21,"(1, 2)",2,0.8,False,l2,141.5,0.115261,"{'Violence on board during US flight': {'rank': 9, 'score': 0.13549555621392861}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 12, 'score': 0.11422741238953942}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 37, 'score': 0.09805660853936517}, 'Rapper gives financial support to Bronx education institution': {'rank': 6, 'score': 0.1521451850606366}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 46, 'score': 0.07966297015364651}, 'New anchor takes over prime time slot on cable news': {'rank': 61, 'score': 0.05746337486394994}, 'Movie director discusses complex adult film industry character': {'rank': 960, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 1, 'score': 0.2850402079901924}}"
23,"(1, 2)",2,0.9,False,l2,141.5,0.115261,"{'Violence on board during US flight': {'rank': 9, 'score': 0.13549555621392861}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 12, 'score': 0.11422741238953942}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 37, 'score': 0.09805660853936517}, 'Rapper gives financial support to Bronx education institution': {'rank': 6, 'score': 0.1521451850606366}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 46, 'score': 0.07966297015364651}, 'New anchor takes over prime time slot on cable news': {'rank': 61, 'score': 0.05746337486394994}, 'Movie director discusses complex adult film industry character': {'rank': 960, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 1, 'score': 0.2850402079901924}}"
20,"(1, 2)",2,0.8,True,l2,141.75,0.118537,"{'Violence on board during US flight': {'rank': 10, 'score': 0.13549555621392861}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.11927572570999612}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 36, 'score': 0.11091071398868566}, 'Rapper gives financial support to Bronx education institution': {'rank': 7, 'score': 0.15667348983894472}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 45, 'score': 0.08343759120461683}, 'New anchor takes over prime time slot on cable news': {'rank': 64, 'score': 0.05746337486394994}, 'Movie director discusses complex adult film industry character': {'rank': 960, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 1, 'score': 0.2850402079901924}}"
22,"(1, 2)",2,0.9,True,l2,141.75,0.118537,"{'Violence on board during US flight': {'rank': 10, 'score': 0.13549555621392861}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.11927572570999612}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 36, 'score': 0.11091071398868566}, 'Rapper gives financial support to Bronx education institution': {'rank': 7, 'score': 0.15667348983894472}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 45, 'score': 0.08343759120461683}, 'New anchor takes over prime time slot on cable news': {'rank': 64, 'score': 0.05746337486394994}, 'Movie director discusses complex adult film industry character': {'rank': 960, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 1, 'score': 0.2850402079901924}}"
24,"(1, 2)",3,0.8,True,l2,142.0,0.129049,"{'Violence on board during US flight': {'rank': 11, 'score': 0.1476307094111076}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.13186222618553373}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 34, 'score': 0.13820973589749602}, 'Rapper gives financial support to Bronx education institution': {'rank': 9, 'score': 0.19259617888597483}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 55, 'score': 0.08955198808071897}, 'New anchor takes over prime time slot on cable news': {'rank': 52, 'score': 0.07994512991097191}, 'Movie director discusses complex adult film industry character': {'rank': 961, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 3, 'score': 0.25259811900037404}}"
26,"(1, 2)",3,0.9,True,l2,142.0,0.129049,"{'Violence on board during US flight': {'rank': 11, 'score': 0.1476307094111076}, 'Statistics regarding new virus vaccination fall campaign': {'rank': 11, 'score': 0.13186222618553373}, 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 34, 'score': 0.13820973589749602}, 'Rapper gives financial support to Bronx education institution': {'rank': 9, 'score': 0.19259617888597483}, 'Lawyers argue for life imprisonment for mass murderer': {'rank': 55, 'score': 0.08955198808071897}, 'New anchor takes over prime time slot on cable news': {'rank': 52, 'score': 0.07994512991097191}, 'Movie director discusses complex adult film industry character': {'rank': 961, 'score': 0.0}, 'Fatal accident involving college sports team students': {'rank': 3, 'score': 0.25259811900037404}}"


In [10]:
best_row = tfidf_df.iloc[0]

best_row['all_results']


{'Violence on board during US flight': {'rank': 9,
  'score': np.float64(0.21996089551565737)},
 'Statistics regarding new virus vaccination fall campaign': {'rank': 11,
  'score': np.float64(0.18954643879981192)},
 'Supreme Court spouse talks to Capitol riot investigators': {'rank': 28,
  'score': np.float64(0.19314707622584368)},
 'Rapper gives financial support to Bronx education institution': {'rank': 5,
  'score': np.float64(0.25862309051365917)},
 'Lawyers argue for life imprisonment for mass murderer': {'rank': 59,
  'score': np.float64(0.08955198808071897)},
 'New anchor takes over prime time slot on cable news': {'rank': 53,
  'score': np.float64(0.11167377483939141)},
 'Movie director discusses complex adult film industry character': {'rank': 961,
  'score': np.float64(0.0)},
 'Fatal accident involving college sports team students': {'rank': 3,
  'score': np.float64(0.37179015473094784)}}

## 3. Eksperymenty z Doc2Vec

In [14]:
def run_doc2vec_experiment_batch(docs, golden_set, model_params):
    tagged_docs = [TaggedDocument(doc.content.split(), [doc.name]) for doc in docs]
    doc_names = [d.name for d in docs]

    # Trenujemy model RAZ dla danego zestawu parametrów
    model = Doc2Vec(tagged_docs, **model_params)
    
    results = {}
    for query, expected_doc_name in golden_set.items():
        query_tokens = DocumentService.preprocess_text(query, return_tokens=True)
        # Zwiększamy jakość inferencji (epochs zamiast steps dla starszych wersji gensim)
        # Dodajemy seed dla powtarzalności
        query_vector = model.infer_vector(query_tokens, epochs=100)
        
        try:
            # Pobieramy podobieństwa dla wszystkich dokumentów
            sims = model.dv.most_similar([query_vector], topn=len(docs))
            sim_names = [name for name, _ in sims]
            
            rank = sim_names.index(expected_doc_name) + 1
            score = next(s for name, s in sims if name == expected_doc_name)
            
            results[query] = {'rank': rank, 'score': score}
        except (ValueError, KeyError):
            results[query] = {'rank': -1, 'score': -1}
            
    return results

# Rozszerzona siatka parametrów dla Doc2Vec
doc2vec_param_grid = {
    'vector_size': [200, 300],
    'window': [5, 10],
    'min_count': [1, 2],
    'epochs': [100, 200],
    'dm': [0, 1],
    'alpha': [0.025, 0.05],
    'dbow_words': [1]
}

doc2vec_results = []
params_list_d2v = list(itertools.product(*doc2vec_param_grid.values()))

for params in tqdm(params_list_d2v, desc="Doc2Vec Experiments"):
    current_params = dict(zip(doc2vec_param_grid.keys(), params))
    
    # Przebieg batchowy dla wszystkich zapytań
    query_results = run_doc2vec_experiment_batch(documents, GOLDEN_SET, current_params)
    
    ranks = [res['rank'] for res in query_results.values() if res['rank'] != -1]
    scores = [res['score'] for res in query_results.values() if res['score'] != -1]
    
    if ranks:
        avg_rank = sum(ranks) / len(ranks)
        avg_score = sum(scores) / len(scores)
        result_entry = {**current_params, 'avg_rank': avg_rank, 'avg_score': avg_score, 'all_results': query_results}
        doc2vec_results.append(result_entry)

doc2vec_df = pd.DataFrame(doc2vec_results)
doc2vec_df.sort_values(by=['avg_rank', 'avg_score'], ascending=[True, False], inplace=True)

print("Najlepsze parametry dla Doc2Vec (Summary)")
# Usuwamy all_results z widoku, żeby tabela była czytelna
display(doc2vec_df.drop(columns=['all_results']).head(100))

Doc2Vec Experiments:   2%|▏         | 1/64 [00:04<05:09,  4.92s/it]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Doc2Vec Experiments:   3%|▎         | 2/64 [00:09<04:52,  4.72s/it]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Doc2Vec Experiments:   6%|▋         | 4/64 [00:13<02:51,  2.86s/it]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Doc2Vec Experiments:   8%|▊         | 5/64 [00:22<04:58,  5.06s/it]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Doc2Vec Experiments:  20%|██        | 13/64 [00:56<03:40,  4.33s/it]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Doc2Vec Experiments:  25%|██▌       | 16/64 [01:10<03:25,  4.28s/it]Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Doc2Vec Experiments:  27%|██▋       | 17/64 [01:17<03:57,  5

Najlepsze parametry dla Doc2Vec (Summary)





Unnamed: 0,vector_size,window,min_count,epochs,dm,alpha,dbow_words,avg_rank,avg_score
9,200,5,2,100,0,0.05,1,19.75,0.489195
12,200,5,2,200,0,0.025,1,23.125,0.492849
45,300,5,2,200,0,0.05,1,25.0,0.469414
41,300,5,2,100,0,0.05,1,28.875,0.47004
13,200,5,2,200,0,0.05,1,28.875,0.459404
8,200,5,2,100,0,0.025,1,29.5,0.506306
44,300,5,2,200,0,0.025,1,35.875,0.486903
29,200,10,2,200,0,0.05,1,37.125,0.477147
61,300,10,2,200,0,0.05,1,46.125,0.461959
28,200,10,2,200,0,0.025,1,48.375,0.488038
