***עבור עמודות לבל וקוושטין בלבד - לכל השורות יצור קובץ בפורמט סי אס וי***

*   List item
*   List item



In [None]:
# --- שלב 1: קריאת קובץ ה-XLS ---
import pandas as pd

DATA_PATH = "/content/train.xls"
# קרא את קובץ ה-XLS שנמצא באותה תקייה של הקולאב
df = pd.read_csv(DATA_PATH)

# --- שלב 2: חילוץ רק שתי עמודות ---
filtered_df = df[["question", "level"]]

# --- שלב 2: מחיקת כפילויות ---
filtered_df = filtered_df.drop_duplicates(subset=["question"], keep="first")

# --- שלב 3: שמירה ל-CSV ---
filtered_df.to_csv("train_filtered.csv", index=False, encoding="utf-8")

# הדפסה לווידוא
filtered_df.head()


Unnamed: 0,question,level
0,Which magazine was started first Arthur's Maga...,medium
1,The Oberoi family is part of a hotel company t...,medium
2,Musician and satirist Allie Goertz wrote a son...,hard
3,What nationality was James Henry Miller's wife?,medium
4,Cadmium Chloride is slightly soluble in this c...,medium


אם מוסיפים את ה סי אס וי ידני אפשר להתחיל מפה

# ***תרגיל 3 של הפרוייקט***

***טעינת הקובץ - מחיקת כפיליות מ סי אס וי***

In [3]:
import pandas as pd

# --- שלב 1: טעינת הקובץ ---
filtered_df = pd.read_csv("/content/train-filtered_question_level.csv")

# --- שלב 2: מחיקת כפילויות ---
filtered_df = filtered_df.drop_duplicates(subset=["question"], keep="first")

# ***א-1***

***ייבוא ספריות ובדיקה ראשונית של התוויות***

---



In [5]:
from sklearn.model_selection import train_test_split

# Check that the expected columns exist and inspect the data
print(filtered_df.columns)
print(filtered_df.head())

# Show global label distribution for 'level'
print("\nGlobal distribution of 'level':")
print(filtered_df["level"].value_counts(normalize=True))


Index(['question', 'level'], dtype='object')
                                            question   level
0  Which magazine was started first Arthur's Maga...  medium
1  The Oberoi family is part of a hotel company t...  medium
2  Musician and satirist Allie Goertz wrote a son...    hard
3    What nationality was James Henry Miller's wife?  medium
4  Cadmium Chloride is slightly soluble in this c...  medium

Global distribution of 'level':
level
medium    0.628149
easy      0.198688
hard      0.173162
Name: proportion, dtype: float64


***חלוקה מאוזנת ל־train / validation / test (עם stratify)***

In [6]:
# Define split proportions
TEST_SIZE = 0.15      # 15% of total data for test
VAL_SIZE = 0.15       # 15% of total data for validation
RANDOM_STATE = 42     # For reproducibility

# Compute validation size relative to the remaining data after test split
val_size_relative = VAL_SIZE / (1 - TEST_SIZE)  # e.g., 0.15 / 0.85

print("Relative validation size (from train_val):", val_size_relative)

# Step 1: Split into train_val and test with stratification on 'level'
train_val_df, test_df = train_test_split(
    filtered_df,
    test_size=TEST_SIZE,
    stratify=filtered_df["level"],
    random_state=RANDOM_STATE
)

# Step 2: Split train_val into train and validation with stratification on 'level'
train_df, val_df = train_test_split(
    train_val_df,
    test_size=val_size_relative,
    stratify=train_val_df["level"],
    random_state=RANDOM_STATE
)

print("Finished stratified split into train / validation / test.")


Relative validation size (from train_val): 0.17647058823529413
Finished stratified split into train / validation / test.


***בדיקה שהחלוקה מאוזנת (stratified) ושיש לנו את היחסים הרצויים***

In [7]:
def print_split_info(df, name):
    print(f"\n{name}:")
    print("Number of rows:", len(df))
    print("Label distribution for 'level':")
    print(df["level"].value_counts(normalize=True))

print("Total rows in original filtered_df:", len(filtered_df))

print_split_info(train_df, "Train set")
print_split_info(val_df, "Validation set")
print_split_info(test_df, "Test set")


Total rows in original filtered_df: 90418

Train set:
Number of rows: 63292
Label distribution for 'level':
level
medium    0.628152
easy      0.198682
hard      0.173166
Name: proportion, dtype: float64

Validation set:
Number of rows: 13563
Label distribution for 'level':
level
medium    0.628180
easy      0.198702
hard      0.173118
Name: proportion, dtype: float64

Test set:
Number of rows: 13563
Label distribution for 'level':
level
medium    0.628106
easy      0.198702
hard      0.173192
Name: proportion, dtype: float64


# **א-2**

***Text preprocessing (tokenization + lemmatization)***

In [15]:
import re
import nltk
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize, pos_tag

# Download required NLTK resources (only once)
nltk.download("punkt")
nltk.download('averaged_perceptron_tagger_eng')
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger")
nltk.download("stopwords")

lemmatizer = WordNetLemmatizer()
eng_stops = set(stopwords.words("english"))
BE_FORMS = {"am", "is", "are", "was", "were", "be", "been", "being"}

def get_wordnet_pos(tag: str):
    """
    Map POS tag from nltk.pos_tag to a WordNet POS tag.
    This helps the lemmatizer pick the correct base form.
    """
    if tag.startswith("J"):
        return wordnet.ADJ
    if tag.startswith("V"):
        return wordnet.VERB
    if tag.startswith("N"):
        return wordnet.NOUN
    if tag.startswith("R"):
        return wordnet.ADV
    return wordnet.NOUN

# Regex patterns for cleaning
url_email_handle_re = re.compile(r"(https?://\S+|www\.\S+|\S+@\S+|[@#]\w+)", re.IGNORECASE)
digits_re = re.compile(r"\d+")            # digits -> _number
non_letter_re = re.compile(r"[^a-z_ ]+")  # after lowercase, keep only a-z, space, underscore

def process_text_value(text: str) -> str:
    """
    Full preprocessing for a single text value:
    - Remove URLs, emails, and @handles/#hashtags
    - Tokenize
    - POS tagging
    - Lemmatization with POS
    - Normalize 'be' verb forms
    - Replace digits with '_number'
    - Remove non-letter characters (keep a-z, space, underscore)
    - Remove stopwords
    - Lowercase
    Returns a cleaned string with space-separated tokens.
    """
    if not isinstance(text, str):
        return ""

    t = text

    # Remove URLs, emails, handles, hashtags
    t = url_email_handle_re.sub(" ", t)

    # Tokenize and POS-tag on original text
    tokens = word_tokenize(t)
    tagged = pos_tag(tokens)

    lemmas = []
    for tok, pos in tagged:
        # Normalize 'be' forms early
        if tok.lower() in BE_FORMS:
            lemmas.append("be")
            continue

        # Map tag to WordNet POS and lemmatize
        wn_pos = get_wordnet_pos(pos)
        lemma = lemmatizer.lemmatize(tok, wn_pos)
        lemmas.append(lemma)

    # Lowercase
    lemmas = [w.lower() for w in lemmas]

    # Replace digits inside tokens with '_number'
    lemmas = [digits_re.sub("_number", w) for w in lemmas]

    # Keep only a-z / underscore / spaces, and clean token by token
    clean_lemmas = []
    for w in lemmas:
        w2 = non_letter_re.sub(" ", w).strip()
        if not w2:
            continue
        # If cleaning produced multiple parts, split them
        for part in w2.split():
            clean_lemmas.append(part)

    # # Remove stopwords
    # clean_lemmas = [w for w in clean_lemmas if w not in eng_stops]

    # Join back into a single string
    return " ".join(clean_lemmas)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


***Apply preprocessing to train / val / test***

In [16]:
# Apply processing to 'question' column in each split
train_df["clean_text"] = train_df["question"].apply(process_text_value)
val_df["clean_text"]   = val_df["question"].apply(process_text_value)
test_df["clean_text"]  = test_df["question"].apply(process_text_value)

# Also create a token list from the cleaned string (for Word2Vec etc.)
train_df["tokens"] = train_df["clean_text"].apply(lambda s: s.split())
val_df["tokens"]   = val_df["clean_text"].apply(lambda s: s.split())
test_df["tokens"]  = test_df["clean_text"].apply(lambda s: s.split())

train_df[["question", "clean_text"]].head()


Unnamed: 0,question,clean_text
72693,How many studio albums has the band that playe...,many studio album band play warrior call release
24973,When was the team that won the 1981-82 Turkish...,team win _number _number turkish first footbal...
19388,What genre of music do Sonic Reign and Emperor...,genre music sonic reign emperor fall
67473,What is Rollkommando Hamann?,rollkommando hamann
62224,Who starred with Jay Mohr in a 1997 American r...,star jay mohr _number american romantic comedy


***TF-IDF representation (document → vector)***

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF vectorizer
# You can adjust max_features depending on dataset size
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,   # limit vocabulary size (optional)
    ngram_range=(1, 1),   # unigrams only
)

# Fit on train set and transform all splits
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df["clean_text"])
X_val_tfidf   = tfidf_vectorizer.transform(val_df["clean_text"])
X_test_tfidf  = tfidf_vectorizer.transform(test_df["clean_text"])

print("TF-IDF shapes:")
print("X_train_tfidf:", X_train_tfidf.shape)
print("X_val_tfidf:  ", X_val_tfidf.shape)
print("X_test_tfidf: ", X_test_tfidf.shape)
print("(Num of documents, max_features)")


TF-IDF shapes:
X_train_tfidf: (63292, 10000)
X_val_tfidf:   (13563, 10000)
X_test_tfidf:  (13563, 10000)


***Train Word2Vec on tokens (word embeddings)***

In [24]:
!pip install gensim
from gensim.models import Word2Vec

# Prepare sentences for Word2Vec (list of token lists)
train_sentences = train_df["tokens"].tolist()

# Train a Word2Vec model on the training set only
w2v_model = Word2Vec(
    sentences=train_sentences,
    vector_size=100,   # size of word vectors
    window=5,          # context window
    min_count=2,       # ignore words with total frequency < 2
    workers=4,         # number of CPU threads (Colab usually supports this)
    sg=1,              # 1 = skip-gram, 0 = CBOW
    seed=42
)

print("Word2Vec model trained.")
print("Vocabulary size:", len(w2v_model.wv.key_to_index))
print("Vector size:", w2v_model.vector_size)


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m44.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Word2Vec model trained.
Vocabulary size: 26984
Vector size: 100


***Build document vectors from Word2Vec (TF-IDF weighted average)***

In [25]:
import numpy as np

# Build a dictionary: word -> IDF score, based on the TF-IDF vocabulary
idf_scores = dict(zip(tfidf_vectorizer.get_feature_names_out(),
                      tfidf_vectorizer.idf_))

def document_vector(tokens, use_tfidf_weight=True):
    """
    Compute a single document vector from word vectors.
    By default uses TF-IDF weights as recommended.
    - tokens: list of preprocessed, lemmatized tokens
    - use_tfidf_weight: if True, weight each word vector by its IDF
    """
    vectors = []
    weights = []

    for tok in tokens:
        if tok in w2v_model.wv:
            vec = w2v_model.wv[tok]
            if use_tfidf_weight:
                weight = idf_scores.get(tok, 1.0)
            else:
                weight = 1.0
            vectors.append(vec * weight)
            weights.append(weight)

    if not vectors:
        # If no token has a vector, return a zero vector
        return np.zeros(w2v_model.vector_size, dtype=np.float32)

    vectors = np.vstack(vectors)
    weights = np.array(weights, dtype=np.float32)

    # Weighted average: sum(w_i * v_i) / sum(w_i)
    return vectors.sum(axis=0) / weights.sum()

# Build document-level vectors for each split
X_train_w2v = np.vstack(train_df["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True)))
X_val_w2v   = np.vstack(val_df["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True)))
X_test_w2v  = np.vstack(test_df["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True)))

print("Word2Vec document matrices shapes:")
print("X_train_w2v:", X_train_w2v.shape)
print("X_val_w2v:  ", X_val_w2v.shape)
print("X_test_w2v: ", X_test_w2v.shape)


Word2Vec document matrices shapes:
X_train_w2v: (63292, 100)
X_val_w2v:   (13563, 100)
X_test_w2v:  (13563, 100)


# ***ב-1-סיווג בינארי***

***Filter to binary classes (easy, hard)***

In [29]:
# Keep only 'easy' and 'hard' classes
binary_train = train_df[train_df["level"].isin(["easy", "hard"])].copy()
binary_val   = val_df[val_df["level"].isin(["easy", "hard"])].copy()
binary_test  = test_df[test_df["level"].isin(["easy", "hard"])].copy()

print("Train size:", len(binary_train))
print("Validation size:", len(binary_val))
print("Test size:", len(binary_test))

print("\nTrain label distribution:")
print(binary_train["level"].value_counts(normalize=True))


Train size: 23535
Validation size: 5043
Test size: 5044

Train label distribution:
level
easy    0.534311
hard    0.465689
Name: proportion, dtype: float64


***Encode labels (easy=0, hard=1)***

In [30]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

y_train = le.fit_transform(binary_train["level"])
y_val   = le.transform(binary_val["level"])
y_test  = le.transform(binary_test["level"])

print("Label classes:", le.classes_)  # ['easy' 'hard']


Label classes: ['easy' 'hard']


***Build TF-IDF for the binary subsets***

In [31]:
# Reuse the same TF-IDF vectorizer that was already fitted on full train_df
X_train_tfidf_bin = tfidf_vectorizer.transform(binary_train["clean_text"])
X_val_tfidf_bin   = tfidf_vectorizer.transform(binary_val["clean_text"])
X_test_tfidf_bin  = tfidf_vectorizer.transform(binary_test["clean_text"])

print("Binary TF-IDF shapes:")
print("X_train_tfidf_bin:", X_train_tfidf_bin.shape)
print("X_val_tfidf_bin:  ", X_val_tfidf_bin.shape)
print("X_test_tfidf_bin: ", X_test_tfidf_bin.shape)


Binary TF-IDF shapes:
X_train_tfidf_bin: (23535, 10000)
X_val_tfidf_bin:   (5043, 10000)
X_test_tfidf_bin:  (5044, 10000)


***Build Word2Vec document vectors for the binary subsets***

In [32]:
import numpy as np

# Assuming you already have w2v_model and document_vector() defined

X_train_w2v_bin = np.vstack(
    binary_train["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True))
)
X_val_w2v_bin = np.vstack(
    binary_val["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True))
)
X_test_w2v_bin = np.vstack(
    binary_test["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True))
)

print("Binary Word2Vec shapes:")
print("X_train_w2v_bin:", X_train_w2v_bin.shape)
print("X_val_w2v_bin:  ", X_val_w2v_bin.shape)
print("X_test_w2v_bin: ", X_test_w2v_bin.shape)


Binary Word2Vec shapes:
X_train_w2v_bin: (23535, 100)
X_val_w2v_bin:   (5043, 100)
X_test_w2v_bin:  (5044, 100)


***Utility: evaluate model***

In [33]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

def evaluate_model(model_name, representation_name, y_true, y_pred):
    print(f"\n=== {model_name} + {representation_name} ===")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("F1 Score:", f1_score(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))


***TF-IDF + Naive Bayes***

In [34]:
from sklearn.naive_bayes import MultinomialNB

nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf_bin, y_train)

pred_val = nb_tfidf.predict(X_val_tfidf_bin)

evaluate_model("Naive Bayes", "TF-IDF", y_val, pred_val)



=== Naive Bayes + TF-IDF ===
Accuracy: 0.6014277215942891
F1 Score: 0.4964929859719439
Confusion Matrix:
 [[2042  653]
 [1357  991]]


***TF-IDF + Logistic Regression***

In [35]:
from sklearn.linear_model import LogisticRegression

lr_tfidf = LogisticRegression(max_iter=2000)
lr_tfidf.fit(X_train_tfidf_bin, y_train)

pred_val = lr_tfidf.predict(X_val_tfidf_bin)

evaluate_model("Logistic Regression", "TF-IDF", y_val, pred_val)



=== Logistic Regression + TF-IDF ===
Accuracy: 0.6652786040055523
F1 Score: 0.631601920558708
Confusion Matrix:
 [[1908  787]
 [ 901 1447]]


***Word2Vec + Naive Bayes***

In [36]:
from sklearn.naive_bayes import GaussianNB

nb_w2v = GaussianNB()
nb_w2v.fit(X_train_w2v_bin, y_train)

pred_val = nb_w2v.predict(X_val_w2v_bin)

evaluate_model("Naive Bayes (Gaussian)", "Word2Vec", y_val, pred_val)



=== Naive Bayes (Gaussian) + Word2Vec ===
Accuracy: 0.5683125123934166
F1 Score: 0.4553415061295972
Confusion Matrix:
 [[1956  739]
 [1438  910]]


***Word2Vec + Logistic Regression***

In [37]:
lr_w2v = LogisticRegression(max_iter=2000)
lr_w2v.fit(X_train_w2v_bin, y_train)

pred_val = lr_w2v.predict(X_val_w2v_bin)

evaluate_model("Logistic Regression", "Word2Vec", y_val, pred_val)



=== Logistic Regression + Word2Vec ===
Accuracy: 0.6178861788617886
F1 Score: 0.5386641129997606
Confusion Matrix:
 [[1991  704]
 [1223 1125]]


# ***ב-1- סיווג רב מחלקתי כלומר 3***

***Build multi-class subsets (easy, medium, hard)***

In [38]:
# Keep only the three target classes
target_levels = ["easy", "medium", "hard"]

multi_train = train_df[train_df["level"].isin(target_levels)].copy()
multi_val   = val_df[val_df["level"].isin(target_levels)].copy()
multi_test  = test_df[test_df["level"].isin(target_levels)].copy()

print("Train size:", len(multi_train))
print("Validation size:", len(multi_val))
print("Test size:", len(multi_test))

print("\nTrain label distribution:")
print(multi_train["level"].value_counts(normalize=True))

print("\nUnique levels in all splits:")
print("Train:", multi_train["level"].unique())
print("Val:  ", multi_val["level"].unique())
print("Test: ", multi_test["level"].unique())


Train size: 63292
Validation size: 13563
Test size: 13563

Train label distribution:
level
medium    0.628152
easy      0.198682
hard      0.173166
Name: proportion, dtype: float64

Unique levels in all splits:
Train: ['medium' 'hard' 'easy']
Val:   ['hard' 'medium' 'easy']
Test:  ['medium' 'easy' 'hard']


***Encode labels (3 classes)***

In [39]:
from sklearn.preprocessing import LabelEncoder

le_multi = LabelEncoder()

y_train_multi = le_multi.fit_transform(multi_train["level"])
y_val_multi   = le_multi.transform(multi_val["level"])
y_test_multi  = le_multi.transform(multi_test["level"])

print("Label classes (order):", le_multi.classes_)  # expects ['easy' 'hard' 'medium'] or similar


Label classes (order): ['easy' 'hard' 'medium']


***TF-IDF representation for multi-class***

In [40]:
# Transform clean_text into TF-IDF vectors using the existing fitted vectorizer
X_train_tfidf_multi = tfidf_vectorizer.transform(multi_train["clean_text"])
X_val_tfidf_multi   = tfidf_vectorizer.transform(multi_val["clean_text"])
X_test_tfidf_multi  = tfidf_vectorizer.transform(multi_test["clean_text"])

print("TF-IDF shapes (multi-class):")
print("X_train_tfidf_multi:", X_train_tfidf_multi.shape)
print("X_val_tfidf_multi:  ", X_val_tfidf_multi.shape)
print("X_test_tfidf_multi: ", X_test_tfidf_multi.shape)


TF-IDF shapes (multi-class):
X_train_tfidf_multi: (63292, 10000)
X_val_tfidf_multi:   (13563, 10000)
X_test_tfidf_multi:  (13563, 10000)


***Word2Vec document vectors for multi-class***

In [41]:
import numpy as np

# Build document-level vectors using the existing Word2Vec model
X_train_w2v_multi = np.vstack(
    multi_train["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True))
)
X_val_w2v_multi = np.vstack(
    multi_val["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True))
)
X_test_w2v_multi = np.vstack(
    multi_test["tokens"].apply(lambda toks: document_vector(toks, use_tfidf_weight=True))
)

print("Word2Vec document shapes (multi-class):")
print("X_train_w2v_multi:", X_train_w2v_multi.shape)
print("X_val_w2v_multi:  ", X_val_w2v_multi.shape)
print("X_test_w2v_multi: ", X_test_w2v_multi.shape)


Word2Vec document shapes (multi-class):
X_train_w2v_multi: (63292, 100)
X_val_w2v_multi:   (13563, 100)
X_test_w2v_multi:  (13563, 100)


***Evaluation helper (Accuracy, macro-F1, confusion matrix)***

In [42]:
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix

def evaluate_multi(model_name, representation_name, y_true, y_pred, label_encoder):
    """
    Print accuracy, macro F1, and confusion matrix for a multi-class setting.
    """
    print(f"\n=== {model_name} + {representation_name} ===")
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Macro F1:", f1_score(y_true, y_pred, average="macro"))
    print("\nConfusion Matrix (rows=true, cols=pred):")
    print(confusion_matrix(y_true, y_pred))
    print("Label order:", label_encoder.classes_)


***TF-IDF + Multinomial Naive Bayes (3 classes)***

In [43]:
from sklearn.naive_bayes import MultinomialNB

nb_tfidf_multi = MultinomialNB()
nb_tfidf_multi.fit(X_train_tfidf_multi, y_train_multi)

pred_val_nb_tfidf = nb_tfidf_multi.predict(X_val_tfidf_multi)

evaluate_multi("Naive Bayes (Multinomial)", "TF-IDF", y_val_multi, pred_val_nb_tfidf, le_multi)



=== Naive Bayes (Multinomial) + TF-IDF ===
Accuracy: 0.6323822163238222
Macro F1: 0.28222654756253046

Confusion Matrix (rows=true, cols=pred):
[[  99   11 2585]
 [  17    4 2327]
 [  32   14 8474]]
Label order: ['easy' 'hard' 'medium']


***TF-IDF + Logistic Regression (3 classes)***

In [44]:
from sklearn.linear_model import LogisticRegression

lr_tfidf_multi = LogisticRegression(max_iter=2000)
lr_tfidf_multi.fit(X_train_tfidf_multi, y_train_multi)

pred_val_lr_tfidf = lr_tfidf_multi.predict(X_val_tfidf_multi)

evaluate_multi("Logistic Regression", "TF-IDF", y_val_multi, pred_val_lr_tfidf, le_multi)



=== Logistic Regression + TF-IDF ===
Accuracy: 0.6435154464351545
Macro F1: 0.3894805596490973

Confusion Matrix (rows=true, cols=pred):
[[ 732   27 1936]
 [ 126   38 2184]
 [ 451  111 7958]]
Label order: ['easy' 'hard' 'medium']


***Word2Vec + Gaussian Naive Bayes (3 classes)***

In [45]:
from sklearn.naive_bayes import GaussianNB

nb_w2v_multi = GaussianNB()
nb_w2v_multi.fit(X_train_w2v_multi, y_train_multi)

pred_val_nb_w2v = nb_w2v_multi.predict(X_val_w2v_multi)

evaluate_multi("Naive Bayes (Gaussian)", "Word2Vec", y_val_multi, pred_val_nb_w2v, le_multi)



=== Naive Bayes (Gaussian) + Word2Vec ===
Accuracy: 0.43567057435670575
Macro F1: 0.35943072535855597

Confusion Matrix (rows=true, cols=pred):
[[1388  357  950]
 [ 972  370 1006]
 [3119 1250 4151]]
Label order: ['easy' 'hard' 'medium']


***Word2Vec + Logistic Regression (3 classes)***

In [46]:
lr_w2v_multi = LogisticRegression(max_iter=2000)
lr_w2v_multi.fit(X_train_w2v_multi, y_train_multi)

pred_val_lr_w2v = lr_w2v_multi.predict(X_val_w2v_multi)

evaluate_multi("Logistic Regression", "Word2Vec", y_val_multi, pred_val_lr_w2v, le_multi)



=== Logistic Regression + Word2Vec ===
Accuracy: 0.6273685762736858
Macro F1: 0.2779021393580056

Confusion Matrix (rows=true, cols=pred):
[[  93    0 2602]
 [  24    0 2324]
 [ 104    0 8416]]
Label order: ['easy' 'hard' 'medium']


***הגדרת פונקצייה לניסויים בהיפר-פרמטרים***

In [47]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

def evaluate_scores(y_true, y_pred):
    """
    Compute accuracy and macro F1 score.
    """
    acc = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred, average="macro")
    return acc, f1


def tune_nb_tfidf(X_train, y_train, X_val, y_val, alphas, representation_name="TF-IDF"):
    """
    Hyperparameter tuning for Multinomial Naive Bayes on TF-IDF features.
    Varies the smoothing parameter 'alpha' and prints validation performance.
    Returns a list of results (alpha, accuracy, f1).
    """
    results = []
    print(f"\n=== Naive Bayes (Multinomial) + {representation_name} — alpha sweep ===")
    for a in alphas:
        model = MultinomialNB(alpha=a)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        acc, f1 = evaluate_scores(y_val, y_pred)
        results.append({"alpha": a, "accuracy": acc, "f1_macro": f1})
        print(f"alpha = {a:>4}  ->  Accuracy = {acc:.4f},  Macro F1 = {f1:.4f}")
    # Print best by F1
    best = max(results, key=lambda r: r["f1_macro"])
    print(f"\nBest alpha by macro F1: {best['alpha']} (Accuracy={best['accuracy']:.4f}, F1={best['f1_macro']:.4f})")
    return results


def tune_logistic(
    X_train,
    y_train,
    X_val,
    y_val,
    Cs,
    max_iter=1000,
    representation_name="TF-IDF",
    model_name_suffix=""
):
    """
    Hyperparameter tuning for Logistic Regression on arbitrary features
    (TF-IDF or Word2Vec).
    Varies the regularization strength C and prints validation performance.
    Returns a list of results (C, accuracy, f1).
    """
    results = []
    print(f"\n=== Logistic Regression {model_name_suffix} + {representation_name} — C sweep (max_iter={max_iter}) ===")
    for c in Cs:
        clf = LogisticRegression(C=c, max_iter=max_iter)
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_val)
        acc, f1 = evaluate_scores(y_val, y_pred)
        results.append({"C": c, "accuracy": acc, "f1_macro": f1})
        print(f"C = {c:>5}  ->  Accuracy = {acc:.4f},  Macro F1 = {f1:.4f}")
    # Print best by F1
    best = max(results, key=lambda r: r["f1_macro"])
    print(f"\nBest C by macro F1: {best['C']} (Accuracy={best['accuracy']:.4f}, F1={best['f1_macro']:.4f})")
    return results





=== Naive Bayes (Multinomial) + TF-IDF (multi-class) — alpha sweep ===
alpha =  0.1  ->  Accuracy = 0.6283,  Macro F1 = 0.3005
alpha =  0.5  ->  Accuracy = 0.6311,  Macro F1 = 0.2910
alpha =  1.0  ->  Accuracy = 0.6324,  Macro F1 = 0.2822

Best alpha by macro F1: 0.1 (Accuracy=0.6283, F1=0.3005)

=== Logistic Regression  + TF-IDF (multi-class) — C sweep (max_iter=2000) ===
C =   0.1  ->  Accuracy = 0.6389,  Macro F1 = 0.3151
C =   1.0  ->  Accuracy = 0.6435,  Macro F1 = 0.3895
C =  10.0  ->  Accuracy = 0.6149,  Macro F1 = 0.4249

Best C by macro F1: 10.0 (Accuracy=0.6149, F1=0.4249)

=== Logistic Regression  + Word2Vec (multi-class) — C sweep (max_iter=2000) ===
C =   0.1  ->  Accuracy = 0.6275,  Macro F1 = 0.2696
C =   1.0  ->  Accuracy = 0.6274,  Macro F1 = 0.2779
C =  10.0  ->  Accuracy = 0.6273,  Macro F1 = 0.2785

Best C by macro F1: 10.0 (Accuracy=0.6273, F1=0.2785)


# **תזכורת:**

### ✔ Accuracy (דיוק)
כמה אחוז מהניבואים של המודל היו נכונים מתוך כלל הדוגמאות.

**איך להבין את זה?**  
אם המודל ניחש נכון 70% מהפעמים → Accuracy = 0.70

**מתי זה טוב?**  
כאשר הדאטה מאוזן*
(כל המחלקות מופיעות בערך באותה כמות).

**החיסרון:**  
אם מחלקה אחת מופיעה הרבה יותר – המדד עלול להיות מטעה.

---

### ✔ F1 Score (מדד F1)
מדד שמחבר בין
 Precision ו־Recall
  למדד אחד מאוזן.

**איך להבין את זה?**  
 גבוה = המודל גם מוצא נכון דוגמאות של המחלקה וגם לא טועה הרבה.  
 נמוך = או שהמודל מפספס הרבה דוגמאות, או שהוא טועה הרבה.

**מתי משתמשים בו?**  
כאשר חשוב לזהות כל מחלקה בצורה טובה במיוחד,
או כאשר יש אי־איזון בין המחלקות.

---

### ✔ Macro F1 (מדד F1 מאקרו)
מחשב את ה
F1
 לכל מחלקה בנפרד, ואז עושה ממוצע פשוט ביניהן.

**איך להבין את זה?**  
כל מחלקה מקבלת משקל שווה — גם אם יש ממנה מעט דוגמאות.

**למה זה חשוב?**  
כי בבעיות שבהן חלק מהמחלקות מופיעות מעט ,  
Accuracy
 יכול להטעות,
אבל
Macro F1
מוודא שהמודל מצליח גם על המחלקות הקטנות.


---


***ניסויים בהיפר פרמטרים***

In [48]:
# ============================================
# Additional Hyperparameter Experiments
# ============================================

# ----------------------------------------------------------
# 1) Naive Bayes + TF-IDF with more alpha values
# ----------------------------------------------------------

nb_alphas_extended = [0.01, 0.1, 0.5, 1.0, 2.0]
nb_tfidf_results_extended = tune_nb_tfidf(
    X_train_tfidf_multi,
    y_train_multi,
    X_val_tfidf_multi,
    y_val_multi,
    alphas=nb_alphas_extended,
    representation_name="TF-IDF (multi-class) — extended alpha"
)

# ----------------------------------------------------------
# 2) Logistic Regression + TF-IDF with extended C values
# ----------------------------------------------------------

lr_C_extended = [0.01, 0.1, 1.0, 10.0, 50.0, 100.0]
lr_tfidf_results_extended = tune_logistic(
    X_train_tfidf_multi,
    y_train_multi,
    X_val_tfidf_multi,
    y_val_multi,
    Cs=lr_C_extended,
    max_iter=3000,  # slightly higher, helps convergence
    representation_name="TF-IDF (multi-class) — extended C",
    model_name_suffix=""
)

# ----------------------------------------------------------
# 3) Logistic Regression + TF-IDF — small max_iter test
# ----------------------------------------------------------

lr_tfidf_small_iter = tune_logistic(
    X_train_tfidf_multi,
    y_train_multi,
    X_val_tfidf_multi,
    y_val_multi,
    Cs=[1.0],
    max_iter=200,  # very small to force non-convergence
    representation_name="TF-IDF (multi-class) — small max_iter",
    model_name_suffix=""
)

# ----------------------------------------------------------
# 4) Logistic Regression + TF-IDF — large max_iter test
# ----------------------------------------------------------

lr_tfidf_large_iter = tune_logistic(
    X_train_tfidf_multi,
    y_train_multi,
    X_val_tfidf_multi,
    y_val_multi,
    Cs=[1.0],
    max_iter=5000,  # large enough to guarantee convergence
    representation_name="TF-IDF (multi-class) — large max_iter",
    model_name_suffix=""
)

# ----------------------------------------------------------
# 5) Logistic Regression + Word2Vec — extended C values
# ----------------------------------------------------------

lr_w2v_extended = tune_logistic(
    X_train_w2v_multi,
    y_train_multi,
    X_val_w2v_multi,
    y_val_multi,
    Cs=lr_C_extended,
    max_iter=3000,
    representation_name="Word2Vec (multi-class) — extended C",
    model_name_suffix=""
)



=== Naive Bayes (Multinomial) + TF-IDF (multi-class) — extended alpha — alpha sweep ===
alpha = 0.01  ->  Accuracy = 0.6277,  Macro F1 = 0.3025
alpha =  0.1  ->  Accuracy = 0.6283,  Macro F1 = 0.3005
alpha =  0.5  ->  Accuracy = 0.6311,  Macro F1 = 0.2910
alpha =  1.0  ->  Accuracy = 0.6324,  Macro F1 = 0.2822
alpha =  2.0  ->  Accuracy = 0.6330,  Macro F1 = 0.2763

Best alpha by macro F1: 0.01 (Accuracy=0.6277, F1=0.3025)

=== Logistic Regression  + TF-IDF (multi-class) — extended C — C sweep (max_iter=3000) ===
C =  0.01  ->  Accuracy = 0.6280,  Macro F1 = 0.2572
C =   0.1  ->  Accuracy = 0.6389,  Macro F1 = 0.3151
C =   1.0  ->  Accuracy = 0.6435,  Macro F1 = 0.3895
C =  10.0  ->  Accuracy = 0.6149,  Macro F1 = 0.4249
C =  50.0  ->  Accuracy = 0.5998,  Macro F1 = 0.4276
C = 100.0  ->  Accuracy = 0.5991,  Macro F1 = 0.4280

Best C by macro F1: 100.0 (Accuracy=0.5991, F1=0.4280)

=== Logistic Regression  + TF-IDF (multi-class) — small max_iter — C sweep (max_iter=200) ===
C =   1.0  

# ***עד לפה זה החדש =========================================================================================================================================================================================================***