# Crowdflower Search Results Relevance

Educational take on one of the more popular past [kaggle challenges](https://www.kaggle.com/c/crowdflower-search-relevance/data) challenges from 2015. In the second notebook (crowdflower_llm_brute) I solve this task using LLMs, just for fun, but here I try to stay sensible and get reasonably high score quickly ("good enough" for educational purposes using mostly sklearn).

In [65]:
import pandas as pd
df = pd.read_csv('train.csv')
print(df.describe())
print(df.isna().sum())
df.sample(3)

                 id  median_relevance  relevance_variance
count  10158.000000      10158.000000        10158.000000
mean   16353.103071          3.309805            0.377863
std     9447.106683          0.980666            0.389707
min        1.000000          1.000000            0.000000
25%     8078.750000          3.000000            0.000000
50%    16349.500000          4.000000            0.471000
75%    24570.750000          4.000000            0.471000
max    32668.000000          4.000000            1.470000
id                        0
query                     0
product_title             0
product_description    2444
median_relevance          0
relevance_variance        0
dtype: int64


Unnamed: 0,id,query,product_title,product_description,median_relevance,relevance_variance
8447,27179,vanilla scented perfumes,Aquolina Pink Sugar Women's 3.4-ounce Eau de T...,Introduced by the design house of Aquolina in ...,2,1.095
4296,13898,polar heart rate monitor,Polar V800 GPS & Heart Rate Monitor Watch Set,,4,0.0
2091,6712,wreck it ralph,"Wreck-It Ralph (Blu-ray/DVD, 2013, 2-Disc Set)...",,3,0.943


In [66]:
df_test = pd.read_csv('test.csv')
print(df_test.describe())
print(df_test.isna().sum())
df_test.sample(3)

                 id
count  22513.000000
mean   16328.282992
std     9424.576451
min        3.000000
25%     8201.000000
50%    16329.000000
75%    24464.000000
max    32671.000000
id                        0
query                     0
product_title             0
product_description    5427
dtype: int64


Unnamed: 0,id,query,product_title,product_description
20502,29757,wall clocks,Lorell Radio Control Wall Clock,Lorell Radio Control Wall Clock - Digital - Qu...
13443,19510,full tang knife,Whetstone Black Forest Full Tang Cherry Pakkaw...,Whetstone Cutlery's Black Forest knife with sh...
17392,25224,decorative pillows,Cotton Tale Girly Pillow Pack (Set of 3),


One can note that description has quite a few missing values, so it makes sense to concatenate title and description as a first step. Plus title/description fields contain html tags which are better to be stripped, so run some preprocessing on both sets:

In [68]:
from bs4 import BeautifulSoup
import re

def clean_text(dirty_text):
    soup = BeautifulSoup(str(dirty_text), "html.parser")
    text_only = soup.get_text()
    # Use regex to remove special characters and keep only alphanumeric characters and spaces
    clean_text = re.sub(r'[^A-Za-z0-9\s]+', '', text_only)
    # Optional: Normalize spaces (remove extra spaces)
    clean_text = re.sub(r'\s+', ' ', clean_text).strip()
    return clean_text.lower()

df['product_title'] = df['product_title'].apply(clean_text)
df['product_description'] = df['product_description'].apply(clean_text)
df_test['product_title'] = df_test['product_title'].apply(clean_text)
df_test['product_description'] = df_test['product_description'].apply(clean_text)
# df['matching'] = df['query'].str.cat(df['product_title'].str.cat(df['product_description'], sep=' ', na_rep=''))
# df_test['matching'] = df_test['query'].str.cat(df_test['product_title'].str.cat(df_test['product_description'], sep=' ', na_rep=''))
df['matching'] = df['query'].str.cat(df['product_title'])
df_test['matching'] = df_test['query'].str.cat(df_test['product_title'])

  soup = BeautifulSoup(str(dirty_text), "html.parser")
  soup = BeautifulSoup(str(dirty_text), "html.parser")


Now we need some way to measure similarity between various queries and results:

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
# get a nice mix of models for ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import TruncatedSVD
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import cohen_kappa_score, make_scorer

# fit vectorizer using full vocabulary (concatenate train and test query + title + description
tfid = TfidfVectorizer(min_df=3,  max_features=None,
                       strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
                       ngram_range=(1, 5), use_idf=True,smooth_idf=True,sublinear_tf=True,
                       stop_words = 'english').fit(pd.concat([df['matching'],df_test['matching']]))

Now we can get actual features and labels and train the model. Use kappa scorer as per competition definition, and start with C-Support Vector Classification as model first. Prior to fit, reduce dimensionality with TruncatedSVD and apply standard scaling as part of the pipeline

In [70]:
x_train = tfid.transform(df['matching'][df['relevance_variance']==0].to_numpy())
x_test = tfid.transform(df_test['matching'].to_numpy())
y = df['median_relevance'][df['relevance_variance']==0].to_numpy()

In [87]:
svm = Pipeline([('svd', TruncatedSVD()),
    						 ('scaler', StandardScaler()),
                    	     ('svm', SVC())])
qwk_scorer = make_scorer(cohen_kappa_score, weights='quadratic', greater_is_better=True)
param_grid = {'svd__n_components' : [420,430,440],
                  'svm__C': [32,34,36]}
model = GridSearchCV(estimator = svm, param_grid=param_grid, scoring=qwk_scorer,
                                     verbose=0, n_jobs=-1, refit=True, cv=4)

In [88]:
model.fit(x_train, y)
print(f"best score: {model.best_score_}")

Fitting 4 folds for each of 9 candidates, totalling 36 fits
best score: 0.5683069851949141


In [86]:
model.best_params_

{'svd__n_components': 430, 'svm__C': 34}

Now can repeat excercise with a different model, for instance, KNN

In [94]:
from sklearn.neighbors import KNeighborsClassifier
knn = Pipeline([('svd', TruncatedSVD()),
    						 ('scaler', StandardScaler()),
                    	     ('knn', KNeighborsClassifier())])

param_grid1 = {'svd__n_components' : [420,430,440],
                  'knn__n_neighbors': [3,5,10]}
model1 = GridSearchCV(estimator = knn, param_grid=param_grid1, scoring=qwk_scorer,
                                     verbose=0, n_jobs=-1, refit=True, cv=4)

In [95]:
model1.fit(x_train, y)
print(f"best score: {model1.best_score_}")

Fitting 4 folds for each of 9 candidates, totalling 36 fits
best score: 0.5262470131103949


In [96]:
model1.best_params_

{'knn__n_neighbors': 3, 'svd__n_components': 430}

And simple logistic regression:

In [109]:
from sklearn.linear_model import LogisticRegression

lr = Pipeline([('svd', TruncatedSVD()),
    						 ('scaler', StandardScaler()),
                    	     ('lr', LogisticRegression(solver='saga',max_iter=10000))])

param_grid2 = {'svd__n_components' : [410,420,430],
                  'lr__C': [20,30,50]}
model2 = GridSearchCV(estimator = lr, param_grid=param_grid2, scoring=qwk_scorer,
                                     verbose=0, n_jobs=-1, refit=True, cv=4)

In [110]:
model2.fit(x_train, y)
print(f"best score: {model2.best_score_}")

Fitting 4 folds for each of 9 candidates, totalling 36 fits
best score: 0.4737610482925336
[CV 3/4; 1/9] START lr__C=20, svd__n_components=410.............................
[CV 3/4; 1/9] END lr__C=20, svd__n_components=410;, score=0.469 total time= 1.7min
[CV 1/4; 5/9] START lr__C=30, svd__n_components=420.............................
[CV 1/4; 5/9] END lr__C=30, svd__n_components=420;, score=0.471 total time= 1.4min
[CV 2/4; 7/9] START lr__C=50, svd__n_components=410.............................
[CV 2/4; 7/9] END lr__C=50, svd__n_components=410;, score=0.486 total time= 1.7min
[CV 2/4; 3/9] START lr__C=20, svd__n_components=430.............................
[CV 2/4; 3/9] END lr__C=20, svd__n_components=430;, score=0.472 total time= 1.7min
[CV 2/4; 5/9] START lr__C=30, svd__n_components=420.............................
[CV 2/4; 5/9] END lr__C=30, svd__n_components=420;, score=0.511 total time= 1.5min
[CV 1/4; 8/9] START lr__C=50, svd__n_components=420.............................
[CV 1/4;

In [108]:
model2.best_params_

{'lr__C': 30, 'svd__n_components': 420}

[CV 4/4; 2/9] START knn__n_neighbors=3, svd__n_components=430...................
[CV 4/4; 2/9] END knn__n_neighbors=3, svd__n_components=430;, score=0.436 total time=  15.9s
[CV 1/4; 5/9] START knn__n_neighbors=5, svd__n_components=430...................
[CV 1/4; 5/9] END knn__n_neighbors=5, svd__n_components=430;, score=0.434 total time=  13.7s
[CV 3/4; 7/9] START knn__n_neighbors=10, svd__n_components=420..................
[CV 3/4; 7/9] END knn__n_neighbors=10, svd__n_components=420;, score=0.324 total time=  13.9s
[CV 4/4; 2/9] START lr__C=10, svd__n_components=430.............................
[CV 4/4; 2/9] END lr__C=10, svd__n_components=430;, score=0.432 total time=  16.7s
[CV 1/4; 5/9] START lr__C=20, svd__n_components=430.............................
[CV 1/4; 5/9] END lr__C=20, svd__n_components=430;, score=0.435 total time=  17.2s
[CV 1/4; 7/9] START lr__C=30, svd__n_components=420.............................
[CV 1/4; 7/9] END lr__C=30, svd__n_components=420;, score=0.457 tota

In [143]:
from sklearn.linear_model import SGDClassifier

sgd = Pipeline([('svd', TruncatedSVD(n_components=400)),
    						 ('scaler', StandardScaler()),
                    	     ('sgd', SGDClassifier(loss='modified_huber',max_iter=10000,learning_rate='optimal',early_stopping=True,n_iter_no_change=5))])

param_grid3 = {
    "sgd__alpha": [1e-9,1e-8,1e-7,1e-6,1e-5],
    # "sgd__loss": ['hinge','squared_error','modified_huber'],
    # "sgd__penalty": ['l1','l2','elasticnet']
}

model3 = GridSearchCV(estimator = sgd, param_grid=param_grid3, scoring=qwk_scorer,
                                     verbose=0, n_jobs=-1, refit=True, cv=4)

In [144]:
model3.fit(x_train, y)
print(f"best score: {model3.best_score_}")

best score: 0.4114817749868326


Now we can try to take best of all classifiers and implement a voting classifier using full set, i.e. also values with high reported variance of the relevance. 

In [147]:
x_train = tfid.transform(df['matching'].to_numpy())
x_test = tfid.transform(df_test['matching'].to_numpy())
y = df['median_relevance'].to_numpy()

In [163]:
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.model_selection import cross_val_score

model.best_estimator_.set_params(svm__probability=False)
vcl = StackingClassifier(estimators=[
        ('svm', model.best_estimator_), 
        ('knn', model1.best_estimator_), 
        ('lr', model2.best_estimator_),
        ('sgd', model3.best_estimator_)],final_estimator=LogisticRegression(max_iter=10000),n_jobs=8)

vcl.fit(x_train,y,)

In [164]:
ypred = vcl.predict(x_train)

And the $\kappa$ score for the full train set is:

In [167]:
cohen_kappa_score(y,ypred,weights='quadratic')

0.779149935036425

Now verify with kaggle for the test set:

In [176]:
df_test['prediction'] = vcl.predict(x_test)

In [177]:
dfsub = pd.read_csv('sampleSubmission.csv')

In [180]:
df_test[['id','prediction']].to_csv('submission1.csv',index=False) # private score 0.54, public 0.51