## What is this competition all about?

* Given product information such as product_description, product_title, product_uid with respect to search query we are asked to predict the relevancy score on test_set.
* Training set contains information about product such as _id_, _product title_,  _product uid_, and _search query_ and their _relevancy score_.
* Attribute set contains technical details about the product along with product identification number _product uid_, for most of them data aren't present.
* Description set contains information about product description along with _product uid_ (for all).
* Test set contains information about the product such as _id_, _product title_, _search terms_ for which _relevancy score_ have to be predicted.



## Loading packages


In [None]:
import numpy as np
import pandas as pd

%matplotlib inline 
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)





## What is Home depot product search relevance


Shoppers rely on Home Depot’s product authority to find and buy the latest products and to get timely solutions to their home improvement needs.
From installing a new ceiling fan to remodeling an entire kitchen, with the click of a mouse or tap of the screen, customers expect the correct results to their queries – quickly. 
Speed, accuracy and delivering a frictionless customer experience are essential.




## Let's get familiar with the data!

### Training data


In [None]:
training_data = pd.read_csv("../input/train.csv", encoding="ISO-8859-1")

In [None]:
print("training data shape is:",training_data.shape)
print("training data has empty values:",training_data.isnull().values.any())
training_data.head(3)

As we can see there is no empty fields like NaN, but an interesting questions pop up. 
1. What is the smallest length of a sentece in the search terms.
2. What is the biggest length in a sentence in search terms.


In [None]:
search_terms_df = training_data[training_data.search_term.str.len() == training_data.search_term.str.len().min()]
print("Mean: {}".format(search_terms_df.relevance.mean()))
search_terms_df


In [None]:
search_terms_df = training_data[training_data.search_term.str.len() == training_data.search_term.str.len().max()]
print("Mean: {}".format(search_terms_df.relevance.mean()))
search_terms_df

It appears that the following two hypothesis has to be tested
1. More words appears in search query the higher the chance that result will be relevance
2. Fewer words appears in search query have less relevance score




In [None]:
training_data['seach_query_length'] = training_data.search_term.str.len()

In [None]:
training_data

It looks like there is no correlation between length of search query and relevance score
## How much relevancy the following cases scored
1. Search terms which consists of diggits and words (for example like "4*8 beadboard paneling").
2. Search terms which  consists of diggits and in between a character (for example 3 x 2).
3. Search terms which consists of only diggits (if there are any of them)

In [None]:
mask = training_data.search_term.str.contains("[\d\w]")
r = training_data.loc[mask, ['product_title', 'search_term','relevance']].groupby('relevance')
r.head(10)

In [None]:
training_data[training_data.search_term.str.contains("\d+\w.+\d+")]

In [None]:
training_data[training_data.search_term.str.contains("_\d+")]

In [None]:
training_data[training_data.search_term.str.contains("^\\d+ . \\d+$")]


In [None]:
# testing_data = pd.read_csv("../input/test.csv", encoding="ISO-8859-1")
# attribute_data = pd.read_csv('../input/attributes.csv')
# descriptions = pd.read_csv('../input/product_descriptions.csv')

In [None]:
import nltk

def compute_exponent_of_conv(word_dist):
    n = len(word_dist.items())
    r = [e for _, e in word_dist.items()]
    return max([np.log(n) / np.log(_) for _ in r])


# df = training_data[training_data.search_term.str.contains("_\d+")]
df = training_data[training_data.search_term.str.contains("^\\d+ . \\d+$")]

lowest_relevancy_score = df[df.relevance < 3].search_term.values.tolist()
high_score = df[df.relevance == 3].search_term.values.tolist()

def make_plot(words):  
    all_words = ''.join(words)
    words = nltk.tokenize.word_tokenize(all_words)
    word_dist = nltk.FreqDist(words)
    title = compute_exponent_of_conv(word_dist)
    word_dist.plot( title=title)
    
    
make_plot(lowest_relevancy_score)
make_plot(high_score)
make_plot(training_data[training_data.relevance == 1].search_term.values.tolist()[:8000])



It kindly hard to say anything about this two graphs, and they are both tends to have the same rate of convergency

# Ideas
1. we can build tfids for each ranking and try to play with for predictions
2. find out the most common words which has been shared
3. from 2 and plot most common words and their sharing accross
4. use tfids for relevance score

In [None]:
total_words = training_data.search_term.unique().tolist()[:10]

In [None]:
make_plot(total_words)

In [None]:
total_words = training_data[training_data.relevance==3].search_term.unique().tolist()[:10]
make_plot(total_words)

In [None]:
total_words = training_data[training_data.relevance==2].search_term.unique().tolist()[:10]
make_plot(total_words)

In [None]:
total_words = training_data[training_data.relevance==1].search_term.unique().tolist()[:10]
make_plot(total_words)

In [None]:
attributes_data = pd.read_csv("../input/train.csv", encoding="ISO-8859-1")

In [None]:
total_words = training_data[training_data.relevance<3].search_term.unique().tolist()[:10]


## Interesting
It doesn't make sens to use much data to gathering more data than unique since it look different than above graphs. 
Lets try another way to do it, how does the attributes are affecting the score based on 

In [None]:
training_data[training_data.search_term.str.contains("_\d+")]

next quesiton is wheathe it has at leas one common shared word in lemmatisation

## what can we plot
1. number of shared words between search query and product metadata
2. number of words in seach query
3. number of non shared words
4. __Number of shared words between search title and attributes __
5. Number of shared words between shearch title and description
6. Number of total shared words
7. Number of tota non shared words
8. number of common elements in tfids 
9. number of only words
10. group relevance by part of speach like tags frequency vs relevance
11. calculate symmetric difference between search_terms and product title store it as feature
12. corrected jedict distance / symmetric distance




In [None]:
################begin testing
## let's create first the cleaning functions
from bs4 import BeautifulSoup
import lxml
import re
import nltk
from nltk.corpus import stopwords # Import the stop word list
from nltk.metrics import edit_distance
from string import punctuation
from collections import Counter
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')


def remove_html_tag(text):
    soup = BeautifulSoup(text, 'lxml')
    text = soup.get_text().replace('Click here to review our return policy for additional information regarding returns', '')
    return text

def str_stemmer(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return ' '.join(tokens)


def str_stemmer_tokens(tokens):
    # split into tokens by white space
#     tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filter out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return ' '.join(tokens)

def str_stemmer_title(s):
    return " ".join(map(stemmer.stem, s))

def str_common_word(str1, str2):
    whole_set = set(str1.split())
#     return sum(int(str2.find(word)>=0) for word in whole_set)
    return sum(int(str2.find(word)>=0) for word in whole_set)


def get_shared_words_mut(row_data):
    return np.sum([str_common_word2(*row_data[:-1]), str_common_word2(*row_data[1:])])


def get_shared_words_imut(row_data):
    return np.sum([str_common_word(*row_data[:-1]), str_common_word2(*row_data[1:])])
    
from nltk.corpus import brown, stopwords
from nltk.cluster.util import cosine_distance
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter


def sentence_similarity(columns,stopwords=None):
    sent1, sent2 = columns[0], columns[1]
    if stopwords is None:
        stopwords = []
 
    sent1 = [w.lower() for w in sent1]
    sent2 = [w.lower() for w in sent2]
 
    all_words = list(set(sent1 + sent2))
 
    vector1 = [0] * len(all_words)
    vector2 = [0] * len(all_words)
 
    # build the vector for the first sentence
    for w in sent1:
        if w in stopwords:
            continue
        vector1[all_words.index(w)] += 1
 
    # build the vector for the second sentence
    for w in sent2:
        if w in stopwords:
            continue
        vector2[all_words.index(w)] += 1
 
    return 1 - cosine_distance(vector1, vector2)

def get_jaccard_sim(columns): 
    str1, str2 = columns[0], columns[1]
    a = set(str1) 
    b = set(str2)
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))


descriptions = pd.read_csv('../input/product_descriptions.csv')
training_data = pd.merge(training_data, descriptions, 
                         on="product_uid", how="left")


In [None]:
training_data['search_term_tokens'] = training_data.search_term.str.lower().str.split()
training_data['product_title_tokens'] = training_data.product_title.str.lower().str.split()
training_data['product_description_tokens'] = training_data.product_description.str.lower().str.split()


In [None]:
training_data['search_term'] = [str_stemmer_title(_) for _ in training_data.search_term_tokens.values.tolist()]
training_data['product_title'] = [str_stemmer_tokens(_) for _ in training_data.product_title_tokens.values.tolist()]
training_data['product_description'] = [str_stemmer_tokens(_) for _ in training_data.product_description_tokens.values.tolist()]


# 1. count number of shared words between search query and product metadata


In [None]:
def str_common_word2(str1, str2):
    part_of_first = set(str1)
    return sum(1 for word in str2 if word in part_of_first)


training_data['c_search_term_product'] = [
    str_common_word2(s1, s2) for s1,s2 in training_data[['search_term', 'product_description_tokens']].values.tolist()
]

training_data['c_search_term_desc'] = [
    str_common_word2(s1, s2) for s1,s2 in training_data[['search_term', 'product_title']].values.tolist()
]

training_data['c_search_term_s'] = [
    str_common_word2(s1, s2)+str_common_word2(s1, s3) for s1,s2,s3 in training_data[['search_term', 'product_title', 'product_description_tokens']].values.tolist()
]



In [None]:
training_data[['c_search_term_product', 'c_search_term_desc', 'relevance']].corr()

In [None]:
def str_common_diff(str1, str2):
    return len(set(str1).symmetric_difference(set(str2)))

def str_common_dij(str1, str2):
    return len(set(str1).intersection(set(str2)))

training_data['c_search_term_product_dij'] = [
    str_common_dij(s1, s2)+str_common_dij(s1, s3) for s1,s2,s3 in training_data[['search_term', 'product_title', 'product_description_tokens']].values.tolist()
]

training_data['c_search_term_product_symm'] = [
    str_common_diff(s1, s2)+str_common_diff(s1, s3) for s1,s2,s3 in training_data[['search_term', 'product_title','product_description_tokens']].values.tolist()
]

training_data[['c_search_term_product_dij', 'c_search_term_product_symm', 'c_search_term_product', 'relevance']].corr()

In [None]:
def get_jaccard_sim(columns): 
    str1, str2 = columns[0], columns[1]
    a = set(str1) 
    b = set(str2)
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))



training_data['j_dis_sqt'] = [get_jaccard_sim(rows) for rows in training_data[["search_term_tokens","product_title_tokens"]].values]
training_data['j_dis_sqd'] = [get_jaccard_sim(rows) for rows in training_data[["search_term_tokens","product_description_tokens"]].values]


In [None]:
# from sklearn.model_selection import train_test_split
# X = training_data['whole']
# y = training_data.product_description

# X_train, X_test, y_train, y_test = train_test_split(X, y)