# Home Depot’s Search Relevance Prediction

In our project, we would help Home Depot’s to improve their customers' shopping experience by developing a model that can accurately predict the relevance of search results.<br>
Home depot is the equivalent of selling hardware (like Google vertically) , when we type in the search box, "I want to build a house’’, all the tools used to build the house would show up. This is the background.<br>
The core of this project is natural language processing, the nature of NLP is the conversion of a natural language into a computer language, so the first thing we need to understand natural language processing steps, roughly from text, Tokenize, Lemma or Stemming, stopwords, to Word_List (this is the typical text pretreatment) and then do the feature engineering, which is the text into digital, finally we can build some model to train the data and We get what we want.<br>
The most different and interesting part of this project is that we can create different features by ourselves instead of simply calling third-party libraries, which we will explain in detail later.<br>
In order to achieve the final goal, we first import some basic library.<br>


### Import library

In [1]:
import numpy as np    ### matrix processing library
import pandas as pd   ### table processing library
### model abour classifier and regression
from sklearn.ensemble import RandomForestClassifier, BaggingRegressor  
from nltk.stem.snowball import SnowballStemmer    ### text processing about stemming
import os

### Load data

In [35]:
### load train.csv, test.csv,  product_descriptons.csv

df_train = pd.read_csv('train.csv', encoding = "ISO-8859-1")  
df_pro = pd.read_csv('product_descriptions.csv', encoding='ISO-8859-1')
df_test = pd.read_csv('test.csv', encoding = "ISO-8859-1")
# df_attr = pd.read_csv('attributes.csv', encoding='ISO-8859-1')

### Add product description to all table.

Product description will be very useful since we find that the corpus of title is too small, and we need a more detailed Description who has more corpus information to support our search, or if there are two similar products using two different words as product title, you will never be able to find them.<br>
Product_uid here to match description.<br>


In [36]:
### for text processing, concat train file and product description file.

df_all = pd.merge(df_train, df_pro, how='left', on='product_uid')

### for text processing, concat test file and product description file.
df_all_test = pd.merge(df_test, df_pro, how='left', on='product_uid')
df_all_test.head()

Unnamed: 0,id,product_uid,product_title,search_term,product_description
0,1,100001,Simpson Strong-Tie 12-Gauge Angle,90 degree bracket,"Not only do angles make joints stronger, they ..."
1,4,100001,Simpson Strong-Tie 12-Gauge Angle,metal l brackets,"Not only do angles make joints stronger, they ..."
2,5,100001,Simpson Strong-Tie 12-Gauge Angle,simpson sku able,"Not only do angles make joints stronger, they ..."
3,6,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong ties,"Not only do angles make joints stronger, they ..."
4,7,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong tie hcc668,"Not only do angles make joints stronger, they ..."


### Show relevance.

And then let's count the relevance frequency.

In [4]:
df_y = pd.DataFrame(df_train['relevance'])
df_y['num'] = 1
df_y_g = df_y['num'].groupby(df_y['relevance'])
df_y_g.sum()

relevance
1.00     2105
1.25        4
1.33     3006
1.50        5
1.67     6780
1.75        9
2.00    11730
2.25       11
2.33    16060
2.50       19
2.67    15202
2.75       11
3.00    19125
Name: num, dtype: int64

- Conclusion: The relevance score ranges from 1 to 3, with a grade of 0.33 or 0.34. Among them, 1.25, 1.50, etc. are too few samples, which can be ignored and divided into 7 categories. The regression problem can be converted into a classification problem for consideration. The regression problem should be treated first.

### Text preprocessing for text model preformance.

First, we import libraries for stemming, remove stopwords and string similarity calculation.




In [39]:
#### import library about text preprocessing

from nltk import SnowballStemmer  
from nltk.corpus import stopwords
import re
import Levenshtein  ### Library for string similarity calculation
import nltk
stemmer = SnowballStemmer('english')

For natural language processing, we want to improve the performance of the model and greatly reduce the complexity of the text, so we do the following: first, we remove useless symbols.But we should pay attention to some of the symbols is particularly important and cannot be removed, like [". ", "/", "-", "%"]

In [38]:
### For improving text model performance, to process text string, 
### remove useless symbol. 

pattern_replace_pair_list = [
            (r"<.+?>", r""),
            # html codes
            (r"&nbsp;", r" "),
            (r"&amp;", r"&"),
            (r"&#39;", r"'"),
            (r"/>/Agt/>", r""),
            (r"</a<gt/", r""),
            (r"gt/>", r""),
            (r"/>", r""),
            (r"<br", r""),
            # can't remove [".", "/", "-", "%"] as they are useful in numbers, 
           ### e.g., 1.97, 1-1/2, 10%, etc.
            (r"[ &<>)(_,;:!?\+^~@#\$\*]+", r" "),
            (r"'s\\b", r""),
            (r"[']+", r""),
            #(r'([A-Z][a-z]+|[a-z]+|\d+)', r'\1 '),
            (r'(\d?)([a-zA-Z]+)', r'\1 \2 '),
            #(r'(/d+)', r' \1 '),
            (r'([A-Z][a-z]+)', r' \1 '),
        ]

#### Some  function.

we write some feature functions that we can call directly later, for example<br>
1. Convert the number to the word<br>
2. Lower, remove symbol stop word.<br>
3. The stemmer<br>
4. The feature extraction function<br>
(1) how many words in str1 of str2<br>
This is a function that can calculate the validity characteristics of a keyword, string1 is the title string, string2 is our search string, and we can calculate the validity of a keyword simply by calculating how many times it occurs<br>
(2) in contrast, we can also calculate how many words are not in str1 of str2<br>
(3) Determine how many words in the title appear in str2, str3 together (string3 is a description string)<br>
6.Calculate similarity between two words<br>
7.Calculate similarity between two sentences<br>
8.Build a dictionary of stop words<br>


In [37]:
### convert number to word.

dic = {1:'one', 2:'two', 3:'three', 4:'four', 5:'five',6:'six', 7:'seven', 8:'eight', 9:'night',0:'zero'}

### 
def dashrep(matchobj):
    if len(matchobj.group())==1:
        return dic[int(matchobj.group())]
    else:
        return matchobj.group() 

# lower,  remove symbol stop word.
def transform(text):
    for pattern, replace in pattern_replace_pair_list:
        try:
            text = re.sub(pattern, replace, text)
        except:
            pass
    #text = re.sub(r'[\d]+', dashrep, text)
    text = re.sub(r"\s+", " ", text).strip()
    return ' '.join([word for word in text.lower().split() if word not in dic_stopwords])

#word_list = "Package stopwords is already up-to-date".split(" ")
#filtered_words = [word for word in word_list if word not in stopwords.words('english')]

### stemmer
def str_stemmer(s):
    # not stmmer, lower
    if isinstance(s, float):
        s = unicode(s)
    return ' '.join([stemmer.stem(word) for word in s.split()])

### Feature extraction.

#  how many word in str1 of str2.
def str_common_word(str1, str2):
    return sum(int(str2.find(word)>=0) for word in str1.split())

#  how many word not in str1 of str2.

def str_notcommon_word(str1, str2):
    return sum(int(str2.find(word)==-1) for word in str1.split())

# str1:title,str2:search,str3:desc

# Determine how many words in the title appear in str2, str3 together

def str_common_desc_pro_word(str1, str2, str3):
    return sum(int(str2.find(word)>=0 and str3.find(word)>=0) for word in str1.split())

### Ratio between two words
def word_vs_word_ratio(str1, str2):
    ratio = 0
    count = 0
    for word1 in str1.split():
        for word2 in str2.split():
            ratio = Levenshtein.ratio(word1, word2)+ratio
            count+=1
    return ratio/max(count,1)

### Ratio between search term and target term
def search_vs_word_ratio(str_search, str_des):
    ratio = 0
    if len(str_search) ==0:
        return 0
    for word in str_des:
        ratio = max(Levenshtein.ratio(str_search, word), ratio)
    return ratio

### Calculate similarity between two words
def similarity(word1, word2):    
    from nltk.corpus import wordnet as wn
    word_1 = wn.synsets(word1)
    word_2 = wn.synsets(word2)
    sl = 0.
    for el1 in word_1:
        for el2 in word_2:
            val = el1.path_similarity(el2)
            if val is not None:
                sl = max(val,sl)
                if sl > 0.8:
                    break
    return sl


### Calculate similarity between two sentences.
def similarity_sentences(str1, str2):
    sl, count, Ntotal, Nmatch = 0., 0., 0., 0.
    for word1 in nltk.pos_tag(str1.split()):
        score = 0
        for word2 in nltk.pos_tag(str2.split()):
            score = max(Levenshtein.ratio(word1[0],word2[0]),score)
            if score < 0.75:
                if word1[1][0]==word2[1][0]:
                    score = max(similarity(word1[0], word2[0]),score)
            #print score, word1, word2
            if score < 0.75:
                continue
            sl += score
            count += 1
            break
        if score > 0.7:
            Nmatch += 1
    if count == 0:
        return [0., 0.]
    return  [sl/count,Nmatch/max(len(str1.split()),1)] 

In [40]:
### build a dictionary of sopt words.
nltk.download('stopwords')
dic_stopwords = dict(zip(stopwords.words('english'),range(len(stopwords.words('english')))))

[nltk_data] Downloading package stopwords to /home/zwl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Remove punctuation, lowercase, stop words by calling above functions.

we use the above text processing and feature extraction functions to process training text data，same to test data

In [41]:
### Use the above text processing and feature extraction 
### functions to process training text data
df_all['search_term_transform'] = df_all['search_term'].map(lambda x:transform(x))
df_all['product_title_transform'] = df_all['product_title'].map(lambda x:transform(x))
df_all['pro_des_trans'] = df_all['product_description'].map(lambda x:transform(x))

### Use the above text processing and feature extraction 
### functions to process test text data
df_all_test['search_term_transform'] = df_all_test['search_term'].map(lambda x:transform(x))
df_all_test['product_title_transform'] = df_all_test['product_title'].map(lambda x:transform(x))
df_all_test['pro_des_trans'] = df_all_test['product_description'].map(lambda x:transform(x))

#### Stemming by calling above functions.


By the way, stemming is very important in search, because if we're going to do similarity checking, the easiest thing to do is to calculate the validity of your keywords. <br>
If your keywords are apples, but the text itself is an apple, it won't match


In [42]:
### Use the above text processing and feature extraction functions to perform 
### stem processing on the training text data

df_all['search_term_transform_stem'] = df_all['search_term_transform'].map(lambda x:str_stemmer(x))
df_all['product_title_transform_stem'] = df_all['product_title_transform'].map(lambda x:str_stemmer(x))
df_all['pro_des_trans_stem'] = df_all['pro_des_trans'].map(lambda x:str_stemmer(x))

### Use the above text processing and feature extraction functions to perform stem processing
### on the test text data
df_all_test['search_term_transform_stem'] = df_all_test['search_term_transform'].map(lambda x:str_stemmer(x))
df_all_test['product_title_transform_stem'] = df_all_test['product_title_transform'].map(lambda x:str_stemmer(x))
df_all_test['pro_des_trans_stem'] = df_all_test['pro_des_trans'].map(lambda x:str_stemmer(x))

then we concat text words.Combine text to facilitate subsequent text feature extraction

In [43]:
### concat text words.Combine text to facilitate subsequent text feature extraction

### concat train text.
df_all['all_texts_transform'] = df_all['product_title_transform'] + ' . ' + df_all['pro_des_trans']
df_all['all_texts_trans_stemm'] = df_all['product_title_transform_stem'] + ' . ' + df_all['pro_des_trans_stem']

### concat test text.
df_all_test['all_texts_transform'] = df_all_test['product_title_transform'] + ' . ' + df_all_test['pro_des_trans']
df_all_test['all_texts_trans_stemm'] = df_all_test['product_title_transform_stem'] + ' . ' + df_all_test['pro_des_trans_stem']


In [44]:
df_all_test.head()

Unnamed: 0,id,product_uid,product_title,search_term,product_description,search_term_transform,product_title_transform,pro_des_trans,search_term_transform_stem,product_title_transform_stem,pro_des_trans_stem,all_texts_transform,all_texts_trans_stemm
0,1,100001,Simpson Strong-Tie 12-Gauge Angle,90 degree bracket,"Not only do angles make joints stronger, they ...",90 degree bracket,simpson strong - tie 12- gauge angle,angles make joints stronger also provide consi...,90 degre bracket,simpson strong - tie 12- gaug angl,angl make joint stronger also provid consist s...,simpson strong - tie 12- gauge angle . angles ...,simpson strong - tie 12- gaug angl . angl make...
1,4,100001,Simpson Strong-Tie 12-Gauge Angle,metal l brackets,"Not only do angles make joints stronger, they ...",metal l brackets,simpson strong - tie 12- gauge angle,angles make joints stronger also provide consi...,metal l bracket,simpson strong - tie 12- gaug angl,angl make joint stronger also provid consist s...,simpson strong - tie 12- gauge angle . angles ...,simpson strong - tie 12- gaug angl . angl make...
2,5,100001,Simpson Strong-Tie 12-Gauge Angle,simpson sku able,"Not only do angles make joints stronger, they ...",simpson sku able,simpson strong - tie 12- gauge angle,angles make joints stronger also provide consi...,simpson sku abl,simpson strong - tie 12- gaug angl,angl make joint stronger also provid consist s...,simpson strong - tie 12- gauge angle . angles ...,simpson strong - tie 12- gaug angl . angl make...
3,6,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong ties,"Not only do angles make joints stronger, they ...",simpson strong ties,simpson strong - tie 12- gauge angle,angles make joints stronger also provide consi...,simpson strong tie,simpson strong - tie 12- gaug angl,angl make joint stronger also provid consist s...,simpson strong - tie 12- gauge angle . angles ...,simpson strong - tie 12- gaug angl . angl make...
4,7,100001,Simpson Strong-Tie 12-Gauge Angle,simpson strong tie hcc668,"Not only do angles make joints stronger, they ...",simpson strong tie hcc 668,simpson strong - tie 12- gauge angle,angles make joints stronger also provide consi...,simpson strong tie hcc 668,simpson strong - tie 12- gaug angl,angl make joint stronger also provid consist s...,simpson strong - tie 12- gauge angle . angles ...,simpson strong - tie 12- gaug angl . angl make...


then， we Create two pandas to store training features and test features。<br>
we use the previous feature function to extract training text features and test text features separately<br>
for example：<br>
length of key words.<br>
How many keywords coincide with search terms in the title<br>
How many keywords coincide with the search term in the description<br>
and the output are as follows<br>

In [45]:
### Create two pandas to store training features and test features
df_model = pd.DataFrame({})
df_model_test = pd.DataFrame({})

#3） text features.

### Apply the previous feature function 
### to extract training text features and test text features separately

# length of key words.
df_model['len_of_query'] = df_all['search_term'].map(lambda x: len(x.split())).astype(np.int64)
df_model_test['len_of_query'] = df_all_test['search_term'].map(lambda x: len(x.split())).astype(np.int64)

# How many keywords coincide with search terms in the title
df_model['commons_in_title'] = df_all.apply(lambda x:str_common_word(x['search_term'], x['product_title']), axis=1)
df_model_test['commons_in_title'] = df_all_test.apply(lambda x:str_common_word(x['search_term'], x['product_title']), axis=1)

#How many keywords coincide with the search term in the description
#df_all['commons_in_desc'] = df_all.apply(lambda x: str_common_word(x['search_term'], x['product_description'], axis = 1)
df_model['commons_in_desc'] = df_all.apply(lambda x: str_common_word(x['search_term'], x['product_description']), axis=1)
df_model_test['commons_in_desc'] = df_all_test.apply(lambda x: str_common_word(x['search_term'], x['product_description']), axis=1)
# df_all = df_all.drop(['search_term', 'product_title', 'product_description'], axis=1)
df_model_test.head() 

Unnamed: 0,len_of_query,commons_in_title,commons_in_desc
0,3,0,1
1,3,1,1
2,3,0,0
3,3,0,1
4,4,0,1


and we use the previous feature function to extract training text features, then get some new features.<br>
For example:<br>
search_term and product_title comparison<br>
search_term and product_description comparison<br>
How many keywords overlap in the product title<br>
How many keywords overlap in the description<br>

In [46]:
import numpy as np

### Apply the previous feature function 
### to extract training text features。

df_model['len_search_term']=df_all['search_term_transform_stem'].map(
        lambda x:len(x.split())).astype(np.int64)

### new feature.

# 1. search_term and product_title comparison
df_model['dist_in_title'] = df_all.apply(lambda x:word_vs_word_ratio((x['search_term_transform_stem']),x['product_title_transform_stem']), axis=1)

df_model['dist_in_title1'] = df_all.apply(lambda x:search_vs_word_ratio((x['search_term_transform_stem']),x['product_title_transform_stem']), axis=1)

# search_term and product_description comparison
df_model['dist_in_desc'] = df_all.apply(lambda x:word_vs_word_ratio((x['search_term_transform_stem']),x['pro_des_trans_stem']), axis=1)

df_model['dist_in_desc1'] = df_all.apply(lambda x:search_vs_word_ratio((x['search_term_transform_stem']),x['pro_des_trans_stem']), axis=1)

df_model['len_of_query']=df_all['search_term_transform_stem'].map(
        lambda x:len(x.split())).astype(np.int64)
df_model['len_search'] = df_all['search_term_transform_stem'].map(lambda x:len(x))

# How many keywords overlap in the product title
df_model['commons_in_title']=df_all.apply(
    lambda x:str_common_word(
    x['search_term_transform_stem'],x['product_title_transform_stem']),axis=1)

# How many keywords overlap in the description
df_model['commons_in_desc'] = df_all.apply(lambda x:str_common_word(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)

df_model['common_in_desc_pro']=df_all.apply(lambda x: str_common_desc_pro_word(x['search_term_transform_stem'],x['pro_des_trans_stem'],x['product_title_transform_stem']), axis=1)

# df_model['nn_word_in_search'] = df_all['search_term_transform_stem'].map(nn_word_numbers_In_Search)

df_model['queryvstitle'] = df_model.apply(lambda x: float(x['commons_in_title'])/max((x['len_of_query']),1), axis=1)

df_model['product_uid'] =df_all['product_uid']

df_model['queryvsdesc'] = df_model.apply(lambda x: float(x['commons_in_desc'])/max(x['len_of_query'],1), axis=1)

and we use the previous feature function to extract test text features, then get some new features.<br>
For example:<br>
search_term and product_title comparison<br>
search_term and product_description comparison<br>
How many keywords overlap in the product title<br>
How many keywords overlap in the description<br>

In [47]:
import numpy as np

### Apply the previous feature function 
### to extract test text features。

df_model_test['len_search_term']=df_all_test['search_term_transform_stem'].map(
        lambda x:len(x.split())).astype(np.int64)

### new feature.

# 1. search_term and product_title comparison
df_model_test['dist_in_title'] = df_all_test.apply(lambda x:word_vs_word_ratio((x['search_term_transform_stem']),x['product_title_transform_stem']), axis=1)

df_model_test['dist_in_title1'] = df_all_test.apply(lambda x:search_vs_word_ratio((x['search_term_transform_stem']),x['product_title_transform_stem']), axis=1)

# search_term and product_description comparison
df_model_test['dist_in_desc'] = df_all_test.apply(lambda x:word_vs_word_ratio((x['search_term_transform_stem']),x['pro_des_trans_stem']), axis=1)

df_model_test['dist_in_desc1'] = df_all_test.apply(lambda x:search_vs_word_ratio((x['search_term_transform_stem']),x['pro_des_trans_stem']), axis=1)

df_model_test['len_of_query']=df_all_test['search_term_transform_stem'].map(
        lambda x:len(x.split())).astype(np.int64)
df_model_test['len_search'] = df_all_test['search_term_transform_stem'].map(lambda x:len(x))

# How many keywords overlap in the product title
df_model_test['commons_in_title']=df_all_test.apply(
    lambda x:str_common_word(
    x['search_term_transform_stem'],x['product_title_transform_stem']),axis=1)

# How many keywords overlap in the description
df_model_test['commons_in_desc'] = df_all_test.apply(lambda x:str_common_word(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)

df_model_test['common_in_desc_pro']=df_all_test.apply(lambda x: str_common_desc_pro_word(x['search_term_transform_stem'],x['pro_des_trans_stem'],x['product_title_transform_stem']), axis=1)

# df_model['nn_word_in_search'] = df_all['search_term_transform_stem'].map(nn_word_numbers_In_Search)

df_model_test['queryvstitle'] = df_model_test.apply(lambda x: float(x['commons_in_title'])/max((x['len_of_query']),1), axis=1)

df_model_test['product_uid'] =df_all_test['product_uid']

df_model_test['queryvsdesc'] = df_model_test.apply(lambda x: float(x['commons_in_desc'])/max(x['len_of_query'],1), axis=1)

In [15]:
df_model_test.head()

Unnamed: 0,len_of_query,commons_in_title,commons_in_desc,len_search_term,dist_in_title,dist_in_title1,dist_in_desc,dist_in_desc1,len_search,common_in_desc_pro,queryvstitle,product_uid,queryvsdesc
0,2,1,1,2,0.19995,0.153846,0.179414,0.153846,12,1,0.5,100001,0.5
1,2,1,1,2,0.07982,0.2,0.123923,0.2,9,1,0.5,100001,0.5
2,1,1,1,1,0.203613,0.4,0.188907,0.4,4,1,1.0,100002,1.0
3,3,1,1,3,0.254357,0.117647,0.218599,0.117647,16,1,0.333333,100005,0.333333
4,2,2,2,2,0.257484,0.142857,0.224937,0.142857,13,2,1.0,100005,1.0


The above briefly shows the features of our work so far.

#### Using TF IDF method to compute text feature.

In addition to the above methods, we can also use more powerful algorithms to obtain some features.For example, Using TF IDF method to compute the text feature. <br>
The TF IDF is a kind of statistical method, the principle is = TF * IDF.  It assesses the importance of a word in a document or a corpus. if a word or phrase in an article has high frequency (TF), and rarely appears in other articles (IDF), we would say this word or phrase has great distinguishing ability. And the TF - IDF values will increase with the word show more times in the document.  It will also decrease with the increase of the number of words appearing in the corpus.<br>
For the TFIDF model, we need to build a dictionary using the data set.<br>
Here we chose to create our dictionary in the environment of gensim. Then we can calculate that there are 28531 unique words which we use to create a corpus. <br>
What is important is that we intend to convert every word into a digit, so that when we are using the tf-idf model, the same word will become the same number,which makes it simpler and clearer. Also since the corpus are generally very big, so we will use an iterator.<br>
Then we write a class to sweep up all our corpus and turn them into a simple count of words. This would be a bag of words.


In [48]:
from gensim.utils import tokenize
from gensim.corpora.dictionary import Dictionary

#dictionary.
### For the TFIDF model, we need to build a thesaurus using the data set.

dictionary = Dictionary(list(tokenize(x, errors='ignore')) for x in df_all['all_texts_trans_stemm'].values)
print(dictionary)

class MyCorpus(object):
    def __iter__(self):
        for x in df_all['all_texts_trans_stemm'].values:
            yield dictionary.doc2bow(list(tokenize(x, errors='ignore')))

#
corpus = MyCorpus()

Dictionary(28531 unique tokens: ['alon', 'also', 'angl', 'bent', 'coat']...)


then we Training the TFIDF model directly

A quick test to convert 'hello world, good morning' into a number (here's one stopwords, so 3 outputs )

In [49]:
### Training the TFIDF model

from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus)

tfidf[dictionary.doc2bow(list(tokenize('hello world, good morning', errors='ignore')))]

[(806, 0.36030376943542936),
 (2358, 0.33818710980924394),
 (11067, 0.8693737242920857)]

Taking the large string as the reference vector and the small string with complement of  0 can solve the problem that the size does not match when comparing strings.

then we use TFIDF model to convert text to TFIDF value，and creating a cosine similarity comparison method

In [50]:
from gensim.similarities import MatrixSimilarity

#### Use TFIDF model to convert text to TFIDF value

def to_tfidf(text):
    res = tfidf[dictionary.doc2bow(list(tokenize(text, errors='ignore')))]
    return res

# Creating a cosine similarity comparison method

def cos_sim(text1, text2):
    tfidf1 = to_tfidf(text1)
    tfidf2 = to_tfidf(text2)
    index = MatrixSimilarity([tfidf1],num_features=len(dictionary))
    sim = index[tfidf2]
    return float(sim[0])

then Calculate TFIDF features on training and test data，we can get the new feature：<br>
similarity between search term and product title<br>
and similarity between search term and product description<br>

In [51]:
### Calculate TFIDF features on training and test data

#Calculate similarity between search term and product title
df_model['tfidf_cos_sim_in_title'] = df_all.apply(lambda x: cos_sim(x['search_term_transform_stem'], x['product_title_transform_stem']), axis=1)
df_model_test['tfidf_cos_sim_in_title'] = df_all_test.apply(lambda x: cos_sim(x['search_term_transform_stem'], x['product_title_transform_stem']), axis=1)

#Calculate similarity between search term and product description
df_model['tfidf_cos_sim_in_desc'] = df_all.apply(lambda x: cos_sim(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)
df_model_test['tfidf_cos_sim_in_desc'] = df_all_test.apply(lambda x: cos_sim(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)

In [52]:
df_model_test.head()

Unnamed: 0,len_of_query,commons_in_title,commons_in_desc,len_search_term,dist_in_title,dist_in_title1,dist_in_desc,dist_in_desc1,len_search,common_in_desc_pro,queryvstitle,product_uid,queryvsdesc,tfidf_cos_sim_in_title,tfidf_cos_sim_in_desc
0,3,0,1,3,0.075893,0.117647,0.127041,0.117647,16,0,0.0,100001,0.333333,0.0,0.0
1,3,1,1,3,0.113459,0.125,0.153707,0.125,15,1,0.333333,100001,0.333333,0.0,0.0
2,3,1,1,3,0.162306,0.125,0.13175,0.125,15,1,0.333333,100001,0.333333,0.345086,0.081922
3,3,3,3,3,0.264254,0.105263,0.205404,0.105263,18,3,1.0,100001,1.0,0.861288,0.271635
4,5,3,3,5,0.158553,0.074074,0.136194,0.074074,26,3,0.6,100001,0.6,0.861288,0.271635


### Using Word2Vec method to compute text feature.

There's a big difference between w2v and tf-idf. For tf-idf, all we need to know is which word elements are contained in a whole text and then we are all set.<br>

But w2v needs to consider the split of the sentence hierarchy, so we can’t use the TF-IDF corpus directly. Here, we need to sort out the sentences/words first.<br>
First, we import nltk which also comes with a powerful sentence splitter.<br>
Next  we make long text into list of sentences, then turn sentences into list of words.<br>
Since the sentences do not need these hierarchies, they are all flat, so we gave flatten to list of lists.<br>
We divide up the words in the sentence. You can use the tokenizer just from Gensim or the word_tokenizer from NLTK.<br>


In [53]:
import nltk
nltk.download('punkt')

#1）nltk also comes with a powerful sentence splitter. [Call tool]
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#2）We first make long text into list of sentences, and then turn sentences into list of words: [text-> sentence]
sentences = [tokenizer.tokenize(x) for x in df_all['all_texts_trans_stemm'].values]

#3）We gave flatten to list of lists. [Sentence-> flatten]
sentences = [y for x in sentences for y in x] #

#4）We divide up the words in the sentence. You can use the tokenizer just from Gensim or the word_tokenizer from nltk[sentence -> words]
from nltk.tokenize import word_tokenize
w2v_corpus = [word_tokenize(x) for x in sentences]

[nltk_data] Downloading package punkt to /home/zwl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [22]:
# import pickle
# with open(  'data.pkl','wb' ) as f:
# pickle.dump(w2v_corpus, f)

And train our predictive database into word vectors, at this point, each word can read out its w2v vector like a dictionary:

In [54]:
from gensim.models.word2vec import Word2Vec

#5） Train our predictive database into word vectors [Words-> Training Corpus Model]
# model = Word2Vec(w2v_corpus, size=128, window=5, min_count=5, workers=4)
model = Word2Vec.load('w2v')

In [55]:
model

<gensim.models.word2vec.Word2Vec at 0x7f376d175668>

Like TFiDF, we can turn the columns of textual into w2v vectors. <br>
The difference here is that TFiDF is for every sentence, while w2v is for every word. Therefore, we can average the w2v vector of a sentence and calculate the similarity of word vectors between texts as the average vector then we calculate the similarity of word vectors between texts.<br>


In [56]:
import numpy as np
#6) You can get a vector of each word, but each sentence is composed of multiple words, and each word vector is averaged,
vocab = model.vocabulary
print(vocab)

### Get word vector for text
def get_vector(text):
    res = np.zeros([128])
    count = 0
    for word in word_tokenize(text):
        res += model[word]
        count+=1
    return res/max(count,1)

from scipy import spatial
### Calculate the similarity of word vectors between texts
def w2v_cos_sim(text1, text2):
    try:
        w2v1 = get_vector(text1)
        w2v2 = get_vector(text2)
        sim = 1 - spatial.distance.cosine(w2v1, w2v2)
        return float(sim)
    except:
        return float(0)

<gensim.models.word2vec.Word2VecVocab object at 0x7f3766c97358>


last,we use the previous word vector feature function to extract training text features and test text features<br>
Then we can get a numblized large table with many features created by our own as follows:

In [57]:
### feature about word2vec.

### Apply the previous word vector 
### feature function to extract training text features and test text features

df_model['w2v_cos_sim_in_title'] = df_all.apply(lambda x: w2v_cos_sim(x['search_term_transform_stem'], x['product_title_transform_stem']), axis=1)
df_model['w2v_cos_sim_in_desc'] = df_all.apply(lambda x: w2v_cos_sim(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)

df_model_test['w2v_cos_sim_in_title'] = df_all_test.apply(lambda x: w2v_cos_sim(x['search_term_transform_stem'], x['product_title_transform_stem']), axis=1)
df_model_test['w2v_cos_sim_in_desc'] = df_all_test.apply(lambda x: w2v_cos_sim(x['search_term_transform_stem'], x['pro_des_trans_stem']), axis=1)

  # Remove the CWD from sys.path while we load stuff.


In [58]:
df_model_test.head()

Unnamed: 0,len_of_query,commons_in_title,commons_in_desc,len_search_term,dist_in_title,dist_in_title1,dist_in_desc,dist_in_desc1,len_search,common_in_desc_pro,queryvstitle,product_uid,queryvsdesc,tfidf_cos_sim_in_title,tfidf_cos_sim_in_desc,w2v_cos_sim_in_title,w2v_cos_sim_in_desc
0,3,0,1,3,0.075893,0.117647,0.127041,0.117647,16,0,0.0,100001,0.333333,0.0,0.0,0.0,0.0
1,3,1,1,3,0.113459,0.125,0.153707,0.125,15,1,0.333333,100001,0.333333,0.0,0.0,0.0,0.0
2,3,1,1,3,0.162306,0.125,0.13175,0.125,15,1,0.333333,100001,0.333333,0.345086,0.081922,0.0,0.0
3,3,3,3,3,0.264254,0.105263,0.205404,0.105263,18,3,1.0,100001,1.0,0.861288,0.271635,0.0,0.0
4,5,3,3,5,0.158553,0.074074,0.136194,0.074074,26,3,0.6,100001,0.6,0.861288,0.271635,0.0,0.0


### Using GBDT method to train model.

In this project, we chose Gradient Boosting Decision Tree to do the training since it has good performance in ensembling.

For the training dataset, we can get train feature x and train label y, then get the test feature for testing dataset.<br>
By the way, since there is no relevance direct result in training data set, in order to ensure the effectiveness of our method, we first split the train data set, we divide train set and test set using 0.2 ratio. And then directly use GradientBoostingRegressor to train the  GBDT model using training dataset.<br>


In [59]:
### For training data, get train feature x and train label y.

y = df_all['relevance'].values
x = df_model.values

#### Get test feature.

test_x = df_model_test.values

In [60]:
print('x: ', x.shape)
print('y: ', y.shape)
print('test x...', test_x.shape)

x:  (74067, 17)
y:  (74067,)
test x... (166693, 17)


In [61]:
### divide train set and test set using 0.2 ratio../

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
y_train.shape

(59253,)

In [62]:
from sklearn.ensemble import GradientBoostingRegressor

In [63]:
### gbdt model..

gbdt = GradientBoostingRegressor()
gbdt

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

In [64]:
### train gbdt model using train data.
gbdt.fit(x_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

and predict and compute MSE on test set of the original train dataset

In [34]:
#### predict and compute MSE on test set.

y_pred = gbdt.predict(x_test)
from sklearn.metrics import mean_squared_error as mse

### compute MSE on test set.
mse(y_pred, y_test)

0.22196498084458471

- The correlation prediction error is 0.22, the performance is relatively good, and the error is very low.

#### Use the trained model to predict test data

In [68]:
test_y_pred = gbdt.predict(test_x)
print(test_y_pred.shape)
test_y_pred

(166693,)


array([1.86558007, 2.04259776, 2.25202692, ..., 2.3285694 , 2.53979406,
       2.31565136])

In [67]:
df_test.shape

(166693, 4)

In [72]:

#### Save csv file about test data.
from pandas import DataFrame as df
res = pd.concat( [df(df_test), df(test_y_pred)], axis=1)
res.to_csv('predict_res.csv', header=None)