# TODO
### TVID transformers for product_title and product_description throw errors because the encoding for those columns is whack (and when i read_csv and encode in encoding="ISO-8859-1" it doesn't pipeline the features)

# Home Depot Product Search Relevance
The goal of this analysis is to determine how to predict relevance of a search on Home Depot's website. The training data were labelled by crowdsourcing humans, but the hope is that the text and numerical features will be enough to predict relevance via machine learning

In [3]:
# Import my library stack
import pandas as pd, numpy as np, matplotlib.pyplot as plt
import os
import pprint
import copy
import gc
import re
%matplotlib inline

# Some nice display tools for ipython
from IPython.display import display, HTML

# There are several files that the Kaggle competition included for this analysis
DATADIR = "%s/home_depot_2015/"%os.environ["KAGGLE_DATA_DIR"] 

# 1: Overview of the data
There are three files I will take a peek at here:
    
* **train.csv** -- The training set, which contains products, searches, and relevance scores
* **test.csv** -- The test set, which contains products and searches --> I am to predict relevance scores
    
* **product_descriptions.csv** -- Contains product id and a plain text description of the product
    
* **attributes.csv** -- Contains product id and several attributes, but for only a *subset* of products

I'm just going to preview the first few rows of each.

In [4]:
# Preview the first 5 rows of a csv file given the path to it
def preview_data(file, name):
    print display(HTML("<h3>First three rows of %s</h3>"%name))
    preview_df = pd.read_csv(file, encoding="ISO-8859-1")
    print display(preview_df.head(1))

def get_path(file):
    return "%s%s"%(DATADIR, file)

# Define the files for later
f_train = get_path('train.csv')
f_test = get_path('test.csv')
f_desc = get_path('product_descriptions.csv')
f_attr = get_path('attributes.csv')

# Do all four
files = [(f_train, 'train'), (f_test, 'test'), (f_desc, 'descriptions'), (f_attr, 'attributes')]
map(lambda x: preview_data(x[0], x[1]), files)

None


Unnamed: 0,id,product_uid,product_title,search_term,relevance
0,2,100001,Simpson Strong-Tie 12-Gauge Angle,angle bracket,3


None


None


Unnamed: 0,id,product_uid,product_title,search_term
0,1,100001,Simpson Strong-Tie 12-Gauge Angle,90 degree bracket


None


None


Unnamed: 0,product_uid,product_description
0,100001,"Not only do angles make joints stronger, they ..."


None


None


Unnamed: 0,product_uid,name,value
0,100001,Bullet01,Versatile connector for various 90Â° connectio...


None


[None, None, None, None]

# 2: Feature engineering
After looking at these spreadsheets, I realize there isn't a ton of information with which to work. My initial thought is to do some sort of a word matching procedure (e.g. see if one of the search words matches one of the words in the title (or description, or attributes). Better still, I could take the individual letters in each word of the search query and try to see if they appear consecutively in the raw string of the title, description, or attributes.

Let's give that a try.

In [5]:
# I will definitely want to use multiprocessing in the coming steps
from multiprocessing import Pool

### Custom manipulation of attributes
The attributes file has a dump of attributes and their respective product_uid values. I want two things out of this file:
* A concatenation of all the "values"; that is, all of the raw text information
* Specifically the brand name (marked as "MFG Brand Name")

In [6]:
# The attributes are a little trickier. There may be 0 or many attributes per 1 product_uid
# I want to concatenate all strings belonging to a particular product_uid

# Data[0] = product_uid; data[1] = name; data[2] = value
def collapse_attr(data):
    attr = {}
    for d in data:
        # d is an array of form [product_uid, name, value]
        if not np.isnan(d[0]):
            i = str(int(d[0]))
            # Concatenate the attribute as a string
            # Also add the brand if it exists
            if i in attr:
                attr[i]['string'] = "%s %s"%( attr[i]['string'], str(d[2]) )
                if d[1] == 'MFG Brand Name':
                    attr[i]['brand'] = str(d[2])
            else:
                attr[i] = {'string': str(d[2])}
                
                if d[1] == 'MFG Brand Name':
                    attr[i]['brand'] = str(d[2])
                else:
                    attr[i]['brand'] = ''

    return attr

In [7]:
# Create the training attribute array, which we will append to the dataframe in the pipeline
attr = pd.read_csv(f_attr)
ATTR_ARR = collapse_attr(np.array(attr))

## Custom functions for processing the strings
These will format the strings, add alternate suffixes, and add some common abbreviations if applicable. Since we're dealing with Home Depot data, we have a general idea of what types of abbreviations we might encounter.

In [8]:
# Given a word, return all forms of it and its abbreviations
def abbrev(s):
    abrv_groups = [
        ["'", "in", "inches", "inch"],
        ["pounds", "pound", "lbs", "lb"],
        ["sqft", "sq", "sf", "square", "squared", "foot", "feet", "\"", "ft", "inch", "inches", "in", "'"],
        ["cf", "cu", "cubic", "cubed", "inch", "foot", "feet", "'", "\"", "ft", "in", "inches"],
        ["gal", "gallon", "gallons"],
        ["g", "gram", "grams", "kg", "kilogram", "kilo"],
        ["oz", "ounces", "ounce"],
        ["cm", "centimeters", "centimeter"],
        ["m", "meter", "meters"],
        ["mm", "milimeter", "millimeter", "milimeters", "millimeters"],
        ["a", "amp", "amps", "ampere", "amperes"],
        ["w", "watt", "watts"],
        ["v", "volt", "volts"],
        ["whirpool","whirlpool", "whirlpoolga", "whirlpoolstainless","stainless"],
        ["and", "&", "+", "&amp;"],
        ["x", "by", "*"],
        ["deg", "degree", "degrees", "°", "angle"],
        ['dia', 'diameter']
    ]
    
    # If we can match the word in an abbreviation group, return the whole group
    for g in abrv_groups:
        if s in g:
            return g
        
    # If we can't match anything just return an empty array
    return []

# Turn the string into a series of words
def process_string(s):
    
    __words = np.array(s.split(" "))
    
    # Split by special, but include those characters
    # This regex splits by the characters [', ", /, *, -], but INCLUDES those characters
    _words = list(np.hstack(map(lambda i: re.split(r"(plus|[('\"\-*/)])", i), __words)))

    words = _words
    
    # Get rid of commas
    words = map(lambda x: x.replace(',', ''), words)
    # Get rid of semicolons
    words = map(lambda x: x.replace(';', ''), words)
    # Get rid of colons
    words = map(lambda x: x.replace(':', ''), words)
    # Get rid of periods
    words = map(lambda x: x.replace('.', ''), words)
    # Get rid of blanks
    words = filter(lambda x: x != ' ' and x != '', words)
    return words


def pre_process_strings(query):
    
    # Lowercase all the things
    query = str(query).lower()
    
    # Split the query into an array of char arrays
    query_words = process_string(query)

    return query_words


# This function processes strings like "4x4" or "4'x4'"
def process_x_by(s):
    if any(i.isdigit() for i in s) and "x" in s:
        new_s = list(filter(lambda x: x!='', s.split("x"))); new_s.append(s); new_s.append("x")
        return new_s
    else:
        return [s]

# Split strings that have unit suffixes (e.g. 10in, 50g, etc)
def process_unit_suffixes(s):
    strings = ["mm", "cm", "m", "g", "kg", "in", "ft", "a", "w", "v", "oz", "gal", "cf"]
    # Inefficient but I can't think of a better way
    for i in strings:
        # If any of the above strings is in s, return that string + the number
        if i in s and any(j.isdigit() for j in s):
            arr = list(filter(lambda x: x!='', s.split(i))); arr.append(i); arr.append(s)
            return arr

    return [s]

# Get a list of words similar to the word if applicable
# This will get called with a word in the QUERY
def extension_words(word):
    
    # Make damn sure everything is lower case
    w = word.lower()
    
    # Process strings of the form "AxB"
    ret_words = process_x_by(w)
    
    # Include abbreviation words
    abbr = abbrev(w)
    
    # Flatten array
    ret_words = list(np.hstack([ret_words, abbr]))
    
    
    # If the word is small (<4 chars), contains a number, or contains a special character,
    #     only add s and return
    if any(i.isdigit() for i in w) or len(list(word)) < 4 or any(i in w for i in ["-", "*", "'", "\"", "/"]):
        ret_words.append("%ss"%w)
        return filter(lambda x: x!='' and x!=' ', ret_words)

    
    # A list of suffixes
    suffixes = ['s', 'ed', 'ing', 'n', 'en', 'er', 'est', 'ise', 'fy', 'ly',
               'ful', 'able', 'ible', 'hood', 'ess', 'ness', 'less', 'ism',
               'ment', 'ist', 'al', 'ish', 'tion']
    
    # If the word ends in one of these suffixes, add the smaller version
    # to strings; otherwise, add this to the end of the word and add that
    for x in xrange(len(suffixes)):
        l = len(suffixes[x])
        if w[-l:] == suffixes[x]:
            ret_words.append(w[0:-l])
        else:
            ret_words.append(w+suffixes[x])

    return filter(lambda x: x!='' and x!=' ', ret_words)

## Custom functions for doing word searches
These will determine if words in the query are in the matching string. We want to know three basic things:
* Does the string contain *any* of the query words?
* What fraction of the query words are in the matching words?
* What fraction of the chars making up query words are found in the matching words?

In [9]:
# Determine if the word is in the comparison string
#   @returns 1 or 0
def match_word(word, compare):
    strings = extension_words(word)
    if not strings: return 0
    return min(1, sum( map(lambda s: 1 if s in compare else 0, strings ) ))

# Determine the number of times the word (or any version of it) matches a string
#   @returns array of match counts
def match_word_count(word, compare):  
    strings = filter( lambda x: x!='' and x!=' ', extension_words(word) )
    if not strings: return 0
    return max( map(lambda s: compare.count(s) , strings ) )
    
    
    
## STRING MATCHING
##=========================================================

# Get the number of unique words that are matched
def matched_words_string(query, to_match):
    query_words = pre_process_strings(query)
    return sum( map(lambda x: match_word(x, to_match.lower()), query_words) ) if query_words else 0
    
    
# Get the count of all words matched (i.e. if a word is matched more than once, it is counted multiple times)
def count_matched_words_string(query, to_match):
    query_words = pre_process_strings(query)
    return sum( map(lambda x: match_word_count(x, to_match.lower()), query_words) )

## Word matching to the words in the string
def matched_words_words(query, to_match):
    query_words = pre_process_strings(query)
    to_match_words = to_match.split(" ")
    sums = map(lambda y: sum( map(lambda x: match_word(x, to_match.lower()), query_words) ), \
                to_match_words) if query_words else 0
    return sum(sums)

# Count of words matching the words in the string
def count_matched_words_words(query, to_match):
    query_words = pre_process_strings(query)
    to_match_words = to_match.split(" ")
    counts = map(lambda y: sum( map(lambda x: match_word_count(x, to_match.lower()), query_words) ), \
                to_match_words) if query_words else 0
    return sum(counts)
        


## QUERY MATCHING
##=========================================================

# Whether or not the whole query is in the string
def matched_query(query, to_match):
    return query in to_match

# How many times the whole query is in the string
def count_matched_query(query, to_match):
    return to_match.count(query)

# Whether or not the last word of the query is in the to_match string
def last_word(query, to_match):
    last_q = query.split(" ")[-1]
    return 1 if last_q in to_match else 0

# Whether or not a word from the query is the FIRST word in the to_match string
def first_word(query, to_match):
    first_q = query.split(" ")[0]
    return 1 if first_q in to_match else 0



# 3: Feature Engineering Pipeline
I will start by engineering new features and removing the long strings in my data set. Specifically, I want to add

* Match rates of query relating to title and description (determined by char_match_fraction function)
* String length columns of query, description, and title columns

I will go ahead and build a new training set based on this.

### Lambda functions and Pool
This is a lot of extra code but multiprocessing will speed things up significantly and the lambda functions have to be defined in scope.

In [23]:
# POOL lambda functions (need to be defined outside the function that calls them)
# With multiprocessing, we can't use lambda, so I will define some basic functions here
#================================================================================
import string
import sys
printable = set(string.printable)

# An encoding function because apparently some of these aren't readable as utf-8...
def encode(x):
    return filter(lambda i: i in printable, x)
    #return filter(lambda i: i in printable, x.encode('utf-8').strip())

def en(strs):
    if isinstance(strs, str):
        return encode(strs)
    else:
        return strs
    """
    # If it's a single item, encode it right away
    if isinstance(strs, (float, int, np.float64, np.int64)):
        return encode(str(int(strs)))
    elif isinstance(strs, str):
        return encode(strs)
    
    # If it's an array, encode an array
    elif isinstance(strs, list):
        ret_strs = []
        for x in strs:
            # We want to encode this item in utf-8 and remove all non-ascii characters
            if isinstance(x, (float, int, np.float64, np.int64, str)):
                return encode(str(x))
            else:
                ret_strs.append(x)
        return ret_strs
    else:
        return strs
    """
    
# Attribute stuff
def lambda_in_attr(a):
    a = str(int(float(en(a))))
    return ATTR_ARR[a]['string'] if a in ATTR_ARR else ''
def lambda_brand(a):
    a = str(int(float(en(a))))
    return ATTR_ARR[a]['brand'] if a in ATTR_ARR else ''


# Lengths
def lambda_char_len(a):
    return float(len(list(str(a))))
def lambda_word_len(a):
    return float(len(str(a).split(" ")))


# Matched words
def l_matched_words_string(a):
    try:
        return float(matched_words_string( str(a[0]), str(a[1]) ))
    except:
        print a
        return
    
def l_count_matched_words_string(a):
    return float(count_matched_words_string( str(a[0]), str(a[1]) ))
def l_matched_words_words(a):
    return float(matched_words_words( str(a[0]), str(a[1]) ))
def l_count_matched_words_words(a):
    return float(count_matched_words_words( str(a[0]), str(a[1]) ))


# Matched queries
def l_matched_query(a):
    return matched_query( str(a[0]), str(a[1]) )
def l_count_matched_query(a):
    return float(count_matched_query( str(a[0]), str(a[1]) ))


# Binaries
def l_last_word(a):
    return last_word( str(a[0]), str(a[1]) )
def l_first_word(a):
    return first_word( str(a[0]), str(a[1]) )


# An abstracted multiprocessor functional so that I can remove MP when debugging
# All functions are applied to a map
POOL = Pool(maxtasksperchild=1000)
def run_pool(f, iterator):
    #m = map(f, iterator)
    m = POOL.map(f, iterator)
    return pd.Series(m)

### Feature pipeline
Here I will add all of my features to the dataframe that will be used as X_train

In [28]:
import time
import sys


#================================================================================
## DATA PIPELINE
#================================================================================
# Given the data (train or test) and description files,
# perform a series of operations to produce a data set on which we can do ML
def feature_pipeline(data_file, **kwargs):
    
    # Define my multiprocessing pool and start the timer
    
    start = time.time()
    
    
    #============
    # Read files
    #============
    
    # Read the initial train.csv and join it to product descriptions
    #_df = pd.read_csv(data_file, encoding="ISO-8859-1")
    _df = pd.read_csv(data_file)

    # Add in descriptions because they are 1:1
    df = pd.merge(_df, pd.read_csv(f_desc), how='outer')
    
    # If there is an attribute for a product uid, join it
    df['attr'] = run_pool(lambda_in_attr, df['product_uid'])
    df['brand'] = run_pool(lambda_brand, df['product_uid'])
    
    
    # Construct the columns to be added to the dataframe. Each will be added with run_pool defined above.
    
    # Char lengths
    char_lengths = [
        ['desc_char_l', lambda_char_len, df['product_description'] ],
        ['title_char_l', lambda_char_len, df['product_title'] ],
        ['query_char_l', lambda_char_len, df['search_term'] ],
        ['attr_char_l', lambda_char_len, df['attr'] ],
        ['brand_char_l', lambda_char_len, df['brand'] ]
    ]

    
    # Word lengths
    word_lengths = [
        ['desc_word_l', lambda_word_len, 'product_description'],
        ['title_word_l', lambda_word_len, 'product_title'],
        ['query_word_l', lambda_word_len, 'search_term'],
        ['attr_word_l', lambda_word_len, 'attr'],
        ['brand_word_l', lambda_word_len, 'brand']
    ]

    
    # Zip the data into tuples
    desc_zip = np.dstack( ( np.array(df['search_term']), np.array(df['product_description'].apply(en)) ))[0]
    title_zip = np.dstack( (np.array(df['search_term']), np.array(df['product_title'].apply(en)) ))[0]
    attr_zip = np.dstack( (np.array(df['search_term']), np.array(df['attr'].apply(en)) ))[0]
    brand_zip = np.dstack( (np.array(df['search_term']), np.array(df['brand'].apply(en)) ))[0]
    
    
    # Number of unique words that are matched to the comparison string (float)
    matched_strings = [
        ['desc_matched_string', l_matched_words_string, desc_zip],
        ['title_matched_string', l_matched_words_string, title_zip],
        ['attr_matched_string', l_matched_words_string, attr_zip],
        ['brand_matched_string', l_matched_words_string, brand_zip]
    ]
    
    
    # Total number of query words matched to the comparison string (float)
    count_string = [
        ['desc_count_string', l_count_matched_words_string, desc_zip],
        ['title_count_string', l_count_matched_words_string, title_zip],
        ['attr_count_string', l_count_matched_words_string, attr_zip]
    ]

    
    # Whether or not the whole query (string) can be found in the comparison string (bool)
    query_matched = [
        ['desc_query_matched', l_matched_query, desc_zip],
        ['title_query_matched', l_matched_query, title_zip],
        ['attr_query_matched', l_matched_query, attr_zip]
    ]
    
    
    # How many times the query is in the comparison string (float)
    query_count = [
        ['desc_query_count', l_count_matched_query, desc_zip],
        ['title_query_count', l_count_matched_query, title_zip],
        ['attr_query_count', l_count_matched_query, attr_zip]
    ]
    
    
    # Is the last word of the query in the comparison string? (bool)
    last_word = [
        ['desc_last_word', l_last_word, desc_zip],
        ['title_last_word', l_last_word, title_zip]
    ]
    
    # Is the first word of the query in the comparison string? (bool)
    first_word = [
        ['desc_first_word', l_first_word, desc_zip],
        ['title_first_word', l_first_word, title_zip]
    ]
    
    ## Add the columns
    cols_to_add = [char_lengths, word_lengths]
    for cols in cols_to_add:
        for col in cols:
            df[col[0]] = run_pool(col[1], col[2])
    
    # Fraction of unique words matched divided by number of unique words in the query (float)
    #df['desc_frac_matched'] = df['desc_matched_string'] / df['query_word_l']
    #df['title_frac_matched'] = df['title_matched_string'] / df['query_word_l']
    #df['attr_frac_matched_query'] = df['attr_matched_string'] / df['query_word_l']
    # Also by the attr length
    #df['attr_frac_matched_attr'] = df['attr_matched_string'] / df['attr_word_l']
    
    
    # Drop NaNs
    copy_df = copy.deepcopy(df.dropna())
    
    print display(HTML("<font color='blue'><b>Data pipelined in %s s</b></font>"%(time.time()-start)))
    return copy_df


In [29]:
X_train = feature_pipeline(f_train)
y_train = X_train['relevance'].values

None


In [30]:
X_train.head()


Unnamed: 0,id,product_uid,product_title,search_term,relevance,product_description,attr,brand,desc_char_l,title_char_l,query_char_l,attr_char_l,brand_char_l,desc_word_l,title_word_l,query_word_l,attr_word_l,brand_word_l
0,2,100001,Simpson Strong-Tie 12-Gauge Angle,angle bracket,3.0,"Not only do angles make joints stronger, they ...",Versatile connector for various 90° connection...,Simpson Strong-Tie,847,33,13,413,18,1,1,1,1,1
1,3,100001,Simpson Strong-Tie 12-Gauge Angle,l bracket,2.5,"Not only do angles make joints stronger, they ...",Versatile connector for various 90° connection...,Simpson Strong-Tie,847,33,9,413,18,1,1,1,1,1
2,9,100002,BEHR Premium Textured DeckOver 1-gal. #SC-141 ...,deck over,3.0,BEHR Premium Textured DECKOVER is an innovativ...,"Brush,Roller,Spray 6.63 in 7.76 in 6.63 in Rev...",BEHR Premium Textured DeckOver,1102,79,9,905,30,1,1,1,1,1
3,16,100005,Delta Vero 1-Handle Shower Only Faucet Trim Ki...,rain shower head,2.33,Update your bathroom with the Delta Vero Singl...,Combo Tub and Shower No Includes the trim kit ...,Delta,694,78,16,616,5,1,1,1,1,1


# 4: Plot the Feature Distributions
As a sanity check, it is good to check out the first few lines of my data frame and also to graph the features to make sure there are actual distributions of them.

In [None]:
# Plot the distribution (histogram) of my features
def plot_hist(col, name):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    
    plt.title('%s' %name, fontsize=15)
    #fig.colorbar(cax)
    plt.hist(col)

    plt.show()

def run_feature_plots(df):
    # Feature plots
    skip_cols = ['id', 'product_uid', 'relevance','search_term', 'brand',
                     'product_title','product_description','attr']
    
    features = list(df.drop(skip_cols, axis=1).columns.values)
    # Plot a bunch of stuff
    dim = len(features)/3 + 1 if len(features)%3 > 0 else len(features)/3

    f, axarr = plt.subplots(dim, 3, figsize=(16,20))
    plt.tight_layout(pad=3)
    
    for i in range(0, dim):
        # For each row
        for j in range(0, 3):
            # For each element in the row
            if (i*3 + j) < len(features):
                # As long as the chart exists in the tuple
                axarr[i][j].hist( df[ features[i*3+j] ], color='orange' )
                axarr[i][j].set_title( features[i*3+j], fontsize=15 )

In [None]:
#run_feature_plots(X_train)

# 5: Learning
A few notes about the distributions:

* The string length columns look to be distributed pretty nicely
* The description matches are heavily favored to the right (meaning the strings match well); we would expect this from a search engine
* The relevance scores are also heavily favored to the right (again, we expect this engine to work reasonably well, so this makes makes sense)

Everything so far looks reasonable. Now I will go ahead and set up a machine learning pipeline to test some algorithms on the training/test data.

## Pipeline Transformers

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Define my custom pipeline
class CustomPipeline(BaseEstimator, TransformerMixin):
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, df):
        drop_cols = ['id', 'product_uid', 'relevance','search_term', 'brand',
                     'product_title','product_description','attr']
        new_df = df.drop(drop_cols, axis=1).values
        return new_df
    
# This is a 
class TextPipeline(BaseEstimator, TransformerMixin):
    
    def __init__(self, key):
        self.key = key
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, data_dict):
        # Convert dict into a string
        return data_dict[self.key].apply(str)

In [None]:
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor as BR
from sklearn import pipeline, grid_search
from sklearn.pipeline import FeatureUnion

## Use a random forest
rfr = RandomForestRegressor(n_jobs=-1, n_estimators=400, max_depth=15, verbose=0)

# Use tf-idf sk-learn functions to vectorize the documents
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
# After tf-idf, reduce dimensionality of the vector
tsvd = TruncatedSVD(n_components=10)

## Define the pipeline
_pipeline = pipeline.Pipeline([
    ('union', FeatureUnion(
        transformer_list = [
            ('cst',  CustomPipeline()),
            ('txt1', pipeline.Pipeline([
                ('s1', TextPipeline(key='search_term')),
                ('tfidf1', tfidf), 
                ('tsvd1', tsvd)
            ])),
            ('txt2', pipeline.Pipeline([
                ('s2', TextPipeline(key='product_title')), 
                ('tfidf2', tfidf),
                ('tsvd2', tsvd)
            ])),
            #('txt3', pipeline.Pipeline([('s3', TextPipeline(key='product_description')), ('tfidf3', tfidf), ('tsvd3', tsvd)])),
            ('txt4', pipeline.Pipeline([
                ('s4', TextPipeline(key='brand')),
                ('tfidf4', tfidf), 
                ('tsvd4', tsvd)
            ]))        
        ],
        transformer_weights = {
            'cst': 1.0,
            'txt1': 0.5,
            'txt2': 0.25,
            #'txt3': 0.05,
            'txt4': 0.5
        },
        #n_jobs = -1
    )), 
    ('rfr', rfr)])



In [None]:
from sklearn.metrics import mean_squared_error, make_scorer

# Define the loss function; this is a custom root-MSE (RMSE) function with tighter errors
def f_mse(y, y_pred):
    return mean_squared_error(y, y_pred)**0.5
RMSE = make_scorer(f_mse, greater_is_better=False)

# Param grid for GridSearch
param_grid = { 'rfr__max_features': [10],'rfr__max_depth': [20] }

# Arguments for GridSearch
grid_search_args = {
    'estimator': _pipeline,
    'param_grid': param_grid,
    #'n_jobs': -1,
    'cv': 4,
    'verbose': 0,
    'scoring': RMSE
}

# Define the model
model = grid_search.GridSearchCV(**grid_search_args)

In [None]:
# RUN GridSearch
start = time.time()
model.fit(X_train, y_train)

print "GridSearchCV completed in %s s"%(time.time()-start)
print "Best parameters found by grid search: %s"%model.best_params_
print "Best CV score: %s"%model.best_score_

# 6: Test Set
Now I will move over to the test set. I will predict based on the model I just generated.

### 6.1 Pipeline Features

In [None]:
df_test = feature_pipeline(f_test)

# Separate the ids
df_test_ids = df_test['id']

# Need to add this column temporarily; it will get dropped in the pipeline
df_test['relevance'] = 1

### 6.2 Predict test_y

In [None]:
# Predict!
test_y = model.predict(df_test)
final_y = map(lambda x: 1 if x < 1. else 3 if x > 3. else x, test_y)

### 6.3 Look at distributions
After looking at the data, I want to look at the distribution and compare it to the one from the test set.

In [None]:
plot_hist(final_y, 'Scores in Test Set')

In [None]:
plot_hist(train_y, 'Normalized Relevance')

In [None]:
print "Test mean: %s, std: %s"%(np.mean(final_y), np.std(final_y))
print "Training mean: %s, std: %s"%(np.mean(train_y), np.std(train_y))

# 7: Submission
Now I am finally ready to write the submission file!

In [None]:
print np.shape(test_y)

In [None]:
file_name = "submission4"

path = "%s/%s.csv"%(DATADIR, file_name)
pd.DataFrame({"id": df_test_ids.apply(int), "relevance": test_y}).to_csv(path, index=False)