# Mercari Price Suggestion Challenge

For the Kaggle competition "Mercari Price Suggestion Challenge," we were tasked to write code to predict prices. This is my suggestion. There are three steps: 

1. EDA
2. Data Engineering
3. Machine Learning

The challenge was completed in mid-February of 2018 and I scored among the top 60% of over 2300 sbumission. This was my first ever Kaggle competition. 


## Exploratory Data Analysis

The data is organized accordingly:

- train_id or test_id - the id of the listing
- name - the title of the listing
- item_condition_id - the condition of the items provided by the seller
- category_name - category of the listing
- brand_name
- price - the price that the item was sold for, also the **target variable**
- shipping - 1 if shipping fee is paid by seller and 0 by buyer
- item_description - the full description of the item

The first natural step is to look at the distribituion of prices.


In [None]:
import pandas as pd
import numpy as np
import string
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

test=pd.read_csv("C:/Users/Malte/Documents/My repositories/Mercari/test.tsv",sep="\t")

train=pd.read_csv("C:/Users/Malte/Documents/My repositories/Mercari/train.tsv",sep="\t")

train.fillna("",inplace=True)
test.fillna("",inplace=True)

In [None]:
fig,(ax) = plt.subplots(1,2,figsize=(10,5))
sns.distplot(train["price"],ax=ax[0],axlabel="Price Distribution")
sns.distplot(np.log1p(train["price"]),ax=ax[1],axlabel="Log1p Price Distribution")

Looking at the price distribution, two observations become clear:

1. The bulk of the prices are lower.
2. Transforming the prices logarithmically, we can see that that the distribtuion is slightly positively skewed. 

Let's look at prices now and how they are affected whether shipping is included or not.

In [None]:
sns.barplot(x="shipping",y="price",data=train,ci=None)

Shipping does seem to have a slight effect on pricing. 

What about the condition of the item?

In [None]:
sns.barplot(x="item_condition_id",y="price",data=train,ci=None)

The item condition does not have the expected effect on prices.

We are looking at the item prices in a very broad manner. Let's look at categories in greater detail to gain a deeper understanding. The data comes with uncategorized category names, that is, each category looks like the following:

In [None]:
print train["category_name"][:10]

In order to work with the categories more effectively, we have to change the data to separate the main category and the two subcategories. 

In [None]:
def cat_split(row):
    try:
        txt1, txt2, txt3 = row.split('/')
        return row.split('/')
    except:
        return ("No Label", "No Label", "No Label")

train["cat_1"], train["cat_2"], train["cat_3"] = zip(*train["category_name"].apply(lambda val: cat_split(val)))

We can now look at the categories and price differences in greater detail.

In [None]:
fig,ax=plt.subplots(1,1,figsize=(20,10))
sns.barplot(x="cat_1",y="price",data=train,ax=ax,ci=None)
ax.set(xlabel="Main Catgory")

There does seem to be quite a variety of prices between the categories.

Since you can describe each item by putting it into two further subcategories, we are going to look at the types of these subcategories. For both, we will use the ten most occuring labels and display their average price.

In [None]:
def top_x_by_mean_price(df,cat,x):
    most_freq_items=df[cat].value_counts()
    most_freq_items=list(most_freq_items.index[:x])
    if "" in most_freq_items:
        most_freq_items.remove("")
    df_top_10=df[df[cat].isin(most_freq_items)]
    df_top_10=df_top_10.groupby(cat)
    df_top_10_by_price= df_top_10["price"].mean().reset_index()
    df_top_10_by_price.sort_values("price",ascending=False,inplace=True)
    fig,ax=plt.subplots(figsize=(x+10,10))
    ax.set_title("Top {} {} by occurence, sorted by mean article value".format(str(x),cat))
    sns.barplot(x=cat,y="price",data=df_top_10_by_price,ax=ax,ci=None)
    
top_x_by_mean_price(train,"cat_2",10)


In [None]:
def top_x_by_mean_price(df,cat,x):
    most_freq_items=df[cat].value_counts()
    most_freq_items=list(most_freq_items.index[:x])
    if "" in most_freq_items:
        most_freq_items.remove("")
    df_top_10=df[df[cat].isin(most_freq_items)]
    df_top_10=df_top_10.groupby(cat)
    df_top_10_by_price= df_top_10["price"].mean().reset_index()
    df_top_10_by_price.sort_values("price",ascending=False,inplace=True)
    fig,ax=plt.subplots(figsize=(x+10,10))
    ax.set_title("Top {} {} by occurence, sorted by mean article value".format(str(x),cat))
    sns.barplot(x=cat,y="price",data=df_top_10_by_price,ax=ax,ci=None)
    
top_x_by_mean_price(train,"cat_3",10)


Categories definitely play a role, which is why we need to prepare them for our machine learning algorithm.

In [None]:
def categorizer(df,col):
    df[col]=df[col].astype("category")
    df[col]=df[col].cat.codes
    return df

for e in ["brand_name","cat_1","cat_2","cat_3"]:
    categorizer(train,e)

These kind of platforms also rely on descriptions of their products. Let's take a look first and see whether having descriptions or not makes a difference in price.

In [None]:
def described(x):
    if "description yet" in x:
        return 1
    else:
        return 0
    
train["has_description"]=train["item_description"].apply(lambda x: described(x))

plt.figure(figsize=(20,15))
plt.hist(train["price"].loc[train["has_description"]==True],label="Has Description",bins=60,color="blue",alpha=0.6,range=[0,250])
plt.hist(train["price"].loc[train["has_description"]==False],label="Does not have Description",bins=60,alpha=0.6,range=[0,250])
plt.legend()
plt.show

The distributions are smiliar, which tells us that there are simply more items without descriptions. Let's look at the lengths of the descriptions now.

In [None]:
train["desc_length"]=train["item_description"].apply(lambda x: len(x))

In [None]:
sns.lmplot("price","desc_length",data=train,fit_reg=True,scatter_kws={'alpha':0.3},aspect=1,size=15)

There seems to be a minimal correlation at best. 

One more thing to examine is the term frequency-inverse document frequency, that is how often a word appears relative to the overall number of words in the document. Another way of looking at it is, how rare a word is. Let's start with a wordcloud first.