In [48]:
%cd "C:\Users\andrewmauro\Desktop\springboard\Project Excercises\Kaggle - Mercari Price Suggestion"

C:\Users\andrewmauro\Desktop\springboard\Project Excercises\Kaggle - Mercari Price Suggestion


Data Ingestion, Wrangling, and Manipulation

Purpose: 

To prepare the Mercari Price dataset for visualization by cleaning and extracting additional information from text columns, and calculating average prices based on grouped categorical features. In addition, Natural Language processing will be performed to obtain additional features for prediction purposes.

Procedures

The following data wrangling steps were performed:

1. The 'category_name' field was split into five levels of category names, with each level increasing in specificity.
2. NULL values in all columns were replaced with 'Unknown'
3. Averages were calculated, grouping by each categorical variable (with the exception of item description)
4. Natural Language processing was performed on the 'item_description' field. The text was normalized to stem, lemmatize all words in each item descritpion and remove stop words. The following additional features were derived for prediction:
(a) Bag of Words count
(b) Term frequency over individual document frequency ratio


Section 1 - Data Ingestion

The below code will import the training set and then perform basic data manipulation to obtain a dataset with some additional category features. 

In [49]:
#Section 1.1 - Import packages and Dataset
import pandas as pd
import numpy as np

#text analytics
#regular expressions
import re

price = pd.read_table(filepath_or_buffer = 'train.tsv', sep = '\t', index_col = 'train_id')

#check data set size
#mb = 329893 / 1024
#print(mb) #322 MB
#We noted that our data set is very large

#create sample set
price = pd.DataFrame.sample(price, n = 1000)


  mask |= (ar1 == a)


In [50]:
price.shape


(1000, 7)

In [51]:

price.head(10)

Unnamed: 0_level_0,name,item_condition_id,category_name,brand_name,price,shipping,item_description
train_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
890397,Girls Dress Up Boots Toddler's 9,1,Kids/Girls 2T-5T/Shoes,,14.0,0,Adorable girls dress up boots. Toddlers Size 9...
33944,Short booties,3,Women/Shoes/Boots,,12.0,0,No description yet
419683,Laura Mercier Medium Deep Setting Powder,1,Beauty/Makeup/Face,Laura Mercier,32.0,1,New in the box. Authentic. Laura Mercier Trans...
84819,New Adidas jogger pants,1,Men/Sweats & Hoodies/Track & Sweat Pants,Adidas,36.0,0,"Brand new with tags , a small was too small fo..."
1001447,Rebecca Minkoff Satchel -vmena,2,Women/Women's Handbags/Satchel,Rebecca Minkoff,220.0,1,Touches of whipstitch trim and edgy 3D studs s...
677544,iPhone 6 6s glitter bling phone case,1,"Electronics/Cell Phones & Accessories/Cases, C...",,9.0,0,Brand-new iPhone 6s/6 case. It has a back cove...
1241517,Cosmic cleanse scent sations mix lot,1,Home/Home Décor/Home Fragrance,,28.0,1,New Latest pour date 9/27/16 Hoodrat things Ho...
352442,BBW Beautiful Day refill bulb,1,Home/Home Décor/Home Fragrance,Bath & Body Works,10.0,0,Bath & Body Works Wallflowers Fragrance Refill...
633256,Charlotte Russe Flats,3,Women/Shoes/Flats,,16.0,0,Worn only once size 6 Charlotte Russe flats
1444189,Men's American Eagle jeans,3,"Men/Jeans/Classic, Straight Leg",American Eagle,23.0,0,Distressed size 31x32. Original straight is th...


Section 2 - Data Wrangling, Maniuplation, and Natural Language Processing

The below code will create additional features from the training set data for the purposes of visualization and predictive modeling. All text columns will be converted to lower case text, and the category name column will be split into five high-level categories, with each successive category level offering greater specificity as to the item type.

Additionally, null values in our text will be replaced with "unknown," or in the case of the item description category, "no description yet."

We will also calculate mean prices for categorical variables.

In [52]:
#Section 2.1 - Data Wrangling and Maniulation - Split category name column, Remove NULLs and then 
##add columns that contain category averages

#switch all text columns to lower
price[['name']] = price['name'].str.lower()
price[['category_name']] = price['category_name'].str.lower()
price[['brand_name']] = price['brand_name'].str.lower()
price[['item_description']] = price['item_description'].str.lower()

#split category name column
price[['catOne','catTwo', 'catThree', 'catFour', 'catFive']] = price['category_name'].str.split('/',expand=True)

#replace empty brand names and category names with 'Unknown'

price['brand_name'][(price['brand_name'].isnull())] = 'unknown'
price['catOne'][(price['catOne'].isnull())] = 'unknown'
price['catTwo'][(price['catTwo'].isnull())] = 'unknown'
price['catThree'][(price['catThree'].isnull())] = 'unknown'
price['catFour'][(price['catFour'].isnull())] = 'unknown'
price['catFive'][(price['catFive'].isnull())] = 'unknown'
price['item_description'][(price['item_description'].isnull())] = 'no description yet'


#pull out item descriptions for NLP analysis
description = price.loc[:, ['item_description']]
description['lengthDescription'] = description['item_description'].str.len()

#create dummy variables with subset of predictors
price = price.loc[:, ['item_condition_id', 'brand_name', 'shipping', 'catOne', 'catTwo', 'price']]

price = pd.get_dummies(price, drop_first = True)

price = pd.merge(description, price, left_index = True, right_index = True)



#fields for visualization
#price.loc[:, ['item_condition_id', 'brand_name', 'shipping', 'catOne', 'catTwo', 'meanBrand', 'countBrand', 'meanLevelOne', 'countLevelOne', 'meanLevelTwo', 'countLevelTwo', 'price']].to_csv('trainViz.csv')

#select sample for analyis below
#price = pd.DataFrame.sample(price, n = 300)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-vie

In [53]:
price.head()

Unnamed: 0_level_0,item_description,lengthDescription,item_condition_id,shipping,price,brand_name_adidas,brand_name_adidas originals,brand_name_aerie,brand_name_air jordan,brand_name_alo yoga,...,catTwo_tops & blouses,catTwo_toy,catTwo_toys,"catTwo_tv, audio & surveillance",catTwo_underwear,catTwo_unknown,catTwo_video games & consoles,catTwo_weddings,catTwo_women's accessories,catTwo_women's handbags
train_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
890397,adorable girls dress up boots. toddlers size 9...,159,1,0,14.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
33944,no description yet,18,3,0,12.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
419683,new in the box. authentic. laura mercier trans...,96,1,1,32.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
84819,"brand new with tags , a small was too small fo...",114,1,0,36.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1001447,touches of whipstitch trim and edgy 3d studs s...,355,2,1,220.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


Section 3 - Natural Language Processing Pre-processing Steps

We will adjust the item description column to:

(1) Include only alphabetic characters from our text
(2) Lemmatize all words (identify the appropriate part of speech based on context and group similar words together)

In [54]:
#Section 3.1 - import packages
import re
import nltk

#include only alphabetic characters
price[['item_description']] = price['item_description'].str.replace('[^a-zA-Z\s]+', '')

#lemmatize

#define lemmatizer function
toke = nltk.tokenize.WhitespaceTokenizer()
wnl = nltk.stem.wordnet.WordNetLemmatizer()
def lemmatize_text(text):
   return [wnl.lemmatize(w) for w in toke.tokenize(text)]

price[['item_description']] = price['item_description'].apply(lemmatize_text)
price[['item_description']] = price['item_description'].apply(lambda x: ' '.join(x))

#review
#print(price[['item_description']].head())


In [55]:
print(price.item_description.values)

[ 'adorable girl dress up boot toddler size white and black net top with bow tie and a black suede bottom fully lined with faux fur to keep those toe warm'
 'no description yet'
 'new in the box authentic laura mercier translucent medium deep setting powder retail for rm'
 'brand new with tag a small wa too small for me white pant with black stripe not taking low ball offer'
 'touch of whipstitch trim and edgy d stud set rebecca minkoffs roadready regan satchel apart trendright suede softens the motoinspired silhouette open to offer suede double handle detachable adjustable crossbody strap exterior zip pocket three interior slip pocket interior zip pocket l x w x h handle drop strap drop'
 'brandnew iphone s case it ha a back cover shining protective bumper and is super cute'
 'new latest pour date hoodrat thing hoodrat thing cut from my monster loaf so i could share the amazingness approx oz oz scent shot blackout babe witch can dream too ecstasy on the rock party with heidi flower ta

Section 3.2

NLP Feature #1 - Bag of Words Vectors

Our first NLP feature is a "bag of words" vector depicting an identifier for each word in each cell of the item_description column, and corresponding vectors depicting the total word count for each word within each cell. The occurence of particular words can be used as a predictor for determining item price.

Fit/Transform

We do this through the fit/transform method. This will call the models fit and transform emthods. THis is what helps us map word ids to vectors depicting each word's occurenccce. Fit will find parameters or norms in the data, and transform will apply the model's underlying algorithm or approximation, similar to pre-processing but with a specific use case in mind.

In [56]:
#Section 3.2 - Natural Language Processing - Bag of Words

#import pacakages
from sklearn.feature_extraction.text import CountVectorizer

#create word count vectorizer - test with full data set, and then use a sample
countVectorizer = CountVectorizer(stop_words = 'english')

countTrain = countVectorizer.fit_transform(price.item_description.values)


In [57]:
#Obtain the first ten features of the count vector
#countVectorizer.get_feature_names()[:10]

countDf = pd.DataFrame(countTrain.A, columns = countVectorizer.get_feature_names())

#print(countDf.shape)
print(countDf)

     aai  ab  abc  abcbuybnib  ability  abrasion  abrasive  absolutely  \
0      0   0    0           0        0         0         0           0   
1      0   0    0           0        0         0         0           0   
2      0   0    0           0        0         0         0           0   
3      0   0    0           0        0         0         0           0   
4      0   0    0           0        0         0         0           0   
5      0   0    0           0        0         0         0           0   
6      0   0    0           0        0         0         0           0   
7      0   0    0           0        0         0         0           0   
8      0   0    0           0        0         0         0           0   
9      0   0    0           0        0         0         0           0   
10     0   0    0           0        0         0         0           0   
11     0   0    0           0        0         0         0           0   
12     0   0    0           0        0

Section 3.3

NLP Feature #2 - Tf - idf Feature

A metric that indicates the term frequency within a given item description relative to the frequency of that term in all item descriptions may be predictive of the given item's price. We will use the TfidfVectorizer function from the sklearn library to derive this feature for our analysis.



In [58]:
#Section 3.3. - NLP Bag of Words

from sklearn.feature_extraction.text import TfidfVectorizer 

# Initialize a TfidfVectorizer object: tfidfVectorizer
tfidfVectorizer = TfidfVectorizer(stop_words = "english", max_df = 0.7)

# Transform the training data: tfidf_train 
tfidfTrain = tfidfVectorizer.fit_transform(price.item_description.values)

# Print the first 10 features
print(tfidfVectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidfTrain.A[:5])


['aai', 'ab', 'abc', 'abcbuybnib', 'ability', 'abrasion', 'abrasive', 'absolutely', 'acccomidate', 'accelerate']
[[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]


In [59]:
tfidfDF = pd.DataFrame(tfidfTrain.A, columns = tfidfVectorizer.get_feature_names())
#print(tfidfDF)

tfidfDF

Unnamed: 0,aai,ab,abc,abcbuybnib,ability,abrasion,abrasive,absolutely,acccomidate,accelerate,...,zales,zara,zero,zest,zia,zip,zipper,zippered,zipup,zirconia
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.239717,0.000000,0.000000,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0


Section 4 - Conclusion of Analysis and Saving of Data Wrangled Datasets

In [60]:
#Section 4 - Save a local copy of data frames for prediction
#from sklearn.model_selection import train_test_split

#create train set and holdout set
#X_train, X_test, y_train, y_test = train_test_split(price["price"], y, test_size = 0.25, random_state = 56)

#write to csv
#price.to_csv('priceWrangle.csv')


#X_train.to_csv('trainPredictors.csv')
#y_train.to_csv('trainOutcomes.csv')
#X_test.to_csv('testPredictors.csv')
#y_test.to_csv('testOutcomes.csv')

Conclusion: We have successfully wrangled and manipulated our data frame of prices. We have additional predictor variables related to High Level item categories, High Level item category average prices, item description length. We have also applied natural language processing to the item description category to obtain word counts and term frequency relative to document frequency predictors.