## Rapidly Prototyping a Machine Learning Pipeline

The purpose of this workshop is to show how easy it is to take an idea and turn it into a successful Machine Learning application.

### Part 1: The idea/loading data
In line with Tuesday's panel around "autocoders," we'll demonstrate how to use text descriptions to predict codes. Along the way, we'll show some good practice when practicing machine learning. While this notebook will focus on coding products using descriptions, this approach should work for many scenarios where one has a dataset with both text descriptions and their associated codes.

#### The data
This data comes from 

Specifically, this is the "concordance file" for the Harmonized System's import/export codes. This gives us the code and a description of what products fit into that code: perfect for the sort of automatic coding we want to do. We've included the file here (in the `data/` subfolder) for ease-of-use.

The import and export codes are slightly different at the 10-digit level; however, the codes are heirarchical and at the 6-digit level they are the same (as defined by an international standards group). We'll model at an even less granular level than that - the 4-digit level - based on the amount of data that we have.

First, let's take a look in a text editor. Jupyter has one that's enough for this.

From here, taking a look at the `imp-stru.txt` file will give us the "schema."

Now, we'll use a combination of python and pandas to ingest and clean this data.

In [10]:
import pandas as pd
pd.options.display.max_colwidth = 0

import matplotlib
%matplotlib inline

# this is a trick to parallelize our computations
import multiprocessing
NCPUS = 8 if multiprocessing.cpu_count() > 8 else (multiprocessing.cpu_count() - 1)

In [2]:
!pip install nltk
import nltk
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


False

Now that we've got two lists (`imp_lines` and `exp_lines`) that hold all of the codes and descriptions, let's put them in a data frame and start processing/looking around.

In [11]:
# as you can see in the output, pandas is smart enough to take a list of tuples
# and turn it into a table in a "sensible" way. We do need to specify column names though.
df = pd.read_csv("data/rdc-catalog-train.tsv", sep='\t', header=None)
df.head(2)

Unnamed: 0,0,1
0,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,3292>114>1231
1,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,3292>1370>4767>3975>1420


In [12]:
df = pd.read_csv('data/rdc-catalog-train.tsv', header=None, sep='\t', names=['desc', 'code'])
print("df has", len(df), "rows")
print("df has", df['code'].nunique(), 'unique product codes')
df.head()

df has 800000 rows
df has 3008 unique product codes


Unnamed: 0,desc,code
0,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,3292>114>1231
1,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,3292>1370>4767>3975>1420
2,Bonjour,2296>3597>2989
3,Two Pack 6V 12Ah Eaton POWERRITE PRO II 2400 6V 12Ah UPS Replacement Battery - SPS BRAND,3292>114>1231
4,Generations Small Side Table White,4015>3636>1319>1409>3606


Great. We have something like 28,000 rows. As we can immediately see, there is some overlap between the 10-digit import and export codes, hence the only ~22,000 unique 10-digit codes. It is likely that duplicate codes have the same description in both files... and from the perspective of the model, 2 copies of the same description/code is just as good as 1, so we'll only want to retain one copy of those. Additionally, we need to standardize and process this text, which will probably leave us with more of duplicate text strings as well. Let's start processing and find out.

#### Part 2: Cleaning / processing the data

There are two main libraries for cleaning and processing text data in Python:
- `spaCy`
- `nltk`
In this case, we'll use nltk, but spaCy is equally as good.

Additionally, depending on the type of text that you have, there are many different ways to process and "extract features" (i.e. create variables for modelling) from that text. For example, if you're working with phrases/sentences, spaCy has good tools for determining which part-of-speech each word in a sentence maps to.

In our case, we have very simple product descriptions (perhaps closer to "tags" than sentences), so less processing is required. We'll start out by using regular expressions. This is a good tutorial to learn more about those: https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

In [13]:
# we'll do this step-by-step, to illustrate
df['desc_stripped'] = df['desc'].str.strip()

# this is pulling 5 random rows, instead of the top 5, just for some variety
# note the new colummn
df.sample(3)

Unnamed: 0,desc,code,desc_stripped
441385,Unique Bargains Ladies Girls Hairdressing Stretchy Hair Bands Ropes Ties 50 Pcs Black,3625>3641>1599>2026,Unique Bargains Ladies Girls Hairdressing Stretchy Hair Bands Ropes Ties 50 Pcs Black
516513,Vince Camuto Kamaye Womens Suede Platforms & Wedges,1608>1206>1632>4904,Vince Camuto Kamaye Womens Suede Platforms & Wedges
552436,MightySkins Protective Vinyl Skin Decal for Pelican Tumbler 32 oz wrap cover sticker skins Pink Roses,3292>3581>3145>2201,MightySkins Protective Vinyl Skin Decal for Pelican Tumbler 32 oz wrap cover sticker skins Pink Roses


In [14]:
df['desc_lower'] = df['desc_stripped'].str.lower()

df.head(2)

Unnamed: 0,desc,code,desc_stripped,desc_lower
0,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,3292>114>1231,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a
1,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,3292>1370>4767>3975>1420,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,hp compaq pavilion dv6-1410ez 4400mah 48wh 6 cell li-ion 10.8v black compatible battery


In [15]:
import re
# this is in general one of the more useful regular expressions to know
# the '\w' searches for word-like tokens, and the '+' says "one or more"
# combined, this gets us 'one or more characters', i.e. word tokens w/o commas, spaces, etc.
WORD_REGEX = r'\w+'

def find_words(desc):
    return re.findall(WORD_REGEX, desc)
df['desc_word_list'] = df['desc_lower'].apply(find_words)
df['desc_words_only'] = df['desc_word_list'].str.join(' ')

df.head()

Unnamed: 0,desc,code,desc_stripped,desc_lower,desc_word_list,desc_words_only
0,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,3292>114>1231,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a,"[replacement, viewsonic, vg710, lcd, monitor, 48watt, ac, adapter, 12v, 4a]",replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a
1,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,3292>1370>4767>3975>1420,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,hp compaq pavilion dv6-1410ez 4400mah 48wh 6 cell li-ion 10.8v black compatible battery,"[hp, compaq, pavilion, dv6, 1410ez, 4400mah, 48wh, 6, cell, li, ion, 10, 8v, black, compatible, battery]",hp compaq pavilion dv6 1410ez 4400mah 48wh 6 cell li ion 10 8v black compatible battery
2,Bonjour,2296>3597>2989,Bonjour,bonjour,[bonjour],bonjour
3,Two Pack 6V 12Ah Eaton POWERRITE PRO II 2400 6V 12Ah UPS Replacement Battery - SPS BRAND,3292>114>1231,Two Pack 6V 12Ah Eaton POWERRITE PRO II 2400 6V 12Ah UPS Replacement Battery - SPS BRAND,two pack 6v 12ah eaton powerrite pro ii 2400 6v 12ah ups replacement battery - sps brand,"[two, pack, 6v, 12ah, eaton, powerrite, pro, ii, 2400, 6v, 12ah, ups, replacement, battery, sps, brand]",two pack 6v 12ah eaton powerrite pro ii 2400 6v 12ah ups replacement battery sps brand
4,Generations Small Side Table White,4015>3636>1319>1409>3606,Generations Small Side Table White,generations small side table white,"[generations, small, side, table, white]",generations small side table white


To clean things up a bit, let's get rid of some of these intermediate columns.

In practice, this is a useful thing to do because it will free up memory. If you find your code is using too much memory or is running very slow, then consider deleting unnecessary columns in this manner.

In [16]:
# running this cell more than once will cause an error
# because when you try to delete somethat that has 
# already been deleted, it's no descer there
del df['desc_stripped']
del df['desc_lower']
del df['desc_word_list']

df.head()

Unnamed: 0,desc,code,desc_words_only
0,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,3292>114>1231,replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a
1,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,3292>1370>4767>3975>1420,hp compaq pavilion dv6 1410ez 4400mah 48wh 6 cell li ion 10 8v black compatible battery
2,Bonjour,2296>3597>2989,bonjour
3,Two Pack 6V 12Ah Eaton POWERRITE PRO II 2400 6V 12Ah UPS Replacement Battery - SPS BRAND,3292>114>1231,two pack 6v 12ah eaton powerrite pro ii 2400 6v 12ah ups replacement battery sps brand
4,Generations Small Side Table White,4015>3636>1319>1409>3606,generations small side table white


#### Part 3: Initial Model
Now that we've gotten English words only, we can try a simple model. Let's get into the modelling approach and the package that we'll use to implement it, `scikit-learn`.

#We'll be implementing a _bag-of-words_ model. The idea is very simple: each word becomes a separate variable. For each variable, the value is the number of times that word occurs in that particular record. Let's demonstrate with a quick example.

In [17]:
# let's try this process on the first few and see what we get back
first_few_only = df.head(2)
# look at the final results, for reference
first_few_only['desc_words_only']

0    replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a                       
1    hp compaq pavilion dv6 1410ez 4400mah 48wh 6 cell li ion 10 8v black compatible battery
Name: desc_words_only, dtype: object

In [18]:
# scikit-learn calls this process "vectorizing", i.e. turning a sentence into a vector of variables.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cv = CountVectorizer()
# this will actually convert our first few descriptions into vectors
tfd = cv.fit_transform(first_few_only['desc_words_only'])
# by default, it's a sparse matrix
tfd.toarray()

array([[0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
        1, 1, 1],
       [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
        0, 0, 0]], dtype=int64)

For explainability, let's cleanly present this mapping:

In [19]:
# don't worry about this, it's for pedagogical purposes
columns = [x[0] for x in sorted(list(cv.vocabulary_.items()), key=lambda x: x[1])]
pd.DataFrame(tfd.toarray(), columns=columns, index=first_few_only['desc_words_only'])

Unnamed: 0_level_0,10,12v,1410ez,4400mah,48watt,48wh,4a,8v,ac,adapter,...,dv6,hp,ion,lcd,li,monitor,pavilion,replacement,vg710,viewsonic
desc_words_only,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a,0,1,0,0,1,0,1,0,1,1,...,0,0,0,1,0,1,0,1,1,1
hp compaq pavilion dv6 1410ez 4400mah 48wh 6 cell li ion 10 8v black compatible battery,1,0,1,1,0,1,0,1,0,0,...,1,1,1,0,1,0,1,0,0,0


Now that we get the idea, let's do this for the entire dataset and see what it looks like.

In [20]:
cv = CountVectorizer()
cv.fit_transform(df["desc_words_only"])

<800000x427297 sparse matrix of type '<class 'numpy.int64'>'
	with 8185293 stored elements in Compressed Sparse Row format>

We won't try to convert this into an array as above, because it would be very memory-intensive. But we can see we have 427,297 unique words.

Now, we have variables. But what what exactly are we going to model? The trade-off is that the more levels of the categorization we use, the fewer examples we'll have in each category.

In [21]:
df["codes"] = df["code"].str.split(">")
df["first_code"] = df["codes"].str[0]
df["second_code"] = df["codes"].str[1]
df["first_second"] = df["first_code"] + ">" + df["second_code"]

df["first_second"].value_counts().describe()

count    108.000000  
mean     7331.740741 
std      16112.232764
min      1.000000    
25%      283.500000  
50%      1227.000000 
75%      5660.500000 
max      84438.000000
Name: first_second, dtype: float64

We see that some categories do have only 1 description. Let's see how many there are:

In [22]:
vcs = df["first_second"].value_counts()
print(len(vcs[vcs == 1]), "codes with only 1 desc")
vcs[vcs == 1]

1 codes with only 1 desc


2199>2819    1
Name: first_second, dtype: int64

We'd probably want to investigate these categories more closely. 


#### Training / test sets
As a bare minimum, we need at least 2 records in any category we want to attempt to model. This is because we need to split our data into two pieces: the _training set_, which we'll develop the model on, and the _test set_, which we'll subsequently evaluate it on. We want to see how the model performs on descriptions it's never seen before.

In practice, we almost certainly want more than 2, but we'll continue along here.

In addition, [_cross validation_](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6), which we won't get into today, is an important technique in machine learning that prevents us from "overfitting" the model to the sample of data we're using. It's easy to do in python with `scikit-learn`.

In [23]:
from sklearn.model_selection import train_test_split
def make_test_train(df, column_name):
    # first, let's remove duplicates
    deduped = df.drop_duplicates(subset=[column_name])
    print("after deduping, we have", len(deduped), "records")

    #now, let's remove any HS4 category with <2 records
    vcs = deduped["first_second"].value_counts()
    to_include = vcs[vcs > 1].index
    final_dataset = deduped[deduped["first_second"].isin(to_include)]
    print("after removing <2 record categories, there are", len(final_dataset), "records")

    # the stratify is important -- 
    # it's making sure that we have an instance of each category in both the train and test sets

    train, test = train_test_split(final_dataset, stratify=final_dataset["first_second"])
    print("training set has", len(train), "records --", 100 * len(train) / len(final_dataset), 
          "percent -- and test set has", len(test), "records")
    
    return train, test
train, test = make_test_train(df, "desc_words_only")

after deduping, we have 799539 records
after removing <2 record categories, there are 791368 records
training set has 593526 records -- 75.0 percent -- and test set has 197842 records


Now that we've split up our data, we can choose a classifier. We'll use one of the simplest out there: Logistic Regression, often known as "logit." Normally, logistic regression is a binary classifier. In our case, because we're categorizing something like 1200 codes, we'll actually be training 1200 models, and selecting the highest-probability prediction. This is known as "one-vs-all". There are other voting schemes to convert binary classifiers into multi-class classifiers.

One more trick: instead of using the closed-form logistic regression classifier, we'll use a heuristic optimization approach that runs much more quickly and efficiently, called "stochastic gradient descent." SGD, as it's abbreviated, is part of the core technique used to optimize neural nets (back-propagation or "backprop") as well. We'll leave it to you to convince yourself that this approach is just as good. You can get into the math or just try it out in Python!

In [24]:
from sklearn.linear_model import SGDClassifier

def make_model(train, vec, column_name, loss='log'):
    X = vec.fit_transform(train[column_name])
    y = train["first_second"]

    clf = SGDClassifier(n_jobs=NCPUS, loss=loss, max_iter=1000, tol=1e-5)
    clf.fit(X, y)
    return clf, vec
clf, vec = make_model(train, CountVectorizer(), "desc_words_only")



Now we have a trained model in the `clf` variable. Let's see how it does on a simple metric: overall accuracy. That is, "out of every code the model predicted, what fraction did it get right?"

In [25]:
def evaluate_model(clf, vec, train, test, column_name):
    X = vec.transform(train[column_name])
    y = train["first_second"]
    X_test = vec.transform(test[column_name])
    y_test_pred = clf.predict(X_test)
    y_test_true = test["first_second"]
    print("in-sample accuracy: ", (clf.predict(X) == y).mean())
    print("test set accuracy: ", (y_test_pred == y_test_true).mean())
    return y_test_true, y_test_pred
    
y_test_true, y_test_pred = evaluate_model(clf, vec, train, test, "desc_words_only")

in-sample accuracy:  0.862681668537
test set accuracy:  0.835444445568


Not bad! Now, we can drill down a bit in a few ways. A productive way to do so is to look at failing codes.

In [21]:
from sklearn.metrics import classification_report
rep = (classification_report(y_test_true, y_test_pred, output_dict=True))
sorted_by_support = sorted(rep.items(), key=lambda x: x[1]['support'], reverse=True)
pd.DataFrame((s[1] for s in sorted_by_support), index=(s[0] for s in sorted_by_support)).head(30)

  'precision', 'predicted', average, warn_for)


Unnamed: 0,f1-score,precision,recall,support
micro avg,0.833913,0.833913,0.833913,197842
macro avg,0.552782,0.750469,0.490477,197842
weighted avg,0.828068,0.836773,0.833913,197842
3292>1370,0.89843,0.865799,0.933618,21090
2199>4592,0.888386,0.846615,0.934492,19326
4015>3754,0.79629,0.722548,0.886794,18780
4015>2337,0.957402,0.969152,0.945934,15444
4015>3636,0.816503,0.753821,0.890556,14400
3292>3581,0.915983,0.920118,0.911885,10282
1608>2320,0.97884,0.980292,0.977393,9466


We can see code 2933 is performing poorly. Let's investigate

In [None]:
test_reset = test.reset_index()
test_reset["pred"] = y_test_pred

In [None]:
cols = ["desc_words_only", "first_second", "pred"]
CODE = "4015>2824"
test_reset[((test_reset["first_second"] == CODE)\
           | (test_reset["pred"] == CODE))\
           & (test_reset["first_second"] != test_reset["pred"])][cols]

In [22]:
from nltk.corpus import stopwords
all_stopwords = set(stopwords.words('english'))
all_stopwords.add("count")
def remove_stopwords(desc):
    return " ".join(d for d in desc.split() if d not in all_stopwords)
def remove_single_char(desc):
    return " ".join(d for d in desc.split() if len(d) > 1)
df["desc_no_stopwords"] = df["desc_words_only"].apply(remove_stopwords).apply(remove_single_char)
df.head()

Unnamed: 0,desc,code,desc_words_only,codes,first_code,second_code,first_second,desc_no_stopwords
0,Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A,3292>114>1231,replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a,"[3292, 114, 1231]",3292,114,3292>114,replacement viewsonic vg710 lcd monitor 48watt ac adapter 12v 4a
1,HP COMPAQ Pavilion DV6-1410EZ 4400mAh 48Wh 6 Cell Li-ion 10.8V Black Compatible Battery,3292>1370>4767>3975>1420,hp compaq pavilion dv6 1410ez 4400mah 48wh 6 cell li ion 10 8v black compatible battery,"[3292, 1370, 4767, 3975, 1420]",3292,1370,3292>1370,hp compaq pavilion dv6 1410ez 4400mah 48wh cell li ion 10 8v black compatible battery
2,Bonjour,2296>3597>2989,bonjour,"[2296, 3597, 2989]",2296,3597,2296>3597,bonjour
3,Two Pack 6V 12Ah Eaton POWERRITE PRO II 2400 6V 12Ah UPS Replacement Battery - SPS BRAND,3292>114>1231,two pack 6v 12ah eaton powerrite pro ii 2400 6v 12ah ups replacement battery sps brand,"[3292, 114, 1231]",3292,114,3292>114,two pack 6v 12ah eaton powerrite pro ii 2400 6v 12ah ups replacement battery sps brand
4,Generations Small Side Table White,4015>3636>1319>1409>3606,generations small side table white,"[4015, 3636, 1319, 1409, 3606]",4015,3636,4015>3636,generations small side table white


In [23]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
df["desc_no_stopwords_stemmed"] = df["desc_no_stopwords"].apply(stemmer.stem)

In [24]:
df.sample(5)

Unnamed: 0,desc,code,desc_words_only,codes,first_code,second_code,first_second,desc_no_stopwords,desc_no_stopwords_stemmed
426722,What Time Is It? Wall Clock - Red,1395>2736>1026>2013,what time is it wall clock red,"[1395, 2736, 1026, 2013]",1395,2736,1395>2736,time wall clock red,time wall clock r
288153,Numi Tea Organic Chocolate Pu-Erh - Case of 6 - 16 Bag Pu-erh Tea,3730>1887>3044>3352>1346,numi tea organic chocolate pu erh case of 6 16 bag pu erh tea,"[3730, 1887, 3044, 3352, 1346]",3730,1887,3730>1887,numi tea organic chocolate pu erh case 16 bag pu erh tea,numi tea organic chocolate pu erh case 16 bag pu erh tea
142899,Mightyskins Protective Vinyl Skin Decal Cover for Samsung Galaxy Note LTE Cell Phone wrap sticker skins Fantasy Angel,3292>3581>3145>2201,mightyskins protective vinyl skin decal cover for samsung galaxy note lte cell phone wrap sticker skins fantasy angel,"[3292, 3581, 3145, 2201]",3292,3581,3292>3581,mightyskins protective vinyl skin decal cover samsung galaxy note lte cell phone wrap sticker skins fantasy angel,mightyskins protective vinyl skin decal cover samsung galaxy note lte cell phone wrap sticker skins fantasy angel
206943,Little Tikes Fish 'n Splash Water Table,1395>2736>3899>2131>407,little tikes fish n splash water table,"[1395, 2736, 3899, 2131, 407]",1395,2736,1395>2736,little tikes fish splash water table,little tikes fish splash water t
623037,Durable XXXL 180T Waterproof Dust Motorcycle Cover Outdoor UV Protector Black Blue,2199>1974>2821,durable xxxl 180t waterproof dust motorcycle cover outdoor uv protector black blue,"[2199, 1974, 2821]",2199,1974,2199>1974,durable xxxl 180t waterproof dust motorcycle cover outdoor uv protector black blue,durable xxxl 180t waterproof dust motorcycle cover outdoor uv protector black blu


In [29]:
train, test = make_test_train(df, "desc_no_stopwords_stemmed")
clf, vec = make_model(train, CountVectorizer(ngram_range=(1, 2)), "desc_no_stopwords_stemmed")
y_test_true, y_pred = evaluate_model(clf, vec, train, test, "desc_no_stopwords_stemmed")

after deduping, we have 784060 records
after removing <2 record categories, there are 775892 records
training set has 581919 records -- 75.0 percent -- and test set has 193973 records




in-sample accuracy:  0.9171362337369977
test set accuracy:  0.8529022080392632


In [30]:
train, test = make_test_train(df, "desc_no_stopwords_stemmed")
clf, vec = make_model(train, TfidfVectorizer(ngram_range=(1, 2)), "desc_no_stopwords_stemmed")
y_test_true, y_pred = evaluate_model(clf, vec, train, test, "desc_no_stopwords_stemmed")

after deduping, we have 784060 records
after removing <2 record categories, there are 775892 records
training set has 581919 records -- 75.0 percent -- and test set has 193973 records




in-sample accuracy:  0.7502229691761224
test set accuracy:  0.7488980425110712


In [26]:
train, test = make_test_train(df, "desc_words_only")
clf, vec = make_model(train, CountVectorizer(ngram_range=(1, 3)), "desc_words_only")
y_test_true, y_pred = evaluate_model(clf, vec, train, test, "desc_words_only")

after deduping, we have 799539 records
after removing <2 record categories, there are 791368 records
training set has 593526 records -- 75.0 percent -- and test set has 197842 records




in-sample accuracy:  0.946843103756
test set accuracy:  0.862006045228


What's going on here? We can see, for example, that code 1605 "octopus prepared or preserved nesoi" was mis-predicted into this class. It may well be because of the "prepared or preserved nesoi" bit. The model may be incorrectly attributing this to an important part of the 2008 code. 

We can deal with this using a _feature scaling_ technique known as TF-IDF (term frequency - inverse document frequency). In essence, we're helping the model determine what features should be weighted and which ones should be ignored.

The idea with TF-IDF is that instead of weighting each word with a 1 or 0, depending on whether or not it's in that particular record, instead we'll weight with more contextual information. There are many TF-IDF schemes, but they essentially all boil down to this:

$$ 
\frac{\textrm{# times word occurs in record}}{\textrm{# unique records the word occurs in}}
$$

In other words, the less frequently a word occurs across the entire set of descriptions, the more important it presumably is. Thus, since we frequently see the terms, "nesoi", "prepared", and "preserved",  for example, we'll weight those less

Let's look at the same example with the `CountVectorizer` as above, but using the `TfidfVectorizer` instead.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vec = TfidfVectorizer()

# this will actually convert our first few descriptions into vectors
tfd = vec.fit_transform(first_few_only['desc_words_only'])
# by default, it's a sparse matrix
tfd.toarray()
# don't worry about this, it's for pedagogical purposes
columns = [x[0] for x in sorted(list(vec.vocabulary_.items()), key=lambda x: x[1])]
pd.DataFrame(tfd.toarray(), columns=columns, index=first_few_only['desc_words_only'])

And, without further ado, let's try a model.

In [None]:
vec = TfidfVectorizer()
X = vec.fit_transform(train["desc_words_only"])
clf = SGDClassifier(n_jobs=-1, alpha=.00001)
clf.fit(X, train["first_second"])

In [None]:
X_test = cv.transform(test["desc_words_only"])
y_test_pred = clf.predict(X_test)
y_test_true = test["first_second"]
print("in-sample accuracy: ", (clf.predict(X) == y).mean())
print("test set accuracy: ", (y_test_pred == y_test_true).mean())

In [None]:
deduped = df.drop_duplicates(subset=["desc_no_stopwords"])
print("after deduping, we have", len(deduped), "records")

#now, let's remove any HS4 category with <2 records
vcs = deduped["first_second"].value_counts()
to_include = vcs[vcs > 1].index
final_dataset = deduped[deduped['first_second'].isin(to_include)]
print("after removing <2 record categories, there are", len(final_dataset), "records")

# the stratify is important -- 
# it's making sure that we have an instance of each category in both the train and test sets

train, test = train_test_split(final_dataset, stratify=final_dataset["first_second"])
print("training set has", len(train), "records --", 100 * len(train) / len(final_dataset), 
      "percent -- and test set has", len(test), "records")

In [None]:
from sklearn.linear_model import SGDClassifier

cv = CountVectorizer()
X = cv.fit_transform(train["desc_no_stopwords"])
y = train["first_second"]

clf = SGDClassifier(alpha=.00001, n_jobs=-1)
clf.fit(X, y)

X_test = cv.transform(test["desc_no_stopwords"])
y_test_pred = clf.predict(X_test)
y_test_true = test["first_second"]
print("initial accuracy: ", (y_test_pred == y_test_true).mean())

In [None]:
deduped = df.drop_duplicates(subset=["desc_no_stopwords_stemmed"])
print("after deduping, we have", len(deduped), "records")

#now, let's remove any HS4 category with <2 records
vcs = deduped["first_second"].value_counts()
to_include = vcs[vcs > 1].index
final_dataset = deduped[deduped.first_second.isin(to_include)]
print("after removing <2 record categories, there are", len(final_dataset), "records")

# the stratify is important -- 
# it's making sure that we have an instance of each category in both the train and test sets

train, test = train_test_split(final_dataset, stratify=final_dataset["first_second"])
print("training set has", len(train), "records --", 100 * len(train) / len(final_dataset), 
      "percent -- and test set has", len(test), "records")

In [None]:
from sklearn.linear_model import SGDClassifier

cv = CountVectorizer()
X = cv.fit_transform(train["desc_no_stopwords_stemmed"])
y = train["first_second"]

clf = SGDClassifier(alpha=.00001, n_jobs=-1)
clf.fit(X, y)

X_test = cv.transform(test["desc_no_stopwords_stemmed"])
y_test_pred = clf.predict(X_test)
y_test_true = test["first_second"]
print("initial accuracy: ", (y_test_pred == y_test_true).mean())

In [None]:
from sklearn.linear_model import SGDClassifier

cv = TfidfVectorizer()
X = cv.fit_transform(train["long_no_stopwords_stemmed"])
y = train["HS2"]

clf = SGDClassifier()
clf.fit(X, y)

X_test = cv.transform(test["long_no_stopwords_stemmed"])
y_test_pred = clf.predict(X_test)
y_test_true = test["HS2"]
print("initial accuracy: ", (y_test_pred == y_test_true).mean())

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1)
rf.fit(X, y)

In [None]:
y_test_pred = rf.predict(X_test)
y_test_true = test["HS2"]
print("initial accuracy: ", (y_test_pred == y_test_true).mean())