# Pink and Shrink

Limited rendering on Github, external view available with nbviewer.

Please click the circle to the right if possible to view all the graphs!

## First things first

Install the amazon product wrapper: 

`pip install python-amazon-product-api`

Other installations may be required:
* nltk
* pandas 
* lxml 
* bokeh

In [1]:
from __future__ import absolute_import, division, print_function
from amazonproduct import API
import os
from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
from nltk import word_tokenize, FreqDist
from nltk.corpus import stopwords
import string
from nltk.classify import NaiveBayesClassifier, util
import random
import pandas as pd
import re

from bokeh.io import show
from bokeh.models import ColumnDataSource, PrintfTickFormatter, LabelSet
from bokeh.plotting import figure
from bokeh.models.glyphs import HBar
from bokeh.io import output_notebook
output_notebook()

## Set up API

Fill in your own info here, example dictionary shown

In [1]:
# sample configuration dictionary
cfg = {
    'access_key': 'ABCDEFG1234X',
    'secret_key': 'Ydjkei78HdkffdklieAHDJWE3134',
    'associate_tag': 'redtoad-10',
    'locale': 'us'
}

In [3]:
api = API(cfg = cfg)

### List of Allowed Index (for US)

To do a search for items to pull their descriptions, one needs both an "index" and a "node ID". The "node ID" is numeric and part of the tree of categories that range from the broad to the specific, while the "index" are one-level broad description.
Sometimes it's hard to know the necessary index to look into. And for product searches it is required and cannot be simply "All".

These are the indices allowed:
'Wine','Wireless','ArtsAndCrafts','Miscellaneous','Electronics','Jewelry','MobileApps','Photo','Shoes',
'KindleStore','Automotive','Vehicles','Pantry','MusicalInstruments','DigitalMusic','GiftCards','FashionBaby',
'FashionGirls','GourmetFood','HomeGarden','MusicTracks','UnboxVideo','FashionWomen','VideoGames','FashionMen',
'Kitchen','Video','Software','Beauty','Grocery',,'FashionBoys','Industrial','PetSupplies','OfficeProducts',
'Magazines','Watches','Luggage','OutdoorLiving','Toys','SportingGoods','PCHardware','Movies','Books','Collectibles',
'Handmade','VHS','MP3Downloads','HomeAndBusinessServices','Fashion','Tools','Baby','Apparel','Marketplace','DVD',
'Appliances','Music','LawnAndGarden','WirelessAccessories','Blended','HealthPersonalCare','Classical'

# Finding Mirrored Categories

After just some browsing of Amazon and their tree of categories, you can find that Amazon already has a sort of conveniently mirred male/female node ID's for a lot of product categories. This is easier done for men's vs. women's fashion categories as the subtrees are similarly structured. For others where the men vs. women distinguishes are made later as a filter of sorts, such as vitamins/minerals and deodorant/anti-perspirant, I have yet to find the solution. Some of the mirrored nodes IDs items found are:

* socks
* wallets
* pants
* shirts
* eyewear
* razor
* running shoes
* perfume/cologne

Some of the considerations and constraints to take into account is that we want something specific but not too specific such that we cannot retrieve enough items from sum of the subcategories. Recall that the API restricts to 20 pages which translates only to roughly 200 items per node ID, so we are circumventing this by pulling from each of the "child nodes" in addition to the original node. 

We need enough item description data to form a corpus, so generally speaking, the larger the number of items the better. We can also pull from different indices (e.g. for shoes, there's the obvious "shoes" but also "fashion" and "sportinggoods". The good thing is that Amazon items has a unique item code which we use to name the file, so duplicates are overwritten and won't be over-represented.

Unfortunately this method of getting around the item limit implies approximate equal representation across the subcategories, while in reality, some subcategory may be vastly more popular than others. Also, another caveat is that the same product item in different sizes, colors, styles, etc. will each have their unique code and may over-represent a particular description for an item. 

For ItemSearch requests that use the BrowseNode parameter, results are sorted by BestSeller ranking. So we can at least trust the iteams returned are the most popular for their type.

In [58]:
shoes_running = {}
shoes_running['f'] = 679360011
shoes_running['m'] = 679286011

perfume = {}
perfume['f'] = 11056931
perfume['m'] = 11056761

watch = {}
watch['f'] = 6358543011
watch['m'] = 6358539011

razor = {}
razor['f'] = 13269991011
razor['m'] = 13271080011

eyewear = {}
eyewear['m'] = 7072330011
eyewear['f'] = 7072321011

glasses = {}
glasses['m'] = 2474995011
glasses['f'] = 2474971011

socks = {}
socks['f'] = 1044886
socks['m'] = 1045708

wallets = {}
wallets['m'] = 2475895011
wallets['f'] = 2475898011

shirts = {}
shirts['m'] = 11444073011
shirts['f'] = 11444120011

pants = {}
pants['f'] = 11443933011
pants['m'] = 11443885011

jeans = {}
jeans['f'] = 1048188
jeans['m'] = 1045564

# Checking symmetry

Some interesting aspect can already be gleamed from the subcategories of what seems to be symmetric categories.

Things like running shoes and shirts seems to have matching subcategories (child_nodes).

In [18]:
print_child_nodes(shoes_running['f'])
print("----------------------------")
print_child_nodes(shoes_running['m'])

Road Running (14210388011)
Track & Field & Cross Country (3412255011)
Trail Running (1264582011)
----------------------------
Road Running (14210389011)
Track & Field & Cross Country (3420973011)
Trail Running (1264575011)


In [19]:
print_child_nodes(shirts['f'])
print("----------------------------")
print_child_nodes(shirts['m'])

Compression Tops (9590778011)
Polo Shirts (11444121011)
T-Shirts (11444122011)
Tank Tops (11444123011)
----------------------------
Compression Tops (9590779011)
Polo Shirts (11444074011)
T-Shirts (11444075011)
Tank Tops (11444076011)


## Ensuring Symmetry

But other items see more differences across the gender mirror. One of the artifact of this difference is reflected at first in that the category of "perfume" for women is called "cologne" for men.

Here, "safety razors" and "straight razors" are only present as subcategories for men, while "razors with soap bars" are only present for women. 
"Essential Oils" is a unique subcategory for perfume for women. As is "Capri Pants".
There are 5 more subcategories for women's socks from "Leg Warmers" to "Sheers" to "No Show & Liner Socks".

An option is to only pull from the same subcategories to ensure that symmetry is completely preserved. In otherwords, we can exclude those unique subcategories via their node ID. We can do this to a certain extent for products we can find in multiple indices to still ensure a relatively large corpus. But if we do not do this and "capri" shows up as a keyword for women's pants or "essential" or "oil" for women's perfume, we would know the reasons. 

In [20]:
print_child_nodes(razor['f'])
print("----------------------------")
print_child_nodes(razor['m'])

Cartridges & Refills (13269995011)
Disposable Razors (13269997011)
Razor Systems (13269993011)
Razors with Soap Bars (13295188011)
----------------------------
Razor Systems (13271082011)
Cartridges & Refills (13271084011)
Disposable Razors (13271086011)
Safety Razors (13271088011)
Straight Razors (13271090011)


In [21]:
print_child_nodes(socks['f'])
print("----------------------------")
print_child_nodes(socks['m'])

Casual Socks (2376196011)
Dress & Trouser Socks (2376197011)
Athletic Socks (1044920)
Leg Warmers (2376198011)
Tights (1044936)
Sheers (1044934)
Slipper Socks (2376199011)
No Show & Liner Socks (2376200011)
----------------------------
Casual Socks (2476509011)
Dress & Trouser Socks (1045726)
Athletic Socks (1045724)


In [22]:
print_child_nodes(perfume['f'])
print("----------------------------")
print_child_nodes(perfume['m'])

Body Sprays (3783161)
Cologne (11057051)
Eau Fraiche (16262036011)
Eau de Parfum (11057071)
Eau de Toilette (11057081)
Essential Oils (11057091)
Sets (11057111)
----------------------------
Body Sprays (16262034011)
Cologne (11059721)
Eau Fraiche (16262035011)
Eau de Parfum (363235011)
Eau de Toilette (363236011)
Sets (11059741)


In [23]:
print_child_nodes(pants['f'])
print("----------------------------")
print_child_nodes(pants['m'])

Capri Pants (11443934011)
Insulated Pants (11443935011)
Shell Pants (11443936011)
----------------------------
Insulated Pants (11443887011)
Shell Pants (11443888011)


In [24]:
print_child_nodes(eyewear['f'])
print("----------------------------")
print_child_nodes(eyewear['m'])

Sunglasses (2474971011)
Eyeglass Cases (3478925011)
Eyeglass Chains (3478927011)
Eyewear Frames (3478921011)
Replacement Sunglass Lenses (3508162011)
----------------------------
Sunglasses (2474995011)
Eyeglass Cases (3478924011)
Eyeglass Chains (3478926011)
Eyewear Frames (3478922011)
Replacement Sunglass Lenses (3508163011)


## Categories that are "Predictable"

The Naive Bayes algorithm used is very simple and interpretable, so the products with gender that can be somewhat accurately predicted by this method have distinguishing words that appear mostly in one category. The set up is also very straightforward, with a 80/20 split for training/testing data sets. In the future there might be added complexity, but for now will do.

With that sad, the test accuracy is surprisingly high, in the high 70s to mid 80s for things like razors, perfume, shirt, pants, and wallets. Keep in mind that a human may not be able to predict whether a product description is one for women's or men's product with 100% accuracy.

There may be reasons that these categories do well that goes back to the data tiself. For perfume at least, it seems like there are fairly standardized formats for descriptions for many of the products offered, e.g. the maker brand and what mix of aroma it presents. This kind of format is well suited as it likely offers keywords without much clutter or noise. It could also be, however, the expected problem of duplicate and/or overused product descriptions, both which increases the association between the gender label and particular words that show up again and again for this reason alone. 

Note that this section is not very DRY so will need further work. 

In [110]:
hb_indices = ['HealthPersonalCare', 'Beauty']
for index in hb_indices:
    Razor_Corpus = build_corpus(index, "RazorCorpus", razor)

In [125]:
razor_exclude = [13295188011, 13271088011, 13271090011]
for index in hb_indices:
        Razor_Corpus2 = build_corpus(index, "RazorCorpus2", razor, exclude = razor_exclude)

In [26]:
Razor_Corpus = read_corpus("RazorCorpus/")
Razor_Corpus2 = read_corpus("RazorCorpus2/")

In [38]:
#RazorFD = get_FD(Razor_Corpus)
razor_labeled = create_features(Razor_Corpus)
razorSC = NB_classify(razor_labeled)
razor_features = store_most_informative_features(razorSC, 30)
graph_features(razor_features, 'Razors Unfiltered')

Train accuracy: 94.5804195804
 Test accuracy: 85.3146853147


In [37]:
razor_labeled2 = create_features(Razor_Corpus2)
razorSC2 = NB_classify(razor_labeled2)
razor_features2 = store_most_informative_features(razorSC2, 30)
graph_features(razor_features2, 'Razors Filtered')

Train accuracy: 94.1025641026
 Test accuracy: 87.7551020408


## Graph of Interest 1

Removing the categories of "razors with soap bars" for women and "safety razors" and "straight razors" for men definitely has a noticeable difference. One of the more obvious being the lack of "straight" as an informative word for men. It also removed "moisturizing" for women. Interestingly though, it also removed "Venus", which I can only presume made a lot of those razors with soap bars, such as the "embrace". 

Seems that "beard" and "professional" are more associated with safety and straight razors as they were no longer there after the filter.

Some more reflections of practical features are "legs" for women. The name "intuition" as a razor for women stuck, and so has "butter" which implies some moisturizing still.

Qualitatively, "sharp", "edge", "angle", "precision" as informative words for men is not too surprising given the general connotations. As opposed to the "silky" and "curves" that appeared for women before. Who also appear to value "convenience" more over a "superior" experience.

In [None]:
for index in hb_indices:
    Perfume_Corpus = build_corpus(index, "PerfumeCorpus", perfume)

In [155]:
perfume_exclude = [11057091]
for index in hb_indices:
    Perfume_Corpus2 = build_corpus("Beauty", "PerfumeCorpus2", perfume, perfume_exclude)

In [34]:
Perfume_Corpus = read_corpus("PerfumeCorpus")
Perfume_Corpus2 = read_corpus("PerfumeCorpus2")

In [35]:
#PerfumeFD = get_FD(Perfume_Corpus)
perfume_labeled = create_features(Perfume_Corpus)
perfumeSC = NB_classify(perfume_labeled)
perfume_features = store_most_informative_features(perfumeSC,30)
graph_features(perfume_features, "Perfume")

Train accuracy: 89.8746383799
 Test accuracy: 80.3846153846


In [39]:
perfume_labeled2 = create_features(Perfume_Corpus2)
perfumeSC2 = NB_classify(perfume_labeled2)
perfume_features2 = store_most_informative_features(perfumeSC2,30)
graph_features(perfume_features2, "Perfume Filtered")

Train accuracy: 90.8613445378
 Test accuracy: 81.512605042


## Graph of Interest 2

Despite filtering out the "essential oils" category for women, "essential" still makes appearance as the most indicative word for women. 

"Womenly" scents include aloe, orchid, freesia. "Manly" scents include rosewood, sage, tobacco, cardamom. Unsurprisingly, it is "flowery" vs. "spices" (and other so-called rugged things like ashes and timber).

The main points that catch my attention are the adjectives "delicate" and "pure" for women, compared with "manly", "masculine", "man", and even "homme" for men. It seems important to emphasize the masculinity of the product for men. Although given the lack of the word feminine for women, it doesn't appear that it is something perfumes for women want to highlight either. Nonetheless, the liberal application of such seemingly subjective adjectives is fascinating.

In [115]:
outfit_indices = ["SportingGoods", "Apparel", "Fashion"]
for index in outfit_indices:
    Pants_Corpus = build_corpus(index, "PantsCorpus", pants)

In [149]:
#PantsFD = get_FD(Pants_Corpus)
pants_features = create_features(Pants_Corpus)
pantsSC = NB_classify(pants_features)
pants_info = store_most_informative_features(pantsSC, 30)
graph_features(pants_info, "Pants")

Training accuracy: 91.4130434783
Test accuracy: 88.6956521739


In [59]:
Jeans_Corpus = build_corpus("Fashion", "JeansCorpus", jeans)
Jeans_Corpus = build_corpus("Apparel", "JeansCorpus", jeans)

No child nodes found
No child nodes found
No child nodes found
No child nodes found


In [57]:
#Jeans_Corpus = read_corpus("JeansCorpus/")
jeans_labeled = create_features(Jeans_Corpus)
jeansSC = NB_classify(jeans_labeled)
jeans_features = store_most_informative_features(jeansSC, 30)
graph_features(jeans_features, "Jeans")

Train accuracy: 93.2038834951
 Test accuracy: 89.6103896104


In [148]:
for index in outfit_indices:
    Shirt_Corpus = build_corpus(index, "ShirtCorpus", shirts)

In [154]:
#ShirtFD = get_FD(Shirt_Corpus)
shirt_labeled = create_features(Shirt_Corpus)
shirtSC = NB_classify(shirt_labeled)
shirt_features = store_most_informative_features(shirtSC, 30)
graph_features(shirt_features, "Shirts")

Training accuracy: 93.75
Test accuracy: 86.3372093023


In [132]:
socks_exclude = [2376198011, 1044936, 1044934, 2376199011, 2376200011]
for index in outfit_indices:
    Socks_Corpus = build_corpus(index, "SocksCorpus", socks, socks_exclude)

In [146]:
#SocksFD = get_FD(Socks_Corpus)
sock_labeled = create_features(Socks_Corpus)
socksSC = NB_classify(sock_labeled)
store_most_informative_features(socksSC, 30)
socks_features = store_most_informative_features(socksSC,30)
graph_features(socks_features, "Socks")

Training accuracy: 91.7400881057
Test accuracy: 80.0884955752


In [180]:
Wallet_Corpus = build_corpus("Fashion", "WalletCorpus", wallets)
# WalletFD = get_FD(Wallet_Corpus)

No child nodes found
No child nodes found


In [181]:
wallet_labeled = create_features(Wallet_Corpus)
walletSC = NB_classify(wallet_labeled)
wallet_features = store_most_informative_features(walletSC, 20)
graph_features(wallet_features, "Wallets")

Train accuracy: 98.064516129
 Test accuracy: 78.9473684211


## Less Predictable (by Naive Bayes)

Things like shoes and watches seem to do less well. Their accuracies on the test cases falls below 70s, although it is debatable if humans could do much better. And for glasses, it seems to be so bad as to draw with random assignment at 50% accuracy.

It could have more to do with the quality of descriptions for these descriptions. They may contain a large amount of "nondescript" desriptions that tend to be brief and generic, or messy and filled with technical measurements, copy and pasted over and over. The subcategories may also differ too much to create a coherent theme, as it may be the case for glasses, which included cases and chains.

In [41]:
shoes_indices = ["SportingGoods", "Fashion", "Shoes"]
for i in shoes_indices:
    Shoes_Corpus = build_corpus(i, "ShoesCorpus", shoes_running)

In [150]:
# ShoesFD = get_FD(Shoes_Corpus)
shoes_features = create_features(Shoes_Corpus)
shoesSC = NB_classify(shoes_features)
shoes_info = store_most_informative_features(shoesSC, 30)
graph_features(shoes_info, "Running Shoes")

Training accuracy: 85.0509626274
Test accuracy: 66.0633484163


In [182]:
watch_indices = ['Fashion', "Electronics", "Jewelry"]
for i in watch_indices:
    Watch_Corpus = build_corpus(i, "WatchesCorpus", watch)

In [186]:
watch_label = create_features(Watch_Corpus) 
watchSC = NB_classify(watch_label)
watch_features = store_most_informative_features(watchSC, 30)
graph_features(watch_features, "Watches")

Train accuracy: 77.8618732261
 Test accuracy: 61.3636363636


In [170]:
Eyewear_Corpus = build_corpus("Fashion", "EyewearCorpus", eyewear)
Eyewear_Corpus = build_corpus("Fashion", "EyewearCorpus", glasses)

No child nodes found
No child nodes found


In [176]:
# EyewearFD = get_FD(Eyewear_Corpus)
eyewear_labeled = create_features(Eyewear_Corpus)
eyewearSC = NB_classify(eyewear_labeled)
eyewear_features = store_most_informative_features(eyewearSC, 20)
graph_features(eyewear_features, "Eyewear")

Train accuracy: 77.0913770914
 Test accuracy: 50.0


## All the Functions and Methods

Some really helpful tools:

### NLTK
Convenient corpus manipulation. Very easy to label and process in batch, filter out stopwords, get the filenames. Has built in Naive Bayes classifier. Can easily get frequency distributions.

### Bokeh
Takes a lot to customize, but will be useful for interactive graphics on a site later.

### Python Amazon Product API . 
Besides not being available in Python 3, generally easy to use. Having realized the first issue a bit late, might stick with it unless there's noticeable lapses in fetching/ parsing. Experimented with another that seemed less intuitive at first, could experiment with others later.

In [28]:
def graph_features(featlist, name):
    
    """
    Function takes as input, a list of features as outputed by store_most_informative_features, and a str for product
    Function creates a labeled dataframe, transforms the df by
        retaining positive values for "f" features, negative values for "m" features, sorting by descending order
        use Bokeh to graph the most informative features from most "f" indicating to most "m" indicating
    Function returns None
    """
    featdf = pd.DataFrame(featlist)
    featdf.columns = ['word','gender','ratio']
    if (featdf[featdf.gender=='m'].ratio > 0).any():
        featdf.loc[featdf.gender=='m','ratio'] *= -1
    featdf.sort_values(by='ratio',ascending = True, inplace = True)
    x_max = featdf.ratio.abs().max() + 5
    p = figure(y_range=list(featdf.word), x_range=(-x_max,x_max), 
               title="Most Informative Words in Descriptions for " + name)

    p.hbar(y='word', right='ratio', left = 0, height = 0.8, legend = "Female",
           color = 'deeppink', source=ColumnDataSource(featdf[featdf.gender=='f']))
    p.hbar(y='word', left ='ratio', right = 0, height = 0.8, legend = "Male",
           color = 'deepskyblue', source=ColumnDataSource(featdf[featdf.gender=='m']))
    labels_m = LabelSet(x=0, y='word', text='word', level='glyph', x_offset = 5,
              source=ColumnDataSource(featdf[featdf.gender=='m']), y_offset = -9, render_mode='canvas')
    labels_f = LabelSet(x=0, y='word', text='word', level='glyph', x_offset = -5, text_align = 'right',
              source=ColumnDataSource(featdf[featdf.gender=='f']), y_offset = -9, render_mode='canvas')
    p.add_layout(labels_m)
    p.add_layout(labels_f)
    p.yaxis.visible = False
    #p.y_range.range_padding = 0.1
    p.xaxis.axis_label = "Likelihood Ratio"
    p.xaxis.axis_label_text_font_size = '1em'
    p.xaxis.major_label_text_font_size = '1em'
    p.title.text_font_size = '1.2em'
    p.xaxis[0].formatter = PrintfTickFormatter(format="%uX")
    p.legend.location = "bottom_right"
    p.legend.label_text_font_size = '1em'
    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None
    show(p)
#    return featdf

In [6]:
def store_most_informative_features(self, n=20):
    # function acts on a Naive Bayes classifier object outputed by NLTK
    # function outputs list of n most informative features, the label of lean, and the ratio
    cpdist = self._feature_probdist
    feat_list = []

    for (fname, fval) in self.most_informative_features(n):
        def labelprob(l):
            return cpdist[l, fname].prob(fval)

        labels = sorted([l for l in self._labels
                         if fval in cpdist[l, fname].samples()],
                        key=labelprob)
        if len(labels) == 1:
            continue
        l0 = labels[0]
        l1 = labels[-1]
        if cpdist[l0, fname].prob(fval) == 0:
            ratio = 'INF'
        else:
            ratio = (cpdist[l1, fname].prob(fval) /
                               cpdist[l0, fname].prob(fval))
        feat_list.append([fname, ("%s" % l1)[:6], ratio])
    return feat_list

In [7]:
def NB_classify(feature, fraction = 0.8, cats = ['f','m']):
    """
    Function takes as inputs:
        feature(dic) as dictionary of labeled features
        fraction(0-1.0) as proportion of raw files to include in the training set
    Function scrambles file order, kept the same scramble here with seeding for reference purposes
        takes the portion of training data to train the NLTK Naive Bayes classifier, prints its accuracy,
        prints accuracy of the classifier on the test data
    Function outputs the Naive Bayes Sentiment classifier object
    """
    test = []
    train = []
    for cat in cats:
        featlist = feature[cat]
        split = int(round(len(featlist) * fraction))
        random.seed(9876543210)
        random.shuffle(featlist)
        train += featlist[:split]
        test += featlist[split:]
    sentiment_classifier = NaiveBayesClassifier.train(train)
    print("Train accuracy: {}".format(util.accuracy(sentiment_classifier, train) * 100))
    print(" Test accuracy: {}". format(util.accuracy(sentiment_classifier, test) * 100))
    return sentiment_classifier

In [8]:
def get_FD(corpus, n = 10, cat = None):
    # Function takes categorized NLTK corpus and returns FreqDist object while printing n most common words of corpus
    words = corpus.words(categories = cat)
    filtered = filter_lower(words)
    FD = FreqDist(filtered)
    print(FD.most_common(n))
    return FD

In [9]:
def build_corpus(index, dirname, dictionary, exclude = None):
    """
    function takes as inputs:
        index(str) as optioned by Amazon
        dirname(str) as chosen by user for folder for output
        dictionary(dic) of pair matched parent node IDs, with keys as 'f' or 'm'
        nodelimit(list) as a range of lower and upper limit to the child nodes
    function retrieves item descriptions by calling save_item_files() for the node itself,
        and child nodes up to the numerical nodelimit
    function outputs:
        Corpus, an NLTK categorized corpus, labeling each item item
    """
    corpusdir = dirname + "/"
    if not os.path.isdir(corpusdir):
        os.mkdir(corpusdir)
    
    for k,v in dictionary.items():
        save_item_files(corpusdir, index, v , tag = k)
        for childnode in get_child_nodes(v):
            if exclude is None:
                save_item_files(corpusdir, index, childnode, tag = k)
            elif childnode not in exclude:
                save_item_files(corpusdir, index, childnode, tag = k)
            else:
                continue
            
    corpus = read_corpus(corpusdir)
    return corpus

In [10]:
def read_corpus(corpusdir):
    if not os.path.isdir(corpusdir):
        os.mkdir(corpusdir)
    corpus = CategorizedPlaintextCorpusReader(corpusdir, r'.*.txt', cat_pattern=r'\w+_([fm]).txt')
    return corpus

In [11]:
def create_features(corpus, cats = ['f','m']):
    # function takes categorized NLTK corpus, and creates a dictionary of features labeled 'f' or 'm'
    dictout = {}
    for cat in cats:
        dictout[cat] = [
        (bag_filter_lower(corpus.words(fileids=[f])), cat) \
        for f in corpus.fileids(categories=cat)]
    return dictout

In [12]:
def save_item_files(corpusdir, SearchIndex, Node, tag=""):
    """
    function takes as inputs:
        corpusdir(str) of the directory where corpus files will be created and saved to
        SearchIndex(str) of item to be searched as optioned by Amazon
        node(int) of the node ID for the item search
        tag(str) of the "tag" that goes at end of filename to indicate "f" or "m"
    function uses created Amazon api object and saves description of each item as a text file
        with item ASIN as filename, with usually the indicated tag at the end following underscore
    function returns nothing
    """
    items = api.item_search(
        SearchIndex, ResponseGroup = 'EditorialReview', BrowseNode = Node)

    for i, item in enumerate(items):
        try:
            content = item.EditorialReviews.EditorialReview.Content.text
            content = re.sub('<[^<]+?>', '', content)
            with open(corpusdir + item.ASIN.text + "_" + tag + '.txt','w') as fout:
                fout.write(content.encode('utf-8'))
                
        except AttributeError:
            pass
        


In [13]:
def filter_lower(words):
    # function filters for stopwords and punctuations from a list of words and returns list with all lowercase words
    uselesswords = stopwords.words("english") + list(string.punctuation)
    #otherwords = ["\s", "br", "b", "</", "><", ".</", "/>", ".<", ">-", """+)""", """)<\\"""]
    #uselesswords += otherwords
    return [word.lower() for word in words if not word.lower() in uselesswords]

In [14]:
def bag_filter_lower(words):
    # function filters for stopwords and punctuations from a list of words 
    # and returns dictionary with all lowercase words with value 1
    uselesswords = stopwords.words("english") + list(string.punctuation)
    return {word.lower():1 for word in words if not word.lower() in uselesswords}

In [15]:
def get_child_nodes(node_id):
    # function takes the node ID(int) and returns a list of child_nodes
    child_nodes = []
    result = api.browse_node_lookup(node_id)
    try:
        for child in result.BrowseNodes.BrowseNode.Children.BrowseNode:
            nodeout = child.BrowseNodeId.text
            child_nodes.append(int(nodeout))
    except AttributeError:
        print("No child nodes found")
    return child_nodes

In [16]:
def print_child_nodes(node_id):
    # function takes as input the node ID(int) and prints descriptions and node ID of children nodes
    result = api.browse_node_lookup(node_id)
    for child in result.BrowseNodes.BrowseNode.Children.BrowseNode:
         print('{} ({})'.format(child.Name, child.BrowseNodeId))