Creating a sample of documents to use as inputs to a classification task. 

In [8]:
import json
from pathlib import Path
import pandas as pd 

In [32]:
DATA_DIR = Path('../../data/wiki10')
TAGS_PATH = DATA_DIR / 'tags.json'
CLF_LABELS_PATH = DATA_DIR / 'clf0-singlelabel.csv'

In [33]:
with TAGS_PATH.open() as input_fileobj:
    tags = json.load(input_fileobj)

We will create a sample which should be easy to classify as a benchmark for testing. Future tests will have more diverse classes / imbalanced classes. 

Choosing four tag types which were "distant" from each other in our doc2vec-evaluate analysis. 

In [5]:
targets = {'food', 'software', 'religion', 'music'} 

In [17]:
df = pd.DataFrame.from_dict(tags, orient='index').loc[df.tag.isin(targets)]

In [16]:
df.head()

Unnamed: 0,title,user_count,tags,tag,mapped_tag
0001f7867e0d59fa2797608d85fb35ac,Advergaming,31,"{'games': 0.3225806451612903, 'web': 0.1290322...",games,game
00071c1d56e4601dee50016d95af90fe,Likert scale,93,"{'statistics': 0.44086021505376344, 'psycholog...",statistics,math
000e9edf0163688ef62a4592546109fb,Technical communication,13,"{'writing': 0.07692307692307693, 'management':...",writing,literature
0012a3afc219030f26e3adb10eb5526a,Gehenna,14,{'religion': 0.2857142857142857},religion,religion
001312da1e617115951209bcd0fe1697,Galois theory,11,"{'math': 0.5454545454545454, 'mathematics': 0....",math,math


In [19]:
df.tag.value_counts()

music       985
religion    605
software    443
food        304
Name: tag, dtype: int64

As each document is multi-labelled we still may have some instances where a document may contain labels from multiple classes in our target list. 

The 'tag' column which we have shown is the dominant label for each document (has the highest score, ref 01_cluster_tags) and this will be our ground truth. Now we will filter out any documents that contain one or more of the other tags types in its 'tags' column (dict of all tags for that doc) so we will have no ambiguity over whether a mis-classification is in fact simply mislabelled (multi-labelled). This is to create a simple test case to benchmark our models. 

In [27]:
df[df.apply(lambda x: len((targets - {x.tag}) & x.tags.keys()) > 0, axis=1)]

Unnamed: 0,title,user_count,tags,tag,mapped_tag
0233336c1e06fa0f3fc17a83b4d7c626,Scorewriter,12,"{'music': 0.6666666666666666, 'software': 0.5,...",music,music
079b1471a99c79eac5587a7c3ee7531d,Last.fm,121,"{'music': 0.5041322314049587, 'web2.0': 0.1322...",music,music
07daff5529134e4723e595fbdea273c3,Rockbox,66,"{'music': 0.2727272727272727, 'software': 0.19...",music,music
09e0bdec4922049cf5c54427eb140748,iTunes version history,28,"{'software': 0.32142857142857145, 'history': 0...",software,
0dbf8657a6906208f055f39fa023ed20,Apple Lossless,14,"{'music': 0.5714285714285714, 'software': 0.21...",music,music
0e06f20b78d2809e084d88fdc3ba0429,Logic Pro,10,"{'music': 0.3, 'software': 0.1}",music,music
0e934a92d7dc961c7edabe8c10d5d930,LimeWire,28,"{'music': 0.21428571428571427, 'software': 0.0...",music,music
0ecc3f4c2a0b50270d23c4082b68258d,Milarepa,12,"{'religion': 0.08333333333333333, 'psychology'...",religion,religion
11554b475548eae74e2ced25972437bc,Cantillation,13,"{'music': 0.23076923076923078, 'religion': 0.0...",music,music
126bcce30578c8b8b7d67d06ee047685,Demeter,11,"{'music': 0.09090909090909091, 'religion': 0.0...",music,music


The biggest crossover appears to be from music related software, will now exclude these records

In [28]:
df = df[df.apply(lambda x: len((targets - {x.tag}) & x.tags.keys()) == 0, axis=1)]

In [29]:
df.tag.value_counts() 

music       931
religion    588
software    429
food        303
Name: tag, dtype: int64

Balancing classes

In [37]:
df = (df.groupby('tag')
        .apply(lambda x: x.sample(300))
        .reset_index(level=0, drop=True)) 

In [38]:
df.tag.value_counts() 

food        300
software    300
religion    300
music       300
Name: tag, dtype: int64

In [40]:
df.to_csv(CLF_LABELS_PATH, index=True, index_label='id', columns=['tag']) 