Creating a sample of documents to use as inputs to a classification task. 

In [1]:
import json
from pathlib import Path
import pandas as pd 

In [2]:
DATA_DIR = Path('../../data/wiki10')
TAGS_PATH = DATA_DIR / 'tags.json'
CLF_LABELS_PATH = DATA_DIR / 'clf0-singlelabel.csv'

In [3]:
with TAGS_PATH.open() as input_fileobj:
    tags = json.load(input_fileobj)

We will create a sample which should be easy to classify as a benchmark for testing. Future tests will have more diverse classes / imbalanced classes. 

Choosing four tag types which were "distant" from each other in our doc2vec-evaluate analysis. 

In [4]:
targets = {'food', 'software', 'religion', 'music', 'politics', 'math'} 

In [5]:
df = pd.DataFrame.from_dict(tags, orient='index')
df = df.loc[df.tag.isin(targets)]

In [6]:
df.head()

Unnamed: 0,title,user_count,tags,tag,mapped_tag
0012a3afc219030f26e3adb10eb5526a,Gehenna,14,{'religion': 0.2857142857142857},religion,religion
001312da1e617115951209bcd0fe1697,Galois theory,11,"{'math': 0.5454545454545454, 'mathematics': 0....",math,math
001f7d3f2890f8f2fb3ef7601a2cacaf,Syncretism,51,"{'religion': 0.27450980392156865, 'philosophy'...",religion,religion
0029a27cb935e387433556f6a54d117c,Habeas corpus,60,"{'politics': 0.2, 'words': 0.06666666666666667}",politics,politics
0088076ebf3cf27718ee17b5e8c2c063,Guitar effects,12,"{'music': 0.4166666666666667, 'science': 0.083...",music,music


In [7]:
df.tag.value_counts()

music       985
politics    619
religion    605
math        449
software    443
food        304
Name: tag, dtype: int64

As each document is multi-labelled we still may have some instances where a document may contain labels from multiple classes in our target list. 

The 'tag' column which we have shown is the dominant label for each document (has the highest score, ref 01_cluster_tags) and this will be our ground truth. Now we will filter out any documents that contain one or more of the other tags types in its 'tags' column (dict of all tags for that doc) so we will have no ambiguity over whether a mis-classification is in fact simply mislabelled (multi-labelled). This is to create a simple test case to benchmark our models. 

In [8]:
df[df.apply(lambda x: len((targets - {x.tag}) & x.tags.keys()) > 0, axis=1)]

Unnamed: 0,title,user_count,tags,tag,mapped_tag
023127c2dce678f21ea7b87dabe5cc71,Pete Seeger,20,"{'music': 0.4, 'politics': 0.1, 'biography': 0...",music,music
0233336c1e06fa0f3fc17a83b4d7c626,Scorewriter,12,"{'music': 0.6666666666666666, 'software': 0.5,...",music,music
0344201ccd47f440016393d24960b527,Condorcet method,53,"{'politics': 0.41509433962264153, 'math': 0.16...",politics,politics
035e4cd56dd70a26f53e8cacecdf4871,Home Taping Is Killing Music,41,"{'music': 0.4634146341463415, 'history': 0.195...",music,music
079b1471a99c79eac5587a7c3ee7531d,Last.fm,121,"{'music': 0.5041322314049587, 'web2.0': 0.1322...",music,music
07daff5529134e4723e595fbdea273c3,Rockbox,66,"{'music': 0.2727272727272727, 'software': 0.19...",music,music
0988858ab4b02605d452ee7cbdd16dae,Voting paradox,10,"{'politics': 0.4, 'math': 0.3, 'mathematics': ...",politics,politics
09e0bdec4922049cf5c54427eb140748,iTunes version history,28,"{'software': 0.32142857142857145, 'history': 0...",software,
0aa029bba46bd6ca34e4e11f519ec966,Fred Phelps,14,"{'religion': 0.42857142857142855, 'people': 0....",religion,religion
0b5f461c681d0ddaa61a0e222bca77f5,Discrete Fourier transform,28,"{'math': 0.32142857142857145, 'music': 0.07142...",math,math


The biggest crossover appears to be from music related software, will now exclude these records

In [9]:
df = df[df.apply(lambda x: len((targets - {x.tag}) & x.tags.keys()) == 0, axis=1)]

In [10]:
df.tag.value_counts() 

music       894
politics    581
religion    535
math        428
software    416
food        300
Name: tag, dtype: int64

Balancing classes

In [11]:
df = (df.groupby('tag')
        .apply(lambda x: x.sample(300, random_state=0))
        .reset_index(level=0, drop=True)) 

In [12]:
df.tag.value_counts() 

politics    300
math        300
religion    300
food        300
music       300
software    300
Name: tag, dtype: int64

In [13]:
df.to_csv(CLF_LABELS_PATH, index=True, index_label='id', columns=['tag']) 