# Document tagging

*This notebook first appeared as a [blog post](//betatim.github.io/posts/bumping) on [Tim Head](//betatim.github.io)'s blog.*

*License: [MIT](http://opensource.org/licenses/MIT)*

*(C) 2016, Tim Head.*
*Feel free to use, distribute, and modify with the above attribution.*

In [2]:
%config InlineBackend.figure_format='retina'
%matplotlib inline

In [3]:
from collections import Counter

import numpy as np
np.random.seed(3)
import matplotlib.pyplot as plt

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer

In [117]:
fname = '/Users/thead/Downloads/Train.csv'
df = pd.read_csv(fname, nrows=100000, index_col='Id', engine='c')

In [None]:
df.head()

In [None]:
df.Tags = df.Tags.map(lambda x: x.split())

In [None]:
def encode_tags(tags, n_tags=40):
    tags_ = Counter()
    for v in tags:
        tags_.update(v)

    keys = list(sorted(v[0] for v in tags_.most_common(n_tags)))

    encoded = np.zeros((len(tags), n_tags))
    for i,row in enumerate(tags):
        for tag in row:
            if tag in keys:
                j = keys.index(tag)
                encoded[i, j] = 1
                
    return encoded
    
encode_tags(df.Tags[:10])

In [118]:
from html.parser import HTMLParser

class TextExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_code = False
        self.text = []
        
    def handle_starttag(self, tag, attrs):
        if tag == 'code':
            self.in_code = True

    def handle_endtag(self, tag):
        if tag == 'code':
            self.in_code = False

    def handle_data(self, data):
        if not self.in_code:
            self.text.append(data)


def clean_body(body):
    extractor = TextExtractor()
    extractor.feed(body)
    return ' '.join(extractor.text)


df['CleanBody'] = df.Body.map(clean_body)
        
tfidf = TfidfVectorizer(stop_words='english')

In [119]:
df['Text'] = df.apply(lambda x: x.Title +' '+x.CleanBody, axis=1)

In [120]:
df.head()

Unnamed: 0_level_0,Title,Body,Tags,CleanBody,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,How to check if an uploaded file is an image w...,<p>I'd like to check if an uploaded file is an...,php image-processing file-upload upload mime-t...,I'd like to check if an uploaded file is an im...,How to check if an uploaded file is an image w...
2,How can I prevent firefox from closing when I ...,"<p>In my favorite editor (vim), I regularly us...",firefox,"In my favorite editor (vim), I regularly use c...",How can I prevent firefox from closing when I ...
3,R Error Invalid type (list) for variable,<p>I am import matlab file and construct a dat...,r matlab machine-learning,I am import matlab file and construct a data f...,R Error Invalid type (list) for variable I am ...
4,How do I replace special characters in a URL?,"<p>This is probably very simple, but I simply ...",c# url encoding,"This is probably very simple, but I simply can...",How do I replace special characters in a URL? ...
5,How to modify whois contact details?,<pre><code>function modify(.......)\n{\n $mco...,php api file-get-contents,"\n\n using this modify function, displays warn...",How to modify whois contact details? \n\n usin...


In [None]:
tags = Counter()
for v in df.Tags:
    tags.update(v)
    
tags.most_common(40)

In [None]:
len(tags.keys())

In [None]:
plt.hist(df.Tags.apply(len), range=(0,5), bins=5)

In [None]:
df.Tags.apply()

In [28]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups

import pyLDAvis
pyLDAvis.enable_notebook()

ImportError: No module named 'pyLDAvis.sklearn'

In [106]:
#dataset = fetch_20newsgroups(shuffle=True, random_state=1,
#                             remove=('headers', 'footers', 'quotes'))


cats = ['talk.religion.misc', 'alt.atheism', 'comp.graphics', 'sci.med', 'sci.space']
dataset = fetch_20newsgroups(subset='train', categories=cats,
                             shuffle=True, random_state=1,
                             remove=('headers', 'footers', 'quotes'))

In [107]:
Counter(dataset.target)

Counter({0: 480, 1: 584, 2: 594, 3: 593, 4: 377})

In [108]:
dataset.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'sci.space', 'talk.religion.misc']

In [125]:
def norm(a):
    a = np.asarray(a)
    return a/(a.sum(axis=1)[:,np.newaxis])

def norm_(a):
    a = np.asarray(a)
    return a/a.sum()

vect = CountVectorizer(stop_words='english', max_df=0.95, min_df=2)
lda = LatentDirichletAllocation(n_topics=10)

docs = dataset['data']
docs = df.Text[:1000]

vectorised = vect.fit_transform(docs)
doc_topic_prob = lda.fit_transform(vectorised)

In [126]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()
    
print_top_words(lda, vect.get_feature_names(), 15)

Topic #0:
xml iphone api object want code application user development rows use exception app using entity
Topic #1:
javascript jquery use want using like code thread way html page create rails value thanks
Topic #2:
android page image problem code set file using method type make user facebook text css
Topic #3:
http php data com array 00 form using like 10 ve number code want string
Topic #4:
frac infty mathrm tests active session cdot integral int_ dx int_0 pdf question prove integration
Topic #5:
server database table sql windows using error access client connection id data install tables network
Topic #6:
java node function org nodes tree hibernate axis parent 1234 case graph attribute prefix length
Topic #7:
mode report random software crystal mac good restart emacs spanned numbers animation sleep nginx volume
Topic #8:
files width file swf utf detect scale height makefile segmentation windows repair rotation vista copied
Topic #9:
like using code use file way application want try

In [123]:
# topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency,
opts = dict(vocab=vect.get_feature_names(),
            doc_topic_dists=norm(doc_topic_prob),
            doc_lengths=np.array((vectorised != 0).sum(1)).squeeze(),
            topic_term_dists=norm(lda.components_),
            term_frequency=norm_(vectorised.sum(axis=0).tolist()[0]),)

In [124]:
import warnings
warnings.filterwarnings('ignore')
pyLDAvis.prepare(**opts, mds='tsne')