<a href="https://colab.research.google.com/github/brianray/topic_time_models/blob/master/Topic_Model_over_Time_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Full article: https://medium.com/@brianray_7981/king-man-queen-5-saussures-semiotics-applied-to-modern-natural-language-processing-and-eb3b1fcd33a0



Credits



*   Download script [link](https://github.com/bmabey/pyLDAvis/blob/master/notebooks/Gensim%20Newsgroup.ipynb)
*   List item



In [0]:
%%bash
mkdir -p data
pushd data
if [ -d "20news-bydate-train" ]
then
  echo "The data has already been downloaded..."
else
  wget http://qwone.com/%7Ejason/20Newsgroups/20news-bydate.tar.gz
  tar xfv 20news-bydate.tar.gz
  rm 20news-bydate.tar.gz
fi
echo "Lets take a look at the groups..."
ls 20news-bydate-train/
popd

/content/data /content
The data has already been downloaded...
Lets take a look at the groups...
alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc
/content


In [0]:
from glob import glob
import re
import string
import funcy as fp
from gensim import models
from gensim.corpora import Dictionary, MmCorpus
import nltk
import pandas as pd

In [0]:
!pip install funcy

Collecting funcy
  Downloading https://files.pythonhosted.org/packages/47/a4/204fa23012e913839c2da4514b92f17da82bf5fc8c2c3d902fa3fa3c6eec/funcy-1.11-py2.py3-none-any.whl
Installing collected packages: funcy
Successfully installed funcy-1.11


In [0]:
# quick and dirty....
EMAIL_REGEX = re.compile(r"[a-z0-9\.\+_-]+@[a-z0-9\._-]+\.[a-z]*")
FILTER_REGEX = re.compile(r"[^a-z '#]")
TOKEN_MAPPINGS = [(EMAIL_REGEX, "#email"), (FILTER_REGEX, ' ')]

def tokenize_line(line):
    res = line.lower()
    for regexp, replacement in TOKEN_MAPPINGS:
        res = regexp.sub(replacement, res)
    return res.split()
    
def tokenize(lines, token_size_filter=2):
    tokens = fp.mapcat(tokenize_line, lines)
    return [t for t in tokens if len(t) > token_size_filter]
    

def load_doc(filename):
    group, doc_id = filename.split('/')[-2:]
    with open(filename, errors='ignore') as f:
        doc = f.readlines()
    return {'group': group,
            'doc': doc,
            'tokens': tokenize(doc),
            'id': doc_id}


docs = pd.DataFrame(list(map(load_doc, glob('data/20news-bydate-train/*/*')))).set_index(['group','id'])
docs.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,doc,tokens
group,id,Unnamed: 2_level_1,Unnamed: 3_level_1
sci.crypt,15613,"[From: mcbay@clam.com (George McBay)\n, Subjec...","[from, #email, george, mcbay, subject, what, t..."
sci.crypt,14992,[Subject: Re: Illegal Wiretaps (was Denning's ...,"[subject, illegal, wiretaps, was, denning's, t..."
sci.crypt,15726,"[From: croley@magic.mcc.com (David Croley)\n, ...","[from, #email, david, croley, subject, new, en..."
sci.crypt,15616,"[From: rdippold@qualcomm.com (Ron ""Asbestos"" D...","[from, #email, ron, asbestos, dippold, subject..."
sci.crypt,15339,[From: pmetzger@snark.shearson.com (Perry E. M...,"[from, #email, perry, metzger, subject, once, ..."


In [0]:
def nltk_stopwords():
    return set(nltk.corpus.stopwords.words('english'))

def prep_corpus(docs, additional_stopwords=set(), no_below=5, no_above=0.5):
  print('Building dictionary...')
  dictionary = Dictionary(docs)
  stopwords = nltk_stopwords().union(additional_stopwords)
  stopword_ids = map(dictionary.token2id.get, stopwords)
  dictionary.filter_tokens(stopword_ids)
  dictionary.compactify()
  dictionary.filter_extremes(no_below=no_below, no_above=no_above, keep_n=None)
  dictionary.compactify()

  print('Building corpus...')
  corpus = [dictionary.doc2bow(doc) for doc in docs]

  return dictionary, corpus

In [0]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
dictionary, corpus = prep_corpus(docs['tokens'])


Building dictionary...
Building corpus...


In [0]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=50, passes=10)
                                      
lda.save('newsgroups_50_lda.model')

CPU times: user 3min 44s, sys: 2min 5s, total: 5min 49s
Wall time: 3min 3s


In [0]:
import pyLDAvis.gensim as gensimvis
import pyLDAvis

In [0]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)

In [0]:

# helper script for creating kaggle config if one does not exist
HELP = False
import getpass
import json
kaggle_json_path = os.path.expanduser("~/.kaggle/kaggle.json")
try:
  f = open(kaggle_json_path, "r")
  assert "username" in f.read()
except Exception as e:
  print(e)
  if HELP:
    f = open(kaggle_json_path, "w")
    config_dict = dict(username=getpass.getpass("Kaggle Username "), 
                       key=getpass.getpass("Kaggle Key "))
    open(kaggle_json_path, "w").write(json.dumps(config_dict))
    print("wrote to {}".format(kaggle_json_path))
  else:
    raise Exception("please fix ~/.kaggle/kaggle.json or set HELP=True")
os.chmod(kaggle_json_path, 0o600)  
print(kaggle_json_path)

/root/.kaggle/kaggle.json


In [0]:
!kaggle competitions download -c transfer-learning-on-stack-exchange-tags

cooking.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
crypto.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
robotics.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
biology.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
sample_submission.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
travel.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
diy.csv.zip: Skipping, found more recently modified local copy (use --force to force download)
test.csv.zip: Skipping, found more recently modified local copy (use --force to force download)


biology.csv.zip				 newsgroups_50_lda.model.id2word
cooking.csv.zip				 newsgroups_50_lda.model.state
crypto.csv.zip				 robotics.csv.zip
data					 sample_data
diy.csv.zip				 sample_submission.csv.zip
newsgroups_50_lda.model			 test.csv.zip
newsgroups_50_lda.model.expElogbeta.npy  travel.csv.zip


In [0]:
import pandas as pd
import zipfile

zf = zipfile.ZipFile('biology.csv.zip') 
df = pd.read_csv(zf.open('biology.csv'))
df['group'] = 'biology'
print('biology')
print(len(df))

# append the others
for group in ['cooking', 'crypto', 'diy', 'robotics', 'test', 'travel']:
  zf = zipfile.ZipFile(group + '.csv.zip') 
  df2 = pd.read_csv(zf.open(group + '.csv'))
  df2['group'] = group
  df = df.append(df2)
  print(group)
  print(len(df))
df

biology
13196
cooking
28600
crypto
39032
diy
64950
robotics
67721
test
149647
travel
168926


Unnamed: 0,content,group,id,tags,title
0,"<p>In prokaryotic translation, how critical fo...",biology,1,ribosome binding-sites translation synthetic-b...,What is the criticality of the ribosome bindin...
1,<p>Does anyone have any suggestions to prevent...,biology,2,rna biochemistry,How is RNAse contamination in RNA based experi...
2,<p>Tortora writes in <em>Principles of Anatomy...,biology,3,immunology cell-biology hematology,Are lymphocyte sizes clustered in two groups?
3,<p>Various people in our lab will prepare a li...,biology,4,cell-culture,How long does antibiotic-dosed LB maintain goo...
4,<p>Are there any cases in which the splicing m...,biology,5,splicing mrna spliceosome introns exons,Is exon order always preserved in splicing?
5,<p>I'm interested in sequencing and analyzing ...,biology,6,dna biochemistry molecular-biology,How can I avoid digesting protein-bound DNA?
6,<p>I'm looking for resources or any informatio...,biology,8,neuroscience synapses,Under what conditions do dendritic spines form?
7,<p>I shipped 10 µL of my vector miniprep to a ...,biology,9,plasmids,How should I ship plasmids?
8,<p>I noticed within example experiments in cla...,biology,10,molecular-genetics gene-expression experimenta...,What is the reason behind choosing the reporte...
9,"<p>According to the endosymbiont theory, mitoc...",biology,11,evolution mitochondria chloroplasts,How many times did endosymbiosis occur?


In [0]:
df.rename(columns={'content':'doc'}, inplace=True)

In [0]:
%%time
df['tokens'] = df['doc'].apply(tokenize)

In [0]:
dictionary, corpus = prep_corpus(docs['tokens'])

Building dictionary...
Building corpus...


In [0]:
%%time
lda = models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=7, passes=10)
                                      
lda.save('stack_overflow_7.model')

CPU times: user 1min 18s, sys: 12 s, total: 1min 30s
Wall time: 1min 14s


In [0]:
vis_data = gensimvis.prepare(lda, corpus, dictionary)
pyLDAvis.display(vis_data)