#### The motivating system diagramme:
![Topics](../../Notebooks/images/Topic-Extraction-Modeling-Persistence.jpg)

One question I'd like to answer is where (and if) we might get some readymade **Inovation-Centric** datasets _rather than by doing this all by hand_ which would ential combing through the set of articles in our JSON collection.

Places we have found:
* ["USPTO Brief Summary Text"](https://patentsview.org/download/brf_sum_text)
* ["USPTO Detailed Summary Text"](https://patentsview.org/download/detail_desc_text)
* [National Bureau of Economic Rsearch Innovation and R&D"](https://www.nber.org/taxonomy/term/656?page=1&perPage=100)
*["Online Appendix to Technological Innovation,Resource Allocation and Growth"](https://mitsloan.mit.edu/shared/ods/documents?PublicationDocumentID=5894)


In [None]:
# !ls ../../Notebooks/images

### Some experimentation to do topic detect and associated keywords
#### Latent Dirichlet Allocation
LDA is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

* Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
* LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.
* It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.

Used this as ["spirit guide"](https://towardsdatascience.com/nlp-extracting-the-main-topics-from-your-dataset-using-lda-in-minutes-21486f5aa925) and also [this](https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb) and [this](https://scikit-learn.org/stable/datasets/real_world.html) are useful


In [24]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

from pprint import pprint
'''
Loading Gensim and nltk libraries
'''
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
# NLTK

from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
nltk.download('wordnet')

# Stemmer
stemmer = SnowballStemmer("english")

import numpy as np
np.random.seed(400)

[nltk_data] Downloading package wordnet to /home/deleidos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [14]:
dir(fetch_20newsgroups())

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [15]:
%%time
categories = [
'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x', 
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space'
]
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle = True)
newsgroups_test = fetch_20newsgroups(subset='test',   categories=categories, shuffle = True)

print(f"Train number: {len(newsgroups_train['data'])}")
print(f"Test number: {len(newsgroups_test['data'])}")

Train number: 5309
Test number: 3534
CPU times: user 461 ms, sys: 58.3 ms, total: 520 ms
Wall time: 518 ms


In [16]:
pprint(list(newsgroups_train.target_names))


['comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space']


In [17]:
# for i in range (4000,4100):
#     print(f"newsgroups_test : {i} :\n{newsgroups_test['data'][6]}", end="\n\n")

In [18]:
print(newsgroups_train.filenames.shape, newsgroups_train.target.shape)

(5309,) (5309,)


In [19]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

In [22]:
# The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero components 
# by sample in a more than 30000-dimensional space (less than .5% non-zero features):
vectors.nnz / float(vectors.shape[0])

150.02976078357506

Note 1: 
["sklearn.datasets.fetch_20newsgroups_vectorized"](https://scikit-learn.org/0.19/modules/generated/sklearn.datasets.fetch_20newsgroups_vectorized.html#sklearn.datasets.fetch_20newsgroups_vectorized) is a function which returns ready-to-use tfidf features instead of file names.

Note 2:
It is easy for a classifier to overfit on particular things that appear in the 20 Newsgroups data, such as newsgroup headers. Many classifiers achieve very high F-scores, but their results would not generalize to other documents that aren’t from this window of time.
For this reason, the functions that load 20 Newsgroups data provide a parameter called remove, telling it what kinds of information to strip out of each file. **remove** should be a tuple containing any subset of ('headers', 'footers', 'quotes'), telling it to remove headers, signature blocks, and quotation blocks respectively.

In [25]:
clf = MultinomialNB(alpha=.01)
newsgroups_test = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)
vectors_train= vectorizer.transform(newsgroups_train.data)

clf.fit(vectors, newsgroups_train.target)
pred_test = clf.predict(vectors_test)
test_metrics = metrics.f1_score(pred_test, newsgroups_test.target, average='macro')
print (f"test_metrics = {test_metrics}")

clf.fit(vectors, newsgroups_train.target)
pred_train = clf.predict(vectors_train)
train_metrics = metrics.f1_score(pred_train, newsgroups_train.target, average='macro')
print (f"train_metrics = {train_metrics}")


test_metrics = 0.8693128151003656
train_metrics = 0.9982998470701069


In [26]:
def show_top10(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(categories):
         top10 = np.argsort(classifier.coef_[i])[-10:]
         print("%s: %s" % (category, " ".join(feature_names[top10])))

show_top10(clf, vectorizer, newsgroups_train.target_names)

comp.graphics: edu it for in is graphics and of to the
comp.os.ms-windows.misc: for in of is and edu it to windows the
comp.sys.ibm.pc.hardware: edu for is drive of it scsi and to the
comp.sys.mac.hardware: in it is apple and of mac edu to the
comp.windows.x: it com motif in is and of window to the
sci.crypt: it in clipper that is and key of to the
sci.electronics: for it edu you in is and of to the
sci.med: pitt edu that it in and is to of the
sci.space: edu nasa that is and in space to of the




## OLDER experimental code
### Let's assume that

    1. user has selected an article
    2. we will use the summary to gain some understanding of the topic
    3. we can then present the topics to launch a new search

In [27]:
'''
A function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [28]:
# A typical summary from our news feeds
summary = """
ShotSpotter evidence has increasingly been admitted in court cases around the country, now totaling some 200.\n
But even as its use has expanded in court, ShotSpotter’s technology has drawn scrutiny.\n
Some courts, too, have been less than impressed with the ShotSpotter system.\n
Nonetheless, the split three-judge panel concluded that other evidence prosecutors presented was enough to uphold Godinez’s conviction.\n
Max, who requested it, said such material could be used to cast doubt on the validity and reliability of ShotSpotter evidence in cases nationwide.
"""

In [29]:
print("Original document: ")
words = []
summary  = summary.replace("\n", " ")

for word in summary.split(' '):
    words.append(word)
print(words)
print("\n\nTokenized and lemmatized document: ")
print(preprocess(summary))

Original document: 
['', 'ShotSpotter', 'evidence', 'has', 'increasingly', 'been', 'admitted', 'in', 'court', 'cases', 'around', 'the', 'country,', 'now', 'totaling', 'some', '200.', '', 'But', 'even', 'as', 'its', 'use', 'has', 'expanded', 'in', 'court,', 'ShotSpotter’s', 'technology', 'has', 'drawn', 'scrutiny.', '', 'Some', 'courts,', 'too,', 'have', 'been', 'less', 'than', 'impressed', 'with', 'the', 'ShotSpotter', 'system.', '', 'Nonetheless,', 'the', 'split', 'three-judge', 'panel', 'concluded', 'that', 'other', 'evidence', 'prosecutors', 'presented', 'was', 'enough', 'to', 'uphold', 'Godinez’s', 'conviction.', '', 'Max,', 'who', 'requested', 'it,', 'said', 'such', 'material', 'could', 'be', 'used', 'to', 'cast', 'doubt', 'on', 'the', 'validity', 'and', 'reliability', 'of', 'ShotSpotter', 'evidence', 'in', 'cases', 'nationwide.', '']


Tokenized and lemmatized document: 
['shotspott', 'evid', 'increas', 'admit', 'court', 'case', 'countri', 'total', 'expand', 'court', 'shotspott',

In [30]:
processed_docs = []
processed_docs.append(preprocess(summary))

'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [31]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 admit
1 case
2 cast
3 conclud
4 convict
5 countri
6 court
7 doubt
8 draw
9 evid
10 expand


#### Gensim filter_extremes

    filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Filter out tokens that appear in

    * less than no_below documents (absolute number) or
    * more than no_above documents (fraction of total corpus size, not absolute number).
    * after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

In [32]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

#### Gensim doc2bow

doc2bow(document)

  * Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized string (either unicode or utf8-encoded). No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

In [33]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [34]:
'''
Preview BOW for our sample preprocessed document
'''
document_num = 0
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))