# Background of the work

The Institute of Materials Research and Engineering would like to extract the core skills of its staff based on their scientific publications in peer reviewed journals. A record of these publications are available in the "Publication Release Form" database. This work extracts keywords from each publication's title and abstract, and match (store) these keywords with the respective first authors, who are assumed to be the experts in the subject knowledge. I.e. the staff expertise is constituted by the contents of their first-author publications.

# Import libraries

In [1]:
import pandas as pd
import numpy as np
import re
import math
import collections
import itertools, nltk, string
import random
from operator import itemgetter 
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 

# Define functions

In [2]:
def cleanColNames(vector): # vector = PRF.columns
    newCol = []
    for i in vector:
        i = i.replace(' ', '_')
        i = i.replace('(', '')
        i = i.replace(')', '')
        i = i.replace('/', '_or_')
        newCol.append(i)
    return newCol

In [3]:
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

In [4]:
def getAuthorPubKeywords(index, tfidfModel, fit):
    m = fit.todense()
    doc_keywords = pd.DataFrame(
    {'first_author': PRF_merged.Staff_Name[index],
     'corr_author': PRF_merged.Corresponding_Authors[index],
     'date': PRF_merged.Publication_Date[index],
     'keywords': [tfidfModel.get_feature_names()[ind] for ind, freq in enumerate(m[index, :].tolist()[0]) if freq != 0],
     'tfidf': [freq for freq in m[index, :].tolist()[0] if freq != 0]
    })
    return doc_keywords.sort_values(by = 'tfidf', ascending = False).reset_index()


# Get, preliminarily clean and explore data

The "PRF" is the system that stores all publication info.

In [5]:
PRF_first = pd.read_csv('/Users/yingjiang/Dropbox/Learnings/Stats_data/Projects/IMRE_work/Manpower/PRF_clean_first.csv')
PRF_corr = pd.read_csv('/Users/yingjiang/Dropbox/Learnings/Stats_data/Projects/IMRE_work/Manpower/PRF_clean_corr.csv')

In [6]:
PRF_first.columns = cleanColNames(PRF_first.columns)
PRF_corr.columns = cleanColNames(PRF_corr.columns)

In [7]:
PRF_first.head()

Unnamed: 0,Staff_Name,Title_of_Paper,Publication_Release_Number,Significance_of_Paper,Publication_Date,Project_Finance_Code,Project_Title
0,Afriyanti SUMBOJA,Progress in Development of Flexible Metal-Air ...,AS/15-736,The review paper discuss the latest developmen...,4/1/16,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries
1,Afriyanti SUMBOJA,Manganese oxide catalyst grown on carbon paper...,AS/15-056,The directly grown MnOx catalyst is grown via ...,8/1/15,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries
2,Agata Maria BRZOZOWSKA,Biomimicking Micropatterned Surfaces and Their...,PF/14-251,Paper describes and sumarizes research being a...,7/1/14,IMRE/10-2C0209,IMAS WP6 - Surface Topology / Morphology Engin...
3,Albertus Denny HANDOKO,Hydrothermal Growth of Piezoelectrically Activ...,DG/12-341,"First demonstration of how to make a (Na,K)nbO...",1/1/13,,
4,AN Tao,Co3O4 Nanoparticles grown on N-doped Vulcan Ca...,AS/15-177,"The Co3O4/NVC hybrid represents an efficient, ...",9/1/15,IMRE/12-2P0504,Advanced Nano-structured Porous Materials and ...


In [8]:
PRF_corr.head()

Unnamed: 0,Title_of_Paper,Journal_Title,Publication_Release_Number,Significance_of_Paper,Corresponding_Authors
0,Characteristics of InAs/InGaAs/GaAs QDs on GeO...,Journal of Physics D: Applied Physics,PF/12-679,To realize high-speed optical interconnects fo...,Andrew Ngo Chun Yong
1,Bandgap engineering of 1.3µm quantum dot struc...,Journal of Crystal Growth,PF/10-558,Investigated the effects of growth parameters ...,Andrew Ngo Chun Yong
2,Electro-absorption characteristics of single-m...,IEEE Photonics Technology Letters,PF/10-625,To characterize the electro-absorption propert...,Andrew Ngo Chun Yong
3,Effects of annealing and p-doping on the two-s...,IEEE T. Nanotechnol.,PF/12-674,"In this letter, we investigate the effects of ...",Andrew NGO Chun Yong
4,Semiconductor Quantum Dots for Photonic Modula...,Scholar's Press,PF/14-522,"This book describes the growth,fabrication and...",Andrew Ngo Chun Yong


In [9]:
missing_authors = [ind for ind, auth in enumerate(PRF_first.Staff_Name) if auth.lower() == 'nan']
missing_titles = [ind for ind, title in enumerate(PRF_first.Title_of_Paper) if title.lower() == 'nan']
missing_abstracts = [ind for ind, abst in enumerate(PRF_first.Significance_of_Paper) if abst.lower() == 'nan']
print missing_authors, missing_titles, missing_abstracts

missing_authors = [ind for ind, auth in enumerate(PRF_corr.Corresponding_Authors) if auth.lower() == 'nan']
missing_titles = [ind for ind, title in enumerate(PRF_corr.Title_of_Paper) if title.lower() == 'nan']
missing_abstracts = [ind for ind, abst in enumerate(PRF_corr.Significance_of_Paper) if abst.lower() == 'nan']
print missing_authors, missing_titles, missing_abstracts

[] [] []
[] [] []


No missing values.

In [10]:
for ind, i in enumerate(PRF_first.Publication_Release_Number):
    if i in PRF_corr.Publication_Release_Number.values:
        print ind

0
3
6
7
8
9
10
11
12
13
14
15
16
18
19
20
25
26
27
28
30
31
32
33
34
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
59
60
61
62
63
65
66
67
68
69
70
71
72
74
75
76
77
78
79
80
82
83
84
85
86
87
89
92
93
94
96
97
98
99
100
101
103
105
106
107
108
109
110
111
112
115
116
122
123
124
125
126
127
128
129
130
131
132
135
136
137
138
139
140
141
142
143
144
145
146
147
149
150
152
153
154
155
156
157
161
162
163
164
165
166
167
168
169
171
172
173
174
175
176
177
178
179
181
182
184
185
186
187
188
189
190
191
192
193
194
195
196
197
199
200
201
202
203
204
205
206
207
208
209
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
256
257
258
260
261
262
263
264
265
266
267
268
269
270
271
272
274
275
276
277
278
279
280
283
284
286
287
288
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
315
316
317
318
319
320
321
322
326
327
328
329
330
331
332
333
335


There are overlapping publications between the first-author and corr-author papers.

In [11]:
PRF_merged = pd.merge(left = PRF_first,
                      right = PRF_corr,
                      how = 'outer',
                      left_on = ['Publication_Release_Number', 'Title_of_Paper', 'Significance_of_Paper'],
                      right_on = ['Publication_Release_Number', 'Title_of_Paper', 'Significance_of_Paper'])

In [12]:
PRF_merged.head(20)

Unnamed: 0,Staff_Name,Title_of_Paper,Publication_Release_Number,Significance_of_Paper,Publication_Date,Project_Finance_Code,Project_Title,Journal_Title,Corresponding_Authors
0,Afriyanti SUMBOJA,Progress in Development of Flexible Metal-Air ...,AS/15-736,The review paper discuss the latest developmen...,4/1/16,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries,Functional Materials Letters,LIU ZHao Lin
1,Afriyanti SUMBOJA,Progress in Development of Flexible Metal-Air ...,AS/15-736,The review paper discuss the latest developmen...,4/1/16,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries,Functional Materials Letters,ZONG Yun
2,Afriyanti SUMBOJA,Manganese oxide catalyst grown on carbon paper...,AS/15-056,The directly grown MnOx catalyst is grown via ...,8/1/15,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries,,
3,Agata Maria BRZOZOWSKA,Biomimicking Micropatterned Surfaces and Their...,PF/14-251,Paper describes and sumarizes research being a...,7/1/14,IMRE/10-2C0209,IMAS WP6 - Surface Topology / Morphology Engin...,,
4,Albertus Denny HANDOKO,Hydrothermal Growth of Piezoelectrically Activ...,DG/12-341,"First demonstration of how to make a (Na,K)nbO...",1/1/13,,,CrystEngComm,Gregory GOH Kia Liang
5,AN Tao,Co3O4 Nanoparticles grown on N-doped Vulcan Ca...,AS/15-177,"The Co3O4/NVC hybrid represents an efficient, ...",9/1/15,IMRE/12-2P0504,Advanced Nano-structured Porous Materials and ...,,
6,Andrew NG Ming Hua,Highly Sensitive Reduced Graphene Oxide Microe...,PF/14-426,We report a flexible reduced graphene oxide mi...,3/1/15,IMRE/12-1P0904,Scalable Patterning of Graphene,,
7,Andrew Ngo Chun Yong,Bandgap engineering of 1.3µm quantum dot struc...,PF/10-558,Investigated the effects of growth parameters ...,5/1/11,IMRE/08-8C0301,Compact Terahertz Source for Frequency Domain ...,Journal of Crystal Growth,Andrew Ngo Chun Yong
8,Andrew Ngo Chun Yong,Electro-absorption characteristics of single-m...,PF/10-625,To characterize the electro-absorption propert...,12/1/10,,,IEEE Photonics Technology Letters,Andrew Ngo Chun Yong
9,Andrew Ngo Chun Yong,Semiconductor Quantum Dots for Photonic Modula...,PF/14-522,"This book describes the growth,fabrication and...",2/1/14,,,Scholar's Press,Andrew Ngo Chun Yong


There are missing values for first or corr authors upon joining (ie one or the other isn't an IMRE author). Replaced missing values with the string 'nonIMRE'.

In [13]:
PRF_merged.Staff_Name[PRF_merged.Staff_Name.isnull()] = 'nonIMRE'
PRF_merged.Corresponding_Authors[PRF_merged.Corresponding_Authors.isnull()] = 'nonIMRE'

In [14]:
missing_first = [ind for ind, auth in enumerate(PRF_merged.Staff_Name) if str(auth).lower() == 'nan']
missing_corr = [ind for ind, auth in enumerate(PRF_merged.Corresponding_Authors) if str(auth).lower() == 'nan']
print missing_first, missing_corr

[] []


Which papers have IMRE authors for both first and corr?

In [15]:
both_IMRE = [ind for ind, auth in enumerate(PRF_merged.Staff_Name) if auth == PRF_merged.Corresponding_Authors[ind]]
len(both_IMRE)

41

Remove whitespaces from corresponding author names:

In [16]:
PRF_merged['Corresponding_Authors'] = PRF_merged['Corresponding_Authors'].apply(lambda x: x.strip())

In [17]:
PRF_merged.head(20)

Unnamed: 0,Staff_Name,Title_of_Paper,Publication_Release_Number,Significance_of_Paper,Publication_Date,Project_Finance_Code,Project_Title,Journal_Title,Corresponding_Authors
0,Afriyanti SUMBOJA,Progress in Development of Flexible Metal-Air ...,AS/15-736,The review paper discuss the latest developmen...,4/1/16,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries,Functional Materials Letters,LIU ZHao Lin
1,Afriyanti SUMBOJA,Progress in Development of Flexible Metal-Air ...,AS/15-736,The review paper discuss the latest developmen...,4/1/16,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries,Functional Materials Letters,ZONG Yun
2,Afriyanti SUMBOJA,Manganese oxide catalyst grown on carbon paper...,AS/15-056,The directly grown MnOx catalyst is grown via ...,8/1/15,IMRE/12-2P0503,Development of Zinc-air Rechargeable Batteries,,nonIMRE
3,Agata Maria BRZOZOWSKA,Biomimicking Micropatterned Surfaces and Their...,PF/14-251,Paper describes and sumarizes research being a...,7/1/14,IMRE/10-2C0209,IMAS WP6 - Surface Topology / Morphology Engin...,,nonIMRE
4,Albertus Denny HANDOKO,Hydrothermal Growth of Piezoelectrically Activ...,DG/12-341,"First demonstration of how to make a (Na,K)nbO...",1/1/13,,,CrystEngComm,Gregory GOH Kia Liang
5,AN Tao,Co3O4 Nanoparticles grown on N-doped Vulcan Ca...,AS/15-177,"The Co3O4/NVC hybrid represents an efficient, ...",9/1/15,IMRE/12-2P0504,Advanced Nano-structured Porous Materials and ...,,nonIMRE
6,Andrew NG Ming Hua,Highly Sensitive Reduced Graphene Oxide Microe...,PF/14-426,We report a flexible reduced graphene oxide mi...,3/1/15,IMRE/12-1P0904,Scalable Patterning of Graphene,,nonIMRE
7,Andrew Ngo Chun Yong,Bandgap engineering of 1.3µm quantum dot struc...,PF/10-558,Investigated the effects of growth parameters ...,5/1/11,IMRE/08-8C0301,Compact Terahertz Source for Frequency Domain ...,Journal of Crystal Growth,Andrew Ngo Chun Yong
8,Andrew Ngo Chun Yong,Electro-absorption characteristics of single-m...,PF/10-625,To characterize the electro-absorption propert...,12/1/10,,,IEEE Photonics Technology Letters,Andrew Ngo Chun Yong
9,Andrew Ngo Chun Yong,Semiconductor Quantum Dots for Photonic Modula...,PF/14-522,"This book describes the growth,fabrication and...",2/1/14,,,Scholar's Press,Andrew Ngo Chun Yong


In [18]:
print PRF_merged.shape
print len(PRF_merged.Publication_Release_Number.unique())
print len(PRF_merged.Staff_Name.unique())
print len(PRF_merged.Corresponding_Authors.unique())

(1036, 9)
966
169
99


There are multiple first authors (2) and corresponding authors (60+) on the same paper:

In [19]:
print PRF_first.shape
print len(PRF_first.Publication_Release_Number.unique())
print PRF_corr.shape
print len(PRF_corr.Publication_Release_Number.unique())

(508, 7)
506
(907, 5)
840


# Extract keywords

This approach uses the NLTK and scikit-learn libraries for the tfidf method.

## Create and pre-process corpus
- A corpus is created from all titles and abstract texts within the 'PRF_merged' dataframe.
- The corpus is prefiltered to remove non-ascii characters.

In [20]:
printable = set(string.printable)
corpus = [filter(lambda x: x in printable, i+j) for i, j in zip(PRF_merged.Title_of_Paper, PRF_merged.Significance_of_Paper)]

corpus # A list of documents

['Progress in Development of Flexible Metal-Air Batteries  The review paper discuss the latest development on flexible metal air batteries which are of interest as an energy storage for low cost electronic devices.  ',
 'Progress in Development of Flexible Metal-Air Batteries  The review paper discuss the latest development on flexible metal air batteries which are of interest as an energy storage for low cost electronic devices.  ',
 'Manganese oxide catalyst grown on carbon paper as air cathode for high performance rechargeable zinc-air batteries  The directly grown MnOx catalyst is grown via a simple immersion process. It reduces the contact resistance and enhances the discharge/charge profile of the zincair batteries. Zincair batteries with the directly grown catalyst show a discharge volta',
 'Biomimicking Micropatterned Surfaces and Their Effect on Marine Biofouling  Paper describes and sumarizes research being an objective of IMAS project.  ',
 'Hydrothermal Growth of Piezoelect

- Change to all lower-case letters.
- Remove punctuation and stopwords.

In [21]:
corpus_preprocess = {}
for ind, doc in enumerate(corpus):
    lowers = doc.lower()
    no_punctuation = lowers.translate(None, string.punctuation)
    corpus_preprocess[ind] = no_punctuation # tokenized corpus


corpus_preprocess_nostop = {}
for ind, doc in enumerate(corpus):
    lowers = doc.lower()
    no_punctuation = lowers.translate(None, string.punctuation)
    no_stopwords = filter(lambda a: a not in text.ENGLISH_STOP_WORDS, no_punctuation.split())
    corpus_preprocess_nostop[ind] = ' '.join(no_stopwords)

In [22]:
print corpus_preprocess[4]
print corpus_preprocess_nostop[4]

hydrothermal growth of piezoelectrically active lead free naknbo3litao3 thin films  first demonstration of how to make a naknbo3litao3 based leadfree film piezoelectrically active  
hydrothermal growth piezoelectrically active lead free naknbo3litao3 films demonstration make naknbo3litao3 based leadfree film piezoelectrically active


## Explore removal of scientific stopwords 
Explore further removal of stopwords common in scientific literature (but are not found in a normal stopword dictionary)

### Explore frequency of word occurrence in entire corpus.


In [23]:
corpus_merged = ' '.join(corpus_preprocess_nostop.values()).split()
tf_raw = collections.Counter(corpus_merged)
print len(corpus_merged)
print len(tf_raw)
print(tf_raw)

27185
6234
Counter({'using': 188, 'nanoparticles': 185, 'high': 172, 'films': 166, 'applications': 149, 'properties': 137, 'cells': 136, 'paper': 132, 'based': 128, 'solar': 118, 'surface': 105, 'synthesis': 101, 'polymer': 99, 'oxide': 98, 'gold': 91, 'organic': 89, 'developed': 88, 'new': 85, 'layer': 85, 'performance': 81, 'grown': 78, 'metal': 78, 'method': 75, 'carbon': 73, 'materials': 72, 'graphene': 71, 'growth': 71, 'emission': 69, 'highly': 68, 'structure': 68, 'nanostructures': 67, 'low': 67, 'effect': 67, 'temperature': 67, 'cell': 66, 'film': 66, 'hybrid': 64, 'batteries': 63, 'work': 62, 'substrates': 62, 'review': 62, 'polymers': 62, 'electron': 62, 'efficient': 61, 'prepared': 61, 'study': 60, 'formation': 59, 'imaging': 59, 'zno': 57, 'deposition': 56, 'enhanced': 55, 'conjugated': 55, 'delivery': 54, 'application': 54, 'report': 54, 'optical': 54, 'single': 53, 'synthesized': 53, 'solution': 52, 'novel': 50, 'used': 49, 'energy': 49, 'reaction': 49, 'phase': 48, 'semi

There are 6234 unique vocabulary out of 27185 words.

The highly frequent words are stopwords (e.g. device, improved, various, designed, ~ 20-30 times or more). However even some low frequency words are stopwords (e.g. candidate, mixtures, fully, black, ~3-5 times)

### Stem the corpus to aggregate vocabulary.

In [24]:
stemmer = PorterStemmer()
def tokenize(text = corpus_merged):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems
vocab_stemmed = tokenize(' '.join(corpus_merged))

In [25]:
vocab_stemmed

[u'progress',
 u'develop',
 u'flexibl',
 u'metalair',
 u'batteri',
 u'review',
 u'paper',
 u'discuss',
 u'latest',
 u'develop',
 u'flexibl',
 u'metal',
 u'air',
 u'batteri',
 u'energi',
 u'storag',
 u'low',
 u'cost',
 u'electron',
 u'devic',
 u'progress',
 u'develop',
 u'flexibl',
 u'metalair',
 u'batteri',
 u'review',
 u'paper',
 u'discuss',
 u'latest',
 u'develop',
 u'flexibl',
 u'metal',
 u'air',
 u'batteri',
 u'energi',
 u'storag',
 u'low',
 u'cost',
 u'electron',
 u'devic',
 u'manganes',
 u'oxid',
 u'catalyst',
 u'grown',
 u'carbon',
 u'paper',
 u'air',
 u'cathod',
 u'high',
 u'perform',
 u'recharg',
 u'zincair',
 u'batteri',
 u'directli',
 u'grown',
 u'mnox',
 u'catalyst',
 u'grown',
 u'simpl',
 u'immers',
 u'process',
 u'reduc',
 u'contact',
 u'resist',
 u'enhanc',
 u'dischargecharg',
 u'profil',
 u'zincair',
 u'batteri',
 u'zincair',
 u'batteri',
 u'directli',
 u'grown',
 u'catalyst',
 u'discharg',
 u'volta',
 u'biomimick',
 u'micropattern',
 u'surfac',
 u'effect',
 u'marin',
 

In [26]:
tf_stemmed = collections.Counter(vocab_stemmed)
print len(vocab_stemmed)
print len(tf_stemmed)
print(tf_stemmed)

27185
4883
Counter({u'use': 287, u'film': 232, u'nanoparticl': 210, u'applic': 204, u'cell': 202, u'high': 172, u'polym': 162, u'develop': 154, u'properti': 143, u'effect': 143, u'oxid': 139, u'structur': 139, u'paper': 136, u'surfac': 131, u'base': 130, u'layer': 121, u'report': 120, u'demonstr': 118, u'solar': 118, u'studi': 111, u'materi': 108, u'substrat': 107, u'fabric': 103, u'synthesi': 101, u'metal': 99, u'enhanc': 99, u'effici': 95, u'electron': 92, u'perform': 91, u'organ': 91, u'gold': 91, u'nanostructur': 91, u'control': 90, u'carbon': 89, u'method': 88, u'prepar': 87, u'deposit': 86, u'new': 85, u'fluoresc': 83, u'review': 79, u'temperatur': 79, u'work': 78, u'grown': 78, u'emiss': 76, u'batteri': 76, u'imag': 76, u'mechan': 73, u'devic': 72, u'graphen': 71, u'growth': 71, u'plasmon': 70, u'process': 70, u'hybrid': 69, u'highli': 68, u'low': 67, u'investig': 67, u'catalyst': 65, u'activ': 65, u'synthes': 64, u'function': 62, u'result': 60, u'format': 60, u'conjug': 60, u'i

Stemming shrank the vocabulary from 6234 to 4883.

### Check the number of documents each of the word appeared in

If a word appears in many documents, then it's more likely a stopword. If it appears frequently but only in a few documents, then it's a keyword. It's basically the tfidf strategy:

weight = tf * log(N/df)

In [27]:
corpus_stemmed = [tokenize(doc) for doc in corpus_preprocess_nostop.values()]

In [28]:
# Check that the words in vocab_stemmed is the same as corpus_stemmed:
vocab_stemmed == sum(corpus_stemmed, []) # the sum() function combines the list of lists

True

#### Attempt #1: Compute simple term frequencies and document frequencies

In [29]:
term = [tup[0] for tup in list(tf_stemmed.items())] # tf_stemmed is tf_raw with duplicates removed.
term_freq = [tup[1] for tup in list(tf_stemmed.items())] # Overall freq of term in entire collection
        
doc_freq = []
for word in term:
    counter = itertools.count(1)
    doc_freq.append(len([next(counter) for doc in corpus_stemmed if word in doc]))
print len(doc_freq)

4883


In [30]:
doc_freq

[1,
 1,
 42,
 7,
 1,
 2,
 1,
 1,
 3,
 1,
 2,
 5,
 1,
 1,
 1,
 1,
 2,
 5,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 2,
 1,
 1,
 2,
 2,
 2,
 2,
 1,
 1,
 1,
 1,
 2,
 1,
 21,
 3,
 1,
 4,
 2,
 1,
 1,
 2,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 5,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 1,
 1,
 1,
 1,
 1,
 1,
 3,
 1,
 1,
 1,
 9,
 1,
 2,
 4,
 1,
 34,
 2,
 3,
 1,
 2,
 1,
 17,
 12,
 2,
 11,
 2,
 22,
 4,
 20,
 2,
 1,
 7,
 5,
 4,
 2,
 1,
 2,
 7,
 1,
 6,
 4,
 1,
 1,
 2,
 15,
 1,
 2,
 1,
 2,
 1,
 1,
 1,
 1,
 8,
 1,
 1,
 1,
 4,
 1,
 3,
 1,
 3,
 1,
 2,
 1,
 1,
 1,
 2,
 2,
 15,
 6,
 1,
 2,
 8,
 1,
 2,
 1,
 1,
 1,
 2,
 1,
 1,
 10,
 2,
 1,
 1,
 102,
 5,
 5,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 1,
 2,
 3,
 60,
 1,
 13,
 10,
 28,
 1,
 3,
 1,
 2,
 1,
 1,
 1,
 1,
 2,
 1,
 7,
 4,
 2,
 1,
 18,
 11,
 1,
 1,
 1,
 1,
 6,
 1,
 1,
 6,
 1,
 2,
 8,
 1,
 3,
 1,
 1,
 14,
 1,
 1,
 5,
 6,
 3,
 3,
 1,
 2,
 2,
 1,
 1,
 1,
 2,
 19,
 1,
 10,
 1,
 3,
 72,
 1,
 1,
 2,
 1,
 2,
 2,
 2,
 1,
 1,
 2,
 3,
 1,
 1,
 1,
 13,
 1,
 1,
 4,
 2,


In [31]:
print len(term), len(term_freq), len(doc_freq)

4883 4883 4883


In [32]:
vocab_signif = pd.DataFrame(
    {'term': term,
     'term_freq': term_freq,
     'doc_freq': doc_freq,
     'tfidf_score': [tf*math.log(len(corpus_stemmed)/df) for tf, df in zip(term_freq, doc_freq)]
    })

In [33]:
vocab_signif.sort_values(by = 'tfidf_score', ascending = False)

Unnamed: 0,doc_freq,term,term_freq,tfidf_score
2370,143,film,232,451.451155
2151,131,nanoparticl,210,408.641131
3832,241,use,287,397.866482
3770,130,cell,202,393.073850
923,165,applic,204,365.518932
4560,107,polym,162,355.950382
3888,135,high,172,334.696546
963,91,oxid,139,333.307443
3897,91,surfac,131,314.124281
1095,73,solar,118,311.408765


In [34]:
vocab_signif.sort_values(by = 'term_freq', ascending = False).reset_index()

Unnamed: 0,index,doc_freq,term,term_freq,tfidf_score
0,3832,241,use,287,397.866482
1,2370,143,film,232,451.451155
2,2151,131,nanoparticl,210,408.641131
3,923,165,applic,204,365.518932
4,3770,130,cell,202,393.073850
5,3888,135,high,172,334.696546
6,4560,107,polym,162,355.950382
7,2065,138,develop,154,299.670163
8,797,118,properti,143,297.360140
9,4533,119,effect,143,297.360140


In [35]:
vocab_signif.sort_values(by = 'doc_freq', ascending = False).reset_index()

Unnamed: 0,index,doc_freq,term,term_freq,tfidf_score
0,3832,241,use,287,397.866482
1,923,165,applic,204,365.518932
2,2370,143,film,232,451.451155
3,2065,138,develop,154,299.670163
4,3888,135,high,172,334.696546
5,2750,132,paper,136,264.643780
6,2151,131,nanoparticl,210,408.641131
7,3770,130,cell,202,393.073850
8,643,120,report,120,249.532985
9,4533,119,effect,143,297.360140


Document frequency (or inverse) seems to be a better predictor of whether the words are stop words. Term frequency is confounded by the presence of both stopwords and highly popular research subjects.

In [36]:
vocab_signif.sort_values(by = 'doc_freq', ascending = False, inplace = True)
vocab_signif.reset_index()

Unnamed: 0,index,doc_freq,term,term_freq,tfidf_score
0,3832,241,use,287,397.866482
1,923,165,applic,204,365.518932
2,2370,143,film,232,451.451155
3,2065,138,develop,154,299.670163
4,3888,135,high,172,334.696546
5,2750,132,paper,136,264.643780
6,2151,131,nanoparticl,210,408.641131
7,3770,130,cell,202,393.073850
8,643,120,report,120,249.532985
9,4533,119,effect,143,297.360140


In [52]:
print vocab_signif.iloc[:150, :]

      doc_freq         term  term_freq  tfidf_score
3832       241          use        287   397.866482
923        165       applic        204   365.518932
2370       143         film        232   451.451155
2065       138      develop        154   299.670163
3888       135         high        172   334.696546
2750       132        paper        136   264.643780
2151       131  nanoparticl        210   408.641131
3770       130         cell        202   393.073850
643        120       report        120   249.532985
4533       119       effect        143   297.360140
797        118     properti        143   297.360140
287        116     demonstr        118   245.374102
4171       114     structur        139   305.414216
2568       111         base        130   285.639195
4560       107        polym        162   355.950382
152        102        studi        111   255.586945
3508        93       materi        108   258.972689
1804        93     synthesi        101   242.187423
3897        

#### Attempt #2: Use the Kullback-Leibler Divergence criterion

In [37]:
term_coll = [tup[0] for tup in list(tf_stemmed.items())] # tf_stemmed is tf_raw with duplicates removed.
term_coll_freq = [tup[1] for tup in list(tf_stemmed.items())] # Overall freq of term in entire collection
        
term_doc_freq = [[doc.count(term) for doc in corpus_stemmed] for ind_term, term in enumerate(term_coll)]
# This is a matrix of M terms (rows) x N documents (columns)

In [38]:
print len(term_doc_freq)
print len(term_doc_freq[0])

4883
1036


Use the refined Kullback-Leibler divergence measure to assign a weight to every term in the retrieved documents. The assigned weight will give us some indication of how important the term is.
(Automatically Building a Stopword List for an Information Retrieval System, Lo et al)

wx = Px · log2 (Px / Pc)

Specifically, the Kullback–Leibler divergence from Q to P, denoted DKL(P‖Q), or

P(x) log2(P(x) / Q(x))

is a measure of the information gained when one revises one's beliefs from the prior probability distribution Q to the posterior probability distribution P. In other words, it is the amount of information lost when Q is used to approximate P.

For each term x:
- Retrieve all documents that contain x.
- tfx = How many times does term x appear in these documents in total?
- lx = What are the lengths of these documents? (total number of words)
- Px = tfx/lx
- Pc = F /tokenc, where F is the term frequency of the query term in the whole collection and tokenc is the total number of tokens in the whole collection. F is actually the same as tfx.

Therefore, when a term appears in few specific documents, its Px will be much larger than Pc, i.e. great divergence. The value of w will be large. Conversely, the value of w will be close to 0. Terms with small (near-zero) values of w will be designated stopwords.


In [39]:
Pc = [i / float(len(tf_stemmed)) for i in term_coll_freq]

In [40]:
tfx = [sum(terms) for terms in term_doc_freq]

In [41]:
lx = []
for ind1, term in enumerate(term_doc_freq):
    lx_term = 0
    for ind2, doc_freq in enumerate(term):
        if doc_freq != 0:
            lx_term += len(corpus_stemmed[ind2])
    lx.append(lx_term)

In [42]:
print len(Pc)
print len(tfx)
print len(lx)

4883
4883
4883


In [43]:
Px = [i/float(j) for i, j in zip(tfx, lx)]

In [44]:
KL_wt = [i * math.log(i/float(j)) for i, j in zip(Px, Pc)]
len(KL_wt)

4883

In [45]:
# Arrange the terms in order of KL_wt
vocab_signif_KL = pd.DataFrame(
    {'term': term_coll,
     'KullbackLeibler_weight': KL_wt
    })
vocab_signif_KL = vocab_signif_KL.sort_values(by = 'KullbackLeibler_weight').reset_index()

In [46]:
print vocab_signif_KL.iloc[:100, :]
print vocab_signif_KL.iloc[150:200, :]

    index  KullbackLeibler_weight          term
0    3832               -0.015404           use
1    3888                0.006602          high
2    2370                0.008952          film
3     923                0.010983        applic
4     643                0.011535        report
5    2065                0.011822       develop
6    3770                0.012623          cell
7    4171                0.013637      structur
8    2750                0.014130         paper
9     287                0.014331      demonstr
10   4533                0.016509        effect
11    797                0.016943      properti
12   2151                0.019820   nanoparticl
13   2568                0.020432          base
14    152                0.020534         studi
15   1091                0.024265      substrat
16   1804                0.026805      synthesi
17   3897                0.027025        surfac
18   1060                0.027064        fabric
19   4493                0.027631       

Interesting observation: The KL stopword list contain real stopwords, such as "good", "introduc", "review", as well as lower-order / general-content subject knowledge words, such as element names ("tin", "silicon", "carbon", "composit"), technique names ("spectroscopi",  "epitaxi"), disciplines or domains ("semiconductor", "electrocatalyst", "batteri", "plasmon", "protein", "vacuum").

One can potentially build layers of subject knowledge words, in increasing levels of generality. Then one can truly characterize the inter-relatedness of one person's expertise to another.

#### Conclusion:
Use the KL approach to build the scientific stopword list.

#### Attempt #3: By inspection, pick the scientific stopword list.

In [47]:
print vocab_signif.iloc[40:50, :]
print vocab_signif.iloc[90:100, :]
print vocab_signif.iloc[140:150, :]
print vocab_signif.iloc[190:200, :]
print vocab_signif.iloc[240:250, :]
print vocab_signif.iloc[290:300, :]
print vocab_signif.iloc[360:370, :]

      doc_freq      term  term_freq  tfidf_score
3113        61   synthes         64   177.445678
3884        61   process         70   194.081211
3129        61  fluoresc         83   230.124864
3243        60     devic         72   203.991361
2661        60     emiss         76   215.324214
169         60    highli         68   192.658507
674         58    result         60   169.992801
2479        57    mechan         73   210.997138
824         56     grown         78   225.448997
4653        55    growth         71   205.216395
      doc_freq        term  term_freq  tfidf_score
2749        37     thermal         55   183.271248
2920        37      differ         42   139.952589
4220        37       singl         53   176.606839
3350        36      recent         41   136.620385
2566        36  photovolta         47   156.613612
2695        36       simpl         39   129.955976
3944        36       light         43   143.284794
2723        36        coat         49   163.278021
29

By inspection, the top ~370 words, which occur in the most number of documents, are taken to be scientific stopwords.

Note that it's imperfect. Words such as "fluoresc" and "ferroelectr" are obviously keywords, but belong to the level of general subject knowledge.

To truly differentiate one might need to label the words and do supervised learning.

In [53]:
print vocab_signif_KL.iloc[200:250, :]
print vocab_signif_KL.iloc[250:300, :]

     index  KullbackLeibler_weight         term
200   3034                0.087915        charg
201   3350                0.088034       recent
202   3952                0.088320        allow
203   2152                0.088388    treatment
204    392                0.088544           xp
205   2839                0.088592       energi
206    300                0.088832     incorpor
207   3087                0.089092            8
208   1304                0.089260      current
209   1227                0.089336  environment
210   4331                0.089408      contain
211    110                0.089507      conduct
212    663                0.089514   microscopi
213   2040                0.089775      copolym
214   3820                0.089814        appli
215   1962                0.089863            s
216   3429                0.090008  polyethylen
217   4866                0.090119          gan
218   2930                0.090326    construct
219   4477                0.090357      

### Add the scientific stopwards to the generic stopword list.

In [48]:
sci_stopwords = vocab_signif.term[:370].tolist()
own_stopwords = text.ENGLISH_STOP_WORDS.union(sci_stopwords) # It's a frozenset

Using the KL approach:

In [49]:
sci_stopwords_KL = vocab_signif_KL.term[]

SyntaxError: invalid syntax (<ipython-input-49-bbb70d86fecb>, line 1)

## Model the text using TFIDF
### Run tokenize function
With a clean corpus, to each document in the corpus, the texts are tokenized and stemmed.

In [None]:
print corpus_preprocess[4]
print corpus_preprocess_nostop[4]
print corpus_stemmed[4]

In [None]:
i = 0
doc = corpus_preprocess[i]

def tokenize(text = doc):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

### Run tfidf
- The preprocessing function (on a specific body of text) is then passed to the tfidf vectorizer to create the tfidf model. Note that scikit-learn has its own dictionary of stopwords, which are removed from the document in the model-building process.
- The model is automatically fitted for every document in the corpus, under the fit_transform() function

In [None]:
tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words=set(own_stopwords))
fit = tfidf.fit_transform(corpus_preprocess.values())

In [None]:
fit.shape # A vocabulary of 4519 words, across 1036 documents

In [None]:
getAuthorPubKeywords(index = 0, tfidfModel=tfidf, fit=fit)

In [None]:
getAuthorPubKeywords(index = 1, tfidfModel=tfidf, fit=fit)

In [None]:
getAuthorPubKeywords(index = 5, tfidfModel=tfidf, fit=fit)