# Exploratory data analysis

I have a db with following structure:

        {'title': title,
        'authors': authors,
        'published': published,
        'keywords': keywords,
        'url': url,
        'paper': text}
        
Steps:
* topic modeling
* recommendation system

For topic modelling, I need to have an idea of how many topics there might be. Look into the keywords. 

In [4]:
from pymongo import MongoClient
import re
import pandas as pd

In [5]:
client = MongoClient()
db = client.lingbuzz
papers = db.get_collection('papers')

In [18]:
number = 0
for doc in papers.find():
    try:
        paper = doc['paper']
        number+=1
    except:
        pass
print(number)

750


In [6]:
len(papers.distinct('keywords', { 'paper': { '$exists': True } }))

736

There are 750 papers in the db. 

## Keywords
Let's see how many distinct keywords there are and their frequencies:

In [62]:
db_with_papers = papers.find({ 'paper': { '$exists': True } })

In [63]:
keywords = []

for doc in db_with_papers:
    keywords += [re.split(',|;', keyword) for keyword in doc['keywords']]

In [64]:
keywords 

[['czech passives',
  ' passive vs. past participles',
  ' czech clitics',
  ' past czech auxiliary',
  ' grammatical morphemes',
  ' instrumental case',
  ' post-syntactic insertion',
  ' post-syntactic derivation',
  ' semantics',
  ' morphology',
  ' syntax',
  ' morphology',
  ' syntax'],
 ['czech dp',
  ' universal dp',
  ' determiners',
  ' functional domain',
  ' word order in dps',
  ' semantics of dp',
  ' modifiers of n',
  ' semantics',
  ' morphology',
  ' syntax'],
 ['sign language',
  ' strong pronouns',
  ' pointing',
  ' focus',
  ' semantics',
  ' morphology',
  ' syntax'],
 ['syntax',
  ' morphology',
  ' extended projections',
  ' selection',
  ' agreement',
  ' multi-verbal constructions',
  ' auxiliaries',
  ' light verbs',
  ' ndebele',
  ' bantu'],
 ['sluicing',
  ' ellipsis licensing',
  ' pair-list readings',
  ' scope',
  ' parallelism',
  ' semantics',
  ' syntax'],
 ['quantifier domain restriction',
  ' ellipsis',
  ' inverse scope',
  ' semantics',
  ' synt

In [68]:
keywords_list = [item.strip() for sublist in keywords for item in sublist]

In [69]:
keywords_list

['czech passives',
 'passive vs. past participles',
 'czech clitics',
 'past czech auxiliary',
 'grammatical morphemes',
 'instrumental case',
 'post-syntactic insertion',
 'post-syntactic derivation',
 'semantics',
 'morphology',
 'syntax',
 'morphology',
 'syntax',
 'czech dp',
 'universal dp',
 'determiners',
 'functional domain',
 'word order in dps',
 'semantics of dp',
 'modifiers of n',
 'semantics',
 'morphology',
 'syntax',
 'sign language',
 'strong pronouns',
 'pointing',
 'focus',
 'semantics',
 'morphology',
 'syntax',
 'syntax',
 'morphology',
 'extended projections',
 'selection',
 'agreement',
 'multi-verbal constructions',
 'auxiliaries',
 'light verbs',
 'ndebele',
 'bantu',
 'sluicing',
 'ellipsis licensing',
 'pair-list readings',
 'scope',
 'parallelism',
 'semantics',
 'syntax',
 'quantifier domain restriction',
 'ellipsis',
 'inverse scope',
 'semantics',
 'syntax',
 'disjunction',
 'alternative questions',
 'wh-quantification',
 'alternative semantics',
 'mandar

In [71]:
import collections
counter=collections.Counter(keywords_list)

In [74]:
print(len(set(counter)))

2436


There are 2436 distinct keywords for our 750 papers (too many to visualize in a countplot). Let's look at their frequencies:

In [89]:
keywords = []
for keyword, count in counter.items():
    keywords.append((keyword, count))
print(sorted(keywords, key= lambda x: x[1], reverse=True))

[('syntax', 601), ('semantics', 265), ('morphology', 236), ('phonology', 130), ('agreement', 43), ('case', 30), ('distributed morphology', 26), ('japanese', 26), ('focus', 25), ('ellipsis', 22), ('negation', 22), ('icelandic', 20), ('phases', 19), ('english', 19), ('russian', 18), ('german', 18), ('scope', 16), ('locality', 16), ('dm', 16), ('binding', 15), ('prosody', 15), ('argument structure', 15), ('allomorphy', 13), ('tense', 13), ('agree', 13), ('movement', 13), ('control', 12), ('epp', 12), ('nanosyntax', 12), ('bantu', 11), ('sluicing', 11), ('french', 11), ('aspect', 11), ('left periphery', 11), ('linearization', 11), ('phase', 11), ('spanish', 11), ('minimalism', 10), ('generative syntax', 10), ('head movement', 10), ('pronouns', 10), ('language acquisition', 9), ('optimality theory', 9), ('merge', 9), ('coordination', 9), ('recursion', 9), ('universal grammar', 9), ('information structure', 9), ('dutch', 9), ('roots', 9), ('adjectives', 9), ('wh-movement', 9), ('clitics', 9)

Clearly, most of the papers deal with syntax (601 out of 750). Umbrella topics such as syntax, semantics, morphology and phonology are frequent, which is to be expected. Probably each paper falls into at least one of these categories. Within these categories, there are several topics, and several sub-topics can be present in more than one umbrella topic. For instance, 'agreement' can be a subtopic of both syntax and morphology, and both 'distributed morphology' and 'case' can fall under morphology. We need a network of topics.

## Texts

In [95]:
db_with_papers = papers.find({ 'paper': { '$exists': True } })

In [96]:
for doc in db_with_papers[:10]:
    print(doc['paper'][:1000], '\n')

ANALYTIC PASSIVES IN CZECH    Ludmila Veselovská & Petr Karlík   INTRODUCTION: DEFINING THE PROBLEM   Like other languages, Czech also has pairs of sentences clearly related both in their  form and in their meaning. The example (1) illustrates the phenomena of passivisation   which is the topic of our paper.  According to the traditional terminology  (1a)  demonstrates an active structure and (1b) the related passive structure.     (1)   (b)      Pavel je chválen Petrem  Paul   is praised  by Peter   (a)      Petr chválí Pavla      Peter praises Paul      The semantic relation between (1a) and (1b) can be stated as an intuition that both  examples describe the same extralinguistic situation and have ‘similar truth values”  (each implies the other). The formal similarity between (1a) and (1b) follows from the  fact that both examples contain close to identical lexical material. The distinctions  between (1a) and (1b) can be summarised as follows.     (2)   (a)  Morphology:    (i)   (i

In [6]:
db_with_papers = papers.find({ 'paper': { '$exists': True } })
df = pd.DataFrame(list(papers.find({ 'paper': { '$exists': True } }, {'paper':1})))

In [7]:
df.head()

Unnamed: 0,_id,paper
0,598b44c407d7df07719383e2,ANALYTIC PASSIVES IN CZECH Ludmila Veselovs...
1,598b44c407d7df07719383e5,UNIVERSAL DP-ANALYSIS IN ARTICLELESS LANGUAGE:...
2,598b44c407d7df07719383e8,Strong Pronominals in ASL and LSF* Philippe ...
3,598b44c407d7df07719383f0,THE UNIVERSITY OF CHICAGO INFLECTIONAL DEPEND...
4,598b44c407d7df07719383fc,"Multiple Sluicing, Scope, and Superiority: Con..."


In [19]:
index = 0
for paper in df.paper:
    print(index, paper[:20])
    index +=1 

0 ANALYTIC PASSIVES IN
1 UNIVERSAL DP-ANALYSI
2 Strong Pronominals i
3 THE UNIVERSITY OF CH
4 Multiple Sluicing, S
5 	     Quantifier Dom
6 Two disjunctions in 
7 Iconic Pragmatics*  
8   Moraic Onsets in A
9    1   Revolutionary
10 Back to the Future: 
11   To appear in Bánré
12 To appear in Proceed
13 The role of incremen
14 AxParts and Case in 
15 Temmerman, Tanja and
16 Gerrit Kentner Goeth
17 ONLY: An NPI-license
18 1  The loi de positi
19 An Argument for Zwar
20      Minimalist Sema
21 Alessandra Giorgi   
22 Modeling syntactic a
23 LUDMILA VESELOVSKÁ A
24 The scope of alterna
25 Iconic Plurality*   
26 Speech production pl
27 Speech production pl
28 WHY THE NULL COMPLEM
29 How real are adjecti
30 Daniel Altshuler EVE
31 Le charme discret of
32 CHAPTER 15   ON THE 
33 BANGLA NEGATIVE POLA
34 Frame setters and th
35 Prosody as an argume
36    On the (un)interp
37                     
38 (Grammatical)	gender
39 A Proposal for Lingu
40 Vocabulary insertion
41 Memorization and the
42

Some papers seem to be problematic:

In [20]:
prob = [37, 57, 81, 95, 176, 223, 251, 253, 257, 260, 273, 283, 285, 286, 336, 338, 349, 405, 418, 455, 462,
        539, 571, 611, 614, 628, 636, 638, 654, 682, 700, 701]
for n in prob: 
    print(n, df.paper.iloc[n][:100])

37                       On the double-headed analysis of “Headless” relative clauses*                 
57 1" 2" 3" 4" 5" 6" 7" 8" 9" 10" 11" 12" 13" 14" 15" 16" 17" 18" 19" 20" 21" 22"              Word-int
81                                       (1)         VoiceP               3            agent         Vo
95 	     	     	     	     On	   the	   similarity	   between	   syntax	   and	   actions.	     An	   a
176 國立清華大學   博士論文                            題目： 中文名稱：漢語的詞組性空範疇     英文名稱：The Phrasal Empty Categories   
223 (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:8)(cid:9)(cid:10)(cid:10)(cid:7)(cid:11
251 
253 
257 
260 
273 
283                        Object Shift and Scrambling in North and West Germanic:   A Case Study in Sym
285                           JOSEPH SABBAGH                 ORDERING A

In [21]:
still_prob = [223, 251, 253, 257, 260, 273, 462]
df.paper.iloc[223]

'(cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:8)(cid:9)(cid:10)(cid:10)(cid:7)(cid:11)(cid:12)(cid:3)(cid:13)(cid:4)(cid:14)(cid:5)(cid:7)  (cid:7)  (cid:7) (cid:7)  (cid:15)(cid:16)(cid:17)(cid:7)(cid:1)(cid:18)(cid:19)(cid:15)(cid:20)(cid:21)(cid:7)(cid:22)(cid:23)(cid:7)(cid:1)(cid:24)(cid:20)(cid:15)(cid:10)(cid:20)(cid:25)(cid:7)(cid:20)(cid:19)(cid:20)(cid:24)(cid:16)(cid:22)(cid:26)(cid:20)(cid:7)  (cid:27)(cid:14)(cid:28)(cid:29)(cid:5)(cid:7)(cid:26)(cid:14)(cid:14)(cid:3)(cid:30)(cid:31) (cid:7)  &(cid:25)(cid:12)(cid:4)$(cid:12)(cid:5)(cid:7)’(cid:5)(cid:4)(cid:11)(cid:12)(cid:3)(cid:13)(cid:4)((cid:30))(cid:7)  /0(cid:12)0(cid:31)0(cid:11)0(cid:3)(cid:14)(cid:14)(cid:3)(cid:30)(cid:31) 1,(cid:12)(0,(cid:12)(cid:4)$(cid:12)(cid:5)#(cid:5)(cid:4)(cid:11)0(cid:5),(cid:7)  !(cid:7)  (cid:7)  (cid:7)  "#(cid:4)$(cid:14)(cid:7)(cid:9)(cid:29)(cid:5)$(cid:12)(cid:5)(cid:7)%(cid:30)(cid:5)(cid:6)(cid:29)(cid:12)(cid:3)$(cid:7)  &*’(cid:7)+(cid:3)#(cid:13)(cid:13)(ci

In [22]:
df.paper.iloc[251]

'\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c'

In [23]:
df.paper.iloc[253]

'\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c'

In [24]:
df.paper.iloc[257]

'\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c'

In [25]:
df.paper.iloc[260]

'\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c'

In [26]:
df.paper.iloc[273]

'\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c'

In [30]:
df.paper.iloc[462][-300:]

'\uf024\uf027\uf00e\uf002\uf008\uf00d\uf001\uf005\uf028\uf001\uf02a\uf008\uf035\uf00a\uf035\uf001\uf02d\uf009\uf006\uf008\uf001\uf005\uf028\uf001\uf034\uf005\uf00d\uf010\uf02d\uf00b\uf02d\uf008\uf00d\uf00f\uf02b\uf001 \uf002\uf005\uf011\uf025\uf008\uf00d\uf001\uf001 \uf00f\uf005\uf009\uf006\uf003 \uf003  \uf02e\uf03f\uf02f\uf001  \uf00a\uf004\uf004\uf013\uf00d\uf005\uf004\uf008\uf010\uf015\uf00d\uf015\uf005\uf008\uf005\uf015\uf00d\uf003\uf001 \uf00a\uf039\uf039\uf01b\uf00d\uf00d\uf015\uf005\uf008\uf005\uf015\uf00d\uf013\uf004\uf005\uf003\uf002  \uf012\uf015\uf001\uf001 \uf012\uf00c\uf001  \uf026\uf00b\uf027\uf00f\uf008\uf001\uf011\uf005\uf001\uf006\uf009\uf007\uf008\uf001\uf00b\uf026\uf026\uf005\uf00d\uf029\uf009\uf024\uf00a\uf001\uf011\uf005\uf001\uf066\uf034\uf025\uf009\uf011\uf008\uf066\uf001\uf02a\uf024\uf005\uf024\uf039\uf01a\uf00b\uf011\uf009\uf007\uf008\uf02b\uf001\uf026\uf027\uf006\uf011\uf027\uf00d\uf008\uf001  \uf00a\uf00d\uf00d\uf016\uf004\uf01b\uf00d\uf015\uf005\uf008\uf005

These papers have to be eliminated from any further analysis.

In [None]:
indexes_to_eliminate = [223, 251, 253, 257, 260, 273, 462]