# Inserting the data in MongoDB

What I have so far:
* a directory with downloaded pdfs
* a .txt file with all the urls to the pdfs that I am actually using
* a all2.json file with the additional information of the following format: 
        {'title': title,
        'authors': authors,
        'published': published,
        'keywords': keywords,
        'url': url}
        
What I need:
* put everything in MongoDB, by matching the additional information on the converted pdfs. The problem is that I do not have much to match it on: the original pdfs' name is a part of the url and got cut of at a different point as the url in the db.       

In [4]:
import json
import os
from pymongo import MongoClient
import codecs
import pandas as pd

The metadata contains data for all papers that are available on Lingbuzz. I only selected a subset of them.   
To do: match pdfs to url in db, insert new field 'paper' with the converted file (which is one huge string).  

## Create db and insert meta data

In [7]:
with open('../all2.json') as f:
    meta_data = json.load(f)

In [8]:
meta_data

[{'authors': ['Yusuke Imanishi'],
  'keywords': ['resumption, possessor wh phrases, the clause-mate condition on resumptive chains, island (in)sensitivity, kaqchikel (mayan), syntax'],
  'published': ['To appear in Studia Linguistica'],
  'title': 'The clause-mate condition on resumption: Evidence from Kaqchikel',
  'url': '/lingbuzz/003606'},
 {'authors': ['Prakash Mondal'],
  'keywords': ['natural language; cognition; possible minds; mentality; organisms; machines; computation, semantics, syntax'],
  'published': ['Leiden/Boston: Brill (2017)'],
  'title': "The Preliminary Material of 'Natural Language and Possible Minds: How Language Uncovers the Cognitive Landscape of Nature'",
  'url': '/lingbuzz/003607'},
 {'authors': ['Lida Veselovska'],
  'keywords': ['czech passives, passive vs. past participles, czech clitics, past czech auxiliary, grammatical morphemes, instrumental case, post-syntactic insertion, post-syntactic derivation, semantics, morphology, syntax, morphology, syntax']

In [6]:
client = MongoClient()
db = client.lingbuzz
#db.create_collection("papers")
papers = db.get_collection('papers')
#papers.insert_many(meta_data)

In [7]:
for _ in papers.find()[:5]:
    print(_)

{'_id': ObjectId('598b44c407d7df07719383e0'), 'title': 'The clause-mate condition on resumption: Evidence from Kaqchikel', 'authors': ['Yusuke Imanishi'], 'published': ['To appear in Studia Linguistica'], 'keywords': ['resumption, possessor wh phrases, the clause-mate condition on resumptive chains, island (in)sensitivity, kaqchikel (mayan), syntax'], 'url': '/lingbuzz/003606'}
{'_id': ObjectId('598b44c407d7df07719383e1'), 'title': "The Preliminary Material of 'Natural Language and Possible Minds: How Language Uncovers the Cognitive Landscape of Nature'", 'authors': ['Prakash Mondal'], 'published': ['Leiden/Boston: Brill (2017)'], 'keywords': ['natural language; cognition; possible minds; mentality; organisms; machines; computation, semantics, syntax'], 'url': '/lingbuzz/003607'}
{'_id': ObjectId('598b44c407d7df07719383e2'), 'title': 'Analytic Passives in Czech', 'authors': ['Lida Veselovska'], 'published': ['Zeitschrift für Slawistik 49 (2) 2004: 163–235'], 'keywords': ['czech passive

## Insert papers

Due to bad scraping, I have to match the pdfs that I actually downloaded on the metadata in the DB through a txt file containg the urls that I downloaded.

In [8]:
# put urls that I downloaded in df to easily find the files that match
with open('../papers/urls.txt') as f:
    urls = [url.strip() for url in f.readlines()]
splitted_urls = []
for url in urls:
    splitted = url.split('/')
    for_match = ['/'.join(splitted[:3]), '/'+'/'.join(splitted[3:5]), splitted[5]]
    splitted_urls.append(for_match)
url_df = pd.DataFrame(splitted_urls)

In [9]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io
import os

In [10]:
## stolen and adapted from https://stackoverflow.com/questions/5725278/how-do-i-use-pdfminer-as-a-library/8325135#8325135
def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = io.StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    string = retstr.getvalue()
    retstr.close()
    return string

In [22]:
# find matching pdfs, convert and insert in db
for doc in papers.find()[:5]:
    if url_df.iloc[:,1].str.contains(doc['url']).any():
        to_match = doc['url']
        url = url_df[url_df.iloc[:,1]==to_match].iloc[:,2].iloc[0]
        text = convert_pdf_to_txt('../papers/'+url)
        text = text.replace('\n', ' ').replace('\r', ' ')
        print(doc['url'] == to_match)
        papers.update_one({"url": to_match},{"$set": {"paper": text}})

True


In [37]:
for doc in papers.find()[:5]:
    print(doc)

{'_id': ObjectId('598b44c407d7df07719383e0'), 'title': 'The clause-mate condition on resumption: Evidence from Kaqchikel', 'authors': ['Yusuke Imanishi'], 'published': ['To appear in Studia Linguistica'], 'keywords': ['resumption, possessor wh phrases, the clause-mate condition on resumptive chains, island (in)sensitivity, kaqchikel (mayan), syntax'], 'url': '/lingbuzz/003606'}
{'_id': ObjectId('598b44c407d7df07719383e1'), 'title': "The Preliminary Material of 'Natural Language and Possible Minds: How Language Uncovers the Cognitive Landscape of Nature'", 'authors': ['Prakash Mondal'], 'published': ['Leiden/Boston: Brill (2017)'], 'keywords': ['natural language; cognition; possible minds; mentality; organisms; machines; computation, semantics, syntax'], 'url': '/lingbuzz/003607'}
{'_id': ObjectId('598b44c407d7df07719383e2'), 'title': 'Analytic Passives in Czech', 'authors': ['Lida Veselovska'], 'published': ['Zeitschrift für Slawistik 49 (2) 2004: 163–235'], 'keywords': ['czech passive

In [42]:
for doc in papers.find()[:5]:
    try: 
        print(doc['paper'])
    except:
        pass

ANALYTIC PASSIVES IN CZECH  

Ludmila Veselovská & Petr Karlík 

INTRODUCTION: DEFINING THE PROBLEM 

Like other languages, Czech also has pairs of sentences clearly related both in their 
form and in their meaning. The example (1) illustrates the phenomena of passivisation  
which is the topic of our paper.  According to the traditional terminology  (1a) 
demonstrates an active structure and (1b) the related passive structure. 
  
(1) 

(b)  
 

Pavel je chválen Petrem 
Paul   is praised  by Peter 

(a)  
 

Petr chválí Pavla     
Peter praises Paul    

The semantic relation between (1a) and (1b) can be stated as an intuition that both 
examples describe the same extralinguistic situation and have ‘similar truth values” 
(each implies the other). The formal similarity between (1a) and (1b) follows from the 
fact that both examples contain close to identical lexical material. The distinctions 
between (1a) and (1b) can be summarised as follows.  
 
(2) 

(a)  Morphology:  

(i)  
(i