You are given a set of pdf files written in a folder.
Your task is to create an index of the mentioning of the
specific list of terms in these pdf files, e.g. index of their
occurrences.

**Input:** Name of the folder, where pdf files are written
N – number of elements in the list, (comma-separated)
The list of N terms (N < 20).

**Output:** As a result you have to prepare a csv file with
the following structure: *term; pdf_file_name; page*
(one page for a term if there are several occurrences).
The terms have to be ordered lexicographically.

**Libraries:**
- PDF: Requires [slate3k](https://pypi.org/project/slate3k/) library, since PyPDF2 ... PyPDF4 failed to extract text from the complex PDF failes.
- Keywords: requires *gensim*.

In [1]:
from os import listdir
import re
import csv
import slate3k as slate
from gensim.summarization import keywords

# Input.
folder = 'pdf/'
n_terms = 5

In [2]:
# Rough cleanup.
def cleanup(pages):
    for i, page in enumerate(pages):
        page = re.sub(r'^.{0,20}$', '', page, flags=re.M)
        page = re.sub(r'\n+', ' ', page, flags=re.M)
        pages[i] = page
    return pages

# Process keywords.
def get_keywords(pages, n_terms=5):
    keyword_list = {}
    for key, page in enumerate(pages):
        words = keywords(page, words=n_terms, lemmatize=True, scores=False, pos_filter='NN').split('\n')
        words.sort()
        keyword_list[key] = words
    return keyword_list

In [3]:
# Load all pdf files.
files = [f for f in listdir(folder) if f.endswith('.pdf')]

# Create dict of content.
data = {}
for file in files:
    record = {}
    with open(folder + file, 'rb') as fh:
        doc = slate.PDF(fh)
        record['npages'] = len(doc)
        record['pages'] = cleanup(doc)
        record['keywords'] = get_keywords(record['pages'], n_terms)
    data[file] = record

In [4]:
# Format for csv.
res = []
for filename in data:
    for page_n in data[filename]['keywords']:
        for term in data[filename]['keywords'][page_n]:
            res.append([term, filename, page_n + 1])

# Write to csv.
with open('task2.csv', 'w') as fh:
    fields = ['term', 'file', 'page']
    writer = csv.DictWriter(fh, fieldnames=fields)
    writer.writeheader()
    for record in res:
        writer.writerow({
            'term': record[0],
            'file': record[1],
            'page': record[2],
        })

print(f'Saved {len(res)} records to "task2.csv".')

Saved 66 records to "task2.csv".
