# Run Keyword Extraction

This notebook uses the functions defined in three other notebooks to perform extract keywords from PDFs.

***
# Setup

### Install Packages
* pdfplumber
* PyPDF2 (Currently, not used here)

In [1]:
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.5.28.tar.gz (45 kB)
[K     |████████████████████████████████| 45 kB 1.1 MB/s 
[?25hCollecting pdfminer.six==20200517
  Downloading pdfminer.six-20200517-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 6.2 MB/s 
Collecting Wand
  Downloading Wand-0.6.7-py2.py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 38.5 MB/s 
[?25hCollecting pycryptodome
  Downloading pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 46.5 MB/s 
Building wheels for collected packages: pdfplumber
  Building wheel for pdfplumber (setup.py) ... [?25l[?25hdone
  Created wheel for pdfplumber: filename=pdfplumber-0.5.28-py3-none-any.whl size=32240 sha256=e87edcaaf3cf0706a376a295fa272ba22e74f5e2793808d9a9e7374df528b5de
  Stored in directory: /root/.cache/pip/wheels/f2/b1/a0/c0a77b756d580f53b3806ae0e0b3ec945a8d05fca1d6e10cc1
Successfully built pdfplumbe


### Mount Google Drive
* Files from three folders: Memorandums, Resolutions, SanJoseFiles


In [2]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


### Import Other Notebooks
  - PDFToTokens.ipynb
  - FindStopWords.ipynb
  - TokensToKeywords.ipynb

In [39]:
%run "/content/gdrive/My Drive/Colab Notebooks/PDFToTokens.ipynb"

In [42]:
%run "/content/gdrive/My Drive/Colab Notebooks/FindStopWords.ipynb"

In [31]:
%run "/content/gdrive/My Drive/Colab Notebooks/TokensToKeywords.ipynb"

***
# Get Keywords


In [6]:
import os
import string
import pandas as pd

import pdfplumber
import spacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

Get keywords for a file from project drive, for which I added a shortcut in my own drive.


In [7]:
# File location
pathBeforeFile = "/content/gdrive/My Drive/#proj-city-agenda-scraper/Agenda_Scraper_Files/Legistar/"
filename = "SanJose1.pdf"
pathToPDF = pathBeforeFile + filename

# Top 10 keywords
getKeywords(pathToPDF).head(10)

Unnamed: 0,Keywords,TF-IDF
0,police,4.034953
1,community,4.663562
2,department,4.663562
3,chief,4.897176
4,question,4.897176
5,policy,5.133565
6,provide,5.44372
7,statement,5.44372
8,process,5.53903
9,service,5.53903


***
# Save Files


### Save PDFs as Txt Files

In [None]:
# Location of PDFs
pathToPDFsFolder = "/content/gdrive/My Drive/#proj-city-agenda-scraper/Agenda_Scraper_Files/Legistar/"
# Location of folder where text files are to be saved (must be created beforehand)
pathToTxtFilesFolder = "/content/gdrive/My Drive/CFSJ/TxtFiles/"

print("Starting to convert all PDFs to txt files.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathToPDFsFolder)):
    pathToFile = pathToPDFsFolder + filename
    print(f"Saving file #{i+1}: {filename}")
    text = getText(pathToFile)
    # Save text as .txt file
    txtFile = savePDFAsTxt(text, filename, pathToTxtFilesFolder)
    print("\n" + "-"*50)

print(f"Finished: Saved {i+1} txt files.")

### Save Keywords as Csv Files
* Currently, this uses CommonWords.csv
* Takes almost 20 minutes

In [None]:
# Location of PDFs
pathToPDFsFolder = "/content/gdrive/My Drive/#proj-city-agenda-scraper/Agenda_Scraper_Files/Legistar/"
# Location of where to save the keywords csv files
pathToKeywordsFolder = "/content/gdrive/My Drive/CFSJ/Keywords-spaCy/"
pathToCommonWordsCsv = "/content/gdrive/My Drive/CFSJ/StopWords/CommonWords.csv"

print("Starting to save keywords for all PDFs to csv files.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathToPDFsFolder)):
    # Get path to pdf file
    pathToPDF = pathToPDFsFolder + filename
    print(f"Saving file #{i+1}: {filename}")
    # Get keywords dataframe
    keywords_df = getKeywords(pathToPDF, pathToCommonWordsCsv)
    # Save as csv
    csvFile = saveKeywordsAsCsv(keywords_df, filename, pathToKeywordsFolder)
    print("\n" + "-"*50)

print(f"Finished: Saved {i+1} csv files.")


### Save Common Words

This is run once to get the CommonWords.csv file. It uses all of the keywords csv files, which should be created from stop words that do not include the include words. This requires modifying getStopWords(...) in PDFToTokens notebook.

(Should take only a few seconds to run).

In [15]:
#pathToStopWordsFolder = "/content/gdrive/My Drive/CFSJ/StopWords/"
#pathToKeywordsFolder = "/content/gdrive/My Drive/CFSJ/Keywords-spaCy/"
#os.makedirs(pathToStopWordsFolder, exist_ok=True)
#df_commonWords = saveCommonWords(pathToStopWordsFolder, pathToKeywordsFolder)

Saved as CommonWords.csv


In [47]:
# View top 15 common words
pathToCommonWordsCsv = "/content/gdrive/My Drive/CFSJ/StopWords/CommonWords.csv"
getCommonWords(pathToCommonWordsCsv)[:15]

['item',
 'require',
 'public',
 'mayor',
 'approve',
 'date',
 'subject',
 'adopt',
 'provide',
 'august',
 'follow',
 'service',
 'change',
 'propose',
 'action']