# Keyword Extraction with spaCy Processing
***
### Table of Contents 

  1. [Setup](#setup)

  2. [Functions](#createfunctions)

  3. [Run Tests](#run-tests)

  4. [Save as Txt Files](#save-as-txt)

<a name="setup"></a>
***
# Setup

### Install Packages
* PyPDF2
* pdfplumber
* spaCyPDFReader (cannot install)

In [1]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
[?25l[K     |████▎                           | 10 kB 16.0 MB/s eta 0:00:01[K     |████████▌                       | 20 kB 18.9 MB/s eta 0:00:01[K     |████████████▊                   | 30 kB 11.2 MB/s eta 0:00:01[K     |█████████████████               | 40 kB 9.5 MB/s eta 0:00:01[K     |█████████████████████▏          | 51 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████████▍      | 61 kB 6.1 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71 kB 5.9 MB/s eta 0:00:01[K     |████████████████████████████████| 77 kB 2.1 MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61101 sha256=171d4b76602129a7d0831992f2920a925dc22440b9c185936ee8543655949e37
  Stored in directory: /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3
Successfu

In [2]:
!pip install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.5.28.tar.gz (45 kB)
[K     |████████████████████████████████| 45 kB 1.5 MB/s 
[?25hCollecting pdfminer.six==20200517
  Downloading pdfminer.six-20200517-py3-none-any.whl (5.6 MB)
[K     |████████████████████████████████| 5.6 MB 9.7 MB/s 
Collecting Wand
  Downloading Wand-0.6.7-py2.py3-none-any.whl (139 kB)
[K     |████████████████████████████████| 139 kB 43.4 MB/s 
[?25hCollecting pycryptodome
  Downloading pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 45.4 MB/s 
Building wheels for collected packages: pdfplumber
  Building wheel for pdfplumber (setup.py) ... [?25l[?25hdone
  Created wheel for pdfplumber: filename=pdfplumber-0.5.28-py3-none-any.whl size=32240 sha256=e39d2a3bb20620c648f6b40f066a46b456b3c7244466e6fbe31c9a3acc0410d7
  Stored in directory: /root/.cache/pip/wheels/f2/b1/a0/c0a77b756d580f53b3806ae0e0b3ec945a8d05fca1d6e10cc1
Successfully built pdfplumbe

In [None]:
!pip install spacypdfreader

[31mERROR: Could not find a version that satisfies the requirement spacypdfreader (from versions: none)[0m
[31mERROR: No matching distribution found for spacypdfreader[0m



### Mount Google Drive
* Files from three folders: Memorandums, Resolutions, SanJoseFiles


In [3]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


<a name="createfunctions"></a>
*** 
# Functions

1.  Preprocessing Text: load pdf raw text
    * getText_PyPDF2(pathToFile)
    * getText(pathToFile)  **<---new**
    * preprocessText(text)


2.  Get Stop Words: find common words and stop words
    * get_nlp( )  **<---updated**
    * getListKeywords(pathToFile)  **<---new**
    * updateDict(words, wordsDict) **<---new**
    * findFrequentWords( )  **<---new**
    * saveCommonWords(pathBeforeFile)  **<---new**
    * getCommonWords( )  **<---new**
    * getStopWords(words) **<---updated**


3.  Processing Text: lemmatize and tokenize text
    * lemmatizeText(text, nlp)  **<---new**
    * processText(sentence, stopWords, nlp)
    * getSentences(text, nlp)
    * getTokens(sentences, nlp)


4.  TF-IDF: compute scores and get keywords
    * get_tf_idf(tokens)
    * getKeywords(pathToFile)
  

5.  Test: compare code's keywords with correct keywords
    * getCorrectKeywords(pathToFile)
    * testFilename(pathToFile)


6. Save Files: save pdfs as .txt, save keywords in .csv, save common words  **<---new**
    * saveAsTxt(filename, text)  **<---new**
    * saveKeywordsAsCsv(filename, keywords_df) **<---new**

### 1. Preprocessing Text

Load text from PDF

In [4]:
"""
Get raw text from pdf file as string using PyPDF2
"""
def getText_PyPDF2(pathToFile: str) -> str:
    # Load pdf file
    pdfFile = open(pathToFile, 'rb')
    PDF_Reader = PyPDF2.PdfFileReader(pdfFile)

    # Get total number of pages in document
    numPages = PDF_Reader.getNumPages()
    #print(f"There are {numPages} pages in the file.\n")

    # Combine text from all pages into one string
    text = ""
    for pg_number in range(numPages):
      page = PDF_Reader.getPage(pg_number)
      page_text = page.extractText()
      text += page_text
    
    return text

In [5]:
"""
Extract raw text from pdf using pdfplumber

Problems with PyPDF2:
  - only extracted 2 of 13 pages for a file
  - added many newline chars where they don't exist in the original 

Improvements with pdfplumber:
  - able to extract ALL text, including header, footer, image captions
  - keeps general format of original pdf, just makes it all left-aligned
  - all words extracted are as they appear in pdf 
    (significantly reduced the "fake" words)

Initially found here: 
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
"""
def getText(pathToFile: str) -> str:
    # Open pdf file
    pdfFile = pdfplumber.open(pathToFile)

    # Get list of all pages' objects
    allPages = pdfFile.pages

    # Extract text from each page and store into one string
    allText = ""
    for pageObject in allPages:
        pageText = pageObject.extract_text()
        allText += pageText
    
    return allText

In [6]:
"""
Get text from pdf as a spacy doc object using spacypdfreader
(spacypdfreader was not able to get installed)
"""

'\nGet text from pdf as a spacy doc object using spacypdfreader\n(spacypdfreader was not able to get installed)\n'

In [7]:
"""
Preprocess text by replacing newline with a space
"""
def preprocessText(text: str) -> str:
    # Make lower case, and remove newline
    preprocessedText = text.replace("\n", " ")
    return preprocessedText

### 2. Get Stop Words

In [8]:
"""
Create spacy Language object to parse English text
"""
def get_nlp():
    nlp = spacy.load('en')
    return nlp

In [9]:
"""
Get list of words from keywords csv
"""
def getListKeywords(pathToFile: str) -> list:
    df = pd.read_csv(pathToFile)
    words = list(df['Keywords'])
    return words

In [10]:
"""
Update frequency of words in hash map
"""
def updateDict(words, wordsDict):
    for word in words:
        if word not in wordsDict:
            wordsDict[word] = 1 #add word
        else: # if word is in dict
            wordsDict[word] += 1 #increase frequency
    return wordsDict

In [11]:
"""
Find the most common words in all SanJose# files
"""
def findFrequentWords():
    # Path to keywords files
    pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Keywords-spaCy/"
    # Names of all files of keywords csv
    files = os.listdir(pathBeforeFile)

    # Count frequency of words across all files
    wordsDict = {} # hash map with key=word, value=frequency
    for filename in files:
        # Get all words in one file
        pathToFile = pathBeforeFile + filename
        words = getListKeywords(pathToFile)
        wordsDict = updateDict(words, wordsDict)
    
    # Create dataframe from dict
    wordsFrequency = pd.DataFrame(wordsDict.items(), 
                                  columns=['Word', 'Frequency'])
    wordsFrequency = wordsFrequency.sort_values('Frequency', ascending=False)
    wordsFrequency = wordsFrequency.reset_index(drop=True)
    
    return wordsFrequency

In [12]:
"""
Save file with words and their frequency
"""
def saveCommonWords(pathBeforeFile):
    filename = "CommonWords.csv"
    pathToFile = pathBeforeFile + filename
    df_commonWords = findFrequentWords()
    # Save df as .csv file
    df_commonWords.to_csv(pathToFile, index=False)
    print(f"Saved as {filename}")
    return df_commonWords

In [13]:
"""
Get list of words in descending order of frequency (more to less)
"""
def getCommonWords():
    pathBeforeFile = "/content/gdrive/My Drive/CFSJ/StopWords/"
    filename = "CommonWords.csv"
    pathToFile = pathBeforeFile + filename
    try: 
        df = pd.read_csv(pathToFile)
        commonWords = list(df['Word'])
        return commonWords
    except: 
        print("Could not find file for common words.")

In [28]:
""" 
Get list of stop words from spacy, 
optionally can add extra words
"""
def getStopWords(words=[]) -> list:
  stopWords = spacy.lang.en.stop_words.STOP_WORDS #set
  # Will update/change the common words later
  #commonWords = getCommonWords()[:15] #top 15 common words

  stopWordsList = list(stopWords) + words #+ commonWords
  
  stopWordsList += ['san', 'jose', 'josé', 'city', 'council', 
                    'meeting', 'resolution', 'memorandum',
                    'event', 'file', 'document', 'agenda',
                    'draft', 'contact', 'office', 'resource',
                    'clerk', 'final', 'california',]
  return stopWordsList

### 3. Processing Text

In [15]:
"""
Lemmatize string with text
"""
def lemmatizeText(text, nlp) -> str:
    tokens = nlp(text)
    lemmatizedWords = []
    for token in tokens:
        lemmatizedWords.append(token.lemma_)
    return lemmatizedWords

In [16]:
"""
Processing text: lemmatize, remove stop words, make lowercase
Input:
    sentence: str, 
    stopWords: set, 
    nlp: spacy.lang.en.English

Note: when the same word appeared in a sentence (e.g. Fees and fees),
spaCy only lemmatized fees to 'fee', but did not lemmatize Fees.
Some other times, it did lemmatize (e.g. Authorizes and authorizes).
"""
def processText(sentence, stopWords, nlp) -> list:
    # Lemmatize 2 times
    # Lemmatize entire sentence
    tokens_lemmatized = lemmatizeText(sentence, nlp)
    # Make each word of sentence into lower case words
    tokens_lowercase = [word.lower() for word in tokens_lemmatized]
    # Lemmatize each word to ensure they are lemmatized 
    tokens_lemmatized = [lemmatizeText(word, nlp)[0] for word in tokens_lowercase]
    
    # Removing stop words and any punctuation or numeric strings
    list_tokens = []
    for word in tokens_lemmatized:
        if word not in stopWords and word.isalpha():
            list_tokens.append(word)
    
    # Remove single letters
    list_tokens = [word for word in list_tokens if len(word)>1]

    # Return preprocessed list of tokens
    return list_tokens  

In [17]:
"""
Get sentences from pdf's preprocessed text using spacy's trained pipeline
"""
def getSentences(text: str, nlp) -> list:
    #nlp = get_nlp() #spacy.load('en_core_web_sm')
    doc = nlp(text)
    sentences = list(doc.sents)
    sentences = [sent.string.strip() for sent in sentences]
    return sentences

In [18]:
"""
Break sentences into words by processing them
"""
def getTokens(sentences: list, nlp) -> list:
    stopWords = getStopWords()
    # Process sentences to get words
    tokens = []
    for sentence in sentences:
        current_tokens = processText(sentence, stopWords, nlp)
        tokens += current_tokens
    # Return words with alphabet only
    return tokens

### 4. TF-IDF

In [19]:
"""
Run tf-idf algorithm on the list of tokens, and
return a dataframe with tokens and scores
"""
def get_TF_IDF(tokens: list):
    # Compute TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit(tokens)

    tf_idf = list(tfidf_vectorizer.idf_) #scores
    features = list(tfidf_vectorizer.get_feature_names()) #words/tokens

    # Store results in dataframe
    scores_df = pd.DataFrame(list(zip(features, tf_idf)), 
                             columns=['Keywords', 'TF-IDF'])
    # Sort by score in ascending - small to large - order
    scores_df = scores_df.sort_values('TF-IDF').reset_index(drop=True)
    
    return scores_df

In [20]:
"""
Combine all functions to get keywords dataframe from just path to file
"""
def getKeywords(pathToFile: str): 
  # Run process for extracting keywords
  nlp = get_nlp()
  rawText = getText(pathToFile)
  preprocessedText = preprocessText(rawText)
  sentences = getSentences(preprocessedText, nlp)
  tokens = getTokens(sentences, nlp)

  # Get dataframe with keywords and scores
  keywords_df = get_TF_IDF(tokens)

  return keywords_df

### 5. Test 

In [21]:
"""
Extract words from filename, 
which are separated by an underscore _
"""
def getCorrectKeywords(pathToFile: str) -> list:
    filename = pathToFile.split('/')[-1]
    correctKeywords = filename.replace(".pdf", "").split('_')
    correctKeywords = [word.lower() for word in correctKeywords]
    return correctKeywords

In [22]:
"""
Test one file
Compare correct keywords with the top 10 keywords computed with tf-idf
"""
def testFilename(pathToFile: str) -> float:
    # Get top 10 keywords using tf-idf
    tf_idf_keywords = getKeywords(pathToFile)['Keywords'].to_list()[:10]

    # Get actual keywords from file name
    correctKeywords = getCorrectKeywords(pathToFile)

    # Count number of correct words in keywords
    numCorrectWordsFound = 0
    for keyword in correctKeywords:
        if keyword in tf_idf_keywords:
            numCorrectWordsFound += 1

    # Print perfect of correct words found   
    correctPercentage = round((numCorrectWordsFound/len(correctKeywords))*100, 2)
    print(f"{correctPercentage}% of keywords were found.")

    return correctPercentage

### 6. Save Files

In [23]:
"""
Save extracted text into .txt file,
using the same name (e.g. SanJose1.pdf => SanJose1.txt)
"""
def savePDFAsTxt(filename: str, text: str):
    # Create folder, if it doesn't exist 
    pathToFolder = "/content/gdrive/My Drive/CFSJ/"
    os.makedirs(pathToFolder+"TxtFiles", exist_ok=True)

    # Create txt file name
    name = filename[:-4] #remove .pdf
    txtFilename = name + ".txt"

    # Make txt file with name
    txtFile = open(pathToFolder+"TxtFiles/"+txtFilename, 'w+')    
    
    # Write to file with string of pdf's text
    txtFile.write(text.strip())
    print(f"Saved as {txtFilename}")
    return txtFile

In [24]:
"""
Save dataframe with keywords and tf-idf scores into .csv file,
using the same name (e.g. SanJose1.pdf => SanJose1.csv)

keywords_df is result from function getKeywords()s
"""
def saveKeywordsAsCsv(filename, keywords_df):
    # Create folder to store file
    pathToFolder = "/content/gdrive/My Drive/CFSJ/"
    os.makedirs(pathToFolder+"Keywords-spaCy", exist_ok=True)

    # Create csv file name
    name = filename[:-4]
    csvFilename = name + ".csv"

    # Save file in folder
    folder = "Keywords-spaCy/"
    keywords_df.to_csv(pathToFolder + folder + csvFilename, index=False)
    print(f"Saved as {csvFilename}")

    return csvFilename

<a name="save-as-txt"></a>
*** 
# Save Files
  1. Save PDFs as txt files
  2. Save keywords dataframe as csv files
  3. Save common words as csv

In [25]:
import os
import string
import pandas as pd

import PyPDF2
import pdfplumber
import spacy

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Save PDFs as Txt Files

In [None]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/SanJoseFiles/"

print("Starting to convert all PDFs to txt files.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Saving file #{i+1}: {filename}")
    text = getText(pathToFile)
    txtFile = savePDFAsTxt(filename, text)
    print("\n" + "-"*50)

print(f"Finished: Saved {i+1} txt files.")

Starting to convert all PDFs to txt files.
--------------------------------------------------
Saving file #1: SanJose19.pdf
Saved as SanJose19.txt

--------------------------------------------------
Saving file #2: SanJose4.pdf
Saved as SanJose4.txt

--------------------------------------------------
Saving file #3: SanJose13.pdf
Saved as SanJose13.txt

--------------------------------------------------
Saving file #4: SanJose17.pdf
Saved as SanJose17.txt

--------------------------------------------------
Saving file #5: SanJose11.pdf
Saved as SanJose11.txt

--------------------------------------------------
Saving file #6: SanJose15.pdf
Saved as SanJose15.txt

--------------------------------------------------
Saving file #7: SanJose9.pdf
Saved as SanJose9.txt

--------------------------------------------------
Saving file #8: SanJose7.pdf
Saved as SanJose7.txt

--------------------------------------------------
Saving file #9: SanJose5.pdf
Saved as SanJose5.txt

--------------------

In [26]:
# SanJose1 and SanJose2
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/SanJoseFiles/"

filename1 = "SanJose1.pdf"
text1 = getText(pathBeforeFile + filename1)
txtFile1 = savePDFAsTxt(filename1, text1)

filename2 = "SanJose2.pdf"
text2 = getText(pathBeforeFile + filename2)
txtFile2 = savePDFAsTxt(filename2, text2)

Saved as SanJose1.txt
Saved as SanJose2.txt


### Save Keywords as Csv Files

- Requires CommonWords.csv

In [31]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/SanJoseFiles/"

print("Starting to save keywords for all PDFs to csv files.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Saving file #{i+1}: {filename}")
    # Get keywords dataframe
    keywords_df = getKeywords(pathToFile)
    csvFile = saveKeywordsAsCsv(filename, keywords_df)
    print("\n" + "-"*50)

print(f"Finished: Saved {i+1} csv files.")

Starting to save keywords for all PDFs to csv files.
--------------------------------------------------
Saving file #1: SanJose19.pdf
Saved as SanJose19.csv

--------------------------------------------------
Saving file #2: SanJose4.pdf
Saved as SanJose4.csv

--------------------------------------------------
Saving file #3: SanJose13.pdf
Saved as SanJose13.csv

--------------------------------------------------
Saving file #4: SanJose17.pdf
Saved as SanJose17.csv

--------------------------------------------------
Saving file #5: SanJose11.pdf
Saved as SanJose11.csv

--------------------------------------------------
Saving file #6: SanJose15.pdf
Saved as SanJose15.csv

--------------------------------------------------
Saving file #7: SanJose9.pdf
Saved as SanJose9.csv

--------------------------------------------------
Saving file #8: SanJose7.pdf
Saved as SanJose7.csv

--------------------------------------------------
Saving file #9: SanJose5.pdf
Saved as SanJose5.csv

----------

### Save Common Words
- Requires keywords .csv files

In [None]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/StopWords/"
os.makedirs(pathBeforeFile, exist_ok=True)
df_commonWords = saveCommonWords(pathBeforeFile)

Saved as CommonWords.csv


<a name="run-tests"></a>
*** 
# Run Tests
(For memorandum and resolution files).
  1. Test One File
  2. Test All Files

### Test One File

Memorandum

In [29]:
# Run test for one file
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Memorandums/"
filename = "Historic_Landmark_Designation_Property.pdf"
pathToFile = pathBeforeFile + filename

# Test
fileTestResult = testFilename(pathToFile=pathToFile)

# Top 10 keywords
getKeywords(pathToFile).head(10)

75.0% of keywords were found.


Unnamed: 0,Keywords,TF-IDF
0,house,4.524365
1,historic,4.650658
2,property,4.670076
3,contract,4.913023
4,landmark,5.269697
5,act,5.269697
6,preservation,5.269697
7,mill,5.306065
8,year,5.423848
9,south,5.60617


Resolution

In [30]:
# Run test for one file
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Resolutions/"
filename = "Fire_Department_Exam_Free_Use_Hall.pdf"
pathToFile = pathBeforeFile + filename

# Test
fileTestResult = testFilename(pathToFile=pathToFile)

# Top 10 keywords
getKeywords(pathToFile).head(10)

100.0% of keywords were found.


Unnamed: 0,Keywords,TF-IDF
0,use,3.855032
1,free,3.855032
2,fire,3.855032
3,department,3.988564
4,saturday,4.142714
5,hall,4.325036
6,august,4.325036
7,authorize,4.325036
8,process,4.54818
9,exam,4.54818


### Test All Files

Memorandums

In [None]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Memorandums/"

print("Starting Tests.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Test {i+1}: {filename}")
    result = testFilename(pathToFile=pathToFile)
    print("\n" + "-"*50)

print("Tests completed.")

Starting Tests.
--------------------------------------------------
Test 1: Downtown_Rezone_Addendum_Environmental.pdf
25.0% of keywords were found.

--------------------------------------------------
Test 2: Chief_Police_Questions_Policy_Selection.pdf
60.0% of keywords were found.

--------------------------------------------------
Test 3: Dumpster_Day_Brooktree_Vinci_Flickinger.pdf
40.0% of keywords were found.

--------------------------------------------------
Test 4: Juneteenth_Holiday.pdf
100.0% of keywords were found.

--------------------------------------------------
Test 5: Marriott_Townplace_Suites_Hotel_Vesting_Development_Permit.pdf
14.29% of keywords were found.

--------------------------------------------------
Test 6: Demolition_Permit_Site_Development_Construction_Building.pdf
50.0% of keywords were found.

--------------------------------------------------
Test 7: Audit_Peer_Review.pdf
100.0% of keywords were found.

--------------------------------------------------


Resolutions

* Was not able to get text from "Environmental_Mixed_Use_Construction.pdf" (or SanJose16.pdf) with PyPDF2
* just shows many newline chars


* SanJose16.pdf works with pdfplumber

In [None]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Resolutions/"

print("Starting Tests.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Test {i+1}: {filename}")
    result = testFilename(pathToFile=pathToFile)
    print("\n" + "-"*50)

print("Tests completed.")

Starting Tests.
--------------------------------------------------
Test 1: Financing_Commercial.pdf
50.0% of keywords were found.

--------------------------------------------------
Test 2: Fire_Department_Exam_Free_Use_Hall.pdf
100.0% of keywords were found.

--------------------------------------------------
Test 3: Vacate_Almaden_Property_Surplus.pdf
25.0% of keywords were found.

--------------------------------------------------
Test 4: Environmental_Mixed_Use_Construction.pdf
50.0% of keywords were found.

--------------------------------------------------
Tests completed.


*** 
# Other

### View Common Words
- Will change this later

In [None]:
df_commonWords = getCommonWords()
df_commonWords[:20]

['date',
 'require',
 'item',
 'approve',
 'adopt',
 'provide',
 'propose',
 'project',
 'service',
 'follow',
 'post',
 'recommend',
 'include',
 'state',
 'report',
 'pursuant',
 'government',
 'receive',
 'associate',
 'use']

In [None]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/StopWords/"
pd.read_csv(pathBeforeFile + "CommonWords.csv").head(15)

Unnamed: 0,Word,Frequency
0,date,16
1,require,16
2,item,16
3,approve,16
4,adopt,15
5,provide,14
6,propose,13
7,project,13
8,service,13
9,follow,13
