# Keyword Extraction with spaCy Processing
***
### Table of Contents 

  1. [Setup](#setup)

  2. [Functions](#createfunctions)

  3. [Run Tests](#run-tests)

  4. [Save as Txt Files](#save-as-txt)

<a name="setup"></a>
***
# Setup

### Install Packages
* PyPDF2
* pdfplumber
* spaCyPDFReader (cannot install)

In [1]:
!pip install PyPDF2



In [2]:
!pip install pdfplumber



In [11]:
!pip install spacypdfreader

[31mERROR: Could not find a version that satisfies the requirement spacypdfreader (from versions: none)[0m
[31mERROR: No matching distribution found for spacypdfreader[0m



### Mount Google Drive
* Files from three folders: Memorandums, Resolutions, SanJoseFiles


In [12]:
from google.colab import drive

drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


<a name="createfunctions"></a>
*** 
# Functions

1.  Processing Text: load pdf text and create tokens
    * getText_PyPDF2(pathToFile)
    * getText(pathToFile)  **<---new**
    * preprocessText(text)
    * getStopWords( ) **<---updated**
    * getParser( )
    * processText(sentence, stopWords, parser)
    * getSentences(text)
    * getTokens(sentences)

3.  TF-IDF: compute scores and get keywords
    * get_tf_idf(tokens)
    * getKeywords(pathToFile)
  
4.  Test: compare code's keywords with correct keywords
    * getCorrectKeywords(pathToFile)
    * testFilename(pathToFile)

5. Save Files: save text as .txt, save keywords in .csv **<---new**
    * saveAsTxt(filename, text)  **<---new**
    * saveKeywordsAsCsv(filename, keywords_df) **<---new**

### 1. Processing Text

Load text from PDF

In [3]:
"""
Get raw text from pdf file as string using PyPDF2
"""
def getText_PyPDF2(pathToFile: str) -> str:
    # Load pdf file
    pdfFile = open(pathToFile, 'rb')
    PDF_Reader = PyPDF2.PdfFileReader(pdfFile)

    # Get total number of pages in document
    numPages = PDF_Reader.getNumPages()
    #print(f"There are {numPages} pages in the file.\n")

    # Combine text from all pages into one string
    text = ""
    for pg_number in range(numPages):
      page = PDF_Reader.getPage(pg_number)
      page_text = page.extractText()
      text += page_text
    
    return text

In [4]:
"""
Extract raw text from pdf using pdfplumber

Problems with PyPDF2
  - only extracted 2 of 13 pages for a file
  - added many newline chars where they don't exist in the original 

Improvements with pdfplumber
  - able to extract ALL text, including header, footer, image captions
  - keeps general format of original pdf, just makes it all left-aligned
  - all words extracted are as they appear in pdf 
    (significantly reduced the "fake" words)

Initially found here: 
https://towardsdatascience.com/how-to-extract-text-from-pdf-245482a96de7
"""
def getText(pathToFile: str) -> str:
    # Open pdf file
    pdfFile = pdfplumber.open(pathToFile)

    # Get list of all pages' objects
    allPages = pdfFile.pages

    # Extract text from each page and store into one string
    allText = ""
    for pageObject in allPages:
        pageText = pageObject.extract_text()
        allText += pageText
    
    return allText

In [5]:
"""
Get text from pdf as a spacy doc object using spacypdfreader
(spacypdfreader was not able to get installed)
"""

'\nGet text from pdf as a spacy doc object using spacypdfreader\n(spacypdfreader was not able to get installed)\n'

Process Text

In [6]:
"""
Preprocess text by replacing newline with a space
"""
def preprocessText(text: str) -> str:
    # Make lower case, and remove newline
    preprocessedText = text.replace("\n", " ")
    return preprocessedText

In [7]:
""" 
Get list of stop words from spacy
"""
def getStopWords():
  stopWords = spacy.lang.en.stop_words.STOP_WORDS
  stopWordsList = list(stopWords) #was a set
  stopWordsList += ['san', 'jose', 'josé', 'city', 'council', 
                    'meeting', 'resolution', 'memorandum',
                    'event', 'file', 'document', 'agenda',
                    'draft', 'contact', 'office', 'resource',
                    'clerk', 'final', 'california',]
  return stopWordsList

In [8]:
"""
Create spacy Language object to parse English text
"""
def getParser():
    parser = English()
    return parser

In [9]:
"""
Processing text: lemmatize, remove stop words, make lowercase
Input:
    sentence: str, 
    stopWords: set, 
    parser: spacy.lang.en.English
"""

def processText(sentence, stopWords, parser) -> list:
    
    # Create token object 
    tokens = parser(sentence)
    
    # Lemmatize each token and make them lower case
    tokens_lemmatized = [word.lemma_ for word in tokens]
    tokens_lowercase = [(word.lower().strip()) for word in tokens_lemmatized]
    
    # Removing stop words and any punctuation or numeric strings
    list_tokens = []
    for word in tokens_lowercase:
        if word not in stopWords and word.isalpha():
            list_tokens.append(word)
    
    # Remove single letters
    list_tokens = [word for word in list_tokens if len(word)>1]
    
    # Return preprocessed list of tokens
    return list_tokens  

In [10]:
"""
Get sentences from pdf's preprocessed text using spacy's trained pipeline
"""
def getSentences(text: str) -> list:
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    sentences = list(doc.sents)
    sentences = [sent.string.strip() for sent in sentences]
    return sentences

In [11]:
"""
Break sentences into words by processing them
"""
def getTokens(sentences: list) -> list:
    stopWords = getStopWords()
    parser = getParser()
    # Process sentences to get words
    tokens = []
    for sentence in sentences:
        current_tokens = processText(sentence, stopWords, parser)
        tokens += current_tokens
    # Return words with alphabet only
    return tokens

### 2. TF-IDF

In [12]:
"""
Run tf-idf algorithm on the list of tokens, and
return a dataframe with tokens and scores
"""
def get_TF_IDF(tokens: list):
    
    # compute TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_vectorizer.fit(tokens)

    tf_idf = list(tfidf_vectorizer.idf_) #scores
    features = list(tfidf_vectorizer.get_feature_names()) #words/tokens

    # store results in dataframe
    scores_df = pd.DataFrame(list(zip(features, tf_idf)), 
                             columns=['Keywords', 'TF-IDF'])
    # sort by score in ascending - small to large - order
    scores_df = scores_df.sort_values('TF-IDF').reset_index(drop=True)
    
    return scores_df

In [13]:
"""
Combine all functions to get keywords dataframe from just path to file
"""
def getKeywords(pathToFile: str): 
  rawText = getText(pathToFile=pathToFile)
  preprocessedText = preprocessText(text = rawText)
  sentences = getSentences(text = preprocessedText)
  tokens = getTokens(sentences = sentences)

  # get dataframe with keywords and scores
  keywords_df = get_TF_IDF(tokens)

  return keywords_df

### 3. Test 

In [14]:
"""
Extract words from filename, 
which are separated by an underscore _
"""
def getCorrectKeywords(pathToFile: str) -> list:
    filename = pathToFile.split('/')[-1]
    correctKeywords = filename.replace(".pdf", "").split('_')
    correctKeywords = [word.lower() for word in correctKeywords]
    return correctKeywords

In [15]:
"""
Test one file
Compare correct keywords with the top 10 keywords computed with tf-idf
"""
def testFilename(pathToFile: str) -> float:
    # get top 10 keywords using tf-idf
    tf_idf_keywords = getKeywords(pathToFile)['Keywords'].to_list()[:10]

    # get actual keywords from file name
    correctKeywords = getCorrectKeywords(pathToFile)

    numCorrectWordsFound = 0
    for keyword in correctKeywords:
        if keyword in tf_idf_keywords:
            #print(f"{keyword} was found.")
            numCorrectWordsFound += 1
            
        #else:
            #print(f"The word '{keyword}' was not found in tf-idf keywords.")
    
    correctPercentage = round((numCorrectWordsFound/len(correctKeywords))*100, 2)
    print(f"{correctPercentage}% of keywords were found.")

    return correctPercentage

### 4. Save Files

In [16]:
"""
Save extracted text into .txt file,
using the same name (e.g. SanJose1.pdf => SanJose1.txt)
"""
def savePDFAsTxt(filename: str, text: str):
    # Create folder, if it doesn't exist 
    pathToFolder = "/content/gdrive/My Drive/CFSJ/"
    os.makedirs(pathToFolder+"TxtFiles", exist_ok=True)

    # Create txt file name
    name = filename[:-4] #remove .pdf
    txtFilename = name + ".txt"

    # Make txt file with name
    txtFile = open(pathToFolder+"TxtFiles/"+txtFilename, 'w+')    
    
    # Write to file with string of pdf's text
    txtFile.write(text.strip())
    print(f"Saved as {txtFilename}")
    return txtFile

In [17]:
"""
Save dataframe with keywords and tf-idf scores into .csv file,
using the same name (e.g. SanJose1.pdf => SanJose1.csv)

keywords_df is result from function getKeywords()s
"""
def saveKeywordsAsCsv(filename, keywords_df):
    # Create folder to store file
    pathToFolder = "/content/gdrive/My Drive/CFSJ/"
    os.makedirs(pathToFolder+"Keywords-spaCy", exist_ok=True)

    # Create csv file name
    name = filename[:-4]
    csvFilename = name + ".csv"

    # Save file in folder
    folder = "Keywords-spaCy/"
    keywords_df.to_csv(pathToFolder + folder + csvFilename, index=False)
    print(f"Saved as {csvFilename}")

    return csvFilename

<a name="run-tests"></a>
*** 
# Run Tests
  1. Load libraries
  2. Test 1 File
  3. Test All Files

In [18]:
import os
import string
import pandas as pd

import PyPDF2
import pdfplumber
import spacy
from spacy.lang.en import English

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Test One File

Memorandum

In [19]:
# Run test for one file
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Memorandums/"
filename = "Historic_Landmark_Designation_Property.pdf"
pathToFile = pathBeforeFile + filename

# Test
fileTestResult = testFilename(pathToFile=pathToFile)

# Top 10 keywords
getKeywords(pathToFile).head(10)

75.0% of keywords were found.


Unnamed: 0,Keywords,TF-IDF
0,house,4.547931
1,historic,4.657131
2,property,4.847174
3,contract,5.053026
4,preservation,5.27617
5,act,5.27617
6,mills,5.350278
7,landmark,5.350278
8,south,5.612642
9,historical,5.663935


Resolution

In [20]:
# Run test for one file
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Resolutions/"
filename = "Fire_Department_Exam_Free_Use_Hall.pdf"
pathToFile = pathBeforeFile + filename

# Test
fileTestResult = testFilename(pathToFile=pathToFile)

# Top 10 keywords
getKeywords(pathToFile).head(10)

100.0% of keywords were found.


Unnamed: 0,Keywords,TF-IDF
0,use,3.883403
1,free,3.883403
2,fire,3.883403
3,saturday,4.171085
4,department,4.171085
5,hall,4.353407
6,august,4.353407
7,process,4.57655
8,exam,4.57655
9,captain,4.57655


### Test All Files

Memorandums

In [21]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Memorandums/"

print("Starting Tests.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Test {i+1}: {filename}")
    result = testFilename(pathToFile=pathToFile)
    print("\n" + "-"*50)

print("Tests completed.")

Starting Tests.
--------------------------------------------------
Test 1: Downtown_Rezone_Addendum_Environmental.pdf
25.0% of keywords were found.

--------------------------------------------------
Test 2: Chief_Police_Questions_Policy_Selection.pdf
80.0% of keywords were found.

--------------------------------------------------
Test 3: Dumpster_Day_Brooktree_Vinci_Flickinger.pdf
40.0% of keywords were found.

--------------------------------------------------
Test 4: Juneteenth_Holiday.pdf
100.0% of keywords were found.

--------------------------------------------------
Test 5: Marriott_Townplace_Suites_Hotel_Vesting_Development_Permit.pdf
14.29% of keywords were found.

--------------------------------------------------
Test 6: Demolition_Permit_Site_Development_Construction_Building.pdf
50.0% of keywords were found.

--------------------------------------------------
Test 7: Audit_Peer_Review.pdf
100.0% of keywords were found.

--------------------------------------------------


Resolutions

* Was not able to get text from "Environmental_Mixed_Use_Construction.pdf" (or SanJose16.pdf) with PyPDF2
* just shows many newline chars


* SanJose16.pdf works with pdfplumber

In [22]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/Resolutions/"

print("Starting Tests.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Test {i+1}: {filename}")
    result = testFilename(pathToFile=pathToFile)
    print("\n" + "-"*50)

print("Tests completed.")

Starting Tests.
--------------------------------------------------
Test 1: Financing_Commercial.pdf
50.0% of keywords were found.

--------------------------------------------------
Test 2: Fire_Department_Exam_Free_Use_Hall.pdf
100.0% of keywords were found.

--------------------------------------------------
Test 3: Vacate_Almaden_Property_Surplus.pdf
25.0% of keywords were found.

--------------------------------------------------
Test 4: Environmental_Mixed_Use_Construction.pdf
25.0% of keywords were found.

--------------------------------------------------
Tests completed.


<a name="save-as-txt"></a>
*** 
# Save Files
Save PDFs as txt files, and save keywords dataframe as csv files.

### Save PDFs as Txt Files

In [23]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/SanJoseFiles/"

print("Starting to convert all PDFs to txt files.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Saving file #{i+1}: {filename}")
    text = getText(pathToFile)
    txtFile = savePDFAsTxt(filename, text)
    print("\n" + "-"*50)

print(f"Finished: Saved {i+1} txt files.")

Starting to convert all PDFs to txt files.
--------------------------------------------------
Saving file #1: SanJose19.pdf
Saved as SanJose19.txt

--------------------------------------------------
Saving file #2: SanJose4.pdf
Saved as SanJose4.txt

--------------------------------------------------
Saving file #3: SanJose13.pdf
Saved as SanJose13.txt

--------------------------------------------------
Saving file #4: SanJose17.pdf
Saved as SanJose17.txt

--------------------------------------------------
Saving file #5: SanJose11.pdf
Saved as SanJose11.txt

--------------------------------------------------
Saving file #6: SanJose15.pdf
Saved as SanJose15.txt

--------------------------------------------------
Saving file #7: SanJose9.pdf
Saved as SanJose9.txt

--------------------------------------------------
Saving file #8: SanJose7.pdf
Saved as SanJose7.txt

--------------------------------------------------
Saving file #9: SanJose5.pdf
Saved as SanJose5.txt

--------------------

### Save Keywords as Csv Files

In [24]:
pathBeforeFile = "/content/gdrive/My Drive/CFSJ/SanJoseFiles/"

print("Starting to save keywords for all PDFs to csv files.")
print("-"*50)

for i,filename in enumerate(os.listdir(pathBeforeFile)):
    pathToFile = pathBeforeFile + filename
    print(f"Saving file #{i+1}: {filename}")
    # Get keywords dataframe
    keywords_df = getKeywords(pathToFile)
    csvFile = saveKeywordsAsCsv(filename, keywords_df)
    print("\n" + "-"*50)

print(f"Finised: Saved {i+1} csv files.")

Starting to save keywords for all PDFs to csv files.
--------------------------------------------------
Saving file #1: SanJose19.pdf
Saved as SanJose19.csv

--------------------------------------------------
Saving file #2: SanJose4.pdf
Saved as SanJose4.csv

--------------------------------------------------
Saving file #3: SanJose13.pdf
Saved as SanJose13.csv

--------------------------------------------------
Saving file #4: SanJose17.pdf
Saved as SanJose17.csv

--------------------------------------------------
Saving file #5: SanJose11.pdf
Saved as SanJose11.csv

--------------------------------------------------
Saving file #6: SanJose15.pdf
Saved as SanJose15.csv

--------------------------------------------------
Saving file #7: SanJose9.pdf
Saved as SanJose9.csv

--------------------------------------------------
Saving file #8: SanJose7.pdf
Saved as SanJose7.csv

--------------------------------------------------
Saving file #9: SanJose5.pdf
Saved as SanJose5.csv

----------