## Author: Maria Garcia 
Date created: January 2, 2024

# InfoSearch-Extractor

## AI-Powered Document Search and Information Extraction

A system that can search a given topic within a collection of PDFs and extract relevant information from the documents. This project not only involves AI and natural language processing aspects but also provides practical utility by creating a tool for efficient document search and information extraction.

### Overview
Picture this: 891 e-books, a treasure trove of Data Science and Machine Learning wonders. Why? Because book lovers are on a quest for knowledge! 

### The Challenge
1. It's like finding a needle in a haystack when searching through all these gems.
2. Each book has a wealth of information, but finding specific topics can be time-consuming.


It's time to craft a solution that transforms this exploration into an effortless search!


### Seamless Exploration
Embark on a seamless exploration of your chosen topic within a curated folder of PDFs. This AI aims to sift through the complexities, presenting you with a refined selection of documents that genuinely matter. 

### Extraction with Purpose
Gone are the days of information overload. This system does not just search; it extracts meaningful information, distilling the essence of each document and focusing only on the most relevant and impactful details. 

This system is more than a tool; it's your key to unlocking the vast power of knowledge. Focusing on specific topics and information within curated PDF files brings you precisely the information you seek—focused intelligence is at your fingertips.


### Technologies and Tools Used:

- Programming Language: Python
- PDF Parsing Libraries: PyPDF2, pdfplumber
- Topic Modeling: Gensim - Latent Dirichlet Allocation (LDA)
- Search Mechanism: TF-IDF
- Named Entity Recognition: spaCy 


## 1. Dataset Preparation: E-Book Collection

- **Number of E-Books:** 891
- **Total Size:** 14.77 GB
- **Topic Focus:** Data Science and Machine Learning
- **Compilation Date:** February 23, 2022

**Library Essentials:**

In [1]:
import os
import pandas as pd
import pdfplumber
from tqdm import tqdm
from gensim import corpora, models
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import spacy
import ipywidgets as widgets
from IPython.display import display

In [2]:
# Access PDF files in a designated folder 
def access_pdfs(source_folder, sample_size=13):
    # View the files
    pdf_files = [file for file in os.listdir(source_folder) if file.endswith(".pdf")]

    # Check # and Total Size of the files
    total_pdf_files = len(pdf_files)
    total_size = sum(os.path.getsize(os.path.join(source_folder, pdf_file)) for pdf_file in pdf_files) / (1024 * 1024)

    # Data Preview
    print(f"1. Number of E-books:   {total_pdf_files} PDF files accessed")
    print(f"2. Total Size: \t\t{total_size:.2f} MB")
    print(f"\n\n PDF files found in the source folder:\n\n{pdf_files[:sample_size]}")  # Display only the specified sample size
    accessed_pdf_files = [os.path.join(source_folder, pdf_file) for pdf_file in pdf_files]
    return accessed_pdf_files[:sample_size]

source_folder = "/Users/mialaarnigarcia/Desktop/Project_DS_and_ML_Folder"

# Apply the function to the accessed e-books
pdf_files_found = access_pdfs(source_folder, sample_size=13)

1. Number of E-books:   891 PDF files accessed
2. Total Size: 		14087.48 MB


 PDF files found in the source folder:

['Mastering Machine Learning with Python in Six Steps A Practical Implementation Guide to Predictive Data Analytics Using Python by Manohar Swamynathan (auth.).pdf', 'Football Hackers The Science and Art of a Data Revolution by Christoph Biermann.pdf', 'Hyperparameter Optimization in Machine Learning Make Your Machine Learning and Deep Learning Models More Efficient by Tanay Agrawal.pdf', 'The hundred-page machine learning book by Burkov, Andriy.pdf', 'Signal Processing and Machine Learning for Brain–Machine Interfaces by Toshihisa Tanaka, Mahnaz Arvaneh.pdf', 'Financial signal processing and machine learning by Akansu, Ali N. Kulkarni, Sanjeev Malioutov, Dmitry.pdf', 'XML and Web Technologies for Data Sciences with R by Deborah Nolan, Duncan Temple Lang (auth.).pdf', 'Fundamentals of Machine Learning for Predictive Data Analytics Algorithms, Worked Examples, and Case S

In [3]:
# See samples
pdf_file_names = [os.path.basename(pdf_file) for pdf_file in pdf_files_found]
df = pd.DataFrame({"File Name": pdf_file_names})
display(df)

Unnamed: 0,File Name
0,Mastering Machine Learning with Python in Six ...
1,Football Hackers The Science and Art of a Data...
2,Hyperparameter Optimization in Machine Learnin...
3,The hundred-page machine learning book by Burk...
4,Signal Processing and Machine Learning for Bra...
5,Financial signal processing and machine learni...
6,XML and Web Technologies for Data Sciences wit...
7,Fundamentals of Machine Learning for Predictiv...
8,Data Science Algorithms in a Week Top 7 algori...
9,Fundamentals of Deep Learning Designing Next-G...


Separate the Author Name from the PDF File Title

In [4]:
# Extract title and author for easy scan
df['E-Book Title'] = df['File Name'].str.extract(r'^(.*?) by ')
df['Author'] = df['File Name'].str.extract(r' by (.*)\.pdf$')
display(df)

Unnamed: 0,File Name,E-Book Title,Author
0,Mastering Machine Learning with Python in Six ...,Mastering Machine Learning with Python in Six ...,Manohar Swamynathan (auth.)
1,Football Hackers The Science and Art of a Data...,Football Hackers The Science and Art of a Data...,Christoph Biermann
2,Hyperparameter Optimization in Machine Learnin...,Hyperparameter Optimization in Machine Learnin...,Tanay Agrawal
3,The hundred-page machine learning book by Burk...,The hundred-page machine learning book,"Burkov, Andriy"
4,Signal Processing and Machine Learning for Bra...,Signal Processing and Machine Learning for Bra...,"Toshihisa Tanaka, Mahnaz Arvaneh"
5,Financial signal processing and machine learni...,Financial signal processing and machine learning,"Akansu, Ali N. Kulkarni, Sanjeev Malioutov, Dm..."
6,XML and Web Technologies for Data Sciences wit...,XML and Web Technologies for Data Sciences with R,"Deborah Nolan, Duncan Temple Lang (auth.)"
7,Fundamentals of Machine Learning for Predictiv...,Fundamentals of Machine Learning for Predictiv...,"John D. Kelleher, Brian Mac Namee, Aoife DArcy"
8,Data Science Algorithms in a Week Top 7 algori...,Data Science Algorithms in a Week Top 7 algori...,David Natingga
9,Fundamentals of Deep Learning Designing Next-G...,Fundamentals of Deep Learning Designing Next-G...,"Nikhil Buduma, Nicholas Locascio"


## 2. PDF Parsing:

Use a PDF parsing library such as PyPDF2 or pdfplumber in Python to extract text content from the PDF documents.


- We will go through each PDF file in the pdf_files_found list, open it using pdfplumber, iterate through its pages, and print the text content of each page along with the file name and page number.
- To focus on extracting and manipulating content, PyPDF2, an open-source library, provides essential functionalities that makes it easy to work with PDF files. 
- Also a cross-platform compatible making it usable in different operating systems.

Note: 

For complex or encrypted PDFs, there are more suitable libraries available aside from pdfplumber. There is also PyMuPDF or PpyPDFium may be suitable depending on the requirements of a project.

In [5]:
# Storage of extracted texts from all PDFs
extracted_text_list = []

# Iterate through each PDF file with tqdm
for pdf_file in tqdm(pdf_files_found, desc="Processing PDFs"):
    with pdfplumber.open(pdf_file) as pdf:
        # Iterate through pages and extract text
        for page_number in range(len(pdf.pages)):
            page = pdf.pages[page_number]
            text = page.extract_text()

            # Append the extracted text to the list
            extracted_text_list.append(text)

print("Now extracted_text_list contains the texts from all PDFs.")

Processing PDFs: 100%|██████████████████████████| 13/13 [06:01<00:00, 27.78s/it]

Now extracted_text_list contains the texts from all PDFs.





## 3. Topic Modeling:
- Leverage Gensim's Latent Dirichlet Allocation - LDA to identify key topics within the extracted text.


- Each document will be represented as bag of words because Gensim's LDA model expects its input in the form of corpus.

- All extracted text will be transformed into a single string, the entire corpus will be treated a "single document" for LDA modeling.

- We tokenized the extracted text, created a dictionary and a corpus, and trained an LDA model.

- The num_topics parameter specifies the number of topics I want the model to identify. Parameters may be adjusted based on our dataset and needs.

- After training the base model, LDA, with the subset of our data to recognize the topics, the output will represent the "identified topics" along with the "associated words", and their "weights" within each topic.

In [6]:
# Combine all extracted text into a single string
combined_text = ' '.join(extracted_text_list)

# Preprocess the text (tokenization, removing stop words, etc.)
# For simplicity, we'll use a basic tokenization here.
tokenized_text = [word for word in combined_text.lower().split() if word.isalnum()]

# Dictionary from the tokenized text and a corpus that represents the document as a bag of words
dictionary = corpora.Dictionary([tokenized_text])
corpus = [dictionary.doc2bow(tokenized_text)]

# Train the LDA model
lda_model = models.LdaModel(corpus, num_topics=3, id2word=dictionary, passes=15)

# Print the topics
pprint(lda_model.print_topics())

[(0,
  '0.071*"the" + 0.028*"of" + 0.028*"to" + 0.026*"a" + 0.022*"and" + '
  '0.019*"in" + 0.017*"is" + 0.012*"for" + 0.011*"we" + 0.011*"that"'),
 (1,
  '0.047*"the" + 0.016*"of" + 0.015*"to" + 0.014*"and" + 0.014*"a" + '
  '0.013*"in" + 0.010*"is" + 0.008*"that" + 0.007*"we" + 0.007*"for"'),
 (2,
  '0.001*"the" + 0.001*"of" + 0.000*"to" + 0.000*"and" + 0.000*"in" + '
  '0.000*"a" + 0.000*"is" + 0.000*"we" + 0.000*"this" + 0.000*"for"')]


Output Interpretation:

The output represents the identified topics along with the associated words and their weights within each topic. Each tuple in the list corresponds to a topic, and the topics are represented as a combination of words with their associated weights.

For example, in the first topic (index 0):

 - The word "the" has a weight of 0.073.
 - The word "in" has a weight of 0.043.
 - Similarly, other words and their weights are listed.


## 4. Search Functionality:
- We need a search mechanism that takes a user-inputted query and searches for relevant documents within the PDFs.
- For efficient text searching, let us utilize techniques such as TF-IDF (Term Frequency-Inverse Document Frequency).

Why TF-IDF?
- TF-IDF vectorizer from scikit-learn can convert text data into a numerical format. It then calculates the cosine similarity between the user query and each document, identifying the most relevant documents based on similarity scores.

Why Cosine Similarity?
- Cosine similarity is used for comparing similarity between two vectors. 
- In our case, the vector representations of the "user query" and "each document"
- It measures the cosine angle between vectors in a multi-dimensional space.

Cosine Similarity Merit:
- It helps in comparing documents of different lengths and identify the most relevant documents by quantifying the similarity in the directions of their vector representations.
- Usable when the magnitude of vectors is not important as the directional aspect.
- A scale-invarient - meaning it focuses on the orientation of vectors rather than their length. 

In [7]:
# User-inputted query
user_query = "important information"

# Preprocess the user query 
user_query_tokens = [word for word in user_query.lower().split() if word.isalnum()]

# Preprocess the extracted text
preprocessed_text_list = [" ".join([word for word in text.lower().split() if word.isalnum()]) for text in extracted_text_list]

# Combine the user query and preprocessed text
all_texts = preprocessed_text_list + [" ".join(user_query_tokens)]

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(all_texts)

# Calculate cosine similarity between the user query and each document
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1]).flatten()

# Get the indices of documents with the highest similarity - Top 3
top_indices = cosine_similarities.argsort()[:-4:-1] 

print("Most relevant documents:\n")
for index in top_indices:
    print(f"Document {index + 1}: {extracted_text_list[index]}\n")

Most relevant documents:

Document 3212: Decision Trees
Without further knowledge, we would not be able to classify each row correctly.
Fortunately, there is one more question that can be asked about each row which classifies
each row correctly. For the row with the attribute water=cold, the swimming preference is
no. For the row with the attribute water=warm, the swimming preference is yes.
To summarize, starting with the root node, we ask a question at every node and based on
the answer, we move down the tree until we reach a leaf node where we find the class of
the data item corresponding to those answers.
This is how we can use a ready-made decision tree to classify samples of the data. But it is
also important to know how to construct a decision tree from the data.
Which attribute has a question at which node? How does this reflect on the construction of
a decision tree? If we change the order of the attributes, can the resulting decision tree
classify better than another tree?
In

Findings:
1. The top 3 topics with the highest similarity using cosine similarity: machine learning of course, decision trees, and reinforcement learning.
2. There's a common theme of information theory, particularly in the context of decision trees and reinforcement learning.
3. The documents provide a mix of theoretical concepts and practical examples, making them relevant for understanding and implementing machine learning algorithms.

## 5. Information Extraction:


#### The Magic of spaCy

Imagine diving into the vast sea of books, each page filled with the beauty of language. To make sense of this ocean of words, we needed a guide, and spaCy emerged as the perfect companion.

#### Breaking it Down: The Art of Tokenization
- In the realm of text, words are the building blocks. spaCy's magic lies in its ability to break down sentences into these meaningful elements – words, punctuation, and more. It's like deciphering the secret code of language.

#### Understanding the Dance of Words: Part-of-Speech Tagging
- Words, like actors on a stage, play different roles in a sentence. spaCy's Part-of-Speech tagging helps us identify these roles – who is the noun, the verb, or the adjective? It's like unraveling the intricate dance of language.

#### Spotting the Stars: Named Entity Recognition (NER)
- In our exploration, we encountered stars – named entities like people, places, and organizations. spaCy's NER capabilities shine in spotting these stars, helping us extract specific information and bring it into the spotlight.

#### The Need for Speed: spaCy’s Efficiency
- Navigating through the vast library of books is no small feat. SpaCy was designed with speed in mind, swiftly processing large volumes of text. It's like having a nimble guide leading you through the pages at a pace that keeps up with your curiosity.

#### The Treasure Map: Pre-trained Models
- Picture this – instead of starting from scratch, spaCy comes with pre-trained models, like a treasure map for multiple languages. It saves time and effort, guiding us to the hidden gems in the world of words.

#### Harmony in Workflows: Integration with Machine Learning
- Our journey doesn’t end with understanding words; we want to weave the insights into a larger tapestry. SpaCy seamlessly integrates with machine learning pipelines, allowing us to incorporate language understanding into our broader exploration.

#### spaCy becomes the guide, the decoder, and the companion on our literary journey. It transforms the language of books into a tapestry of insights, making our exploration not just efficient but also easier.

In [8]:
# Load spaCy English model
nlp = spacy.load("en_core_web_sm")

In [9]:
# Choose a document index of the document to analyze
document_index = 7 

# Get the text content of the chosen document
document_text = extracted_text_list[document_index]

# Use spaCy to preprocess
doc = nlp(document_text)

# Extract keywords
keywords = [token.text for token in doc if token.is_alpha and not token.is_stop]

num_keywords_to_display = 10  
top_keywords = keywords[:num_keywords_to_display]

# Print extracted keywords
print(f"Top {num_keywords_to_display} Keywords:\n")
print(top_keywords)

Top 10 Keywords:

['Contents', 'Polynomial', 'Regression', 'Multivariate', 'Regression', 'Multicollinearity', 'Variation', 'Inflation', 'Factor', 'VIF']


## 6. Input queries and Extract Information
- A simple user interface that allows users to input queries and view the extracted information.

In [None]:
# Prompt the user to enter a topic or query
user_query = input("Enter a topic or query: ")

# Perform a search to find relevant documents
relevant_documents = []

for i, document_text in enumerate(extracted_text_list):
    # Perform your search logic here, for example, checking if the user query is in the document text
    if user_query.lower() in document_text.lower():
        relevant_documents.append(i)

# Display the information from relevant documents
for document_index in relevant_documents:
    print(f"Extracted Information from Document {document_index + 1}:")
    print(extracted_text_list[document_index])

In [13]:
def search_and_display(query):
    

    for i, document_text in enumerate(extracted_text_list):
        if query.lower() in document_text.lower():
            relevant_documents.append(i)

    output.clear_output()

    with output:
        # Display additional information
        total_documents = len(extracted_text_list)
        print(f"1. Number of E-books related to the topic: {len(relevant_documents)}")
        print(f"\n2. List of Documents: \n{', '.join([f'Document {i+1}' for i in relevant_documents])}")
        print("3. Information extracted:")

        # Display the information from relevant documents
        for document_index in relevant_documents:
            print(f"Extracted Information from Document {document_index + 1}:")
            print(extracted_text_list[document_index])

# Text input
search_input = widgets.Text(placeholder="Enter a topic or query.", description="Search:")

# Button 
search_button = widgets.Button(description="Search")

# Widget to display output
output = widgets.Output()

# Function to handle button click
def on_search_button_click(b):
    search_and_display(search_input.value)

# Attach the function to the button click event
search_button.on_click(on_search_button_click)

# Display the widgets
display(search_input, search_button, output)

Text(value='', description='Search:', placeholder='Enter a topic or query.')

Button(description='Search', style=ButtonStyle())

Output()

## 7. Evaluation and Improvement:
We can evaluate the accuracy and effectiveness of the system by comparing its search results and information extraction against a manually annotated set of relevant information.

## 8. Summary:
Imagine you have 891 e-books or more, each holding the secrets of Data Science and Machine Learning.
As a passionate book lover and knowledge seeker, navigating this treasure trove is a bit overwhelming.

Going back to the Challenge:
1. It's like finding a needle in a haystack when searching through all these gems.
2. Each book has a wealth of information, but finding specific topics can be time-consuming.

Introducing the Solution:
- Enter Gensim's Latent Dirichlet Allocation (LDA) model, a tool for extracting hidden topics from a collection of documents.
- But here's the twist: before unleashing the power of LDA, we combine all these e-books into a single string.

Why the Merge?
- LDA works with a "corpus," treating each piece of text as a document.
- Our definition of a "document" isn't each e-book but a continuous text block. So, we merge them into a single string.
- This helps LDA capture relationships and patterns that might span across different parts of the documents.

Path taken:
- The combined text undergoes tokenization and preprocessing, transforming it into a clean, structured format.
- Think of it as preparing the ingredients before cooking - ensuring everything is in order.

Simplifying the Plot:
- Merging simplifies the initial implementation. It's a quick way to test the LDA model on the entire corpus without worrying about document boundaries.
- A pragmatic approach for experimentation, but not a one-size-fits-all solution.

Keep in Mind:
- This strategy might only suit some scenarios. If preserving the uniqueness of each document is crucial or if your documents are lengthy and contain different fields, this method might lose meaningful distinctions.
- Combining or separating documents depends on what you want to achieve.


**With the stage set, topics discovered, and information extracted, the search through the e-books becomes not just manageable but easier.**

Sources:
- Machine Learning Design Interview - Khang Pham
- Analytical Skills for AI and Data Science- Daniel Vaughan
- OpenAI. (2024). ChatGPT [Large language model]. 