<a href="https://colab.research.google.com/github/angelogener/CorpusForEduryone/blob/main/Chemistry_Corpus_Creator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chemistry Corpus Creator

Most of this code in creating the corpus has been pioneered by Jan already and I have just changed it a bit to suit what I have done. Most of the credit in this colab goes to him. Thanks to the hard work of everyone in Social Good for pushing on!

Note: Some parts of my code can probably be cut down and you are very much welcome to take what has been made and optimize it!

# Step 1: Initialize Google Collab with libraries and files

The libraries we will use are:   

* **PyPDF2**, (for parsing PDF's)
* **google.colab**, (for file input)
* **re**, or Python's built in Regular Expression library (for parsing text and filtering out unnecessary characters)
* **nltk**, or Natural Language Toolkit (for processing human language)
* **docx**, (for writing to docx files)




In [None]:
!pip install PyPDF2
!pip install python-docx

Collecting python-docx
  Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-docx
Successfully installed python-docx-1.1.0


As Jan briefly touches on, 'punkt', 'stopwords', 'wordnet' are just datasets used to clean up our words.

In [None]:
import re
import nltk
from PyPDF2 import PdfReader
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from docx import Document
from google.colab import files
uploaded = files.upload()

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

Saving Nelson-Chemistry-11-Glossary.pdf to Nelson-Chemistry-11-Glossary.pdf
Saving Nelson-Chemistry-12_glossary_index.pdf to Nelson-Chemistry-12_glossary_index.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

# Step 2: Create Functions to Process and Create Documents

 What I'll be doing is collecting my corpus from relevant words, so I'll start by collecting from the Ontario Chemistry Textbook Glossaries. First we collect all text from these glossaries

We will approach two different aspects:
* Creating a function to read **any** pdf (plus omit characters < 2)
* Process any text by omitting any unneccessary components: ***Uppercasing, Punctuation, Stopwords, and Non-Baseform Words (Adverbs etc.)***
* Create the final corpus!
* Create a corpus



In [None]:
"""
Creates a list of words contained in any given PDF
"""
def read_pdf(pdf: str) -> str:
    # Instantiate a new reader
    reader = PdfReader(pdf)
    pdf_text = ''

    for page in reader.pages:
      content = page.extract_text()

      # Only append non-empty pages and w
      if content:
        pdf_text += content

    return pdf_text

"""
Uses nltk to convert most of the words in their base form.
We want this function to clean as many words as possible before we
have to physically clean it (since chemistry will have various terms that
may not exist in the nltk).
"""

def preprocess(text: str):
    # Lowercase, remove punctuation, and tokenize
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]+', ' ', text)
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return lemmatized_tokens


"""
Using the other functions listed above, we finally create a corpus of
relevant words.
"""
def create_corpus(texts: list[str]) -> list[str]:
    corpus = []
    for text in texts:
        preprocessed_text = preprocess(text)
        corpus.extend(preprocessed_text)
    return corpus

"""
From a corpus, write to a document to be cleaned and upload
for use for TF table calculation.
"""
def create_document(corpus: list[str]):
  # Instantiate a new Document
  document = Document()

  # Format!
  document.add_heading('Chemistry Corpus')

  # Fill pages
  corpus_words = ", ".join(corpus)
  document.add_paragraph(corpus_words)

  # Save
  document.save('chemistry corpus.docx')


"""
A function to help locate long joined words in the finalized
corpus doc.
"""
def long_words(words: list[str]) -> list[str]:

  to_check = []
  for word in words:

    # I will be setting 10 as the minimum word length of concern
    # and I do not want repeating instances as I will search these up
    if len(word) >= 10 and word not in to_check:
      to_check.append(word)

  return to_check

"""
Makes a document from words to check.
"""
def create_long(corpus: list[str]):
  # Instantiate a new Document
  document = Document()

  # Format!
  document.add_heading('Long Words')

  # Fill pages
  corpus_words = ", ".join(corpus)
  document.add_paragraph(corpus_words)

  # Save
  document.save('long words.docx')



# Step 3: Process!

Put it all together to start processing our Chemistry Glossaries!


In [None]:
# Read PDF's

texts = []
texts.append(read_pdf('Nelson-Chemistry-11-Glossary.pdf'))
texts.append(read_pdf('Nelson-Chemistry-12_glossary_index.pdf'))

corpus = create_corpus(texts)
words_to_check = long_words(corpus)

create_document(corpus)
create_long(words_to_check)

# TF Table Calculations
Now that the corpus have been made, we now calculate the term frequency relative to the total amount of words in corpus.

# Step 4: Import new libraries and upload processed documents
Since we have processed our word document, we upload our document and take in any other libraries for this stage of the process.

Note: Due to "p" being the most frequent entry, I would have to omit entries that are less than 3 characters long, these would encapsulate all the "article" and acronyms for some molecules and polymers.

In [None]:
import pandas as pd
uploaded = files.upload()

Saving chemistry corpus.docx to chemistry corpus (1).docx


In [None]:
"""
Similar read_pdf but takes a .docx file instead
"""
def upload_doc(path: str) -> list[str]:
  document = Document(path)
  cleaned_words = []

  # Repeat the process
  for paragraph in document.paragraphs:
    text = paragraph.text
    text = re.sub(r'[^a-zA-Z\s]+', ' ', text)
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

    # Filter out words less than 3 characters
    lemmatized_tokens = [word for word in lemmatized_tokens if len(word) > 2]

    cleaned_words.extend(lemmatized_tokens)

  return lemmatized_tokens

"""
Makes a dictionary pointing words to their count
"""
def make_dict(words: list[str]) -> dict[str, int]:
  word_freq = {}
  for word in words:

    if word in word_freq:
      word_freq[word] += 1

    else:
      word_freq[word] = 1

  return word_freq


# Step 5: Produce Dataframe

In [None]:
cleaned = upload_doc('clean corpus.docx')
total = len(cleaned)

word_dict = make_dict(cleaned)

# We turn our word_dict into a dictionary that the dataframe can use

for_df = {'Words': [], 'Term Frequency': []}

for word in word_dict:

  # Append unique words
  for_df['Words'].append(word)

  # Append the ratio of the frequency / total words
  # Using the f-string to format into a percent
  to_percent = word_dict[word]/total
  percentage = f"{to_percent:.2%}"

  for_df['Term Frequency'].append(percentage)

# Turn into Data Frame
term_freq = pd.DataFrame(for_df)

# Filer it so we can have a descending df
freq_filered = term_freq.sort_values(by='Term Frequency', ascending=False)

print(freq_filered)

               Words Term Frequency
9           reaction          2.01%
5               acid          1.61%
56              atom          1.51%
125           energy          1.23%
8           chemical          1.13%
...              ...            ...
1273  electronically          0.01%
1272      alkalinity          0.01%
1270          linked          0.01%
1267           alone          0.01%
2316          zeeman          0.01%

[2317 rows x 2 columns]
