## Importing Packages

In [1]:
import os
import re
from nltk import word_tokenize, sent_tokenize # widely used text tokenizer
from nltk.stem.porter import PorterStemmer # an approximate method of stemming words
ps = PorterStemmer()

In this notebook, we're testing out an approach to parsing out the menu, header, footer items from web pages.
By exploiting the fact that these elements should be present in each page, we make the assumption that any text common to multiple pages is superfluous and should get filtered out.

This notebook assumes that extra spaces from scraping are left intact.

The way this is implemented is by making one pass through all the pages and counting the number of times each section of text was encountered.  Then, we do a second pass through the text to filter out sections that were encountered more than once.

Additionally, I've reimplemented dictionary analysis as an override.  This means that when find text we want to filter out, we can run dictionary analysis on the text to see if we should actually keep it instead.  Currently, the dictionary being used is a compilation of all the Essentialism/Traditionalism, Progressivism, and Ritualism Summer 2017 keywords, as well as some additional keywords gathered manually.  I've found that this dictionary is too broad and there are too many hits, which is why I've currently disabled using dictionary analysis as an override.

In [2]:
dir_location = "sample_page/" # Location where pages get read from
file_format = dir_location + "html Link {}.txt"
filtered_file_format = "sample_page_filtered_split/html Link {}.txt"# where filtered pages are written

dict_override = False # whether or not to use dictionary analysis as a fail safe

## Set Up Custom Dictionary Analysis

In [3]:
additional_keywords = ["values", "positive", "academics", "skills", "perseverance", 'purpose',
                       'direction', 'mission', 'vision', 'vision', 'mission', 'our purpose',
                       'our ideals', 'ideals', 'our cause', 'curriculum','curricular',
                       'method', 'pedagogy', 'pedagogical', 'approach', 'model', 'system',
                       'structure','philosophy', 'philosophical', 'beliefs', 'believe',
                       'principles', 'creed', 'credo', 'values','moral', 'history', 'our story',
                       'the story', 'school story', 'background', 'founding', 'founded',
                       'established','establishment', 'our school began', 'we began',
                       'doors opened', 'school opened']
custom_dict = set()

# Function to load in the dictionaries for each philosophy
def load_dict(custom_dict, file_path):
    with open(file_path) as f:
        line = f.readline()
        while line:
            custom_dict.add(ps.stem(line.replace("\n", ""))) # Stem dictionary entries
            line = f.readline()
    return custom_dict

custom_dict = load_dict(custom_dict, "custom_dicts_s17/prog_dict.txt")
custom_dict = load_dict(custom_dict, "custom_dicts_s17/ess_dict.txt")
custom_dict = load_dict(custom_dict, "custom_dicts_s17/rit_dict.txt")

for keyword in additional_keywords:
    custom_dict.add(keyword)

print(len(custom_dict), "dictionary entries loaded")
custom_dict_list = list(custom_dict)
custom_dict_list.sort(key = lambda x: x.lower())
custom_dict_list[:10]

707 dictionary entries loaded


['abstract think',
 'abstract thought',
 'academ',
 'academics',
 'acceler',
 'account',
 'achiev',
 'achievement gain',
 'achievement gap',
 'activ']

In [4]:
# Get length of longest entry in custom dictionary
max_entry_length = max([len(entry.split()) for entry in custom_dict])
debug = False # Prints out the matching dictionary terms, for debugging

# Function that performs the dictionary analysis.
# Also removes punctuation, stems the phrase being analyzed, and checks for multiple length
# dictionary entries
# Now returns the number of dictionary hits found
def dict_match(phrase, custom_dict):
    # regex to keep only letters and spaces. Effectively removes punctuation
    phrase = re.sub(r'[^\w\s]', '', phrase)
    
    # Do dictionary analysis for word chunks of lengths 1 to max_entry_length
    counts = 0
    for length in range(max_entry_length, 0, -1):
        phrase, len_counts = dict_match_len(phrase, custom_dict, length)
        counts += len_counts
    return counts

# Helper function to run dictionary analysis for given length word chunks
# This function stems the phrases before checking, and the dictionary entries were
# also stemmed upon creation
# Now returns an updated copy of phrase, which has removed the dictionary hits
# Also returns the number of dictionary hits
def dict_match_len(phrase, custom_dict, length):
    splitted_phrase = phrase.split()
    if len(splitted_phrase) < length:
        return phrase, 0
    hits_indices, counts = [], 0
    for i in range(len(splitted_phrase) - length + 1):
        to_stem = ""
        for j in range(length):
            to_stem += splitted_phrase[i+j] + " " # Builds chunk of 'length' words
        stemmed_word = ps.stem(to_stem[:-1]) # stem chunk
        if stemmed_word in custom_dict:
            hits_indices.append(i) # Store the index of the word that has a dictionary hit
            counts += 1
            if debug:
                print(stemmed_word)
    # Iterate through list of matching word indices and remove the matches
    for i in range(len(hits_indices)-1, -1, -1):
        splitted_phrase = splitted_phrase[:hits_indices[i]] + \
        splitted_phrase[hits_indices[i] + length:]
    modified_phrase = ""
    for sp in splitted_phrase: # Rebuild the modified phrase, with matches removed
        modified_phrase += sp + " "
    return modified_phrase[:-1], counts

In [5]:
debug = True
if debug:
    dict_match("Example sentence accelerate abstract think \
    cats cats cats achievement gain learning", custom_dict)
debug = False

abstract think
achievement gain
acceler
learn


## Filter Out Repeated Text

In [6]:
num_pages = len([f for f in os.listdir(dir_location) if f[0] != "."])
seen_counts = dict()
pages, pages_set = [], set()

# First iterate through each page to find overlaps
for i in range(num_pages):
    page_text = open(file_format.format(i)).read()
    
    # Split raw text into phrases separated by three or more spaces
    page_text_split = [phrase for phrase in page_text.split("  ") if phrase != ""]
    
    # Check if page is a duplicate.  If so, ignore it.
    # Because of this, duplicate pages will get turned into empty files
    if str(page_text_split) in pages_set:
        pages.append([])
        continue

    pages.append(page_text_split)
    pages_set.add(str(page_text_split))

    # Count the number of times each phrase was seen across the entire website
    for phrase in page_text_split:
        if phrase in seen_counts:
            # If there's a dictionary match, set the seen_count value to 1,
            # which ensures that the phrase won't get filtered out
            if dict_override and dict_match(phrase, custom_dict):
                seen_counts[phrase] = 1
            else:
                seen_counts[phrase] += 1
        else:
            seen_counts[phrase] = 1

# Iterate again through each page to only keep non-overlapping phrases
for i in range(len(pages)):
    filtered_page_text = ""
    for j in range(len(pages[i])):
        if seen_counts[pages[i][j]] == 1:
            filtered_page_text += pages[i][j] + " "

    # Save the filtered page into a different directory
    with open(filtered_file_format.format(str(i)), "w") as f:
        f.write(filtered_page_text)

Below we can see exactly which phrases got filtered out, along with the number of times they were encountered.

In [7]:
for elem in seen_counts.keys():
    if seen_counts[elem] > 1:
        print(elem, seen_counts[elem])

Explore 16
 x: 20
Home 22
About Us 22
School Information 25
Lab Hours 19
Calendar 22
Frequently Asked Questions 19
Directions 19
Student Handbook 17
School Board 22
School Improvement Council 18
Contact Us 22
Leadership 17
News 25
Careers 19
Register Your Student 18
 Follow Us 20
Traducciones ~ Translations ~ 번역 17
Richland Two Charter High School 27
From the News Room View All News 2
 Upcoming Events View Calendar 2
7900 Brookmont Lane  |  Columbia, SC 29203  |  Phone: (803)-419-1348 |  Fax: (803)-935-1212 17
Accessibility | © 2017 Richland School District Two 20
Original text Contribute a better translation  18
 Home 16
 Richland Two Charter High School 10
 Explore 4
Your Name: 2
Site Translations 3
Fall Break Hours 2
R2i2 is offering Dual Credit courses for the 2017-2018 School Year 2
Welcome to Our New Website! 2
Page Not Found 2
Our Schools 3
I'm a Richland Two... 3
Student or Parent 3
Employee 3
Future Family 3
Community Member/Local Business 3
Prospective Employee 3
About Richla

Below we can see all of the remaining text in one place after filtering

In [8]:
filtered_full_output = ""
for i in range(num_pages):
    page_text = open(filtered_file_format.format(i)).read()
    filtered_full_output += page_text + " "
print(filtered_full_output)

 Fall Break Hours On November 20th and 21st, we'll close at 3 PM.
No school November 22nd through the 24th. R2i2 is offering Dual Credit courses for the 2017-2018 School Year Earn college credit while you complete high school! Welcome to Our New Website! Sleek. Shiny. Built for the future. Just like us. Connected   The Richland Two Charter High School was founded in 2010 for students looking for a modern approach to their high school learning experience. Students at the Charter School customize their education using our virtual curriculum and design a flexible attendance schedule to meet their individual needs. In 2013, the Charter School opened its doors to students in the 9th through 12th grade. Our extended hours of operation allow students the flexibility to work during the day and complete their coursework by taking advantage of our nighttime hours. Our major advantage over other online schools is our faculty and staff. By employing a full time school counselor, English teacher, M

In [9]:
max_page_score = (-1, -1)
for i in range(num_pages):
    page_text = open(filtered_file_format.format(i)).read()
    
    if len(page_text) != 0:
        page_score = dict_match(page_text, custom_dict) / len(page_text.split())
        if page_score > max_page_score[0]:
            max_page_score = (page_score, i)
max_text = open(filtered_file_format.format(max_page_score[1])).read()
print("Page with the highest dictionary score:\n\n" + max_text)

Page with the highest dictionary score:

The Richland Two Charter High School was founded in 2010 for students looking for a modern approach to their high school learning experience. Students at the Charter School customize their education using our virtual curriculum and design a flexible attendance schedule to meet their individual needs. In 2013, the Charter School opened its doors to students in the 9th through 12th grade. Our extended hours of operation allow students the flexibility to work during the day and complete their coursework by taking advantage of our nighttime hours. Our major advantage over other online schools is our faculty and staff. By employing a full time school counselor, English teacher, Math teacher, Work Based Learning Coordinator and Lab Specialist, students can ask questions and receive individualized attention.  Your Future Starts Now! Blended Virtual High School Environment 9 th through 12 th Grades Flexible School Day Small School Experience SC High Sch