# Download and Clean Sample Data #
This notebook will download a few different lengeths of sample data for testing our summary algorithm.  First, we'll download a full length novel, Frakenstein by Mary Shelley.  Second, we'll download a short story, Flowers for Algernon.  Third, we'll download the 4 page Hills like White Elephants.  Forth, we'll download one of the longest factual wikipedia entries, which covers Elvis Presly.  Fifth, we'll look at a collection of word documents, to explore summarization of groups of texts.  For all of these, we'll clean them up into plain text, and a format expected by the summary algorithm.  The cleaning process is different for each because we're cleaning up HTML formatting, but the end goal is to have a simple string containing the document, or dict of strings for document groups and save it as a Pickle for use in other notebooks.

### Raw Text Locations: ###
Frankenstein: https://www.gutenberg.org/files/84/84-h/84-h.htm


Flowers for Algernon: https://www.alcaweb.org/arch.php/resource/view/172077


Hills like White Elephants: https://www.macmillanhighered.com/BrainHoney/Resource/6702/digital_first_content/trunk/test/literature_full/asset/downloadables/AnnotatedText_HillsLikeWhiteElephants.html

Elvis Presly: https://en.wikipedia.org/wiki/Elvis_Presley

In [4]:
#import dependancies
import requests
from bs4 import BeautifulSoup
import pickle
import re

### Download and clean Frankenstein ###
This is a full length book, to test creation of a summary based on a very long single text.

In [None]:
#grab the text, using Beautiful Soup to parse the HTML
url = "https://www.gutenberg.org/files/84/84-h/84-h.htm"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
raw_full_text = soup.text

In [None]:
#Cut the top and bottom of the page, so that we only have the text of the book.
raw_full_text = raw_full_text[raw_full_text.index("Letter 1\n\nTo Mrs. Saville, England."):raw_full_text.index("*** END OF THE PROJECT GUTENBERG EBOOK FRANKENSTEIN ***")].replace("\r\n"," ").replace("\n", " ")
#encode some misc unicode charaters.
full_text = raw_full_text.encode('raw_unicode_escape').decode()
#show that we found the expected length
words_count = len(full_text.split(" "))
pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.
print ("Approximate word count:",words_count)
print ("Approximate page count:",pages_count)

In [None]:
#save this clean text for later use
with open('sample texts/frankenstien.pkl', 'wb') as file:
    pickle.dump(full_text, file)

### Download and clean Flowers for Algernon ###
This is a short, to test creation of a summary based on a short story that is still longer than most context windows.  This story is also challenging because it contains poorly written english, representing the main charater's mental strength.

In [None]:
#grab the text, using Beautiful Soup to parse the HTML
url = "https://www.alcaweb.org/arch.php/resource/view/172077"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
raw_full_text = soup.text

In [None]:
#Cut the top and bottom of the page, so that we only have the text of the book.
start_text = "Progris riport 1 martch 3."
end_text = "chanse put some flown on Algernons grave in the bak yard."
full_text = raw_full_text[raw_full_text.index(start_text):raw_full_text.index(end_text)+len(end_text)].replace("\r\n"," ").replace("\n", " ").replace("\t","")

#show that we found the expected length
words_count = len(full_text.split(" "))
pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.
print ("Approximate word count:",words_count)
print ("Approximate page count:",pages_count)

In [None]:
#save this clean text for later use
with open('sample texts/algernon.pkl', 'wb') as file:
    pickle.dump(full_text, file)

### Download and clean Hills like White Elephants ###
This is a short story, to test creation of a summary based text that is only a few pages long.

In [None]:
#grab the text, using Beautiful Soup to parse the HTML
url = "https://www.macmillanhighered.com/BrainHoney/Resource/6702/digital_first_content/trunk/test/literature_full/asset/downloadables/AnnotatedText_HillsLikeWhiteElephants.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
raw_full_text = soup.findAll('p')

In [None]:
#join the paragraphs together:
raw_full_text_temp = []
for p in raw_full_text:
    raw_full_text_temp.append(p.text)
raw_full_text = " ".join(raw_full_text_temp)

In [None]:
#Cut the top and bottom of the page, so that we only have the text of the book.
start_text = "The hills across the valley of the Ebro were long and white."
end_text = "“I feel fine,” she said. “There’s nothing wrong with me. I feel fine.”"
full_text = raw_full_text[raw_full_text.index(start_text):raw_full_text.index(end_text)+len(end_text)].replace("\r\n"," ").replace("\n", " ").replace("\t","")

#show that we found the expected length
words_count = len(full_text.split(" "))
pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.
print ("Approximate word count:",words_count)
print ("Approximate page count:",pages_count)

In [None]:
#save this clean text for later use
with open('sample texts/hills.pkl', 'wb') as file:
    pickle.dump(full_text, file)

### Download and clean Elvis Presley's wikipedia article ###
This is a long factual article, to test creation of a summary based on non-fiction text.

In [None]:
#grab the text, using Beautiful Soup to parse the HTML
url = "https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&titles=Elvis_Presley&redirects=true"
response = requests.get(url)
soup = BeautifulSoup(response.text)

In [None]:
#grab out the text:
full_text_with_tags = soup.get_text()
full_text = re.sub('<[^<]+?>', '', full_text_with_tags)

#cut the top and bottom of the page
start_text = "Elvis Aaron Presley (January 8, 1935 – August 16, 1977), often referred"
end_text = "albums. In the 1970s, his most heavily promoted and bestselling LP releases tended to be concert albums."
full_text = full_text[full_text.index(start_text):full_text.index(end_text)+len(end_text)].replace("\r\n"," ").replace("\n", " ").replace("\t","")

#print(full_text)

In [None]:
#show that we found the expected length
words_count = len(full_text.split(" "))
pages_count = int(words_count/500)#quick estimate, real page count is dependant on page and font size.
print ("Approximate word count:",words_count)
print ("Approximate page count:",pages_count)

In [None]:
#save this clean text for later use
with open('sample texts/elvis.pkl', 'wb') as file:
    pickle.dump(full_text, file)

## Clean the sample group of documents.
The sample docs cleaned here are a collection of word documents.  These are not public, and so are not included in the git repo.  Feel free to drop in your own documents.

In [None]:
!pip install docx2txt

In [None]:
import docx2txt
import glob

directory = glob.glob('sample texts/*.docx')
docs = {}
for file_name in directory:
    #print(file_name)
    with open(file_name, 'rb') as infile:
        doc = docx2txt.process(infile)
        docs[file_name.replace("sample texts/docs/","")] = doc   

print("Loaded in %s docs."%len(docs))

Alternativly, let's grab a bunch of amazon reviews, and use them as seperate documents.  
Here are some 1 and 5 star reviews for https://www.amazon.com/dp/B09DXZB7JQ

In [2]:
docs = {}
docs["review_001"] = "My daughter absolutely loves this set from Panel Sound (not sure why named panel sound) but the bag is a great addition as it keeps the paddles and balls organized, its a messenger style bag as opposed to book bags but it does the trick. We have not used the included cooling towels as we have larger ones for us to use down here in S. Florida. My daughter loved it so much we bought a second set for her to play with my parents when she visits their house. I'm not a pickle ball specialist, but the paddles seem great to me and 60 days later the original ball she used (put her name on it) is still going strong."
docs["review_002"] = """The Panel Sound USAPA Approved Pickleball Paddle Set has exceeded my expectations, delivering a complete package for pickleball enthusiasts of all skill levels. With its lightweight paddles, versatile ball options, and thoughtful accessories, this set caters to indoor and outdoor play with style and precision.
Pros:
Quality Paddles: The included fiberglass pickleball paddles are USAPA approved, ensuring professional-grade quality. Their lightweight yet sturdy construction enhances control and power during gameplay.
Variety of Balls: The set includes both indoor and outdoor balls, offering adaptability to different playing environments. This flexibility lets you enjoy pickleball wherever you choose.
Comprehensive Set: The addition of a carrying case, cooling towels, and various ball types demonstrates the manufacturer's attention to detail, making this set a convenient and complete solution for pickleball enthusiasts.
Enhanced Gameplay: The paddles' lightweight design and responsive construction contribute to improved gameplay, allowing for precise shots and better maneuverability on the court.
Durability and Portability: The quality materials used in the construction of the paddles and accessories ensure their longevity. The carrying case and cooling towels add portability and ease to your pickleball adventures.
Cons:
Cooling Towel Size: Some users might find the cooling towels on the smaller side, potentially requiring more frequent re-wetting during extended play sessions.
Personal Preference: While the paddles are versatile, some players might have personal preferences for specific paddle designs or grip styles.
In summary, the Panel Sound Pickleball Paddle Set offers an excellent combination of performance, variety, and convenience. The quality paddles, diverse ball options, and thoughtful extras like the carrying case and cooling towels create a comprehensive package for pickleball enthusiasts. While there might be minor considerations such as cooling towel size and personal preference, the overall benefits and attention to detail make this set a solid choice for enhancing your pickleball experience both indoors and outdoors."""
docs["review_003"] = "My husband and I were invited to play Pickleball with some friends and we’d never play before. We found these and they were a good value with a bag, balls, and towels included. We had to Google which balls were for where lol so I wish the instructions mentioned that, but overall they worked well! We’re no expert by any means but I’m petite and not athletic at all, and they’re surprisingly easy to move and handle. They also work for my husband and he’s a big larger. So far so good!"
docs["review_004"] = """Pickle ball is all the rage now. I live close to this tennis court and I have watched a mostly empty court now become used quite frequently. Decided to try pickle all and ordered this set after browsing a few reviews. Seems to be good.
Grip is ehh..grippy and ergonomic
Carrying case is a nice addition.
Balls are seemingly good quality
I have nothing to compare this to but i didn’t get the feeling that it is an inferior product. Seems good"""
docs["review_005"] = "What can I say. My husband wanted these because it was a fad and we only used them once or twice. Hopefully we'll use them again. I love the carry case and everything it came with."
docs["review_006"] = "These paddles were great at the start. But after only four individual days of play, they broke! At first, it was only one paddle that came loose during a game. I thought that was odd, being that we only used it according to its purpose. There was no rough treatment of the paddle, we were just playing a game. So, we borrowed a paddle from someone to finish our match. But during the game the second paddle became loose and started wobbling. Now there's no power or control in the paddle. It just wobbles around like a cracked piece of wood being held together by the grip tape! This is frustrating because the time for a return has passed and no one could've predicted that the paddle wouldn't last beyond a month. I would like a refund or at least new paddles."
docs["review_007"] = "Feels too light with no power compared to other paddles i used. I wish i could return these but passed the 30 d timeline"
docs["review_008"] = "Do not buy! Product comes from China and cannot contact; the item is warranted and will not be honored because you can’t get in touch. Amazon will do nothing to help! Disgraceful."
docs["review_009"] = "We used it once and the paddle broke in half. Get a different brand that’s more sturdy."
docs["review_010"] = "I just bought these in June. It is March and one of the paddles is shattered inside. We were not careless in caring for them either. Poor Quality. I cannot find any information as to if there is a warranty on them either."

In [5]:
#save this clean text for later use
with open('sample texts/docs.pkl', 'wb') as file:
    pickle.dump(docs, file)