12.1 Web pages are excellent sources of text to use in NLP tasks. In the following IPython session, you’ll use the requests library to download the www.python.org home page’s content. This is called web scraping. You’ll then use the Beautiful Soup library to extract only the text from the page. Eliminate the stop words in the resulting text, then use the  module to create a word cloud based on the text.

In [21]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob
from wordcloud import WordCloud

nltk.download('stopwords')
response = requests.get('https://www.python.org')
# I used html.parser instead of html5lib because html5lib was returning the script content
soup = BeautifulSoup(response.content, 'html.parser')
# get the text from the webpage; stripping the tags
text = soup.get_text(strip=True)
# get the stop words
stops = stopwords.words('english')
# convert the text into a textblob
blob = TextBlob(text)
# remove stop words from the text
textWithoutStops = [word for word in blob.words if word not in stops]
# join the words back together in a sentence to be used by word cloud
textWithoutStopString = ' '.join(textWithoutStops)
# setup the world cloud
wordCloud = WordCloud(colormap='prism', background_color='white')
# generate the world cloud with the text
wordCloud.generate(textWithoutStopString)
# save the word cloud
wordCloud.to_file('PythonOrg.png')

print(textWithoutStops)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\cgrot\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['Welcome', 'Python.orgNotice', 'While', 'JavaScript', 'essential', 'website', 'interaction', 'content', 'limited', 'Please', 'turn', 'JavaScript', 'full', 'experience.Skip', 'content▼ClosePythonPSFDocsPyPIJobsCommunity▲The', 'Python', 'NetworkDonate≡MenuSearch', 'This', 'SiteGOAASmallerLargerResetSocializeFacebookTwitterChat', 'IRCAboutApplicationsQuotesGetting', 'StartedHelpPython', 'BrochureDownloadsAll', 'releasesSource', 'codeWindowsmacOSOther', 'PlatformsLicenseAlternative', 'ImplementationsDocumentationDocsAudio/Visual', 'TalksBeginner', "'s", 'GuideDeveloper', "'s", 'GuideFAQNon-English', 'DocsPEP', 'IndexPython', 'BooksPython', 'EssaysCommunityDiversityMailing', 'ListsIRCForumsPSF', 'Annual', 'Impact', 'ReportPython', 'ConferencesSpecial', 'Interest', 'GroupsPython', 'LogoPython', 'WikiCode', 'ConductCommunity', 'AwardsGet', 'InvolvedShared', 'StoriesSuccess', 'StoriesArtsBusinessEducationEngineeringGovernmentScientificSoftware', 'DevelopmentNewsPython', 'NewsPSF', 'Newsletter

12.2 Using the text from Exercise 12.1, create a TextBlob, then tokenize it into Sentences and Words, and extract its noun phrases.

In [19]:
sentences = blob.sentences
print(sentences)
words = blob.words
print(words)
nouns = blob.noun_phrases
print(nouns)

[Sentence("Welcome to Python.orgNotice:While JavaScript is not essential for this website, your interaction with the content will be limited."), Sentence("Please turn JavaScript on for the full experience.Skip to content▼ClosePythonPSFDocsPyPIJobsCommunity▲The Python NetworkDonate≡MenuSearch This SiteGOAASmallerLargerResetSocializeFacebookTwitterChat on IRCAboutApplicationsQuotesGetting StartedHelpPython BrochureDownloadsAll releasesSource codeWindowsmacOSOther PlatformsLicenseAlternative ImplementationsDocumentationDocsAudio/Visual TalksBeginner's GuideDeveloper's GuideFAQNon-English DocsPEP IndexPython BooksPython EssaysCommunityDiversityMailing ListsIRCForumsPSF Annual Impact ReportPython ConferencesSpecial Interest GroupsPython LogoPython WikiCode of ConductCommunity AwardsGet InvolvedShared StoriesSuccess StoriesArtsBusinessEducationEngineeringGovernmentScientificSoftware DevelopmentNewsPython NewsPSF NewsletterPSF NewsPyCon US NewsNews from the CommunityEventsPython EventsUser Gr