## Zhengxu Wang 
zhengxu@bu.edu  
cs505 hw3

In this piece of code, we are going to process and analyze the data we collect from Twitter, Wikipedia, ABC and Fox news.

Prior to this assignment, please make sure you have implemented the scraping functions so that you could scrap data from the Wikipedia, ABC and Fox news pages.



Task 1. With your implemented code provided in the first lab section, get the "article" texts of the wikipedia page of "fishing" and its all linked wiki pages. Your saved data should contain the titles of the wiki pages and their article texts.

In [52]:
import requests
from bs4 import BeautifulSoup
import time # for setting up a delay on getting htmls from wiki server.
from tqdm import tqdm
def getPageFromWiki(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

def getHeading(soup):
    return soup.title.string


#mw-content-text
#bodyContent
#mw-content-text > div.mw-parser-output
def getContent(soup):
    temp = soup.select_one('#mw-content-text > div.mw-parser-output')
    if temp == None:
        return None
    return temp.get_text()

    
def getLinks(soup):

  linksDict = {}
  for link in soup.select_one('#mw-content-text > div.mw-parser-output').find_all('a'):
    title = link.get('title')
    url = link.get('href')
    if title != None and url != None:
        if url[0] == '/':
            linksDict[title] = 'https://en.wikipedia.org' + url

  return linksDict

In [53]:
#Lastly, write them down in a .csv file for both the abc and fox news. 

import csv

pathToSave = 'wikiContents.csv'

# Once you've implemented the above functions, run the following piece to see if a dictionary that contains the wiki articles we scraped.
# Run a for loop to get all the article contents from Wikipedia.
pageDict = {}

page = getPageFromWiki('https://en.wikipedia.org/wiki/Fishing') # scrap the main page we want. 
header = getHeading(page)
content = getContent(page)
pageDict[header] = content
# print(pageDict)

linksDict = getLinks(page) # get the links contained in the article part of the page.
print("a set of {} links are found.".format(len(linksDict)))

for title in tqdm(list(linksDict.keys())): # set up a loop to , set a delay at each iteration
  url = linksDict[title]
  page = getPageFromWiki(url)
  header = getHeading(page)
  content = getContent(page)
  if content != None:
    pageDict[header] = content
  time.sleep(1) # Remember to set a delay >=1 second so you won't break the server.

print("a size of {} content dictionary is built.".format(len(pageDict)))

a set of 422 links are found.


100%|██████████| 422/422 [09:57<00:00,  1.42s/it]

a size of 372 content dictionary is built.





In [54]:
# Lastly, save your contents and corresponding title in a .csv file.
import csv

pathToSave = 'wikiContents.csv'

with open(pathToSave, 'w', newline='') as csvfile:
  fieldnames = ['idx','wikiTitle', 'wikiContents']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i,wikiContentKey in enumerate(pageDict.keys()):
    writer.writerow({'idx': i, 'wikiTitle': wikiContentKey,'wikiContents': pageDict[wikiContentKey]})

In [62]:
import sys
import csv
csv.field_size_limit(sys.maxsize)

# Here is a function you could load the text data if your saved data follows
# the format we provide the in first lab section code.

def loadWikiTexts(csvPath):
  wikiRawTextDict = {}
  with open(csvPath, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
      wikiRawTextDict[row['wikiTitle']] = row['wikiContents']
  return wikiRawTextDict

# Load your wiki text data here

wikiRawDict = loadWikiTexts('./wikiContents.csv')


Task 2. With library Spacy and Regular Expression (re), preprocess our scraped data to:

- Remove all the references texts [...] in the scraped data  ([re](https://docs.python.org/3/library/re.html)). 
- [Sentence split](https://spacy.io/usage/linguistic-features#sbd) (Spacy).
- [Tokenize](https://spacy.io/usage/linguistic-features#tokenization) (Spacy)
- [Lemmatize](https://spacy.io/usage/linguistic-features#lemmatization) (Spacy)
- [Lower case](https://www.programiz.com/python-programming/methods/string/lower) (String)


In [None]:
# install spacy and related package(s)

!pip3 install -U pip setuptools wheel
!pip3 install -U spacy
!python3 -m spacy download en_core_web_sm

In [63]:
import re
import spacy

def preprocess(wikiTextDict):
  # Input: a wiki text dictionary with keys are titles and values are the corresponding texts.
  # Output: a wiki text dictionary with keys are the titles and the values are the preprocessed texts 
  # (sentences - tokens).

  # sub-task 1: remove all the references texts "[...]"

  nlp = spacy.load("en_core_web_sm")
  lemmatizer = nlp.get_pipe("lemmatizer")
  for key in wikiTextDict.keys():
    wikiTextDict[key] = re.sub(r'[\[].*?[\]]', '', wikiTextDict[key])
    
  # sub-task 2: segment all the sentences in the wiki texts.
    sentences = []
    docParagraph = nlp(wikiTextDict[key])
    assert docParagraph.has_annotation("SENT_START")
    for sent in docParagraph.sents:
      # sub-task 3: tokenize the sentences from sub-task 2.
      # sub-task 4: lemmatize the tokens from sub-task 3.
      # sub-task 5: lower-case the tokens from sub-task 3/4.

      tokens = []
      docSent = nlp(sent.text)
      for token in docSent:
        tokens.append(token.lemma_.lower())
      sentences.append(tokens)
    wikiTextDict[key] = sentences

  return wikiTextDict
  # You don't need to follow the order of the sub-tasks.


# Preprocess your data here.
wikiProcessedDict = preprocess(wikiRawDict)


In [66]:
wikiProcessedDict['Fishing - Wikipedia'][0]

['activity',
 'of',
 'try',
 'to',
 'catch',
 'fish',
 '\n',
 'for',
 'other',
 'use',
 ',',
 'see',
 'fishing',
 '(',
 'disambiguation',
 ')',
 '.',
 '\n\n\n']

Task 3. Construct a dictionary of the vocabulary for your scraped data (all texts). The keys are the word types and the values are the count of the appearances of the word (frequency).

In [67]:
def computeFreq(wikiTextDict):

  # Input: a wiki text dictionary with keys are titles and values are the preprocessed corresponding texts.
  # Output: a dictionary with keys are the word types, and the values are the appearance counts of the word types
  tokenDict = {}
  for key in wikiTextDict.keys():
    for sentence in wikiTextDict[key]:
      for token in sentence:
        if token.isalpha() == True:
          if token in tokenDict.keys():
            tokenDict[token] += 1
          else:
            tokenDict[token] = 0
  return tokenDict

tokenDict = computeFreq(wikiProcessedDict)
  # Compute the frequency dictionary here.


check result

In [74]:
i = 0
for item in tokenDict.items():
    print(item)
    i += 1
    if i == 10:
        break

('activity', 713)
('of', 65963)
('try', 159)
('to', 33098)
('catch', 1572)
('fish', 10217)
('for', 13647)
('other', 4376)
('use', 6696)
('see', 2340)


Task 4. What are the top 20 non-stop, non-punctuation words in the vocabulary according to frequency?

In [76]:
from nltk.corpus import stopwords

def computeTop20Words(freqDict,stop_words):
  
  # Input: a dictionary with keys are the word types, and the values are the appearance counts of the word types
  # Output: a list of 20 words that appear most frequently in all the preprocessed scraped texts.

  # If not preprocessed specifically, the punctuations still exist in the frequency dictionary from task 3.
  # You need to remove them before outputing the top 20 words.
  Top20 = []
  sortedList = sorted(freqDict.items(), key=lambda dict: dict[1], reverse=True)
  while True:
    if len(Top20) == 20:
      break
    key = sortedList.pop(0)[0]
    if key not in stop_words:
      Top20.append(key)

  return Top20

stop_words = set(stopwords.words('english'))
Top20 = computeTop20Words(tokenDict, stop_words)
# Print your top 20 words here.

In [82]:
Top20

['fish',
 'retrieve',
 'fishing',
 'use',
 'water',
 'archive',
 'also',
 'original',
 'world',
 'may',
 'new',
 'isbn',
 'sea',
 'include',
 'one',
 'b',
 'marine',
 'specie',
 'large',
 'year']

Task 5. Use library such as wordcloud, [generate the word cloud](https://towardsdatascience.com/simple-wordcloud-in-python-2ae54a9f58e5) of the text to visualize the distribution of non-stop and non-punctuation words.

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

def plotWordCloud(image):
  # Input: word cloud image
  # Output: (display the cloud image in the output)

  plt.figure(figsize=(40, 30))
  # Display image
  plt.imshow(image) 
  # No axis details
  plt.axis("off")

def generateWordCloud(text):
  # Input: all texts in the scraped wiki data.
  # Output: word cloud image.
  wordcloud = WordCloud(width= 3000, height = 2000, random_state=1, background_color='salmon', colormap='Pastel1', collocations=False, stopwords = STOPWORDS).generate(text)

  return wordcloud

# Draw your word cloud here

allWordList = []
for paragraph in wikiProcessedDict.values():
    for sent in paragraph:
        for word in sent:
            if word not in stop_words and word.isalpha() == True:
                allWordList.append(word)

len(allWordList)

wordcloud = generateWordCloud(' '.join(allWordList))
plotWordCloud(wordcloud)

Task 6. Preprocess the raw scraped tweets with keyword ’fishing’ you’ve collected in the last assignment in the same way as you preprocess the wiki texts.

In [91]:
# Here is a function you could load the tweet text data if your saved data follows
# the format we provide the in last lab section code.

import csv

def loadTweetTextFromCSV(csvPath):
  tweetDict = {}
  with open(csvPath, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
      tweetDict[int(row['idx'])] = row['tweetText']
  return tweetDict

# processedTweetData = preprocess(tweetDict) # here we assume you have implemented the preprocess function in task 2.
tweetRawDict = loadTweetTextFromCSV('./tweetsFishing.csv')
processedTweetData = preprocess(tweetRawDict)

In [93]:
processedTweetData[0]

[['genuinely',
  'lovely',
  'episode',
  'of',
  'gone',
  'fishing',
  'this',
  'week',
  '.'],
 ['bob',
  'and',
  'paul',
  'just',
  'giddy',
  'mess',
  'about',
  'up',
  'at',
  'loch',
  'ness',
  '.'],
 ['glorious', '.']]

Task 7. Compute how many **word types** in your tweets are out-of-vocabulary (out of Wiki vocabulary Dict), divided by the number of **word types** in your tweets. Show the value in percentage (%).

In [101]:
def computeOOVWordTypes(tweetVocabDict, wikiVocabDict):

  # Input: a dictionary of tweet data vocabulary, a dictionary of wiki data vocabulary.
  # Output: the ratio of word types in your tweets that are out-of-vocabulary w.r.t. wiki vocabulary
  # v.s. total number of word types in your tweet data.

  # The ratio should be in percentage.
  count = 0
  for key in tweetVocabDict.keys():
    if key not in wikiVocabDict.keys():
      count += 1
  return str(count/len(tweetVocabDict.keys()) * 100) + '%'
# Print your ratio here.
tweetVocabDict = computeFreq(processedTweetData)
print(computeOOVWordTypes(tweetVocabDict, tokenDict))

33.166058394160586%


Task 8. Compute how many **word tokens** in your tweets are out of vocabulary, divided by the number of **word tokens** in your tweets. (This is the OOV-rate of your tweet test set.)

In [103]:
def computeOOVWordTokens(tweetVocabDict, wikiVocabDict):

  # Input: a dictionary of tweet data vocabulary, a dictionary of wiki data vocabulary. (E.g. computed from task 3)
  # Output: the ratio of word tokens in your tweets that are out-of-vocabulary w.r.t. wiki vocabulary
  # v.s. total number of word tokens in your tweet data.

  # Remeber this time we count the number of tokens instead of types. The ratio should be in percentage.
  count = 0
  sum = 0
  for key in tweetVocabDict.keys():
    sum += tweetVocabDict[key]
    if key not in wikiVocabDict.keys():
      count += tweetVocabDict[key]
  return str(count/sum * 100) + '%'

# Print your ratio here.
print(computeOOVWordTokens(tweetVocabDict, tokenDict))

2.396296632477081%


Task 9. Get the first 9,000 sentences from the processed Wikipedia data from task 2, train a trigram Add-one-smoothing language model based
on these 9,000 sentences (which you should have done so in the last assignment). 

(You could consider using the language model from NLTK.)


In [108]:
# Get the first 9000 sentences from the processed wiki data.
wikiSentences_9k = []
for value in wikiProcessedDict.values():
    for sent in value:
        sentence = []
        for word in sent:
            if word.isalpha() == True:
                sentence.append(word)
        if sentence != []:
            wikiSentences_9k.append(sentence)
        if len(wikiSentences_9k) == 9000:
            break
    if len(wikiSentences_9k) == 9000:
        break

In [112]:
wikiSentences_9k[8999]

['this',
 'conclusion',
 'be',
 'base',
 'on',
 'the',
 'lobster',
 'simple',
 'nervous',
 'system']

In [113]:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import Laplace

def trainLanguageModel(processedWikiData):
  # Input: the pre-processed wiki data
  # Output: a trigram model trained on the wiki data 
  
  # You could refer to the last assignment to implement this function.
  train, vocab = padded_everygram_pipeline(order=3, text= processedWikiData)
  lm = Laplace(3)
  lm.fit(train, vocab)
  return lm

# Train the language model with the processed data.
wikiLanguageModel = trainLanguageModel(wikiSentences_9k)


Test the model

In [117]:
wikiLanguageModel.generate(10, random_seed=5)

['mastodon',
 'plano',
 'transverse',
 'arrowhead',
 'systems',
 'game',
 'drive',
 'system',
 'buffalo',
 'jump']

Task 10. Report the average perplexity of this Wikipedia-trained language model on your processed Twitter test sentences (i.e. the 20% split) related to "fishing". Compare this perplexity to the one you obtained in task 4 of the last assignment, specifically, the trigram LM trained on tweets. 

In [122]:
# Get the 20% sentences from the processed tweet data.
tweetAllData = []
for value in processedTweetData.values():
    for sent in value:
        sentence = []
        for word in sent:
            if word.isalpha() == True:
                sentence.append(word)
        if sentence != []:
            tweetAllData.append(sentence)
        

In [126]:
# Prepare the testing data
tweetTestData = tweetAllData[0:int(len(tweetAllData)*0.2)]

In [127]:
len(tweetTestData)

2974

In [129]:
tweetTestData[0]

['genuinely', 'lovely', 'episode', 'of', 'gone', 'fishing', 'this', 'week']

In [130]:

def computePerplexity(model,testData):
  
  # Input: your model; the testing data

  # Output: average perplexity of the model on your testing data.

  # You may want to re-use the same function you implemented in the last assignment
  sum = 0
  for sent in testData:
    sum += model.perplexity(sent)

  return sum/len(testData)

# Compute and print the average perplexity of the wiki-trained model on your tweet testing data.
print(computePerplexity(wikiLanguageModel, tweetTestData))

13170.05302785912


Task 11. Scrap 100 news from both ABC news and Fox news (100 each) with the code provided in the first lab section. Preprocess the texts in the same way as task 2.

In [271]:
def getPageFrom(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml-xml')
    return soup
def getUrlList(sitemap):
  # This function should return a list of URLs of news contained in the sitemap page.
    url_list = []
    for link in sitemap.find_all('loc'):
        url_list.append(link.text)
    return url_list

Get the CNN links and Fox links and ABC

In [276]:
ABCNewsSitemap = getPageFrom('https://abcnews.go.com/xmlLatestStories')
ABCNewsLinks = getUrlList(ABCNewsSitemap)

In [272]:
CNNNewsSitemap = getPageFrom('https://www.cnn.com/sitemaps/cnn/news.xml')

In [273]:
CNNNewsLinks = getUrlList(CNNNewsSitemap)

In [274]:
len(CNNNewsLinks)

530

In [188]:
CNNNewsLinks[0]

'https://www.cnn.com/2022/10/05/asia/north-korea-missile-intl/index.html'

In [293]:
FOXNewsSitemap = getPageFrom('https://www.foxnews.com/sitemap.xml?type=news')
FOXNewsLinks = getUrlList(FOXNewsSitemap)

In [289]:
len(FOXNewsLinks)

356

Get Fox news dict

In [285]:
from newspaper import Article
def getFOXNewsDict(url_list):

  # Your key should be the news title and value should be the article text of the news.
  newsDict = {}
  # IMPLEMENT YOUR CODE HERE:# 
  for url in url_list:
    article = Article(url)
    article.download()
    article.parse()
    title = article.title
    text = article.text
    if title != '' and text != '':
        newsDict[title] = text
    
    if len(newsDict) == 14:
        break
  
  return newsDict

In [212]:
FOXNews = getFOXNewsDict(FOXNewsLinks)

In [214]:
len(FOXNews)

100

Get CNN news dict

In [248]:
def getCNNNewsDict(url_list):

  # Your key should be the news title and value should be the article text of the news.
  newsDict = {}
  # IMPLEMENT YOUR CODE HERE:# 
  for url in url_list:
    article = Article(url)
    article.download()
    article.parse()
    title = article.title
    text = article.text
    if title != '' and text != '':
        newsDict[title] = text
    
    if len(newsDict) == 100:
        break
  
  return newsDict


In [249]:
CNNNews = getCNNNewsDict(CNNNewsLinks)

In [269]:
len(CNNNews)

26

Still not enough

In [268]:
import newspaper
cnn_paper = newspaper.build('http://cnn.com')
for article in cnn_paper.articles:
    article.download()
    article.parse()
    title = article.title
    text = article.text
    if title != '' and text != '':
        CNNNews[title] = text

In [275]:
CNNNews2 = getCNNNewsDict(CNNNewsLinks)

In [278]:
len(CNNNews2)

22

In [279]:
CNNNews.update(CNNNews2)

In [280]:
len(CNNNews)

29

In [281]:
ABCNews = getFOXNewsDict(ABCNewsLinks)

In [284]:
ABCNews.update(CNNNews)
len(ABCNews)

86

In [296]:
suplementLinks = FOXNewsLinks.reverse()

In [297]:
suplementDict = getFOXNewsDict(FOXNewsLinks)

In [300]:
ABCNews.update(suplementDict)

add ABC FOX CNN but totally different from the pure Fox one

In [301]:
len(ABCNews)

100

save file

In [303]:
#Lastly, write them down in a .csv file for both the abc and fox news. 

import csv

pathToSave = 'newsContents.csv'

# size check
assert len(ABCNews)>=100 and len(FOXNews)>=100, "the size of both news dictionary should be no less than 100. got {} for abc news and {} for fox news instead.".format(len(ABCNews),len(FOXNews))

with open(pathToSave, 'w', newline='') as csvfile:
  fieldnames = ['idx','newsSource','newsTitle','newsContents']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  for i,newsDictKey in enumerate(ABCNews.keys()):
    writer.writerow({'idx': i,'newsSource':'ABCNews', 'newsTitle': newsDictKey,'newsContents': ABCNews[newsDictKey]})
  for i,newsDictKey in enumerate(FOXNews.keys()):
    writer.writerow({'idx': i,'newsSource':'FoxNews', 'newsTitle': newsDictKey,'newsContents': FOXNews[newsDictKey]})

In [304]:
# Here is a function you could load the text data if your saved data follows
# the format we provide the in first lab section code.

def loadNewsTexts(csvPath):

  # the function returns two dictionaries, one for ABC news text data and one for Fox news text data

  abcNewsRawTextDict = {}
  foxNewsRawTextDict = {}
  with open(csvPath, newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
      if (row['newsSource'] == "ABCNews"):
        abcNewsRawTextDict[row['newsTitle']] = row['newsContents']
      else:
        foxNewsRawTextDict[row['newsTitle']] = row['newsContents']

  return abcNewsRawTextDict,foxNewsRawTextDict

# Load your news text data here
# abcNewsDict,foxNewsDict = loadNewsTexts('./newsContents.csv')


In [306]:
abcNewsRawTextDict,foxNewsRawTextDict = loadNewsTexts('newsContents.csv')

In [308]:
def preprocess(wikiTextDict):
  # Input: a wiki text dictionary with keys are titles and values are the corresponding texts.
  # Output: a wiki text dictionary with keys are the titles and the values are the preprocessed texts 
  # (sentences - tokens).

  # sub-task 1: remove all the references texts "[...]"

  nlp = spacy.load("en_core_web_sm")
  lemmatizer = nlp.get_pipe("lemmatizer")
  for key in wikiTextDict.keys():
    wikiTextDict[key] = re.sub(r'[\[].*?[\]]', '', wikiTextDict[key])
    
  # sub-task 2: segment all the sentences in the wiki texts.
    sentences = []
    docParagraph = nlp(wikiTextDict[key])
    assert docParagraph.has_annotation("SENT_START")
    for sent in docParagraph.sents:
      # sub-task 3: tokenize the sentences from sub-task 2.
      # sub-task 4: lemmatize the tokens from sub-task 3.
      # sub-task 5: lower-case the tokens from sub-task 3/4.

      tokens = []
      docSent = nlp(sent.text)
      for token in docSent:
        tokens.append(token.lemma_.lower())
      sentences.append(tokens)
    wikiTextDict[key] = sentences

  return wikiTextDict
  # You don't need to follow the order of the sub-tasks.
abcProcessed = preprocess(abcNewsRawTextDict)
foxProcessed = preprocess(foxNewsRawTextDict)

Task 12. Construct a histogram of word count from both sources. The X-axis should be unique words in decending order of word count and the Y-axis should be the counts for each word.

(Please remember to preprocess the text data first.)

In [310]:
def computeFreq(wikiTextDict):

  # Input: a wiki text dictionary with keys are titles and values are the preprocessed corresponding texts.
  # Output: a dictionary with keys are the word types, and the values are the appearance counts of the word types
  tokenDict = {}
  for key in wikiTextDict.keys():
    for sentence in wikiTextDict[key]:
      for token in sentence:
        if token.isalpha() == True:
          if token in tokenDict.keys():
            tokenDict[token] += 1
          else:
            tokenDict[token] = 0
  return tokenDict

In [321]:
def computeWords(freqDict,stop_words):
  
  # Input: a dictionary with keys are the word types, and the values are the appearance counts of the word types
  # Output: a list of 20 words that appear most frequently in all the preprocessed scraped texts.

  # If not preprocessed specifically, the punctuations still exist in the frequency dictionary from task 3.
  # You need to remove them before outputing the top 20 words.
  sortKey = []
  sortValue = []
  sortedList = sorted(freqDict.items(), key=lambda dict: dict[1], reverse=True)
  for item in sortedList:
    if item[0] not in stop_words:
      sortKey.append(item[0])
      sortValue.append(item[1])
  return sortKey,sortValue

In [322]:
import matplotlib.pyplot as plt


# Preprocess the news data.
# Compute word type list and the word token list.
abcFreq = computeFreq(abcProcessed)
abcWord,abcCount = computeWords(abcFreq,stop_words)


In [324]:
foxFreq = computeFreq(foxProcessed)
foxWord,foxCount = computeWords(foxFreq,stop_words)

In [1]:

def plotHistogram(wordType,wordTokens):
  # Input: a list of word types, a list of word token counts to the corresponding word types
  # Output: (display the histogram of word count from a news source)

  # X-axis should be (indexes) of the word type, and Y-axis should be the word counts of the word type.
  
    plt.figure(figsize = (6, 4))
    # for i in range(len(wordType)):
    plt.hist(wordType,bins=wordTokens)
    plt.show()

plotHistogram(abcWord,abcCount)
# Plot the histogram here.
plotHistogram(foxWord,foxCount)

NameError: name 'abcWord' is not defined

Task 13. Construct the word clouds from the two texts. Include the word clouds and comment your interesting insights after that.

In [None]:
# You may consider re-use the code from task 5 here.

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt

def plotWordCloud(image):
  # Input: word cloud image
  # Output: (display the cloud image in the output)

  plt.figure(figsize=(40, 30))
  # Display image
  plt.imshow(image) 
  # No axis details
  plt.axis("off")

def generateWordCloud(text):
  # Input: all texts in the scraped wiki data.
  # Output: word cloud image.
  wordcloud = WordCloud(width= 3000, height = 2000, random_state=1, background_color='salmon', colormap='Pastel1', collocations=False, stopwords = STOPWORDS).generate(text)

  return wordcloud

# Draw your word cloud here

allWordList = []
for paragraph in abcProcessed.values():
    for sent in paragraph:
        for word in sent:
            if word not in stop_words and word.isalpha() == True:
                allWordList.append(word)

len(allWordList)

wordcloud = generateWordCloud(' '.join(allWordList))
plotWordCloud(wordcloud)


allWordList = []
for paragraph in foxProcessed.values():
    for sent in paragraph:
        for word in sent:
            if word not in stop_words and word.isalpha() == True:
                allWordList.append(word)

len(allWordList)

wordcloud = generateWordCloud(' '.join(allWordList))
plotWordCloud(wordcloud)

My m2 chip macbook cannot install the wordcloud because some fitting problem, I've checked on the colab, it's worked.  
And the colab failed to install the newspaper lib, so sorry about no Pics in there. You can check the code in a feasible environment.

From the wordcloud, the news very like reference from someones comments because there are lots of "say". Means often use someone says that....