<a href="https://colab.research.google.com/github/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping with RSS Feeds

Prepared by Group 14

The purpose of this notebook is to show how to scrape news articles from the RSS feeds of various news sources, and perform text processing techniques. This notebook can be accessed at https://github.com/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb

## 1. Installation

In [1]:
!git clone https://github.com/farhanwadia/MIE1624.git

Cloning into 'MIE1624'...
remote: Enumerating objects: 48, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (42/42), done.[K
remote: Total 48 (delta 10), reused 20 (delta 3), pack-reused 0[K
Unpacking objects: 100% (48/48), 270.61 KiB | 1.63 MiB/s, done.


In [2]:
%cd MIE1624
%cd 'Course Presentation'

/content/MIE1624
/content/MIE1624/Course Presentation


In [3]:
!ls

 GP.ipynb					    new_york_times.csv
'In-class presentation assignment - Group 14.pdf'   RSS.ipynb
 le_devoir.csv					    toronto_star.csv


In [4]:
!pip install feedparser
!pip install newspaper3k

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6066 sha256=7dab0cbf388b8c7ced3725d3745bafb3991e19c28f9b2c3cfb918e4a86434130
  Stored in directory: /root/.cache/pip/wheels/83/63/2f/117884c3b19d46b64d3d61690333aa80c88dc14050e269c546
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0
Looking in indexes: https://pypi.org/simple, https://us-python

## 2. Working with RSS Feeds

### New York Times

A list of all RSS feeds from the New York Times can be accessed at https://www.nytimes.com/rss.

Let's use the World feed from https://rss.nytimes.com/services/xml/rss/nyt/World.xml as an example:

#### Form the dataframe

In [5]:
import feedparser

d = feedparser.parse('https://rss.nytimes.com/services/xml/rss/nyt/World.xml')

In [6]:
# Get a list of all possible fields from the RSS
all_fields = []
for field in d.entries[0]:
    all_fields.append(field)

print(all_fields)

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'summary', 'summary_detail', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'media_content', 'media_credit', 'credit']


In [7]:
# Define the fields of interest that we want to obtain from the RSS
fields = ['title', 'published', 'summary', 'author', 'link']

In [8]:
import pandas as pd

# Create a list of lists to hold the required RSS data from each entry
data = []
for i, entry in enumerate(d.entries):
    row = []
    for field in fields:
        row.append(d.entries[i][field])
    data.append(row)

# Convert the list of lists to a df
df = pd.DataFrame(data, columns = fields)

In [9]:
df.head()

Unnamed: 0,title,published,summary,author,link
0,Ukraine’s Top Generals Want to Keep Fighting f...,"Tue, 07 Mar 2023 00:48:52 +0000",Military commanders told President Volodymyr Z...,The New York Times,https://www.nytimes.com/live/2023/03/06/world/...
1,Protests Over Netanyahu’s Judiciary Overhaul S...,"Mon, 06 Mar 2023 23:00:38 +0000",The military leadership is concerned that ange...,Ronen Bergman and Patrick Kingsley,https://www.nytimes.com/2023/03/06/world/middl...
2,"Iran’s Rulers, Shaken by Protests, Now Face Cu...","Mon, 06 Mar 2023 22:30:53 +0000",Years of Western sanctions are partly to blame...,Vivian Yee,https://www.nytimes.com/2023/03/06/world/middl...
3,"Ukrainian Soldiers, Nearly Encircled in Bakhmu...","Mon, 06 Mar 2023 08:00:19 +0000",The battle for Bakhmut is not over — at least ...,Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/06/world/europ...
4,"The Story of Multicultural Canada, Told in Hum...","Mon, 06 Mar 2023 22:11:00 +0000",Some of Toronto’s best dining options are mom-...,Norimitsu Onishi,https://www.nytimes.com/2023/03/05/world/canad...


In [10]:
print("The shape of the dataframe is", df.shape)

The shape of the dataframe is (61, 5)


#### Retrieving the texts for the corresponding articles to the Dataframe

In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [12]:
from newspaper import Article

links = df["link"]

article_text_dict = {}
for link in links:
  article = Article(link)
  article.download()
  article.parse()
  article.nlp()
  article_text_dict[link] = article.text
  
df['text'] = list(article_text_dict.values())

In [13]:
df.head()

Unnamed: 0,title,published,summary,author,link,text
0,Ukraine’s Top Generals Want to Keep Fighting f...,"Tue, 07 Mar 2023 00:48:52 +0000",Military commanders told President Volodymyr Z...,The New York Times,https://www.nytimes.com/live/2023/03/06/world/...,Ukrainian forces continue to defend the easter...
1,Protests Over Netanyahu’s Judiciary Overhaul S...,"Mon, 06 Mar 2023 23:00:38 +0000",The military leadership is concerned that ange...,Ronen Bergman and Patrick Kingsley,https://www.nytimes.com/2023/03/06/world/middl...,A plan by Prime Minister Benjamin Netanyahu to...
2,"Iran’s Rulers, Shaken by Protests, Now Face Cu...","Mon, 06 Mar 2023 22:30:53 +0000",Years of Western sanctions are partly to blame...,Vivian Yee,https://www.nytimes.com/2023/03/06/world/middl...,As their currency plunged to new lows recently...
3,"Ukrainian Soldiers, Nearly Encircled in Bakhmu...","Mon, 06 Mar 2023 08:00:19 +0000",The battle for Bakhmut is not over — at least ...,Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/06/world/europ...,"CHASIV YAR, Ukraine — Lined up in the dark in ..."
4,"The Story of Multicultural Canada, Told in Hum...","Mon, 06 Mar 2023 22:11:00 +0000",Some of Toronto’s best dining options are mom-...,Norimitsu Onishi,https://www.nytimes.com/2023/03/05/world/canad...,"SCARBOROUGH, Ontario — At a tiny strip mall wh..."


In [14]:
df.to_csv("new_york_times.csv", encoding='utf-8', index=False)

### Toronto Star

#### Function Development
Create a function to assist with the scraping process

In [15]:
def print_RSS_fields(rss_link):
    
    d = feedparser.parse(rss_link)

    all_fields = []
    for field in d.entries[0]:
        all_fields.append(field)
    print(all_fields)

def df_from_RSS(rss_link, fields):
    
    d = feedparser.parse(rss_link)
    
    # Create a list of lists to hold the required RSS data from each entry
    data = []
    for i, entry in enumerate(d.entries):
        row = []
        for field in fields:
            row.append(d.entries[i][field])
        data.append(row)

    # Convert the list of lists to a df
    df = pd.DataFrame(data, columns = fields)

    links = df["link"]

    article_text_dict = {}
    for link in links:
        article = Article(link)
        article.download()
        article.parse()
        article.nlp()
        article_text_dict[link] = article.text
    
    df['text'] = list(article_text_dict.values())

    return df

A list of RSS feeds for the Toronto Star can be found here: https://www.thestar.com/about/rssfeeds.html

Let's use the Top Stories RSS feed.

In [16]:
print_RSS_fields('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'summary', 'summary_detail', 'media_content', 'media_thumbnail', 'href', 'content', 'media_credit', 'credit']


In [17]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Chinese interference in Canada? Chinese Canadi...,"Mon, 6 Mar 2023 17:30:00 EST",Joanna Chiu - Staff Reporter,https://www.thestar.com/politics/federal/2023/...,The first time Cheuk Kwan and Sheng Xue testif...
1,Justin Trudeau to appoint special rapporteur t...,"Mon, 6 Mar 2023 14:41:00 EST",Tonda MacCharles - Ottawa Bureau,https://www.thestar.com/politics/federal/2023/...,"OTTAWA — Amid rising political pressure, the L..."
2,This hospital’s deal with a corporation of sur...,"Mon, 6 Mar 2023 19:02:00 EST","Kenyon Wallace - Investigative Reporter,Megan ...",https://www.thestar.com/news/canada/2023/03/06...,Members of the Ontario NDP are pressing the pr...
3,These five cities have seen home prices fall b...,"Mon, 6 Mar 2023 06:00:00 EST",Clarrie Feinstein - Business Reporter,https://www.thestar.com/business/2023/03/06/th...,Some Ontario cities’ home prices have declined...
4,Largest cross-border drug case in Toronto poli...,"Mon, 6 Mar 2023 20:05:00 EST","Wendy Gillis - Staff Reporter,Jason Miller - C...",https://www.thestar.com/news/gta/2023/03/06/la...,With fanfare that included tables piled high w...


In [18]:
df.to_csv("toronto_star.csv", encoding='utf-8', index=False)

### Le Devoir

A list of RSS feeds for Le Devoir can be found here: https://www.ledevoir.com/flux-rss

Let's use the World (le Monde) RSS feed.

In [19]:
print_RSS_fields('https://www.ledevoir.com/rss/section/monde.xml?id=76')

['surtitle', 'title', 'title_detail', 'published', 'published_parsed', 'links', 'link', 'id', 'guidislink', 'tags', 'summary', 'summary_detail', 'authors', 'author', 'author_detail']


In [20]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.ledevoir.com/rss/section/monde.xml?id=76', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Un mardi de «mise à l’arrêt» s’annonce en Fran...,"Mon, 06 Mar 2023 20:48:42 -0500",webmestre@ledevoir.com (Lucie Peytermann),https://www.ledevoir.com/monde/europe/784249/-...,La France se prépare à une journée d’action dé...
1,L’armée ukrainienne refuse d’abandonner ses po...,"Mon, 06 Mar 2023 20:46:08 -0500",webmestre@ledevoir.com (Emmanuel Peuchot),https://www.ledevoir.com/monde/europe/784240/-...,Le président Volodymyr Zelensky a ordonné lund...
2,"En Afghanistan, les hommes sont de retour à l’...","Mon, 06 Mar 2023 20:43:04 -0500",webmestre@ledevoir.com (Estelle Emonet),https://www.ledevoir.com/monde/asie/784228/-en...,Les hommes ont repris les cours lundi dans les...
3,Taïwan alarmé par la hausse du budget de la Dé...,"Mon, 06 Mar 2023 10:52:30 -0500",webmestre@ledevoir.com (Agence France-Presse),https://www.ledevoir.com/monde/asie/784245/-ta...,"Le ministre taïwanais de la Défense, Chiu Kuo-..."
4,L’opposante biélorusse Tikhanovskaïa condamnée...,"Mon, 06 Mar 2023 09:03:50 -0500",webmestre@ledevoir.com (Agence France-Presse),https://www.ledevoir.com/monde/europe/784235/-...,Un tribunal biélorusse a condamné lundi par co...


In [21]:
df.to_csv("le_devoir.csv", encoding='utf-8-sig', index=False)

### CBC

A list of RSS feeds for the CBCs can be found here: https://www.cbc.ca/rss/

Let's use the World RSS feed.

In [22]:
print_RSS_fields('https://rss.cbc.ca/lineup/world.xml')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'tags', 'summary', 'summary_detail']


In [25]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://rss.cbc.ca/lineup/world.xml', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Russia's Wagner group leader suggests Moscow h...,"Sat, 12 Feb 2022 11:04:48 EST",,https://www.cbc.ca/news/world/ukraine-war-day-...,The founder of Russia's Wagner mercenary force...
1,Twitter says things 'normal' again after exper...,"Mon, 6 Mar 2023 12:46:20 EST",Reuters,https://www.cbc.ca/news/business/twitter-issue...,Thousands of Twitter users reported problems a...
2,"As holidaying Canadians return to Cuba, Cubans...","Mon, 6 Mar 2023 04:00:00 EST",Evan Dyer,https://www.cbc.ca/news/politics/cuba-nicaragu...,Canadian tourists returned to Cuba in large nu...
3,Exiled opposition leader vows to fight on afte...,"Mon, 6 Mar 2023 15:57:38 EST",Reuters,https://www.cbc.ca/news/world/belarus-oppositi...,Exiled Belarusian opposition leader Sviatlana ...
4,Toblerone chocolate maker drops iconic Matterh...,"Mon, 6 Mar 2023 15:38:13 EST",The Associated Press,https://www.cbc.ca/news/world/toblerone-chocol...,The makers of Toblerone are stripping images o...


In [26]:
df.to_csv("cbc.csv", encoding='utf-8', index=False)

## 3. Text Processing

In [27]:
# For the purpose of demonstrating text processing techniques, we will be using
# the New York Times dataset as created above, let's start be importing the dataset first

nytimes_df = pd.read_csv("new_york_times.csv", encoding='utf-8')
nytimes_df.head()

Unnamed: 0,title,published,summary,author,link,text
0,Ukraine’s Top Generals Want to Keep Fighting f...,"Tue, 07 Mar 2023 00:48:52 +0000",Military commanders told President Volodymyr Z...,The New York Times,https://www.nytimes.com/live/2023/03/06/world/...,Ukrainian forces continue to defend the easter...
1,Protests Over Netanyahu’s Judiciary Overhaul S...,"Mon, 06 Mar 2023 23:00:38 +0000",The military leadership is concerned that ange...,Ronen Bergman and Patrick Kingsley,https://www.nytimes.com/2023/03/06/world/middl...,A plan by Prime Minister Benjamin Netanyahu to...
2,"Iran’s Rulers, Shaken by Protests, Now Face Cu...","Mon, 06 Mar 2023 22:30:53 +0000",Years of Western sanctions are partly to blame...,Vivian Yee,https://www.nytimes.com/2023/03/06/world/middl...,As their currency plunged to new lows recently...
3,"Ukrainian Soldiers, Nearly Encircled in Bakhmu...","Mon, 06 Mar 2023 08:00:19 +0000",The battle for Bakhmut is not over — at least ...,Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/06/world/europ...,"CHASIV YAR, Ukraine — Lined up in the dark in ..."
4,"The Story of Multicultural Canada, Told in Hum...","Mon, 06 Mar 2023 22:11:00 +0000",Some of Toronto’s best dining options are mom-...,Norimitsu Onishi,https://www.nytimes.com/2023/03/05/world/canad...,"SCARBOROUGH, Ontario — At a tiny strip mall wh..."


In [28]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.util import ngrams
from nltk.probability import ConditionalFreqDist
from collections import defaultdict
import string
import random

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [36]:
# defining all the text processing techniques discussed in the presentation as 
# functions

def number_of_words(text):
  """
  Counts the total number of words in a text

  'text': This argument strictly needs to be of type "string"
  """
  return len(text.split())

def number_of_characters(text):
  """
  Counts the total number of characters in a text
  
  'text': This argument strictly needs to be of type "string"
  """
  return len(text)

def average_word_length(text):
  """
  Returns the average word length of a text

  'text': This argument strictly needs to be of type "string"
  """
  words = text.split()
  avg_word_len = sum(len(word) for word in words) / len(words)
  return int(avg_word_len)

def change_case(text, case_ = 'lower'):
  """
  Returns the transformed text as specified by the case arguement

  'text': This argument strictly needs to be of type "string"
  'case_': can either be 'lower' or 'upper'
  """
  if case_ == 'lower':
    return text.lower()
  elif case_ == 'upper':
    return text.upper()

def count_stopwords(text):
  """
  Returns the total number of stopwords in a text.
  Vocabulary for stopwords is from nltk module,

  'text': This argument strictly needs to be of type "string"
  """
  stop_words = set(stopwords.words('english'))
  num_of_stopwords = len([word for word in text.split() if word.lower() in stop_words])
  return num_of_stopwords


def tokenize(text, how = 'word'):
  """
  Tokenizes the text based on nltk module's basic tokenization techniques

  'text': This argument strictly needs to be of type "string"
  'how': This arguement caan be one of ['word', 'sentence', 'whitespace']
         indicting the strategy for tokenizing
  """
  if how == 'word':
    return nltk.word_tokenize(text)
  elif how == 'sentence':
    return nltk.sent_tokenize(text)
  elif how == 'whitespace':
    return nltk.WhitespaceTokenizer(text)

def remove_punct(text):
  """
  Removes punctuation marks from a specified text

  'text': This argument strictly needs to be of type "string"
  """
  
  # Tokenize the text into individual words
  tokens = tokenize(text)

  # Remove punctuation marks from each word
  table = str.maketrans('', '', string.punctuation)
  words = [word.translate(table) for word in tokens]

  # Combine the words back into a single string
  cleaned_text = ' '.join(words)

  return cleaned_text

def remove_stopwords(text):
  """
  Performs stopwords removal on the specified text

  'text': This argument strictly needs to be of type "string"
  """

  # Tokenize the text
  tokens = tokenize(text, how = 'word')
    
  # Remove stopwords
  stopwords_list = stopwords.words('english')
  filtered_tokens = [token for token in tokens if token.lower() not in stopwords_list]
    
  # Join the filtered tokens back into a string
  filtered_text = ' '.join(filtered_tokens)
    
  return filtered_text

def stem_text(text):
  """
  Returns the stemmed text based on the "PorterStemmer" stemming strategy of
  the nltk module. This is a widely used stemming algorithm that removes common 
  suffixes from English words. It is implemented in nltk as the PorterStemmer 
  class.

  'text': This argument strictly needs to be of type "string"
  """
  stemmer = PorterStemmer()
  tokens = text.split()
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  stemmed_text = ' '.join(stemmed_tokens)
  return stemmed_text

def lemmatize_text(text):
  """
  Lemmatizes text using the WordNetLemmatizer class from the nltk.stem module.

  'text': This argument strictly needs to be of type "string"
  """
  tokens = tokenize(text, how = 'word')
  lemmatizer = WordNetLemmatizer()

  lemmatized_tokens = []
  for word in tokens:
    lemmatized_tokens.append(lemmatizer.lemmatize(word))
  
  lemmatized_text = ' '.join(lemmatized_tokens)
  
  return lemmatized_text

def normalize_text(text):
  """
  This function acts like a pipeline for the function defined above, for the
  purpose of normalizing the text

  'text': This argument strictly needs to be of type "string"
  """
  normalized_text = change_case(text, case_ = 'lower')
  normalized_text = remove_punct(normalized_text)
  # normalized_text = remove_stopwords(normalized_text)
  normalized_text = lemmatize_text(normalized_text)

  return normalized_text

In [30]:
sample_text = """ The quick brown fox jumps over the lazy dog. The dog is not actually lazy, 
                  but rather quite active. He loves to run and play all day long, chasing after squirrels 
                  and rabbits in the park. Unfortunately, he sometimes gets into fights with other dogs, 
                  which can be quite dangerous. As his owner, it's my job to make sure he stays safe and healthy, 
                  both physically and emotionally.
              """

print('# of Words:', number_of_words(sample_text), '\n')
print('# of Characters:', number_of_characters(sample_text), '\n')
print('Avg. word length:', average_word_length(sample_text), '\n')

# of Words: 68 

# of Characters: 470 

Avg. word length: 4 



In [31]:
print('Normalized text:', normalize_text(sample_text), '\n')

Normalized text: thequickbrownfoxjumpoverthelazydogthedogisnotactuallylazybutratherquiteactivehelovetorunandplayalldaylongchasingaftersquirrelandrabbitintheparkunfortunatelyhesometimesgetintofightwithotherdogwhichcanbequitedangerousahisowneritsmyjobtomakesurehestaysafeandhealthybothphysicallyandemotionally 



### Predicting the next word in a sentence using the N-grams model

In [37]:
def n_gram_model(text):
  trigrams = list(nltk.ngrams(list(text.split()), 3, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))

  # make conditional frequencies dictionary  
  cfdist = ConditionalFreqDist()
  for w1, w2, w3 in trigrams:
    cfdist[(w1, w2)][w3] += 1

  # transform frequencies to probabilities
  for w1_w2 in cfdist:
    total_count = float(sum(cfdist[w1_w2].values()))
    for w3 in cfdist[w1_w2]:
      cfdist[w1_w2][w3] /= total_count

  return cfdist

def predict(model, user_input):
  user_input = normalize_text(user_input)
  user_input = user_input.split()

  w1 = len(user_input) - 2
  w2 = len(user_input)
  prev_words = user_input[w1:w2]

  # display prediction from highest to lowest maximum likelihood
  prediction = sorted(dict(model[prev_words[0], prev_words[1]]), key=lambda x: dict(model[prev_words[0], prev_words[1]])[x], reverse=True)
  print("Trigram model predictions: ", prediction)

  word = []
  weight = []
  for key, prob in dict(model[prev_words[0], prev_words[1]]).items():
    word.append(key)
    weight.append(prob)

  # pick from a weighted random probability of predictions
  next_word = random.choices(word, weights=weight, k=1)
  
  # add predicted word to user input
  user_input.append(next_word[0])
  print(' '.join(user_input))

  ask = input("Do you want to generate another word? (type 'y' for yes or 'n' for no): ")
  if ask.lower() == 'y':
        predict(model, str(user_input))
  elif ask.lower() == 'n':
        print("done")

In [38]:
normalized_text = normalize_text(sample_text)
model = n_gram_model(normalized_text)
predict(model, "quick brown fox")

Trigram model predictions:  ['jump']
quick brown fox jump
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['over']
quick brown fox jump over
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['the']
quick brown fox jump over the
Do you want to generate another word? (type 'y' for yes or 'n' for no): n
done


In [40]:
# using the New York Times data
text = nytimes_df.loc[0, 'text']
#print(text, '\n')

In [41]:
normalized_text = normalize_text(text)
model = n_gram_model(normalized_text)
predict(model, "with Russian forces")

Trigram model predictions:  ['and', 'have', 'russian', 'attacking', '“']
with russian force russian
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['force']
with russian force russian force
Do you want to generate another word? (type 'y' for yes or 'n' for no): n
done


### "Bag of Words" Document Representation using Term Frequency (TF)

Bag of Words is a document representation technique used in natural language processing to convert a text document into a numerical vector that can be used for various machine learning tasks. In this technique, the text of a document is first preprocessed by tokenizing the words and removing any stop words and punctuation marks. The resulting set of words is then used to create a vocabulary, where each word is represented by a unique index.

The Bag of Words model can then be represented using two methods: Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF).

In the TF method, each document is represented by a vector where each entry corresponds to the frequency of the corresponding word in the document. For example, consider the following two sentences:

The quick brown fox jumps over the lazy dog.

The lazy dog sleeps all day.

The corresponding Bag of Words representation using the TF method would be:

```
            quick brown fox  jumps over the   lazy  dog   sleeps all   day
Sentence 1: 0.11  0.11  0.11 0.11  0.11 0.22  0.11  0.11  0      0     0  
Sentence 2: 0     0     0    0     0    0.167 0.167 0.167 0.167  0.167 0.167  
```
The entry in each vector corresponds to the frequency of the corresponding word in the document. For example, the first sentence has a 1 in the "quick" column because the word "quick" appears once in the sentence.

Sci-Kit Learn's TfidfVectorizer Class implements the above explained 'Bag of Words' representation, the only modification that needs to be made is setting the boolean arguement `use_idf` to **False**, which caan be seen below

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a list of all articles
texts = [article for article in nytimes_df['text'].to_list()]

# Create an instance of the TfidfVectorizer class
vectorizer = TfidfVectorizer(use_idf = False)

# Fit the vectorizer on the documents
vectorizer.fit(texts)

# Transform the articles into a TF-IDF matrix
tf_matrix = vectorizer.transform(texts)
tf_matrix = tf_matrix.toarray()

print(tf_matrix)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.03468962]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [43]:
tf_matrix.shape

(61, 3597)

This matrix is the **term frequency 'bag of words' representation** of the New York times articles, the shape of this matrix is (# of articles, # of unique words in the corpus). Let's add a column `bow_tf` to our New York times dataframe, which has the term frequency bag of words representation of the corresponding article,

In [44]:
nytimes_df['bow_tf'] = pd.Series(list(tf_matrix))

### "Bag of Words" Document Representation using Term Frequency and Inverse Document Frequency (TF_IDF)

In the TF-IDF method, each entry in the vector is the product of the term frequency and inverse document frequency of the corresponding word. This method gives more weight to words that are rare in the corpus and less weight to words that are common. The formula for calculating the TF-IDF weight of a word is:
```
TF-IDF(w, d) = TF(w, d) * IDF(w)
```
where TF(w, d) is the term frequency of word w in document d and IDF(w) is the inverse document frequency of word w across all documents in the corpus.

Continuing with the same example, 

The quick brown fox jumps over the lazy dog.

The lazy dog sleeps all day.

the corresponding Bag of Words representation using the TF-IDF method would be:
```
            quick brown fox jumps over the   lazy  dog   sleeps all day
Sentence 1: 0     0     0   0     0    0.089 0.045 0.045 0      0   0   
Sentence 2: 0     0     0   0     0    0.067 0.067 0.067 0      0   0 
```
The entry in each vector corresponds to the TF-IDF weight of the corresponding word in the document. For example, the first sentence has a TF-IDF weight of 0.045 in the "dog" column because the word "dog" appears once in the sentence and has a relatively low IDF value.

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a list of all articles
texts = [article for article in nytimes_df['text'].to_list()]

# Create an instance of the TfidfVectorizer class
vectorizer = TfidfVectorizer(use_idf = True)

# Fit the vectorizer on the documents
vectorizer.fit(texts)

# Transform the articles into a TF-IDF matrix
tfidf_matrix = vectorizer.transform(texts)
tfidf_matrix = tfidf_matrix.toarray()

print(tfidf_matrix)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.06032222]
 [0.         0.         0.         ... 0.         0.         0.        ]]


The column `bow-tfidf' is added to the new york times dataframe to show the **term frequency - inverse document frequency 'bag of words' representation** of the corresponding articles

In [46]:
nytimes_df['bow_tfidf'] = pd.Series(list(tfidf_matrix))

In summary, the Bag of Words model with the TF and TF-IDF methods represents a document as a numerical vector where each entry corresponds to the frequency or importance of the corresponding word in the document. These representations can then be used in various machine learning tasks such as classification, clustering, and information retrieval.

## Word Embeddings

Word embeddings are a type of vector representation used in natural language processing to capture the meaning of words in a way that can be easily processed by machine learning algorithms. Unlike Bag of Words representations, word embeddings are able to capture the semantic relationships between words and can represent the meaning of a word based on its context.

Word embeddings are typically created using a technique called word2vec, which is a neural network-based approach. The basic idea behind word2vec is to train a neural network to predict the context of a word given its surrounding words. The context of a word refers to the words that appear in its vicinity, such as the words that come before and after it in a sentence or paragraph.

The word2vec model consists of a shallow neural network with one input layer, one hidden layer, and one output layer. The input layer corresponds to the words in the vocabulary, and each word is represented by an embedding vector whose dimension can be a user-defined parameter. The output layer consists of a set of softmax nodes, each representing a word in the vocabulary. The objective of the network is to predict the probability distribution of the words in the vocabulary given the input word.

During training, the network is fed with pairs of words and their context words, and the objective is to minimize the cross-entropy loss between the predicted probability distribution and the actual context words. The weights of the hidden layer are then used as the word embeddings for the corresponding words in the vocabulary.

The resulting word embeddings are typically dense vectors of fixed dimensionality, where each element represents a latent feature or aspect of the corresponding word. These vectors can be used as input to various machine learning algorithms for various NLP tasks such as sentiment analysis, named entity recognition, and machine translation.

For the purpose of this presentation, instead of writing the whole neural metwork model for implementing word2vec, we are using the `gensim.sklearn_api` module's `W2VTransformer` class, as shown below,

In [47]:
from gensim.sklearn_api import W2VTransformer

# creating a model to represent each word by a 10 dimensional embedding vector
word2vec = W2VTransformer(size=10, min_count=1, seed=1)

# transforming the collections of articel text to be compatible feeding into the model
corpus = []
for text in nytimes_df['text'].to_list():
  corpus.append(list(text.split()))

word2vec.fit(corpus)



In [48]:
## word embedding for the word "country"
country_vec = word2vec.transform("country")
print(country_vec, '\n', country_vec.shape)

[[ 0.05276789 -0.04101641 -0.00537425  0.06918377  0.01668434 -0.02814205
   0.01882263 -0.03083917 -0.06442052 -0.01823299]] 
 (1, 10)


Since we initialized the model with size = 10, word embedding or word vector for every word will be of dimension 10

In [49]:
## word embedding for the word "company"
country_vec = word2vec.transform("company")
print(country_vec, '\n', country_vec.shape)

[[ 0.04399081 -0.02204123 -0.02133559  0.06640895 -0.00159628  0.03069001
   0.02628162 -0.02725606 -0.05027699 -0.03825242]] 
 (1, 10)
