# Web Scraping with RSS Feeds

Prepared by Group 14

The purpose of this notebook is to show how to scrape news articles from the RSS feeds of various news sources, and perform text processing techniques. This notebook can be accessed at https://github.com/farhanwadia/MIE1624/blob/master/Course%20Presentation/RSS.ipynb

## 1. Installation

In [2]:
!git clone https://github.com/farhanwadia/MIE1624.git

Cloning into 'MIE1624'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (39/39), done.[K
remote: Compressing objects: 100% (35/35), done.[K
remote: Total 39 (delta 8), reused 11 (delta 2), pack-reused 0[K
Unpacking objects: 100% (39/39), 156.78 KiB | 863.00 KiB/s, done.


In [3]:
%cd MIE1624
%cd 'Course Presentation'

/content/MIE1624
/content/MIE1624/Course Presentation


In [4]:
!ls

 GP.ipynb  'In-class presentation assignment - Group 14.pdf'   RSS.ipynb


In [5]:
!pip install feedparser
!pip install newspaper3k

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 KB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6066 sha256=caa2dfdde7071b047f5fdeb96250b1c18cdbc865598673a0bd9b58d5159ddbb7
  Stored in directory: /root/.cache/pip/wheels/83/63/2f/117884c3b19d46b64d3d61690333aa80c88dc14050e269c546
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0
Looking in indexes: https://pypi.org/simple, https://us-python

## 2. Working with RSS Feeds

### New York Times

A list of all RSS feeds from the New York Times can be accessed at https://www.nytimes.com/rss.

Let's use the World feed from https://rss.nytimes.com/services/xml/rss/nyt/World.xml as an example:

#### Form the dataframe

In [6]:
import feedparser

d = feedparser.parse('https://rss.nytimes.com/services/xml/rss/nyt/World.xml')

In [7]:
# Get a list of all possible fields from the RSS
all_fields = []
for field in d.entries[0]:
    all_fields.append(field)

print(all_fields)

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'summary', 'summary_detail', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'tags', 'media_content', 'content']


In [8]:
# Define the fields of interest that we want to obtain from the RSS
fields = ['title', 'published', 'summary', 'author', 'link']

In [9]:
import pandas as pd

# Create a list of lists to hold the required RSS data from each entry
data = []
for i, entry in enumerate(d.entries):
    row = []
    for field in fields:
        row.append(d.entries[i][field])
    data.append(row)

# Convert the list of lists to a df
df = pd.DataFrame(data, columns = fields)

In [10]:
df.head()

Unnamed: 0,title,published,summary,author,link
0,"Ukrainian Soldiers, Nearly Encircled in Bakhmu...","Mon, 06 Mar 2023 08:00:19 +0000",The battle for Bakhmut is not over — at least ...,Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/06/world/europ...
1,Ukraine’s Top Generals Want to Keep Fighting f...,"Mon, 06 Mar 2023 19:12:15 +0000",Military commanders told President Volodymyr Z...,The New York Times,https://www.nytimes.com/live/2023/03/06/world/...
2,"The Story of Multicultural Canada, Told in Hum...","Sun, 05 Mar 2023 22:04:46 +0000",Some of Toronto’s best dining options are mom-...,Norimitsu Onishi,https://www.nytimes.com/2023/03/05/world/canad...
3,Historical Disputes Kept Them at Odds. Can Seo...,"Mon, 06 Mar 2023 15:59:37 +0000",Icy relations between the two have long been a...,Choe Sang-Hun,https://www.nytimes.com/2023/03/05/world/asia/...
4,Dalit Journalist Takes On India’s Caste Injust...,"Mon, 06 Mar 2023 16:27:16 +0000",Meena Kotwal started a news outlet focused on ...,Karan Deep Singh,https://www.nytimes.com/2023/03/06/world/asia/...


In [11]:
print("The shape of the dataframe is", df.shape)

The shape of the dataframe is (62, 5)


#### Retrieving the texts for the corresponding articles to the Dataframe

In [12]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
from newspaper import Article

links = df["link"]

article_text_dict = {}
for link in links:
  article = Article(link)
  article.download()
  article.parse()
  article.nlp()
  article_text_dict[link] = article.text
  
df['text'] = list(article_text_dict.values())

In [14]:
df.head()

Unnamed: 0,title,published,summary,author,link,text
0,"Ukrainian Soldiers, Nearly Encircled in Bakhmu...","Mon, 06 Mar 2023 08:00:19 +0000",The battle for Bakhmut is not over — at least ...,Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/06/world/europ...,"CHASIV YAR, Ukraine — Lined up in the dark in ..."
1,Ukraine’s Top Generals Want to Keep Fighting f...,"Mon, 06 Mar 2023 19:12:15 +0000",Military commanders told President Volodymyr Z...,The New York Times,https://www.nytimes.com/live/2023/03/06/world/...,Ukraine’s top generals want to bolster their f...
2,"The Story of Multicultural Canada, Told in Hum...","Sun, 05 Mar 2023 22:04:46 +0000",Some of Toronto’s best dining options are mom-...,Norimitsu Onishi,https://www.nytimes.com/2023/03/05/world/canad...,"SCARBOROUGH, Ontario — At a tiny strip mall wh..."
3,Historical Disputes Kept Them at Odds. Can Seo...,"Mon, 06 Mar 2023 15:59:37 +0000",Icy relations between the two have long been a...,Choe Sang-Hun,https://www.nytimes.com/2023/03/05/world/asia/...,SEOUL — When it comes to South Korea and Japan...
4,Dalit Journalist Takes On India’s Caste Injust...,"Mon, 06 Mar 2023 16:27:16 +0000",Meena Kotwal started a news outlet focused on ...,Karan Deep Singh,https://www.nytimes.com/2023/03/06/world/asia/...,"After two years on the job, she filed an offic..."


In [15]:
df.to_csv("new_york_times.csv", encoding='utf-8', index=False)

### Toronto Star

#### Function Development
Create a function to assist with the scraping process

In [16]:
def print_RSS_fields(rss_link):
    
    d = feedparser.parse(rss_link)

    all_fields = []
    for field in d.entries[0]:
        all_fields.append(field)
    print(all_fields)

def df_from_RSS(rss_link, fields):
    
    d = feedparser.parse(rss_link)
    
    # Create a list of lists to hold the required RSS data from each entry
    data = []
    for i, entry in enumerate(d.entries):
        row = []
        for field in fields:
            row.append(d.entries[i][field])
        data.append(row)

    # Convert the list of lists to a df
    df = pd.DataFrame(data, columns = fields)

    links = df["link"]

    article_text_dict = {}
    for link in links:
        article = Article(link)
        article.download()
        article.parse()
        article.nlp()
        article_text_dict[link] = article.text
    
    df['text'] = list(article_text_dict.values())

    return df

A list of RSS feeds for the Toronto Star can be found here: https://www.thestar.com/about/rssfeeds.html

Let's use the Top Stories RSS feed.

In [17]:
print_RSS_fields('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'authors', 'author', 'author_detail', 'published', 'published_parsed', 'summary', 'summary_detail', 'media_content', 'media_thumbnail', 'href', 'content', 'media_credit', 'credit']


In [18]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.thestar.com/content/thestar/feed.RSSManagerServlet.articles.topstories.rss', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,These five cities have seen home prices fall b...,"Mon, 6 Mar 2023 06:00:00 EST",Clarrie Feinstein - Business Reporter,https://www.thestar.com/business/2023/03/06/th...,Some Ontario cities’ home prices have declined...
1,We asked the Ontario government about its cont...,"Mon, 6 Mar 2023 05:00:00 EST","Robert Cribb - Staff Reporter,Declan Keogh and...",https://www.thestar.com/news/investigations/20...,Transportation Minister Caroline Mulroney’s of...
2,Toronto’s deputy mayor renews plea for federal...,"Mon, 6 Mar 2023 10:22:49 EST",David Rider - City Hall Bureau Chief,https://www.thestar.com/news/city_hall/2023/03...,Deputy Mayor Jennifer McKelvie is picking up w...
3,Even royals can face eviction. What to do when...,"Mon, 6 Mar 2023 12:18:00 EST",Aisling Murphy - Staff Reporter,https://www.thestar.com/news/gta/2023/03/02/ev...,"In the latest royal family drama, Prince Harry..."
4,‘Keto-like’ diet may be linked to increased ri...,"Mon, 6 Mar 2023 10:52:49 EST",Joshua Chong - Staff Reporter,https://www.thestar.com/life/2023/03/06/keto-l...,The ketogenic — or keto — diet has been all th...


In [19]:
df.to_csv("toronto_star.csv", encoding='utf-8', index=False)

### Le Devoir

A list of RSS feeds for Le Devoir can be found here: https://www.ledevoir.com/flux-rss

Let's use the World (le Monde) RSS feed.

In [19]:
print_RSS_fields('https://www.ledevoir.com/rss/section/monde.xml?id=76')

['surtitle', 'title', 'title_detail', 'published', 'published_parsed', 'links', 'link', 'id', 'guidislink', 'tags', 'summary', 'summary_detail', 'authors', 'author', 'author_detail']


In [20]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://www.ledevoir.com/rss/section/monde.xml?id=76', fields)

df.head()

Unnamed: 0,title,published,author,link,text
0,Des histoires de sécheresse,"Sat, 04 Mar 2023 15:15:23 -0500",aprovost@ledevoir.com (Anne-Marie Provost),https://www.ledevoir.com/monde/afrique/784006/...,"6 Shake Guyo, 72 ans, Malabot | Les éleveurs p..."
1,Dernière ligne droite pour éviter un naufrage ...,"Sat, 04 Mar 2023 15:01:55 -0500",webmestre@ledevoir.com (Amélie Bottollier-Depois),https://www.ledevoir.com/monde/784115/-dernier...,Les États membres de l’ONU tentaient toujours ...
2,Le Canada interpellé alors que le nombre de ré...,"Sat, 04 Mar 2023 12:29:47 -0500",webmestre@ledevoir.com (Dylan Robertson),https://www.ledevoir.com/monde/784107/-le-cana...,Les Nations Unies se préparent à une nouvelle ...
3,"Combats acharnés à Bakhmout, visite du ministr...","Sat, 04 Mar 2023 10:33:26 -0500",webmestre@ledevoir.com (Agence France-Presse),https://www.ledevoir.com/monde/europe/784100/-...,Le ministre russe de la Défense a mené une ins...
4,La solidarité de l’OTAN mise à l’épreuve,"Sat, 04 Mar 2023 04:09:41 -0500",mvastel@ledevoir.com (Marie Vastel),https://www.ledevoir.com/monde/784053/un-an-de...,Un an après que la Russie a lancé son invasion...


In [21]:
df.to_csv("le_devoir.csv", encoding='utf-8-sig', index=False)

### CBC

A list of RSS feeds for the CBCs can be found here: https://www.cbc.ca/rss/

Let's use the World RSS feed.

In [22]:
print_RSS_fields('https://rss.cbc.ca/lineup/world.xml')

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'author_detail', 'tags', 'summary', 'summary_detail']


In [31]:
#### ROUGH
d = feedparser.parse('https://rss.cbc.ca/lineup/world.xml')

all_fields = []
for field in d.entries[0]:
    all_fields.append(field)

print(all_fields)

fields = ['title', 'published', 'summary', 'author', 'link']

data = []
for i, entry in enumerate(d.entries):
    row = []
    for field in fields:
        row.append(d.entries[i][field])
    data.append(row)

df = pd.DataFrame(data, columns = fields)

links = df["link"]

df.iloc[-1]

['title', 'title_detail', 'links', 'link', 'id', 'guidislink', 'published', 'published_parsed', 'authors', 'author', 'author_detail', 'tags', 'summary', 'summary_detail']


title        Movement out of India that 'disseminates hate'...
published                         Wed, 1 Mar 2023 17:19:59 EST
summary      <img alt="INDIA-RELIGION/" height="259" src="h...
author                                               Lisa Xing
link         https://www.cbc.ca/news/canada/rss-hindutva-in...
Name: 19, dtype: object

In [34]:
#### ROUGH
links = df["link"]

article_text_dict = {}
for link in links:
  article = Article(link)
  article.download()
  article.parse()
  article.nlp()
  article_text_dict[link] = article.text
  
# df['text'] = list(article_text_dict.values())

print(links.iloc[-1], article_text_dict[links.iloc[-1]])

https://www.cbc.ca/news/canada/rss-hindutva-india-report-1.6764114?cmp=rss Canada shouldn't allow a movement out of India that "disseminates hate" and victimizes religious minority groups to entrench itself in this country, according to a report released Wednesday by the National Council of Canadian Muslims and the World Sikh Organization of Canada.

The report, called Rashtriya Swayamsevak Sangh (RSS) Network in Canada, documents the roots of the RSS movement in India and its extensive global reach, promoting far right views in various ways.

"It's one of the most influential organizations in the world," said Steven Zhou, a spokesperson for the National Council of Canadian Muslims.

The council and the World Sikh Organization of Canada are trying to draw attention to what academics, including some in Canada, say they have witnessed for years — an increasing influence and threat from a movement closely linked to the government in New Delhi that they say promotes discrimination against 

In [23]:
fields = ['title', 'published', 'author', 'link']

df = df_from_RSS('https://rss.cbc.ca/lineup/world.xml', fields)

df.head()

ValueError: ignored

In [None]:
df.to_csv("cbc.csv", encoding='utf-8', index=False)

## 3. Text Processing

In [20]:
# For the purpose of demonstrating text processing techniques, we will be using
# the New York Times dataset as created above, let's start be importing the dataset first

nytimes_df = pd.read_csv("new_york_times.csv", encoding='utf-8')
nytimes_df.head()

Unnamed: 0,title,published,summary,author,link,text
0,"Ukrainian Soldiers, Nearly Encircled in Bakhmu...","Mon, 06 Mar 2023 08:00:19 +0000",The battle for Bakhmut is not over — at least ...,Carlotta Gall and Daniel Berehulak,https://www.nytimes.com/2023/03/06/world/europ...,"CHASIV YAR, Ukraine — Lined up in the dark in ..."
1,Ukraine’s Top Generals Want to Keep Fighting f...,"Mon, 06 Mar 2023 19:12:15 +0000",Military commanders told President Volodymyr Z...,The New York Times,https://www.nytimes.com/live/2023/03/06/world/...,Ukraine’s top generals want to bolster their f...
2,"The Story of Multicultural Canada, Told in Hum...","Sun, 05 Mar 2023 22:04:46 +0000",Some of Toronto’s best dining options are mom-...,Norimitsu Onishi,https://www.nytimes.com/2023/03/05/world/canad...,"SCARBOROUGH, Ontario — At a tiny strip mall wh..."
3,Historical Disputes Kept Them at Odds. Can Seo...,"Mon, 06 Mar 2023 15:59:37 +0000",Icy relations between the two have long been a...,Choe Sang-Hun,https://www.nytimes.com/2023/03/05/world/asia/...,SEOUL — When it comes to South Korea and Japan...
4,Dalit Journalist Takes On India’s Caste Injust...,"Mon, 06 Mar 2023 16:27:16 +0000",Meena Kotwal started a news outlet focused on ...,Karan Deep Singh,https://www.nytimes.com/2023/03/06/world/asia/...,"After two years on the job, she filed an offic..."


In [65]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.util import ngrams
from nltk.probability import ConditionalFreqDist
from collections import defaultdict
import string
import random

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [76]:
# defining all the text processing techniques discussed in the presentation as 
# functions

def number_of_words(text):
  """
  Counts the total number of words in a text

  'text': This argument strictly needs to be of type "string"
  """
  return len(text.split())

def number_of_characters(text):
  """
  Counts the total number of characters in a text
  
  'text': This argument strictly needs to be of type "string"
  """
  return len(text)

def average_word_length(text):
  """
  Returns the average word length of a text

  'text': This argument strictly needs to be of type "string"
  """
  words = text.split()
  avg_word_len = sum(len(word) for word in words) / len(words)
  return int(avg_word_len)

def change_case(text, case_ = 'lower'):
  """
  Returns the transformed text as specified by the case arguement

  'text': This argument strictly needs to be of type "string"
  'case_': can either be 'lower' or 'upper'
  """
  if case_ == 'lower':
    return text.lower()
  elif case_ == 'upper':
    return text.upper()

def count_stopwords(text):
  """
  Returns the total number of stopwords in a text.
  Vocabulary for stopwords is from nltk module,

  'text': This argument strictly needs to be of type "string"
  """
  stop_words = set(stopwords.words('english'))
  num_of_stopwords = len([word for word in text.split() if word.lower() in stop_words])
  return num_of_stopwords


def tokenize(text, how = 'word'):
  """
  Tokenizes the text based on nltk module's basic tokenization techniques

  'text': This argument strictly needs to be of type "string"
  'how': This arguement caan be one of ['word', 'sentence', 'whitespace']
         indicting the strategy for tokenizing
  """
  if how == 'word':
    return nltk.word_tokenize(text)
  elif how == 'sentence':
    return nltk.sent_tokenize(text)
  elif how == 'whitespace':
    return nltk.WhitespaceTokenizer(text)

def remove_punct(text):
  """
  Removes punctuation marks from a specified text

  'text': This argument strictly needs to be of type "string"
  """
  
  # Tokenize the text into individual words
  tokens = tokenize(text)

  # Remove punctuation marks from each word
  table = str.maketrans('', '', string.punctuation)
  words = [word.translate(table) for word in tokens]

  # Combine the words back into a single string
  cleaned_text = ' '.join(words)

  return cleaned_text

def remove_stopwords(text):
  """
  Performs stopwords removal on the specified text

  'text': This argument strictly needs to be of type "string"
  """

  # Tokenize the text
  tokens = tokenize(text, how = 'word')
    
  # Remove stopwords
  stopwords_list = stopwords.words('english')
  filtered_tokens = [token for token in tokens if token.lower() not in stopwords_list]
    
  # Join the filtered tokens back into a string
  filtered_text = ' '.join(filtered_tokens)
    
  return filtered_text

def stem_text(text):
  """
  Returns the stemmed text based on the "PorterStemmer" stemming strategy of
  the nltk module. This is a widely used stemming algorithm that removes common 
  suffixes from English words. It is implemented in nltk as the PorterStemmer 
  class.

  'text': This argument strictly needs to be of type "string"
  """
  stemmer = PorterStemmer()
  tokens = text.split()
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  stemmed_text = ' '.join(stemmed_tokens)
  return stemmed_text

def lemmatize_text(text):
  """
  Lemmatizes text using the WordNetLemmatizer class from the nltk.stem module.

  'text': This argument strictly needs to be of type "string"
  """
  tokens = tokenize(text, how = 'word')
  lemmatizer = WordNetLemmatizer()

  lemmatized_tokens = []
  for word in tokens:
    lemmatized_tokens.append(lemmatizer.lemmatize(word))
  
  lemmatized_text = ''.join(lemmatized_tokens)
  
  return lemmatized_text

def normalize_text(text):
  """
  This function acts like a pipeline for the function defined above, for the
  purpose of normalizing the text

  'text': This argument strictly needs to be of type "string"
  """
  normalized_text = change_case(text, case_ = 'lower')
  normalized_text = remove_punct(normalized_text)
  # normalized_text = remove_stopwords(normalized_text)
  normalized_text = lemmatize_text(normalized_text)

  return normalized_text

In [77]:
sample_text = """ The quick brown fox jumps over the lazy dog. The dog is not actually lazy, 
                  but rather quite active. He loves to run and play all day long, chasing after squirrels 
                  and rabbits in the park. Unfortunately, he sometimes gets into fights with other dogs, 
                  which can be quite dangerous. As his owner, it's my job to make sure he stays safe and healthy, 
                  both physically and emotionally.
              """

print('# of Words:', number_of_words(sample_text), '\n')
print('# of Characters:', number_of_characters(sample_text), '\n')
print('Avg. word length:', average_word_length(sample_text), '\n')

# of Words: 68 

# of Characters: 470 

Avg. word length: 4 



In [78]:
print('Normalized text:', normalize_text(sample_text), '\n')

Normalized text: the quick brown fox jump over the lazy dog the dog is not actually lazy but rather quite active he love to run and play all day long chasing after squirrel and rabbit in the park unfortunately he sometimes get into fight with other dog which can be quite dangerous a his owner it s my job to make sure he stay safe and healthy both physically and emotionally 



### Predicting the next word in a sentence using the N-grams model

In [79]:
def n_gram_model(text):
  trigrams = list(nltk.ngrams(list(text.split()), 3, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))

  # make conditional frequencies dictionary  
  cfdist = ConditionalFreqDist()
  for w1, w2, w3 in trigrams:
    cfdist[(w1, w2)][w3] += 1

  # transform frequencies to probabilities
  for w1_w2 in cfdist:
    total_count = float(sum(cfdist[w1_w2].values()))
    for w3 in cfdist[w1_w2]:
      cfdist[w1_w2][w3] /= total_count

  return cfdist

def predict(model, user_input):
  user_input = normalize_text(user_input)
  user_input = user_input.split()

  w1 = len(user_input) - 2
  w2 = len(user_input)
  prev_words = user_input[w1:w2]

  # display prediction from highest to lowest maximum likelihood
  prediction = sorted(dict(model[prev_words[0], prev_words[1]]), key=lambda x: dict(model[prev_words[0], prev_words[1]])[x], reverse=True)
  print("Trigram model predictions: ", prediction)

  word = []
  weight = []
  for key, prob in dict(model[prev_words[0], prev_words[1]]).items():
    word.append(key)
    weight.append(prob)

  # pick from a weighted random probability of predictions
  next_word = random.choices(word, weights=weight, k=1)
  
  # add predicted word to user input
  user_input.append(next_word[0])
  print(' '.join(user_input))

  ask = input("Do you want to generate another word? (type 'y' for yes or 'n' for no): ")
  if ask.lower() == 'y':
        predict(model, str(user_input))
  elif ask.lower() == 'n':
        print("done")

In [80]:
normalized_text = normalize_text(sample_text)
model = n_gram_model(normalized_text)
predict(model, "quick brown fox")

Trigram model predictions:  ['jump']
quick brown fox jump
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['over']
quick brown fox jump over
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['the']
quick brown fox jump over the
Do you want to generate another word? (type 'y' for yes or 'n' for no): n
done


In [81]:
# using the New York Times data
text = nytimes_df.loc[0, 'text']
print(text, '\n')

CHASIV YAR, Ukraine — Lined up in the dark in civilian vehicles, lights dimmed, a company of soldiers waited silently at the side of a road. Farther behind, a second company was parked, an occasional light inside a car revealing the face of a soldier. Still farther back, a third company was moving into place.

After months of epic struggle, the fight over the Ukrainian city of Bakhmut had seemed in recent days to be reaching a climax, with Russian forces close to encircling the city and some Ukrainian units pulling out.

Then, early Saturday, Ukrainian assault brigades went on the attack. Over the weekend, hundreds of troops joined the counteroffensive, mounting assaults from the ground and pounding Russian positions with artillery from the surrounding hills.

Ukrainian commanders acknowledged that their forces in Bakhmut still faced the risk of encirclement, but the fighting over the weekend showed that a military that has surprised the world with its doggedness was not yet ready to g

In [83]:
normalized_text = normalize_text(text)
model = n_gram_model(normalized_text)
predict(model, "with Russian forces")

Trigram model predictions:  ['close']
with russian force close
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['to']
with russian force close to
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['encircling']
with russian force close to encircling
Do you want to generate another word? (type 'y' for yes or 'n' for no): y
Trigram model predictions:  ['the']
with russian force close to encircling the
Do you want to generate another word? (type 'y' for yes or 'n' for no): n
done


### "Bag of Words" Document Representation using Term Frequency (TF)

Bag of Words is a document representation technique used in natural language processing to convert a text document into a numerical vector that can be used for various machine learning tasks. In this technique, the text of a document is first preprocessed by tokenizing the words and removing any stop words and punctuation marks. The resulting set of words is then used to create a vocabulary, where each word is represented by a unique index.

The Bag of Words model can then be represented using two methods: Term Frequency (TF) and Term Frequency-Inverse Document Frequency (TF-IDF).

In the TF method, each document is represented by a vector where each entry corresponds to the frequency of the corresponding word in the document. For example, consider the following two sentences:

The quick brown fox jumps over the lazy dog.

The lazy dog sleeps all day.

The corresponding Bag of Words representation using the TF method would be:

```
            quick brown fox  jumps over the   lazy  dog   sleeps all   day
Sentence 1: 0.11  0.11  0.11 0.11  0.11 0.22  0.11  0.11  0      0     0  
Sentence 2: 0     0     0    0     0    0.167 0.167 0.167 0.167  0.167 0.167  
```
The entry in each vector corresponds to the frequency of the corresponding word in the document. For example, the first sentence has a 1 in the "quick" column because the word "quick" appears once in the sentence.

Sci-Kit Learn's TfidfVectorizer Class implements the above explained 'Bag of Words' representation, the only modification that needs to be made is setting the boolean arguement `use_idf` to **False**, which caan be seen below

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a list of all articles
texts = [article for article in nytimes_df['text'].to_list()]

# Create an instance of the TfidfVectorizer class
vectorizer = TfidfVectorizer(use_idf = False)

# Fit the vectorizer on the documents
vectorizer.fit(texts)

# Transform the articles into a TF-IDF matrix
tf_matrix = vectorizer.transform(texts)
tf_matrix = tf_matrix.toarray()

print(tf_matrix)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.04649626 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.04822428 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.05735393 0.         0.        ]]


In [100]:
tf_matrix.shape

(62, 3674)

This matrix is the **term frequency 'bag of words' representation** of the new york times articles, the shape of this matrix is (62, 3674) which means the total number of unique words (vocabulary) in our collections of article texts (corpus) are 3674, with 62 different articles. Let's add a column `bow_tf` to our new york times dataframe, which has the term frequency bag of words representation of the corresponding article,

In [97]:
nytimes_df['bow_tf'] = pd.Series(list(tf_matrix))

### "Bag of Words" Document Representation using Term Frequency and Inverse Document Frequency (TF_IDF)

In the TF-IDF method, each entry in the vector is the product of the term frequency and inverse document frequency of the corresponding word. This method gives more weight to words that are rare in the corpus and less weight to words that are common. The formula for calculating the TF-IDF weight of a word is:
```
TF-IDF(w, d) = TF(w, d) * IDF(w)
```
where TF(w, d) is the term frequency of word w in document d and IDF(w) is the inverse document frequency of word w across all documents in the corpus.

Continuing with the same example, 

The quick brown fox jumps over the lazy dog.

The lazy dog sleeps all day.

the corresponding Bag of Words representation using the TF-IDF method would be:
```
            quick brown fox jumps over the   lazy  dog   sleeps all day
Sentence 1: 0     0     0   0     0    0.089 0.045 0.045 0      0   0   
Sentence 2: 0     0     0   0     0    0.067 0.067 0.067 0      0   0 
```
The entry in each vector corresponds to the TF-IDF weight of the corresponding word in the document. For example, the first sentence has a TF-IDF weight of 0.045 in the "dog" column because the word "dog" appears once in the sentence and has a relatively low IDF value.

In [98]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create a list of all articles
texts = [article for article in nytimes_df['text'].to_list()]

# Create an instance of the TfidfVectorizer class
vectorizer = TfidfVectorizer(use_idf = True)

# Fit the vectorizer on the documents
vectorizer.fit(texts)

# Transform the articles into a TF-IDF matrix
tfidf_matrix = vectorizer.transform(texts)
tfidf_matrix = tfidf_matrix.toarray()

print(tfidf_matrix)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.09319666 0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.09181915 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.08804082 0.         0.        ]]


The column `bow-tfidf' is added to the new york times dataframe to show the **term frequency - inverse document frequency 'bag of words' representation** of the corresponding articles

In [99]:
nytimes_df['bow_tfidf'] = pd.Series(list(tfidf_matrix))

In summary, the Bag of Words model with the TF and TF-IDF methods represents a document as a numerical vector where each entry corresponds to the frequency or importance of the corresponding word in the document. These representations can then be used in various machine learning tasks such as classification, clustering, and information retrieval.

## Word Embeddings

Word embeddings are a type of vector representation used in natural language processing to capture the meaning of words in a way that can be easily processed by machine learning algorithms. Unlike Bag of Words representations, word embeddings are able to capture the semantic relationships between words and can represent the meaning of a word based on its context.

Word embeddings are typically created using a technique called word2vec, which is a neural network-based approach. The basic idea behind word2vec is to train a neural network to predict the context of a word given its surrounding words. The context of a word refers to the words that appear in its vicinity, such as the words that come before and after it in a sentence or paragraph.

The word2vec model consists of a shallow neural network with one input layer, one hidden layer, and one output layer. The input layer corresponds to the words in the vocabulary, and each word is represented by an embedding vector whose dimension can be a user-defined parameter. The output layer consists of a set of softmax nodes, each representing a word in the vocabulary. The objective of the network is to predict the probability distribution of the words in the vocabulary given the input word.

During training, the network is fed with pairs of words and their context words, and the objective is to minimize the cross-entropy loss between the predicted probability distribution and the actual context words. The weights of the hidden layer are then used as the word embeddings for the corresponding words in the vocabulary.

The resulting word embeddings are typically dense vectors of fixed dimensionality, where each element represents a latent feature or aspect of the corresponding word. These vectors can be used as input to various machine learning algorithms for various NLP tasks such as sentiment analysis, named entity recognition, and machine translation.

For the purpose of this presentation, instead of writing the whole neural metwork model for implementing word2vec, we are using the `gensim.sklearn_api` module's `W2VTransformer` class, as shown below,

In [None]:
from gensim.sklearn_api import W2VTransformer

# creating a model to represent each word by a 10 dimensional embedding vector
word2vec = W2VTransformer(size=10, min_count=1, seed=1)

# transforming the collections of articel text to be compatible feeding into the model
corpus = []
for text in nytimes_df['text'].to_list():
  corpus.append(list(text.split()))

word2vec.fit(corpus)

In [112]:
## word embedding for the word "country"
country_vec = word2vec.transform("country")
print(country_vec, '\n', country_vec.shape)

[[-0.02184694  0.03425032 -0.03293595 -0.08557533  0.00280237  0.0948814
   0.06911288  0.0607648   0.0174236   0.02138556]] 
 (1, 10)


Since we initialized the model with size = 10, word embedding or word vector for every word will be of dimension 10

In [114]:
## word embedding for the word "company"
country_vec = word2vec.transform("company")
print(country_vec, '\n', country_vec.shape)

[[-0.0598026   0.02213237  0.00524647 -0.07460058 -0.01489155  0.06536892
   0.07121351  0.01691799  0.00978078  0.02069317]] 
 (1, 10)
