# Keywords Extraction Using TF-IDF
<b><div style="text-align: right">[TOTAL POINTS: 14]</div></b>

In the reading material, you learned about the complete pipeline for an NLP task, including raw text extraction and processing, text preprocessing, and applying machine learning models. In this assignment, you will apply the knowledge you have learned to extract keywords from text articles. Here, you will scrape a text document, preprocess it, and extract keywords from it. Note that you won't be fitting any machine learning model in this assignment. Instead, you will use the Tfidf vectorization technique to extract the keywords.


## Imports

In [None]:
!pip install -q nltk
!pip install -q scikit-learn

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Dataset Description

You will first fit the TF-IDF vectorizer on the [Medium Articles](https://www.kaggle.com/hsankesara/medium-articles) dataset. This dataset contains a collection of medium articles on ML, AI, and data science and other details like author, number of claps, reading time, etc.

*Source:* https://www.kaggle.com/hsankesara/medium-articles

*Author:* Heet Sankesara

*License:* https://creativecommons.org/publicdomain/zero/1.0/



In [None]:
dataset = pd.read_csv('articles.csv')
dataset.head()

Unnamed: 0,author,claps,reading_time,link,title,text
0,Justin Lee,8.3K,11,https://medium.com/swlh/chatbots-were-the-next...,Chatbots were the next big thing: what happene...,"Oh, how the headlines blared:\nChatbots were T..."
1,Conor Dewey,1.4K,7,https://towardsdatascience.com/python-for-data...,Python for Data Science: 8 Concepts You May Ha...,If you’ve ever found yourself looking up the s...
2,William Koehrsen,2.8K,11,https://towardsdatascience.com/automated-featu...,Automated Feature Engineering in Python – Towa...,Machine learning is increasingly moving from h...
3,Gant Laborde,1.3K,7,https://medium.freecodecamp.org/machine-learni...,Machine Learning: how to go from Zero to Hero ...,If your understanding of A.I. and Machine Lear...
4,Emmanuel Ameisen,935,11,https://blog.insightdatascience.com/reinforcem...,Reinforcement Learning from scratch – Insight ...,Want to learn about applied Artificial Intelli...


In [None]:
print("No. of instances: {}".format(dataset.shape[0]))

No. of instances: 337


For this assignment, you will use only the `text` column from the dataset. You won't have to preprocess the texts yourself, as you have already done it a couple of times. So let's quickly load the preprocessed articles from the file `preprocessed_articles.txt`.

In [None]:
# define an empty list to contain the preprocessed articles
preprocessed_articles = []

# open file and read its content
with open('preprocessed_articles.txt', 'r') as filehandle:
    for line in filehandle:
        # remove linebreak which is the last character of the string
        currentPlace = line[:-1]

        # add item to the list
        preprocessed_articles.append(currentPlace)

In [None]:
for line in preprocessed_articles[:10]:
    print(line)

oh headlin blare chatbot next big thing hope sky high bright eye bushi tail industri wa ripe new era innov wa time start social machin whi road sign point toward insan success mobil world congress chatbot main headlin confer organ cite overwhelm accept event inevit shift focu brand corpor chatbot fact onli signific question around chatbot wa would monopol field whether chatbot would take first place one year answer question becaus even ecosystem platform domin chatbot first technolog develop talk grandios term slump spectacularli age old hype cycl unfold familiar fashion expect built built kind fizzl predict paradim shift materi app tellingli still aliv well look back breathless optim turn slightli baffl wa chatbot revolut promis digit ethan bloch sum gener consensu accord dave feldman vice presid product design heap chatbot take one difficult problem fail took sever fail bot interfac user differ way big divid text vs speech begin comput interfac wa written word user type command manua

## Exercise 1: Fitting CountVectorizer
<b><div style="text-align: right">[POINTS: 2]</div></b>

In this assignment, you will implement the TF-IDF vectorizer using the `TfidfTransformer`. For this, you'll first need to get the count matrix using the `CountVectorizer`.

**Tasks:**

1. Import the `CountVectorizer` from sklearn.
2. Instantiate the object of the `CountVectorizer'class as `count_vectorizer`.
3. Fit the `count_vectorizer` on the `preprocessed_articles` and transform it into the feature matrix `X_count`.


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = None
X_count = None

### Ex-1-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(preprocessed_articles)
### END SOLUTION

if (type(X_count) != np.ndarray):
    X_count = X_count.toarray()

In [None]:
### INTENTIONALLY LEFT BLANK

Let's have a look at 10 words from our vocabulary.

In [None]:
print(list(count_vectorizer.vocabulary_.keys())[:10])

['oh', 'headlin', 'blare', 'chatbot', 'next', 'big', 'thing', 'hope', 'sky', 'high']


## Exercise 2: Fitting TfidfTransformer
<b><div style="text-align: right">[POINTS: 3]</div></b>


Till now, you have implemented the TF-IDF using either the `TfidfVectorizer` or from scratch. Here you will learn another way to implement it using the [`TfidfTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). `TfidfTransformer` uses a count matrix containing the count of words in the documents, like the one calculated by the `CountVectorizer`, to calculate the TF-IDF values.
Using `TfidfTransformer` is similar to using other methods from sklearn. First, you instantiate the `TfidfTransformer` class. Then you use the `fit` method to compute the IDF values for the words in the vocabulary. Finally, you use the `transform` method to compute the TF-IDF value and transform the count matrix into a feature matrix containing TF-IDF values.

However, in this assignment, you won't have to create a TF-IDF feature matrix from the documents as you won't be applying any machine learning model. Instead, you will justcompute the IDF values for the words in the vocabulary and use them to compute the TF-IDF values for a new test document.

**Tasks:**

1. Import the `TfidfTransformer` from sklearn.
2. Instantiate the object of the `TfidfTransformer'class as `tfidf_transformer`.
3. Fit the `tfidf_transformer` on the feature matrix `X_count`.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = None

### Ex-2-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE
tfidf_transformer = TfidfTransformer()
tfidf_transformer.fit(X_count)
### END SOLUTION

In [None]:
### INTENTIONALLY LEFT BLANK

## Exercise 3: Scraping a webpage containing a blog post
<b><div style="text-align: right">[POINTS: 1]</div></b>

Now that you have learned the vocabulary and their IDF scores using the medium articles dataset, we can use them to compute the TF-IDF scores of new text documents. Let's start with scraping a webpage containing a blog post. You will scrape the page [AI to the Rescue: COVID-19 Recovery](https://insights.fusemachines.com/ai-to-the-rescue-the-covid-19-recovery/) from [Insights- Fusemachines](https://insights.fusemachines.com/).

**Tasks:**

1. Import the `requests` module.
2. Extract the site stored in the `URL` variable and store it in the variable `response`.


In [None]:
URL = 'https://insights.fusemachines.com/ai-to-the-rescue-the-covid-19-recovery/'
response = None

### Ex-3-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE
import requests
response = requests.get(URL)
# raise NotImplementedError()
### END SOLUTION

print(response.status_code)

200


In [None]:
### INTENTIONALLY LEFT BLANK

## Exercise 4: Extracting the blog post
<b><div style="text-align: right">[POINTS: 3]</div></b>

The entire HTML page is now stored in the `response` variable. Now, you will use `Beautiful Soup` to extract the content from the page. All the contents of the blog are inside the \<p> tag.

**Tasks:**

1. Import `BeautifulSoup` class from `bs4` module.

2. Create an object, `soup` of the `BeautifulSoup` class, and initialize it with two parameters, the content of our response and the `html.parser`.

3. Find all the \<p> tags using the `find_all` method of the object `soup` and store it in the variable `paras`.

4. Extract the text from all the \<p> tags using the `get_text` method and store them in the list `texts`. $[$Hint: Use a for loop to access each element of the list `paras` and use the `get_text` method of each element to extract the text content of the element$]$

In [None]:
soup = None
paras = None
texts = []

### Ex-4-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE
from bs4 import BeautifulSoup

# Create a BeautifulSoup object with the HTML content from the response
soup = BeautifulSoup(response.content, 'html.parser')
paras = soup.find_all('p')

texts = []
for para in paras:
    text = para.get_text()
    texts.append(text)
# raise NotImplementedError()
### END SOLUTION

article = " ".join(texts)
print(article)

Even before the COVID-19 crisis struck and plunged the world into recession, people and businesses were adopting AI to take on all varieties of tasks. With most of the global workforce working from home, AI has become all the more important and will continue to affect and shape the way the world recovers. During past economic recessions, 14% of companies were able to increase sales by adopting digital solutions.  Consumer behavior  It’s no surprise consumers are turning online, a behavior that began long before the pandemic. Consumers are operating nearly entirely online, indeed, they have no other choice. Under lockdown, today’s consumers are choosing reduced contact means of getting goods and services. This is especially true for younger and higher-income consumers. In today’s digitized world, one can order food, take classes, and work all without leaving home. This trend is unlikely to disappear once the crisis is over.  Automate repetitive tasks so people can focus on creative task

In [None]:
### INTENTIONALLY LEFT BLANK

## Exercise 5:
<b><div style="text-align: right">[POINTS: 2]</div></b>

Now that you have the complete blog post stored in the variable `text`, you need to apply some basic pre-processing to it.

**Tasks:**

1. Remove punctuation marks, symbols, and numbers from `article`. Store the cleaned article in the variable `cleaned_article`.

2. Tokenize `article` into its constituent words and store it in the variable `tokens`.

3. Lowercase each token in `tokens`.

4. Check each token in `tokens` to be a stopword and remove it if it is a stopword. Note: List of English stopwords are stored in the list `eng_stopwords`.  

5. Create an object, `stemmer` of the class `PorterStemmer`, and use it to stem each token in `tokens`.

In [None]:
eng_stopwords = stopwords.words('english')
cleaned_article = None
tokens = None

### Ex-5-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE

cleaned_article = re.sub('[^A-Za-z]+',' ',article)
tokens = word_tokenize(cleaned_article)
tokens = [token.lower() for token in tokens]
tokens = [token for token in tokens if token not in eng_stopwords]
stemmer = PorterStemmer()
tokens = [stemmer.stem(token) for token in tokens]
# raise NotImplementedError()
### END SOLUTION

# getting back the original text
preprocessed_article = " ".join(tokens)
print(preprocessed_article)

even covid19 crisi struck plung world recess peopl busi adopt ai take varieti task global workforc work home ai becom import continu affect shape way world recov past econom recess 14 compani abl increas sale adopt digit solut consum behavior surpris consum turn onlin behavior began long pandem consum oper nearli entir onlin inde choic lockdown today consum choos reduc contact mean get good servic especi true younger higherincom consum today digit world one order food take class work without leav home trend unlik disappear crisi autom repetit task peopl focu creativ task busi take advantag uniqu time capit onlin market consum willing partak set tone busi function postcovid19 market leverag ai alway provid competit advantag power autom repetit time consum task previous done human ai human better spend time demand dynam task channel creativ need boost suppli chain effici reduc need human interact exactli benefit adopt ai help busi recov post covid19 data help process scale al make possib

In [None]:
### INTENTIONALLY LEFT BLANK

## Exercise 6: Calculating Tfidf Scores
<b><div style="text-align: right">[POINTS: 2]</div></b>

Now that you have preprocessed the article let's compute its TF-IDF representation. You will use the vocabulary, and the IDF values computed previously on the medium articles dataset. So you will not fit the `count_vectorizer` and `transformer` objects on the new article. Instead, you will use them to transform the new article into a TF-IDF vector.

**Tasks:**

1. Transform the `preprocesssed_article` into a count vector using the `transform` method of the `count_vectorizer` object you created earlier.

2. Transform the count vector created by the `count_vectorizer` into a TF-IDF vector using the `transform` method of the `transformer` object you created earlier. Store the result in `article_tfidf`.





In [None]:
article_tfidf = None

### Ex-6-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE
count_vector = count_vectorizer.transform([preprocessed_article])
article_tfidf = tfidf_transformer.transform(count_vector)
# raise NotImplementedError()
### END SOLUTION

if (type(article_tfidf) != np.ndarray):
    article_tfidf = article_tfidf.toarray()

article_tfidf = pd.DataFrame(article_tfidf, columns=count_vectorizer.get_feature_names_out())
article_tfidf

Unnamed: 0,aaand,aaauu,aae,aah,aarhu,aaron,ab,aback,abaixo,abandon,...,zone,zoo,zoologist,zoom,zorn,zoubin,ztm,zuckerberg,zumba,zurich
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
### INTENTIONALLY LEFT BLANK

## Exercise 7: Extracting the Keywords
<b><div style="text-align: right">[POINTS: 1]</div></b>

You have the TF-IDF representation of the article. You can now pick the top $n$ words with the highest TF-IDF values as the keywords.

**Tasks:**

1. Sort the columns of the dataframe `article_tfidf` in the descending order of their values. Store the sorted dataframe as `article_tfidf_sorted`.

2. Pick the first $20$ column names from the sorted `article_tfidf` and store it in the variable `keywords`.

In [None]:
article_tfidf_sorted = None

### Ex-7-Task-1
### BEGIN SOLUTION
# YOUR CODE HERE
article_tfidf_sorted = article_tfidf.sort_values(by=0, axis=1, ascending=False)
keywords = article_tfidf_sorted.columns[:20]
# raise NotImplementedError()
### END SOLUTION

keywords = list(keywords)
keywords

['ai',
 'custom',
 'help',
 'compani',
 'busi',
 'leverag',
 'consum',
 'food',
 'remot',
 'boost',
 'adopt',
 'unpreced',
 'recov',
 'healthcar',
 'recess',
 'crisi',
 'manufactur',
 'onlin',
 'loyalti',
 'holist']

In [None]:
### INTENTIONALLY LEFT BLANK

You can see that the extracted keywords give us a pretty good idea about what the article is about.

# Well Done!