# Embedding using TF-IDF

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. There are various different embedding techniques present. In this notebook we make use of TF-IDF embedding technique. TF-IDF stands for Term frequency - Inverse Document Frequency. TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents.

In this notebook we extract text data from a wikipedia page and apply TF-IDF word embedding technique to the extracted data

### Importing the libraries

* We import TfidfVectorizer from sklearn for TF-IDF computations
* Other libraries like requests , BeautifulSoup are used to extract text data from the Wikipedia page
* Nltk is used to process the extracted text

In [22]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import requests
from bs4 import BeautifulSoup
import random
import pandas as pd
import csv
import re
import nltk
from nltk.corpus import stopwords
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gaura\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Defining the different functions

Connecting to the Wikipedia page is the first step in the whole process. initResp() function helps in achieving the first step. This is then used in the extractText function which is the function used to extract the text from the page.

In [3]:
def initResp(url):
    response = requests.get(url)
    return response

Once we are connected to the Wikipedia page we need to extract the title and the data written about the topic for further processing. titleExtractor and contextExtractor do this by using the beautifulSoup library.

In [4]:
def titleExtractor(soup):
    title = soup.find('title')
    return title.string

def contextExtractor(soup):
    context = " "
    for i in soup.select('p'):
        context = context + i.getText()
    return context

Before computing TF-IDF we have to remove all the stopwords or else even the stop words will be considered while performing the computations. stopWordDel uses the stopword set from nltk to remove all the stopwords present in the textdata.

In [8]:
def stopWordDel(text):
    stop_words = set(stopwords.words('english'))
    filteredSent = " "
    pageWord = text.split()
    for r in pageWord:
        if not r in stop_words:
            filteredSent = filteredSent + " " + r
    return filteredSent

### Extracting Data and Computing TF-IDF

Now that the sub functions are defined for performing basic functionalities we can start focusing on the functions which perform the extraction and computing.<br>
* extractText() function first connects with the wikipedia page and then extracts the title and context using the titleExtractor() and contextExtractor() . The extracted data is then converted into lowercase before feeding it to the stopWordDel() function. After removal of stopwords, numerical characters are also deleted from the text data. At the end the processed text data is then split into different sentences using nltk library and stored in a list. This list is then returned when the function is called

In [19]:
def extractText(url):
    response=initResp(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    title = titleExtractor(soup)
    context = contextExtractor(soup)
    res = stopWordDel(context.lower())
    filteredContext = ''.join([i for i in res if not i.isdigit()])
    sentenceList = nltk.tokenize.sent_tokenize(filteredContext)
    return sentenceList

* computeIDF makes use of the extractText() function to extract the sentence list from the wikipedia page. TfidVectorizer from sklearn is used. The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. A dataframe is returned which contains the TFIDF computation.

In [20]:
def computeIDF(url):
    sentenceList= extractText(url)
    vectorizer = TfidfVectorizer()
    vectors = vectorizer.fit_transform(sentenceList)
    feature_names = vectorizer.get_feature_names()
    dense = vectors.todense()
    denselist = dense.tolist()
    df = pd.DataFrame(denselist, columns=feature_names)
    return df

### Results:

Here we pass a wikipedia url for genocide to the computeIDF function. It returns a dataframe containing all the computations. We have displayed the head of that dataframe for better understanding

In [23]:
df = computeIDF('https://en.wikipedia.org/wiki/Genocide')
df.head()

Unnamed: 0,abetting,abortion,absolute,abstained,academics,acceptance,accepted,access,accomplished,according,...,yazidi,yazidis,years,yemen,yet,york,yugoslavia,zdravko,zepa,γένος
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.209056
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
