<a href="https://colab.research.google.com/github/anandpol98/Information_Retrieval_TF_IDF/blob/main/Term_Document_Matrix_IRipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Unzipping Business Document

In [None]:
!unzip "/content/drive/MyDrive/business.zip" -d "/content/"

# Importing Libraries

In [3]:
import os,re,math
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

#Funtions for Preprocessing text

This cell has all functions for preprocessing text

*   Remove Punctuations/Numbers
*   Tokenization using NLTK library
*   Remove Stopwords
*   Lemmatization





In [4]:

# removing Puntuation

def remove_punc(data):
  return re.sub('[!\"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~0-9\n]','',data) 


# Tokenization using NLTK library

def tokenization(data):
  return word_tokenize(data)



# remove Stopwords

def remove_stopwards(tokens):
  
  from nltk.corpus import stopwords
  lst = stopwords.words('english');
  count=0
  mini=float('inf') 
  while(count<mini):
    for i in tokens:
      count = count +1   
      if i in lst:
        tokens.remove(i) 
    if count==mini:
      break;
    else:
      mini=count  
      count = 0    
  return tokens     



# Stemming using Lemmatization

def lemmatization(tokens):
  stem_lemma = []
  from nltk.stem import WordNetLemmatizer
  lemmatizer = WordNetLemmatizer()
  for word in tokens:
    stem_lemma.append(lemmatizer.lemmatize(word,pos='v'))
  return stem_lemma

# Removing business_index files from set of documents

In [None]:
# Removing Index files

files = sorted(os.listdir('/content/business'))
lst = re.findall('[0-9]*_business_index.utf8',str(files));
for i in lst:
  files.remove(i)
print("File names are as follows : \n",files)

#Preprocessing all documents individually and storing preprocessed data at respective index as a list of lists.

*   Preprocess each document and store the preprocessed data in string form in corpus list item
*   After this step, **corpus[i]** contains preprocessed data of **document of index i**.



In [6]:
k=0
corpus = []
for i in sorted(files):

   
  obj = open('/content/business/' + i)
  dat = obj.read()
  soup = BeautifulSoup(dat)
  dat = soup.find('text').text    # storing file content into dat using BeautifulSoup

# remove puntuation
  data = remove_punc(dat)

# convert text into tokens
  tokens = tokenization(data)

# converting all tokens into lowercase
  tokens = [item.lower() for item in tokens]

# removing stopwords from tokens generated
  tokens = remove_stopwards(tokens)

# Lemmatization to convert original token into it's root form
  stem_lemma = lemmatization(tokens)

# Combining all terms in its root form into single string
  str1 = ' '.join(stem_lemma)

# Storing previous step's string at kth index of corpus list
  corpus.insert(k,str1)
  k = k + 1

For e.g corpus[0] has preprocessed data of 1st document which would look like below shown.

In [7]:
corpus[0]

'telegraph calcutta business corporate brief kanoria chemicals industries invest rs crore set mw power plant chemical unit target rs crore turnover thermal power plant would set outlay rs crore chloralkali plant would set cost rs crore chairman manage director r v kanoria say hummingbird ltd lead global provider integrate enterprise content management ecm solutions launch comprehensive content library consolidation solution law firm part enterprise content integration solution solution design help law firm library consolidation efforts minimise time require consolidation ensure complete data integrity availability throughout process canon image technology company draw plan capture per cent rapidly grow market colour laser multifunction devices india market estimate rs crore sierra atlantic lead player offshoring enterprise applications assess fully compliant maturity level five software engineer institute sei capability mature model cmm hikal ltd enter longterm arrangement bayer cropsc

# **Using Sklearn**

In [23]:
# use standard approach for tf-idf representation

from sklearn.feature_extraction.text import TfidfVectorizer 
tfidf = TfidfVectorizer()               # object creation
X = tfidf.fit_transform(corpus)                       # use fit_transform() method of TfidfVectorizer to fit medal on data
print("Terms after tf-idf matrix process are : \n",tfidf.get_feature_names()[:100])      # getting terms obtained after tf-idf formation process


Terms after tf-idf matrix process are : 
 ['aa', 'aaa', 'aaaind', 'aaarated', 'aaastable', 'aad', 'aai', 'aaifr', 'aalayance', 'aamir', 'aarohi', 'aaron', 'aarvee', 'aastable', 'aazmao', 'ab', 'aback', 'abacus', 'abandon', 'abani', 'abarrel', 'abate', 'abb', 'abbots', 'abbott', 'abbreviate', 'abbs', 'abc', 'abcs', 'abdomen', 'abdul', 'abdullah', 'abe', 'abel', 'aber', 'aberdeen', 'aberration', 'abeyance', 'abhang', 'abhay', 'abhijit', 'abhiyan', 'abide', 'abilities', 'ability', 'able', 'ablr', 'abn', 'abnormal', 'abolish', 'abolishment', 'abolition', 'abort', 'abovementioned', 'abp', 'abrasive', 'abrasives', 'abreast', 'abroad', 'abroadstanchart', 'abrupt', 'abs', 'absence', 'absences', 'absent', 'absolute', 'absolutely', 'absolve', 'absorb', 'absorbable', 'absorbers', 'absorption', 'absorptive', 'abstain', 'absurd', 'abu', 'abubakir', 'abundant', 'abuse', 'abusive', 'abuy', 'abuzz', 'abysmal', 'abysmally', 'ac', 'academia', 'academic', 'academics', 'academies', 'academy', 'acc', 'acce

# **Shape of Term-Document Matrix formed using Sklearn library**

In [9]:
print("Shape of Term-Document Matrix is : ",X.shape)

Shape of Term-Document Matrix is :  (1994, 20044)


Converting X into Pandas dataframe

In [10]:
df_X = pd.DataFrame(data = X.toarray(),index = sorted(files),columns = tfidf.get_feature_names())

In [None]:
print(df_X)

In [12]:
top_5_doc = sorted(files)[:5]
top_5_doc

['1040901_business_story_3700171.utf8',
 '1040901_business_story_3700827.utf8',
 '1040901_business_story_3701515.utf8',
 '1040901_business_story_3701518.utf8',
 '1040901_business_story_3701887.utf8']

In [13]:
type(df_X.sort_index()[:5])

pandas.core.frame.DataFrame

#**Q4 - First 5 documents TOP 5 terms and their tf-idf scores obtained using Sklearn library**

In [14]:
for i in range(5):
  print(str(i+1) + " document's top 5 words and their tf-idf score are as follows\n")
  print("\nWords  |  Tf-idf Scores\n")
  print((df_X.loc[top_5_doc[i]]).sort_values(ascending = False)[:5],"\n\n")

1 document's top 5 words and their tf-idf score are as follows


Words  |  Tf-idf Scores

consolidation    0.181328
content          0.178707
kanoria          0.178479
idbi             0.176259
hikal            0.172957
Name: 1040901_business_story_3700171.utf8, dtype: float64 


2 document's top 5 words and their tf-idf score are as follows


Words  |  Tf-idf Scores

policy         0.292382
export         0.229180
trade          0.216347
maidan         0.184744
entitlement    0.143509
Name: 1040901_business_story_3700827.utf8, dtype: float64 


3 document's top 5 words and their tf-idf score are as follows


Words  |  Tf-idf Scores

patni     0.448705
centre    0.312209
anna      0.208451
salai     0.208451
patnis    0.197759
Name: 1040901_business_story_3701515.utf8, dtype: float64 


4 document's top 5 words and their tf-idf score are as follows


Words  |  Tf-idf Scores

bharat        0.437181
petro         0.347059
kochi         0.233013
refineries    0.215340
behuria       0.2114



---



# **Using Tf-Idf classic manual approach**

**Class Definition and its functions**

In [15]:
class tf_idf_matrix:
  
  # Constructor - intialize members of class

  def __init__(self,corpus,files):
    self.corpus = corpus            # corpus data
    self.set_tokens = set()         # will be filled with unique terms includinf all documents in corpus

    for i in self.corpus:
      self.set_tokens.update(i.split())         # storing unique terms - tokens in "set_tokens" member of class

    self.files = files  # storing file names 
    self.tf = np.zeros((len(self.corpus),len(self.set_tokens)))             # "tf" is term frequency matrix
    self.tf_idf =  np.zeros((len(self.corpus),len(self.set_tokens)))        # "tf_idf" is term-document" matrix
    self.idf_t =  np.zeros(len(self.set_tokens))                             # "idf_t" stores inverse document frequency for each term
  
  def transform(self,doc_no):
    return obj.tf_idf[sorted(self.files).index(doc_no)][:]  # return document tf_idf vector having scores for all terms
    

    
  def fit(self):

    # defining some local variables

    tf =  np.zeros((len(self.corpus),len(self.set_tokens)))
    corpus = self.corpus
    set_tokens = self.set_tokens
    lst = list(set_tokens)


    #tf matrix calculation

    for i in range(len(corpus)):
      for j in range(len(lst)):
          tf[i][j] = (corpus[i]).count(lst[j])

      tf[i][:] = np.divide((tf[i][:]),len(corpus[i]))

    

    #idf_t vector calculation

    df_t =  np.zeros(len(lst))  # document freq for each term
    idf_t =  np.zeros(len(lst)) # store inverse document frequency for each term

    for i in range(len(lst)):
      for j in range(len(corpus)):
        if lst[i] in corpus[j]:
          df_t[i] = df_t[i] +  1

      idf_t[i] = math.log(len(corpus)/df_t[i])



    #tf_idf matrix computation

    tf_idf = np.zeros((len(corpus),len(lst))) 

    for i in range(len(corpus)):
      tf_idf[i][:] = [a*b for a,b in zip(tf[i][:],idf_t[:])]


    # storing values into member variables

    self.tf_idf = tf_idf     
    self.idf_t = idf_t
    self.tf = tf


Creating object of class

In [16]:
obj = tf_idf_matrix(corpus,files)   #class object creation

**Calling Fit method**

In [17]:
obj.fit()   # fitting the data for tf-idf matrix formation

In [18]:
obj.idf_t   # printing idf_t vector

array([1.16817847, 6.90475077, 6.90475077, ..., 0.68714716, 7.59789795,
       7.59789795])

**Calling Transform Method for Document "*1040901_business_story_3700171.utf8*"**

In [19]:
print("Term-document vector for particular doc no:\n",obj.transform("1040901_business_story_3700171.utf8"))

Term-document vector for particular doc no:
 [0.00188112 0.         0.         ... 0.00073768 0.         0.        ]


In [20]:
#@title
#p = sorted(obj.transform("1040901_business_story_3700171.utf8"),reverse = True)
#t = np.argsort(p)
#=0
#for i in t:
#  print(list(obj.set_tokens)[i],"--",p[j])
#  j += 1

# **Shape of Term-Document Matrix formed using Classic Manual approach**

In [21]:
print("Shape of Term-Document Matrix formed is : ", np.shape(obj.tf_idf))

Shape of Term-Document Matrix formed is :  (1994, 20062)


# **Q4 - First 5 documents TOP 5 terms and their tf-idf scores obtained using classic manual approach**

In [22]:
for i in range(5):
    top_5 = np.argsort(obj.tf_idf[i][:])[-5:]
    print(str(i+1) + " document's top 5 words and their tf-idf score are as follows\n")
    print(sorted(obj.tf_idf[i][:], reverse= True)[:5])
    print("\nWords    |     Tf-idf Scores\n")
    print(list(obj.set_tokens)[top_5[4]],"------",obj.tf_idf[i][top_5[4]])
    print(list(obj.set_tokens)[top_5[3]],"------",obj.tf_idf[i][top_5[3]])
    print(list(obj.set_tokens)[top_5[2]],"------",obj.tf_idf[i][top_5[2]])
    print(list(obj.set_tokens)[top_5[1]],"------",obj.tf_idf[i][top_5[1]])
    print(list(obj.set_tokens)[top_5[0]],"------",obj.tf_idf[i][top_5[0]])
    print("\n")

1 document's top 5 words and their tf-idf score are as follows

[0.006977225616590095, 0.006668388179712177, 0.006428835252912167, 0.006254953114037803, 0.005967918924977709]

Words    |     Tf-idf Scores

kanoria ------ 0.006977225616590095
hikal ------ 0.006668388179712177
library ------ 0.006428835252912167
consolidation ------ 0.006254953114037803
idbi ------ 0.005967918924977709


2 document's top 5 words and their tf-idf score are as follows

[0.007921357912634848, 0.007793657170012391, 0.007325242948823728, 0.006699305404232055, 0.006699305404232055]

Words    |     Tf-idf Scores

export ------ 0.007921357912634848
policy ------ 0.007793657170012391
expo ------ 0.007325242948823728
pragati ------ 0.006699305404232055
maidan ------ 0.006699305404232055


3 document's top 5 words and their tf-idf score are as follows

[0.02930278973976647, 0.011241088463507297, 0.010955873036080437, 0.009956381787976695, 0.00851457040041009]

Words    |     Tf-idf Scores

patni ------ 0.0293027897