### Implementation of a simple vector space model using tf-idf

This is a simple implementation of a vector space model as used in information retrieval using tf-idf. 

To implement an end-to-end information retrieval system using tf-idf, I followed the following general steps:

1. I collected a set of documents that we want to search through. In this case I made a list of around 10000 news articles.

2. Then I pre-processed the documnets to remove any irrelevant information e.g. text pre-processing techniques like removing punctuations and tokemization.  

3. Next I convert the preprocessed documents into a matrix of tf-idf values. This will involve calculating the term frequency (tf) and inverse document frequency (idf) for each term in each document, and then multiplying the two values to get the tf-idf weight for each term in each document.

4. Create a search query that the user will enter to retrieve relevant documents. This query should also be preprocessed in the same way as the documents.

5. Lastly, I calculate the similarity between the query and each document using the tf-idf matrix. I use two similarity measures: the dot product of the query and document vector and the cosine similarity. 

6. Finally, we rank the documents based on their similarity to the query, with the most relevant documents appearing first.


In [101]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
import time
import csv
import pickle

In [22]:
# read in the data we cleaned and saved
df = pd.read_csv('train_data.csv')
df.head()

Unnamed: 0,id,title,content,clean_content
0,17283,House Republicans Fret About Winning Their Hea...,WASHINGTON — Congressional Republicans have...,washington congressional republicans new fear ...
1,17284,Rift Between Officers and Residents as Killing...,"After the bullet shells get counted, the blood...",bullet shells get counted blood dries votive c...
2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...","When Walt Disney’s “Bambi” opened in 1942, cri...",walt disneys bambi opened 1942 critics praised...
3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...","Death may be the great equalizer, but it isn’t...",death may great equalizer isnt necessarily eve...
4,17287,Kim Jong-un Says North Korea Is Preparing to T...,"SEOUL, South Korea — North Korea’s leader, ...",seoul south korea north koreas leader kim said...


In [97]:
# summary information
print(f'\n\tTHIS DATA COLLECTION HAS\n\n\t {df.shape[0]}\t\n\n\t DOCUMENTS')



	THIS DATA COLLECTION HAS

	 99999	

	 DOCUMENTS


### Pre-process the data
Here we tokenize, stem and clean the text off all random noise so that we can use it to build the model. 

In [None]:
%time # takes a while to run
documents = list(df['content'].values)

def preprocess_document(document):
    # Tokenize the document into individual words
    words = nltk.word_tokenize(document)
    
    # Remove stop words (common words that do not provide useful information)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    
    # Apply stemming to reduce words to their base forms
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    
    return words

# preprocess the documents by tokenizing and removing stopwords
processed_docs = []
for doc in documents:
    
    tokens = preprocess_document(doc)
    
    filtered_tokens1 = [token for token in tokens if token not in stopwords.words("english")]
    
    processed_docs.append(" ".join(filtered_tokens1))

In [8]:
# save to csv foe easier access -- RuN only once
df['clean_content'] = processed_docs
df.to_csv('train_data2.csv', index=False)

### Building the Model

In [24]:
# load the pre-processed data
df = pd.read_csv('train_data.csv')

# using fillna() function
df = df.fillna('')

processed_docs = list(df['clean_content'].values)

In [81]:
# create the vector space model using TfidfVectorizer
vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(processed_docs)

# save an instance of the model to a file
pickle.dump(X, open("tfidf.pickle", "wb"))


In [89]:
# Load the model from the file
with open("tfidf.pickle", "rb") as file:
    saved_tfidf = pickle.load(file)


# define a sample query
query = input("\nEnter a search query:\n ")

print('Processing Querry...please wait..\n')
print('---'*10)

time.sleep(5)

print('Displaying search results\n')

# preprocess the query by tokenizing and removing stopwords
tokens = word_tokenize(query)

filtered_tokens = [token for token in tokens if token not in stopwords.words("english")]

query = " ".join(filtered_tokens)

# transform the query using the same vector space model
query_vector = vectorizer.transform([query])

# calculate the cosine similarity between the query vector and the document vectors
similarity = (saved_tfidf * query_vector.T).A

df2 = df.copy()
df2['similarity_measure'] = similarity.flatten()
results = df2.sort_values(by='similarity_measure',ascending=False)
top_5 = results.head()


# summarize the content of the top 5 docs
# Input text - to summarize 
text = list(top_5['content'].values)

# def summarize(text):
i = 1

for text in text:
    
    title = top_5.iloc[i-1]['title'].strip()
    doc_id = top_5.iloc[i-1]['id']
    similarity = round(top_5.iloc[i-1]['similarity_measure'], 4)
    
    # Tokenizing the text
    stopWords = set(stopwords.words("english"))
    words = word_tokenize(text)

    # Creating a frequency table to keep the score of each word
    freqTable = dict()
    for word in words:
        word = word.lower()
        if word in stopWords:
            continue
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

    # Creating a dictionary to keep the score
    # of each sentence
    sentences = sent_tokenize(text)
    sentenceValue = dict()

    for sentence in sentences:
        for word, freq in freqTable.items():
            if word in sentence.lower():
                if sentence in sentenceValue:
                    sentenceValue[sentence] += freq
                else:
                    sentenceValue[sentence] = freq


    sumValues = 0
    for sentence in sentenceValue:
        sumValues += sentenceValue[sentence]

    # Average value of a sentence from the original text
    average = int(sumValues / len(sentenceValue))

    # Storing sentences into our summary.
    summary = ''
    for sentence in sentences:
        if (sentence in sentenceValue) and (sentenceValue[sentence] > (1 * average)):
            summary += " " + sentence
        
    print(f'DocumentID: {doc_id}')
    print(f'Rank: {i}')
    print(f'Title of Retrieved Document {i}: {title}')
    print(f'Similarity to query: {similarity}\n')
    print('Summary of Retrieved Document:')
    print('---'*7)
    print(summary)
    print('\n\n')
    
    i+=1





Enter a search query:
  The ethics of machine learning and state of artificial intelligence


Processing Querry...please wait..

------------------------------
Displaying search results

DocumentID: 75603
Rank: 1
Title of Retrieved Document 1: Amazon Is Making It Easier for Companies to Track You
Similarity to query: 0.3805

Summary of Retrieved Document:
---------------------
   Like “big data” and “social media” before it, the term “artificial intelligence” has become so buzzworthy at this point that it’s largely lost meaning. If everything seems to be powered by A. I. that’s because many companies are desperate to be perceived as leaders in machine learning (or deep learning, or natural language generation, all of which fall under the A. I. umbrella) —  even when they’re not. Artificial intelligence is an increasingly powerful force in the world, even as our grasp of what A. I. is and does continuously evolves. ” So it makes sense that Jeff Bezos, the Amazon founder and CEO, spent a good chunk of his latest letter to Amazon shareholders focused on artificial intelligence. It