

# Vector Space Model

This notebook uses the VSM Cosine Similarity ranking formula for information retrieval of documents based on a query search. 

The documents, from a fixed repository, are scored and ranked for similarity against a test set of queries. The output results are used for evaluation using the trec_eval tool.

In the final section, the notebook allows a user to manually enter a free form text search to test this against the existing documents repository, using the same ranking model - useful for exploratory testing.

## Imports and setup

In [1]:
import math
import numpy as np
import pandas as pd
import csv
import os
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.text import log
import xml.etree.ElementTree as ET

nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Part 1 - Ranking by document titles
In this section we score each search query for document title and create a shortlist of the top 100 relevant documents (by title).

### Setup

In [3]:
# Create base dataframe for recording results
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'VSM_Score','Query_Desc', 'Doc_Desc'])

In [4]:
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

In [5]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

Document titles file

In [6]:
# Import from prepared CSV file - read doc IDs and titles to array
with open('Indexed_Titles.csv', 'r') as file:
    reader = csv.reader(file)
    documents = []
    documentIDs = []
    for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

Search queries file

In [7]:
# Import from prepared CSV file - read query IDs and search strings to array
with open('Indexed_Queries.csv', 'r') as file:
    reader = csv.reader(file)
    queries = []
    queryIDs = []
    for row in reader:
        queries.append(row[2])
        queryIDs.append((row[1]))

### Vectorisation

Preprocessing and stopwords removal

In [8]:
# Split document titles into individual words and remove stop words
def preprocess(documents):
    preprocessed_docs = []
    for doc in documents:
        words = doc.lower().split()
        words = [word for word in words if word not in stop_words]
        preprocessed_docs.append(words)
    return preprocessed_docs

In [9]:
preprocessed_docs = preprocess(documents)

In [10]:
# Create vocabulary from the documents
vocab = sorted(set(word for doc in preprocessed_docs for word in doc))

Vectorisation

In [11]:
# Convert title document into a vector representation using the vocabulary
def vectorise(doc, vocab):
    vector = np.zeros(len(vocab))
    for word in doc:
        if word in vocab:
            vector[vocab.index(word)] += 1
    return vector

In [12]:
# Vectorise the preprocessed documents
vectors = [vectorise(doc, vocab) for doc in preprocessed_docs]

### Similarity

Compute cosine similarity between two vectors

In [13]:
def calculate_cosine_similarity(u, v):
    score = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
    if np.isnan(score):
      # To cater for values so close to zero they are being treated as NAN
      score = 0
    return score

### Process queries

For each query, a similarity score is computed for every document

In [None]:
current_query = 0
# For each query
for item in queries:

  rawquery = queries[current_query]
  query = queries[current_query]
  queryID = queryIDs[current_query]
  query = query.split()

  for i in range(len(query)):
      query[i] = query[i].lower()
  query = [string for string in query if string not in stop_words]  
  
  query_vec = vectorise(query, vocab)

  # Compute cosine similarity for all documents (previously vectorised above)
  similarities = [calculate_cosine_similarity(query_vec, vector) for vector in vectors]

  current_score = 0
  # For each computed similarity score
  for score in similarities:
    # Append a new row to the results dataframe
    new_row = [int(queryID), int(documentIDs[current_score]), score, rawquery, documents[current_score]]
    df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
    current_score += 1
  
  current_query += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [16]:
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'VSM_Score'], ascending=[True, False])

In [17]:
# Restrict to top 100 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(100).reset_index(drop=True)

In [18]:
df_TopResults.insert(4, 'Rank',0)

In [19]:
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

In [20]:
# Export final results to CSV for final analysis (outside of this notebook)
df_TopResults.to_csv("Export_VSM_Top100_by_Title.csv")

## Part 2 - Ranking by document contents
In this section we score each search query for document contents (main body of the document) and create a shortlist of the top 100 relevant documents (by contents).

### Setup

In [21]:
# Create base dataframe for recording results
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'VSM_Score'])

In [22]:
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

In [23]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

Document contents file

In [25]:
# Import from prepared CSV file - read doc IDs and contents to array
with open('Indexed_Contents.csv', 'r') as file:
    reader = csv.reader(file)
    documents = []
    documentIDs = []
    for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

Search queries file

In [26]:
# Import from prepared CSV file - read query IDs and search strings to array
with open('Indexed_Queries.csv', 'r') as file:
    reader = csv.reader(file)
    queries = []
    queryIDs = []
    for row in reader:
        queries.append(row[2])
        queryIDs.append((row[1]))

### Vectorisation

Preprocessing and stopwords removal

In [27]:
# Split document contents into individual words and remove stop words
def preprocess(documents):
    preprocessed_docs = []
    for doc in documents:
        words = doc.lower().split()
        words = [word for word in words if word not in stop_words]
        preprocessed_docs.append(words)
    return preprocessed_docs

In [28]:
preprocessed_docs = preprocess(documents)

In [29]:
# Create vocabulary from the documents
vocab = sorted(set(word for doc in preprocessed_docs for word in doc))

Vectorisation

In [30]:
# Convert document contents into a vector representation using the vocabulary
def vectorise(doc, vocab):
    vector = np.zeros(len(vocab))
    for word in doc:
        if word in vocab:
            vector[vocab.index(word)] += 1
    return vector

In [31]:
# Vectorize preprocessed documents
vectors = [vectorise(doc, vocab) for doc in preprocessed_docs]

### Similarity

Compute cosine similarity between two vectors

In [32]:
def calculate_cosine_similarity(u, v):
    score = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
    if np.isnan(score):
      # To cater for values so close to zero they are being treated as NAN
      score = 0
    return score

### Process queries

For each query, a similarity score is computed for every document

In [None]:
current_query = 0
# For each query
for item in queries:

  rawquery = queries[current_query]
  query = queries[current_query]
  queryID = queryIDs[current_query]
  query = query.split()

  for i in range(len(query)):
      query[i] = query[i].lower()
  query = [string for string in query if string not in stop_words]  
  
  query_vec = vectorise(query, vocab)

  # Compute cosine similarity for all documents (previously vectorised above)
  similarities = [calculate_cosine_similarity(query_vec, vector) for vector in vectors]

  current_score = 0
  # For each computed similarity score
  for score in similarities:
    # Append a new row to the results dataframe
    new_row = [int(queryID), int(documentIDs[current_score]), score]
    df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
    current_score += 1
  
  current_query += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [36]:
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'VSM_Score'], ascending=[True, False])

In [37]:
# Restrict to top 100 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(100).reset_index(drop=True)

In [38]:
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

In [39]:
df_TopResults.to_csv("Export_VSM_Top100_by_Content.csv")

## Part 3 - Test a single query

Enter a freeform query search against the documents repository

### Setup

Read indexed document titles data into dataframe - title to be used in search results summary


In [66]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")
df_titles = []
df_titles = pd.DataFrame(columns=['Index','Doc_ID', 'Title'])
title_data = pd.read_csv("Indexed_Titles.csv", names=['Index','Doc_ID', 'Title'])
df_titles = df_titles.append(title_data, ignore_index=True)

  df_titles = df_titles.append(title_data, ignore_index=True)


Create base dataframe for recording results


In [67]:
df_Results =[]
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'VSM_Score', 'Rank', 'Title'])
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

Document Contents file

In [68]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

# Import from prepared CSV file - read doc IDs and titles to array
with open('Indexed_Contents.csv', 'r') as file:
    reader = csv.reader(file)
    documents = []
    documentIDs = []
    for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

### Vectorisation

Preprocessing and stopwords removal

In [69]:
# Split document titles into individual words and remove stop words
def preprocess(documents):
    preprocessed_docs = []
    for doc in documents:
        words = doc.lower().split()
        words = [word for word in words if word not in stop_words]
        preprocessed_docs.append(words)
    return preprocessed_docs

In [70]:
preprocessed_docs = preprocess(documents)

In [71]:
# Create vocabulary from the documents
vocab = sorted(set(word for doc in preprocessed_docs for word in doc))

Vectorisation

In [72]:
# Convert document contents into a vector representation using the vocabulary
def vectorise(doc, vocab):
    vector = np.zeros(len(vocab))
    for word in doc:
        if word in vocab:
            vector[vocab.index(word)] += 1
    return vector

In [73]:
# Vectorize preprocessed documents
vectors = [vectorise(doc, vocab) for doc in preprocessed_docs]

### Similarity

Compute cosine similarity between two vectors

In [74]:
def calculate_cosine_similarity(u, v):
    score = np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))
    if np.isnan(score):
      # To cater for values so close to zero they are being treated as NAN
      score = 0
    return score

### Process queries
- Type a query ==> similarity score is computed for every document.

- Results display top 10 ranked documents and a title summary for each.

- Open a document file using the listed document ID.

Enter query

In [None]:
# query = 'what blah blah laws must be OBEYED when constructing aeroelastic models of heated high speed aircraft'
# query = 'fly me to the moon in a high speed turbo jet'
query = 'experimental model techniques and equipment'
# query = 'similarity laws for stressing heated wings'

df_Results.drop(df_Results.index,inplace=True)
processed_query = query.split()

for i in range(len(processed_query)):
    processed_query[i] = processed_query[i].lower()
processed_query = [string for string in processed_query if string not in stop_words]  

query_vec = vectorise(processed_query, vocab)

# Compute cosine similarity for all documents (previously vectorised above)
similarities = [calculate_cosine_similarity(query_vec, vector) for vector in vectors]

current_score = 0
# For each computed similarity score
for score in similarities:
  # Append a new row to the results dataframe
  new_row = ["USER", int(documentIDs[current_score]), score, 0, ""]
  df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
  current_score += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [79]:
df_SortedResults = []
df_TopResults = []
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'VSM_Score'], ascending=[True, False])
# Restrict to top 10 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(10).reset_index(drop=True)
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

for index, row in df_titles.iterrows():
  df_TopResults.loc[(df_TopResults.Doc_ID == row['Doc_ID']), 'Title'] = row['Title']


print("--- QUERY: " + query + "\n")
df_TopResults

--- QUERY: experimental model techniques and equipment



Unnamed: 0,Query_ID,Doc_ID,VSM_Score,Rank,Title
0,USER,874,0.292153,1,the use of models for the determination of cri...
1,USER,879,0.257248,2,flutter model testing at transonic speeds
2,USER,800,0.255377,3,wall interference at transonic speeds on a hem...
3,USER,878,0.218218,4,experimental model techniques and equipment fo...
4,USER,594,0.200446,5,wind tunnel techniques for the measurements of...
5,USER,271,0.2,6,an experimental test of compressibility transf...
6,USER,462,0.185695,7,photo-thermoelasticity
7,USER,713,0.180894,8,static longitudinal stability characteristics ...
8,USER,1164,0.179928,9,effect of ground proximity on the aerodynamic ...
9,USER,836,0.17609,10,analytical and experimental investigation of s...


In [80]:
intdocno = 874

os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Individual_Docs")
xml_file = "document_" + str(intdocno) + ".xml"

# parse the XML file
tree = ET.parse(xml_file)

# get the root element of the XML file
root = tree.getroot()

print("--- QUERY: " + query + "\n")
print("--- DOCUMENT: " + "\n")

# print the contents of the XML file
for child in root:
    print(ET.tostring(child, encoding='unicode'))

--- QUERY: experimental model techniques and equipment

--- DOCUMENT: 

<docno>874</docno>

<title>the use of models for the determination of critical flutter speeds .</title>

<author>duncan, w.j.</author>

<bib>r + m 1425, july 1931 .</bib>

<text>the use of models for the determination of critical flutter speeds .
the use of model tests in the prediction of full-scale critical flutter
speeds is now well established, and the technique of such tests is
therefore worthy of discussion . in order to obtain critical speeds for
the model within the speed range of ordinary wind tunnels it is
necessary that the model should differ in some respect from a mere small
suggested by mckinnon wood the modification of the model consists in a
reduction of its effective stiffnesses . this method has the defect /in
most cases probably not serious/ that the model experiment is conducted
at a reynolds number much below that for full-scale . in the present
paper it is pointed out that an alternative metho