# Multinomial Language Model

This notebook impliments a multinomial distribution language model for information retrieval of documents based on a query search. 

The documents, from a fixed repository, are scored and ranked for similarity against a test set of queries. The output results are used for evaluation using the trec_eval tool.

In the final section, the notebook allows a user to manually enter a free form text search to test this against the existing documents repository, using the same ranking model - useful for exploratory testing.

## Imports and setup

In [1]:
import math
import numpy as np
import pandas as pd
from collections import Counter
import csv
import os
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.text import log
import xml.etree.ElementTree as ET

nltk.download('reuters')
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package reuters to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Part 1 - Ranking by document titles
In this section we score each search query for document title and create a shortlist of the top 100 relevant documents (by title).

### Setup

In [None]:
# Create base dataframe for recording results
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'Multinomial_Score','Query_Desc', 'Doc_Desc'])

In [None]:
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

In [None]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

Document titles file

In [None]:
# Import from prepared CSV file - read doc IDs and titles to array
with open('Indexed_Titles.csv', 'r') as file:
   reader = csv.reader(file)
   documents = []
   documentIDs = []
   for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

Search queries file

In [None]:
# Import from prepared CSV file - read query IDs and search strings to array
with open('Indexed_Queries.csv', 'r') as file:
    reader = csv.reader(file)
    queries = []
    queryIDs = []
    for row in reader:
        queries.append(row[2])
        queryIDs.append((row[1]))

### Preprocessing

In [None]:
# Tokenize the documents into words
tokenized_docs = []
for doc in documents:
    words = doc.lower().split()
    words = [word for word in words if word not in stop_words]
    tokenized_docs.append(words)

# Compute the vocabulary
vocab = set([word for doc in tokenized_docs for word in doc])

# Compute the document-term matrix
doc_term_matrix = np.zeros((len(documents), len(vocab)))
for i, doc in enumerate(tokenized_docs):
    for j, word in enumerate(vocab):
        doc_term_matrix[i, j] = doc.count(word)

In [None]:
preprocessed_docs = []
for doc in documents:
    words = doc.lower().split()
    words = [word for word in words if word not in stop_words]
    preprocessed_docs.append(words)

### Process queries and scores

For each query, a similarity score is computed for every document

In [None]:
current_query = 0
# For each query
for query in queries:

  doc_scores = []

  rawquery = queries[current_query]
  queryID = queryIDs[current_query]

  tokenized_query = query.lower().split()
  tokenized_query = [string for string in tokenized_query if string not in stop_words]  

  # Compute the query-term vector
  query_term_vector = np.zeros(len(vocab))
  for i, word in enumerate(vocab):
      query_term_vector[i] = tokenized_query.count(word)

  # Compute the document scores
  doc_scores = np.dot(doc_term_matrix, query_term_vector)

  current_score = 0
  # For each computed similarity score
  for score in doc_scores:
    #print("Query: " + str(current_query) + " Score: " + str(current_score) + " " + str(score))
    # Append a new row to the results dataframe
    new_row = [int(queryID), int(documentIDs[current_score]), score, rawquery, documents[current_score]]
    df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
    current_score += 1  
  
  current_query += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [None]:
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'Multinomial_Score'], ascending=[True, False])

In [None]:
# Restrict to top 100 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(100).reset_index(drop=True)

In [None]:
df_TopResults.insert(4, 'Rank',0)

In [None]:
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

In [None]:
# Export final results to CSV for final analysis (outside of this notebook)
df_TopResults.to_csv("Export_Multinomial_Top100_by_Title.csv")

## Part 2 - Ranking by document contents
In this section we score each search query for document contents (main body of the document) and create a shortlist of the top 100 relevant documents (by contents).

### Setup

In [None]:
# Create base dataframe for recording results
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'Multinomial_Score'])

In [None]:
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

In [None]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

Indexed queries and documents preprepared from previous notebook

Document contents file

In [None]:
# Import from prepared CSV file - read doc IDs and contents to array
with open('Indexed_Contents.csv', 'r') as file:
   reader = csv.reader(file)
   documents = []
   documentIDs = []
   for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

Search queries file

In [None]:
# Import from prepared CSV file - read query IDs and search strings to array
with open('Indexed_Queries.csv', 'r') as file:
    reader = csv.reader(file)
    queries = []
    queryIDs = []
    for row in reader:
        queries.append(row[2])
        queryIDs.append((row[1]))

### Preprocessing

In [None]:
# Tokenize the documents into words
tokenized_docs = []
for doc in documents:
    words = doc.lower().split()
    words = [word for word in words if word not in stop_words]
    tokenized_docs.append(words)

# Compute the vocabulary
vocab = set([word for doc in tokenized_docs for word in doc])

# Compute the document-term matrix
doc_term_matrix = np.zeros((len(documents), len(vocab)))
for i, doc in enumerate(tokenized_docs):
    for j, word in enumerate(vocab):
        doc_term_matrix[i, j] = doc.count(word)

In [None]:
preprocessed_docs = []
for doc in documents:
    words = doc.lower().split()
    words = [word for word in words if word not in stop_words]
    preprocessed_docs.append(words)

### Similarity scoring

For each query, a similarity score is computed for every document

In [None]:
current_query = 0
# For each query
for query in queries:

  doc_scores = []

  rawquery = queries[current_query]
  queryID = queryIDs[current_query]

  tokenized_query = query.lower().split()
  tokenized_query = [string for string in tokenized_query if string not in stop_words]  

  # Compute the query-term vector
  query_term_vector = np.zeros(len(vocab))
  for i, word in enumerate(vocab):
      query_term_vector[i] = tokenized_query.count(word)

  # Compute the document scores
  doc_scores = np.dot(doc_term_matrix, query_term_vector)

  current_score = 0
  # For each computed similarity score
  for score in doc_scores:
    #print("Query: " + str(current_query) + " Score: " + str(current_score) + " " + str(score))
    # Append a new row to the results dataframe
    new_row = [int(queryID), int(documentIDs[current_score]), score]
    df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
    current_score += 1  
  
  current_query += 1

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [None]:
df_TopResults

In [None]:
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'Multinomial_Score'], ascending=[True, False])

In [None]:
# Restrict to top X results
df_TopResults = df_SortedResults.groupby('Query_ID').head(100).reset_index(drop=True)

In [None]:
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

In [None]:
df_TopResults.to_csv("Export_Multinomial_Top100_Queries_by_Content.csv")

## Part 3 - Test a single query

Enter a freeform query search against the documents repository.

### Setup

Read indexed document titles data into dataframe - title to be used in search results summary


In [3]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")
df_titles = []
df_titles = pd.DataFrame(columns=['Index','Doc_ID', 'Title'])
title_data = pd.read_csv("Indexed_Titles.csv", names=['Index','Doc_ID', 'Title'])
df_titles = df_titles.append(title_data, ignore_index=True)

  df_titles = df_titles.append(title_data, ignore_index=True)


Create base dataframe for recording results


In [7]:
df_Results =[]
df_Results = pd.DataFrame(columns=['Query_ID','Doc_ID', 'LM_Score', 'Rank', 'Title'])
df_Results.drop(df_Results.index,inplace=True)

### Bring in the data

Indexed queries and documents preprepared from previous notebook

Document contents file

In [4]:
os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Indexed")

# Import from prepared CSV file - read doc IDs and contents to array
with open('Indexed_Contents.csv', 'r') as file:
   reader = csv.reader(file)
   documents = []
   documentIDs = []
   for row in reader:
        documentIDs.append(row[1])
        documents.append(row[2])

### Preprocessing

In [5]:
# Tokenize the documents into words
tokenized_docs = []
for doc in documents:
    words = doc.lower().split()
    words = [word for word in words if word not in stop_words]
    tokenized_docs.append(words)

# Compute the vocabulary
vocab = set([word for doc in tokenized_docs for word in doc])

# Compute the document-term matrix
doc_term_matrix = np.zeros((len(documents), len(vocab)))
for i, doc in enumerate(tokenized_docs):
    for j, word in enumerate(vocab):
        doc_term_matrix[i, j] = doc.count(word)

### Process queries
- Type a query ==> similarity score is computed for every document.

- Results display top 10 ranked documents and a title summary for each.

- Open a document file using the listed document ID.

Enter query

In [None]:
#query = "what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft"
#query = "what is the captial of France?"
query = "inviscid hypersonic airflows with coupled"


doc_scores = []
df_Results.drop(df_Results.index,inplace=True)

tokenized_query = query.lower().split()
tokenized_query = [string for string in tokenized_query if string not in stop_words]  

# Compute the query-term vector
query_term_vector = np.zeros(len(vocab))
for i, word in enumerate(vocab):
    query_term_vector[i] = tokenized_query.count(word)

# Compute the document scores
doc_scores = np.dot(doc_term_matrix, query_term_vector)

current_score = 0
# For each computed similarity score
for score in doc_scores:
  # print("Query: " + str(current_query) + " Score: " + str(current_score) + " " + str(score))
  # Append a new row to the results dataframe
  new_row = ["USER", int(documentIDs[current_score]), score, 0, ""]
  df_Results = df_Results.append(pd.Series(new_row, index=df_Results.columns), ignore_index=True)
  current_score += 1  

Sort the results: group by query ID, then sorted by scores ascending for each query. Finally, optionally, retain only top results for each query search, e.g. 10, 50, 100...

In [14]:
df_SortedResults = []
df_TopResults = []
df_SortedResults = df_Results.sort_values(by=['Query_ID', 'LM_Score'], ascending=[True, False])
# Restrict to top 10 results
df_TopResults = df_SortedResults.groupby('Query_ID').head(10).reset_index(drop=True)
df_TopResults['Rank'] = df_TopResults.groupby('Query_ID').cumcount() + 1

for index, row in df_titles.iterrows():
  df_TopResults.loc[(df_TopResults.Doc_ID == row['Doc_ID']), 'Title'] = row['Title']


print("--- QUERY: " + query + "\n")
df_TopResults

--- QUERY: inviscid hypersonic airflows with coupled



Unnamed: 0,Query_ID,Doc_ID,LM_Score,Rank,Title
0,USER,401,12.0,1,inviscid hypersonic airflows with coupled non-...
1,USER,625,7.0,2,viscous and inviscid nonequilibrium gas flows
2,USER,1310,7.0,3,survey of inviscid hypersonic flow theory for ...
3,USER,37,7.0,4,a new technique for investigating heat transfe...
4,USER,373,7.0,5,the generalized expansion method and its appli...
5,USER,572,6.0,6,boundary layer displacement and leading edge b...
6,USER,1296,6.0,7,non-equilibrium expansions of air with coupled...
7,USER,25,6.0,8,inviscid hypersonic flow over blunt-nosed slen...
8,USER,305,6.0,9,hypersonic strong viscous interaction on a fla...
9,USER,540,5.0,10,use of local similarity concepts in hypersonic...


Display document

In [15]:
intdocno = 401

os.chdir("/content/drive/MyDrive/CA6005I - Mechanics of Search/Assignment1/Files_Individual_Docs")
xml_file = "document_" + str(intdocno) + ".xml"

# parse the XML file
tree = ET.parse(xml_file)

# get the root element of the XML file
root = tree.getroot()

print("--- QUERY: " + query + "\n")
print("--- DOCUMENT: " + "\n")

# print the contents of the XML file
for child in root:
    print(ET.tostring(child, encoding='unicode'))

--- QUERY: inviscid hypersonic airflows with coupled

--- DOCUMENT: 

<docno>401</docno>

<title>inviscid hypersonic airflows with coupled non-equilibrium
processes .</title>

<author>hall,j.g., eschenroeder,a.w. and marrone,p.v.</author>

<bib>ias paper 62-67, 1962.</bib>

<text>inviscid hypersonic airflows with coupled non-equilibrium
processes .
  analyses have been made of the effects of coupled chemical rate
processes in external inviscid hypersonic airflows at high enthalpy
levels .  exact (numerical) solutions have been obtained by the
inverse method for inviscid airflow over a near-spherical nose
under flight conditions where substantial nonequilibrium prevails
through the nose region .  typical conditions considered include
nose radii of the order of 1 ft at an altitude of 250,000 ft and
velocities of 15,000 and 23,000 ft per sec .
  the results illustrate the general importance of the coupling
among the reactions considered .  these included
dissociation-recombination, bimole