<a href="https://colab.research.google.com/github/cinnamonclo/python/blob/main/Untitled21.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Installing NLTK Libraries
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Documents
document_D1 = "Macavity's a Mystery Cat: he's called the Hidden Paw—\
For he's the master criminal who can defy the Law.\
He's the bafflement of Scotland Yard, the Flying Squad's despair:\
For when they reach the scene of crime—Macavity's not there!"

document_D2 = "Macavity, Macavity, there's no one like Macavity,\
He's broken every human law, he breaks the law of gravity.\
His powers of levitation would make a fakir stare,\
And when you reach the scene of crime—Macavity's not there!\
You may seek him in the basement, you may look up in the air—\
But I tell you once and once again, Macavity's not there!"

# Queries
VSM_query = "Cats breaking the law"
boolean_query = "cat AND breaking AND law"

Part a) Tokenize the documents, removing whitespace characters and punctuation and case fold to all lowercase.


In [3]:
# Function to tokenise, remove punctuation, and case fold to lower case
def preprocess(document):
  # Tokenize
  tokens = word_tokenize(document)
  tokens = [word.lower() for word in tokens if word.isalpha()]
  return tokens

In [4]:
D1_processed = preprocess(document_D1)
D2_processed = preprocess(document_D2)
VSM_processed = preprocess(VSM_query)

In [5]:
print(D1_processed)
print(D2_processed)
print(VSM_processed)

['macavity', 'a', 'mystery', 'cat', 'he', 'called', 'the', 'hidden', 'he', 'the', 'master', 'criminal', 'who', 'can', 'defy', 'the', 'the', 'bafflement', 'of', 'scotland', 'yard', 'the', 'flying', 'squad', 'despair', 'for', 'when', 'they', 'reach', 'the', 'scene', 'of', 'not', 'there']
['macavity', 'macavity', 'there', 'no', 'one', 'like', 'macavity', 'he', 'broken', 'every', 'human', 'law', 'he', 'breaks', 'the', 'law', 'of', 'powers', 'of', 'levitation', 'would', 'make', 'a', 'fakir', 'stare', 'and', 'when', 'you', 'reach', 'the', 'scene', 'of', 'not', 'there', 'you', 'may', 'seek', 'him', 'in', 'the', 'basement', 'you', 'may', 'look', 'up', 'in', 'the', 'i', 'tell', 'you', 'once', 'and', 'once', 'again', 'macavity', 'not', 'there']
['cats', 'breaking', 'the', 'law']


Part b)  Remove stop words from both the VSM query and documents.

In [6]:
# English stopwords
stop_words=set(stopwords.words("english"))

In [7]:
# Function to remove stopwords
def preprocess_stopwords(document):
  stopword_tokens=[word for word in document if word not in stop_words]
  return stopword_tokens

In [8]:
D1_stopwords = preprocess_stopwords(D1_processed)
D2_stopwords = preprocess_stopwords(D2_processed)
VSM_stopwords = preprocess_stopwords(VSM_processed)

In [9]:
print(D1_stopwords)
print(D2_stopwords)
print(VSM_stopwords)

['macavity', 'mystery', 'cat', 'called', 'hidden', 'master', 'criminal', 'defy', 'bafflement', 'scotland', 'yard', 'flying', 'squad', 'despair', 'reach', 'scene']
['macavity', 'macavity', 'one', 'like', 'macavity', 'broken', 'every', 'human', 'law', 'breaks', 'law', 'powers', 'levitation', 'would', 'make', 'fakir', 'stare', 'reach', 'scene', 'may', 'seek', 'basement', 'may', 'look', 'tell', 'macavity']
['cats', 'breaking', 'law']


Part c) Stem the documents and the queries using the Porter stemming algorithm.

In [10]:
# Initialise porter stemmer
porter = PorterStemmer()

# Function to stem using Porter stemming algorithm
def preprocess_stemming(document):
  stemmed_tokens = [porter.stem(word) for word in document]
  return stemmed_tokens

In [11]:
D1_stemmed = preprocess_stemming(D1_stopwords)
D2_stemmed = preprocess_stemming(D2_stopwords)
VSM_stemmed = preprocess_stemming(VSM_stopwords)

In [12]:
print(D1_stemmed)
print(D2_stemmed)
print(VSM_stemmed)

['macav', 'mysteri', 'cat', 'call', 'hidden', 'master', 'crimin', 'defi', 'bafflement', 'scotland', 'yard', 'fli', 'squad', 'despair', 'reach', 'scene']
['macav', 'macav', 'one', 'like', 'macav', 'broken', 'everi', 'human', 'law', 'break', 'law', 'power', 'levit', 'would', 'make', 'fakir', 'stare', 'reach', 'scene', 'may', 'seek', 'basement', 'may', 'look', 'tell', 'macav']
['cat', 'break', 'law']


Part d) Construct an inverted index with terms in dictionary order.

Inverted indexing help: https://www.geeksforgeeks.org/create-inverted-index-for-file-using-python/

In [13]:
# Combining Documents
terms = list(set(D1_stemmed + D2_stemmed))

In [14]:
# Sorting terms alphabetically
terms.sort()
print(terms)

['bafflement', 'basement', 'break', 'broken', 'call', 'cat', 'crimin', 'defi', 'despair', 'everi', 'fakir', 'fli', 'hidden', 'human', 'law', 'levit', 'like', 'look', 'macav', 'make', 'master', 'may', 'mysteri', 'one', 'power', 'reach', 'scene', 'scotland', 'seek', 'squad', 'stare', 'tell', 'would', 'yard']


In [15]:
inverted_index = {}

# Creating inverted index
for term in terms:
  documents = []
  if term in D1_stemmed:
    documents.append("Document 1")
  if term in D2_stemmed:
    documents.append("Document 2")
  inverted_index[term] = documents

In [16]:
# Printing Inverted Index
for term, documents in inverted_index.items():
  print(term, "->",", ".join(documents))

bafflement -> Document 1
basement -> Document 2
break -> Document 2
broken -> Document 2
call -> Document 1
cat -> Document 1
crimin -> Document 1
defi -> Document 1
despair -> Document 1
everi -> Document 2
fakir -> Document 2
fli -> Document 1
hidden -> Document 1
human -> Document 2
law -> Document 2
levit -> Document 2
like -> Document 2
look -> Document 2
macav -> Document 1, Document 2
make -> Document 2
master -> Document 1
may -> Document 2
mysteri -> Document 1
one -> Document 2
power -> Document 2
reach -> Document 1, Document 2
scene -> Document 1, Document 2
scotland -> Document 1
seek -> Document 2
squad -> Document 1
stare -> Document 2
tell -> Document 2
would -> Document 2
yard -> Document 1


Part e1) Perform a simple match using the AND operators