## Latent Semantic Indexing (LSI)
Latent Semantic Indexing (LSI) is a powerful technique in natural language processing and information retrieval that helps uncover the underlying relationships between terms and documents. **It enhances the retrieval of relevant documents by capturing the latent structure in the data.**

- **Term-Document Matrix:** The initial step in LSI involves creating a term-document matrix where each row represents a unique term, and each column represents a document. The entries in the matrix indicate the frequency of terms in the documents.

- **Singular Value Decomposition (SVD):** LSI applies SVD to the term-document matrix. SVD decomposes the matrix into three other matrices:
  - \( U \): Term matrix, where each row represents a term in a reduced-dimensional space.
  - \( Σ \): Diagonal matrix of singular values, representing the importance of each dimension.
  - \( V^T \): Document matrix, where each column represents a document in the reduced-dimensional space.

- **Dimensionality Reduction:** By selecting the top \( k \) singular values from the diagonal matrix \( Σ \) and the corresponding columns from \( U \) and rows from \( V^T \), LSI reduces the dimensionality of the original term-document matrix.

**This reduction captures the most significant patterns in the data while ignoring noise and less important details.**

### Library Initialization

In [1]:
import numpy as np
from numpy.linalg import norm
import pandas as pd
import regex as re

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Error loading punkt: <urlopen error [Errno 8] nodename nor
[nltk_data]     servname provided, or not known>
[nltk_data] Error loading stopwords: <urlopen error [Errno 8] nodename
[nltk_data]     nor servname provided, or not known>


False

### Defining Preprocess

In [3]:
def preprocess(text):
  clean_text = []
  stemmer = PorterStemmer()
  stop_words = set(stopwords.words('english'))

  for t in text:
      clean = re.sub(r'[^\w\s]', '', t.lower())
      clean = re.sub(r'\d+', '', clean)
      tokens = word_tokenize(clean)
      stemmed_tokens = [stemmer.stem(token) for token in tokens if token not in stop_words]
      clean_text.append(' '.join(stemmed_tokens))

  return clean_text

### Importing Cleaned Corpus

In [4]:
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', 1000)

dataset = pd.read_csv("../corpus/clean-corpus-inggris.csv")
dataset

Unnamed: 0,teks
0,call bird habit
1,brother like bird month father gave black bird
2,antoni bird lost
3,greedi characterist hate


### Importing Un-Preprocessed Corpus

In [5]:
validate = pd.read_csv("../corpus/corpus-inggris.csv").head(4)
validate

Unnamed: 0,id,text,topic
0,ENG1,"They called him a bird, because of his habit",bird
1,ENG2,My brother likes bird and after a month my father gave him a black bird,bird
2,ENG3,Antony has a bird and he lost it,bird
3,ENG4,Greedy is the most characteristic that I hate,hate


### Corpus Preparations

In [6]:
corpus = dataset.teks.tolist()
corpus

['call bird habit',
 'brother like bird month father gave black bird',
 'antoni bird lost',
 'greedi characterist hate']

#### Looking to display the unique words

In [7]:
unique_words = set()
for sentence in corpus:
    words = sentence.split()
    unique_words.update(words)
unique_words

{'antoni',
 'bird',
 'black',
 'brother',
 'call',
 'characterist',
 'father',
 'gave',
 'greedi',
 'habit',
 'hate',
 'like',
 'lost',
 'month'}

In [8]:
clean_query=[]

while True:
    print("Insert a query:")
    query = input()
    list_query =query.split(' ')
    clean_query = preprocess(list_query)
    clean_query= [word for word in clean_query if word != '']
    clean_query= [word for word in clean_query if word in unique_words]
    if clean_query:
        break

print("List of words in the query:\n",list_query)
print("List of the queries:\n",clean_query)

Insert a query:


 last month, i have a bird named anthony, but it's dead right now


List of words in the query:
 ['last', 'month,', 'i', 'have', 'a', 'bird', 'named', 'anthony,', 'but', "it's", 'dead', 'right', 'now']
List of the queries:
 ['month', 'bird']


### TF (Term Frequency)

In [9]:
def tf(text):
    word_count_per_document = {}

    for i, sentence in enumerate(text, start=0):
        words = sentence.split()
        for word in words:
            if word in word_count_per_document:
                if i in word_count_per_document[word]:
                    word_count_per_document[word][i] += 1
                else:
                    word_count_per_document[word][i] = 1
            else:
                word_count_per_document[word] = {i: 1}

    df_term_frequency = pd.DataFrame(word_count_per_document)
    df_term_frequency.fillna(0, inplace=True)
    return df_term_frequency.T

#### Searching for the document term frequency

In [10]:
tf_document= tf(corpus).T.sort_index().T
tf_document

Unnamed: 0,0,1,2,3
call,1.0,0.0,0.0,0.0
bird,1.0,2.0,1.0,0.0
habit,1.0,0.0,0.0,0.0
brother,0.0,1.0,0.0,0.0
like,0.0,1.0,0.0,0.0
month,0.0,1.0,0.0,0.0
father,0.0,1.0,0.0,0.0
gave,0.0,1.0,0.0,0.0
black,0.0,1.0,0.0,0.0
antoni,0.0,0.0,1.0,0.0


#### Searching for the query term frequency

In [11]:
tf_query= pd.DataFrame(tf(clean_query).sum(axis=1), columns=["query"])
tf_query

Unnamed: 0,query
month,1.0
bird,1.0


### Array Transformation
#### Array of Documents

In [12]:
corpus = []
for i in tf_document.index:
    documents = (tf_document.loc[i]).astype(int).values.tolist()
    corpus.append(documents)

corpus

[[1, 0, 0, 0],
 [1, 2, 1, 0],
 [1, 0, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 1, 0, 0],
 [0, 0, 1, 0],
 [0, 0, 1, 0],
 [0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1]]

In [13]:
U, S, V = np.linalg.svd(corpus)

In [14]:
U_rounded = np.round(U[:,:(tf_document.shape[1])], decimals=2)
print(U_rounded)

[[-0.08  0.    0.39  0.5 ]
 [-0.71  0.    0.34 -0.  ]
 [-0.08  0.    0.39  0.5 ]
 [-0.28  0.   -0.22 -0.  ]
 [-0.28  0.   -0.22 -0.  ]
 [-0.28  0.   -0.22 -0.  ]
 [-0.28  0.   -0.22 -0.  ]
 [-0.28  0.   -0.22 -0.  ]
 [-0.28  0.   -0.22 -0.  ]
 [-0.08  0.    0.39 -0.5 ]
 [-0.08  0.    0.39 -0.5 ]
 [ 0.   -0.58  0.    0.  ]
 [ 0.   -0.58  0.    0.  ]
 [ 0.   -0.58  0.    0.  ]]


In [15]:
S = np.diag(S)
S_rounded = np.round(S, decimals=2)
print(S_rounded)

[[3.34 0.   0.   0.  ]
 [0.   1.73 0.   0.  ]
 [0.   0.   1.7  0.  ]
 [0.   0.   0.   1.41]]


In [16]:
V_rounded = np.round(V.T, decimals=2)
print(V_rounded)

[[-0.26 -0.    0.66  0.71]
 [-0.93 -0.   -0.37  0.  ]
 [-0.26 -0.    0.66 -0.71]
 [ 0.   -1.    0.    0.  ]]


In [17]:
X_results = np.round(U_rounded[:,:1].dot(S_rounded[:1,:1]).dot(V_rounded.T[:1]), decimals=3)
print(X_results)

[[0.069 0.248 0.069 0.   ]
 [0.617 2.205 0.617 0.   ]
 [0.069 0.248 0.069 0.   ]
 [0.243 0.87  0.243 0.   ]
 [0.243 0.87  0.243 0.   ]
 [0.243 0.87  0.243 0.   ]
 [0.243 0.87  0.243 0.   ]
 [0.243 0.87  0.243 0.   ]
 [0.243 0.87  0.243 0.   ]
 [0.069 0.248 0.069 0.   ]
 [0.069 0.248 0.069 0.   ]
 [0.    0.    0.    0.   ]
 [0.    0.    0.    0.   ]
 [0.    0.    0.    0.   ]]


#### Array of Query

In [18]:
query = []
for i in tf_document.index:
    try:
        documents = (tf_query.loc[i]).astype(int).values.tolist()
        query.append(documents)
    except:
        query.append(np.array([0]).tolist())

query = np.array(query)
query

array([[0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0]])

In [19]:
U, S, V = np.linalg.svd(query)

In [20]:
U_rounded_q = np.round(U[:,:4], decimals=2)
print("U = ", U_rounded_q)
print("\n")

S = np.diag(S)
S_rounded_q = np.round(S, decimals=2)
print("S = ",S_rounded_q)
print("\n")

V_rounded_q = np.round(V.T, decimals=2)
print("VT = ",V_rounded_q)
print("\n")

U =  [[ 0.   -0.71  0.    0.  ]
 [ 0.71  0.5   0.    0.  ]
 [ 0.    0.    1.    0.  ]
 [ 0.    0.    0.    1.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.71 -0.5   0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]]


S =  [[1.41]]


VT =  [[1.]]




In [21]:
X_results_q = np.round(U_rounded_q[:1,:1].dot(S_rounded_q).dot(V_rounded_q), decimals=3)
print(X_results_q)

[[0.]]


In [35]:
X_results_q_new = np.append(X_results_q, np.zeros(len(unique_words)-1, dtype=int))
X_results_q_new = X_results_q_new.astype(int).tolist()
X_results_q_new

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [24]:
lists_query = [index.tolist()[0] for index in query]
lists_query

[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

In [38]:
cosine = np.dot(lists_query, X_results)/(norm(X_results)*norm(lists_query))
similary_score = np.round(cosine, decimals=3)
similary_score 

array([0.182, 0.651, 0.182, 0.   ])

In [39]:
results = {'score'  : similary_score,
           'text'   : validate.text
          }

pd.DataFrame(results).sort_values(by='score', ascending=False)

Unnamed: 0,score,text
1,0.651,My brother likes bird and after a month my father gave him a black bird
0,0.182,"They called him a bird, because of his habit"
2,0.182,Antony has a bird and he lost it
3,0.0,Greedy is the most characteristic that I hate
