## Vector Space Model

Represents documents and terms as vectors in a multi-dimensional space.

Each dimensional corresponds to a unique term in the entire corpus of documents.
\
&nbsp;

<img src="../Resource/Images/vector_space_model.jpg" alt="Vecto Space Model" style="width:400px;"/>


<h5><b>Document-Term Matrix (DTM)</b><br></h5>
Rows in this matrix represent documents, columns represent terms (words or phrases).<br>

<b>Example:</b> <br>
<ul>
    <li><b>Doc1: </b>"I love data"</li>
    <li><b>Doc2: </b>"I love AI and data"</li>
    <li><b>Doc3: </b>"AI is the future"</li>
</ul>
<br>
<b>Corresponding Matrix:</b> <br><br>
<table>
    <tr>
        <th></th>
        <th>I</th>
        <th>love</th>
        <th>data</th>
        <th>AI</th>
        <th>future</th>
    </tr>
        <th>Doc1</th>
        <th>1</th>
        <th>1</th>
        <th>1</th>
        <th>0</th>
        <th>0</th>
    </tr>
        </tr>
        <th>Doc2</th>
        <th>1</th>
        <th>1</th>
        <th>1</th>
        <th>1</th>
        <th>0</th>
    </tr>
        </tr>
        <th>Doc3</th>
        <th>0</th>
        <th>0</th>
        <th>0</th>
        <th>1</th>
        <th>1</th>
    </tr>
</table>

<h5><b>Term Frequency - Inverse Document Frequency (TF-IDF) (*)</b><br></h5>
A measure that reflects the importance of a term within a document relative to its importance across all documents in the corpus. It helps in highlighting important terms while downplaying common terms.<br>
<br>
<div><b>Term Frequency (TF):</b> How often a word appears in a document.</div>
<br>
<div>
    <table>
        <tr>
            <th>TF(t,d) = (Number of times term t appears in document d) / (Total number of terms in document d)</th>
        </tr>
    </table>
</div>
<br>
<div><b>Inverse Document Frequency (IDF):</b> How rare or unique a word is across all documents.</div>
<br>
<div>
    <table>
        <tr>
            <th>TF(t,d) = LOG[(Total number of documents in corpus D) / (Number of documents containing term t)]</th>
        </tr>
    </table>
</div>
<br>
<div><b>Calculate TD-IDF.</div>
<br>
<div>
    <table>
        <tr>
            <th>TF-IDF(t, d, D) = TF(t, d) x IDF(t, D)</th>
        </tr>
    </table>
</div>

<b>Corresponding Matrix (TF-IDF):</b> <br><br>
<br>
<div><b>Step 1:</b></div>
<table>
    <tr>
        <th></th>
        <th>I</th>
        <th>love</th>
        <th>data</th>
        <th>AI</th>
        <th>future</th>
    </tr>
        <th>Doc1</th>
        <th>[(1/3) x log(3/2)]</th>
        <th>[(1/3) x log(3/2)]</th>
        <th>[(1/3) x log(3/2)]</th>
        <th>0</th>
        <th>0</th>
    </tr>
        </tr>
        <th>Doc2</th>
        <th>[(1/5) x log(3/2)]</th>
        <th>[(1/5) x log(3/2)]</th>
        <th>[(1/5) x log(3/2)]</th>
        <th>[(1/5) x log(3/2)]</th>
        <th>0</th>
    </tr>
        </tr>
        <th>Doc3</th>
        <th>0</th>
        <th>0</th>
        <th>0</th>
        <th>[(1/4) x log(3/2)]</th>
        <th>[(1/4) x log(3/1)] </th>
    </tr>
</table>

<br>
<div><b>Step 2: Final Result</b></div>
<table>
    <tr>
        <th></th>
        <th>I</th>
        <th>love</th>
        <th>data</th>
        <th>AI</th>
        <th>future</th>
    </tr>
        <th>Doc1</th>
        <th>0.058</th>
        <th>0.058</th>
        <th>0.058</th>
        <th>0</th>
        <th>0</th>
    </tr>
        </tr>
        <th>Doc2</th>
        <th>0.035</th>
        <th>0.035</th>
        <th>0.035</th>
        <th>0.035</th>
        <th>0</th>
    </tr>
        </tr>
        <th>Doc3</th>
        <th>0</th>
        <th>0</th>
        <th>0</th>
        <th>0.044</th>
        <th>0.119</th>
    </tr>
</table>

<h5><b>Vectorization</b><br></h5>
<br>
<div>After applying TF-IDF, each document becomes a <b>vector</b> - essentially a list of numbers - in a high-dimensional space.</div>
<br>
<ul>
    <li>Each dimension in this space represents a unique word (term) from the entire collection of documents.</li>
    <li>Each document is a point (vector) in this space, where the value in each dimension shows how important that term is in the document.</li>
</ul>
<br>
<div><b>Unique terms in the corpus:</b></div>
<br>
<table>
    <tr>
        <th>Doc1</th>
        <th>[0.058, 0.058, 0.058, 0, 0]</th>
    </tr>
    <tr>
        <th>Doc2</th>
        <th>[0.035, 0.035, 0.035, 0.035, 0]</th>
    </tr>
    <tr>
        <th>Doc3</th>
        <th>[0, 0, 0, 0.044, 0.119]</th>
    </tr>
</table>

<h5><b>Cosine Similarity</b><br></h5>

<div>
    <img src="../Resource/Images/cosin_formula.png" alt="Vecto Space Model" style="width:400px;"/>
</div>

<ul>
    <li>It measures the cosine of the angle between two vectors in a high-dimensional space.</li>
    <li>It doesn't care about the magnitude of the vectors — just their direction.</li>
    <li>This is ideal for text, because documents can be different lengths but still have similar content.</li>
</ul>

<div><b>Interpretation of Cosine Similarity Scores</b></div>
<br>
<table>
    <tr>
        <th>Similarity Score</th>
        <th>Exactly the same direction (identical content)</th>
    </tr>
    <tr>
        <th>0.8 - 1.0</th>
        <th>Highly similar</th>
    </tr>
    <tr>
        <th>0.5 - 0.8</th>
        <th>Moderately similar</th>
    </tr>
    <tr>
        <th>0.2 - 0.5</th>
        <th>Slightly similar</th>
    </tr>
    <tr>
        <th>0.0</th>
        <th>No similarity</th>
    </tr>
    <tr>
        <th>< 0.0</th>
        <th>Opposite meaning (rare in text analysis)</th>
    </tr>
</table>


In [None]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import word_tokenize

#Sample documents 
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "A brown dog chased the fox.",
    "The dog is lazy."
]

#Sample query
query = "brown dog"



