### Minimum Edit Distance

- [Cosine Similarity, cosine distance explained](https://www.youtube.com/watch?v=m_CooIRM3UI)
- [Counting Words with CountVectorizer](https://investigate.ai/text-analysis/counting-words-with-scikit-learns-countvectorizer/)

In [1]:
#%pip install scikit-learn pandas

In [2]:
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

cosine_similarity([[3,1]],[[6,2]])

array([[1.]])

In [3]:
cosine_distances([[3,1]],[[6,2]])

array([[1.11022302e-16]])

In [4]:
doc1 = """
As the others have said, don't use the disk cache because of how slow it is. 
Even lowering the number of GPU layers (which then splits it between GPU VRAM 
and system RAM) slows it down tremendously. Keeping that in mind, the 13B file 
is almost certainly too large. Remember that the 13B is a reference to the 
number of parameters, not the file size.
"""

doc2 = """
Just using OPT-Nerys models as an example (huggingface model repository), 13B 
is over 25GB, which is too large to split between your GPU and RAM. 6B is 13+GB, 
so it could be used with a 50/50 split (which leaves a little VRAM for tokens 
and such) or the 2.7B model which is under 6GB and could be loaded fully into 
VRAM.
"""

doc3 = """
Don't ignore using system ram. It keeps getting cheaper. 2xDDR3200's, 
32GB total are under $100. I use 13B models, and usually start with 80% GPU 
layers, and set disk layers to zero. I get text generation within 5 seconds, 
so it feels like a human co-writer or chat. I do occasionally get an out of 
memory, but it's rarely an overall memory problem - just means I have to put 
less on the GPU (which implies more on the CPU), and that adds a bit of time.
"""

doc4 = """
Just replying to everybody so I can let y'all know that I got it working 
under 6GB, thanks for the help! I pretty much knew what 13B meant and all 
that but I wasn't sure about how much 8GB can handle considering with the 
layer system which was very confusing when setting up (it wasn't explained 
in a way I could understand lol)
"""

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
docmx = {'doc1':[], 'doc2':[], 'doc3':[], 'doc4':[]}
counts = {}
for i in docmx.keys():
    docmx[i] = vectorizer.fit_transform([vars()[i]])
    counts[i] = pd.DataFrame(docmx[i].toarray(),
                      columns=vectorizer.get_feature_names_out())

df = pd.concat(counts, names=['doc', 'word']).reset_index(level=1, drop=True)
df.fillna(0, inplace=True)
cosine_similarity(df)

array([[1.        , 0.34905556, 0.42006978, 0.32807142],
       [0.34905556, 1.        , 0.30820565, 0.2397661 ],
       [0.42006978, 0.30820565, 1.        , 0.28288798],
       [0.32807142, 0.2397661 , 0.28288798, 1.        ]])