<a href="https://colab.research.google.com/github/generative-world/ml_dl_concepts/blob/master/sentence_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Sentence Transformer**


Sentence Transformers is a library that **generates dense vector representations** (embeddings [1 0 0 1 0]) **for sentences or text**, capturing their semantic meaning.

You can use Sentence Transformers to transform sentences into fixed-size vectors that you can then use for downstream tasks such as:

* Sentence similarity
* Document clustering
* Question answering
* Text classification





In [33]:
# Import Required Libraries

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sentence_transformers import SentenceTransformer

In [32]:
# load and read data from file
data = {
    'message': ['I liked the movie', 'Movie is not good', 'Wasted my time', 'liked the movie'],
    'sentiment': ['positive', 'negative', 'negative', 'positive']
}

In [34]:
df = pd.DataFrame(data)
df.head()

Unnamed: 0,message,sentiment
0,I liked the movie,positive
1,Movie is not good,negative
2,Wasted my time,negative
3,liked the movie,positive


In [35]:
X = df['message']
y = df['sentiment'].map({'positive': 1, 'negative': 0})

In [36]:
model = SentenceTransformer('all-MiniLM-L6-v2')
X_embeddings = model.encode(X)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [37]:
lg = LogisticRegression()
lg.fit(X_embeddings, y)

In [46]:
test = ['The movie is awsome', 'I liked the movie', 'I hated the movie', 'I disliked the movie', 'Movie is worst']
test_embeddings = model.encode(test)
predictions = lg.predict(test_embeddings)

for sentence, prediction in zip(test, predictions):
    print(f"Sentence: {sentence} -> Predicted Label: {prediction}")

Sentence: The movie is awsome -> Predicted Label: 1
Sentence: I liked the movie -> Predicted Label: 1
Sentence: I hated the movie -> Predicted Label: 1
Sentence: I disliked the movie -> Predicted Label: 1
Sentence: Movie is worst -> Predicted Label: 0
