# Text Classification Using Embeddings
This notebook shows how to build a classifiers using Cohere's embeddings.
<img src="https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/simple-classifier-embeddings.png"
style="width:100%; max-width:600px"
alt="first we embed the text in the dataset, then we use that to train a classifier"/>

The example classification task here will be sentiment analysis of film reviews. We'll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).

We'll go through the following steps:

1. Get the dataset
2. Get the embeddings of the reviews (for both the training set and the test set).
3. Train a classifier using the training set
4. Evaluate the performance of the classifier on the testing set

In [None]:
# Let's first install Cohere's python SDK
!pip install cohere

## 1. Get the dataset

In [None]:
import pandas as pd
# Get the SST2 training and test sets
df_train = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df_test = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/test.tsv', delimiter='\t', header=None)

In [None]:
# Let's glance at the dataset
df_train.head()
print(f"Review #1 text: {df_train.iloc[0, 0]}")
print(f"Review #1 class: {df_train.iloc[0, 1]}")
print(f"Review #2 text: {df_train.iloc[1, 0]}")
print(f"Review #2 class: {df_train.iloc[1, 1]}")

Review #1 text: a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Review #1 class: 1
Review #2 text: apparently reassembled from the cutting room floor of any given daytime soap
Review #2 class: 0


We'll only use a subset of the training and testing datasets in this example. We'll only use 100 examples since this is a toy example. You'll want to increase the number to get better performance and evaluation. 

In [None]:
n_train_samples = 100 # Increase for better performance (e.g. 500)
n_test_samples = 100 # increase for better evaluation (e.g. 500)

# Sample from the dataset
train = df_train.sample(n_train_samples)
test = df_test.sample(n_test_samples)

sentences_train = list(train.iloc[:,0].values)
sentences_test = list(test.iloc[:,0].values)

labels_train  = list(train.iloc[:,1].values)
labels_test  = list(test.iloc[:,1].values)


## 2. Get the embeddings of the reviews
We're now ready to retrieve the embeddings from the API

In [None]:
# import cohere, and start a client session
import cohere
co = cohere.Client("") # ADD YOUR API KEY HERE

# embed sentences from both train and test set on baseline-shrimp
embeddings_train = co.embed(model='baseline-shrimp', texts=sentences_train).embeddings
embeddings_test = co.embed(model='baseline-shrimp', texts=sentences_test).embeddings

We now have two sets of embeddings, `embeddings_train` contains the embeddings of the training  sentences while `embeddings_test` contains the embeddings of the testing sentences.

Curious what an embedding looks like? we can print it:

In [None]:
print(f"Review text: {sentences_train[0]}")
print(f"Embedding vector: {embeddings_train[0]}")

Review text: there 's a sheer unbridled delight in the way the story unfurls
Embedding vector: [0.3060446, -0.5590798, 1.3441724, -0.31918874, 0.10319942, -0.611176, -0.45803407, -0.922121, 0.46373415, 0.82814246, -0.7204705, -0.085331775, -0.24679375, 0.33021235, 0.34427735, 0.016704587, 0.21665218, 0.99444443, -1.1864822, -1.2618482, 0.56724435, -0.40024626, -1.1538417, 0.028833997, 0.018415181, -0.70796496, -0.50904775, -0.46730208, -0.8696139, -0.11697644, 0.36991414, 1.8658482, -0.18236113, 0.086177096, 0.5138425, 0.3988735, -1.7110026, 0.12344775, 0.9969331, -1.2200501, 1.3504177, -0.41682506, 0.52232707, 0.713298, 0.6877792, -0.43937904, 0.2173301, -0.41831672, -1.5772377, -0.3469063, -0.04403664, 0.7759863, -0.8057619, -0.5124225, 0.38261724, 0.24261853, 0.434947, 0.8528693, -0.53545344, -0.45064825, -1.603972, 0.7302793, 0.5154807, 0.60574394, -1.3626307, 0.56966937, -1.918972, 0.056056302, -0.5665736, -0.8589762, 0.8349136, 0.5967386, -0.30615053, -0.19666299, -0.25781697, -0

## 3. Train a classifier using the training set
Now that we have the embedding we can train our classifier. We'll use an SVM from sklearn.

In [None]:
# import support vector machine code
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


# initialize the support vector machine, with class_weight='balanced' because 
# our training set has roughly an equal amount of positive and negative 
# sentiment sentences
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced')) 

# fit the support vector machine
svm_classifier.fit(embeddings_train, labels_train)


Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svc',
                 SVC(C=1.0, break_ties=False, cache_size=200,
                     class_weight='balanced', coef0=0.0,
                     decision_function_shape='ovr', degree=3, gamma='scale',
                     kernel='rbf', max_iter=-1, probability=False,
                     random_state=None, shrinking=True, tol=0.001,
                     verbose=False))],
         verbose=False)

## 4. Evaluate the performance of the classifier on the testing set

In [None]:
# get the score from the test set, and print it out to screen!
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on baseline-shrimp is {100*score}%!")

Validation accuracy on baseline-shrimp is 70.0%!


Let's try increasing the size of the model, and using baseline-otter instead to see if we can improve our score.

In [None]:

# embed sentences from both train and test set on baseline-otter
embeddings_train = co.embed(model='baseline-otter', texts=sentences_train).embeddings
embeddings_test = co.embed(model='baseline-otter', texts=sentences_test).embeddings

# run the same process as before
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced')) 
svm_classifier.fit(embeddings_train, labels_train)
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on baseline-otter is {100*score}%!")




Validation accuracy on baseline-otter is 81.0%!


With a bigger model, we can further improve the accuracy score! This trend typically holds: bigger models tend to create
more semantically rich embeddings, that can be used to solve a wider variety of downstream tasks. 

This was a small scale example, meant as a proof of concept, and designed to illustrate how you can build a custom
classifier quickly using a small amount of labelled data and Cohere's embeddings. 