# Text Classification Using Embeddings
This notebook shows how to build a classifiers using Cohere's embeddings.
<img src="https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/simple-classifier-embeddings.png"
alt="first we embed the text in the dataset, then we use that to train a classifier"/>

The example classification task here will be sentiment analysis of film reviews. We'll train a simple classifier to detect whether a film review is negative (class 0) or positive (class 1).

We'll go through the following steps:

1. Get the dataset
2. Get the embeddings of the reviews (for both the training set and the test set).
3. Train a classifier using the training set
4. Evaluate the performance of the classifier on the testing set

In [None]:
# Let's first install Cohere's python SDK
!pip install cohere

## 1. Get the dataset

In [None]:
import pandas as pd
# Get the SST2 training and test sets
df_train = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df_test = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/test.tsv', delimiter='\t', header=None)

In [None]:
# Let's glance at the dataset
df_train.head()
print(f"Review #1 text: {df_train.iloc[0, 0]}")
print(f"Review #1 class: {df_train.iloc[0, 1]}")
print(f"Review #2 text: {df_train.iloc[1, 0]}")
print(f"Review #2 class: {df_train.iloc[1, 1]}")

Review #1 text: a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
Review #1 class: 1
Review #2 text: apparently reassembled from the cutting room floor of any given daytime soap
Review #2 class: 0


We'll only use a subset of the training and testing datasets in this example. 

In [None]:
n_train_samples = 500
n_test_samples = 500

# Sample from the dataset
train = df_train.sample(n_train_samples)
test = df_test.sample(n_test_samples)

sentences_train = list(train.iloc[:,0].values)
sentences_test = list(test.iloc[:,0].values)

labels_train  = list(train.iloc[:,1].values)
labels_test  = list(test.iloc[:,1].values)


## 2. Get the embeddings of the reviews
We're now ready to retrieve the embeddings from the API

In [None]:
# import cohere, and start a client session
import cohere
co = cohere.Client("") # ADD YOUR API KEY HERE

# embed sentences from both train and test set on baseline-shrimp
embeddings_train = co.embed(model='baseline-shrimp', texts=sentences_train).embeddings
embeddings_test = co.embed(model='baseline-shrimp', texts=sentences_test).embeddings

We now have two sets of embeddings, `embeddings_train` contains the embeddings of the training  sentences while `embeddings_test` contains the embeddings of the testing sentences.

Curious what an embedding looks like? we can print it:

In [None]:
print(f"Review text: {sentences_train[0]}")
print(f"Embedding vector: {embeddings_train[0]}")

Review text: its metaphors are opaque enough to avoid didacticism , and the film succeeds as an emotionally accessible , almost mystical work
Embedding vector: [0.6579068, 0.16875336, 0.4072475, -0.94259065, 0.6713255, -0.38336453, -1.2589791, -0.5653332, 0.14386642, 0.5655985, -0.24851555, 0.5713835, 0.39206874, 0.8174008, -0.26184016, -1.1561943, 0.23619102, -0.3591043, -0.77078646, -1.963628, 0.8563678, 0.12427265, -1.9898235, -0.22819875, 0.11939957, 0.5999593, 0.25658482, -0.3031655, -0.088325396, -0.10845581, 0.811858, 0.93025625, 0.6680047, -1.010159, -0.4077053, -0.22306879, -0.06685991, -1.5153711, -0.9042679, -0.9925903, 0.508977, -1.2123089, -0.75614417, 0.992275, 0.95189476, -0.5295757, -1.7159832, -0.16067825, -0.22996473, 0.5751526, -1.0977967, 0.11184283, -0.42436096, 0.6324947, -0.64518857, 0.07545187, -0.22488812, 0.3991404, -0.15333608, 0.47543943, 0.072336696, 0.038673982, -0.41471612, -0.15128434, -0.25229073, 0.6113767, -0.27524865, 1.3870261, -0.72940207, -0.87819

## 3. Train a classifier using the training set
Now that we have the embedding we can train our classifier. We'll use an SVM from sklearn.

In [None]:
# import support vector machine code
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


# initialize the support vector machine, with class_weight='balanced' because 
# our training set has roughly an equal amount of positive and negative 
# sentiment sentences
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced')) 

# fit the support vector machine
svm_classifier.fit(embeddings_train, labels_train)


Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svc',
                 SVC(C=1.0, break_ties=False, cache_size=200,
                     class_weight='balanced', coef0=0.0,
                     decision_function_shape='ovr', degree=3, gamma='scale',
                     kernel='rbf', max_iter=-1, probability=False,
                     random_state=None, shrinking=True, tol=0.001,
                     verbose=False))],
         verbose=False)

## 4. Evaluate the performance of the classifier on the testing set

In [None]:
# get the score from the test set, and print it out to screen!
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on baseline-shrimp is {100*score}%!")

Validation accuracy on baseline-shrimp is 84.8%!


Let's try increasing the size of the model, and using baseline-otter instead to see if we can improve our score.

In [None]:

# embed sentences from both train and test set on baseline-otter
embeddings_train = co.embed(model='baseline-otter', texts=sentences_train).embeddings
embeddings_test = co.embed(model='baseline-otter', texts=sentences_test).embeddings
# run the same process as before
svm_classifier = make_pipeline(StandardScaler(), SVC(class_weight='balanced')) 
svm_classifier.fit(embeddings_train, labels_train)
score = svm_classifier.score(embeddings_test, labels_test)
print(f"Validation accuracy on baseline-otter is {100*score}%!")




With a bigger model, we can further improve the accuract score! This trend typically holds: bigger models tend to create
more semantically rich embeddings, that can be used to solve a wider variety of downstream tasks. 

This was a small scale example, meant as a proof of concept, and designed to illustrate how you can build a custom
classifier quickly using a small amount of labelled data and Cohere's embeddings. 