app is a text classification model that uses a pre-trained transformer-based language model, specifically a DistilBERT model, to generate embeddings for text documents.

In [None]:
!pip install openai


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.4-py3-none-any.whl (70 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.9.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (269 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.4/269.4 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K    

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m63.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


In [None]:
import numpy as np
import torch
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModel
import os
import hashlib

# Load a smaller subset of the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents, labels = newsgroups.data[:1000], newsgroups.target[:1000]

# Load a smaller and more efficient transformer model
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Function to generate embeddings for text documents
def generate_embeddings(texts, batch_size=32):
    cache_dir = '.embeddings_cache'
    os.makedirs(cache_dir, exist_ok=True)
    cache_path = os.path.join(cache_dir, hashlib.md5('\n'.join(texts).encode('utf-8')).hexdigest() + '.npy')
    if os.path.exists(cache_path):
        embeddings = np.load(cache_path)
    else:
        embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            input_ids = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)['input_ids']
            with torch.no_grad():
                outputs = model(input_ids.to(device))
            embeddings.append(outputs.last_hidden_state.mean(dim=1).cpu().numpy())
        embeddings = np.concatenate(embeddings)
        np.save(cache_path, embeddings)
    return embeddings

# Generate embeddings for the text documents
X = generate_embeddings(documents)
y = np.array(labels)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM classifier
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Make predictions and evaluate the classifier
y_pred = clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.45      0.50        11
           1       0.11      0.20      0.14        10
           2       0.45      0.56      0.50         9
           3       0.50      0.24      0.32        17
           4       0.00      0.00      0.00         7
           5       0.62      0.62      0.62        13
           6       0.50      0.62      0.56         8
           7       0.50      0.18      0.27        11
           8       0.18      0.55      0.27        11
           9       0.50      0.33      0.40        12
          10       0.60      0.46      0.52        13
          11       0.20      0.30      0.24        10
          12       0.23      0.43      0.30         7
          13       0.67      0.40      0.50        10
          14       1.00      0.42      0.59        12
          15       0.75      0.75      0.75         8
          16       0.17      0.11      0.13         9
   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
import numpy as np
import torch
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from transformers import AutoTokenizer, AutoModel
import os
import hashlib

# Load a smaller subset of the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents, labels = newsgroups.data[:1000], newsgroups.target[:1000]

# Load a smaller and more efficient transformer model
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

# Function to generate embeddings for text documents
def generate_embeddings(texts, batch_size=32):
    cache_dir = '.embeddings_cache'
    os.makedirs(cache_dir, exist_ok=True)
    cache_path = os.path.join(cache_dir, hashlib.md5('\n'.join(texts).encode('utf-8')).hexdigest() + '.npy')
    if os.path.exists(cache_path):
        embeddings = np.load(cache_path)
    else:
        embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i+batch_size]
            input_ids = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=512)['input_ids']
            with torch.no_grad():
                outputs = model(input_ids.to(device))
            embeddings.append(outputs.last_hidden_state.mean(dim=1).cpu().numpy())
        embeddings = np.concatenate(embeddings)
        np.save(cache_path, embeddings)
    return embeddings

# Generate embeddings for the text documents
X = generate_embeddings(documents)
y = np.array(labels)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the SVM classifier
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Make predictions and evaluate the classifier
y_pred = clf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred, zero_division=0))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Classification Report:
               precision    recall  f1-score   support

           0       0.56      0.45      0.50        11
           1       0.11      0.20      0.14        10
           2       0.45      0.56      0.50         9
           3       0.50      0.24      0.32        17
           4       0.00      0.00      0.00         7
           5       0.62      0.62      0.62        13
           6       0.50      0.62      0.56         8
           7       0.50      0.18      0.27        11
           8       0.18      0.55      0.27        11
           9       0.50      0.33      0.40        12
          10       0.60      0.46      0.52        13
          11       0.20      0.30      0.24        10
          12       0.23      0.43      0.30         7
          13       0.67      0.40      0.50        10
          14       1.00      0.42      0.59        12
          15       0.75      0.75      0.75         8
          16       0.17      0.11      0.13         9
   

The above app is a text classification model that uses a pre-trained transformer-based language model, specifically a DistilBERT model, to generate embeddings for text documents. These embeddings are then used as features to train a support vector machine (SVM) classifier to classify the text documents into categories.