## Classify Text into Labels

https://js.langchain.com/docs/tutorials/classification/


Tagging means labeling a document with classes such as:

- sentiment
- language
- style (formal, informal etc.)
- covered topics
- political tendency

![Overview](../../assets/images/classification_workflow.png)


#### Quickstart

In [3]:
// Load environment variables
import * as dotenv from "dotenv";
dotenv.config();
// Check if the OPENAI_API_KEY environment variable is set
if (!process.env.OPENAI_API_KEY) {
  throw new Error("Missing OPENAI_API_KEY environment variable");
}

import { ChatOpenAI } from "@langchain/openai";
const llm = new ChatOpenAI({
  model: "gpt-4o-mini",
  temperature: 0
});

evalmachine.<anonymous>:32
    throw new Error("Missing OPENAI_API_KEY environment variable");
    ^

Error: Missing OPENAI_API_KEY environment variable
    at evalmachine.<anonymous>:32:11
    at evalmachine.<anonymous>:42:3
    at sigintHandlersWrap (node:vm:280:12)
    at Script.runInThisContext (node:vm:135:14)
    at Object.runInThisContext (node:vm:317:38)
    at Object.execute (/home/karim/.nvm/versions/node/v22.14.0/lib/node_modules/tslab/dist/executor.js:160:38)
    at JupyterHandlerImpl.handleExecuteImpl (/home/karim/.nvm/versions/node/v22.14.0/lib/node_modules/tslab/dist/jupyter.js:250:38)
    at /home/karim/.nvm/versions/node/v22.14.0/lib/node_modules/tslab/dist/jupyter.js:208:57
    at async JupyterHandlerImpl.handleExecute (/home/karim/.nvm/versions/node/v22.14.0/lib/node_modules/tslab/dist/jupyter.js:208:21)
    at async ZmqServer.handleExecute (/home/karim/.nvm/versions/node/v22.14.0/lib/node_modules/tslab/dist/jupyter.js:406:25)


#### Loading documents

In [2]:
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";

const loader = new PDFLoader("./data/nke-10k-2023.pdf");

const docs = await loader.load();
console.log(docs.length);

107


* The string content of the page

In [4]:
docs[0].pageContent.slice(0, 200);

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
FO


*  Metadata containing the file name and page number

In [5]:
docs[0].metadata;

{
  source: './data/nke-10k-2023.pdf',
  pdf: {
    version: '1.10.100',
    info: {
      PDFFormatVersion: '1.4',
      IsAcroFormPresent: false,
      IsXFAPresent: false,
      Title: '0000320187-23-000039',
      Author: 'EDGAR Online, a division of Donnelley Financial Solutions',
      Subject: 'Form 10-K filed on 2023-07-20 for the period ending 2023-05-31',
      Keywords: '0000320187-23-000039; ; 10-K',
      Creator: 'EDGAR Filing HTML Converter',
      Producer: 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
      CreationDate: "D:20230720162200-04'00'",
      ModDate: "D:20230720162208-04'00'"
    },
    metadata: null,
    totalPages: 107
  },
  loc: { pageNumber: 1 }
}


* Splitting

In [6]:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const allSplits = await textSplitter.splitDocuments(docs);

allSplits.length;

513


#### Embeddings

In [7]:
// Load environment variables
import * as dotenv from "dotenv";
dotenv.config();

if (!process.env.OPENAI_API_KEY) {
  throw new Error("Missing OPENAI_API_KEY environment variable");
}

import { OpenAIEmbeddings } from "@langchain/openai";

const embeddings = new OpenAIEmbeddings({
  model: "text-embedding-3-large"
});

const vector1 = await embeddings.embedQuery(allSplits[0].pageContent);
const vector2 = await embeddings.embedQuery(allSplits[1].pageContent);

console.assert(vector1.length === vector2.length);
console.log(`Generated vectors of length ${vector1.length}\n`);
console.log(vector1.slice(0, 10));

Generated vectors of length 3072

[
  0.014386530965566635,
  -0.01680644601583481,
  -0.0011675760615617037,
  0.010676880367100239,
  0.02285623550415039,
  -0.02829439751803875,
  -0.000561766151804477,
  0.04193633794784546,
  -0.0011202081805095077,
  0.06616208702325821
]


#### Vector stores

In [8]:
import { MemoryVectorStore } from "langchain/vectorstores/memory";

const vectorStore = new MemoryVectorStore(embeddings);
await vectorStore.addDocuments(allSplits);


* Usage

In [9]:
const results1 = await vectorStore.similaritySearch(
    "When was Nike incorporated?"
  );
  
  results1[0];

Document {
  pageContent: 'Table of Contents\n' +
    'PART I\n' +
    'ITEM 1. BUSINESS\n' +
    'GENERAL\n' +
    'NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n' +
    '"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\n' +
    'Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\n' +
    'the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\n' +
    'and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales',
  metadata: 

In [10]:
const results2 = await vectorStore.similaritySearchWithScore(
  "What was Nike's revenue in 2023?"
);

results2[0];

[
  Document {
    pageContent: 'Table of Contents\n' +
      'FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS\n' +
      'The following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:\n' +
      'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
      '•NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.\n' +
      'The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,\n' +
      '2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n' +
      '•NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This\n' +
      "increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids'

The difference between **Cell 11** and **Cell 12** lies in how they perform similarity searches in the vector store and the type of input they use for the search.

---

### **Cell 11**


In [None]:
const embedding = await embeddings.embedQuery(
    "How were Nike's margins impacted in 2023?"
  );
  
const results3 = await vectorStore.similaritySearchVectorWithScore(
    embedding,
    1
  );
  
results3[0];



#### **Key Details**:
1. **Input**:
   - The query string `"How were Nike's margins impacted in 2023?"` is first embedded into a vector using `embeddings.embedQuery`.
   - The resulting embedding (a numerical vector) is then passed to `similaritySearchVectorWithScore`.

2. **Method Used**:
   - `similaritySearchVectorWithScore` performs a similarity search directly using the precomputed embedding vector.

3. **Output**:
   - Returns the most similar document(s) to the query embedding along with their similarity scores.
   - The `1` argument specifies that only the top result should be returned.

4. **Use Case**:
   - Use this when you already have a precomputed embedding for the query or want to control the embedding process separately.

---

### **Cell 12**


In [None]:
const results2 = await vectorStore.similaritySearchWithScore(
  "What was Nike's revenue in 2023?"
);

results2[0];



#### **Key Details**:
1. **Input**:
   - The query string `"What was Nike's revenue in 2023?"` is passed directly to the `similaritySearchWithScore` method.

2. **Method Used**:
   - `similaritySearchWithScore` internally embeds the query string using the `embeddings` instance associated with the `vectorStore`.
   - It then performs a similarity search using the generated embedding.

3. **Output**:
   - Returns the most similar document(s) to the query embedding along with their similarity scores.
   - By default, it may return multiple results unless limited by additional arguments.

4. **Use Case**:
   - Use this when you want to perform a similarity search directly with a query string without manually embedding it first.

---

### **Key Differences**

| **Aspect**                | **Cell 11**                                                                 | **Cell 12**                                                                 |
|---------------------------|-----------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| **Input**                 | Precomputed embedding (via `embedQuery`).                                  | Query string directly.                                                     |
| **Embedding Process**     | Embedding is done manually before the search.                              | Embedding is handled internally by `similaritySearchWithScore`.            |
| **Method Used**           | `similaritySearchVectorWithScore`.                                         | `similaritySearchWithScore`.                                               |
| **Flexibility**           | Allows custom embedding logic or reuse of precomputed embeddings.          | Simpler and more convenient for direct query searches.                     |
| **Use Case**              | When you want to control the embedding process or reuse embeddings.         | When you want a quick similarity search with a query string.               |

---

### Summary
- **Cell 11**: You manually embed the query and then perform a similarity search using the embedding.
- **Cell 12**: You pass the query string directly, and the vector store handles embedding and searching internally.

Both approaches achieve similar results, but **Cell 11** gives you more control over the embedding process, while **Cell 12** is more convenient for quick searches.

In [11]:
const embedding = await embeddings.embedQuery(
    "How were Nike's margins impacted in 2023?"
  );
  
  const results3 = await vectorStore.similaritySearchVectorWithScore(
    embedding,
    1
  );
  
  results3[0];

[
  Document {
    pageContent: 'Table of Contents\n' +
      'GROSS MARGIN\n' +
      'FISCAL 2023 COMPARED TO FISCAL 2022\n' +
      'For fiscal 2023, our consolidated gross profit increased 4% to $22,292 million compared to $21,479 million for fiscal 2022. Gross margin decreased 250 basis points to\n' +
      '43.5% for fiscal 2023 compared to 46.0% for fiscal 2022 due to the following:\n' +
      '*Wholesale equivalent\n' +
      'The decrease in gross margin for fiscal 2023 was primarily due to:\n' +
      '•Higher NIKE Brand product costs, on a wholesale equivalent basis, primarily due to higher input costs and elevated inbound freight and logistics costs as well as\n' +
      'product mix;\n' +
      '•Lower margin in our NIKE Direct business, driven by higher promotional activity to liquidate inventory in the current period compared to lower promotional activity in\n' +
      'the prior period resulting from lower available inventory supply;\n' +
      '•Unfavorable changes in 

#### Retrievers

In [12]:
const retriever = vectorStore.asRetriever({
    searchType: "mmr",
    searchKwargs: {
      fetchK: 1,
    },
  });
  
  await retriever.batch([
    "When was Nike incorporated?",
    "What was Nike's revenue in 2023?",
  ]);

[
  [
    Document {
      pageContent: 'Table of Contents\n' +
        'PART I\n' +
        'ITEM 1. BUSINESS\n' +
        'GENERAL\n' +
        'NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"\n' +
        '"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.\n' +
        'Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is\n' +
        'the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores\n' +
        'and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent di