<img src="https://drive.google.com/uc?export=view&id=1wYSMgJtARFdvTt5g7E20mE4NmwUFUuog" width="200">

[![Build Fast with AI](https://img.shields.io/badge/BuildFastWithAI-GenAI%20Bootcamp-blue?style=for-the-badge&logo=artificial-intelligence)](https://www.buildfastwithai.com/genai-course)
[![EduChain GitHub](https://img.shields.io/github/stars/satvik314/educhain?style=for-the-badge&logo=github&color=gold)](https://github.com/satvik314/educhain)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1I6md-4crUiK8gR4HZ73RoclUSetDgfYC?usp=sharingZ)
## Master Generative AI in 6 Weeks
**What You'll Learn:**
- Build with Latest LLMs
- Create Custom AI Apps
- Learn from Industry Experts
- Join Innovation Community
Transform your AI ideas into reality through hands-on projects and expert mentorship.
[Start Your Journey](https://www.buildfastwithai.com/genai-course)
*Empowering the Next Generation of AI Innovators

# 🌟 **Nomic: Empowering Large-Scale Data Insights**

Nomic is an open-source platform designed to help you **analyze, structure, and interact with large-scale datasets** across various modalities, including text, images, embeddings, audio, and video. 🚀

## ✨ **Key Features of Nomic Atlas**:
- 📁 **Data Organization**: Efficiently manage and organize diverse data types, such as text, images, and embeddings.
- 🌐 **Interactive Visualization**: Create shareable maps to explore complex data relationships with ease.
- 🔍 **Advanced Search**: Perform semantic searches across millions of data points to uncover patterns and insights.
- 🧹 **Data Cleaning and Tagging**: Tag, clean, and deduplicate datasets to ensure data quality.


###**Setup and Installation**

In [None]:
pip install nomic datasets

### **Login to Nomic**


In [None]:
!nomic login

### **Setup Secrets**

In [None]:
from google.colab import userdata

Token=userdata.get('nomic_token')

### **Login to Nomic with Token**

In [None]:
import nomic
nomic.cli.login(
    token=Token,
)


### **Import Required Libraries**








In [None]:
from nomic import atlas, embed
import numpy as np
from datasets import load_dataset

### **📚 Load AG News Dataset**








In [None]:
dataset = load_dataset('ag_news')['train']


### **📊 Select a Subset of the Dataset**








In [None]:
max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=True).tolist()
documents = [dataset[i] for i in subset_idxs]


### **🛠️ Define Functions for Token Usage and Embedding Generation**








In [None]:
usages = []

def print_total_tokens(usages):
    return sum([usage['total_tokens'] for usage in usages])


In [None]:
def generate_embeddings(documents):
    batch_size = 256
    document_embeddings = []

    batch = []
    for idx, doc in enumerate(documents):
        batch.append(doc['text'])
        if (idx + 1) % batch_size == 0 or idx == len(documents):
            batch_embeddings = embed.text(texts=batch, model='nomic-embed-text-v1')
            usages.append(batch_embeddings['usage'])
            for item in batch_embeddings['embeddings']:
                document_embeddings.append(item)
            print(usages[-1], print_total_tokens(usages))

            batch = []

    document_embeddings = np.array(document_embeddings)
    return document_embeddings


In [None]:
document_embeddings = generate_embeddings(documents)
print(document_embeddings.shape)


### **🗺️ Create Atlas Map for AG News Dataset (Random Subset of 100K Documents)**


In [None]:
from nomic import atlas
import pandas

news_articles = pandas.read_csv(
    'https://raw.githubusercontent.com/nomic-ai/maps/main/data/ag_news_25k.csv'
)

atlas.map_data(
    data=news_articles,
    indexed_field='text',
    identifier="Example-text-dataset-news"
)

[32m2025-01-28 19:00:08.995[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m867[0m - [1mOrganization name: `mukeshofficial685`[0m
[32m2025-01-28 19:00:09.334[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m895[0m - [1mCreating dataset `example-text-dataset-news`[0m
[32m2025-01-28 19:00:09.433[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_data[0m:[36m145[0m - [1mUploading data to Atlas.[0m
100%|██████████| 5/5 [00:02<00:00,  2.48it/s]
[32m2025-01-28 19:00:11.601[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_add_data[0m:[36m1714[0m - [1mUpload succeeded.[0m
[32m2025-01-28 19:00:11.603[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_data[0m:[36m163[0m - [1m`mukeshofficial685/example-text-dataset-news`: Data upload succeeded to dataset`[0m
[32m2025-01-28 19:00:12.199[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36mcreate_index[0m:[36m1301[0m - [1mCreated map `Example-text

### **🗺️ Generate Atlas Map for AG News Dataset (Consistent Subset of 100K Documents)**


In [None]:
from nomic import atlas
import numpy as np
from datasets import load_dataset

dataset = load_dataset('ag_news')['train']

max_documents = 100000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]

project = atlas.map_data(data=documents,
                          indexed_field='text',
                          identifier='News 100k Example',
                          description='News 100k Example from the ag_news dataset hosted on huggingface.'
                          )
print(project.maps)

[32m2025-01-28 19:02:03.984[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m867[0m - [1mOrganization name: `mukeshofficial685`[0m
[32m2025-01-28 19:02:04.310[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_create_project[0m:[36m895[0m - [1mCreating dataset `news-100k-example`[0m
[32m2025-01-28 19:02:04.405[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_data[0m:[36m145[0m - [1mUploading data to Atlas.[0m
100%|██████████| 20/20 [00:04<00:00,  4.05it/s]
[32m2025-01-28 19:02:09.453[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36m_add_data[0m:[36m1714[0m - [1mUpload succeeded.[0m
[32m2025-01-28 19:02:09.461[0m | [1mINFO    [0m | [36mnomic.atlas[0m:[36mmap_data[0m:[36m163[0m - [1m`mukeshofficial685/news-100k-example`: Data upload succeeded to dataset`[0m
[32m2025-01-28 19:02:10.199[0m | [1mINFO    [0m | [36mnomic.dataset[0m:[36mcreate_index[0m:[36m1301[0m - [1mCreated map `News 100k Example` in data

[News 100k Example: https://atlas.nomic.ai/data/mukeshofficial685/news-100k-example]


### **🧠 Topic Extraction**








In [None]:
import numpy as np
from nomic import atlas
from pprint import pprint
from nomic import AtlasDataset
from datasets import load_dataset

dataset = load_dataset('ag_news')['train']

max_documents = 10000
subset_idxs = np.random.choice(len(dataset), size=max_documents, replace=False).tolist()
documents = [dataset[i] for i in subset_idxs]

project = atlas.map_data(data=documents,
                         indexed_field='text',
                         identifier='News 10k For Topic Extraction',
                         description='News 10k For Topic Extraction')

with project.wait_for_dataset_lock():
    pprint(project.maps[0].topics.group_by_topic(topic_depth=1)[0])

### **🌍 Create Atlas Map for English-German Translations**








In [None]:
from nomic import atlas
from datasets import load_dataset

dataset = load_dataset("bbaaaa/iwslt14-de-en", split="train")

max_documents = 50_000
selected = dataset[:max_documents]["translation"]

documents = []
for doc in selected:
    en_data = {"text": doc["en"], "en": doc["en"], "de": doc["de"], "language": "en"}
    de_data = {"text": doc["de"], "en": doc["en"], "de": doc["de"], "language": "de"}
    documents.append(en_data)
    documents.append(de_data)
project = atlas.map_data(data=documents,
                          indexed_field='text',
                          identifier='English-German 50k Translations',
                          description='50k Examples from the iwslt14-de-en dataset hosted on huggingface.',
                          embedding_model='gte-multilingual-base',
                          )
print(project.maps)