<a href="https://colab.research.google.com/github/aaubs/ds22/blob/master/notebooks/M5-db-nosql-lntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Unstructured (NoSQL) Databases

## NoSQL Databases Logic
* NoSQL stands for "Not only SQL"
* Designed to handle unstructured data and scalability challenges
* Flexible schema allows for evolving data structures
* Horizontal scalability for handling large volumes of data
* Generally, weaker consistency models compared to SQL databases


## Advantages of NoSQL Databases
* Schema flexibility for changing data requirements
* High scalability for large data volumes and high read/write rates
* High availability and fault tolerance through data replication
* Optimized for specific use cases, depending on the type of NoSQL database

## Types and Structure of NoSQL Databases

### 1. Document Databases
* Store data in documents (usually JSON or BSON)
* Documents can have nested structures
* Documents can be organized in collections
* Examples: MongoDB, Couchbase, RavenDB

**Applications:**
* Content management systems (CMS) with complex data structures
* Real-time analytics for handling unstructured data
* IoT data management for handling diverse data from multiple devices

### 2. Key-Value Stores
* Store data as key-value pairs
* Keys are unique identifiers for the data
* Values can be any data type, including complex structures
* Examples: Redis, Amazon DynamoDB, Riak KV

**Applications:**
* Caching for improving the performance of data retrieval
* Session management for storing user-specific data across sessions
* Configuration data storage for application settings and metadata

### 3. Column-Family Stores
* Store data in column families (loosely related to tables)
* Data is organized as rows and columns
* Designed for high write and read performance on large-scale data
* Examples: Apache Cassandra, HBase, ScyllaDB

**Applications:**
* Time-series data management for high write and read workloads
* Event logging and analytics for large-scale systems
* Recommendation systems for analyzing user behavior and preferences

### 4. Graph Databases
* Store data as nodes (entities) and edges (relationships)
* Designed for graph traversal and querying connected data
* Examples: Neo4j, Amazon Neptune, ArangoDB

**Applications:**
* Social network analysis for exploring connections between users
* Fraud detection for uncovering complex patterns in financial transactions
* Knowledge graphs for storing and querying complex, interrelated information

In this tutorial, we'll introduce NoSQL databases, specifically using the TinyDB library in Python. TinyDB is a lightweight, document-oriented NoSQL database, perfect for beginners to learn the basics of NoSQL.

In [3]:
!pip install tinydb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tinydb
  Downloading tinydb-4.7.1-py3-none-any.whl (24 kB)
Installing collected packages: tinydb
Successfully installed tinydb-4.7.1


# TinyDB

In [None]:
import pandas as pd
from tinydb import TinyDB, Query

## Simple DB example

In [3]:
# Initialize a new TinyDB instance
db = TinyDB('my_database.json')

**Populating the database the database:** Lets get some data in first:

In [4]:
# PÅopulating database with some data
students = [
    {'name': 'Mette', 'age': 26, 'major': 'Data Science'},
    {'name': 'Jeppe', 'age': 22, 'major': 'Data Science'},
    {'name': 'John', 'age': 23, 'major': 'Computer Science'},
    {'name': 'David', 'age': 24, 'major': 'Data Science'},
    {'name': 'Pernille', 'age': 25, 'major': 'Economics'},
    {'name': 'Alexandra', 'age': 21, 'major': 'Economics'},
]

# Insert the students into the database
for student in students:
    db.insert(student)

**Querying the database:** TinyDB allows us to perform various types of queries on the database. Let's explore some common query operations:

In [5]:
# Query object to perform queries
Student = Query()

In [6]:
# Find all students majoring in Data Science
data_science_students = db.search(Student.major == 'Data Science')
print("Data Science students:", data_science_students)

Data Science students: [{'name': 'Mette', 'age': 26, 'major': 'Data Science'}, {'name': 'Jeppe', 'age': 22, 'major': 'Data Science'}, {'name': 'David', 'age': 24, 'major': 'Data Science'}]


In [7]:
# Find all students who are 24 years old
students_24 = db.search(Student.age == 24)
print("Students aged 24:", students_24)

Students aged 24: [{'name': 'David', 'age': 24, 'major': 'Data Science'}]


In [8]:
# Find all students whose name starts with 'A'
students_starting_with_A = db.search(Student.name.matches('^A.*'))
print("Students whose name starts with 'A':", students_starting_with_A)

Students whose name starts with 'A': [{'name': 'Alexandra', 'age': 21, 'major': 'Economics'}]


In [9]:
# Find all students who are majoring in Data Science and are 22 years old
data_science_and_22 = db.search((Student.major == 'Data Science') & (Student.age == 22))
print("Data Science students aged 22:", data_science_and_22)

Data Science students aged 22: [{'name': 'Jeppe', 'age': 22, 'major': 'Data Science'}]


**Updating data in the database:** To update an entry in the database, use the update method.

In [10]:
# Update Alexandras age
db.update({'age': 23}, Student.name == 'Alexandra')

[6]

**Removing data from the database:** To remove entries from the database, use the remove method. 

In [11]:
# Remove all Business Modelling students
db.remove(Student.major == 'Business Modelling')

[]

## Nested DB structure


In [12]:
# Clear the database before inserting new data
db.truncate()

In [13]:
# Insert data into the database
db.insert({
    'type': 'product',
    'name': 'Product 1',
    'price': 10.0,
    'categories': [
        {'id': 1, 'name': 'Category 1'},
        {'id': 2, 'name': 'Category 2'}
    ],
    'attributes': {
        'color': 'red',
        'size': 'M',
        'weight': 2.5
    }
})

db.insert({
    'type': 'product',
    'name': 'Product 2',
    'price': 15.0,
    'categories': [
        {'id': 2, 'name': 'Category 2'},
        {'id': 3, 'name': 'Category 3'}
    ],
    'attributes': {
        'color': 'blue',
        'size': 'L',
        'weight': 3.0
    }
})

2

In [14]:
#Query the database
Product = Query()

In [15]:
# Find products with a specific category
category_id = 2
result = db.search(Product.categories.any(lambda cat: cat['id'] == category_id))

for product in result:
    print(f"Product Name: {product['name']}, Categories: {product['categories']}")

Product Name: Product 1, Categories: [{'id': 1, 'name': 'Category 1'}, {'id': 2, 'name': 'Category 2'}]
Product Name: Product 2, Categories: [{'id': 2, 'name': 'Category 2'}, {'id': 3, 'name': 'Category 3'}]


In [16]:
# Find products with specific attributes
result = db.search((Product.attributes.color == 'red') & (Product.attributes.size == 'M'))

for product in result:
    print(f"Product Name: {product['name']}, Attributes: {product['attributes']}")

Product Name: Product 1, Attributes: {'color': 'red', 'size': 'M', 'weight': 2.5}


# ML workflow

## Tradtional ML

In this example, we'll demonstrate how to use TinyDB in a supervised machine learning workflow using the popular Iris dataset from the scikit-learn library. We'll use a K-Nearest Neighbors (KNN) classifier for this multi-class classification problem.

In [4]:
import random
from tinydb import TinyDB, Query
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
data = iris['data']
target = iris['target']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Store the training data in a TinyDB database
db = TinyDB('iris_training_data.json')
db.truncate()  # Clear the database

for features, label in zip(X_train, y_train):
    db.insert({'features': features.tolist(), 'label': int(label)})

# Train a K-Nearest Neighbors classifier using the data in TinyDB
knn = KNeighborsClassifier(n_neighbors=3)
training_data = db.all()

X_train_db = [entry['features'] for entry in training_data]
y_train_db = [entry['label'] for entry in training_data]

knn.fit(X_train_db, y_train_db)

# Test the classifier on the testing set
y_pred = knn.predict(X_test)

# Calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.2f}')

# Clean up
db.close()


Accuracy: 1.00


## NLP

In [None]:
!pip install transformers
!pip install datasets

In [18]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset
from tinydb import TinyDB, Query
from datasets import load_dataset

In [None]:
dataset = load_dataset("ag_news")

In [10]:
# set up database
db = TinyDB('ag_news.json')
db.truncate()

In [11]:
def add_to_db(data, split):
    documents = [{'text': item['text'], 'label': item['label'], 'split': split} for item in data]
    db.insert_multiple(documents)

add_to_db(dataset['train'], 'train')
add_to_db(dataset['test'], 'test')

In [12]:
# Now, let's define a custom dataset class that inherits from torch.utils.data.Dataset. We will use this class to load data from the TinyDB in batches:
class TinyDBDataset(Dataset):
    def __init__(self, db, split, tokenizer, max_length=128):
        self.db = db
        self.split = split
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.data = db.search(Query().split == split)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        item = self.data[index]
        encoding = self.tokenizer(item['text'], truncation=True, padding='max_length', max_length=self.max_length)
        return {
            'input_ids': torch.tensor(encoding['input_ids'], dtype=torch.long),
            'attention_mask': torch.tensor(encoding['attention_mask'], dtype=torch.long),
            'labels': torch.tensor(item['label'], dtype=torch.long)
        }

In [13]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [14]:
train_dataset = TinyDBDataset(db, 'train', tokenizer)
test_dataset = TinyDBDataset(db, 'test', tokenizer)

In [31]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=100,
    evaluation_strategy="steps",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy"
)


In [32]:
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [33]:
trainer.train()



OutOfMemoryError: ignored

In [None]:
 Evaluate the model
eval_results = trainer.evaluate()

# Print the evaluation results
print("Evaluation Results:", eval_results)

# Save the trained model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")

## Try on your own

In [116]:
!wget https://raw.githubusercontent.com/xsarinix/py_and_wine/master/winemag-data-130k-v2.json

--2023-03-22 07:14:54--  https://raw.githubusercontent.com/xsarinix/py_and_wine/master/winemag-data-130k-v2.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79279294 (76M) [text/plain]
Saving to: ‘winemag-data-130k-v2.json’


2023-03-22 07:14:55 (154 MB/s) - ‘winemag-data-130k-v2.json’ saved [79279294/79279294]



In [None]:
import json
from tinydb import TinyDB

# Load the wine dataset from JSON file
with open("winemag-data-130k-v2.json", "r") as file:
    wine_data = json.load(file)

# Initialize the TinyDB instance
db = TinyDB("wine_reviews_db.json")
db.truncate()  # Clear the database before inserting new data

# Prepare the dataset to be inserted into TinyDB
data_to_insert = [
    {"type": "data", "description": data["description"], "points": int(data["points"])}
    for data in wine_data
    if data["description"] is not None and data["points"] is not None
]

# Store the dataset in TinyDB
db.insert_multiple(data_to_insert)