<a href="https://colab.research.google.com/github/aaubs/ds22/blob/master/notebooks/M5-db-nosql-lntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Unstructured (NoSQL) Databases

## NoSQL Databases Logic
* NoSQL stands for "Not only SQL"
* Designed to handle unstructured data and scalability challenges
* Flexible schema allows for evolving data structures
* Horizontal scalability for handling large volumes of data
* Generally, weaker consistency models compared to SQL databases


## Advantages of NoSQL Databases
* Schema flexibility for changing data requirements
* High scalability for large data volumes and high read/write rates
* High availability and fault tolerance through data replication
* Optimized for specific use cases, depending on the type of NoSQL database

## Types and Structure of NoSQL Databases

### 1. Document Databases
* Store data in documents (usually JSON or BSON)
* Documents can have nested structures
* Documents can be organized in collections
* Examples: MongoDB, Couchbase, RavenDB

**Applications:**
* Content management systems (CMS) with complex data structures
* Real-time analytics for handling unstructured data
* IoT data management for handling diverse data from multiple devices

### 2. Key-Value Stores
* Store data as key-value pairs
* Keys are unique identifiers for the data
* Values can be any data type, including complex structures
* Examples: Redis, Amazon DynamoDB, Riak KV

**Applications:**
* Caching for improving the performance of data retrieval
* Session management for storing user-specific data across sessions
* Configuration data storage for application settings and metadata

### 3. Column-Family Stores
* Store data in column families (loosely related to tables)
* Data is organized as rows and columns
* Designed for high write and read performance on large-scale data
* Examples: Apache Cassandra, HBase, ScyllaDB

**Applications:**
* Time-series data management for high write and read workloads
* Event logging and analytics for large-scale systems
* Recommendation systems for analyzing user behavior and preferences

### 4. Graph Databases
* Store data as nodes (entities) and edges (relationships)
* Designed for graph traversal and querying connected data
* Examples: Neo4j, Amazon Neptune, ArangoDB

**Applications:**
* Social network analysis for exploring connections between users
* Fraud detection for uncovering complex patterns in financial transactions
* Knowledge graphs for storing and querying complex, interrelated information

In this tutorial, we'll introduce NoSQL databases, specifically using the TinyDB library in Python. TinyDB is a lightweight, document-oriented NoSQL database, perfect for beginners to learn the basics of NoSQL.

In [2]:
!pip install tinydb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tinydb
  Downloading tinydb-4.7.1-py3-none-any.whl (24 kB)
Installing collected packages: tinydb
Successfully installed tinydb-4.7.1


# TinyDB

In [3]:
import pandas as pd
from tinydb import TinyDB, Query

## Simple DB example

In [4]:
# Initialize a new TinyDB instance
db = TinyDB('my_database.json')

**Populating the database the database:** Lets get some data in first:

In [5]:
# PÅopulating database with some data
students = [
    {'name': 'Mette', 'age': 26, 'major': 'Data Science'},
    {'name': 'Jeppe', 'age': 22, 'major': 'Data Science'},
    {'name': 'John', 'age': 23, 'major': 'Computer Science'},
    {'name': 'David', 'age': 24, 'major': 'Data Science'},
    {'name': 'Pernille', 'age': 25, 'major': 'Economics'},
    {'name': 'Alexandra', 'age': 21, 'major': 'Economics'},
]

# Insert the students into the database
for student in students:
    db.insert(student)

**Querying the database:** TinyDB allows us to perform various types of queries on the database. Let's explore some common query operations:

In [6]:
# Query object to perform queries
Student = Query()

In [7]:
# Find all students majoring in Data Science
data_science_students = db.search(Student.major == 'Data Science')
print("Data Science students:", data_science_students)

Data Science students: [{'name': 'Mette', 'age': 26, 'major': 'Data Science'}, {'name': 'Jeppe', 'age': 22, 'major': 'Data Science'}, {'name': 'David', 'age': 24, 'major': 'Data Science'}]


In [8]:
# Find all students who are 24 years old
students_24 = db.search(Student.age == 24)
print("Students aged 24:", students_24)

Students aged 24: [{'name': 'David', 'age': 24, 'major': 'Data Science'}]


In [9]:
# Find all students whose name starts with 'A'
students_starting_with_A = db.search(Student.name.matches('^A.*'))
print("Students whose name starts with 'A':", students_starting_with_A)

Students whose name starts with 'A': [{'name': 'Alexandra', 'age': 21, 'major': 'Economics'}]


In [10]:
# Find all students who are majoring in Data Science and are 22 years old
data_science_and_22 = db.search((Student.major == 'Data Science') & (Student.age == 22))
print("Data Science students aged 22:", data_science_and_22)

Data Science students aged 22: [{'name': 'Jeppe', 'age': 22, 'major': 'Data Science'}]


**Updating data in the database:** To update an entry in the database, use the update method.

In [11]:
# Update Alexandras age
db.update({'age': 23}, Student.name == 'Alexandra')

[6]

**Removing data from the database:** To remove entries from the database, use the remove method. 

In [12]:
# Remove all Business Modelling students
db.remove(Student.major == 'Business Modelling')

[]

## Nested DB structure


In [21]:
# Clear the database before inserting new data
db.truncate()

In [22]:
# Insert data into the database
db.insert({
    'type': 'product',
    'name': 'Product 1',
    'price': 10.0,
    'categories': [
        {'id': 1, 'name': 'Category 1'},
        {'id': 2, 'name': 'Category 2'}
    ],
    'attributes': {
        'color': 'red',
        'size': 'M',
        'weight': 2.5
    }
})

db.insert({
    'type': 'product',
    'name': 'Product 2',
    'price': 15.0,
    'categories': [
        {'id': 2, 'name': 'Category 2'},
        {'id': 3, 'name': 'Category 3'}
    ],
    'attributes': {
        'color': 'blue',
        'size': 'L',
        'weight': 3.0
    }
})

2

In [23]:
#Query the database
Product = Query()

In [24]:
# Find products with a specific category
category_id = 2
result = db.search(Product.categories.any(lambda cat: cat['id'] == category_id))

for product in result:
    print(f"Product Name: {product['name']}, Categories: {product['categories']}")

Product Name: Product 1, Categories: [{'id': 1, 'name': 'Category 1'}, {'id': 2, 'name': 'Category 2'}]
Product Name: Product 2, Categories: [{'id': 2, 'name': 'Category 2'}, {'id': 3, 'name': 'Category 3'}]


In [25]:
# Find products with specific attributes
result = db.search((Product.attributes.color == 'red') & (Product.attributes.size == 'M'))

for product in result:
    print(f"Product Name: {product['name']}, Attributes: {product['attributes']}")

Product Name: Product 1, Attributes: {'color': 'red', 'size': 'M', 'weight': 2.5}


# ML workflow

## Tradtional ML

In this example, we'll demonstrate how to use TinyDB in a supervised machine learning workflow using the popular Iris dataset from the scikit-learn library. We'll use a K-Nearest Neighbors (KNN) classifier for this multi-class classification problem.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset from scikit-learn
iris = load_iris()
X = iris.data
y = iris.target

# Initialize the TinyDB instance
db = TinyDB('iris_example_db.json')

# Clear the database before inserting new data
db.truncate()

# Store the dataset in TinyDB
for xi, yi in zip(X, y):
    db.insert({'type': 'data', 'x': xi.tolist(), 'y': yi})

# Preprocess the data: retrieve the data from TinyDB and split it
Data = Query()
data = db.search(Data.type == 'data')
X = [d['x'] for d in data]
y = [d['y'] for d in data]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the KNN classifier
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# Evaluate the model
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Store the model's results in TinyDB
db.insert({'type': 'result', 'accuracy': accuracy, 'report': report})

# Retrieve the results from TinyDB
Result = Query()
results = db.search(Result.type == 'result')

for result in results:
    print(f"Accuracy: {result['accuracy']}\nClassification Report:\n{result['report']}")

## Transformed based NLP

In [53]:
!pip install transformers
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.2-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m46.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m81.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.2
Looking in indexes: https://pypi.org/simple, http

In [78]:
import json
import os

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from transformers import DistilBertTokenizer, DistilBertModel, DistilBertConfig, AdamW

In [None]:
!wget -O wine_review.json https://github.com/xsarinix/py_and_wine/blob/master/winemag-data-130k-v2.json?raw=true

In [43]:

# Load the Wine Reviews dataset from the JSON file
with open('wine_review.json', 'r') as f:
    wine_data = json.load(f)

In [44]:
# Initialize the TinyDB instance
db = TinyDB('wine_reviews_db.json')

# Clear the database before inserting new data
db.truncate()

In [None]:
# Prepare the dataset to be inserted in TinyDB
data_to_insert = [{'type': 'data', 'description': data['description'], 'points': data['points']} for data in wine_data if data['description'] is not None and data['points'] is not None]

In [None]:
# Store the dataset in TinyDB
db.insert_multiple(data_to_insert)

In [65]:
# Load the data from TinyDB
db = TinyDB('wine_reviews_db.json')
Data = Query()
data = db.search(Data.type == 'data')
X = [d['description'] for d in data]
y = [d['points'] for d in data]

In [70]:
# Initialize the tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
config = DistilBertConfig.from_pretrained('distilbert-base-uncased')
config.num_labels = 1
model = DistilBertModel.from_pretrained('distilbert-base-uncased', config=config)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [71]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [74]:
# Create a custom Dataset class for the wine reviews
class WineReviewsDataset(Dataset):
    def __init__(self, X, y, tokenizer):
        self.X = X
        self.y = y
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        input_ids = self.tokenizer(self.X[idx], return_tensors='pt', padding='max_length', max_length=512, truncation=True)['input_ids'].squeeze()
        return input_ids, torch.tensor(self.y[idx], dtype=torch.float32)

In [76]:
# Create DataLoaders for streaming the data in batches
train_dataset = WineReviewsDataset(X_train, y_train, tokenizer)
test_dataset = WineReviewsDataset(X_test, y_test, tokenizer)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)

In [81]:
# Training parameters
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
criterion = torch.nn.MSELoss()
optimizer = AdamW(model.parameters(), lr=5e-5)

In [82]:
# Train the model
num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for batch_X, batch_y in train_dataloader:
        batch_X, batch_y = batch_X.to(device), batch_y.to(device)
        optimizer.zero_grad()
        outputs = model(batch_X).last_hidden_state[:, 0]
        loss = criterion(outputs.squeeze(), batch_y)
        loss.backward()
        optimizer.step()

TypeError: ignored