# dataset + sqlite3

### context management for transactions in data enviroments


**GOAL**

> * Being able to keep track of a downloads, paths (location), and the name-of-files without having to explicitly import, load, and save back-to-back repeating the same steps multiple times.

*When calling e.g. the `download(video_id)` method, which can return a: `(path, file-name, and meta-info)` where the file-name is the : **`video_id`**, meta-info : **'comment_count'** (*count is available and int-value can be returned*).*


> * The database should hold the folowing values (all the following values can be obtained from the *`david.youtube.scraper.download`* method):
    
- file_path : str

- file_name : str
        
- file_meta : int

In [28]:
import sys
sys.path.append('/home/ego/Github/david/')

import os
from os.path import exists, join, isfile
from collections import namedtuple

import dataset
import pandas as pd

from david.utils import pointer
from david.youtube.scraper import _write2json
from david.pipeline import TextMetrics, TextPreprocess

#### Connecting to a database

> It is also possible to define the URL as an environment variable called `DATABASE_URL` so you can initialize database connection without explicitly passing an URL:

```python
# connecting to a SQLite database
db = dataset.connect('sqlite:///mydatabase.db')
```

#### Storing data

> To store some data you need to get a reference to a table. You don’t need to worry about whether the table already exists or not, since dataset will create it automatically:

* Insert a new record.

```python
# get a reference to the table 'user'.
table = db['user']
table.insert(dict(name='John Doe', age=46, country='China'))

# dataset will create "missing" columns any time you insert a dict with an unknown key.
table.insert(dict(name='Jane Doe', age=37, country='France', gender='female'))
```

* Updating existing entries.

> The list of filter columns given as the second argument filter using the values in the first column. If you don’t want to update over a particular value, just use the auto-generated id column.

```python
table.update(dict(name='John Doe', age=47), ['name'])
```

In [31]:
DATABASE_URL = 'sqlite:///context_manager.db'
CONTEXT_TABLE = 'context'

db = dataset.connect(DATABASE_URL)

# when a transaction is executed: e.g user calling scraper.downlod(some-video, count=100)
# the following parameters are created and all three parameters are ALWAYS expected.

trans_1 = dict(path='downloads', name='4Dk3jOSbz_0', entries=100) # meta = entries
trans_2 = dict(path='downloads', name='BmYZH7xt8sU', entries=4252)
trans_1, trans_2

({'path': 'downloads', 'name': '4Dk3jOSbz_0', 'entries': 100},
 {'path': 'downloads', 'name': 'BmYZH7xt8sU', 'entries': 4252})

In [32]:
# pass the keyword arguments to the database and save
# (i need to add a time of download index!)

table = db[CONTEXT_TABLE]
table.insert(trans_1)
table.insert(trans_2)
db.commit()
table.columns

['id', 'path', 'name', 'entries']

In [37]:
# creates the datable if it doest exists at the address
%ls *.db

context_manager.db


#### Using Transactions

> You can group a set of database updates in a transaction. In that case, all updates are committed at once or, in case of exception, all of them are reverted. Transactions are supported through a context manager, so they can be used through a with statement:

#### Storing data

> To store some data you need to get a reference to a table. You don’t need to worry about whether the table already exists or not, since dataset will create it automatically:

* Insert a new record.

```python
with dataset.connect() as tx:
    tx['user'].insert(dict(name='John Doe', age=46, country='China'))

# you can get same functionality by invoking the methods:
# begin(), commit() and rollback() explicitly:

db = dataset.connect()
db.begin()
try:
    db['user'].insert(dict(name='John Doe', age=46, country='China'))
    db.commit()
except:
    db.rollback()

# nested transactions are supported too:
db = dataset.connect()
with db as tx1:
    tx1['user'].insert(dict(name='John Doe', age=46, country='China'))
    with db as tx2:
        tx2['user'].insert(dict(name='Jane Doe', age=37, country='France', gender='female'))
```

In [82]:
from collections import namedtuple
from collections import Counter

def context_pointer(name: str, *args):
    '''
    * Creating a new pointer: returns an instance class like object.

        >>> File = context_pointer('File', ['path', 'name', 'entries'])
        >>> File.__doc__
         'File(path, name, entries)'

        >>> file = File(path='downloads', name='vdsjhdsj11', entries=30)
         File(path='downloads', name='vdsjhdsj11', entries=30)
    
    * By calling the `_asdict()` method returns a dict object.

        >>> file_dict = file._asdict()
        >>> file_dict['name']
         'vdsjhdsj11'
    '''
    return namedtuple(name, *args)

File = context_pointer('File', ['path', 'name', 'entries'])

file = File(path='downloads', name='vdsjhdsj11', entries=30)
file

File(path='downloads', name='vdsjhdsj11', entries=30)

In [56]:
file.name + '.json'

'vdsjhdsj11.json'

In [81]:
file_dict = file._asdict()
file_dict['path']
file._make

('path', 'name', 'entries')

In [83]:
trans_3 = Counter(path='downloads', name='vdsjhdsj11', entries=30)
trans_3

Counter({'path': 'downloads', 'name': 'vdsjhdsj11', 'entries': 30})

In [85]:
dict(trans_3)['name']

'vdsjhdsj11'

In [63]:
with dataset.connect() as tx:
    tx[CONTEXT_TABLE].insert(file_dict)

'downloads'

In [18]:
from collections import defaultdict
from typing import DefaultDict, List, Set, Tuple, Type


def constant_factory_str(value):
    '''
    >>> d = defaultdict(constant_factory('<missing>'))
    >>> d.update(name='John', action='ran')
    >>> '%(name)s %(action)s to %(object)s' % d
    [Out] 'John ran to <missing>'
    '''
    return lambda: value


def _lists(collection):
    # lists grouping factory.
    d = defaultdict(list)
    for k, v in collection:
        d[k].append(v)
    return d


def _sets(collection):
    # sets builder factory.
    d = defaultdict(set)
    for k, v in collection:
        d[k].add(v)
    return d


def _ints(collection):
    # ints counting factory.
    d = defaultdict(int)
    for k in collection:
        d[k] += 1
    return d


def constant_factory(
        collection: List[Tuple],
        func: Type[Set[List[int]]],
        sort_items=False) -> DefaultDict[Tuple]:
    '''
    Dictionary Factory Builder.

    Parameters:
    ----------

    `collection` : object -> list[tuples]

    `by_func` : type -> [set|list|int]

    `sort_items` : (bool)
        If True, the collection is then returned as a sorted list object.
        If False (default), the collection is returned as a dict object.

    func : object[Types]:
    --------------------

        * set  -  Building a dictionary of sets.
        * int  -  Counting a frequecy, e.g. a count freq of sequences.
        * list -  Grouping a sequence of key-value pairs into a dict-of-lists

        >>> constant_factory([('blue', 4), ('blue', 2)], func=list)
        ... [('blue', [2, 4])

        >>> S = ['python is A', 'python is A', 'java is D', 'java is C']
        >>> constant_factory(S, int, sort_items=True)
        ... [('java is C', 1), ('java is D', 1), ('python is A', 2)]

    '''
    if func is set:
        collection = _sets(collection)

    elif func is int:
        collection = _ints(collection)

    elif func is list:
        collection = _lists(collection)

    if sort_items:
        return sorted(collection.items())
    else:
        return collection

In [19]:
G = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
constant_factory(G, list)

defaultdict(list, {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]})

In [20]:
constant_factory(G, list, sort_items=True)

[('blue', [2, 4]), ('red', [1]), ('yellow', [1, 3])]

In [21]:
constant_factory(G, set)

defaultdict(set, {'yellow': {1, 3}, 'blue': {2, 4}, 'red': {1}})

In [22]:
constant_factory(G, set, sort_items=True)

[('blue', {2, 4}), ('red', {1}), ('yellow', {1, 3})]

In [23]:
constant_factory('collection', int)

defaultdict(int, {'c': 2, 'o': 2, 'l': 2, 'e': 1, 't': 1, 'i': 1, 'n': 1})

In [24]:
W = ['hello', 'world', 'hello', 'python']
constant_factory(W, int)

defaultdict(int, {'hello': 2, 'world': 1, 'python': 1})

In [25]:
constant_factory(W, int, sort_items=True)

[('hello', 2), ('python', 1), ('world', 1)]

In [26]:
S = ['python is A', 'python is A', 'java is D', 'java is C']
constant_factory(S, int)

defaultdict(int, {'python is A': 2, 'java is D': 1, 'java is C': 1})

In [27]:
constant_factory(S, int, sort_items=True)

[('java is C', 1), ('java is D', 1), ('python is A', 2)]

In [5]:
import re
from typing import List, Sequence

def tokenize(texts: List[str]) -> List[Sequence[str]]:
    '''Return the tokens of a sentence including punctuation:

        >>> tokenize('The apple. Where is the apple?')
    '['The', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']'
    '''
    return [x.strip() for x in re.split(r'(\W+)?', texts) if x.strip()]

In [6]:
tokenize('The apple. Where is the apple?')

  return _compile(pattern, flags).split(string, maxsplit)


['The', 'apple', '.', 'Where', 'is', 'the', 'apple', '?']

In [7]:
Sequence??