# NewsGroups Dataset Vignette

In this vignette, I will show you how to create a database for storing and manipulating 

## Introduction to dataset

We will be using the [20 Newsgroups dataset](http://qwone.com/~jason/20Newsgroups/) for this vignette. This is the [sklearn website description](https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset):

_The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date._

We use sklearn's [fetch_20newsgroups](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups) method to download and access articles from the politics newsgroup.

In [13]:
import sklearn.datasets
newsgroups = sklearn.datasets.fetch_20newsgroups(categories=['talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc'])
newsgroups.keys(), len(newsgroups['data'])

(dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR']), 1575)

This is an example of a newsgroup post.

In [33]:
print(newsgroups['data'][0])

From: golchowy@alchemy.chem.utoronto.ca (Gerald Olchowy)
Subject: Re: Help fight the Clinton Administration's invasion of your privacy
Organization: University of Toronto Chemistry Department
Lines: 16

In article <9308@blue.cis.pitt.edu> cjp+@pitt.edu (Casimir J Palowitch) writes:
>The Clinton Administration wants to "manage" your use of digital
>encryption. This includes a proposal which would limit your use of
>encryption to a standard developed by the NSA, the technical details of 
>which would remain classified with the government.
>
>This cannot be allowed to happen.
>

It is a bit unfair to call blame the Clinton Administration alone...this
initiative was underway under the Bush Administration...it is basically
a bipartisan effort of the establishment Demopublicans and
Republicrats...the same bipartisan effort that brought the S&L scandal,
and BCCI, etc.

Gerald



It looks very similar to an email, so we will use Python's `email` package to parse the text and return a dictionary containing the various relevant fields. Our `parse_email` function shows how we can extract metadata fields like author, subject, and organization from the message, as well as the main text body.

In [61]:
import email

def parse_newsgroup(email_text):
    message = email.message_from_string(email_text)
    return {
        'author': message['from'],
        'subject': message['Subject'],
        'organization': message['Organization'],
        'lines': int(message['Lines']),
        'text': message.get_payload(),
    }

parse_newsgroup(newsgroups['data'][0])

{'author': 'golchowy@alchemy.chem.utoronto.ca (Gerald Olchowy)',
 'subject': "Re: Help fight the Clinton Administration's invasion of your privacy",
 'organization': 'University of Toronto Chemistry Department',
 'lines': 16,
 'text': 'In article <9308@blue.cis.pitt.edu> cjp+@pitt.edu (Casimir J Palowitch) writes:\n>The Clinton Administration wants to "manage" your use of digital\n>encryption. This includes a proposal which would limit your use of\n>encryption to a standard developed by the NSA, the technical details of \n>which would remain classified with the government.\n>\n>This cannot be allowed to happen.\n>\n\nIt is a bit unfair to call blame the Clinton Administration alone...this\ninitiative was underway under the Bush Administration...it is basically\na bipartisan effort of the establishment Demopublicans and\nRepublicrats...the same bipartisan effort that brought the S&L scandal,\nand BCCI, etc.\n\nGerald\n'}

## Creating a database schema

The first step will be to create a database schema that is appropriate for the newsgroup dataset by defining a container dataclass using the `@schema` decorator.  The `schema` decorator will convert the class into a [`dataclass`](https://realpython.com/python-data-classes/) with [slots](https://docs.python.org/3/reference/datamodel.html#slots) enabled (provided `__slots__ = []` is given in the definition), and inherit from `DocTableRow` to add some additional functionality. The type hints associated with each variable will be used in the schema definition for the new tables, and arguments to `Col()`, `IDCol()`, `AddedCol()`, and `UpdatedCol()` will mostly be passed to `dataclasses.field` (see [docs](https://doctable.org/ref/doctable/schemas/field_columns.html#Col) for more detail), so all dataclass functionality is maintained. The [doctable schema guide](doctable_schema.html) explains more about schema and schema object definitions. 

Here I define a `NewsgroupDoc` class to represent a single document and define `__slots__` so the decorator can automatically create a slot class. Each member variable will act as a column in our database schema, and the first variable we define is an `id` column with the defaulted value `IDCol()`. This is a special function that will translate to a schema that uses the `id` colum as the primary key and enable auto-incrementing. Because `id` is defaulted, we must default our other variables as well.

I also define a couple of methods as part of our schema class - they are ignored in the schema creation process, but allow us to manipulate the object within Python. The `author_email` property will extract just the email address from the author field. Note that even though it is a property, it is defined as a method and therefore will not be considered when creating the class schema. I also define a `classmethod` that can be used to create a new `NewsgroupDoc` from the newsgroup text - this replaces the functionality of the `parse_email` function we created above. This way, the class knows how to create itself from the raw newsgroup text.

In [92]:
import sys
sys.path.append('..')
import doctable

import re
import email
import dataclasses

@doctable.schema
class NewsgroupDoc:
    __slots__ = []
    
    # schema columns
    id: int = doctable.IDCol()
    author: str = None
    subject: str = None
    organization: str = None
    lines: int = None
    text: str = None
        
    @property
    def author_email(self, pattern=re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')):
        '''Get the author\'s email address from the author field text.
        '''
        return re.search(pattern, self.author)[0]
    

    @classmethod
    def from_string(cls, newsgroup_text):
        '''Code to create a NewsGroupDoc object from the original newsgroup string.
        '''
        message = email.message_from_string(newsgroup_text)
        return cls(
            author = message['from'],
            subject = message['Subject'],
            organization = message['Organization'],
            lines = int(message['Lines']),
            text = message.get_payload(),
        )
        
        
# for example, we create a new NewsGroupDoc from the first newsgroup article
ngdoc = NewsgroupDoc.from_string(newsgroups['data'][0])
print(ngdoc.author)
ngdoc.author_email

golchowy@alchemy.chem.utoronto.ca (Gerald Olchowy)


'golchowy@alchemy.chem.utoronto.ca'

To make sure the `NewsgroupDoc` will translate to the database schema we expect, we can create a new `DocTable` object that uses it as a schema. We use the `schema` argument of the `DocTable` constructor to specify the schema, and print it below. See that most fields were translated to `VARCHAR` type fields, but `id` and `lines` were translated to `INTEGER` types based on their type hints.

In [96]:
ng_table = doctable.DocTable(target=':memory:', tabname='documents', schema=NewsgroupDoc)
ng_table.schema_table()

Unnamed: 0,name,type,nullable,default,autoincrement,primary_key
0,id,INTEGER,False,,auto,1
1,author,VARCHAR,True,,auto,0
2,subject,VARCHAR,True,,auto,0
3,organization,VARCHAR,True,,auto,0
4,lines,INTEGER,True,,auto,0
5,text,VARCHAR,True,,auto,0


To better describe the data we are interested in, we now create a class that inherits from `DocTable`. This class will act as the main interface for working with our dataset. We use the `_tabname_` and `_schema_` properties to define the table name and schema so we don't need to include them in the constructor. We also define a method `count_author_emails` - we will describe the behavior of this method later.

In [99]:
import collections

class NewsgroupTable(doctable.DocTable):
    _tabname_ = 'documents'
    _schema_ = NewsgroupDoc
    
    def count_author_emails(self, *args, **kwargs):
        author_emails = self.select('author', *args, **kwargs)
        return collections.Counter(author_emails)

# create a new table instance    
ng_table = NewsgroupTable(target=':memory:')
ng_table.schema_table()

Unnamed: 0,name,type,nullable,default,autoincrement,primary_key
0,id,INTEGER,False,,auto,1
1,author,VARCHAR,True,,auto,0
2,subject,VARCHAR,True,,auto,0
3,organization,VARCHAR,True,,auto,0
4,lines,INTEGER,True,,auto,0
5,text,VARCHAR,True,,auto,0


In [100]:
ng_docs = [NewsgroupDoc.from_string(text) for text in newsgroups['data']]
ng_docs

ValueError: invalid literal for int() with base 10: '78\n\t<Apr15.175334.72079@yuma.ACNS.ColoState.EDU>'

Many text analysis applications include operations for process-based parallelization, and our example is no exception.

In [None]:
class process_and_store



In [None]:
parser = ParsePipeline([
    parse_email
])


for email_text in newsgroups['data']:
    email_data = parse_email(email_text)

In [49]:


import multiprocessing
with multiprocessing.Pool(10) as p:
    print(p)

<multiprocessing.pool.Pool state=RUN pool_size=10>
