This notebook is based on the github gensim tutorial:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb

# Training model using CDC as corpus

Following the tutorial from github gensim, we will try to train a model to correlate a sentence to a document, in this case, we are going to test this approach using the "Codigo de Defesa do Consumidor".

To remember:
- Gensim is a tool for discovering the semantic structure of documents by examining the taken corpus, a collection of text documents, and producing a vector representation of the text in the corpus.
- Vector representation is used then to train a model, which is an algorithms to create different representations of the data, which are usually more semantic.

## Corpus

In [26]:
# each chapter = each documents. Just for tests, we gonna try two documents and increase the reading / 
# trainning as soon more document are translated and prepared.

textIOWrapper44 = open('../../Documents/cdc_en/chapter_4_4', 'r')
chapterText44 = textIOWrapper44.read()
chapterText44

textIOWrapper43 = open('../../Documents/cdc_en/chapter_4_3', 'r')
chapterText43 = textIOWrapper43.read()
chapterText43

"CHAPTER IV\nOf the Quality of Products and Services, of the Prevention and the Reparation of the Damages\n\nSECTION III\nResponsibility for Product and Service Addiction\n\nArticle 18. Suppliers of durable or non-durable consumer products are jointly and severally liable for defects in quality or quantity that render them unfit or inadequate for their consumption or for their reduction in value, as well as those resulting from disparity, Indications in the container, packaging, labeling or advertising message, respecting the variations due to their nature, and the consumer may require the replacement of the vitiated parts.\n\nParagraph 1. If the defect is not remedied within a maximum period of thirty days, the consumer may, alternatively and at his option, require:\n\nI - the replacement of the product with another of the same species, in perfect conditions of use;\n\nII - the immediate return of the amount paid, monetarily updated, without prejudice to any losses and damages;\n\nIII

In [17]:
#set of frequent non-important words for this experiment
stoplist = set('for a of the and to in art. - by (vetoed). i ii iii iv v vi vii viii ix x or paragraph its until section is be than may as if that there any with not one two three four five'.split())

corpus = [chapterText43, chapterText44]
text = []
texts = []
for document in corpus:
    for word in document.lower().split():
        if word not in stoplist:
            text.append(word)
    texts.append(text)
    text = []
texts

[['chapter',
  'quality',
  'products',
  'services,',
  'prevention',
  'reparation',
  'damages',
  'responsibility',
  'product',
  'service',
  'addiction',
  'article',
  '18.',
  'suppliers',
  'durable',
  'non-durable',
  'consumer',
  'products',
  'are',
  'jointly',
  'severally',
  'liable',
  'defects',
  'quality',
  'quantity',
  'render',
  'them',
  'unfit',
  'inadequate',
  'their',
  'consumption',
  'their',
  'reduction',
  'value,',
  'well',
  'those',
  'resulting',
  'from',
  'disparity,',
  'indications',
  'container,',
  'packaging,',
  'labeling',
  'advertising',
  'message,',
  'respecting',
  'variations',
  'due',
  'their',
  'nature,',
  'consumer',
  'require',
  'replacement',
  'vitiated',
  'parts.',
  '1.',
  'defect',
  'remedied',
  'within',
  'maximum',
  'period',
  'thirty',
  'days,',
  'consumer',
  'may,',
  'alternatively',
  'at',
  'his',
  'option,',
  'require:',
  'replacement',
  'product',
  'another',
  'same',
  'species,',
 

In [18]:
#remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] +=1
    
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint #pretty-printer
pprint(processed_corpus)

[['chapter',
  'quality',
  'products',
  'services,',
  'prevention',
  'reparation',
  'damages',
  'product',
  'service',
  'article',
  'suppliers',
  'durable',
  'non-durable',
  'consumer',
  'products',
  'are',
  'jointly',
  'severally',
  'liable',
  'defects',
  'quality',
  'quantity',
  'them',
  'unfit',
  'inadequate',
  'their',
  'consumption',
  'their',
  'reduction',
  'value,',
  'well',
  'those',
  'resulting',
  'from',
  'indications',
  'container,',
  'packaging,',
  'labeling',
  'advertising',
  'message,',
  'due',
  'their',
  'nature,',
  'consumer',
  'replacement',
  '1.',
  'defect',
  'period',
  'days,',
  'consumer',
  'may,',
  'alternatively',
  'at',
  'replacement',
  'product',
  'another',
  'same',
  'species,',
  'immediate',
  'return',
  'amount',
  'paid,',
  'monetarily',
  'updated,',
  'without',
  'prejudice',
  'losses',
  'damages;',
  'proportional',
  'reduction',
  'price.',
  '2.',
  'reduce',
  'period',
  'provided',
  'les

In [19]:
# Associanting each word with a unique id, i.e., creating a dictionary
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(113 unique tokens: ['chapter', 'quality', 'products', 'services,', 'prevention']...)


## Vectorization

To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector. There are various approaches for creating a vector representation of a document but a simple example is the bag-of-words model. Under the bag-of-words model each document is represented by a vector containing the frequency counts of each word in the dictionary. For example, given a dictionary containing the words ['coffee', 'milk', 'sugar', 'spoon'] a document consisting of the string "coffee milk coffee" could be represented by the vector [2, 1, 0, 0] where the entries of the vector are (in order) the occurrences of "coffee", "milk", "sugar" and "spoon" in the document. The length of the vector is the number of entries in the dictionary.

In [20]:
print( dictionary.token2id)

{'chapter': 0, 'quality': 1, 'products': 2, 'services,': 3, 'prevention': 4, 'reparation': 5, 'damages': 6, 'product': 7, 'service': 8, 'article': 9, 'suppliers': 10, 'durable': 11, 'non-durable': 12, 'consumer': 13, 'are': 14, 'jointly': 15, 'severally': 16, 'liable': 17, 'defects': 18, 'quantity': 19, 'them': 20, 'unfit': 21, 'inadequate': 22, 'their': 23, 'consumption': 24, 'reduction': 25, 'value,': 26, 'well': 27, 'those': 28, 'resulting': 29, 'from': 30, 'indications': 31, 'container,': 32, 'packaging,': 33, 'labeling': 34, 'advertising': 35, 'message,': 36, 'due': 37, 'nature,': 38, 'replacement': 39, '1.': 40, 'defect': 41, 'period': 42, 'days,': 43, 'may,': 44, 'alternatively': 45, 'at': 46, 'another': 47, 'same': 48, 'species,': 49, 'immediate': 50, 'return': 51, 'amount': 52, 'paid,': 53, 'monetarily': 54, 'updated,': 55, 'without': 56, 'prejudice': 57, 'losses': 58, 'damages;': 59, 'proportional': 60, 'price.': 61, '2.': 62, 'reduce': 63, 'provided': 64, 'less': 65, 'more':

In [21]:
example_claim = "I made the purchase and payment for two products in \
the PurpleFire.com store on December 9th. And they have 15 days to do \
the posting. But there is no doing. And I've been waiting for more than\
1 month. I contacted the company several times and talked to Amanda\
(amandapt1313@gmail.com), but they did not solve my problem. I have all\
the conversations filed. And I want to point out here that the \
PurpleFire store has no CNPJ and no address on the site. I would not recommend\
anyone to make any purchase with the store mentioned."

vec_claim = dictionary.doc2bow(example_claim.lower().split())
vec_claim

[(2, 1), (66, 1), (72, 1)]

In [22]:
#vectoring the dictionary / corpus
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
bow_corpus

[[(0, 1),
  (1, 5),
  (2, 6),
  (3, 2),
  (4, 1),
  (5, 2),
  (6, 2),
  (7, 6),
  (8, 3),
  (9, 7),
  (10, 2),
  (11, 1),
  (12, 1),
  (13, 6),
  (14, 6),
  (15, 4),
  (16, 2),
  (17, 4),
  (18, 4),
  (19, 2),
  (20, 3),
  (21, 3),
  (22, 2),
  (23, 7),
  (24, 2),
  (25, 4),
  (26, 2),
  (27, 3),
  (28, 4),
  (29, 2),
  (30, 3),
  (31, 2),
  (32, 2),
  (33, 2),
  (34, 2),
  (35, 3),
  (36, 3),
  (37, 3),
  (38, 2),
  (39, 4),
  (40, 4),
  (41, 1),
  (42, 2),
  (43, 1),
  (44, 2),
  (45, 2),
  (46, 3),
  (47, 3),
  (48, 2),
  (49, 2),
  (50, 6),
  (51, 3),
  (52, 3),
  (53, 3),
  (54, 3),
  (55, 3),
  (56, 6),
  (57, 4),
  (58, 3),
  (59, 2),
  (60, 3),
  (61, 2),
  (62, 4),
  (63, 2),
  (64, 4),
  (65, 2),
  (66, 2),
  (67, 1),
  (68, 2),
  (69, 1),
  (70, 2),
  (71, 1),
  (72, 2),
  (73, 3),
  (74, 3),
  (75, 8),
  (76, 2),
  (77, 2),
  (78, 2),
  (79, 2),
  (80, 2),
  (81, 2),
  (82, 2),
  (83, 2),
  (84, 1),
  (85, 4),
  (86, 5),
  (87, 2),
  (88, 3),
  (89, 1),
  (90, 2),
  (91, 2)

## Model

Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.
One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [25]:
from gensim import models
# train the model
tfidf = models.TfidfModel(bow_corpus)
# transform the string
tfidf[dictionary.doc2bow("i would like to have my money back based on the product that i bought in 25 days back then. It suffered some decadential loss but still not satisfied with my product.".lower().split())]

[(111, 1.0)]