This notebook is based on the github gensim tutorial:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb

# Training model using CDC as corpus

Following the tutorial from github gensim, we will try to train a model to correlate a sentence to a document, in this case, we are going to test this approach using the "Codigo de Defesa do Consumidor".

To remember:
- Gensim is a tool for discovering the semantic structure of documents by examining the taken corpus, a collection of text documents, and producing a vector representation of the text in the corpus.
- Vector representation is used then to train a model, which is an algorithms to create different representations of the data, which are usually more semantic.

## Corpus

In [29]:
# each chapter = each documents. Just for tests, we gonna try two documents and increase the reading / 
# trainning as soon more document are translated and prepared.

#textIOWrapper44 = open('../../Documents/cdc_en/chapter_4_4', 'r')
#chapterText44 = textIOWrapper44.read()
#chapterText44

#textIOWrapper43 = open('../../Documents/cdc_en/chapter_4_3', 'r')
#chapterText43 = textIOWrapper43.read()
#chapterText43

import os, codecs
directory = "../../Documents/cdc_en"
corpus = []

# print(os.listdir(directory))
for filename in os.listdir(directory):
    
    #for ascii files, use this line below:
    #fileOpenned = open(directory+os.sep+filename, 'r')
    
    #non-ascii files
    fileOpenned = codecs.open(directory+os.sep+filename, 'r', 'utf-8')
    
    textRead = fileOpenned.read()
    corpus.append(textRead)
    
    #print(textRead)
    #if filename.endswith(".asm") or ...
#print(corpus)


In [104]:
# STOPWORDS!
#set of frequent non-important words for this experiment
stopWordlist = set('well good bad can could may might would this those less more same her his our mine my from until only them was were will am among instead otherwise above under what when where do does have had has who that which whom shall , they other are under their it into by for a an of the and to in art. -   or paragraph its section is be than may as if there any with not one two three four five your on'.split())
#numerals = [i for i in range(1, 11)]

numerals = ''
for i in range(1, 11):
    numerals+=' '+str(i)
stopWordlist.update(numerals.split())

particularSWDocumentRelated = '§ chapter (vetoed). i ii iii iv v vi vii viii ix x'
stopWordlist.update(particularSWDocumentRelated.split())

print(stopWordlist)

#corpus = [chapterText43, chapterText44]
text = []
texts = []
for document in corpus:
    for word in document.lower().split():
        if word not in stopWordlist:
            text.append(word)
    texts.append(text)
    text = []
#texts

{'with', ',', 'those', 'had', 'would', 'not', 'one', 'is', 'any', 'his', 'do', 'does', 'under', 'chapter', 'for', 'shall', 'instead', 'might', 'bad', 'than', 'vi', 'x', 'paragraph', '§', 'iii', 'they', 'other', 'may', 'them', '5', 'and', 'art.', 'as', 'have', 'from', 'will', 'your', 'who', 'otherwise', 'their', 'our', '7', 'only', 'an', 'has', 'her', 'until', 'by', 'there', '3', '8', 'or', 'iv', 'am', 'of', 'four', 'good', 'that', 'was', 'viii', '(vetoed).', 'on', 'where', '1', 'into', 'be', 'i', 'its', 'this', 'if', 'three', 'above', '-', 'same', 'less', 'well', 'can', '9', 'which', 'when', '2', 'more', 'what', 'it', 'the', 'vii', 'ii', 'to', 'a', 'v', '6', 'whom', 'among', 'five', '4', 'ix', '10', 'mine', 'two', 'could', 'section', 'my', 'in', 'were', 'are'}


In [105]:
# MORE FREQUENT WORDS (>1) BY DOCUMENT / CHAPTER
#remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)

for text in texts:
    for token in text:
        frequency[token] +=1
    
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint #pretty-printer
pprint(processed_corpus)

[['quality',
  'products',
  'services,',
  'prevention',
  'reparation',
  'damages',
  'protection',
  'health',
  'article',
  'products',
  'services',
  'placed',
  'consumer',
  'market',
  'risks',
  'health',
  'safety',
  'consumers,',
  'except',
  'considered',
  'result',
  'suppliers',
  'circumstances,',
  'information',
  'about',
  'single',
  'paragraph.',
  'case',
  'industrial',
  'product,',
  'must',
  'provide',
  'information',
  'referred',
  'article,',
  'through',
  'appropriate',
  'must',
  'product.',
  'supplier',
  'products',
  'services',
  'harmful',
  'dangerous',
  'health',
  'safety',
  'must',
  'adequate',
  'about',
  'harmfulness',
  'without',
  'prejudice',
  'adoption',
  'article',
  'supplier',
  'place',
  'consumer',
  'market',
  'product',
  'service',
  'harmfulness',
  'health',
  '1.',
  'supplier',
  'products',
  'services',
  'that,',
  'after',
  'consumer',
  'market,',
  'aware',
  'dangerousness',
  'must',
  'inform',
  'c

In [106]:
# Associanting each word with a unique id, i.e., creating a dictionary
from gensim import corpora

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

Dictionary(266 unique tokens: ['quality', 'products', 'services,', 'prevention', 'reparation']...)


## Vectorization

To infer the latent structure in our corpus we need a way to represent documents that we can manipulate mathematically. One approach is to represent each document as a vector. There are various approaches for creating a vector representation of a document but a simple example is the bag-of-words model. Under the bag-of-words model each document is represented by a vector containing the frequency counts of each word in the dictionary. For example, given a dictionary containing the words ['coffee', 'milk', 'sugar', 'spoon'] a document consisting of the string "coffee milk coffee" could be represented by the vector [2, 1, 0, 0] where the entries of the vector are (in order) the occurrences of "coffee", "milk", "sugar" and "spoon" in the document. The length of the vector is the number of entries in the dictionary.

In [107]:
print( dictionary.token2id)

{'quality': 0, 'products': 1, 'services,': 2, 'prevention': 3, 'reparation': 4, 'damages': 5, 'protection': 6, 'health': 7, 'article': 8, 'services': 9, 'placed': 10, 'consumer': 11, 'market': 12, 'risks': 13, 'safety': 14, 'consumers,': 15, 'except': 16, 'considered': 17, 'result': 18, 'suppliers': 19, 'circumstances,': 20, 'information': 21, 'about': 22, 'single': 23, 'paragraph.': 24, 'case': 25, 'industrial': 26, 'product,': 27, 'must': 28, 'provide': 29, 'referred': 30, 'article,': 31, 'through': 32, 'appropriate': 33, 'product.': 34, 'supplier': 35, 'harmful': 36, 'dangerous': 37, 'adequate': 38, 'harmfulness': 39, 'without': 40, 'prejudice': 41, 'adoption': 42, 'place': 43, 'product': 44, 'service': 45, '1.': 46, 'that,': 47, 'after': 48, 'market,': 49, 'aware': 50, 'dangerousness': 51, 'inform': 52, 'competent': 53, 'consumers': 54, 'means': 55, '2.': 56, 'previous': 57, 'at': 58, 'expense': 59, '3.': 60, 'whenever': 61, 'federal': 62, 'responsibility': 63, 'fact': 64, 'manufac

In [108]:
example_claim = "I made the purchase and payment for two products in \
the PurpleFire.com store on December 9th. And they have 15 days to do \
the posting. But there is no doing. And I've been waiting for more than\
1 month. I contacted the company several times and talked to Amanda\
(amandapt1313@gmail.com), but they did not solve my problem. I have all\
the conversations filed. And I want to point out here that the \
PurpleFire store has no CNPJ and no address on the site. I would not recommend\
anyone to make any purchase with the store mentioned."

vec_claim = dictionary.doc2bow(example_claim.lower().split())
vec_claim

[(1, 1), (96, 1), (181, 1), (208, 1)]

In [113]:
#vectoring the dictionary / corpus
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
print(bow_corpus)

[[(0, 1), (1, 5), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 5), (8, 2), (9, 4), (10, 1), (11, 3), (12, 2), (13, 1), (14, 3), (15, 2), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 2), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 4), (29, 1), (30, 2), (31, 1), (32, 1), (33, 1), (34, 1), (35, 4), (36, 1), (37, 1), (38, 1), (39, 2), (40, 1), (41, 1), (42, 1), (43, 1), (44, 2), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 2), (52, 2), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1)], [(0, 2), (1, 2), (2, 2), (3, 1), (4, 1), (5, 3), (8, 3), (10, 2), (11, 3), (13, 2), (17, 2), (18, 1), (20, 2), (21, 2), (23, 1), (24, 1), (29, 1), (31, 1), (32, 1), (36, 1), (40, 1), (42, 1), (43, 1), (44, 6), (45, 5), (47, 1), (49, 1), (54, 2), (56, 1), (57, 1), (60, 1), (63, 2), (64, 1), (65, 7), (66, 2), (67, 1), (68, 1), (69, 3), (70, 2), (71, 2), (72, 2), (73, 2), (74, 2), (75, 2), (76, 1), (77, 1), (78, 1), (79, 1), (80, 2), (81, 

## Model

Now that we have vectorized our corpus we can begin to transform it using models. We use model as an abstract term referring to a transformation from one document representation to another. In gensim documents are represented as vectors so a model can be thought of as a transformation between two vector spaces. The details of this transformation are learned from the training corpus.
One simple example of a model is tf-idf. The tf-idf model transforms vectors from the bag-of-words representation to a vector space where the frequency counts are weighted according to the relative rarity of each word in the corpus.

In [114]:
from gensim import models
# train the model
tfidf = models.TfidfModel(bow_corpus)
# transform the string
tfidf[dictionary.doc2bow("i would like to have my money back based on the product that i bought in 25 days back then. It suffered some decadential loss but still not satisfied with my product.".lower().split())]

[(34, 0.4220143629672883),
 (44, 0.1430780415410627),
 (249, 0.6330215444509324),
 (262, 0.6330215444509324)]