# PLN Moderno em Python
### _- Ou -_
## O que podemos aprender sobre comida analizando 1 milhão de comentários do Yelp

#### Antes de começar...
- Os exemplos desse notebook foram extraídos desse [notebook](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb#)

## O Dataset do Yelp
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) é um conjunto de dados disponibilizado pelo serviço de comentários de estabelecimentos [Yelp](http://yelp.com) para fins de pesquisa academica e educacional.

**Note:** Para executar esse notebook na sua máquina, você precisará baixar sua própria cópia do Yelp dataset. O dataset está pode ser baixado seguindo os passo abaixo:
1. Vá a página do Yelp dataset [aqui](https://www.yelp.com/dataset_challenge/)
1. Clique em "Get the Data"
1. Leia e concorde com os termos de uso do Yelp.

O atual conjunto de dados consiste em:
- __552K__ usuários
- __77K__ estabelecimentos
- __2.2M__ comentários de usuários

Quando queremos ver apenas os restaurantes, existem aproximadamente __55K__ restaurantes com aproximadamente __3.2M__ comentários de usuários escritas sobre eles.

Os dados estão disponíveis em vários arquivos no formato _.json_. Usaremos os seguintes arquivos para a nossa demonstração:
- __yelp\_academic\_dataset\_business.json__ &mdash; _os registros de estabelecimentos individuais_
- __yelp\_academic\_dataset\_review.json__ &mdash; _os registros de comentários de usuarios escritos sobre os estabelecimentos_

Os arquivos são arquivos de texto (UTF-8) com um _objeto json_ pot linha, cada um correspondendo para um registro de dado indivudual. Vamos ver uns exemplos.

In [1]:
with open('dataset/business.json', encoding='utf_8') as f:
    first_business_record = f.readline() 

print (first_business_record)

{"business_id": "FYWN1wneV18bWNgQjJ2GNg", "name": "Dental by Design", "neighborhood": "", "address": "4855 E Warner Rd, Ste B9", "city": "Ahwatukee", "state": "AZ", "postal_code": "85044", "latitude": 33.3306902, "longitude": -111.9785992, "stars": 4.0, "review_count": 22, "is_open": 1, "attributes": {"AcceptsInsurance": true, "ByAppointmentOnly": true, "BusinessAcceptsCreditCards": true}, "categories": ["Dentists", "General Dentistry", "Health & Medical", "Oral Surgeons", "Cosmetic Dentists", "Orthodontists"], "hours": {"Friday": "7:30-17:00", "Tuesday": "7:30-17:00", "Thursday": "7:30-17:00", "Wednesday": "7:30-17:00", "Monday": "7:30-17:00"}}



Os registros de estabelecimentos consistem de pares de _key: value_  contendo informações sobre o estabelecimento. Alguns atributos que estaremos interessados nessa demonstração incluem:
- __business\_id__ &mdash; _identificador único de um estabelecimento_
- __categories__ &mdash; _Um array que contém as categorias que o estabelecimento se encaixa_

O atributo de _categorias_ É de interesse especial. Esta demonstração foca em restaurantes, que são indicados pela presença da tag  _Restaurant_ no _array_ de categorias _categories_ . Além disso, o _array_ _categories_ pode contar mais informações detalhadas sobre restaurantes, como por exemplo o tipo de comida eles servem.

Os registros de comentários estão ordenados de maneira similar &mdash; pares de _key: value_ contendo informação sobre quem fez o comentário, comentário em si, e qual estabelecimento aquele comentário se refere.

In [2]:
with open('dataset/review.json', encoding='utf_8') as f:
    first_review_record = f.readline()
    
print (first_review_record)

{"review_id":"v0i_UHJMo_hPBq9bxWvW4w","user_id":"bv2nCi5Qv5vroFiqKGopiw","business_id":"0W4lkclzZThpx3V65bVgig","stars":5,"date":"2016-05-28","text":"Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.","useful":0,"funny":0,"cool":0}



Alguns atributos que podemos notar no registro de comentário são:
- __business\_id__ &mdash; _identificador único de um estabelecimento_
- __text__ &mdash; _o texto de linguagem natural o usuário escreveu_

O atributo _text_ será nosso foco!

_json_ é um formato bastante útil para troca de dados, porém não é muito utilizado para trabalhos de modelagem. Vamos aplicar um pré processamento para transoformar os dados em um formato mais usável. O próximo bloco de código fará:
1. Ler cada registro de estabelecimento e converter para um `dict` do Python
2. Descartar os estabelecimentos que não são restaurantes.
3. Criar um `frozenset` dos business IDs para restaurantes, que será usado no próximo passo

In [9]:
import json

restaurant_ids = set()
max_r = 100
i = 0
# open the businesses file
with open('dataset/business.json', encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if 'Restaurants' not in business['categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business['business_id'])
        
        i = i + 1
        
        if (i >= max_r):
            break

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print ('{:,}'.format(len(restaurant_ids)), u'restaurantes no dataset.')

100 restaurantes no dataset.


No próximo, iremos criar um arquivo que contém apenas os comentários sobre restaurantes, com um comentário por linha no arquivo.

In [10]:
%%time
preprocess = True
# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if preprocess:
    
    review_count = 0

    # create & open a new file in write mode
    with open('review.txt', 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with open('dataset/review.json', encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not sobre a restaurant, skip to the next one
                if review['business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print (u'''Text from {:,} restaurant reviews
              escritas to the new txt file.'''.format(review_count))
    
else:
    
    with open('review.txt', encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print ('Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1))

Text from 4,889 restaurant reviews
              escritas to the new txt file.
CPU times: user 57.3 s, sys: 3.99 s, total: 1min 1s
Wall time: 1min 2s


## spaCy &mdash; Industrial-Strength NLP in Python

![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) É uma biblioteca para processamento de linguagem natural com Python. O objetivo do spaCy's é pegar avanços recentes em processamento de linguagem natual a partir de pesquisas e colocá-las em prática.

spaCy suporta muitas tarefas associadas com a construção de pré-processamento em linguagem natural:
- Tokenization
- Text normalization, como converter para minusculo, radicalização
- marcação Part-of-speech
- Análise de dependência sintática
- Detecção de limite de sentença
- Reconhecimento e anotação de entidade nomeada

In [8]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

Vamos pegar uma amostra de comentário para brincar.

In [14]:
with open('review.txt', encoding='utf_8') as f:
    sample_review = '\n'.join(list(it.islice(f, 4, 9)))
    sample_review = sample_review.replace('\\n', '\n')
        
print (sample_review)

Dr. Longwill and his staff are top notch. I have never felt more comfortable at a dentist's office than I do here. They are very friendly and are sure to explain everything thoroughly. I live about 45 minutes from the office and I don't mind the drive, they are worth it! Sarah at the front desk always brightens my day and that is saying something when one finds themselves at a dentist's office!

Fantastic dentist and supporting staff. I called Dr. Ghorshi's office and Joan set me up with an appointment. She was incredibly understanding of my schedule and found me a time that would work. I went in and met Dr. Ghorshi. He was wonderful. He tends to be more conservative with treatment and truly wants what is best. I couldn't recommend this office more!

I had a great experience with Steve yesterday. Having had multiple bad experiences at dentists i was greatly surprised at how incredible this place is.Steve is a very careful, thorough and caring person.Here they have thought of everything

Vamos mandar esses comentários para o spacy...

In [15]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 133 ms, sys: 35.2 ms, total: 168 ms
Wall time: 167 ms


In [16]:
print (parsed_review)

Dr. Longwill and his staff are top notch. I have never felt more comfortable at a dentist's office than I do here. They are very friendly and are sure to explain everything thoroughly. I live about 45 minutes from the office and I don't mind the drive, they are worth it! Sarah at the front desk always brightens my day and that is saying something when one finds themselves at a dentist's office!

Fantastic dentist and supporting staff. I called Dr. Ghorshi's office and Joan set me up with an appointment. She was incredibly understanding of my schedule and found me a time that would work. I went in and met Dr. Ghorshi. He was wonderful. He tends to be more conservative with treatment and truly wants what is best. I couldn't recommend this office more!

I had a great experience with Steve yesterday. Having had multiple bad experiences at dentists i was greatly surprised at how incredible this place is.Steve is a very careful, thorough and caring person.Here they have thought of everything

Parece a mesma coisa! O que aconteceu por baixo dos panos?

Sobre detecção de sentença e segmentação?

In [17]:
for num, sentence in enumerate(parsed_review.sents):
    print ('Sentence {}:'.format(num + 1))
    print (sentence)
    print ('')

Sentence 1:
Dr. Longwill and his staff are top notch.

Sentence 2:
I have never felt more comfortable at a dentist's office than I do here.

Sentence 3:
They are very friendly and are sure to explain everything thoroughly.

Sentence 4:
I live about 45 minutes from the office

Sentence 5:
and I don't mind the drive, they are worth it!

Sentence 6:
Sarah at the front desk always brightens my day and that is saying something when one finds themselves at a dentist's office!



Sentence 7:
Fantastic dentist and supporting staff.

Sentence 8:
I called Dr. Ghorshi's office and Joan set me up with an appointment.

Sentence 9:
She was incredibly understanding of my schedule and found me a time that would work.

Sentence 10:
I went in and met Dr. Ghorshi.

Sentence 11:
He was wonderful.

Sentence 12:
He tends to be more conservative with treatment and truly wants what is best.

Sentence 13:
I couldn't recommend this office more!



Sentence 14:
I had a great experience with Steve yesterday.

Sen

E sobre reconhecimento de entidade nomeada?

In [18]:
for num, entity in enumerate(parsed_review.ents):
    print ('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print ('')

Entity 1: Longwill - PERSON

Entity 2: about 45 minutes - TIME

Entity 3: Sarah - PERSON

Entity 4: my day - DATE

Entity 5: Ghorshi - PERSON

Entity 6: Joan - PERSON

Entity 7: Ghorshi - PERSON

Entity 8: Steve - PERSON

Entity 9: yesterday - DATE

Entity 10: Steve - PERSON

Entity 11: 
 - GPE

Entity 12: This morning - TIME

Entity 13: the day yesterday - DATE

Entity 14: 
 - GPE

Entity 15: Nelson - PERSON

Entity 16: Wendy - PERSON

Entity 17: Phoenix - GPE

Entity 18:   - NORP

Entity 19: 
 - GPE



E sobre marcação part of the speach?

In [19]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,Dr.,PROPN
1,Longwill,PROPN
2,and,CCONJ
3,his,ADJ
4,staff,NOUN
5,are,VERB
6,top,ADJ
7,notch,NOUN
8,.,PUNCT
9,I,PRON


E sobre normalização do texto, como radicalização e análise de formato?

In [43]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,Love,love,Xxxx
1,coming,come,xxxx
2,here,here,xxxx
3,.,.,.
4,Yes,yes,Xxx
5,the,the,xxx
6,place,place,xxxx
7,always,always,xxxx
8,needs,need,xxxx
9,the,the,xxx


E sobre analise de entidade a nível de token?

In [21]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

df = pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])
df

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,Dr.,,O
1,Longwill,PERSON,B
2,and,,O
3,his,,O
4,staff,,O
5,are,,O
6,top,,O
7,notch,,O
8,.,,O
9,I,,O


In [24]:
df.groupby('inside_outside_begin').count()

Unnamed: 0_level_0,token_text,entity_type
inside_outside_begin,Unnamed: 1_level_1,Unnamed: 2_level_1
B,19,19
I,6,6
O,386,386


In [25]:
df[ df['inside_outside_begin'] == 'I' ]

Unnamed: 0,token_text,entity_type,inside_outside_begin
40,45,TIME,I
41,minutes,TIME,I
66,day,DATE,I
222,morning,TIME,I
238,day,DATE,I
239,yesterday,DATE,I


Sobre a variedade de outros atributos a nível de token:
- stopword
- pontuação
- espaço
- representa um número
- faz parte do vocabulário nativo do spacy?

In [26]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,Dr.,-20.0,,,,,Yes
1,Longwill,-20.0,,,,,Yes
2,and,-20.0,Yes,,,,Yes
3,his,-20.0,Yes,,,,Yes
4,staff,-20.0,,,,,Yes
5,are,-20.0,Yes,,,,Yes
6,top,-20.0,Yes,,,,Yes
7,notch,-20.0,,,,,Yes
8,.,-20.0,,Yes,,,Yes
9,I,-20.0,,,,,Yes


Se o texto que você gostaria de processar é  um texto de objetivo geral (i.e., não faz parte de um domínio especifico, como literatura medicinal), spaCy está pronto para ser usado.

## Modelagem de frases

Modelagem de frases _Phrase modeling_ é uma abordagem para aprender combinações de palavras afim de juntalas em um único token. A formula utilizada para modelagem de frases é:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:
* $count(A)$ número de vezes o token $A$ aparece no corpus
* $count(B)$ número de vezes o token $B$ aparece no corpus
* $count(A\ B)$ número de vezes os tokens $A\ B$ aparecem no corpus nessa ordem
* $N$ tamanho do vocabulário
* $count_{min}$ parêmetro definido pelo usuário para assegurar que frases ocorrem um número mínimo de vezes
* $threshold$ parêmetro definido pelo usuário para controlar a força da relação entre dois tokens o modelo requer antes de aceitá-los como uma frase

A biblioteca [**gensim**](https://radimrehurek.com/gensim/index.html) será usada para a modelagem de frases &mdash; a classe [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) em particular.

In [13]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

Como estamos realizando a modelagem de frases, estaremos fazendo uma transformação de dados iterativa ao mesmo tempo. Nosso roteiro para preparação de dados inclui:

1. Segmentar texto de comentários completos em frases e normalizar o texto
1. Modelagem de frase de primeira ordem $\rightarrow$ _aplicar modelo de frases de primeira ordem para transformar sentenças_
1. Modelagem de frase de segunda ordem $\rightarrow$ _aplicar modelo de frase de segunda ordem para transformar sentenças_
1.Aplicar normalização de texto e modelo de frase de segunda ordem ao texto de revisões completas

Primeiro, vamos definir algumas funções auxiliares que usaremos para normalização de texto. Em particular, a função geradora `lemmatized_sentence_corpus` usará spaCy para:
- Iterar sobre as revisões de 3.2 milhões no `review.txt` nós criamos antes
- Dividir os comentáios em frases individuais
- Remover pontuação e espaços escessivos
- Radicalizar o texto

In [30]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield ' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

Vamos usar o gerador `lemmatized_sentence_corpus` para iterar o texto de comentários original, segmentando as revisões em frases individuais e normalizando o texto. Vamos gravar esses dados de volta para um novo arquivo (`unigram_sentences_all`), com uma sentença normalizada por linha. Usaremos esses dados para aprender nossos modelos de frases.

In [11]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with open('unigram_sentences.txt', 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus('review.txt'):
            f.write(sentence + '\n')

CPU times: user 3min 2s, sys: 48.2 s, total: 3min 50s
Wall time: 2min 47s


In [14]:
unigram_sentences = LineSentence('unigram_sentences.txt')

In [15]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print (' '.join(unigram_sentence))
    print ('')

definitely not a place to go if -PRON- can not take -PRON- time

hand down this be las vegas good mom and pop style italian cuisine off the strip

-PRON- usually do not order a meatball anywhere else because the meatball marinara from this place be the gold standard that most other establishment can not mimic

there be just so much flavor in this and in every other item -PRON- have order here

-PRON- have to be hungry if -PRON- order the chicken parmesan because -PRON- be a large piece of batter chicken

the garden salad with -PRON- oil and vinegar dressing have many ingredient

the spaghetti and meatball be always good as the sauce be soak well into the noodle

this place have great discount on drink and appetizer most likely to drive business

$ 3.50 well and beer such as stella artois be a win

-PRON- friend order nachos but -PRON- do not look like -PRON- like -PRON- much



In [17]:
%%time
preprocess2 = True
# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if preprocess2:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save('bigram_model.txt')
    
# load the finished model from disk
bigram_model = Phrases.load('bigram_model.txt')

CPU times: user 1.74 s, sys: 42.3 ms, total: 1.78 s
Wall time: 1.81 s


In [18]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with open('bigram_sentences.txt', 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = ' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')



CPU times: user 3.82 s, sys: 18.9 ms, total: 3.84 s
Wall time: 3.87 s


In [19]:
bigram_sentences = LineSentence('bigram_sentences.txt')

In [20]:
for bigram_sentence in it.islice(bigram_sentences, 230, 240):
    print (' '.join(bigram_sentence))
    print ('')

definitely not a place to go if -PRON- can not take -PRON- time

hand_down this be las_vegas good mom and pop style italian cuisine off the strip

-PRON- usually do_not order a meatball anywhere_else because the meatball marinara from this_place be the gold standard that most other establishment can not mimic

there be just so much flavor in this and in every other_item -PRON- have order here

-PRON- have to be hungry if -PRON- order the chicken_parmesan because -PRON- be a large piece_of batter chicken

the garden salad with -PRON- oil and vinegar dressing have many ingredient

the spaghetti and meatball be always good as the sauce be soak well into the noodle

this_place have great discount on drink and appetizer most_likely to drive business

$ 3.50 well and beer such_as stella artois be a win

-PRON- friend order nachos but -PRON- do_not look_like -PRON- like -PRON- much



In [22]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 1 == 1:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save('trigram_model.txt')
    
# load the finished model from disk
trigram_model = Phrases.load('trigram_model.txt')

CPU times: user 1.72 s, sys: 35.2 ms, total: 1.76 s
Wall time: 1.78 s


In [23]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with open("trigram_sentences.txt", 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = ' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')



CPU times: user 3.51 s, sys: 13.2 ms, total: 3.52 s
Wall time: 3.54 s


In [24]:
trigram_sentences = LineSentence('trigram_sentences.txt')

In [25]:
for trigram_sentence in it.islice(trigram_sentences, 230, 240):
    print (' '.join(trigram_sentence))
    print ('')

definitely not a place to go if -PRON- can not take -PRON- time

hand_down this be las_vegas good mom and pop style italian cuisine off the strip

-PRON- usually do_not order a meatball anywhere_else because the meatball marinara from this_place be the gold standard that most other establishment can not mimic

there be just so_much flavor in this and in every other_item -PRON- have order here

-PRON- have to be hungry if -PRON- order the chicken_parmesan because -PRON- be a large piece_of batter chicken

the garden salad with -PRON- oil and vinegar dressing have many ingredient

the spaghetti and meatball be always good as the sauce be soak well into the noodle

this_place have great discount on drink and appetizer most_likely to drive business

$ 3.50 well and beer such_as stella artois be a win

-PRON- friend order nachos but -PRON- do_not look_like -PRON- like -PRON- much



In [31]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with open('trigram_reviews.txt', 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review('review.txt'),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.lang.en.STOP_WORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')



CPU times: user 3min 17s, sys: 1min 6s, total: 4min 24s
Wall time: 3min 20s


Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.

In [32]:
print ('Original:' + '\n')

for review in it.islice(line_review('review.txt'), 11, 12):
    print (review)

print ('----' + '\n')
print ('Transformed:' + '\n')

with open('trigram_reviews.txt', encoding='utf_8') as f:
    for review in it.islice(f, 11, 12):
        print (review)

Original:

Their signature fruit meringue cakes are good but save yourself the disappointment and don't get the tiramisu. Tiramisu is always my bday cake so I know a good one when I have it. I decided to try this place this year and I have to say even T&T's grocery store cakes are better than it.

----

Transformed:

-PRON- signature fruit meringue_cake good save -PRON- disappointment do_not tiramisu tiramisu -PRON- bday cake -PRON- know good -PRON- -PRON- -PRON- decide try this_place year -PRON- t&t 's grocery_store cake good -PRON-

