# PLN Moderno em Python
### _- Ou -_
## O que podemos aprender sobre comida analizando 1 milhão de comentários do Yelp

#### Antes de começar...
- Os exemplos desse notebook foram extraídos desse [notebook](http://nbviewer.jupyter.org/github/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb#)

## O Dataset do Yelp
[**The Yelp Dataset**](https://www.yelp.com/dataset_challenge/) é um conjunto de dados disponibilizado pelo serviço de comentários de estabelecimentos [Yelp](http://yelp.com) para fins de pesquisa academica e educacional.

**Note:** Para executar esse notebook na sua máquina, você precisará baixar sua própria cópia do Yelp dataset. O dataset está pode ser baixado seguindo os passo abaixo:
1. Vá a página do Yelp dataset [aqui](https://www.yelp.com/dataset_challenge/)
1. Clique em "Get the Data"
1. Leia e concorde com os termos de uso do Yelp.

O atual conjunto de dados consiste em:
- __552K__ usuários
- __77K__ estabelecimentos
- __2.2M__ comentários de usuários

Quando queremos ver apenas os restaurantes, existem aproximadamente __55K__ restaurantes com aproximadamente __3.2M__ comentários de usuários escritas sobre eles.

Os dados estão disponíveis em vários arquivos no formato _.json_. Usaremos os seguintes arquivos para a nossa demonstração:
- __yelp\_academic\_dataset\_business.json__ &mdash; _os registros de estabelecimentos individuais_
- __yelp\_academic\_dataset\_review.json__ &mdash; _os registros de comentários de usuarios escritos sobre os estabelecimentos_

Os arquivos são arquivos de texto (UTF-8) com um _objeto json_ pot linha, cada um correspondendo para um registro de dado indivudual. Vamos ver uns exemplos.

In [46]:
with open('dataset/business.json', encoding='utf_8') as f:
    first_business_record = f.readline() 

print (first_business_record)

{"business_id": "FYWN1wneV18bWNgQjJ2GNg", "name": "Dental by Design", "neighborhood": "", "address": "4855 E Warner Rd, Ste B9", "city": "Ahwatukee", "state": "AZ", "postal_code": "85044", "latitude": 33.3306902, "longitude": -111.9785992, "stars": 4.0, "review_count": 22, "is_open": 1, "attributes": {"AcceptsInsurance": true, "ByAppointmentOnly": true, "BusinessAcceptsCreditCards": true}, "categories": ["Dentists", "General Dentistry", "Health & Medical", "Oral Surgeons", "Cosmetic Dentists", "Orthodontists"], "hours": {"Friday": "7:30-17:00", "Tuesday": "7:30-17:00", "Thursday": "7:30-17:00", "Wednesday": "7:30-17:00", "Monday": "7:30-17:00"}}



Os registros de estabelecimentos consistem de pares de _key: value_  contendo informações sobre o estabelecimento. Alguns atributos que estaremos interessados nessa demonstração incluem:
- __business\_id__ &mdash; _identificador único de um estabelecimento_
- __categories__ &mdash; _Um array que contém as categorias que o estabelecimento se encaixa_

O atributo de _categorias_ É de interesse especial. Esta demonstração foca em restaurantes, que são indicados pela presença da tag  _Restaurant_ no _array_ de categorias _categories_ . Além disso, o _array_ _categories_ pode contar mais informações detalhadas sobre restaurantes, como por exemplo o tipo de comida eles servem.

Os registros de comentários estão ordenados de maneira similar &mdash; pares de _key: value_ contendo informação sobre quem fez o comentário, comentário em si, e qual estabelecimento aquele comentário se refere.

In [6]:
with open('dataset/review.json', encoding='utf_8') as f:
    first_review_record = f.readline()
    
print (first_review_record)

{"review_id":"v0i_UHJMo_hPBq9bxWvW4w","user_id":"bv2nCi5Qv5vroFiqKGopiw","business_id":"0W4lkclzZThpx3V65bVgig","stars":5,"date":"2016-05-28","text":"Love the staff, love the meat, love the place. Prepare for a long line around lunch or dinner hours. \n\nThey ask you how you want you meat, lean or something maybe, I can't remember. Just say you don't want it too fatty. \n\nGet a half sour pickle and a hot pepper. Hand cut french fries too.","useful":0,"funny":0,"cool":0}



Alguns atributos que podemos notar no registro de comentário são:
- __business\_id__ &mdash; _identificador único de um estabelecimento_
- __text__ &mdash; _o texto de linguagem natural o usuário escreveu_

O atributo _text_ será nosso foco!

_json_ é um formato bastante útil para troca de dados, porém não é muito utilizado para trabalhos de modelagem. Vamos aplicar um pré processamento para transoformar os dados em um formato mais usável. O próximo bloco de código fará:
1. Ler cada registro de estabelecimento e converter para um `dict` do Python
2. Descartar os estabelecimentos que não são restaurantes.
3. Criar um `frozenset` dos business IDs para restaurantes, que será usado no próximo passo

In [7]:
import json

restaurant_ids = set()

# open the businesses file
with open('dataset/business.json', encoding='utf_8') as f:
    
    # iterate through each line (json record) in the file
    for business_json in f:
        
        # convert the json record to a Python dict
        business = json.loads(business_json)
        
        # if this business is not a restaurant, skip to the next one
        if u'Restaurants' not in business[u'categories']:
            continue
            
        # add the restaurant business id to our restaurant_ids set
        restaurant_ids.add(business[u'business_id'])

# turn restaurant_ids into a frozenset, as we don't need to change it anymore
restaurant_ids = frozenset(restaurant_ids)

# print the number of unique restaurant ids in the dataset
print ('{:,}'.format(len(restaurant_ids)), u'restaurantes no dataset.')

54,618 restaurantes no dataset.


No próximo, iremos criar um arquivo que contém apenas os comentários sobre restaurantes, com um comentário por linha no arquivo.

In [52]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:
    
    review_count = 0

    # create & open a new file in write mode
    with open('review.txt', 'w', encoding='utf_8') as review_txt_file:

        # open the existing review json file
        with open('dataset/review.json', encoding='utf_8') as review_json_file:

            # loop through all reviews in the existing file and convert to dict
            for review_json in review_json_file:
                review = json.loads(review_json)

                # if this review is not sobre a restaurant, skip to the next one
                if review[u'business_id'] not in restaurant_ids:
                    continue

                # write the restaurant review as a line in the new file
                # escape newline characters in the original review text
                review_txt_file.write(review[u'text'].replace('\n', '\\n') + '\n')
                review_count += 1

    print (u'''Text from {:,} restaurant reviews
              escritas to the new txt file.'''.format(review_count))
    
else:
    
    with open('review.txt', encoding='utf_8') as review_txt_file:
        for review_count, line in enumerate(review_txt_file):
            pass
        
    print ('Text from {:,} restaurant reviews in the txt file.'.format(review_count + 1))

Text from 3,223,214 restaurant reviews in the txt file.
CPU times: user 6.38 s, sys: 2.18 s, total: 8.56 s
Wall time: 9.76 s


## spaCy &mdash; Industrial-Strength NLP in Python

![spaCy](https://s3.amazonaws.com/skipgram-images/spaCy.png)

[**spaCy**](https://spacy.io) É uma biblioteca para processamento de linguagem natural com Python. O objetivo do spaCy's é pegar avanços recentes em processamento de linguagem natual a partir de pesquisas e colocá-las em prática.

spaCy suporta muitas tarefas associadas com a construção de pré-processamento em linguagem natural:
- Tokenization
- Text normalization, como converter para minusculo, radicalização
- marcação Part-of-speech
- Análise de dependência sintática
- Detecção de limite de sentença
- Reconhecimento e anotação de entidade nomeada

In [9]:
import spacy
import pandas as pd
import itertools as it

nlp = spacy.load('en')

Vamos pegar uma amostra de comentário para brincar.

In [37]:
with open('review.txt', encoding='utf_8') as f:
    sample_review = '\n'.join(list(it.islice(f, 4, 9)))
    sample_review = sample_review.replace('\\n', '\n')
        
print (sample_review)

Love coming here. Yes the place always needs the floor swept but when you give out  peanuts in the shell how won't it always be a bit dirty. 

The food speaks for itself, so good. Burgers are made to order and the meat is put on the grill when you order your sandwich. Getting the small burger just means 1 patty, the regular is a 2 patty burger which is twice the deliciousness. 

Getting the Cajun fries adds a bit of spice to them and whatever size you order they always throw more fries (a lot more fries) into the bag.

Had their chocolate almond croissant and it was amazing! So light and buttery and oh my how chocolaty.

If you're looking for a light breakfast then head out here. Perfect spot for a coffee/latté before heading out to the old port

Who would have guess that you would be able to get fairly decent Vietnamese restaurant in East York? 

Not quite the same as Chinatown in terms of pricing (slightly higher) but definitely one of the better Vietnamese restaurants outside of the

Vamos mandar esses comentários para o spacy...

In [38]:
%%time
parsed_review = nlp(sample_review)

CPU times: user 146 ms, sys: 41.2 ms, total: 187 ms
Wall time: 154 ms


In [39]:
print (parsed_review)

Love coming here. Yes the place always needs the floor swept but when you give out  peanuts in the shell how won't it always be a bit dirty. 

The food speaks for itself, so good. Burgers are made to order and the meat is put on the grill when you order your sandwich. Getting the small burger just means 1 patty, the regular is a 2 patty burger which is twice the deliciousness. 

Getting the Cajun fries adds a bit of spice to them and whatever size you order they always throw more fries (a lot more fries) into the bag.

Had their chocolate almond croissant and it was amazing! So light and buttery and oh my how chocolaty.

If you're looking for a light breakfast then head out here. Perfect spot for a coffee/latté before heading out to the old port

Who would have guess that you would be able to get fairly decent Vietnamese restaurant in East York? 

Not quite the same as Chinatown in terms of pricing (slightly higher) but definitely one of the better Vietnamese restaurants outside of the

Parece a mesma coisa! O que aconteceu por baixo dos panos?

Sobre detecção de sentença e segmentação?

In [40]:
for num, sentence in enumerate(parsed_review.sents):
    print ('Sentence {}:'.format(num + 1))
    print (sentence)
    print ('')

Sentence 1:
Love coming here.

Sentence 2:
Yes the place always needs the floor swept but when you give out  peanuts in the shell how won't it always be a bit dirty. 



Sentence 3:
The food speaks for itself, so good.

Sentence 4:
Burgers are made to order and the meat is put on the grill when you order your sandwich.

Sentence 5:
Getting the small burger just means 1 patty, the regular is a 2 patty burger which is twice the deliciousness. 



Sentence 6:
Getting the Cajun fries adds a bit of spice to them and whatever size you order they always throw more fries (a lot more fries) into the bag.



Sentence 7:
Had their chocolate almond croissant and it was amazing!

Sentence 8:
So light and buttery

Sentence 9:
and oh my how chocolaty.



Sentence 10:
If you're looking for a light breakfast then head out here.

Sentence 11:
Perfect spot for a coffee/latté before heading out to the old port

Who would have guess that you would be able to get fairly decent Vietnamese restaurant in East 

E sobre reconhecimento de entidade nomeada?

In [41]:
for num, entity in enumerate(parsed_review.ents):
    print ('Entity {}:'.format(num + 1), entity, '-', entity.label_)
    print ('')

Entity 1: 1 - CARDINAL

Entity 2: 2 patty - QUANTITY

Entity 3: Cajun - ORG

Entity 4: Vietnamese - NORP

Entity 5: East York - GPE

Entity 6: one - CARDINAL

Entity 7: Vietnamese - NORP

Entity 8: the morning - TIME

Entity 9: 6 - CARDINAL

Entity 10: 9 - CARDINAL

Entity 11: 
 - GPE



E sobre marcação part of the speach?

In [42]:
token_text = [token.orth_ for token in parsed_review]
token_pos = [token.pos_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_pos)),
             columns=['token_text', 'part_of_speech'])

Unnamed: 0,token_text,part_of_speech
0,Love,NOUN
1,coming,VERB
2,here,ADV
3,.,PUNCT
4,Yes,INTJ
5,the,DET
6,place,NOUN
7,always,ADV
8,needs,VERB
9,the,DET


E sobre normalização do texto, como radicalização e análise de formato?

In [43]:
token_lemma = [token.lemma_ for token in parsed_review]
token_shape = [token.shape_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_lemma, token_shape)),
             columns=['token_text', 'token_lemma', 'token_shape'])

Unnamed: 0,token_text,token_lemma,token_shape
0,Love,love,Xxxx
1,coming,come,xxxx
2,here,here,xxxx
3,.,.,.
4,Yes,yes,Xxx
5,the,the,xxx
6,place,place,xxxx
7,always,always,xxxx
8,needs,need,xxxx
9,the,the,xxx


E sobre analise de entidade a nível de token?

In [44]:
token_entity_type = [token.ent_type_ for token in parsed_review]
token_entity_iob = [token.ent_iob_ for token in parsed_review]

pd.DataFrame(list(zip(token_text, token_entity_type, token_entity_iob)),
             columns=['token_text', 'entity_type', 'inside_outside_begin'])

Unnamed: 0,token_text,entity_type,inside_outside_begin
0,Love,,O
1,coming,,O
2,here,,O
3,.,,O
4,Yes,,O
5,the,,O
6,place,,O
7,always,,O
8,needs,,O
9,the,,O


Sobre a variedade de outros atributos a nível de token:
- stopword
- pontuação
- espaço
- representa um número
- faz parte do vocabulário nativo do spacy?

In [45]:
token_attributes = [(token.orth_,
                     token.prob,
                     token.is_stop,
                     token.is_punct,
                     token.is_space,
                     token.like_num,
                     token.is_oov)
                    for token in parsed_review]

df = pd.DataFrame(token_attributes,
                  columns=['text',
                           'log_probability',
                           'stop?',
                           'punctuation?',
                           'whitespace?',
                           'number?',
                           'out of vocab.?'])

df.loc[:, 'stop?':'out of vocab.?'] = (df.loc[:, 'stop?':'out of vocab.?']
                                       .applymap(lambda x: u'Yes' if x else u''))
                                               
df

Unnamed: 0,text,log_probability,stop?,punctuation?,whitespace?,number?,out of vocab.?
0,Love,-20.0,,,,,Yes
1,coming,-20.0,,,,,Yes
2,here,-20.0,Yes,,,,Yes
3,.,-20.0,,Yes,,,Yes
4,Yes,-20.0,,,,,Yes
5,the,-20.0,Yes,,,,Yes
6,place,-20.0,,,,,Yes
7,always,-20.0,Yes,,,,Yes
8,needs,-20.0,,,,,Yes
9,the,-20.0,Yes,,,,Yes


Se o texto que você gostaria de processar é  um texto de objetivo geral (i.e., não faz parte de um domínio especifico, como literatura medicinal), spaCy está pronto para ser usado.

## Modelagem de frases

Modelagem de frases _Phrase modeling_ é uma abordagem para aprender combinações de palavras afim de juntalas em um único token. A formula utilizada para modelagem de frases é:

$$\frac{count(A\ B) - count_{min}}{count(A) * count(B)} * N > threshold$$

...where:
* $count(A)$ número de vezes o token $A$ aparece no corpus
* $count(B)$ número de vezes o token $B$ aparece no corpus
* $count(A\ B)$ número de vezes os tokens $A\ B$ aparecem no corpus nessa ordem
* $N$ tamanho do vocabulário
* $count_{min}$ parêmetro definido pelo usuário para assegurar que frases ocorrem um número mínimo de vezes
* $threshold$ parêmetro definido pelo usuário para controlar a força da relação entre dois tokens o modelo requer antes de aceitá-los como uma frase

A biblioteca [**gensim**](https://radimrehurek.com/gensim/index.html) será usada para a modelagem de frases &mdash; a classe [**Phrases**](https://radimrehurek.com/gensim/models/phrases.html) em particular.

In [47]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence

Como estamos realizando a modelagem de frases, estaremos fazendo uma transformação de dados iterativa ao mesmo tempo. Nosso roteiro para preparação de dados inclui:

1. Segmentar texto de comentários completos em frases e normalizar o texto
1. Modelagem de frase de primeira ordem $\rightarrow$ _aplicar modelo de frases de primeira ordem para transformar sentenças_
1. Modelagem de frase de segunda ordem $\rightarrow$ _aplicar modelo de frase de segunda ordem para transformar sentenças_
1.Aplicar normalização de texto e modelo de frase de segunda ordem ao texto de revisões completas

Primeiro, vamos definir algumas funções auxiliares que usaremos para normalização de texto. Em particular, a função geradora `lemmatized_sentence_corpus` usará spaCy para:
- Iterar sobre as revisões de 3.2 milhões no `review.txt` nós criamos antes
- Dividir os comentáios em frases individuais
- Remover pontuação e espaços escessivos
- Radicalizar o texto

In [49]:
def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    
    with open(filename, encoding='utf_8') as f:
        for review in f:
            yield review.replace('\\n', '\n')
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=10000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

Vamos usar o gerador `lemmatized_sentence_corpus` para iterar o texto de comentários original, segmentando as revisões em frases individuais e normalizando o texto. Vamos gravar esses dados de volta para um novo arquivo (`unigram_sentences_all`), com uma sentença normalizada por linha. Usaremos esses dados para aprender nossos modelos de frases.

In [None]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with open('unigram_sentences.txt', 'w', encoding='utf_8') as f:
        for sentence in lemmatized_sentence_corpus('review.txt'):
            f.write(sentence + '\n')

In [20]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [21]:
for unigram_sentence in it.islice(unigram_sentences, 230, 240):
    print (' '.join(unigram_sentence))
    print ('')

no it be not the best food in the world but the service greatly help the perception and it do not taste bad

so back in the late 90 there use to be this super kick as cinnamon ice cream like an apple pie ice cream without the apple or the pie crust

so delicious

however now there be some shit tastic replacement that taste like vanilla ice cream with last year 's red hot in the middle totally gross

fortunately our server be nice enough to warn me sobre the change and bring me a sample so i only have to suffer the death of a childhood memory rather than also have to pay for it

the portion be big and fill just do not come for the ice cream

i have pretty much be eat at various king pretty regularly since i be a child when my parent would take my sister and i into the fox chapel location often

lately me and my girl have be visit the heidelburg location

i love the food it really taste homemade much like something a grandmother would make complete with gob of butter and side dish

price

In [23]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 0 == 1:

    bigram_model = Phrases(unigram_sentences)

    bigram_model.save('bigram_model.txt')
    
# load the finished model from disk
bigram_model = Phrases.load(bigram_model_filepath)

CPU times: user 5.91 s, sys: 3.14 s, total: 9.05 s
Wall time: 11 s


In [25]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    with open('bigram_sentences.txt', 'w', encoding='utf_8') as f:
        
        for unigram_sentence in unigram_sentences:
            
            bigram_sentence = ' '.join(bigram_model[unigram_sentence])
            
            f.write(bigram_sentence + '\n')

CPU times: user 4 µs, sys: 1e+03 ns, total: 5 µs
Wall time: 8.11 µs


In [26]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [27]:
for bigram_sentence in it.islice(bigram_sentences, 230, 240):
    print (' '.join(bigram_sentence))
    print ('')

no it be not the best food in the world but the service greatly help the perception and it do not taste bad

so back in the late 90 there use to be this super kick as cinnamon ice_cream like an apple_pie ice_cream without the apple or the pie crust

so delicious

however now there be some shit tastic replacement that taste like vanilla_ice cream with last year 's red hot in the middle totally gross

fortunately our server be nice enough to warn me sobre the change and bring me a sample so i only have to suffer the death of a childhood_memory rather_than also have to pay for it

the portion be big and fill just do not come for the ice_cream

i have pretty much be eat at various king pretty regularly since i be a child when my parent would take my sister and i into the fox_chapel location often

lately me and my girl have be visit the heidelburg location

i love the food it really taste homemade much like something a grandmother would make complete with gob of butter and side dish

price

In [29]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute modeling yourself.
if 0 == 1:

    trigram_model = Phrases(bigram_sentences)

    trigram_model.save('trigram_model.txt')
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

CPU times: user 4.85 s, sys: 3.17 s, total: 8.02 s
Wall time: 9.58 s


We'll apply our trained second-order phrase model to our first-order transformed sentences, write the results out to a new file, and explore a few of the second-order transformed sentences.

In [31]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 0 == 1:

    with open("trigram_sentences.txt", 'w', encoding='utf_8') as f:
        
        for bigram_sentence in bigram_sentences:
            
            trigram_sentence = u' '.join(trigram_model[bigram_sentence])
            
            f.write(trigram_sentence + '\n')

CPU times: user 8 µs, sys: 4 µs, total: 12 µs
Wall time: 21.9 µs


In [32]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [33]:
for trigram_sentence in it.islice(trigram_sentences, 230, 240):
    print (' '.join(trigram_sentence))
    print ('')

no it be not the best food in the world but the service greatly help the perception and it do not taste bad

so back in the late 90 there use to be this super kick as cinnamon_ice_cream like an apple_pie ice_cream without the apple or the pie crust

so delicious

however now there be some shit tastic replacement that taste like vanilla_ice_cream with last year 's red hot in the middle totally gross

fortunately our server be nice enough to warn me sobre the change and bring me a sample so i only have to suffer the death of a childhood_memory rather_than also have to pay for it

the portion be big and fill just do not come for the ice_cream

i have pretty much be eat at various king pretty regularly since i be a child when my parent would take my sister and i into the fox_chapel location often

lately me and my girl have be visit the heidelburg location

i love the food it really taste homemade much like something a grandmother would make complete with gob of butter and side dish

price

In [35]:
%%time

# this is a bit time consuming - make the if statement True
# if you want to execute data prep yourself.
if 1 == 1:

    with open('trigram_reviews.txt', 'w', encoding='utf_8') as f:
        
        for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                      batch_size=10000, n_threads=4):
            
            # lemmatize the text, removing punctuation and whitespace
            unigram_review = [token.lemma_ for token in parsed_review
                              if not punct_space(token)]
            
            # apply the first-order and second-order phrase models
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            
            # remove any remaining stopwords
            trigram_review = [term for term in trigram_review
                              if term not in spacy.en.STOPWORDS]
            
            # write the transformed review as a line in the new file
            trigram_review = u' '.join(trigram_review)
            f.write(trigram_review + '\n')

CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 11.9 µs


Let's preview the results. We'll grab one review from the file with the original, untransformed text, grab the same review from the file with the normalized and transformed text, and compare the two.

In [36]:
print ('Original:' + '\n')

for review in it.islice(line_review(review_txt_filepath), 11, 12):
    print (review)

print ('----' + '\n')
print ('Transformed:' + '\n')

with codecs.open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 11, 12):
        print (review)

Original:

A great townie bar with tasty food and an interesting clientele. I went to check this place out on the way home from the airport one Friday night and it didn't disappoint. It is refreshing to walk into a townie bar and not feel like the music stops and everyone in the place is staring at you - I'm guessing the mixed crowd of older hockey fans, young men in collared shirts, and thirtysomethings have probably seen it all during their time at this place. 

The staff was top notch - the orders were somewhat overwhelming as they appeared short-staffed for the night, but my waitress tried to keep a positive attitude for my entire visit. The other waiter was wearing a hooded cardigan, and I wanted to steal it from him due to my difficulty in finding such a quality article of clothing.

We ordered a white pizza - large in size, engulfed in cheese, full of garlic flavor, flavorful hot sausage. An overall delicious pizza, aside from 2 things: 1, way too much grease (I know this comes 

You can see that most of the grammatical structure has been scrubbed from the text &mdash; capitalization, articles/conjunctions, punctuation, spacing, etc. However, much of the general semantic *meaning* is still present. Also, multi-word concepts such as "`friday_night`" and "`above_average`" have been joined into single tokens, as expected. The review text is now ready for higher-level modeling. 