# Creating a tagger for flight offer phrases
### Like this one: "¡CDMX a Bogotá 🇨🇴 $4,659!"

In [None]:
# libraries
import pandas as pd

In [None]:
# Load dataset:
vuelos = pd.read_csv('data/vuelos.csv', index_col=0)
with pd.option_context('max_colwidth', 800):
    print(vuelos.loc[:100:5][['label']])

Most of the offers follow a simple pattern: *Destination - Origin - Price - Extras* while extracting this may seem easy for a regular expression, it is not (see this notebook for reference). 

The idea is to create a tagger that will be able to extract this information, however, one first tag is to identify the information that we want to extract. Following the pattern described above: 

 - **DST**: Destination 
 - **ORI**: Origin 
 - **PRC**: Price 
 - **EXT**: Extras
 
| Text 	| DST 	| ORI 	| PRC 	| OTH 	|
|------	|-----	|-----	|-----	|-----	|
| ¡CUN a Holanda \$8,885! Sin escala EE.UU | CUN | Holanda | 8,885 | Sin escala EE.UU |   
| ¡CDMX a Noruega <span>$</span>10,061! (Y agrega 9 noches de hotel por \$7,890!) | CDMX | Noruega | 10,061 | Y agrega 9 noches de hotel por \$7,890!| 
| ¡Todo México a Pisa, Toscana Italia \$12,915! Sin escala EE.UU (Y por \$3,975 agrega 13 noches hotel) | México | Pisa, Toscana Italia | 12,915 | Sin escala EE.UU (Y por \$3,975 agrega 13 noches hotel) |

## Tokenize and POS-tag the dataset 
We need to generate a *csv* file that we can tag (manually 😨) that consists of:
```
token1    POS tag    Label
token2    POS tag    Label
token3    POS tag    Label
```

Where `Label` will be one of DST, ORI, PRC, OTH and NA and will be manually assigned (again: 😨)

In [None]:
from nltk.tag.stanford import StanfordPOSTagger

spanish_postagger = StanfordPOSTagger('/Users/antonioferegrino/stanford_nlp/'
                                      'stanford-postagger-full-2018-02-27/models/spanish-distsim.tagger', 
                                      '/Users/antonioferegrino/stanford_nlp/'
                                      'stanford-postagger-full-2018-02-27/stanford-postagger.jar')

print(spanish_postagger.tag('Pepe Pecas pica papas con un pico, con un pico pica papas Pepe Pecas.'.split()))

In [None]:
from nltk.tokenize import TweetTokenizer
tknz = TweetTokenizer()

print(tknz.tokenize('¡CUN a Puerto Rico $3,296! (Agrega 3 noches de hotel por $2,778)'))

In [None]:
import csv
out_file = 'data/tags.csv'
with open(out_file, 'w') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quoting=csv.QUOTE_MINIMAL)
    for label in vuelos['label']:
        tags = spanish_postagger.tag(tknz.tokenize(label))
        for tagged in tags:
            writer.writerow([tagged[0], tagged[1], 'XXXXXX'])
        

In [None]:
for l in vuelos.loc[:100:5][['label']]:
    print(l)