### 作業目的: 熟練以Torchtext進行文本資料讀取

本次作業主要會使用[polarity](http://www.cs.cornell.edu/people/pabo/movie-review-data/)的電影評論來進行使用torchtext資料讀取，學員可以在附件的polarity.tsv看到所使用的資料。

Hint: 這次作業同學可以嘗試使用[torchtext.data.TabularDataset](https://torchtext.readthedocs.io/en/latest/data.html#tabulardataset)，可以更簡易讀取資料

### 載入套件

In [4]:
import torch
import pandas as pd
import numpy as np
from torchtext.legacy import data, datasets

In [5]:
# 探索資料
# 可以發現資料為文本與類別，而類別即為正評與負評
input_data = pd.read_csv('./polarity.tsv', delimiter='\t', header=None, names=['text', 'label'])
input_data

Unnamed: 0,text,label
0,films adapted from comic books have had plenty...,1
1,every now and then a movie comes along from a ...,1
2,you've got mail works alot better than it dese...,1
3,jaws is a rare film that grabs your attentio...,1
4,moviemaking is a lot like being the general ma...,1
...,...,...
1995,"if anything , "" stigmata "" should be taken as ...",0
1996,"john boorman's "" zardoz "" is a goofy cinematic...",0
1997,the kids in the hall are an acquired taste .it...,0
1998,there was a time when john carpenter was a gre...,0


In [7]:
input_data.iloc[0]

text     films adapted from comic books have had plenty...
label                                                    1
Name: 0, dtype: object

### 建立Pipeline生成資料

In [8]:
import re

In [9]:
def clear_non_char(words):
    words = ' '.join(words)
    words = re.sub('\W+', ' ', words)
    words = words.split()
    
    return words    

In [26]:
# 建立Field與Dataset
text_field = data.Field(sequential=True, preprocessing=clear_non_char,tokenize='spacy', lower=True, dtype=torch.float64)
label_field = data.Field(sequential=False)

In [34]:
# 取的examples並打亂順序
examples = []
for _, (text, label) in input_data.sample(frac=1).iterrows():
    examples.append(data.Example.fromlist(data = [text, label], 
                                          fields=[('text', text_field), ('label', label_field)]))

# 以8:2的比例切分examples
train_ex = examples[:int(len(examples)*0.8)]
test_ex = examples[int(len(examples)*0.8):]

# 建立training與testing dataset
train_data = data.Dataset(examples=train_ex, fields={'text': text_field, 'label':label_field})
test_data = data.Dataset(examples=test_ex, fields={'text':text_field, 'label':label_field})

train_data[0].label, train_data[0].text

(0,
 ['tina',
  'fetch',
  'me',
  'the',
  'axe',
  'a',
  'favourite',
  'book',
  'of',
  'mine',
  'called',
  'the',
  'golden',
  'turkey',
  'awards',
  'relates',
  'the',
  'story',
  'that',
  'when',
  'mommie',
  'dearest',
  'was',
  'unleashed',
  'upon',
  'unsuspecting',
  'audiences',
  'back',
  'in',
  '1981',
  'paramount',
  'soon',
  'realised',
  'they',
  'had',
  'a',
  'problem',
  'on',
  'their',
  'hands',
  'it',
  'was',
  'n',
  't',
  'just',
  'the',
  'film',
  's',
  'disappointing',
  'box',
  'office',
  'performance',
  'indeed',
  'in',
  'the',
  'coming',
  'years',
  'some',
  'people',
  'would',
  'be',
  'going',
  'back',
  'to',
  'see',
  'it',
  'two',
  'three',
  'even',
  'six',
  'times',
  'no',
  'the',
  'main',
  'problem',
  'was',
  'that',
  'what',
  'was',
  'intended',
  'as',
  'a',
  'serious',
  'biopic',
  'of',
  'screen',
  'queen',
  'joan',
  'crawford',
  'was',
  'turning',
  'into',
  'the',
  'laugh',
  'riot',

In [42]:
# 建立字典
text_field.build_vocab(train_data)
label_field.build_vocab(train_data)

print(f"Vocabularies of index 0-5: {text_field.vocab.itos[:10]} \n")
print(f"words to index {text_field.vocab.stoi}")

Vocabularies of index 0-5: ['<unk>', '<pad>', 'the', 'a', 'and', 'of', 'to', 'is', 'in', 's'] 



In [44]:
# create iterator for training and testing data
train_iter, test_iter = data.Iterator(train_data, batch_size=3, sort_key=lambda ex: len(ex.text)),\
data.Iterator(test_data, batch_size=3, sort_key=lambda ex: len(ex.text))

In [45]:
for train_batch in train_iter:
    print(train_batch.text, train_batch.text.shape)
    print(train_batch.label, train_batch.label.shape)
    break

tensor([[4.6400e+02, 2.7000e+01, 3.2100e+02],
        [3.6100e+03, 5.0000e+00, 4.2330e+03],
        [7.0000e+00, 2.0000e+00, 1.3808e+04],
        ...,
        [1.5180e+03, 1.0000e+00, 1.0000e+00],
        [5.0000e+00, 1.0000e+00, 1.0000e+00],
        [5.2160e+03, 1.0000e+00, 1.0000e+00]], dtype=torch.float64) torch.Size([1166, 3])
tensor([2, 1, 1]) torch.Size([3])
