# TorchText

Torchtext is a PyTorch based library. Pytorch is an open source machine learning framework developed by [Facebook](https://github.com/facebook). Torchtext is a collection of data processing utilities for text data and popular datasets for natural language processing. If you are working with PyTorch on an NLP task than TorchText would be one of the best things to check out.

TorchText facilitates tokenization using get_tokenizer() method. `get_tokenizer(tokenizer, language='en')` generates a tokenizer function which in turn converts a string to its corresponding tokens. The `tokenizer` argument can be either a callable function or a library (basic_english, spacy, subword, moses, toktok, revtok). Let's beign by installing torchtext first using:

`pip3 install torchtext`

In [1]:
#Turn off the pretty print
%pprint

Pretty printing has been turned OFF


In [2]:
text = "Good muffins cost $3.88. Please buy me two of them.\n\nThanks.🙂😍"

### Using Libraries

The tokenizer library may include any of: `basic_english`, `spacy`, `subword`, `moses`, `toktok`, `revtok`.

In [3]:
from torchtext.data import get_tokenizer

tokenizer = get_tokenizer('basic_english')
tokens = tokenizer(text)
tokens

  from .autonotebook import tqdm as notebook_tqdm


['good', 'muffins', 'cost', '$3', '.', '88', '.', 'please', 'buy', 'me', 'two', 'of', 'them', '.', 'thanks', '.', '🙂😍']

### Using Custom Function

We can define our own tokenizer using torchtext. For example: `create_tokens` function here ignores all the punctuations and splits the tokens based on whitespace.

In [4]:
import re
import string

In [5]:
def create_tokens(sentence):
    table = str.maketrans('', '', string.punctuation)
    sentence = sentence.translate(table).lower()
    sentence = re.sub(' +', ' ', sentence).lstrip().rstrip()
    return sentence.split()

In [6]:
tokenizer = get_tokenizer(create_tokens)
tokens = tokenizer(text)
tokens

['good', 'muffins', 'cost', '388', 'please', 'buy', 'me', 'two', 'of', 'them', 'thanks🙂😍']