# Pytorch Text - Text Classification with the torchtext library
Notebook for following along with Pytorch Text NLP tutorials that is looking to use the torchtext library to build the dataset for text classification analysis [Pytorch](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html)  website tutorial. <br><br>

### Choices for data

<br>

### Libaries and Modules
Importing the necessary libaries and modules for the notebook.

In [1]:
#Import cell
import glob
import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import math
import numpy as np
import os
import pandas as pd
import pickle as pk
import random
import re
import string
import time
import torch
import torch.nn as nn
import torch.nn.functional as F

from io import open
from torch.utils.data import DataLoader
from torchtext.datasets import AG_NEWS
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(1247) #setting seed value
print(f"Device: {device}. Cuda available: {torch.cuda.is_available()}")
print("Imports complete")

Device: cpu. Cuda available: False
Imports complete


<br>

### Data Loading and Manipulation Functions
<b>Functions:</b><br>
<ul>
    <li>collate_batch - uses pipelines to process input batch of data</li>
    <li>yield_tokens - processes data_iter for build_vocab_from_iterator()</li>
</ul>

In [2]:
#Data loading and manipulation function definition cell
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
        text_list.append(processed_text)
        offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)


def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)
        
        
label_pipeline = lambda x: int(x) - 1        
text_pipeline = lambda x: vocab(tokenizer(x))

print("Data loading and manipulation functions defined.")

Data loading and manipulation functions defined.


### Importing and preparing data sets
Importing and preparing the data for the models.

In [3]:
#Importing data sets
train_iter = iter(AG_NEWS(split='train'))

#Printing demonstration training data
for i in range(3): print(next(train_iter), "\n")

print(f"\nData sets successfully imported, running on device: {device}")

(3, "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.") 

(3, 'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.') 

(3, "Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\\about the economy and the outlook for earnings are expected to\\hang over the stock market next week during the depth of the\\summer doldrums.") 


Data sets successfully imported, running on device: cpu


In [4]:
#Build a vocab with the raw training dataset, generating data batch and iter
tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')
     
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

dataloader = DataLoader(train_iter, batch_size=8, shuffle=False,
                        collate_fn=collate_batch)

print(vocab(['here', 'is', 'an', 'example']))
print(text_pipeline('here is an example'))
print(label_pipeline('10'))

[475, 21, 30, 5297]
[475, 21, 30, 5297]
9


<br>

### Class Definitions
<b>Classes:</b><br>
<ul>
    <li></li>
</ul>

In [5]:
#Class definition cell

print("Classes defined.")

Classes defined.


<br>

### Calculation functions
<b>Functions:</b><br>
<ul>
    <li></li>
</ul>

In [6]:
#Calculation functions cell
    
print("Calculation functions defined.")

Calculation functions defined.


<br>

### Plotting functions
<b>Functions:</b>
<ul>
    <li></li>
</ul>

In [7]:
#Plotting functions Cell
%matplotlib inline

print("Plotting functions defined.")

Plotting functions defined.


<br>

### Main code

<br>