# Load Data

### Using the data from Wikipedia, I'm going to try training a couple different models and see how each of them perform. The goal is to see how accurately each model is at predicting the language of a given string. 

In [1]:
import pandas as pd

# read dataframe
language_df = pd.read_pickle('language_data.pickle')

# shuffle dataframe
language_df = language_df.sample(frac=1).reset_index(drop=True)
language_df.head()

Unnamed: 0,language,sentence,language_code,tokens
0,it,Le ragazze la adorarono e Alice Liddell chiese...,6,"[ragazza, adorare, e, alice, liddell, chiesa, ..."
1,ru,В сказке впервые появились цветные иллюстрации.,4,"[сказка, впервые, появиться, цветной, иллюстра..."
2,ru,"Первоначальный стих Саути, по мнению Хораса Гр...",4,"[первоначальный, стих, саути, мнение, хораса, ..."
3,ru,Вот их полный список в порядке появления в кни...,4,"[полный, список, порядок, появление, книга, им..."
4,de,In Nickolas Cooks Adaption Alice in Zombieland...,3,"[nickolas, cooks, adaption, alice, zombieland,..."


# LangDetect Library

### Before I train anything I want to see how accurate the langdetect library would be on this data

In [2]:
from collections import Counter
from langdetect import detect as ld

def get_lang(text):
    try:
        return ld(text)
    except:
        return 'fail'

accuracy_list = []
lang_targets = language_df['language'].tolist()
ld_preds = language_df['sentence'].apply(lambda x: get_lang(x)).tolist()
for t, p in zip(lang_targets, ld_preds):
    accuracy_list.append(t==p)

counts = Counter(accuracy_list)
print("Accuracy:", counts[True]/len(accuracy_list))

Accuracy: 0.8568568568568569


### 86% isn't bad; the accuracy may be affected by the quality of the data and not necessarily because of the langetect library

# BOW Feature Extraction

### Since this model's going to be trained to classify languages, I suspect that sequence doesn't play a critical role. I think that the uniqueness of the words in each language should be enough for the model to learn how to make the correct distinctions. If that's the case then a BOW model should be enough, so I'll start by creating a count vectorizer

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

tokens = language_df['tokens'].values
y = language_df['language_code'].values
#y = language_df['language_vector].tolist() # for one-hot-encoding

tokens_train, tokens_test, y_train, y_test = train_test_split(tokens, y, test_size = 0.25, random_state=1000)

# disable analyzer since I'm applying CountVectorizer to list of lemms
vectorizer = CountVectorizer(analyzer=lambda x: x)

# fit_tranform is used for iterable of strings, it combines the fit and transform steps
X_train = vectorizer.fit_transform(tokens_train)
X_test = vectorizer.transform(tokens_test)

print(X_train.shape)
print(X_test.shape)

(1498, 7986)
(500, 7986)


# Naive Bayes MultinomialNB

In [4]:
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.956


### The Multinomial Naive Bayes Classifier has an accuracy of 95.6%. I think that's pretty good! I could imagine sampling some sentneces from a website, passing them to the model and choosing the language the model guesses the most. With 95.6% accuracy for each guess, sampling a few sentences should yeild the correct language. Of course, only having vocabulary from one wikipedia article probably means I'd see this accuracy go down if I tested it on a brand new article. 

# Linear Nerual Network

### Now I'll see if a simple Linear Neural Network performs any better or worse than the Naive Bayes classifier using the same BOW vectors. 


In [5]:
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [6]:
# create fully connected network

class NN(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_classes):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(vocab_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [7]:
# set device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# convert sklearn vectors to torch tensors
# dense layer deals with float datatype

X_train_tensor = torch.from_numpy(X_train.todense()).float()
X_test_tensor = torch.from_numpy(X_test.todense()).float()
y_train_tensor = torch.from_numpy(y_train)
y_test_tensor = torch.from_numpy(y_test)

In [8]:
# hyperparameters

vocab_size = X_train_tensor.shape[1]
hidden_size = 4000
num_classes = 8
learning_rate = 0.001
batch_size = 32
num_epochs = 3

# load data
# TensorData creates a list of tuples with each record containing a BOW vector and a target language

train_data = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_data = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [9]:
# initialize network

model = NN(
    vocab_size=vocab_size, 
    hidden_size=hidden_size, 
    num_classes=num_classes ).to(device)

# loss and optimizer

# CrossEntropyLoss() requires integer-encoded target

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# train network

for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        data = data.to(device=device)
        targets = targets.to(device=device)

        # forward
        scores = model(data)
        loss = criterion(scores, targets)

        # backward
        optimizer.zero_grad()
        loss.backward()

        # gradient descent or adam step
        optimizer.step()

In [10]:
# check accuracy on training & test to see how our model performs

def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval() # evaluation mode

    # don't have to compute gradient when checking the accuracy
    with torch.no_grad(): 
        for x, y in loader:
            x = x.to(device=device)
            y = y.to(device=device)

            scores = model(x)
            # 64 x 8
            _, predictions = scores.max(1)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

        print(f'Got {num_correct} / {num_samples} with accuracy {float(num_correct)/float(num_samples)*100:.2f}')

    model.train() # return model to train

check_accuracy(train_loader, model)
check_accuracy(test_loader, model)

Got 1497 / 1498 with accuracy 99.93
Got 476 / 500 with accuracy 95.20


### The Neural Network performs about as well as the Naive Bayes classifier after three epochs. It's a lot more code, and it may be unnecessary for this task, but it's interesting to see how the neural net performs!

# Integer Map Feature Extraction

### I don't think any sort of semantic analysis would be necessary for this particular task. But just for the fun of it I want to see if a 1D convolutional neural network with word embeddings will perform any better or worse than the previous BOW models. This time instead of a count vectorizer I'll create integer map vectors using keras' preprocessing library. 

In [11]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=7500) # keep 7500 most common
tokenizer.fit_on_texts(tokens_train)

X_train = tokenizer.texts_to_sequences(tokens_train)
X_test = tokenizer.texts_to_sequences(tokens_test)

vocab_size = len(tokenizer.word_index) + 1 # Adding 1 because of reserved 0 index

maxlen = 100 # this cuts sequences that exceed 100

X_train = pad_sequences(X_train, padding='post', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='post', maxlen=maxlen)

print('\n', 'word tokens:', tokens_train[2])
print('\n', 'padded vectors:', X_train[2])


 word tokens: ['песня', 'морской', 'кадриль', 'поёт', 'черепаха', 'квази', 'пародировать', 'стихотворение', 'мэри', 'хауитт', 'паук', 'муха']

 padded vectors: [ 247  375  292  881   96  206  476   55  248 2564 2565 2566    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0]


# Convolutional Neural Network

In [12]:
# create simple CNN

class CNN(nn.Module):
    """A 1D Convolutional Neural Network"""
    def __init__(self, 
                 pretrained_embedding=None,
                 freeze_embedding=False,
                 vocab_size=None,
                 embed_dim=128,
                 filter_sizes=[3, 4, 5],
                 num_filters=[100, 100, 100],
                 num_classes=8,
                 dropout=0.5):

        """
    The constructor for CNN_NLP class.

    Args:
        freeze_embedding (bool): Set to False to fine-tune pretraiend vectors.
        embed_dim (int): Dimension of word vectors.
        filter_sizes (List[int]): List of filter sizes.
        num_filters (List[int]): List of number of filters, has the same length as filter_sizes.
        n_classes (int): Number of classes.
        dropout (float): Dropout rate.
        """

        super(CNN, self).__init__()
        # Embedding layer
        self.embed_dim = embed_dim
        self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                      embedding_dim=self.embed_dim,
                                      padding_idx=0,
                                      max_norm=5.0)
        # Conv Network
        self.conv1d_list = nn.ModuleList([
            nn.Conv1d(in_channels=self.embed_dim,
                      out_channels=num_filters[i],
                      kernel_size=filter_sizes[i])
            for i in range(len(filter_sizes))
        ])
        # Fully-connected layer and Dropout
        self.fc = nn.Linear(np.sum(num_filters), num_classes)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, input_ids):
        # Get embeddings from input_ids
        # Output shape: (b, max_len, embed_dim)
        x_embed = self.embedding(input_ids).float()

        # Permute x_embed to match input shape requirement of nn.Conv1d
        # Output shape: (b, embed_dim, max_len)
        x_reshaped = x_embed.permute(0, 2, 1)

        # Apply CNN and ReLU. Output shape: (b, num_filters[i], L_out)
        x_conv_list = [F.relu(conv1d(x_reshaped)) 
            for conv1d in self.conv1d_list]

        # Max pooling. Output shape: (b, num_filters[i], 1)
        x_pool_list = [F.max_pool1d(x_conv, kernel_size=x_conv.shape[2])
            for x_conv in x_conv_list]
        
        # Concatenate x_pool_list to feed the fully connected layer.
        # Output shape: (b, sum(num_filters))
        x_fc = torch.cat([x_pool.squeeze(dim=2) for x_pool in x_pool_list], dim=1)
        
        # Compute logits. Output shape: (b, n_classes)
        logits = self.fc(self.dropout(x_fc))

        return logits

In [13]:
# set device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# convert train and test sets to tensors

X_train_tensor = torch.tensor(X_train)
X_test_tensor = torch.tensor(X_test)
y_train_tensor = torch.from_numpy(y_train)
y_test_tensor = torch.from_numpy(y_test)

# hyperparameters

num_classes = 8
vocab_size = vocab_size
batch_size = 32
num_epochs = 3
learning_rate = 0.01

# load data

train_data = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_data = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [15]:
# initialize network

cnn_model = CNN(vocab_size=vocab_size, num_classes=num_classes).to(device)

# loss and optimizer

# CrossEntropyLoss() requires integer-encoded target, not one-hot-encoded target

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=learning_rate)

# train network

for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        data = data.to(device=device)
        targets = targets.to(device=device)

        # forward
        logits = cnn_model(data)
        loss = criterion(logits, targets)

        # backward
        optimizer.zero_grad()
        loss.backward()

        # gradient descent or adam step
        optimizer.step()

In [16]:
# check accuracy on training & test to see how our model performs

def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval() # evaluation mode

    # don't have to compute gradient when checking the accuracy
    with torch.no_grad(): 
        for x, y in loader:
            x = x.to(device=device)
            y = y.to(device=device)

            logits = cnn_model(x)
            _, predictions = logits.max(1)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

        print(f'Got {num_correct} / {num_samples} with accuracy {float(num_correct)/float(num_samples)*100:.2f}')

    model.train() # return model to train

check_accuracy(train_loader, cnn_model)
check_accuracy(test_loader, cnn_model)

Got 1497 / 1498 with accuracy 99.93
Got 460 / 500 with accuracy 92.00


### This model's accuracy is a little less than the BOW Linear Neural Net. Like I mentioned earlier, I suspect the reason for the lower accuracy is that semantic relationships within each sentence don't matter as much as the uniqueness of words for language classification.

### But 92% is still pretty good! 

### I decided to try implementing my own language detection models in response to an interview question I was asked recently, and I wanted to test if the ideas I gave during that interview could actually work. The question was how I would go about detecting which language a website is written in. I thought the best way to go about it would be to use a model like the ones I trained in this notebook and feed it a sample of processed sents from the website in question. This might not be the best way to accomplish this task, but it's given me a good excuse to practice implementing different types of neural networks in pytorch. 