# Load Data

### This notebook was inspired by an interview question I was asked recently. The question was how I would go about detecting what language a website is written in. Sometimes the language data is simply located in the html, but I want to see if I could train a model that can detect the language by looking at sample sentences. Using the data I scraped from Wikipedia, I'm going to try training a couple different models and see how each of them perform. The goal is to determine the language a given sentence is written in with as high an accuracy as possible

In [1]:
import pandas as pd

# read dataframe
language_df = pd.read_pickle('language_data.pickle')

# shuffle dataframe
language_df = language_df.sample(frac=1).reset_index(drop=True)
language_df.head()

Unnamed: 0,language,sentence,language_code,tokens
0,en,"Alice eats them, and they reduce her again in ...",0,"[alice, eat, reduce, size]"
1,de,Alice’s Abenteuer im Wunderland Filmhörspiele ...,3,"[alice, ’s, abenteuer, wunderland, filmhörspie..."
2,ja,『アリス』の注釈者 は、アリスの物語は（「あらゆる偉大な空想物語と同様に」）どんな象徴的解釈...,1,"[アリス, 注釈, 者, アリス, 物語, あらゆる, 偉大, 空想, 物語, 同様, どん..."
3,it,"–""Alice nel Paese delle Meraviglie"" rimanda qui.",6,"[alice, meraviglie, rimandare]"
4,zh,愛麗絲觀察著這個過程，並跟蛙先生講了一堆晦澀難懂的話，最後還是讓自己走進屋內,7,"[愛麗, 絲, 觀察, 著這, 過程, 並, 跟, 蛙, 先生, 講, 了, 晦澀, 難懂,..."


# LangDetect Library

### Before I train anything I want to see how accurate the langdetect library would be on this data

In [2]:
from collections import Counter
from langdetect import detect as ld

def get_lang(text):
    try:
        return ld(text)
    except:
        return 'fail'

accuracy_list = []
lang_targets = language_df['language'].tolist()
ld_preds = language_df['sentence'].apply(lambda x: get_lang(x)).tolist()
for t, p in zip(lang_targets, ld_preds):
    accuracy_list.append(t==p)

counts = Counter(accuracy_list)
print("Accuracy:", counts[True]/len(accuracy_list))

Accuracy: 0.8563563563563563


### 86% isn't bad; the accuracy may be affected by the quality of the data and not necessarily because of the langetect library

# BOW Feature Extraction

### Since this model's going to be trained to classify languages, I don't think that sequence will play a critical role. The uniqueness of the words in each language should be enough for a model to learn how to make the correct distinctions. If that's the case then a Bag of Words model should be all I need, so I'll start by creating a count vectorizer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

tokens = language_df['tokens'].values
y = language_df['language_code'].values
#y = language_df['language_vector].tolist() # for one-hot-encoding

# disable analyzer since I'm applying CountVectorizer to list of lemms
vectorizer = CountVectorizer(analyzer=lambda x: x)

# K-Fold Cross Validation

### There are many different classification models to choose from. Naturally I want to choose the model with the highest accuracy. To accomplish this I'll use ten fold cross validation to compare the accuracies between the Support Vector Machine, Multinomial Naive Bayes, and Random Forest models

In [4]:
# classification models
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

models = [SVC(), MultinomialNB(), RandomForestClassifier()]

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

def get_score(model, vectorizer):
    clf = Pipeline([('vect', vectorizer), ('model', model)])
    return cross_val_score(clf, tokens, y, cv=10) # ndarray

for model in models:
    scores = get_score(model, vectorizer)
    print(f'{model}: {scores.mean()}')

SVC(): 0.7717713567839196
MultinomialNB(): 0.9499497487437185
RandomForestClassifier(): 0.8293391959798996


### Naive Bayes outperforms the other two models! And it's average accuracy is pretty high. Similar to cross validation, for the actual model I'll use a tenth of the data for testing and the rest to train the model

In [7]:
from sklearn.model_selection import train_test_split

tokens_train, tokens_test, y_train, y_test = train_test_split(tokens, y, test_size=0.10)

# fit_tranform is used for iterable of strings, it combines the fit and transform steps
X_train = vectorizer.fit_transform(tokens_train)
X_test = vectorizer.transform(tokens_test)

print(X_train.shape)
print(X_test.shape)

(1798, 8866)
(200, 8866)


In [8]:
from sklearn import metrics

clf = MultinomialNB()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.95

### The Multinomial Naive Bayes Classifier has an accuracy of 95%. That's not bad. I could imagine sampling some sentences from a website, passing them to the model and choosing the language the model guesses the most. With 95% accuracy for each guess, sampling a few sentences should yeild the correct language. Of course, only having vocabulary from one wikipedia article probably means I'd see this accuracy go down if I tested it on a brand new article. 

# Linear Nerual Network

### Now I'll see if a simple Linear Neural Network performs any better or worse than the Naive Bayes classifier using the same BOW feature extraction. 


In [9]:
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [10]:
# create fully connected network

class NN(nn.Module):
    def __init__(self, vocab_size, hidden_size, num_classes):
        super(NN, self).__init__()
        self.fc1 = nn.Linear(vocab_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [11]:
# set device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# convert sklearn sparse metrices to torch tensors

# dense layer deals with float datatype
X_train_tensor = torch.from_numpy(X_train.todense()).float()
X_test_tensor = torch.from_numpy(X_test.todense()).float()
y_train_tensor = torch.from_numpy(y_train)
y_test_tensor = torch.from_numpy(y_test)

In [12]:
# hyperparameters

vocab_size = X_train_tensor.shape[1]
hidden_size = 4000
num_classes = 8
learning_rate = 0.001
batch_size = 32
num_epochs = 3

# load data
# TensorData creates a list of tuples with each record containing a BOW vector and a target language

train_data = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
test_data = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [13]:
# initialize network

model = NN(
    vocab_size=vocab_size, 
    hidden_size=hidden_size, 
    num_classes=num_classes ).to(device)

# loss and optimizer

criterion = nn.CrossEntropyLoss() # CrossEntropyLoss() requires integer-encoded target
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# train network

for epoch in range(num_epochs):
    for batch_idx, (data, targets) in enumerate(train_loader):
        data = data.to(device=device)
        targets = targets.to(device=device)

        # forward
        scores = model(data)
        loss = criterion(scores, targets)

        # backward
        optimizer.zero_grad()
        loss.backward()

        # gradient descent or adam step
        optimizer.step()

In [14]:
# check accuracy on training & test to see how our model performs

def check_accuracy(loader, model):
    num_correct = 0
    num_samples = 0
    model.eval() # evaluation mode

    # don't have to compute gradient when checking the accuracy
    with torch.no_grad(): 
        for x, y in loader:
            x = x.to(device=device)
            y = y.to(device=device)

            scores = model(x)
            # 64 x 8
            _, predictions = scores.max(1)
            num_correct += (predictions == y).sum()
            num_samples += predictions.size(0)

        print(f'Got {num_correct} / {num_samples} with accuracy {float(num_correct)/float(num_samples)*100:.2f}')

    model.train() # return model to train

check_accuracy(train_loader, model)
check_accuracy(test_loader, model)

Got 1797 / 1798 with accuracy 99.94
Got 196 / 200 with accuracy 98.00


### The Neural Network performs with an accuracy of 98% after three epochs! That's definitely an improvement from the Naive Bayes classifier. It's definitely more code, but I'd say it's worth it for the increased accuracy. 