# Language Detection

<img src="https://cdn.pixabay.com/photo/2015/08/24/20/13/welcome-905562_1280.png" width="500px" align="left" />
<div style="clear: both"></div>

## Question?

- How good are you in detecting different languages?
- What do you think is your accuracy? How many mistakes would you make?

And what about a machine? How well do you think would a machine do this task?

Well, let's go ahead and find out!

Download data
---

First things first, we need a dataset to train our machine on. One of the quickest and most straight forward sources of text in different languages is Wikipedia.

So let's go ahead and extract the main text from a wikipedia page.

In [None]:
import requests # Library to retrieve web pages
from bs4 import BeautifulSoup # Library to parse them

def get_text(url):
    page = BeautifulSoup(requests.get(url).text, 'html.parser') # Get page
    paragraphs = page.find_all('p') # .. paragraphs
    return '\n'.join([p.text for p in paragraphs]) # and texts!

# Let's check the results
get_text('http://github.com')[:100]

Looks good! Let's get some data for each language

In [None]:
import json

pages = {
    'Deutsch': 'https://de.wikipedia.org/wiki/Schweiz',
    'Français': 'https://fr.wikipedia.org/wiki/Suisse',
    'Itialiano': 'https://it.wikipedia.org/wiki/Svizzera',
    'Rumantsch': 'https://rm.wikipedia.org/wiki/Svizra',
    'English': 'https://en.wikipedia.org/wiki/Switzerland',
    # Todo - add more languages from Wikipedia!
}

# Retrieve texts
texts = { language: get_text(url) for language, url in pages.items() }
languages = list(texts.keys())
print(json.dumps(languages))

Encode data for ML
---

What are features that distinguish well the different languages?

* `{switzerland, suisse, schweiz, svizzera}` work well for those documents, but **don't generalize** well to new ones!
* `{is, est, ist, è}` work well and generalize better, but are handpicked - what if we have 100 languages?
* Also, not all languages have words separated by spaces!

Idea, **char N-grams**: extract popular combinations of N characters for each language .. but how to choose N?

* Small Ns generalize better, but are not necessarily relevant to distinguish between languages
* Large Ns are more language-specific, but don't necessarily generalize well to new documents

Using **3-grams** is a good tradeoff

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# .. to extract char N-grams
vectorizer = CountVectorizer(max_features=100, ngram_range=(3,3), analyzer='char')

# .. create "vocabulary"
vocabulary = []
for language, text in texts.items():
    ngrams = vectorizer.fit([text]).get_feature_names() # Extract popular N-grams
    vocabulary.extend(ngrams) # Store them
    
# Remove potential duplicates
vocabulary = list(set(vocabulary))
print(json.dumps(vocabulary))

Prepare data
---

We are going to train our model on some data. Let's extract small sentences from our webpages.

In [None]:
from tensorflow.keras.utils import to_categorical
import numpy as np

# Extract random sentences
X, y = [], []
N = 10000 # Number of data points
length = 30 # Length of sentences

for _ in range(N):
    # Random language
    language = np.random.choice(languages)
    y.append(languages.index(language))
    
    # Random text subset
    idx = np.random.randint(0, len(texts[language]) - length)
    sample_text = texts[language][idx:idx+length]
    X.append(sample_text)
    
# Vectorize
vectorizer = CountVectorizer(vocabulary=vocabulary, ngram_range=(3,3), analyzer='char')
X_vectorized = vectorizer.transform(X)
y_encoded = to_categorical(y)

Train model
---

Now we come to the fun part! Let's create a simple neural network and train it on the sentences that we just prepared!

In [None]:
import tensorflow as tf

# Create neural network
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(units=len(languages), activation='softmax', input_shape=[len(vocabulary)]))
model.summary()

In [None]:
# Compile model
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss='categorical_crossentropy',
    metrics=['acc']
)

# Trick: end training when accuracy stops improving (optional)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_acc', patience=2)

# Train
history = model.fit(
    x=X_vectorized, y=y_encoded, batch_size=32, epochs=20, # max "epochs"
    validation_split=0.7, callbacks=[early_stopping]
)

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

# Plot results
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 4))

# Plot loss values
ax1.set_title('loss: {:.4f}'.format(history.history['val_loss'][-1]))
ax1.plot(history.history['val_loss'], label='validation')
ax1.plot(history.history['loss'], label='training')
ax1.legend()

# plot accuracy values
ax2.set_title('accuracy: {:.2f}%'.format(history.history['val_acc'][-1]*100))
ax2.plot(history.history['val_acc'], label='validation')
ax2.plot(history.history['acc'], label='training')
ax2.legend()

plt.show()

Test model
---

Finally, let's test our model on new sentences!

In [None]:
import pandas as pd

sample_texts = [
    'il n’y a pas le feu au lac',
    'der april macht was er will',
    'success is a team sport'
]
preds = model.predict(vectorizer.transform(sample_texts))
pd.DataFrame(preds, index=sample_texts, columns=languages)

Export the model for the web!
---

In [None]:
# Save model
tf.keras.models.save_model(model, 'classifier.h5')

If you are running the notebook via [Google Colab](https://colab.research.google.com/), run

```bash
!pip install tensorflowjs
```

In [None]:
import shutil

# Prepare model for TensorFlow.js
!tensorflowjs_converter --input_format keras 'classifier.h5' 'tfjs-model'

# Zip the result!
shutil.make_archive('tfjs-model', 'zip', 'tfjs-model');