# Language detection
1. Messages are being stored in CSV file
2. We load messages to a Spark DataFrame
3. We use previously trained fastText model to predict language
4. Previously trained fastText model is in a custom `fasttext_lang_classifier` lib

In [2]:
spark

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *
from pyspark.sql.types import StructType

In [4]:
# Load data
schema = StructType([
    StructField("sentence_id", IntegerType(), True),
    StructField("language_code", StringType(), True),
    StructField("text", StringType(), True)])
messages = spark.read.csv('data/sentences.csv', schema=schema, sep='\t')

## Load our custom fastText language classifier lib

In [5]:
import fasttext_lang_classifier
udf_predict_language = udf(fasttext_lang_classifier.predict_language)

### Test predictor

In [6]:
udf_predict_language.func('Hello world!')

'eng'

In [7]:
udf_predict_language.func('Moi maailma!')

'fin'

## Predict language

In [8]:
%%time
messages = messages.withColumn('predicted_lang',
                               udf_predict_language(col('text')))

It is fast, because it does not do anything yet.

In [9]:
%%time
messages.show()

+-----------+-------------+--------------------+--------------+
|sentence_id|language_code|                text|predicted_lang|
+-----------+-------------+--------------------+--------------+
|          1|          cmn|              我們試試看！|           cmn|
|          2|          cmn|             我该去睡觉了。|           cmn|
|          3|          cmn|             你在干什麼啊？|           cmn|
|          4|          cmn|              這是什麼啊？|           cmn|
|          5|          cmn|今天是６月１８号，也是Muirie...|           cmn|
|          6|          cmn|       生日快乐，Muiriel！|           cmn|
|          7|          cmn|      Muiriel现在20岁了。|           cmn|
|          8|          cmn|       密码是"Muiriel"。|           cmn|
|          9|          cmn|            我很快就會回來。|           cmn|
|         10|          cmn|               我不知道。|           cmn|
|         11|          cmn|        我不知道應該說什麼才好。|           cmn|
|         12|          cmn|           這個永遠完不了了。|           cmn|
|         13|          cmn|     我只是不知道應該

## Predict test samples

In [19]:
from pyspark.sql import functions

In [11]:
def get_msg(data_col):
    #label = data_col.split(' ')[0]
    msg = ' '.join(data_col.split(' ')[1:])
    return msg
get_message = udf(get_msg)

In [23]:
test = spark.read.text('data/fasttext_train.txt')

In [24]:
test = test.withColumn('language', functions.substring_index(test.value, ' ', 1))
test = test.withColumn('language', functions.regexp_replace(test.language, '__label__', ''))
test = test.withColumn('message', get_message(col('value')))
test = test.select(['language', 'message'])

In [25]:
test = test.withColumn('predicted_lang',
                       udf_predict_language(col('message')))

In [26]:
test.sample(False, 0.01, 42).limit(5).toPandas()

Unnamed: 0,language,message,predicted_lang
0,cmn,我不知道。,cmn
1,deu,Unglücklicherweise stimmt es.,deu
2,deu,"Das ist das Dümmste, was ich je gesagt habe.",deu
3,deu,"Wenn du keine Kinder kriegen kannst, kannst du...",deu
4,deu,"Seien wir ehrlich, es ist unmöglich. Wir werde...",deu


In [28]:
test.groupBy(['language', 'predicted_lang']).count().sort('count', ascending=False).limit(100).toPandas()

Unnamed: 0,language,predicted_lang,count
0,eng,eng,616234
1,tur,tur,461610
2,epo,epo,438097
3,rus,rus,424834
4,ita,ita,419386
5,deu,deu,312408
6,fra,fra,269938
7,spa,spa,220595
8,por,por,203944
9,hun,hun,163616


In [31]:
tp = test.where("language == predicted_lang").count()

In [32]:
fn = test.where("language != predicted_lang").count()

In [35]:
print(tp, fn, tp/(tp+fn))

4639591 140912 0.9705236038969121


Whaat, a whopping 97% accuracy using default values. That is pretty good. We should, however, study mean class accuracy instead of plain accarucy, because classes are highly imbalanced.

#### Some mistakes

In [42]:
test.where("language != predicted_lang").limit(50).toPandas()

Unnamed: 0,language,message,predicted_lang
0,cmn,聖誕節快樂！,yue
1,spa,"""Gracias."" ""De nada.""",por
2,fra,"""Comment te sens-tu ?"", demanda-t-il.",spa
3,jpn,何かしてみましょう。,cmn
4,jpn,何してるの？,cmn
5,jpn,今日は６月１８日で、ムーリエルの誕生日です！,cmn
6,jpn,ムーリエルは２０歳になりました。,cmn
7,jpn,パスワードは「Muiriel」です。,cmn
8,jpn,すぐに戻ります。,cmn
9,jpn,何と言ったら良いか分かりません。,cmn
