# Language detection
1. Messages are being stored in CSV file
2. We load messages to a Spark DataFrame
3. We use previously trained fastText model to predict language
4. Previously trained fastText model is in a custom `fasttext_lang_classifier` lib

In [1]:
spark

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import *
from pyspark.sql.types import StructType

In [4]:
# Load data
schema = StructType([
    StructField("sentence_id", IntegerType(), True),
    StructField("language_code", StringType(), True),
    StructField("text", StringType(), True)])
messages = spark.read.csv('data/sentences.csv', schema=schema, sep='\t')

## Load our custom fastText language classifier lib

In [5]:
import fasttext_lang_classifier

In [6]:
udf_predict_language = udf(fasttext_lang_classifier.predict_language)

In [None]:
sc.ad

### Test predictor

In [7]:
udf_predict_language.func('Hello world!')

'eng'

In [8]:
udf_predict_language.func('Moi maailma!')

'fin'

## Predict language

In [9]:
%%time
messages = messages.withColumn('predicted_lang',
                               udf_predict_language(col('text')))

CPU times: user 7 ms, sys: 2.35 ms, total: 9.35 ms
Wall time: 123 ms


In [10]:
%%time
messages.show()

+-----------+-------------+--------------------+--------------+
|sentence_id|language_code|                text|predicted_lang|
+-----------+-------------+--------------------+--------------+
|          1|          cmn|              我們試試看！|           cmn|
|          2|          cmn|             我该去睡觉了。|           cmn|
|          3|          cmn|             你在干什麼啊？|           cmn|
|          4|          cmn|              這是什麼啊？|           cmn|
|          5|          cmn|今天是６月１８号，也是Muirie...|           cmn|
|          6|          cmn|       生日快乐，Muiriel！|           cmn|
|          7|          cmn|      Muiriel现在20岁了。|           cmn|
|          8|          cmn|       密码是"Muiriel"。|           cmn|
|          9|          cmn|            我很快就會回來。|           cmn|
|         10|          cmn|               我不知道。|           cmn|
|         11|          cmn|        我不知道應該說什麼才好。|           cmn|
|         12|          cmn|           這個永遠完不了了。|           cmn|
|         13|          cmn|     我只是不知道應該

In [11]:
udf_predict_language.func('Hello World Moi Maailma')

'eng'

In [13]:
fasttext_lang_classifier.model.predict_proba(['世'])

[[('__label__cmn', 0.871094)]]

In [24]:
test = spark.read.text('data/fasttext_train.txt')

In [25]:
test.collect()

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2881, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-25-dbd655a80081>", line 1, in <module>
    test.collect()
  File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/pyspark/sql/dataframe.py", line 438, in collect
    port = self._jdf.collectToPython()
  File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1131, in __call__
    answer = self.gateway_client.send_command(command)
  File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/usr/local/Cellar/apache-spark/2.2.0/libexec/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1028, in send_command
    answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/local/Cell

KeyboardInterrupt: 

In [None]:
test = spark.read.format?