# Language Detection & Translation in Py

Class,

So far we've wrestled with various things but all in the English language alone. Next, we head past into the domain of other languages. 

Aim is to demo the interesting capabilities of __[Py's TextBlob module](http://textblob.readthedocs.io/en/dev/index.html)__ in detecting languages and translating them. Run the following in (preferably) spyder or Jupyter. 

Install textblob thus from the CMD window:
> pip install -U textblob

Also install textblob corpora thus:
> python -m textblob.download_corpora

YOu'll notice that textblob is an amazingly light library for what it purports to do. Reason it its provides an API interface to the Google Translate Backend where the heavy lifting happens.

OK, let's start.

In [1]:
from textblob import TextBlob
import time

Let's start with a simple enough small example - based on a news report from the recent commonwealth games. See below. P.S. The "u" before the text string is for Unicode standard UTF-8.

### Language Translation  

In [2]:
eng_blob = TextBlob(u"My name is Anmol")
type(eng_blob)
print("\n")  # print a new line
print(eng_blob)   # print the blob object



My name is Anmol


In [3]:
# Hindi to English Translation
t0 = time.time()
hindi_blob = eng_blob.translate(to='hi')  # "hi" is ISO-639-2 code for Hindi. 
t1 = time.time()
time_taken = t1 - t0

print(hindi_blob)
print("\n")  # print a new line
print(round(time_taken, 2))


मेरा नाम अनमोल है


0.18


Neat, eh? One function "translate()" packs quite a punch. Didn't take a whole lot of time either.

If you know Hindi, can you read the transation and get an estimate of how good or bad the translation was?

Another way to see this is to translate this back into english and then seeing how well the machine does. See below:

In [4]:
# Back to Eng translation
t0 = time.time()
hin2eng_blob = hindi_blob.translate(to='en')
t1 = time.time()
time_taken = t1 - t0

print(hin2eng_blob)
print("\n")  # print a new line
print(round(time_taken, 2))

My name is precious


0.08


A large number of languages are translate-able now and their __[ISO-639-2 codes are available here.](https://www.loc.gov/standards/iso639-2/php/code_list.php)__

Another quick example, translating both 'eng_blob' and 'hindi_blob' to Telugu to see if there are differences...

In [5]:
# Eng to Telugu translation
t0 = time.time()

eng2tel_blob = eng_blob.translate(to='te')
hin2tel_blob = hindi_blob.translate(to='te')

t1 = time.time()
time_taken = t1 - t0

print(eng2tel_blob)
print("\n")  # print a new line
print(hin2tel_blob)
print("\n")  # print a new line

print(round(time_taken, 2))

నా పేరు అన్మోల్


నా పేరు విలువైనది


0.64


### Language Detection 

We can use "detect_language()" method to find the language. See the example below.

In [6]:
unkown_blob = TextBlob(u"¡Hola! Mi nombre es Ana. Tengo veinticinco años. Vivo en Miami, Florida con mi esposo y dos hijos. Tengo un hijo en kínder y una hija en primer grado. Ambos van a una escuela pública. Mi esposo es mecánico. Los fines de semana, él trabaja en un restaurante lavando platos. El restaurante usualmente tiene más clientes los fines de semana, entonces necesitan gente extra para lavar platos.")

unkown_blob.detect_language()

'es'

As we can see from above output its spanish (ISO-639-1 Code for Spanish is 'es'). 

Let's translate this spanish text to english to see what it means. 

In [7]:
unkown_blob.translate(to='en')

TextBlob("Hello! My name is Ana. I'm twenty-five years old. I live in Miami, Florida with my husband and two children. I have a son in kindergarten and a daughter in first grade. Both go to a public school. My husband is a mechanic. On weekends, he works in a restaurant washing dishes. The restaurant usually has more customers on weekends, so they need extra people to wash dishes.")

Can we do this at scale? If instead of a paragraph or two, what if my corpus featured pages upon pages of text? Should work just as well, I surmise.  Why not check and see, eh?

I'm reading in a part of our old favorite - IBM data. I'll analyze only a piece of it (~ the first 1000 characters).

In [8]:
from urllib import request
url = "https://raw.githubusercontent.com/sudhir-voleti/sample-data-sets/master/International%20Business%20Machines%20(IBM)%20Q3%202016%20Results%20-%20Earnings%20Call%20Transcript.txt"
response = request.urlopen(url)
ibm = response.read().decode('utf8')

ibm_blob = TextBlob(ibm[:1000])
print(ibm_blob)

International Business Machines Corporation. (NYSE:IBM)
Q3 2016 Results Earnings Conference Call
October 17, 2016, 05:00 PM ET
Executives
Patricia Murphy - Vice President-Investor Relations
Martin Schroeter - Senior Vice President and Chief Financial Officer
Analysts
Toni Sacconaghi - Bernstein
Katy Huberty - Morgan Stanley
Tien-tsin Huang - JPMorgan
Steve Milunovich - UBS
Lou Miscioscia - CLSA
Wamsi Mohan - Bank of America Merrill Lynch
David Grossman - Stifel
Keith Bachman - BMO
Amit Daryanani - RBC Capital Markets
Operator
Patricia Murphy
Thank you. This is Patricia Murphy, Vice President of Investor Relations for IBM. I'm here today with Martin Schroeter, IBM's Senior Vice President and Chief Financial Officer. I'd like to welcome you to our third quarter earnings presentation.
The prepared remarks will be available within a couple of hours and a replay of the webcast will be posted by this time tomorrow.
I'll remind you that certain comments made in this presen

In [9]:
t0 = time.time()
ibm_hindi = ibm_blob.translate(to="hi")
t1 = time.time()
time_taken = t1 - t0

print(ibm_hindi)
print("\n")
print(round(time_taken, 2))

अंतर्राष्ट्रीय व्यापार मशीनें निगम। (NYSE: आईबीएम)
Q3 2016 परिणाम आय सम्मेलन सम्मेलन
17 अक्टूबर 2016, 05:00 बजे ईटी
कार्यकारी अधिकारियों
पेट्रीसिया मर्फी - उपाध्यक्ष-निवेशक संबंध
मार्टिन श्रोएटर - वरिष्ठ उपाध्यक्ष और मुख्य वित्तीय अधिकारी
विश्लेषकों
टोनी सेकोनाघी - बर्नस्टीन
कैटी ह्यूबर्टी - मॉर्गन स्टेनली
टीएन-टीएस हुआंग - जेपी मॉर्गन
स्टीव मिलुनोविच - यूबीएस
लू मिसिसोशिया - सीएलएसए
वामसी मोहन - बैंक ऑफ अमेरिका मेरिल लिंच
डेविड ग्रॉसमैन - स्टिफ़ेल
कीथ बच्चन - बीएमओ
अमित दरयानी - आरबीसी कैपिटल मार्केट्स
ऑपरेटर
पेट्रीसिया मर्फी
धन्यवाद। यह पेट्रीसिया मर्फी, आईबीएम के लिए निवेशक संबंधों के उपाध्यक्ष हैं। मैं आज मार्टिन Schroeter, IBM के वरिष्ठ उपाध्यक्ष और मुख्य वित्तीय अधिकारी के साथ हूँ। मैं हमारी तीसरी तिमाही की आय प्रस्तुति में आपका स्वागत करना चाहता हूं।
तैयार टिप्पणी कुछ घंटों के भीतर उपलब्ध होगी और कल तक वेबकास्ट का एक रिप्ले पोस्ट किया जाएगा।
मैं आपको याद दिलाता हूँ कि इस पूर्व निर्धारित टिप्पणी


0.09


Well, what do you think?

I'll signoff this markdown here.

Sudhir Voleti