# Handling languages
In this section we want to address the issue of languages wince we do not have only english in our dataset.

In [1]:
import pandas as pd
import sqlite3
from textblob import TextBlob

Let's first take the lyrics with their words from our database:

In [2]:
conn = sqlite3.connect("../datasets/mxm_dataset.db")

In [3]:
cursor = conn.cursor()
cursor.execute("SELECT track_id, word, count FROM lyrics ORDER BY track_id;")
track_word_count = cursor.fetchall()

In [4]:
sqldb_frame = pd.DataFrame(track_word_count, columns=["track_id", "word", "count"])
del track_word_count
sqldb_frame["word"]=sqldb_frame["word"].astype(str)

In [5]:
lyrics_words = sqldb_frame.set_index("track_id")

### Song lyrics
The lyrics are shortened to contain only the stems of the words.
Fortunately, it seems like a safe assumption that any english song contains at least some short word like "I, you, me, far, to, " etc.

We look at the lyrics of a sample song.

In [None]:
lyrics_words.loc["TRAADKW128E079503A"].sort_values(by="count").head(10)

Unnamed: 0_level_0,word,count
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1
TRAADKW128E079503A,someth,1
TRAADKW128E079503A,money,1
TRAADKW128E079503A,ani,1
TRAADKW128E079503A,late,1
TRAADKW128E079503A,ai,1
TRAADKW128E079503A,save,1
TRAADKW128E079503A,lose,1
TRAADKW128E079503A,mean,1
TRAADKW128E079503A,myself,1
TRAADKW128E079503A,far,1


We will use [pyenchant](http://pythonhosted.org/pyenchant/) to make sure our words belong to the english US dictionary. The method we use is simple: if most of the words belong to the english dictionary (more than half) then we consider the song as english. The reason we don't say that all the words should be english is because we have a lot of mispelling in the lyrics dataset. Take for example:

In [None]:
import enchant
d = enchant.Dict("en_US")
def is_english(word):
    return d.check(word)
lyrics_words["is_english"] = lyrics_words.word.apply(lambda word: is_english(word))

In [None]:
lyrics_is_english = lyrics_words.is_english.groupby(lyrics_words.index).agg(lambda x : x.value_counts().index[0])

In [None]:
tracks_english = lyrics_is_english[lyrics_is_english==True].reset_index().track_id

In [None]:
tracks_english.to_csv(r'../datasets/tracks_english.csv')

In [None]:
tracks_english.head()