# Data Prep

In the below notebook, we prepare the data that we will work on for the experiments. We look at the Anadolu Agency news data set crawled last year in 2015. We only select `"yaşam", "politika", "spor", "türkiye", "dünya", "ekonomi"` categories from the data set, and only news in Turkish. This leaves us with 1337 documents.

We then vectorize these documents into a document-term matrix. We discard words that appear in more than 60% of the documents, less than 5 of them, or those that appear in the Text2ARFF function words list by YTU Kemik Group. We end up with a total term count of 6226.

We then take the TR Wiktionary etymology lookup prepared previously, and we try to aggressively match the words in the vocabulary to the lookup. Aggressively, that is, we take the following path:

- For each word in the vocabulary, we try to match it to the etymology lookup
- If we get no hit, we discard the last letter of the word and try to match again. We keep taking down letters one by one
- If we have no match, and we are down to four letters, we stop

Then, since the TR Wiki dictionary is observably poor, we complete the dictionary by manually inspecting each of the 6226 terms that appear in the corpus and assign them to the languages they were loaned from. For this, we mainly use Nisanyan Sozluk, the most comprehensive online resource. In doing so:

- We omit words that are obviously proper nouns
- We collapse Turkish origins into one group: Turkish. This includes old, middle, modern Turkish and very few Mongolian words

The resulting distribution of the vocabulary is as follows.

- NO LANG	766
- Arapça	1235
- Farsça	183
- Fransızca	486
- İngilizce	69
- İtalyanca	61
- Türkçe	3397
- Yunanca	28
- Grand Total	6225

We finally match the vocabulary with the data set and look at the distribution of non-Turkish words (loanwords) used in different categories. The distribution of etymologies is significantly different, although not at a drastic level.

We lastly pickle all of the data. The final data files we have are as follows:

- `vocab_trwiki.txt` : a pipe-separated file for word-etymology matchings, before it was manually corrected. Do not use this file
- `vocab_trwiki_manual.csv` : same as above, but with properly corrected etymologies from Nisanyan Sozluk
- `fWords.txt` : the Text2ARFF function words list
- `trwiktionary_etym_lookup.pkl` : the pickled python dict for making the etymology lookups. 

The following data file can be used directly without minding the above:

- `datafile.pkl`: contains all the data needed for the project. A pickle of the `AADataFile` class. See class reference in `data/datafile.py`.

In [71]:
%matplotlib inline
from matplotlib import pyplot as pl
import numpy as np 
import pandas as pd
from pymongo import MongoClient
import pprint
import codecs
import pickle

cxn = MongoClient()
table = cxn["mkk"]["aa"]

data = table.find({"language": "",  "categories": {"$in":["yaşam", "politika", "spor", "türkiye", "dünya", "ekonomi"]}})

In [103]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

Make the dataset

In [120]:
cats = []
text = []

for doc in data:
    doc_text = doc.get("text")
    if len(doc_text) > 0:
        join_text = (doc.get("title") or " ") + " ".join(doc.get("text"))
        cats.append("-".join(doc.get("categories")))
        text.append(join_text)

In [62]:
# get a list of function words

with codecs.open("fWords.txt", encoding="latin5") as f:
    ls = f.readlines()

fwords = map(lambda x: x.rstrip(),ls)

In [66]:
# vectorize..

cvec = CountVectorizer(max_df=.6, min_df=5, stop_words=fwords)
cvec.fit(text)
dt = cvec.transform(text)

In [95]:
dt

<1337x6226 sparse matrix of type '<type 'numpy.int64'>'
	with 116459 stored elements in Compressed Sparse Row format>

In [73]:
etym_lookup = pickle.load(open("trwiktionary_etym_lookup.pkl", "r"))

In [90]:
def recursive_etym_search(w, lookup):
    source = lookup.get(w)
    if source is None and len(w) < 5:
        return None
    elif source is None:
        return recursive_etym_search(w[:-1], etym_lookup)
    else:
        return (w, source)

with codecs.open("vocab_trwiki.txt", "w", "utf-8") as f:
    for w in cvec.vocabulary_:
        search = recursive_etym_search(w, etym_lookup)
        if search is not None:
            f.write("|".join([w, search[0], search[1]]) + "\n") 
        else:
            f.write(w + "||" + "\n")

In [106]:
etym_dict = {}

with codecs.open("vocab_trwiki_manual.csv", "r", "utf-8") as f:
    lines = f.readlines()
    for l in lines:
        a = l.split(",")
        etym_dict[a[0]] = a[2].rstrip()

In [186]:
vocab_list = list(cvec.vocabulary_.items())
sorted_vocab_list = sorted(vocab_list, key=lambda x: x[1])
vocab_arr = map(lambda x: x[0], sorted_vocab_list)

In [188]:
etym_arr = np.array(map(lambda x: etym_dict.get(x), vocab_arr))
etym_arr
le = LabelEncoder()
onehot = OneHotEncoder()

labels = le.fit_transform(etym_arr)
print le.classes_

# Make the 6229 x 9 etymology matrix
etym_matrix = onehot.fit_transform(labels.reshape(-1, 1))

[None u'' u'Arap\xe7a' u'Fars\xe7a' u'Frans\u0131zca' u'T\xfcrk\xe7e'
 u'Yunanca' u'\u0130ngilizce' u'\u0130talyanca']


In [191]:
dt_d = dt.toarray()
etym_d = etym_matrix.toarray()

T = dt_d[:, np.newaxis, :] * etym_d.T[np.newaxis, :, :]

In [192]:
# course-grain the categories
course_cats = map(lambda x: x.split("-")[0], cats)

In [196]:
# Go over the categories one by one, and report the distribution of none-Turkish etymology words

np.set_printoptions(precision=3, suppress=True)

print le.classes_

catsarr = np.array(course_cats)
for cat in np.unique(catsarr):
    A = T[catsarr==cat, :, :]
    s = A.sum((0,2))[[2,3,4,6,7,8]]
    if A.shape[0] > 10:
        print cat, A.shape[0]
        print s / s.sum()

[None u'' u'Arap\xe7a' u'Fars\xe7a' u'Frans\u0131zca' u'T\xfcrk\xe7e'
 u'Yunanca' u'\u0130ngilizce' u'\u0130talyanca']
dünya 271
[ 0.653  0.1    0.196  0.013  0.017  0.02 ]
ekonomi 117
[ 0.508  0.072  0.286  0.011  0.039  0.084]
politika 95
[ 0.652  0.103  0.217  0.011  0.006  0.012]
spor 195
[ 0.43   0.078  0.33   0.002  0.12   0.04 ]
türkiye 495
[ 0.668  0.088  0.207  0.012  0.01   0.016]
yaşam 163
[ 0.648  0.136  0.169  0.009  0.013  0.025]


In [174]:
etym_matrix[4903,:].toarray()

array([[ 0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.]])

In [198]:
lang = u"Frans\u0131zca"
cat = "spor"

lang_ix = le.classes_ == lang
cat_ix = (catsarr == cat)

words_ix = np.nonzero(T[cat_ix, lang_ix, :])[1]

for z in words_ix:
    print [k for k,v in cvec.vocabulary_.iteritems() if v == z][0]

gazetecilere
gazetecinin
puanla
stadı
teknik
antrenmanda
antrenmanın
direktör
grubu
kulübü
lig
pozisyon
spor
süper
teknik
antrenmanda
direktör
koordinasyon
kulübü
park
program
programı
teknik
deplasmanda
grubu
lig
ligdeki
salonda
spor
süper
deplasmanda
ekibin
ekip
ekiplerinden
filelere
grubu
grup
kontrol
lig
puanı
spor
direktörlük
direktörü
ekibinin
kritik
kulübü
kulüp
sekreter
teknik
disiplin
federasyonu
kart
profesyonel
bursaspor
kongre
kulübü
kongre
kulübü
otel
sitesinden
bursa
bursaspor
elektronik
federasyonu
kontrol
lig
sistem
sistemi
sistemle
spor
stadı
süper
bursaspor
sportif
ajansı
federasyon
federasyonu
fotoğraf
fotoğrafları
fotoğrafını
kategorisinde
salonu
spor
antrenman
antrenörü
kiloda
madalya
şampiyonası
şampiyonu
ajansı
direktörlük
direktörü
fotoğraf
fotoğrafları
fotoğrafını
kariyerinde
spor
teknik
deplasmanda
direktörü
grubu
kadrosunda
lig
ligdeki
ligin
spor
süper
teknik
şampiyonluğa
şansı
başantrenörü
ekibi
federasyonu
final
kulüp
ligi
net
olimpiyat
sezon
sezonu
süper
e

In [199]:
class AADataFile(object):
    
    def __init__(self, DT, TE, etym_classes, categories, vocabulary):
        """
        A python object for serializing the data needed for the AA data set 
        experiments. The parameters are as follows:
        
        :param DT: a 1337x6226 sparse matrix containing the document-term counts
            of shape (nr_documents, nr_terms)
        :param TE: a 6226x9 sparse matrix containing the term-etymology assignments
            (nr_terms, nr_etymologies)
        :param etym_classes: a (9,) shaped array containing the names of etymology
            classes
        :param categories: a (1337,) shaped array containing the coarse-grained
            categories of the news items
        :param vocabulary: a (6226,) python list of 2-tuples. The first element of 
            the tuple is the word itself, and the second element refers to the index 
            of the term (i.e. to index the columns of DT)
        """
        self.DT = DT
        self.TE = TE
        self.etym_classes = etym_classes
        self.categories = categories
        self.vocabulary = vocabulary

In [1]:
from datafile import AADataFile

In [2]:
aa_data = AADataFile(dt, 
                     etym_matrix, 
                     le.classes_, 
                     catsarr, 
                     sorted_vocab_list)

NameError: name 'dt' is not defined

In [208]:
pickle.dump(aa_data, open("datafile.pkl", "w"))