<h3>Milestone 3</h3>
<br>
A cool idea of how to build a sentiment analyzer for a target language that has no such tool, e.g. Finnish, would be to machine-translate some English training data to the language of interest and get a sentiment analyzer for free that way - if this idea works, that is. Pick a suitable English dataset and machine-translate it to Finnish or any other foreign language you know, using Google Translate or Bing Translate for example. You can either feed it via the web interfaces or create an account. For example Bing allows a 2M character free quota. Evaluate the resulting sentiment classifier qualitatively as well as quantitatively on a small sample of the target language sentences. You can sample texts from places like Reddit, online shop product reviews, movie reviews, online discussion fora, etc.

For this milestone, we use the python library <i>googletrans</i> to translate the IMDB dataset into Finnish. A classifier is trained on the translated dataset. This classifier is evaluated on our manually annotated dataset of Finnish movie reviews.

In [1]:
# How the library works

from googletrans import Translator

translator = Translator()
tr_object = translator.translate('I don\'t like to watch movies.', dest='fi')
print(tr_object.text)

En halua katsella elokuvia.


In [2]:
# Read in IMDB dataset

import os
import random

# get the file names in each directory
imdb_train_neg = [f for f in os.listdir('aclImdb/train/neg')]
imdb_train_pos = [f for f in os.listdir('aclImdb/train/pos')]
imdb_test_neg = [f for f in os.listdir('aclImdb/test/neg')]
imdb_test_pos = [f for f in os.listdir('aclImdb/test/pos')]
print("File lists ready.")
def read(directory,path):
    texts = []
    for f in directory:
        f=os.path.join(path,f)
        with open(f, "rt") as inp_file:
            text = inp_file.readlines()
            assert len(text)==1
            texts.append(text[0].strip())
    return texts

# read all the files in each directory into a list
imdb_train_neg = read(imdb_train_neg,'aclImdb/train/neg')
imdb_train_pos = read(imdb_train_pos,'aclImdb/train/pos')
imdb_test_neg = read(imdb_test_neg,'aclImdb/test/neg')
imdb_test_pos = read(imdb_test_pos,'aclImdb/test/pos')

# add labels for texts
imdb_train_neg = [(text,-1) for text in imdb_train_neg]
imdb_train_pos = [(text,1) for text in imdb_train_pos]
imdb_test_neg = [(text,-1) for text in imdb_test_neg]
imdb_test_pos = [(text,1) for text in imdb_test_pos]

File lists ready.
[1, 1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, 1, 1, -1, -1, 1, -1, 1, 1]


In [4]:
# How the IMDB data looks like

for text,label in zip(imdb_train_texts[:3],imdb_train_labels):
    print(text[:100],'...\t',label)

It's been a long time since I saw this mini-series and I am happy to say its remembered merits have  ...	 1
I'll say this to begin with:...Why, oh why, can't WB do what these short film directors do? Sandy is ...	 1
I rated this a 3. The dubbing was as bad as I have seen. The plot - yuck. I'm not sure which ruined  ...	 -1


In [23]:
# Translate the English text to Finnish
# Challenge: Google translate allows a maximum character count of 5000 per query

imdb_train_translator = translator.translate(imdb_train_texts[92],src='en',dest='fi')

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [30]:
# Solution 1: filter out the long reviews

# shuffle the training set
imdb_train = imdb_train_neg + imdb_train_pos
print('Number of reviews we originally have (training set):',len(imdb_train))
imdb_train_filtered = [(text,label) for text,label in imdb_train if len(text)<5000]
print('Number of reviews left (training set):',len(imdb_train_filtered))
random.shuffle(imdb_train)

imdb_train_texts = [text for text, label in imdb_train_filtered]
imdb_train_labels = [label for text, label in imdb_train_filtered]
#print(train_texts[:3])
#print(imdb_train_labels[:20])

imdb_test = imdb_test_neg + imdb_test_pos
print('Number of reviews we originally have (test set):',len(imdb_test))
imdb_test_filtered = [(text,label) for text,label in imdb_test if len(text)<5000]
print('Number of reviews left (test set):',len(imdb_test_filtered))

imdb_test_texts = [text for text, label in imdb_test_filtered]
imdb_test_labels = [label for text, label in imdb_test_filtered]
#print(imdb_train_labels[:20])

Number of reviews we originally have (training set): 25000
Number of reviews left (training set): 24687
Number of reviews we originally have (test set): 25000
Number of reviews left (test set): 24705


We don't lose that much data :D

In [31]:
%%time
imdb_train_translator = translator.translate(imdb_train_texts,src='en',dest='fi')
imdb_train_tekstit = [teksti.text for teksti in imdb_train_translator]

imdb_test_translator = translator.translate(imdb_test_texts,src='en',dest='fi')
imdb_test_tekstit = [teksti.text for teksti in imdb_test_translator]

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Somehow doesn't work :(

In [32]:
%%time
# Solution 2: Give a thousand bucks a.k.a. use a for loop

imdb_train = imdb_train_neg + imdb_train_pos
random.shuffle(imdb_train)
imdb_train_tekstit = []
imdb_train_leimat = []
for text,label in imdb_train:
    try:
        imdb_train_translator = translator.translate(text,src='en',dest='fi')
        imdb_train_tekstit.append(imdb_train_translator.text)
        imdb_train_leimat.append(label)
    except: # the thing doesn't recognize JSONDecodeError
        pass

print('Number of reviews we originally have (training set):',len(imdb_train))
print('Number of reviews left (training set):',len(imdb_train_tekstit))

Number of reviews we originally have (training set): 25000
Number of reviews left (training set): 21545
CPU times: user 3min 1s, sys: 4.89 s, total: 3min 5s
Wall time: 2h 28min


In [35]:
# Another thousand bucks for the test set

imdb_test = imdb_test_neg + imdb_test_pos
imdb_test_tekstit = []
imdb_test_leimat = []
for text,label in imdb_test:
    try:
        imdb_test_translator = translator.translate(text,src='en',dest='fi')
        imdb_test_tekstit.append(imdb_test_translator.text)
        imdb_test_leimat.append(label)
    except: # the thing doesn't recognize JSONDecodeError
        pass

print('Number of reviews we originally have (test set):',len(imdb_test))
print('Number of reviews left (test set):',len(imdb_test_tekstit))

Number of reviews we originally have (test set): 25000
Number of reviews left (test set): 22222


In [40]:
# save the translated task
import pickle
with open('imdb_fi.pickle','wb') as f:
    pickle.dump([imdb_train_tekstit,imdb_train_leimat,imdb_test_tekstit,imdb_test_leimat],f)

Things we could have done:
fragment the long reviews
tune the tfidf hyperparameters

In [41]:
# sanity check
assert len(imdb_train_tekstit)==len(imdb_train_leimat)
assert len(imdb_test_tekstit)==len(imdb_test_leimat)

In [42]:
# Vectorizing

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from eli5 import show_weights

space_tokenizer = lambda text: text.split()


# Featurization and vectorization
vectorizer = TfidfVectorizer(tokenizer=space_tokenizer, ngram_range=(1,2))

imdb_vectorizer = vectorizer.fit(imdb_train_tekstit)
imdb_train_X = imdb_vectorizer.transform(imdb_train_tekstit)
#devel_X = vectorizer.transform(devel_texts)
imdb_test_X = imdb_vectorizer.transform(imdb_test_tekstit)

In [43]:
%%time

# Training SVM classifier

imdb_classifier = LinearSVC(
    C=1.0,
    class_weight=None,
    max_iter=1000,
    loss='squared_hinge'
)

imdb_classifier.fit(imdb_train_X, imdb_train_leimat)

CPU times: user 2.32 s, sys: 12 ms, total: 2.34 s
Wall time: 2.34 s


In [44]:
%%time

# Predict

imdb_pred_labels = imdb_classifier.predict(imdb_test_X)

CPU times: user 40 ms, sys: 0 ns, total: 40 ms
Wall time: 39.8 ms


In [46]:
# Results

# Evaluation and analysis
imdb_accuracy = accuracy_score(imdb_test_leimat, imdb_pred_labels)

print('IMDB accuracy {:.2%}'.format(imdb_accuracy))

print("IMDB dataset results")
print(classification_report(imdb_test_leimat, imdb_pred_labels))

show_weights(imdb_classifier, vec=imdb_vectorizer)

IMDB accuracy 87.42%
IMDB dataset results
             precision    recall  f1-score   support

         -1       0.88      0.86      0.87     11133
          1       0.87      0.89      0.88     11089

avg / total       0.87      0.87      0.87     22222



Weight?,Feature
+3.019,loistava
+2.917,paras
+2.738,erinomainen
+2.708,ja
+2.619,hieman
+2.482,erittäin
+2.470,parhaista
… 1133476 more positive …,… 1133476 more positive …
… 1048197 more negative …,… 1048197 more negative …
-2.420,"huono,"


<h3>Evaluation on real Finnish data</h3>
<br>
Our self-annotated data consists of 24 (................description missing, leffatykki)

In [53]:
import csv

def parse_finnish_movie_reviews(filename):
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile,delimiter='|')
        for row in reader:
            assert len(row) == 2
            yield row[0],int(row[-1])

fi_dataset = [*parse_finnish_movie_reviews('/home/akeele/sentiment_detection/finnish_movie_reviews.txt')]
fi_tekstit = [text for text, label in fi_dataset]
fi_leimat = [label for text,label in fi_dataset]

In [54]:
# see the reviews
for t, l in zip(fi_tekstit[:3],fi_leimat[:3]):
    print(l)
    print(t)
    print('\n')

-1
Amatöörimäinen toimintapommi on kivulias kokemus kaksituntisena. Se olisi sitä jopa tunnin mittaisena. Tai lyhärinä.


-1
Elokuvan mainoslause on: Nyt huijataan koko rahan edestä. Totta, koko pääsylipun ja poppareiden hinnan edestä.


1
Vankilasta vapautuneesta naisesta kertova karhea draama osoittaa, ettei ihmisyyden synkälle puolelle rohkeasti katsovan Samppa Batalin Äpärä-esikoinen ollut onnen kantamoinen.




In [55]:
fi_tekstit_vectorized = imdb_vectorizer.transform(fi_tekstit)
fi_pred_labels = imdb_classifier.predict(fi_tekstit_vectorized)

In [56]:
fi_accuracy = accuracy_score(fi_leimat, fi_pred_labels)

print('Accuracy on the Finnish dataset {:.2%}'.format(fi_accuracy))

print("Finnish dataset results")
print(classification_report(fi_leimat, fi_pred_labels))

Accuracy on the Finnish dataset 83.33%
Finnish dataset results
             precision    recall  f1-score   support

         -1       0.88      0.70      0.78        10
          1       0.81      0.93      0.87        14

avg / total       0.84      0.83      0.83        24



In [60]:
for text, gold_standard, prediction in zip(fi_tekstit,fi_leimat, fi_pred_labels):
    if gold_standard!=prediction:
        print('Gold standard:',gold_standard)
        print('Text:',text)
        print('\n')

Gold standard: -1
Text: Vakooja, joka keitti teetä ja tarjoili keksejä.


Gold standard: -1
Text: Asghar Farhadi kohtaa Lost in Translationin. Kirjaimellisessa ja huonossa mielessä.


Gold standard: -1
Text: Ruotsalainen romanttinen draamakomedia kertoo vaivaannuttavan kankean tarinan viiden pariskunnan riutumisesta rakkauden ohdakkeisella tiellä.


Gold standard: 1
Text: Puhelin soi,teinejä kuolee, valkonaamainen murhaaja juoksentelee takapihalla puukko kädessä...Murhaaja on valinnut uhrinsa jo etukäteen. Vai onko? Kuka tietää... On olemassa surkeita kauhuelokuvia ja on olemassa hyviä kauhuelokuvia. Ongelmana on vain se että nykyään on liian paljon huonoja kauhuelokuvia. Mutta tämä! Scream on upea esimerkki harvinaisista hyvistä kauhuelokuvista. Nykyään kauhuelokuvaissa vain tapetaan ilman kunnon motiivia tai juonta. Mutta Screamissa on se kaikki! Kunnon motiivi ja (uskokaa tai älkää) upea juoni! Näyttelijät ansaitsisivat kylläkin Oscar-palkinnon. Varsinkin Neve Campbell. Hänen kiljum