## Evaluation of models trained on IMDb-Filmdatenbank with test data from Amazon reviews

### Download Resourcen

Download and Prepare the dataset ([1], pages 237-239)

In [1]:
import requests
import os
import tarfile
import re

import sentimental_hwglu.utils as hw_u

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from nltk.corpus import stopwords

from sentimental_hwglu.naive_sa import NaiveSA

In [2]:
hw_u.prepareIMDBdataset()
df_imdb = hw_u.loadIMDBdataset()
df_amz = hw_u.loadAmazonDataset('/data/zibaldone/projects/ai/betchelorZhanna/data/datasets/All_Amazon_Review.csv')
df_amz = df_amz[df_amz.reviewText.notnull()]

 downloading http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz to /data/zibaldone/projects/ai/betchelorZhanna/data/archives as aclImdb_v1.tar.gz
 File already downloaded
 preparing csv /data/zibaldone/projects/ai/betchelorZhanna/data/datasets/aclImdb_v1.csv file for IMDb in directory /data/zibaldone/projects/ai/betchelorZhanna/data/datasets/aclImdb - seed = 0
 csv file already created


### Ein logistisches Regressionsmodell für die Dokumentklassifizierung trainieren

In diesem Abschnitt werden wie ein logistisches Regressionsmodell dafür tranieren, die Filmbewertungen als _positiv_ oder _negativ_ zu klassifieieren.
Wir werden:
- teilen Dockumenten zum Trainieren und Testen auf
- benutzen ein `GridSearch`-Objekt, um mit einer 5-fachen stratifizieren Kreutzvalidierung die optimale Parameterkombination für die logistische Regressionmodell zu finden.

In [4]:
def amazon_vs_imdb(df_amz, df_imdb, n_train=50000, n_test=50000):
    if n_train > 0: X_train, y_train = df_amz.reviewText[:n_train], df_amz.sentiment[:n_train]
    else:           X_train, y_train = df_amz.reviewText,           df_amz.sentiment
    if n_test > 0:  X_test, y_test = df_imdb.reviews[:n_test], df_imdb.sentiment[:n_test]
    else:           X_test, y_test = df_imdb.reviews,          df_imdb.sentiment
    return X_train, X_test, y_train, y_test

def imdb_vs_amazon(df_amz, df_imdb, n_train=50000, n_test=50000):
    if n_train > 0: X_train, y_train = df_imdb.reviews[:n_train], df_imdb.sentiment[:n_train]
    else:           X_train, y_train = df_imdb.reviews,           df_imdb.sentiment
    if n_test > 0:  X_test, y_test = df_amz.reviewText[:n_test],  df_amz.sentiment[:n_test]
    else:           X_test, y_test = df_amz.reviewText,           df_amz.sentiment
    return X_train, X_test, y_train, y_test

Hier wir benutzen nur ein Parameter:

In [None]:
def testNaiveVSFtidVect(df_amz_, df_imdb_, n_train=-1, n_test=-1):
    scores = {}
    for name, f in [['train_on_amazon_test_on_imdb', amazon_vs_imdb], ['train_on_imdb_test_on_amazon', imdb_vs_amazon]]:
        print(" --------------------------------")
        print(" function: ", name)
        X_train, X_test, y_train, y_test = f(df_amz_, df_imdb_, n_train=n_train, n_test=n_test)
        # ===================================================================================== #
        for tk_name, tk in [['tokenizer', hw_u.tokenizer], ['porter', hw_u.tokenizer_porter]]:
            # ===================================================================================== #
            score, _ = hw_u.test_naive(X_train, y_train, X_test, y_test, tk=tk, name=name, tk_name=tk_name, model=NaiveSA)
            print(" tokenizier ", tk_name, ' -> ', name, " NAIVE SA score: ", score)
            scores[name + '_' + tk_name + '_' + 'naive'] = score
            # ===================================================================================== #
            score, _ = hw_u.test_FtidfVectorier(X_train, y_train, X_test, y_test, tk=tk, name=name, tk_name=tk_name)
            print(" tokenizier ", tk_name, ' -> ', name, " FTidVect score: ", score)
            scores[name + '_' + tk_name + '_' + 'FtidfVectorier'] = score
            # ===================================================================================== #

    for k, v in scores.items():
        print("[%s] Korrektklassifizierungsrate Test: %.3f" % (k, v))

In [None]:
testNaiveVSFtidVect(df_amz, df_imdb)

### Bibliography

[1] Raschka, Sebastian, Joshua Patterson, and Corey Nolet. _"Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence."_ Information 11.4 (2020): 193.

[2] _Justifying recommendations using distantly-labeled reviews and fined-grained aspects_ Jianmo Ni, Jiacheng Li, Julian McAuley Empirical Methods in Natural Language Processing (EMNLP), 2019