## Die IMDb-Filmdatenbank und Stimmungsanalyse

In this notebook, we are going to used the IMDb von Maas et al. to train a ML model to perform Setimental Analysis


we will follow the steps described in [1], Kapitel 8.

### Download Resourcen

Download and Prepare the dataset ([1], pages 237-239)

In [47]:
import requests
import os
import tarfile
import re

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

import nltk
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords


In [2]:
#### Fetch the data
url = "http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
basedir = "/home/bau/Documents/Zhanna/BechelorArbeit/data/"
outdir = os.path.join(basedir,'datasets')
outdir_zip = os.path.join(basedir,'archives')
dirdatabase = "/home/bau/Documents/Zhanna/BechelorArbeit/data/datasets/aclImdb"
filename = "aclImdb_v1.tar.gz"
csv_filename = os.path.join(outdir, "aclImdb_v1.csv")
csv_filename_clean = os.path.join(outdir, "aclImdb_vclean.csv")

def download_resource(file_url, outputdirectory, nameout):
    print(" downloading {} to {} as {}".format(file_url, outputdirectory, nameout))
    outfile = os.path.join(outputdirectory, nameout)

    if os.path.exists(outfile):
        raise RuntimeError("File already downloaded")

    r = requests.get(file_url, stream = True)
    
    with open(outfile,"wb") as of:
        for chunk in r.iter_content(chunk_size=1024):  
             # writing one chunk at a time to pdf file
            if chunk:
                of.write(chunk)

def uncompress(filein, outdir):
    print(" uncompressing file {} to {}".format(filein, outdir))
    # open file
    with tarfile.open(filein) as file:
        # extracting file
        file.extractall(outdir)
        file.close()

def pandas2csv(df, fileout, shuffle=False, seed=0):
    if shuffle:
        np.random.seed(seed)
        df = df.reindex(np.random.permutation(df.index))
    df.to_csv(fileout, index=False)

def preprocessor(text):
    text = re.sub('<[^>]*', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    return text

def aclImdb2csv(dirin, outfile, seed=0):
    print(" preparing csv {} file for IMDb in directory {} - seed = {}".format(outfile, dirin, seed))
    if os.path.exists(outfile):
        raise RuntimeError("csv file already created")
    labels = {"pos": 1, "neg": 0}
    df = pd.DataFrame()
    for s in ('test', 'train'):
        for l in ('pos', 'neg'):
            path = os.path.join(dirin, s, l)
            for file in os.listdir(path):
                with open(os.path.join(path, file), 'r', encoding='utf-8') as infile:
                    text = infile.read()
                df = pd.concat([df, pd.DataFrame([[text, labels[l]]])], ignore_index=True)
    df.columns = ['reviews', 'sentiment']
    df.reviews = df.reviews.apply(preprocessor)
    np.random.seed(seed)
    df = df.reindex(np.random.permutation(df.index))
    df.to_csv(outfile, index=False)

def prepareIMDBdataset():
    try:
        download_resource(url, outdir_zip, filename)
        uncompress(os.path.join(outdir_zip, filename), outdir)
    except Exception as e: print(" " + str(e))
    try:
        aclImdb2csv(dirdatabase, csv_filename)
    except Exception as e: print(" " + str(e))

def loadIMDBdataset(filename=csv_filename_clean):
    df = pd.read_csv(filename)
    return df

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]



In [3]:
prepareIMDBdataset()
df = loadIMDBdataset()

 downloading http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz to /home/bau/Documents/Zhanna/BechelorArbeit/data/archives as aclImdb_v1.tar.gz
 File already downloaded
 preparing csv /home/bau/Documents/Zhanna/BechelorArbeit/data/datasets/aclImdb_v1.csv file for IMDb in directory /home/bau/Documents/Zhanna/BechelorArbeit/data/datasets/aclImdb - seed = 0
 csv file already created


In [4]:
df.head()

Unnamed: 0,reviews,sentiment
0,ahh i didn t order no amazing hit show we ll ...,1
1,this episode has just aired in the uk what a d...,0
2,this movie is quite possibly one of the most h...,0
3,when you see the cover of the dvd you re convi...,1
4,i found the documentary entitled fast cheap an...,0


### Words to Vector: Bags of Words and Feature Vector

*Bags-of-Words* erlaubt uns Text als numerischen Merkmalsvektor zu rappresentieren.
Die diesem Modell zugrunde liegende Idee ist ganz einfach:
- Wie erstelen ein _Vokabular_ eindeutiger _Tokens_ (z.B. Wörter)
- Konstruiren wir für jedes Textdokument eine Merkmalsvektor, der die Anzahl der Vorkommen jedes einzelnen Wörtes im jeweligen Document enthält

#### Merkmalevektor playground

##### Bags of Words
Wir benutzen die _Bag-of-words_ Modell implementiert in **scikit-learn**: `CountVectorizer`-Klass

In [5]:
count = CountVectorizer()
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining and the weather is sweet'
])

bag = count.fit_transform(docs)
print("The generated vocabulary is:")
print(count.vocabulary_)
print("The vectoried version of the documents is:")
print(bag.toarray())

The generated vocabulary is:
{'the': 5, 'sun': 3, 'is': 1, 'shining': 2, 'weather': 6, 'sweet': 4, 'and': 0}
The vectoried version of the documents is:
[[0 1 1 1 0 1 0]
 [0 1 0 0 1 1 1]
 [1 2 1 1 1 2 1]]


##### Tf-idf Maß - Wörtrelevanz

Die _Term frqeuncy/inverse document frequency_ $\Mu$ ist benutzt zur Beurteilung der Wortrelewanz.
Diese ist definiert als:
$$\Mu(t, d) = tf(t, d) \cdot idf(t, d),$$
wo
- $tf(t, d)$ Vorkommenshäufigkeit (_Row term Frequency_, z.B 2 in _The sun is shining and the weather is sweet_ for das Wort _is_)
- $idf(t, d)$ ist die inverse Dokumenthäufigkeit, i.e.
$$idf(t, d) = \log\frac{n_d}{1 + df(d, t)}$$
- $n_d$ bezeichnet die Gesamtzahl der Dokument
- $df(d, t)$ ist die Anzahl der Dokument $d$, die das Wort $t$ enthalten.

In `scikit-learn` benutzen wir die `TfidfTransformer`-Klasse

In [6]:
tfidf = TfidfTransformer()
np.set_printoptions(precision=2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.56 0.56 0.   0.43 0.  ]
 [0.   0.43 0.   0.   0.56 0.43 0.56]
 [0.4  0.48 0.31 0.31 0.31 0.48 0.31]]


#### Textdaten Bereiningung

Wir müssen den Text bereiningen und alle unwewünschten Zeichen entfernen.
For example, es gibt `html`-tags

In [7]:
doc_n = 0
for doc in df.reviews:
    n = doc.find('<br')
    m = 30
    if n >= 0:
        print(str(doc_n) + ")"+ "..." if n > m else '', end='') 
        print(doc[max(n-m, 0): n + m] + " ...")
        break
    doc_n += 1

In [8]:
preprocessor('</a>This :) is :( a test :-)!')

' this is a test :) :( :)'

In [9]:
# df.reviews = df.reviews.apply(preprocessor)
# pandas2csv(df, csv_filename_clean)

In [10]:
# df = loadIMDBdataset(csv_filename_clean)
# df.head()

#### Dokument in Token zerlegen

Um eine Satz in seinem Token zu zerlegen, verwenden wir die bibliothek `NLTK`. Mit dem `PorterStemmer` liefert zurück, die Stammform von Wörtern.

For example:

In [11]:
tokenizer_porter('I was running in the wood')

['i', 'wa', 'run', 'in', 'the', 'wood']

##### Stopwörter

Stopwörter sind Wörtern die einfach in allen möglichen Textarten extreme häufig sind: diese Wörtern keinerlei oder nur sehr wenige nützliche oder zur Klassifizierung von Dokumenten brauchbare Informationen enhalten.
beispiele sind _"is"_, _"and"_, _"has"_ u.s.w.

Wir verwenden die Stopwörter, die die NLTK Bibliothek zu verfühgung stellt:

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/bau/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [13]:
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### Ein logistisches Regressionsmodell für die Dokumentklassifizierung trainieren

In diesem Abschnitt werden wie ein logistisches Regressionsmodell dafür tranieren, die Filmbewertungen als _positiv_ oder _negativ_ zu klassifieieren.
Wir werden:
- teilen Dockumenten zum Trainieren und Testen auf
- benutzen ein `GridSearch`-Objekt, um mit einer 5-fachen stratifizieren Kreutzvalidierung die optimale Parameterkombination für die logistische Regressionmodell zu finden.

In [48]:
n_train = 10
split_precentage_tests = 0.9
X_train, y_train = df.loc[:n_train, 'reviews'].values, df.loc[:n_train, 'sentiment'].values
X_test, y_test   = df.loc[n_train:, 'reviews'].values, df.loc[n_train:, 'sentiment'].values


X_train, X_test, y_train, y_test = train_test_split(df.reviews, df.sentiment, test_size=split_precentage_tests, random_state=42)

In [15]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
param_grid = [
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words':  [stop, None],
        'vect__tokenizer':   [tokenizer, tokenizer_porter],
        'clf__penalty':      ['l1', 'l2'],
        'clf__C':            [1.0, 10.0, 100.0],
        # 'vect__verbose':     [True],
        # 'clf__verbose':      [True],
    },
    {
        'vect__ngram_range': [(1, 1)],
        'vect__stop_words':  [stop, None],
        'vect__tokenizer':   [tokenizer, tokenizer_porter],
        'vect__use_idf':     [False],
        'vect__norm':        [None],
        'clf__penalty':      ['l1', 'l2'],
        # 'clf__C':            [1.0, 10.0, 100.0],
        # 'clf__verbose':      [True],
    },
]

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid, scoring='accuracy', cv=5, verbose=1, n_jobs=-1)

Jetzt wir können der Modell Trenieren

In [16]:
# gs_lr_tfidf.fit(X_train, y_train)

Hier wir benutzen nur ein Parameter:

In [49]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)

params = {
        'vect__ngram_range': (1, 1),
        'vect__stop_words':  stop,
        'vect__tokenizer':   tokenizer,
        'clf__penalty':      'l2',
        'clf__C':            10.0,
        # 'vect__verbose':     [True],
        # 'clf__verbose':      [True],
    }

lr_tfidf = Pipeline([('vect', tfidf), ('clf', LogisticRegression(random_state=0))])
lr_tfidf.set_params(**params)

lr_tfidf.fit(X_train, y_train)



In [50]:
score = lr_tfidf.score(X_test, y_test)
print("Korrektklassifizierungsrate Test: %.3f" % score)

Korrektklassifizierungsrate Test: 0.872


In [69]:
n_errors = 0
k = 0
while True:
    if lr_tfidf.predict(X_test[k:k+1]) != y_test[k:k+1].values[0]:
        probs = lr_tfidf.predict_proba(X_test[k:k+1])
        print(probs, ") ", y_test[k: k + 1].values[0], " => ", X_test[k: k + 1], )
        n_errors += 1
    k = k + 1
    # if k > 1000: break
print("n_errors: ", n_errors)
print("score:    ", 1 - n_errors / (float)(k))

[[0.72 0.28]] )  1  =>  39489    let me start out by saying i m a big carrey fa...
Name: reviews, dtype: object
[[0.5 0.5]] )  0  =>  12609    this is the first non zombie subgenre review i...
Name: reviews, dtype: object
[[0.43 0.57]] )  0  =>  15118    eaten alive follows a young woman janet agren ...
Name: reviews, dtype: object
[[0.5 0.5]] )  0  =>  42294    if good intentions were enough to produce a go...
Name: reviews, dtype: object
[[0.52 0.48]] )  1  =>  24712    this has to be one of the best and most useful...
Name: reviews, dtype: object
[[0.42 0.58]] )  0  =>  36056    friday night with jonathan ross must have thos...
Name: reviews, dtype: object
[[0.89 0.11]] )  1  =>  48224    i d honestly give this movie a solid 7 5 but i...
Name: reviews, dtype: object
[[0.22 0.78]] )  0  =>  3928    i m one of the millions of columbo addicts all...
Name: reviews, dtype: object
[[0.03 0.97]] )  0  =>  29600    well i watch tons of movies and this one reall...
Name: reviews, dtype: obje

KeyboardInterrupt: 

In [65]:
# lr_tfidf.predict_proba(X_test[k:k+1])
print(lr_tfidf.predict(X_test[k:k+1])[0])
print(y_test[k:k+1].values[0])


0
0


### Bibliography

[1] Raschka, Sebastian, Joshua Patterson, and Corey Nolet. _"Machine learning in python: Main developments and technology trends in data science, machine learning, and artificial intelligence."_ Information 11.4 (2020): 193.