# Beyond accuracy evaluation (bacceval)

## Notes

### Metrics

 * Diversity at 100 (div@100) : on prend seulement les 100 premiers elements du ranking et on calcul la diversité de ceux-ci en terme de topic, de TFIDF et de style. Donc la moyenne des pairwise distances $$diversity(R) = \frac{\sum_{i=1}^{|R|}\sum_{j=i+1}^{|R|} dist(R_i, R_j)}{\frac{{|R|}^2-{|R|}}{2}}$$
 With R a set of recommendation lists. On ne prend que les 100 premier car si on prennait les 1000, alors tous les modèles auraient la même diversité.
 On utilise les representation vectorielles de TFIDF, style et topic avec la cosine distance. Une quatrieme diversity se base sur la moyenne des distance de jaccard. Attention cette distance c'est pas [l'extension à n ensembles comme décrit sur wikipedia](https://fr.wikipedia.org/wiki/Indice_et_distance_de_Jaccard) (car l'intersection de bcp de document donnera simplement un ensemble vide ou composé de stop words...) mais la moyenne des pairwises distances comme dans cet [article](https://sci-hub.tw/https://ieeexplore.ieee.org/abstract/document/4812525) et [celui-ci](http://www.l3s.de/~siersdorfer/sources/2012/fp055-deng.pdf) (refined diversity jaccard) :
 $$JD(A, B) = 1 - \frac{|A \cap B|}{|A \cup B|}$$ where A and B are sets of words from the item A and item B. TODO dire si on supprime les stopwords.
 * Novelty at 100 (nov@100) : pareil mais entre l'historique utilisateur et R.
 $$novelty(R, H) = \frac{\sum_{i=1}^{|R|}\sum_{j=1}^{|H|} dist(R_i, H_j)}{|R|.|H|}$$
 * Strict novelty at 100 (snov@100) : pareil mais on prend le min.
 $$strictnovelty(R, H) = \frac{\sum_{i=1}^{|R|} mindist(R_i, H)}{|R|}$$
 * Serendipity at 100 (ser@100) : the ratio of relevants items the evaluated model recommanded and the primitive model didn't recommand. With $R$ the recommendation set of the evaluated model, $P$ the recommendation set of the primitive model, $T$ the set of relevant items, and for cases where $T \setminus P \neq \emptyset$, we define the serendipity as:
 $$serendipity(R, P, T) = \frac{|R \cap (T \setminus P)|}{|T \setminus P|}$$
 Cases where $T \setminus P = \emptyset$ are not relevant because the primitive model already predicted all relevant items. Thus no model can be serendipe. These cases are not taken into account in the average for all user (+ TODO donner le % des cas $T \setminus P = \emptyset$).
 Les modèles primitif sont le modèle TFIDF avec historyRef=1 et lowercase et lemmatization. L'autre est le modèle qui prend le set des mots sans stop words pour l'historique, et cherche la meilleur similarité jaccard dans les candidats.

### TODO

 * Ne pas faire de normalisation sauf à la fin quand on aura TOUTES les diversity, meme la diversité de random

### Pourquoi dbert-ft permet de maximiser diversité et accuracy ?

 * Il donne un espace vectoriel qui est plus loin, moins semantique, plus focalisé sur les sequence faiblemenet semantique 
 * Le modèle est donc très complementaire puisqu'arrive à trouver des articles interessant pour l'utilistaur juste par le style mais qui diverge d'un point de vue topic...
 * on peut donner un bout d'explication en mentionnant le TFIDF focus (non spécificité) que j'ai mentionné au chapitre style
 * autre explication avec serendipity ?
 * Ou parce qu'il reconnait les sources ? TODO voir l'overlap de source interuser. Faire une baseline qui recommande la source majoritaire ?


## Commands

In [None]:
# oomstopper --no-tail bacceval ; killbill bacceval ; cd ~/twinews-logs ; jupython -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb

In [None]:
# Other runs:
# oomstopper --no-tail bacceval ; cd ~/twinews-logs ; jupython -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb

In [None]:
# Multiple runs:
# oomstopper --no-tail bacceval ; killbill bacceval ; cd ~/twinews-logs ; jupython --no-tail -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb ; sleep 30 ; oomstopper --no-tail bacceval ; cd ~/twinews-logs ; jupython --no-tail -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb ; sleep 30 ; oomstopper --no-tail bacceval ; cd ~/twinews-logs ; jupython --no-tail -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb ; sleep 30 ; oomstopper --no-tail bacceval ; cd ~/twinews-logs ; jupython --no-tail -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb ; sleep 30 ; oomstopper --no-tail bacceval ; cd ~/twinews-logs ; jupython --no-tail -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb ; sleep 30 ; oomstopper --no-tail bacceval ; cd ~/twinews-logs ; jupython --no-tail -o nohup-bacceval-$HOSTNAME-$(date +%Y-%m-%d.%M-%S).out --venv st-venv ~/Workspace/Python/Datasets/Twinews/twinews/evaluation/bacceval.ipynb

## Imports

In [1]:
import os ; os.environ["CUDA_VISIBLE_DEVICES"] = ""
isNotebook = '__file__' not in locals()
TEST = isNotebook

In [2]:
from systemtools.hayj import *
from systemtools.location import *
from systemtools.basics import *
from systemtools.file import *
from systemtools.printer import *
from databasetools.mongo import *
from datastructuretools.cache import *
from newstools.goodarticle.utils import *
from nlptools.preprocessing import *
from nlptools.news import parser as newsParser
from machinelearning.iterator import *
from twinews.utils import *
from twinews.evaluation import metrics
from twinews.evaluation.utils import *
from twinews.models.genericutils import *
from twinews.models.ranking import *
import time
import pymongo

['.',
 'the',
 ',',
 'to',
 'and',
 'a',
 'of',
 'in',
 'for',
 'on',
 'that',
 'is',
 'with',
 '-',
 'it',
 'at',
 'as',
 'from',
 '"',
 'be',
 'by',
 'this',
 'have',
 'an',
 'are',
 'but',
 'has',
 'was',
 'not',
 '__int_2__',
 'they',
 'more',
 'or',
 'who',
 'one',
 'their',
 'about',
 'we',
 'will',
 'said',
 'which',
 'all',
 'also',
 '__int_4__',
 'up',
 'when',
 'been',
 'out',
 'can',
 ':',
 'he',
 'there',
 '(',
 'do',
 'than',
 'what',
 'new',
 'if',
 'other',
 'so',
 'time',
 'would',
 'were',
 'i',
 'you',
 'after',
 'people',
 'had',
 'some',
 ')',
 'into',
 'like',
 'his',
 'its',
 'just',
 'over',
 'first',
 'year',
 'no',
 'them',
 'two',
 'years',
 'could',
 'our',
 'how',
 'now',
 '__int_1__',
 'most',
 'only',
 'those',
 'because',
 'many',
 "'",
 'while',
 'get',
 'make',
 'last',
 'even',
 'where',
 'these',
 'did',
 'before',
 'through',
 'way',
 '__int_3__',
 '?',
 'being',
 'any',
 'work',
 'well',
 'then',
 'much',
 'made',
 'back',
 'take',
 'she',
 'may',
 

## Init

In [None]:
# Defining logger:
logger = Logger(tmpDir('logs') + "/bacceval.log") if isNotebook else Logger("bacceval-" + getHostname() + "-" + getDateSec() + ".log")
tt = TicToc(logger=logger)
tt.tic()

In [None]:
# Making the cache that is a dict-like object (url --> vector) keeping data until 2Go of free RAM:
genericCaches = dict()
newsCollection = getNewsCollection(logger=logger)
def getter(key, logger=None, verbose=True):
    global newsCollection
    global genericCaches
    global genericFields
    if newsCollection is None:
        newsCollection = getNewsCollection(logger=logger, verbose=verbose)
    cacheKey, url = key
    field = genericFields[cacheKey]
    if cacheKey in genericCaches:
        genericCache = genericCaches[cacheKey]
    else:
        genericCache = getGenericCache(cacheKey, logger=logger, verbose=verbose)
        genericCaches[cacheKey] = genericCache
    row = newsCollection.findOne({'url': url}, projection={field: True})
    theHash = objectToHash(row[field])
    return genericCache[theHash]

In [None]:
# We define primitive models and cache keys:
pmodels = {1: "tfidf-4b89a", 2: "tfidf-71fb5"}
cacheKeys = {"tfidf", "dbert-ft", "nmf"}

In [None]:
# Making the cache instance (don't forget to purge it at the end):
cache = Cache(getter, logger=logger)

In [None]:
# We get scores collection and the rankings GridFS:
twinewsScores = getTwinewsScores(logger=logger)
twinewsRankings = getTwinewsRankings(logger=logger)

In [None]:
def cacheForceFeeding(model, maxItems=None, logger=None, verbose=True):
    global cache
    global newsCollection
    i = 0
    urls = list(newsCollection.distinct("url"))
    if maxItems is not None:
        urls = urls[:maxItems]
    for url in pb(urls, printRatio=0.01, message="Force-feeding the cache...",
                  logger=logger, verbose=verbose):
        cache[(model, url)]
        if i % 1000 == 0 and freeRAM() < 2:
            logWarning("Stopping because no RAM left.", logger)
            break
        i += 1

## Diversity

In [None]:
def basicDistPrint(url1, url2, dist, prob=1.0, logger=None, verbose=True):
    if verbose:
        if getRandomFloat() < prob and (dist >= 0.96 or dist < 0.84):
                log(dist, logger)
                # t1 = getNewsField(url1, 'detokText')
                # t2 = getNewsField(url2, 'detokText')
                t1Words = set(flattenLists(getNewsField(url1, 'sentences', verbose=False)))
                t2Words = set(flattenLists(getNewsField(url2, 'sentences', verbose=False)))
                inter = t1Words.intersection(t2Words)
                log("-" * 20, logger)
                bp(t1Words, 4, logger)
                log("-" * 20, logger)
                bp(t2Words, 4, logger)
                log("-" * 20, logger)
                bp(inter, 5, logger)
                log(len(inter), logger)
                log("#" * 20, logger)
                log("#" * 20, logger)

In [None]:
def tfidfDiversityAt100(urls, logger=None, verbose=False):
    return diversity(urls, 'tfidf', at=100, logger=logger, verbose=verbose)
def styleDiversityAt100(urls, logger=None, verbose=False):
    return diversity(urls, 'dbert-ft', at=100, logger=logger, verbose=verbose)
def topicDiversityAt100(urls, logger=None, verbose=False):
    return diversity(urls, 'nmf', at=100, logger=logger, verbose=verbose)
def diversity(urls, model, at=100, distance="cosine", logger=None, verbose=False):
    global cache
    assert isinstance(urls, list)
    urls = urls[:at]
    assert len(urls) == at
    vectors = vstack([cache[(model, url)] for url in urls])
    distances = getDistances(vectors, vectors, metric=distance, verbose=False)
    pairwiseCount = 0
    distSum = 0
    for i in range(at):
        for u in range(i+1, at):
            dist = distances[i][u]
            distSum += dist
            pairwiseCount += 1
            basicDistPrint(urls[i], urls[u], dist, verbose=TEST and verbose, logger=logger)
    assert pairwiseCount == (at**2 - at) / 2 # \frac{{|R|}^2-{|R|}}{2}
    return distSum / pairwiseCount

## Jaccard diversity

In [None]:
def swJaccardRepr(url, *args, **kwargs):
    return __jaccardRepr(url, 200, *args, **kwargs)
def jaccardRepr(url, *args, **kwargs):
    return __jaccardRepr(url, 0, *args, **kwargs)
def __jaccardRepr\
(
    url,
    stopWordAmount,
    lowercase=True,
    logger=None, verbose=True,
):
    global newsCollection
    global STOP_WORDS
    assert '__int_1__' in STOP_WORDS
    if stopWordAmount is None or stopWordAmount == 0:
        sw = None
    else:
        sw = set(STOP_WORDS[:stopWordAmount])
    sentences = getNewsField(url, 'sentences', verbose=False)
    tokens = flattenLists(sentences)
    if lowercase:
        tokens = [e.lower() for e in tokens]
    tokens = set(tokens)
    if sw is not None and len(sw) > 0:
        tokens = set([e for e in tokens if e not in sw])
    return tokens

In [None]:
def jaccardDistance(url1, url2, cache):
    words1 = cache[url1]
    words2 = cache[url2]
    return 1 - len(words1.intersection(words2)) / len((words1.union(words2)))

In [None]:
def swJaccardDiversityAt100(urls, logger=None, verbose=False):
    return jaccardDiversity(urls, True, at=100, logger=logger, verbose=verbose)
def jaccardDiversityAt100(urls, logger=None, verbose=False):
    return jaccardDiversity(urls, False, at=100, logger=logger, verbose=verbose)
def jaccardDiversity(urls, useSW, at=100, logger=None, verbose=False):
    global swJaccardCache
    global jaccardCache
    assert isinstance(urls, list)
    urls = urls[:at]
    assert len(urls) == at
    pairwiseCount = 0
    distSum = 0
    for i in range(at):
        for u in range(i+1, at):
            dist = jaccardDistance(urls[i], urls[u], swJaccardCache if useSW else jaccardCache)
            distSum += dist
            pairwiseCount += 1
            basicDistPrint(urls[i], urls[u], dist, verbose=TEST and verbose, logger=logger)
    assert pairwiseCount == (at**2 - at) / 2 # \frac{{|R|}^2-{|R|}}{2}
    return distSum / pairwiseCount

In [None]:
swJaccardCache = Cache(swJaccardRepr, logger=logger)
jaccardCache = Cache(jaccardRepr, logger=logger)

## Continuous evaluation

In [None]:
# We init an eval data cache:
def evalDataGetter(splitVersion, logger=None, verbose=True):
    log("Downloading eval data version " + str(splitVersion) + "...", logger, verbose=verbose)
    return getEvalData(splitVersion, logger=logger, verbose=verbose, maxExtraNews=0)
evalDataCache = Cache(evalDataGetter, logger=logger)

In [None]:
# Misc params:
iterations = 1 if isNotebook else 10000000
sleep = 0 if isNotebook else 30
exceptionSleep = 10

In [None]:
# Metrics for local:
diversityMetrics = \
{
    'div@100': tfidfDiversityAt100,
    'style-div@100': styleDiversityAt100,
    'topic-div@100': topicDiversityAt100,
}
jaccardDiversityMetrics = \
{
    'jacc-div@100': jaccardDiversityAt100,
    'swjacc-div@100': swJaccardDiversityAt100,
}
noveltyMetrics = {}
strictNoveltyMetrics = {}
serendipityMetrics = {}
tipiNum = lambda: tipiNumber(toInteger=True)
if TEST:
    diversityMetrics = dictSelect(diversityMetrics, {})
    # cacheForceFeeding('tfidf', maxItems=300, logger=logger)
elif tipiNum() in {60, 61, 62, 63}:
    logWarning("Caching only tfidf representations.", logger)
    diversityMetrics = dictSelect(diversityMetrics, {'div@100'})
    # cacheForceFeeding('tfidf', logger=logger)
elif tipiNum() in {90, 92, 93, 95}:
    pass
    # logWarning("Caching only dbert-ft representations.", logger)
    # diversityMetrics = dictSelect(diversityMetrics, {'style-div@100'})
    # cacheForceFeeding('dbert-ft', logger=logger)
elif tipiNum() in {1, 2, 3, 4}:
    pass
    # logWarning("Caching only nmf representations.", logger)
    # diversityMetrics = dictSelect(diversityMetrics, {'topic-div@100'})
    # cacheForceFeeding('nmf', logger=logger)
else:
    logWarning("We take all metrics, the cache can be overflowed...", logger)
metricFuncts = mergeDicts(diversityMetrics, jaccardDiversityMetrics, noveltyMetrics, strictNoveltyMetrics, serendipityMetrics)
log("Current metric functions:\n" + b(metricFuncts.keys(), 5), logger)

In [None]:
# For a certain amount of iterations:
for i in range(iterations):
    # We get all
    modelsKeys = shuffle(list(twinewsRankings.keys()))
    if TEST:
        modelsKeys = [e for e in modelsKeys if "combin" not in e]
        modelsKeys = modelsKeys[:1]
    # For all model instances:
    for modelKey in modelsKeys:
        # We init the eval data to None:
        evalData = None
        rankings = None
        # For all metrics:
        for metricKey, metricFunct in metricFuncts.items():
            # If we didn't added the score previously:
            if not twinewsScores.has({'id': modelKey, 'metric': metricKey}):
                try:
                    # We print infos:
                    log("Computing " + metricKey + " score of " + modelKey + "...", logger)
                    # We get all data:
                    meta = twinewsRankings.getMeta(modelKey)
                    splitVersion = meta['splitVersion']
                    maxUsers = meta['maxUsers']
                    modelName = meta['model']
                    # We get eval data:
                    if evalData is None:
                        evalData = evalDataCache[splitVersion]
                    candidates = evalData['candidates']
                    # We get rankings:
                    if rankings is None:
                        log("Downloading rankings of " + modelKey + "...", logger)
                        rankings = twinewsRankings[modelKey]
                        if rankings is None or len(rankings) == 0:
                            raise Exception("Rankings of " + modelKey + " doesn't exist anymore, you need to re-generate it.")
                        else:
                            checkRankings(rankings, candidates, maxUsers=maxUsers)
                        log("Done.", logger)
                    # Init scores:
                    scores = []
                    # Diversity:
                    if metricKey in diversityMetrics or metricKey in jaccardDiversityMetrics:
                        userIds = list(rankings.keys())
                        if TEST:
                            userIds = userIds[:10]
                        for userId in pb(userIds, logger=logger, message="Computing " + metricKey + " of " + modelKey):
                            for currentRankings in rankings[userId]:
                                assert len(currentRankings) >= 100
                                assert isinstance(currentRankings, list)
                                assert isinstance(currentRankings[0], str) or isinstance(currentRankings[0], tuple)
                                if isinstance(currentRankings[0], tuple):
                                    currentUrls = [e[0] for e in currentRankings]
                                else:
                                    currentUrls = currentRankings
                                score = metricFunct(currentUrls, logger=logger)
                                scores.append(score)
                        if not TEST:
                            assert len(scores) >= len(rankings)
                    # Novelty:
                    elif metricKey in noveltyMetrics:
                        pass
                    # Strict novelty:
                    elif metricKey in strictNoveltyMetrics:
                        pass
                    # Serendipity:
                    elif metricKey in serendipityMetrics:
                        pass
                    # We mean all scrores:
                    score = float(np.mean(scores))
                    # And finally we add the score in the db:
                    if not TEST:
                        addTwinewsScore(modelKey, metricKey, score, verbose=False)
                    # We print result:
                    log(metricKey + " score of " + modelKey + ": " + str(truncateFloat(score, 3)), logger)
                except Exception as e:
                    if isNotebook:
                        raise e
                    else:
                        logError(str(e), logger)
                        time.sleep(exceptionSleep)
    if sleep > 0:
        log("Sleeping " + str(sleep) + " seconds for the iteration " + str(i) + " on " + str(iterations) + "...", logger)
        time.sleep(sleep)

## Testing

In [None]:
if isNotebook and TEST:
    urls = shuffle(list(newsCollection.distinct("url")))[:100]

In [None]:
if isNotebook and TEST:
    log(tfidfDiversityAt100(urls), logger)
    log(styleDiversityAt100(urls), logger)
    log(topicDiversityAt100(urls), logger)

In [None]:
if isNotebook and TEST:
    urls = shuffle(list(newsCollection.distinct("url")))[:100]
    jaccardDistance(urls[0], urls[1], swJaccardCache)
    jaccardDiversity(urls, False, verbose=True, logger=logger)

## End

In [None]:
cache.purge()

In [None]:
tt.toc()