# Tutorial 3: History of tokens (all_content)

# 1. Querying all content


In [None]:
from wikiwho_wrapper import WikiWhoAPI, APIQuerier
api = WikiWhoAPI()
querier = APIQuerier(api)
df = querier.all_content(article="bioglass") 
df.head(3)

The above columns DataFrame columns deserve an explanation:

 - **article_title**: the name of the title
 - **page_id**: the id of the title
 - **o_rev_id**: The ID of the revision where the token was added for the first time in the article.
 - **o_editor**: The user ID of the editor that added the token for the first time (directly related to o_rev_id). User IDs are integers, are unique for the whole Wikipedia and can be used to fetch the current name of a user. 
 - **token**: Actual token value as string. A token can appear multiple times, i.e. it is not unique. This is because a word can be repeated in the document.
 - **token_id**: The token ID assigned internally by the WikiWho algorithm. The token_id describe a token uniquely because it takes into consideration the context in which the token appear (e.g. paragraph, sentence, position in the sentence). 
 - **in**: revision in which the token has been reinserted
 - **out**: revision in which the token has been removed



# 2. Cases for in and out columns

The `in` and `out` columns are useful to extract information regarding the history of a token. Here are some cases

## 2.1 Multiple editions

The token "sida" originally inserted in revision '189370281', deleted in '189370332', reinserted in '189371159', deleted again in '189371182', reinserted again in '189537330', and finaly deleted in '191585577'.

In [None]:
df[df['token_id'] == 378]

## 2.2 Reinserted token
The token 'bioglass' originally inserted in revision '18064039', deleted in '758323388', and reinserted in '758323485'.

In [None]:
df[df['token_id'] == 0]

## 2.3 Never deleted token
The token 'is' inserted in revision '18064039' and never taken out.

In [None]:
df[df['token_id'] == 2]

# 3. Basic filters on the dataframe

A `-1`  in the `in` column means that there was no reinsertion, i.e. it was a first-time insertion (`o_rev_id`).

A `-1` in the `out` column means that the token was not taken out for the corresponding `o_rev_id` or `in` first-time insertion or reinsertion.

This can be used to filter interesting information.

## 3.1 (First-time) token insertions

In [None]:
df[df['in'] == -1].head()

## 3.2 Token reinsertions

In [None]:
df[df['in'] != -1].head()

## 3.3 Token Deletions

In [None]:
df[df['out'] != -1].head()

## 3.4 Tokens that still exists in the page

In [None]:
df[df['out'] == -1].head()

## 3.5 Counting the number of times that each token has been inserted or inserted

In [None]:
sizes = df.groupby(['token_id']).size().sort_values(ascending=False)
sizes.head(10)
