# Workflow

1. Download the data dump (a pair of files pair language)
2. Process the data dump to a structured format
3. Download pageviews data
4. Process pageviews data to a structured format per language
5. Merge pageviews and wiki data
6. Filter to articles we want to keep for embedding
    - Use the pageviews data as a proxy for importance/relevance
7. Embed the articles
8. Store the embeddings


### 1 Downloading the data dumps
Example pair of files for Icelandic wikipedia:
    - iswiki-20240901-pages-articles-multistream.xml.bz2
        - Contains the actual wiki articles
    - iswiki-20240901-pages-articles-multistream-index.txt.bz2
        - Contains the offsets of the articles in the xml file.

### 2 Process the data dump to a structured format
This is done in the script `src/wikipedia/multiprocess_large_wiki_dump.py`

We opt for a duckdb database file per language as the output of this step. 

A simple .parquet file for a pandas dataframe could be used for small languages, but the larger languages would simply not fit into memory.

### 3 Download pageviews data
Basically, we get some sort of simple time series data per month. 

 - Each pageviews dump file contains data for a single month.
 - Each pageviews dump file has the data for all the languages.
 - Each page appears multiple times in the file (once per day I think)

### 4 Process pageviews data to a structured format
This is done in the script `src/wikipedia/process_pageviews.py`
2 things done here:
 - Extract the data for the languages we want to keep
 - Aggregate the data to a monthly level instead of daily

### 5 & 6 Merge pageviews and wiki data and do filtering
This is done with a notebook. 

E.g. 
`notebooks/wikipedia_embeddings_clean/05_english_final_filtering.ipynb`


### 7 & 8 Embed the articles and store the embeddings
This is done with a notebook.

E.g. 
`notebooks/wikipedia_embeddings_clean/07_english_embed.ipynb`




---

# Offical data dumps

Wikipedia has an official data dump site: https://dumps.wikimedia.org/

A mirror site for e.g. the latest septemper data dump for the Icelandic wikipedia is:  
https://mirror.accum.se/mirror/wikimedia.org/dumps/iswiki/20240901/

Look for the file: 	
- basic data dump: iswiki-20240901-pages-articles.xml.bz2 
- with metadata: iswiki-20240901-pages-meta-current.xml.bz2
- with multistream: 
    - iswiki-20240901-pages-articles-multistream.xml.bz2
    - iswiki-20240901-pages-articles-multistream-index.txt.bz2

**PageViews data**
 - https://dumps.wikimedia.org/other/pageview_complete/readme.html
 - https://dumps.wikimedia.org/other/pageviews/
    - some doc: https://meta.wikimedia.org/wiki/Research:Page_view
 - Also pageviews complete:
    https://dumps.wikimedia.org/other/pageview_complete/2024/2024-09/


## Previous work by others

 - https://upstash.com/blog/indexing-wikipedia   
    - References:
    - https://github.com/earwig/mwparserfromhell
    - https://huggingface.co/datasets/wikimedia/wikipedia/blob/script/wikipedia.py


---

# Pageviews as a proxy for importance

After looking into ways to estimate the importance of wikipedia pages, it seems that pageviews is the best proxy for importance.  

Getting the pageviews data is not as straightforward as I thought it would be.  
However, after much digging I found two main ways to get the data. 
 - Data dumps:
    - https://dumps.wikimedia.org/other/pageview_complete/
        - Need to use the monthly files, and preferably at least a year into the past...
 - API:
    - https://wikimedia.org/api/rest_v1/#/Pageviews_data/get_metrics_pageviews
    - https://doc.wikimedia.org/generated-data-platform/aqs/analytics-api/documentation/getting-started.html

The downside of the pageviews in general is that it favors older pages.  
We may want to do something like x amount of pageviews total for the last year or two + new pages (if we can get that data).
Or last edited date. 

API is probably straight forward, can specify range (e.g. 2 years), so no manual work to combine monthly files.  
However for millions of pages this will still take a lot of time, and we need ways to rate limit.  

To get started quickly, the data dump is much faster for me. 



