# Downloads
All the downloads are handled in this notebook. 
Downloads in this notebook:
* Turkish Wikipedia Dumps
* ORES Topic classification of Wikipedia Articles
* WikiText: all Turkish Wikipedia articles text


## 1) Downloads the Wikipedia Dump

All the mediawikis can be downloaded using the link below.

https://dumps.wikimedia.org/other/mediawiki_history/


The structure of the dumps are pretty straightforward. One must choose the dump version in the page (dumps up to the two previous months are available) and then choose the wiki to download. Turkish wikis start with ```tr``` and a number of them are available, ```trwiki``` being the main Turkish Wikipedia. The dumps are regrouped yearly and must be combined in order to have all the activities of Turkish Wikipedia since 2002.

The detailed documentation of the dumps can be found in the link below.

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps

After that the data was downloaded, it was processed and the dumps from different years were combined into one and the separate files were deleted. The notebook ```process_data.ipynb``` handles those operations


In [1]:
import sys
import wget
from pathlib import Path
import logging
import pandas as pd
import numpy as np

%load_ext autoreload
%autoreload 2

# Add other codes if you desire other language wikis
CODES = ["tr"]

# The path inside iccluster111 where the dumps will be downloaded
DUMPS_PATH = "/dlabdata1/turkish_wiki"

# Configure to the dump you want to download
DUMP_URL= f'https://dumps.wikimedia.org/other/mediawiki_history'
DUMP_VERSION = '2021-01'
DUMP_FE = 'tsv.bz2'

# Range of years of the dumps to be downloaded.
YEARS = list(range(2002, 2022))

# Create the download path if it doesn't exist
Path(f'{DUMPS_PATH}').mkdir(parents=True, exist_ok=True)

In [3]:
# Downloads the desired dumps
month = 0
for code in CODES:
    logging.warning(f'Processing {code}...')
    for year in YEARS:
        try:
            if code != 'en':
                url = f'{DUMP_URL}/{DUMP_VERSION}/{code}wiki/{DUMP_VERSION}.{code}wiki.{year}.{DUMP_FE}'
                loc = f'{DUMPS_PATH}/{code}-{year}.{DUMP_FE}'
                logging.warning(f'Download {url}...')
                wget.download(url, loc)
            else:
                 for month in range(1, 13):
                    url = f'{DUMP_URL}/{DUMP_VERSION}/{code}wiki/{DUMP_VERSION}.{code}wiki.{year}-{month:02d}.{DUMP_FE}'
                    loc = f'{DUMPS_PATH}/{code}-{year}-{month:02d}.{DUMP_FE}'
                    logging.info(f'Download {url}...')
                    wget.download(url, loc)
        except:
            logging.error(f'Error when downloading {code}-{year}-{month}')



## 2) Get ORES Topics
The url below downloads the ORES topic classification of Wikipedia articles accross all language editions of Wikipedia. The resulting DataFrame contains namely the name of the project (i.e. ```trwiki```, ```enwiki``` etc.), id of the page in Wiki data (i.e [Q43](https://www.wikidata.org/wiki/Q43)) the pageid and the probabilities of the article belonging to any of the ORES categories.


In [4]:
try:
        url = "https://ndownloader.figshare.com/files/26338159"
        loc = "/scratch/ira"
        logging.warning(f'Download {url}...')
        wget.download(url, loc)
except:
    logging.error(f'Error when downloading {url}')



#### How to load the ORES data

In [4]:
df = pd.read_csv(f'/scratch/ira/topics_all_wikipedia_articles_202012.tsv.bz2', sep="\t", nrows=100)

In [6]:
df.head()

Unnamed: 0,wiki_db,qid,pid,num_outlinks,Culture.Biography.Biography*,Culture.Biography.Women,Culture.Food_and_drink,Culture.Internet_culture,Culture.Linguistics,Culture.Literature,...,STEM.Computing,STEM.Earth_and_environment,STEM.Engineering,STEM.Libraries_&_Information,STEM.Mathematics,STEM.Medicine_&_Health,STEM.Physics,STEM.STEM*,STEM.Space,STEM.Technology
0,abwiki,Q40349,2444,66,0.019,0.004,0.001,0.0,0.003,0.005,...,0.0,0.001,0.003,0.0,0.0,0.002,0.0,0.013,0.0,0.002
1,abwiki,Q2657,2855,51,0.0,0.002,0.0,0.003,0.0,0.012,...,0.003,0.004,0.006,0.0,0.003,0.0,0.001,0.446,0.001,0.005
2,abwiki,Q713229,4439,2,0.0,0.0,0.002,0.0,0.008,0.0,...,0.0,0.0,0.0,0.0,0.997,0.0,0.001,1.0,0.0,0.001
3,abwiki,Q7821,7819,7,0.207,0.001,0.005,0.0,0.006,0.033,...,0.0,0.03,0.028,0.007,0.001,0.022,0.025,0.064,0.0,0.008
4,abwiki,Q19590,9095,7,0.002,0.003,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047,0.012,0.0


## 3) Get WikiText

The dump in the link below downloads the textual content of Turkish Wikipedias revisions up to 01.02.2021. This dump is used to get article topics using WikiPDA. The notebook where article topics are obtained using WikiPDA is ```get_wikipda_categories.ipynb```. Documentation and a guide for WikiPDA can be found in the link below. 

https://github.com/epfl-dlab/WikiPDA

In [2]:
try:
        url = "https://dumps.wikimedia.org/trwiki/20210201/trwiki-20210201-pages-articles-multistream.xml.bz2"
        loc = "/dlabdata1/turkish_wiki"
        logging.warning(f'Download {url}...')
        wget.download(url, loc)
except:
    logging.error(f'Error when downloading {url}')

ERROR:root:Error when downloading https://dumps.wikimedia.org/trwiki/20210201/trwiki-20210201-pages-articles-multistream.xml.bz2
