## Can we get only the "good" articles?

Also can we have some filte such as pageviews to filter out the less popular articles?

Use Norwegian wikipedia as an example. It is larger than Icelandic wikipedia, but not too large to work with. 

From the frontpage:
```
Velkommen til Wikipedia,
den frie encyklopedien som du kan forbedre.
635 757 artikler på bokmål og riksmål
```

In [22]:
from sqlalchemy import create_engine, text
import pandas as pd

# Create a database engine
# engine = create_engine('mysql+pymysql://wikiuser:wikipassword@localhost:3306/iswiki')
engine = create_engine('mysql+pymysql://wikiuser:wikipassword@localhost:3306/nowiki')


# get all tables available
df_tables = pd.read_sql_query("SHOW TABLES", engine)
df_tables

Unnamed: 0,Tables_in_nowiki
0,category
1,categorylinks
2,page
3,page_props
4,page_restrictions
5,pagelinks
6,site_stats
7,sites


In [3]:
df_site_stats = pd.read_sql_query("SELECT * FROM site_stats", engine)

df_site_stats

Unnamed: 0,ss_row_id,ss_total_edits,ss_good_articles,ss_total_pages,ss_users,ss_images,ss_active_users
0,1,24600826,634867,1832530,617593,4,1040


## Can we find the roughly 650K "good" articles

### First what would the python code have processed?

In [3]:
import bz2
import xml.etree.ElementTree as ET

class WikiArticleCounterBZ2:
    def __init__(self):
        self.total_articles = 0
        self.redirect_count = 0
        self.non_redirect_count = 0

    def count_articles(self, bz2_file):
        # Open the .bz2 file and stream its decompressed contents
        with bz2.open(bz2_file, 'rb') as file:
            # Use iterparse to go through the XML without loading it all into memory
            for event, elem in ET.iterparse(file, events=('start', 'end')):
                if event == 'end' and elem.tag.endswith('page'):
                    ns = elem.find('./{*}ns')
                    if ns is not None and ns.text == '0':  # Main namespace (articles)
                        self.total_articles += 1
                        text = elem.findtext('.//{*}text')
                        
                        if text and not text.lower().startswith('#redirect'):
                            self.non_redirect_count += 1
                        else:
                            self.redirect_count += 1

                    elem.clear()  # Clear to save memory

        print(f"Total articles: {self.total_articles}")
        print(f"Non-redirect articles: {self.non_redirect_count}")
        print(f"Redirect articles: {self.redirect_count}")
        
        return self.total_articles, self.non_redirect_count, self.redirect_count

def count_wikipedia_articles_bz2(file_path):
    counter = WikiArticleCounterBZ2()
    return counter.count_articles(file_path)

# Usage
bz2_dump_file = "../../data/wikipedia/nowiki-20240901-pages-articles.xml.bz2"
total_articles, non_redirects, redirects = count_wikipedia_articles_bz2(bz2_dump_file)

print(f"Total number of articles in the dump: {total_articles}")
print(f"Number of non-redirect articles: {non_redirects}")
print(f"Number of redirect articles: {redirects}")


Total articles: 990640
Non-redirect articles: 973791
Redirect articles: 16849
Total number of articles in the dump: 990640
Number of non-redirect articles: 973791
Number of redirect articles: 16849


**This is too many articles, we are expecting around 650K main articles.**



In [4]:
df_tables

Unnamed: 0,Tables_in_nowiki
0,category
1,page
2,page_props
3,page_restrictions
4,pagelinks
5,site_stats
6,sites


In [5]:
df_category = pd.read_sql_query("SELECT * FROM category", engine)

df_category.head(3)

Unnamed: 0,cat_id,cat_title,cat_pages,cat_subcats,cat_files
0,1,b'FN-traktater',27,2,0
1,2,b'Internasjonale_avtaler',140,20,0
2,3,b'G\xc3\xa4vleborgs_l\xc3\xa4n',18,6,0


In [7]:
df_page = pd.read_sql_query("SELECT * FROM page", engine)
df_page.head(3)

Unnamed: 0,page_id,page_namespace,page_title,page_is_redirect,page_is_new,page_random,page_touched,page_links_updated,page_latest,page_len,page_content_model,page_lang
0,1,0,b'Det_norske_Arbeiderparti',1,0,0.094351,b'20240815223517',b'20230311143657',15529121,83,b'wikitext',
1,2,0,b'Akershus',0,0,0.035703,b'20240820193426',b'20240820193434',24630634,60775,b'wikitext',
2,3,0,b'Aalesund_FK',1,0,0.847204,b'20240822135716',b'20230311143657',15344499,95,b'wikitext',


In [9]:
df_page.page_namespace.value_counts()


page_namespace
0       990646
1       249192
3       247255
14      206008
10       59033
2        31248
4        27518
828       5879
100       4462
2600      3329
5         2398
15        2097
11        1551
8         1141
12         379
9          124
101        111
829         84
13          71
6            4
7            1
Name: count, dtype: int64

In [13]:
df_page_props = pd.read_sql_query("SELECT * FROM page_props LIMIT 50", engine)
df_page_props.head(3)



Unnamed: 0,pp_page,pp_propname,pp_value,pp_sortkey
0,2,b'kartographer_frames',b'1',
1,2,b'page_image_free',b'Norway_Counties_Akershus_Position.svg',
2,2,b'wikibase_item',b'Q50615',


In [15]:
df_pagelinks = pd.read_sql_query("SELECT * FROM pagelinks LIMIT 50", engine)

df_pagelinks.head(3)

Unnamed: 0,pl_from,pl_from_namespace,pl_target_id
0,122978,0,1
1,2208781,0,2
2,2689,0,17


In [18]:
df_sites = pd.read_sql_query("SELECT * FROM sites LIMIT 50", engine)

df_sites.head(3)

Unnamed: 0,site_id,site_global_key,site_type,site_group,site_source,site_language,site_protocol,site_domain,site_data,site_forward,site_config
0,1,b'aawiki',b'mediawiki',b'wikipedia',b'local',b'aa',b'https',b'gro.aidepikiw.aa.',"b'a:1:{s:5:""paths"";a:2:{s:9:""file_path"";s:29:""...",0,b'a:0:{}'
1,2,b'aawiktionary',b'mediawiki',b'wiktionary',b'local',b'aa',b'https',b'gro.yranoitkiw.aa.',"b'a:1:{s:5:""paths"";a:2:{s:9:""file_path"";s:30:""...",0,b'a:0:{}'
2,3,b'aawikibooks',b'mediawiki',b'wikibooks',b'local',b'aa',b'https',b'gro.skoobikiw.aa.',"b'a:1:{s:5:""paths"";a:2:{s:9:""file_path"";s:29:""...",0,b'a:0:{}'


In [21]:
df_page_restrictions = pd.read_sql_query("SELECT * FROM page_restrictions LIMIT 50", engine)

df_page_restrictions.head(3)

Unnamed: 0,pr_page,pr_type,pr_level,pr_cascade,pr_expiry,pr_id
0,3031,b'move',b'sysop',0,,19
1,129786,b'edit',b'sysop',0,b'infinity',45
2,129786,b'move',b'sysop',0,b'infinity',46


In [26]:
df_categorylinks = pd.read_sql_query("SELECT * FROM categorylinks", engine)

df_categorylinks.head(3)



Unnamed: 0,cl_from,cl_to,cl_sortkey,cl_timestamp,cl_sortkey_prefix,cl_collation,cl_type
0,1,b'Omdirigeringer_fra_eldre_skriveform',"b'02P\x04DFLN>2\x04*L,2:02LH*LP:\x01\x1c\x01\x...",2011-09-12 15:52:40,b'',b'uca-nb-u-kn',b'page'
1,2,b'11\xc2\xb0\xc3\x98',b'*>2LN8RN\x03\x06*>2LN8RN\x01\x15\x01\xdc\xbe...,2022-06-01 15:01:56,b'Akershus',b'uca-nb-u-kn',b'page'
2,2,b'60\xc2\xb0N',b'*>2LN8RN\x03\x06*>2LN8RN\x01\x15\x01\xdc\xbe...,2022-06-01 15:01:56,b'Akershus',b'uca-nb-u-kn',b'page'


In [31]:
df_categorylinks.cl_to.value_counts().head(20)

cl_to
b'Artikler_med_autoritetsdatalenker_fra_Wikidata'                      320507
b'Artikler_uten_autoritetsdatalenker_fra_Wikidata'                     229144
b'Enkeltmenn'                                                          148009
b'Sider_med_referanser_fra_utsagn'                                     138948
b'Artikler_med_offisielle_lenker_fra_Wikidata'                         114834
b'Alle_spirer'                                                          97328
b'Sider_med_kart'                                                       97100
b'Artikler_hvor_bilde_er_hentet_fra_Wikidata_\xe2\x80\x93_biografi'     91974
b'Artikler_med_sportslenker_fra_Wikidata'                               89201
b'Spirer_2023-11'                                                       72167
b'Sm\xc3\xa5_spirer'                                                    65236
b'Artikler_hvor_bilde_er_hentet_fra_Wikidata'                           56243
b'Enkeltkvinner'                                          

In [28]:
len(df_categorylinks) # 6.008.394

6008394

In [29]:
len(df_categorylinks.cl_to.unique()) 

209589

### So this is mostly not viable unless with some effort to better understand the data
### And even then this approach may not be very good. 

### Should look at other quality control data... PageViews and article lenght?
