### Getting Data for My Final Project

I want to use the data from the 2023_pageviews to get the
- articles titles
- qid
- lang_code
- country_code
- pageviews

So far, I am a little confused about where I should be working on this project and how to use the duckdb databases

Eni said: use the wiki_pageviews duckdb. 
<br> You can follow the second tutorial, but swap the URL and the table name (instead of data_table use wiki_pageviews in the queries) and most of the queries will work, although the ones that do aggregation are a bit slow on this big database. It's better to select the rows that you need and save them in a CSV and then do the operations on the file.

This is from the DuckDB_Tutorial tutorial

I don't know what rows I need though

1. Set up in Google Colab

In [2]:
import duckdb

# Placeholder for the database connection. It will be initialized later with the URL.
conn = duckdb.connect()
conn

<_duckdb.DuckDBPyConnection at 0x10ae0f4b0>

In [3]:
# Install and Load the HTTPFS extension
# This is required to access remote files over the web (HTTP/S)
conn.execute("INSTALL httpfs;")
conn.execute("LOAD httpfs;")

<_duckdb.DuckDBPyConnection at 0x10ae0f4b0>

2. Connect to the Remote Database Source

In [4]:
# This is one of the several DuckDB databases hosted in the CS server.
# This database had the DPDP data for all countries.
database_url = "https://cs.wellesley.edu/~eni/duckdb/all_wiki.duckdb"

# Attach the remote file as a database named 'web_db' and start using it
try:
    conn.execute(f"ATTACH '{database_url}' AS web_db (READ_ONLY);")
    conn.execute("USE web_db;")
    print(f"Successfully attached database from: {database_url}")
except Exception as e:
    print(f"Error attaching database: {e}")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Successfully attached database from: https://cs.wellesley.edu/~eni/duckdb/all_wiki.duckdb


3. Show the Tables

In [5]:
query = "PRAGMA show_tables"
result = conn.sql(query)
result

┌────────────────┐
│      name      │
│    varchar     │
├────────────────┤
│ wiki_pageviews │
└────────────────┘

In [6]:
table_name = "wiki_pageviews"
query = f"PRAGMA table_info('web_db.{table_name}');"

# We can apply the method .df() to the result of the query to convert it into a dataframe
column_info_df = conn.sql(query).df()
column_info_df

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,date,DATE,False,,False
1,1,country_code,VARCHAR,False,,False
2,2,project,VARCHAR,False,,False
3,3,article,VARCHAR,False,,False
4,4,qid,VARCHAR,False,,False
5,5,pageviews,BIGINT,False,,False


I want the date, country_code, article, qid, and pageviews

4. SQL Commands for the Table

In [7]:
query_1 = """
SELECT * FROM wiki_pageviews
LIMIT 10;
"""
result_1 = conn.sql(query_1).df() # after executing, convert to df for better printout

result_1

Unnamed: 0,date,country_code,project,article,qid,pageviews
0,2023-02-06,DZ,ar.wikipedia,ÙØªÙØ§Ø²Ù_Ø£Ø¶ÙØ§Ø¹,Q45867,108
1,2023-02-06,DZ,ar.wikipedia,Ø§ÙØ£ÙØ¯ÙØ³,Q123559,145
2,2023-02-06,AR,en.wikipedia,Robledo_Puch,Q3181149,99
3,2023-02-06,AR,es.wikipedia,Ojo_de_Horus,Q211286,135
4,2023-02-06,AR,es.wikipedia,Estaciones_del_aÃ±o,Q24384,171
5,2023-02-06,AR,es.wikipedia,Isla_de_Alcatraz,Q131354,126
6,2023-02-06,AR,es.wikipedia,Volkswagen_Gol,Q275442,148
7,2023-02-06,AR,es.wikipedia,RÃ­o_Cuarto_(ciudad),Q983451,179
8,2023-02-06,AR,es.wikipedia,Todo_Noticias,Q3244714,325
9,2023-02-06,AR,es.wikipedia,Tres_metros_sobre_el_cielo_(pelÃ­cula_de_2010),Q944385,112


Now that I have my dataframe, I can get the specific information I need:
- date
- articles titles
- qid
- lang_code
- country_code
- pageviews

I also need to pick countries to look at

Let me filter by the date first. For this first part, I only want to look at the data for 1 month, so I am going to pick 2023-3

I think my date column is DATE objects ...

In [32]:
query_2 = """
SELECT date, country_code, project, article, qid, pageviews
FROM wiki_pageviews
WHERE DATE_TRUNC('month', date) = DATE '2023-03-01'
"""
df = conn.sql(query_2).df()

In [27]:
df.tail()

Unnamed: 0,date,country_code,project,article,qid,pageviews
11913520,2023-03-31,US,en.wikipedia,Jerry_Nadler,Q505598,512
11913521,2023-03-31,US,en.wikipedia,68â95â99.7_rule,Q847822,530
11913522,2023-03-31,US,fr.wikipedia,France,Q142,685
11913523,2023-03-31,US,uk.wikipedia,YouTube,Q866,3071
11913524,2023-03-31,US,zh.wikipedia,æ­æ´åÂ·ç¦å°æ©æ¯,Q4653,557


Let me get the lang_code from the project title with pandas first

In [33]:
df[['lang_code', 'project']] = df['project'].str.split('.', n=1, expand=True)
df.tail()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
11913520,2023-03-31,US,wikipedia,Jerry_Nadler,Q505598,512,en
11913521,2023-03-31,US,wikipedia,68â95â99.7_rule,Q847822,530,en
11913522,2023-03-31,US,wikipedia,France,Q142,685,fr
11913523,2023-03-31,US,wikipedia,YouTube,Q866,3071,uk
11913524,2023-03-31,US,wikipedia,æ­æ´åÂ·ç¦å°æ©æ¯,Q4653,557,zh


In [25]:
df["lang_code"] = df["project"][0][0:2]
df = df.drop('project', axis=1)
df.tail()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
11913520,2023-03-31,US,Jerry_Nadler,Q505598,512,ar
11913521,2023-03-31,US,68â95â99.7_rule,Q847822,530,ar
11913522,2023-03-31,US,France,Q142,685,ar
11913523,2023-03-31,US,YouTube,Q866,3071,ar
11913524,2023-03-31,US,æ­æ´åÂ·ç¦å°æ©æ¯,Q4653,557,ar


Okay, now I have my dataframe for March 2023, now I need to get my country dataframes

I want to include the US, Japan, UK, India, Germany - these are supposedly the countries that use wikipedia the most

In [34]:
USdf = df[(df['country_code'] == 'US')]
USdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
1681,2023-03-01,US,wikipedia,Kawasaki_disease,Q265936,684,en
1682,2023-03-01,US,wikipedia,The_Elder_Scrolls_IV:_Oblivion,Q49607,530,en
1683,2023-03-01,US,wikipedia,Marathon_Man_(film),Q1195727,523,en
1684,2023-03-01,US,wikipedia,Eleanor_Tomlinson,Q1582005,697,en
1685,2023-03-01,US,wikipedia,Alice_Neel,Q460186,1044,en


In [35]:
JPdf = df[(df['country_code'] == 'JP')]
JPdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
951,2023-03-01,JP,wikipedia,Compartment_No._6,Q107092356,104,en
952,2023-03-01,JP,wikipedia,çå®é«ç°æ´¾,Q10437214,167,ja
953,2023-03-01,JP,wikipedia,ãã«ã¨ãã¹ãã¨å¬åç£,Q483263,285,ja
954,2023-03-01,JP,wikipedia,ä¼½è¶,Q28084,299,ja
955,2023-03-01,JP,wikipedia,å½é72ç³»é»è»,Q11421672,181,ja


In [36]:
UKdf = df[(df['country_code'] == 'GB')]
UKdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
1501,2023-03-01,GB,wikipedia,Bill_Murray,Q29250,386,en
1502,2023-03-01,GB,wikipedia,Russian_cruiser_Moskva,Q2992278,95,en
1503,2023-03-01,GB,wikipedia,Ashes_to_Ashes_(British_TV_series),Q725195,124,en
1504,2023-03-01,GB,wikipedia,Green_Boots,Q3541506,162,en
1505,2023-03-01,GB,wikipedia,Red_Dead_Redemption,Q548203,194,en


In [37]:
INdf = df[(df['country_code'] == 'IN')]
INdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
633,2023-03-01,IN,wikipedia,à¤à¤¶à¥à¤,Q8589,177,bh
634,2023-03-01,IN,wikipedia,à¦ªà¦¾à¦ à¦¾à¦¨_(à¦à¦²à¦à§à¦à¦¿à¦¤à§à¦°),Q114620212,98,bn
635,2023-03-01,IN,wikipedia,Hussain_Kuwajerwala,Q5949546,225,en
636,2023-03-01,IN,wikipedia,Resident_Evil_(film),Q153484,145,en
637,2023-03-01,IN,wikipedia,Sherilyn_Fenn,Q229993,109,en


In [38]:
DEdf = df[(df['country_code'] == 'DE')]
DEdf.head()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
470,2023-03-01,DE,wikipedia,Carles_Puigdemont,Q4740163,101,br
471,2023-03-01,DE,wikipedia,Liste_von_Pistolen,Q60526,149,de
472,2023-03-01,DE,wikipedia,Priyanka_Chopra_Jonas,Q158957,215,de
473,2023-03-01,DE,wikipedia,Denis_Wladimirowitsch_Puschilin,Q16514790,109,de
474,2023-03-01,DE,wikipedia,Dominica,Q784,211,de


Now I can create my csv file

In [4]:
import pandas as pd

In [41]:
csv_data = pd.concat([USdf, JPdf, UKdf, INdf, DEdf], ignore_index=True)

In [43]:
csv_data.tail()

Unnamed: 0,date,country_code,project,article,qid,pageviews,lang_code
5636456,2023-03-31,DE,wikipedia,The_Glory_(TV_series),Q113197148,211,en
5636457,2023-03-31,DE,wikipedia,Evan_Gershkovich,Q117337455,1032,en
5636458,2023-03-31,DE,wikipedia,ÙØ±ÛÙÛÙ_ÙÙÙØ±Ù,Q4616,121,fa
5636459,2023-03-31,DE,wikipedia,ÙØ¯ÛÙ_Ø¨Ø§Ø²ÙÙØ¯,Q106396209,93,fa
5636460,2023-03-31,DE,wikipedia,Fabio_Cannavaro,Q102027,142,it


In [68]:
csv_data.drop(["project"], axis=1, inplace = True)

In [69]:
csv_data.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-01,US,Kawasaki_disease,Q265936,684,en
1,2023-03-01,US,The_Elder_Scrolls_IV:_Oblivion,Q49607,530,en
2,2023-03-01,US,Marathon_Man_(film),Q1195727,523,en
3,2023-03-01,US,Eleanor_Tomlinson,Q1582005,697,en
4,2023-03-01,US,Alice_Neel,Q460186,1044,en


In [70]:
csv_data.to_csv("final-project-data.csv", index=False)

Now, my next step is to get all the qids for my articles

In [45]:
qid_df = pd.read_csv('final-project-data.csv')

In [48]:
qid_df = qid_df[["article", "qid"]]

In [49]:
qid_df.head()

Unnamed: 0,article,qid
0,Kawasaki_disease,Q265936
1,The_Elder_Scrolls_IV:_Oblivion,Q49607
2,Marathon_Man_(film),Q1195727
3,Eleanor_Tomlinson,Q1582005
4,Alice_Neel,Q460186


Next, we can get script information about the articles using the code from the 4_get_wikidata file

In [35]:
import requests
import json, os

WIKIDATA_API_ENDPOINT = "https://www.wikidata.org/w/api.php"

def fetch_complete_entity_data(qid):
    """
    Fetches all available structured data for a single Wikidata entity (QID)
    using the official Wikibase API action=wbgetentities.

    Args:
        qid (str): The Wikidata Item ID (e.g., 'Q83285' for Durres).

    Returns:
        dict: The complete raw JSON data for the entity, or an error dictionary.
    """

    # Parameters for the MediaWiki API, using the 'wbgetentities' action
    params = {
        'action': 'wbgetentities',
        'ids': qid,
        'format': 'json',
        # Request all relevant data: claims (properties), labels, descriptions, sitelinks (Wikipedia links)
        'props': 'claims|labels|descriptions|sitelinks|aliases',
    }

    # Add a User-Agent header as recommended by Wikidata API policies
    # https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#User-Agent
    headers = {
        'User-Agent': 'Colab-Wikidata-Example/1.0 (https://colab.research.google.com; colab-user@example.com)'
    }

    try:
        response = requests.get(WIKIDATA_API_ENDPOINT, 
                                params=params, 
                                headers=headers, 
                                timeout=10)
        response.raise_for_status()  # Raises an HTTPError for bad responses (4xx or 5xx)

        data = response.json()

        # Check for potential errors in the API response structure
        if 'error' in data:
            return {"error": f"API Error for {qid}: {data['error']['info']}"}

        # The core data is nested under ['entities'][qid]
        entity_data = data.get('entities', {}).get(qid)

        if entity_data:
            return entity_data
        else:
            return {"error": f"Entity {qid} not found or no data returned."}

    except requests.exceptions.RequestException as e:
        return {"error": f"Network or API request error: {e}"}
    except json.JSONDecodeError:
        return {"error": "Failed to decode JSON response."}

I am going to run this on 1 qid to understand better how this code works: 

In [52]:
fetch_complete_entity_data("Q265936")

{'type': 'item',
 'id': 'Q265936',
 'labels': {'de': {'language': 'de', 'value': 'Kawasaki-Syndrom'},
  'ar': {'language': 'ar', 'value': 'داء كاواساكي'},
  'ca': {'language': 'ca', 'value': 'malaltia de Kawasaki'},
  'dv': {'language': 'dv', 'value': 'ކަވަސާކީ ސިންޑްރޯމް'},
  'en': {'language': 'en', 'value': 'Kawasaki disease'},
  'es': {'language': 'es', 'value': 'Enfermedad de Kawasaki'},
  'et': {'language': 'et', 'value': 'Kawasaki haigus'},
  'fa': {'language': 'fa', 'value': 'نشانگان کاوازاکی'},
  'fi': {'language': 'fi', 'value': 'Kawasakin tauti'},
  'fr': {'language': 'fr', 'value': 'maladie de Kawasaki'},
  'he': {'language': 'he', 'value': 'מחלת קווסאקי'},
  'hu': {'language': 'hu', 'value': 'Kawasaki-szindróma'},
  'it': {'language': 'it', 'value': 'sindrome di Kawasaki'},
  'ja': {'language': 'ja', 'value': '川崎病'},
  'ms': {'language': 'ms', 'value': 'Penyakit Kawasaki'},
  'nl': {'language': 'nl', 'value': 'Ziekte van Kawasaki'},
  'pl': {'language': 'pl', 'value': 'Cho

I am not totally sure what all of this means

I also have this code?

In [36]:
def _chunk_list(lst, n):
    """
    Yields successive n-sized chunks from lst.
    """
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

In [37]:
def fetch_labels_for_qids(qids: list[str], lang='en'):
    """
    Fetches labels for a list of Wikidata QIDs or Property IDs.
    Handles API limits by chunking the requests.

    Args:
        qids (list[str]): A list of Wikidata Item IDs or Property IDs (e.g., ['Q515', 'P31']).
        lang (str): The language code for the labels (default is 'en').

    Returns:
        dict: A dictionary mapping QID to its label, or an error dictionary.
    """
    if not qids:
        return {}

    # Wikidata API limit for 'ids' parameter is typically 50
    MAX_IDS_PER_REQUEST = 50
    all_labels_map = {}

    # Chunk the QID list to respect the API limit
    for qid_chunk in _chunk_list(qids, MAX_IDS_PER_REQUEST):
        params = {
            'action': 'wbgetentities',
            'ids': '|'.join(qid_chunk), # Join QIDs with '|' for multiple requests
            'format': 'json',
            'props': 'labels',
            'languages': lang,
        }

        headers = {
            'User-Agent': 'Colab-Wikidata-Example/1.0 (https://colab.research.google.com; colab-user@example.com)'
        }

        try:
            response = requests.get(WIKIDATA_API_ENDPOINT, params=params, headers=headers, timeout=10)
            response.raise_for_status() # Raises an HTTPError for bad responses (4xx or 5xx)

            data = response.json()

            if 'error' in data:
                # If an error occurs in one chunk, return it immediately or log and continue
                return {"error": f"API Error fetching labels for chunk {qid_chunk}: {data['error']['info']}"}

            for qid_key, entity_info in data.get('entities', {}).items():
                label = entity_info.get('labels', {}).get(lang, {}).get('value')
                if label:
                    all_labels_map[qid_key] = label

        except requests.exceptions.RequestException as e:
            return {"error": f"Network or API request error for chunk {qid_chunk}: {e}"}
        except json.JSONDecodeError:
            return {"error": "Failed to decode JSON response for a label chunk."}

    return all_labels_map

In [38]:
def extract_labeled_claim_values(claims: dict, property_labels: dict) -> dict:
    """
    Extracts the main value for each claim, resolves QID values to labels,
    and returns a dictionary of 'property_label': 'value' pairs.

    Args:
        claims (dict): The 'claims' section of a Wikidata entity's data.
        property_labels (dict): A dictionary mapping Property IDs (P-numbers) to their labels.

    Returns:
        dict: A dictionary where keys are property labels and values are their extracted/resolved values.
    """
    labeled_values = {}
    qids_to_resolve = set() # Collect all QIDs that need labels

    # First pass: Extract raw values and collect QIDs
    extracted_raw_values = {}
    for prop_id, statements in claims.items():
        prop_label = property_labels.get(prop_id, prop_id) # Use ID if label not found
        
        # We often care about the primary value of the first statement for simplicity
        if statements:
            main_snak = statements[0].get('mainsnak')
            if not main_snak or 'datavalue' not in main_snak: # Skip if no main value
                continue

            data_value = main_snak['datavalue']
            value_type = data_value.get('type')

            if value_type == 'wikibase-entityid':
                qid_value = data_value['value']['id']
                extracted_raw_values[prop_label] = qid_value # Store QID for later resolution
                qids_to_resolve.add(qid_value)
            elif value_type == 'string' or value_type == 'external-id':
                extracted_raw_values[prop_label] = data_value['value']
            elif value_type == 'quantity':
                # Format quantity with unit if available
                amount = data_value['value']['amount']
                unit = data_value['value'].get('unit', '').replace('http://www.wikidata.org/entity/', '')
                if unit and unit != '1': # '1' is the URI for dimensionless unit
                    # Attempt to add unit to QID list for resolution
                    if unit.startswith('Q'):
                        qids_to_resolve.add(unit)
                        extracted_raw_values[prop_label] = (amount, unit) # Store as tuple for later unit resolution
                    else:
                        extracted_raw_values[prop_label] = f"{amount} {unit}" # Simple string for non-QID units
                else:
                    extracted_raw_values[prop_label] = amount
            elif value_type == 'time':
                # Simple representation for time
                extracted_raw_values[prop_label] = data_value['value']['time']
            elif value_type == 'globecoordinate':
                latitude = data_value['value']['latitude']
                longitude = data_value['value']['longitude']
                extracted_raw_values[prop_label] = f"Lat: {latitude}, Lon: {longitude}"
            elif value_type == 'monolingualtext':
                extracted_raw_values[prop_label] = data_value['value']['text']
            # Add more types as needed
            else:
                # For unhandled types or complex structures, just show the raw datavalue
                extracted_raw_values[prop_label] = f"[Unhandled Type: {value_type}]"

    # Second pass: Resolve QID values and units to labels
    if qids_to_resolve:
        resolved_value_labels = fetch_labels_for_qids(list(qids_to_resolve))
        if "error" in resolved_value_labels:
            print(f"Warning: Could not resolve some value labels: {resolved_value_labels['error']}")
            # Proceed with raw QIDs if resolution fails
            pass

        for prop_label, value in extracted_raw_values.items():
            if isinstance(value, str) and value.startswith('Q'):
                labeled_values[prop_label] = resolved_value_labels.get(value, value) # Use raw QID if label not found
            elif isinstance(value, tuple) and len(value) == 2 and value[1].startswith('Q'): # Handle quantity with QID unit
                amount, unit_qid = value
                unit_label = resolved_value_labels.get(unit_qid, unit_qid)
                labeled_values[prop_label] = f"{amount} {unit_label}"
            else:
                labeled_values[prop_label] = value
    else:
        labeled_values = extracted_raw_values # No QIDs to resolve

    return labeled_values

In [39]:
def test_one(QID):
    """
    Demonstrates fetching the complete JSON data for a given QID string
    and then resolving labels for properties and their values.
    """
    # The entity for the Durres city
    qid_example = QID
    print(f"--- Fetching ALL structured data for {qid_example} \n")

    entity_data = fetch_complete_entity_data(qid_example)

    if "error" in entity_data:
        print(f"Error: {entity_data['error']}")
        return

    # Display main entity's label and description
    print(f"--- Main Entity Details ({qid_example}) ---")
    entity_label = entity_data.get('labels', {}).get('en', {}).get('value', 'No label found')
    entity_description = entity_data.get('descriptions', {}).get('en', {}).get('value', 'No description found')
    print(f"Label: {entity_label}")
    print(f"Description: {entity_description}\n")

    print("--- Full Raw JSON Structure (Truncated for readability) ---")

    # We will print the Claims section specifically to show the attribute:value pairs
    claims = entity_data.get('claims', {}) # This is the full claims dict
    print(f"\nTotal Properties (Claims) Found: {len(claims)}\n")

    # Get labels for the property IDs themselves
    property_ids = list(claims.keys())
    property_labels = fetch_labels_for_qids(property_ids)
    if "error" in property_labels:
        #print(f"Error fetching property labels: {property_labels['error']}")
        property_labels = {pid: pid for pid in property_ids} # Fallback to IDs if labels fail
    else:
        #print("Property IDs found for this entity:")
        # Print property IDs with their labels
        labeled_properties_overview = {pid: property_labels.get(pid, 'Label Not Found') for pid in property_ids}
        #print(json.dumps(labeled_properties_overview, indent=2))

    # Now, extract and label the claim values
    print("\n--- Extracted Labeled Claim Values ---")
    labeled_claim_values = extract_labeled_claim_values(claims, property_labels)
    print(json.dumps(labeled_claim_values, indent=2, ensure_ascii=False))

    print("\n--- Details for 'P31' (instance of) ---")

    if 'P31' in claims:
        # P31 is 'instance of', and it will contain an array of statements
        p31_statements = claims['P31']

        # Iterate over the values found for P31
        extracted_value_qids = []
        for statement in p31_statements:
            # The value is usually nested deep in the datavalue section
            main_snak = statement['mainsnak']
            if main_snak['datavalue']['type'] == 'wikibase-entityid':
                value_qid = main_snak['datavalue']['value']['id']
                extracted_value_qids.append(value_qid)

        # Get the label for the P31 property itself
        p31_label = property_labels.get('P31', 'Label Not Found for P31')
        print(f"Property P31 label: '{p31_label}'")

        # Get the labels for the extracted QID values
        value_labels = fetch_labels_for_qids(extracted_value_qids)

        if "error" in value_labels:
            print(f"Error fetching value labels: {value_labels['error']}")
        else:
            print(f"Raw QID values for 'instance of' (P31): {extracted_value_qids}")
            labeled_values = [value_labels.get(qid, 'Label Not Found') for qid in extracted_value_qids]
            print(f"Labeled values for 'instance of' (P31): {labeled_values}")
    else:
        print("P31 property not found in claims.")

    print("\n------------------------------------------------------------")
    print("This raw data contains every single piece of structured information available for the entity.")



In [40]:
def process_qids_to_jsonl(qid_list, output_filename="entity_data.jsonl"):
    """
    Processes a list of QIDs, fetches structured data, labels it, and stores
    the results (or errors) into a JSONL file.
    
    Args:
        qid_list (list): A list of QID strings (e.g., ['Q534', 'Q142', 'Q999']).
        output_filename (str): The name of the JSONL file to write results to.
    """
    print(f"Starting processing for {len(qid_list)} QIDs.")
    print(f"Results will be written to '{output_filename}'.")
    
    successful_count = 0
    failed_count = 0

    with open(output_filename, 'w', encoding='utf-8') as f:
        for qid in qid_list:
            print(f"Processing {qid}...")
            
            # Initialize the base record structure
            record = {"QID": qid, "status": "failed", "error_message": None}
            
            try:
                # 1. Fetch raw entity data (using your existing function)
                entity_data = fetch_complete_entity_data(qid)

                if "error" in entity_data:
                    # Handle API/Not Found error directly
                    record["error_message"] = entity_data['error']
                    failed_count += 1
                else:
                    # 2. Extract basic details
                    entity_label = entity_data.get('labels', {}).get('en', {}).get('value', 'No label found')
                    entity_description = entity_data.get('descriptions', {}).get('en', {}).get('value', 'No description found')
                    claims = entity_data.get('claims', {})
                    
                    # 3. Get labels for the properties themselves (using your existing function)
                    property_ids = list(claims.keys())
                    property_labels = fetch_labels_for_qids(property_ids)

                    if "error" in property_labels:
                        # Fallback for label fetching error
                        property_labels = {pid: pid for pid in property_ids} 
                        print(f"  Warning: Failed to fetch property labels for {qid}. Using IDs.")
                    
                    # 4. Extract and label all claim values (using your existing function)
                    labeled_claim_values = extract_labeled_claim_values(claims, property_labels)

                    # 5. Structure the final dictionary for successful outcome
                    record.update({
                        "status": "success",
                        "label": entity_label,
                        "description": entity_description,
                        "attributes": labeled_claim_values
                    })
                    record.pop("error_message") # Remove error key on success
                    successful_count += 1
            
            except Exception as e:
                # Catch any unexpected execution errors
                record["error_message"] = f"Unexpected execution error: {type(e).__name__} - {e}"
                failed_count += 1

            # 6. Write the final record (whether success or failure) to the JSONL file
            json_line = json.dumps(record, ensure_ascii=False)
            f.write(json_line + '\n')
    
    print("\n--- Processing Complete ---")
    print(f"Total Processed: {len(qid_list)}")
    print(f"Successful Records: {successful_count}")
    print(f"Failed Records: {failed_count}")
    print("---------------------------\n")

In [57]:
test_one("Q265936")

--- Fetching ALL structured data for Q265936 

--- Main Entity Details (Q265936) ---
Label: Kawasaki disease
Description: human disease in which blood vessels throughout the body become inflamed

--- Full Raw JSON Structure (Truncated for readability) ---

Total Properties (Claims) Found: 53


--- Extracted Labeled Claim Values ---
{
  "Commons category": "Kawasaki disease",
  "OMIM ID": "611775",
  "MedlinePlus ID": "000989",
  "DiseasesDB": "7121",
  "eMedicine ID": "965367",
  "NDL Authority ID": "00565244",
  "Freebase ID": "/m/040k6g",
  "image": "Kawasaki Disease.png",
  "Gran Enciclopèdia Catalana ID (former scheme)": "0262801",
  "Patientplus ID": "kawasaki-disease-pro",
  "Disease Ontology ID": "DOID:13378",
  "NCI Thesaurus ID": "C34825",
  "subclass of": "lymphadenitis",
  "health specialty": "immunology",
  "genetic association": "PPM1L",
  "exact match": "http://purl.obolibrary.org/obo/DOID_13378",
  "UMLS CUI": "C2936917",
  "symptoms and signs": "strawberry tongue",
  "Q

Now I will do this for the first 20 qids in my dataframe

In [59]:
top20 = qid_df.head(20)

In [60]:
top20 = top20['qid'].tolist()

In [61]:
top20

['Q265936',
 'Q49607',
 'Q1195727',
 'Q1582005',
 'Q460186',
 'Q486306',
 'Q869018',
 'Q857634',
 'Q709133',
 'Q675937',
 'Q962932',
 'Q18432',
 'Q3311525',
 'Q2181925',
 'Q122248',
 'Q192814',
 'Q254038',
 'Q4357239',
 'Q30113',
 'Q381941']

In [71]:
import time

In [72]:
start_time = time.perf_counter()

if __name__ == "__main__":
    #test_one("Q83285") # Article about Durres
    #test_one("Q7186")  # Article about Marie Kurie

    # I'm putting the list here, but you'll have a file with a list of QIDs here.
    qid_list_to_process = top20
    
    output_file = "entity_results.jsonl"

    # Run the main function
    process_qids_to_jsonl(qid_list_to_process, output_file)

end_time = time.perf_counter()
elapsed_time = end_time - start_time

print(f"Code execution time: {elapsed_time:.4f} seconds")

Starting processing for 20 QIDs.
Results will be written to 'entity_results.jsonl'.
Processing Q265936...
Processing Q49607...
Processing Q1195727...
Processing Q1582005...
Processing Q460186...
Processing Q486306...
Processing Q869018...
Processing Q857634...
Processing Q709133...
Processing Q675937...
Processing Q962932...
Processing Q18432...
Processing Q3311525...
Processing Q2181925...
Processing Q122248...
Processing Q192814...
Processing Q254038...
Processing Q4357239...
Processing Q30113...
Processing Q381941...

--- Processing Complete ---
Total Processed: 20
Successful Records: 20
Failed Records: 0
---------------------------

Code execution time: 40.2595 seconds


It is going to take me 130 days to get all the wikidata for all my articles, so I am going to change my csv file to only have the top 1,000 articles for each country for the month

First, I am going to see how many artiles I actually have when I aggregate the pageviews for articles over the month

In [5]:
articles_df = pd.read_csv('final-project-data.csv')

In [6]:
articles_df.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-01,US,Kawasaki_disease,Q265936,684,en
1,2023-03-01,US,The_Elder_Scrolls_IV:_Oblivion,Q49607,530,en
2,2023-03-01,US,Marathon_Man_(film),Q1195727,523,en
3,2023-03-01,US,Eleanor_Tomlinson,Q1582005,697,en
4,2023-03-01,US,Alice_Neel,Q460186,1044,en


In [7]:
articles = articles_df.groupby('article')['pageviews'].sum().reset_index()

In [8]:
articles.head()

Unnamed: 0,article,pageviews
0,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",3428
1,"""C""_Is_for_(Please_Insert_Sophomoric_Genitalia...",1006
2,"""Christmas_tree""_files",371
3,"""Crimea_is_a_'red_line'_for_Putin"":_Dr._Jeremy...",1105
4,"""Freeway""_Rick_Ross",32293


I forgot that I need to make sure the country is the same

In [84]:
articles = articles_df.groupby(['article', "country_code"])['pageviews'].sum().reset_index()

In [86]:
articles.head()

Unnamed: 0,article,country_code,pageviews
0,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",DE,99
1,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",GB,412
2,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",US,2917
3,"""C""_Is_for_(Please_Insert_Sophomoric_Genitalia...",IN,1006
4,"""Christmas_tree""_files",GB,371


Let me also convert the unicode characters right now too

In [87]:
len(articles)

488066

Okay, I am going to cut down the number of articles from the country specific dataframes

I'm just going to remake my country specific dataframes because I don't want to rerun all my code

In [None]:
USdf
JPdf
UKdf
INdf
DEdf

I need to sort by pageviews then just keep the top 10,000 or so

In [44]:
USdf = articles_df[(articles_df['country_code'] == 'US')]
JPdf = articles_df[(articles_df['country_code'] == 'JP')]
UKdf = articles_df[(articles_df['country_code'] == 'GB')]
INdf = articles_df[(articles_df['country_code'] == 'IN')]
DEdf = articles_df[(articles_df['country_code'] == 'DE')]

In [45]:
USdf.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-01,US,Kawasaki_disease,Q265936,684,en
1,2023-03-01,US,The_Elder_Scrolls_IV:_Oblivion,Q49607,530,en
2,2023-03-01,US,Marathon_Man_(film),Q1195727,523,en
3,2023-03-01,US,Eleanor_Tomlinson,Q1582005,697,en
4,2023-03-01,US,Alice_Neel,Q460186,1044,en


In [22]:
USdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
JPdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
UKdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
INdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
DEdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()

Unnamed: 0,article,country_code,pageviews
0,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",DE,99
1,"""Hello,_World!""_program",DE,417
2,"""No_Way_to_Prevent_This"",_Says_Only_Nation_Whe...",DE,2748
3,"""Weird_Al""_Yankovic",DE,1149
4,$,DE,1397
...,...,...,...
85679,é½ææ,DE,141
85680,é½å¿,DE,100
85681,êµ­ì _ì¬ì±ì_ë,DE,135
85682,ìí¤ë¯¸ëì´_íêµ­,DE,95


In [46]:
USdf['article'] = USdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
JPdf['article'] = JPdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
UKdf['article'] = UKdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
INdf['article'] = INdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
DEdf['article'] = DEdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  USdf['article'] = USdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  JPdf['article'] = JPdf['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  UKdf['article'] = UKdf['article']

In [47]:
USdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
JPdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
UKdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
INdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()
DEdf.groupby(['article', "country_code"])['pageviews'].sum().reset_index()

Unnamed: 0,article,country_code,pageviews
0,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",DE,99
1,"""Hello,_World!""_program",DE,417
2,"""No_Way_to_Prevent_This"",_Says_Only_Nation_Whe...",DE,2748
3,"""Weird_Al""_Yankovic",DE,1149
4,$,DE,1397
...,...,...,...
85680,é½ææ,DE,141
85681,é½å¿,DE,100
85682,êµ­ì _ì¬ì±ì_ë,DE,135
85683,ìí¤ë¯¸ëì´_íêµ­,DE,95


In [48]:
DEdf

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
4754982,2023-03-01,DE,Carles_Puigdemont,Q4740163,101,br
4754983,2023-03-01,DE,Liste_von_Pistolen,Q60526,149,de
4754984,2023-03-01,DE,Priyanka_Chopra_Jonas,Q158957,215,de
4754985,2023-03-01,DE,Denis_Wladimirowitsch_Puschilin,Q16514790,109,de
4754986,2023-03-01,DE,Dominica,Q784,211,de
...,...,...,...,...,...,...
5636456,2023-03-31,DE,The_Glory_(TV_series),Q113197148,211,en
5636457,2023-03-31,DE,Evan_Gershkovich,Q117337455,1032,en
5636458,2023-03-31,DE,ÙØ±ÛÙÛÙ_ÙÙÙØ±Ù,Q4616,121,fa
5636459,2023-03-31,DE,ÙØ¯ÛÙ_Ø¨Ø§Ø²ÙÙØ¯,Q106396209,93,fa


I am still getting some weird titles

In [53]:
len(USdf)

3000

In [49]:
USdf.sort_values(by='pageviews', ascending=False, inplace=True)
JPdf.sort_values(by='pageviews', ascending=False, inplace=True)
UKdf.sort_values(by='pageviews', ascending=False, inplace=True)
INdf.sort_values(by='pageviews', ascending=False, inplace=True)
DEdf.sort_values(by='pageviews', ascending=False, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  USdf.sort_values(by='pageviews', ascending=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  JPdf.sort_values(by='pageviews', ascending=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  UKdf.sort_values(by='pageviews', ascending=False, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning

In [54]:
len(USdf)

3000

Wait, I need to consolidate these first

In [50]:
USdf = USdf.head(3000)
JPdf = JPdf.head(3000)
UKdf = UKdf.head(3000)
INdf = INdf.head(3000)
DEdf = DEdf.head(3000)


I am going to do the top 3000 articles for each month

In [51]:
csv_data2 = pd.concat([USdf, JPdf, UKdf, INdf, DEdf], ignore_index=True)

In [52]:
csv_data2.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-15,US,Main_Page,Q5296,7132908,en
1,2023-03-16,US,Main_Page,Q5296,4532076,en
2,2023-03-01,US,Cookie_(informatique),Q178995,4251750,fr
3,2023-03-17,US,Main_Page,Q5296,4233371,en
4,2023-03-10,US,Cookie_(informatique),Q178995,4158637,fr


In [28]:
len(csv_data2)

15000

In [30]:
csv_data2.to_csv("final-project-data2.csv", index=False)

In [None]:
articles = articles_df.groupby(['article', "country_code"])['pageviews'].sum().reset_index()

In [60]:
USdf_new = USdf.drop('date', axis=1) 
JPdf_new = JPdf.drop('date', axis=1) 
UKdf_new = UKdf.drop('date', axis=1) 
INdf_new = INdf.drop('date', axis=1) 
DEdf_new = DEdf.drop('date', axis=1) 

In [62]:
USdf.groupby(['article', 'country_code', 'qid'])['pageviews'].sum().reset_index(name='total_pageviews')
JPdf.groupby(['article', 'country_code', 'qid'])['pageviews'].sum().reset_index(name='total_pageviews')
UKdf.groupby(['article', 'country_code', 'qid'])['pageviews'].sum().reset_index(name='total_pageviews')
INdf.groupby(['article', 'country_code', 'qid'])['pageviews'].sum().reset_index(name='total_pageviews')
DEdf.groupby(['article', 'country_code', 'qid'])['pageviews'].sum().reset_index(name='total_pageviews')


Unnamed: 0,article,country_code,qid,total_pageviews
0,(469705)_ÇKÃ¡Ì¦gÃ¡ra,DE,Q15035845,7186
1,1._MÃ¤rz,DE,Q2393,14034
2,10._MÃ¤rz,DE,Q2397,11293
3,11._MÃ¤rz,DE,Q2398,6103
4,12._MÃ¤rz,DE,Q2402,6542
...,...,...,...,...
1401,Ø¨Ø±Ø§Ø¯Ø±Ø§Ù_ÙÛÙØ§,DE,Q108901009,4806
1402,Ø±ÙØ²_Ø¬ÙØ§ÙÛ_Ø²Ù,DE,Q38964,9790
1403,ÙÙØªâØ³ÛÙ,DE,Q1568159,5631
1404,ã¡ã¤ã³ãã¼ã¸,DE,Q5296,62573


In [63]:
USdf.sort_values(by='pageviews', ascending=False, inplace=True)
JPdf.sort_values(by='pageviews', ascending=False, inplace=True)
UKdf.sort_values(by='pageviews', ascending=False, inplace=True)
INdf.sort_values(by='pageviews', ascending=False, inplace=True)
DEdf.sort_values(by='pageviews', ascending=False, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DEdf.sort_values(by='pageviews', ascending=False, inplace=True)


In [64]:
csv_data3 = pd.concat([USdf, JPdf, UKdf, INdf, DEdf], ignore_index=True)

In [66]:
len(csv_data3)

15000

In [65]:
csv_data3.to_csv("final-project-qids.csv", index=False)

In [58]:
qid_df2 = qid_df2[["article", "qid"]]

In [59]:


if __name__ == "__main__":
    #test_one("Q83285") # Article about Durres
    #test_one("Q7186")  # Article about Marie Kurie

    # I'm putting the list here, but you'll have a file with a list of QIDs here.
    qid_list_to_process = qid_df2
    
    output_file = "entity_results.jsonl2"

    # Run the main function
    process_qids_to_jsonl(qid_list_to_process, output_file)





Starting processing for 15000 QIDs.
Results will be written to 'entity_results.jsonl2'.
Processing article...
Processing qid...

--- Processing Complete ---
Total Processed: 15000
Successful Records: 0
Failed Records: 2
---------------------------



This got too messy and I am not confused, so I am going to do this again *in a more streamlined fashion and hopefully get what I want

1. split my final-project-data.csv into it's respective languages

In [158]:
newdf = pd.read_csv('final-project-data.csv')

In [159]:
US = newdf[newdf['country_code'].str.contains("US", case=False, na=False)]

In [160]:
len(US)

1252328

In [161]:
JP = newdf[newdf['country_code'].str.contains("JP", case=False, na=False)]
UK = newdf[newdf['country_code'].str.contains("UK", case=False, na=False)]
IN = newdf[newdf['country_code'].str.contains("IN", case=False, na=False)]
DE = newdf[newdf['country_code'].str.contains("DE", case=False, na=False)]

In [176]:
len(UK)

0

In [162]:
US['article'] = US['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
JP['article'] = JP['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
UK['article'] = UK['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
IN['article'] = IN['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
DE['article'] = DE['article'].astype(str).str.encode('utf-8').str.decode('utf-8')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  US['article'] = US['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  JP['article'] = JP['article'].astype(str).str.encode('utf-8').str.decode('utf-8')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IN['article'] = IN['article'].astype(str)

In [180]:
US.shape, JP.shape

((5000, 4), (5000, 4))

In [163]:
US

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
0,2023-03-01,US,Kawasaki_disease,Q265936,684,en
1,2023-03-01,US,The_Elder_Scrolls_IV:_Oblivion,Q49607,530,en
2,2023-03-01,US,Marathon_Man_(film),Q1195727,523,en
3,2023-03-01,US,Eleanor_Tomlinson,Q1582005,697,en
4,2023-03-01,US,Alice_Neel,Q460186,1044,en
...,...,...,...,...,...,...
1252323,2023-03-31,US,Jerry_Nadler,Q505598,512,en
1252324,2023-03-31,US,68â95â99.7_rule,Q847822,530,en
1252325,2023-03-31,US,France,Q142,685,fr
1252326,2023-03-31,US,YouTube,Q866,3071,uk


In [164]:
len(US)

1252328

In [165]:
US = US.groupby(['article', 'qid', 'country_code'])['pageviews'].sum().reset_index(name='total_pageviews')


In [166]:
len(US)

105113

In [167]:
US.head()

Unnamed: 0,article,qid,country_code,total_pageviews
0,"!aaaH-aH_,yawA_eM_ekaT_oT_gnimoC_er'yehT",Q3990384,US,2917
1,"""Crimea_is_a_'red_line'_for_Putin"":_Dr._Jeremy...",Q117038809,US,1105
2,"""Freeway""_Rick_Ross",Q606032,US,31601
3,"""Hangman""_Adam_Page",Q16525240,US,4915
4,"""Hello,_World!""_program",Q131303,US,35450


In [168]:
JP.head()

Unnamed: 0,date,country_code,article,qid,pageviews,lang_code
1252328,2023-03-01,JP,Compartment_No._6,Q107092356,104,en
1252329,2023-03-01,JP,çå®é«ç°æ´¾,Q10437214,167,ja
1252330,2023-03-01,JP,ãã«ã¨ãã¹ãã¨å¬åç£,Q483263,285,ja
1252331,2023-03-01,JP,ä¼½è¶,Q28084,299,ja
1252332,2023-03-01,JP,å½é72ç³»é»è»,Q11421672,181,ja


In [169]:
JP = JP.groupby(['article', 'qid', 'country_code'])['pageviews'].sum().reset_index(name='total_pageviews')
UK = UK.groupby(['article', 'qid', 'country_code'])['pageviews'].sum().reset_index(name='total_pageviews')
IN = IN.groupby(['article', 'qid', 'country_code'])['pageviews'].sum().reset_index(name='total_pageviews')
DE = DE.groupby(['article', 'qid', 'country_code'])['pageviews'].sum().reset_index(name='total_pageviews')

In [177]:
len(UK)

0

In [170]:
US.sort_values(by='total_pageviews', ascending=False, inplace=True)
JP.sort_values(by='total_pageviews', ascending=False, inplace=True)
UK.sort_values(by='total_pageviews', ascending=False, inplace=True)
IN.sort_values(by='total_pageviews', ascending=False, inplace=True)
DE.sort_values(by='total_pageviews', ascending=False, inplace=True)

In [171]:
US.head()

Unnamed: 0,article,qid,country_code,total_pageviews
58626,Main_Page,Q5296,US,89005625
21448,Cookie_(informatique),Q178995,US,49289112
45655,Jimmy_Carter,Q23685,US,4964868
101896,ã¡ã¤ã³ãã¼ã¸,Q5296,US,4061575
99998,YouTube,Q866,US,3624806


In [106]:
len(US)

105113

In [172]:
US = US.head(5000)
JP = JP.head(5000)
UK = UK.head(5000)
IN = IN.head(5000)
DE = DE.head(5000)

In [173]:
all_qids_df = pd.concat([US, JP, UK, IN, DE], ignore_index=True)

In [174]:
len(all_qids_df)

20000

In [175]:
all_qids_df.to_csv("top5000_each.csv", index=False)

In [149]:
all_qids_df.head()

Unnamed: 0,article,qid,total_pageviews
0,Main_Page,Q5296,89005625
1,Cookie_(informatique),Q178995,49289112
2,Jimmy_Carter,Q23685,4964868
3,ã¡ã¤ã³ãã¼ã¸,Q5296,4061575
4,YouTube,Q866,3624806


In [150]:
all_qids = all_qids_df['qid'].unique().tolist()

In [151]:
len(all_qids)

16108

In [153]:
if __name__ == "__main__":
    #test_one("Q83285") # Article about Durres
    #test_one("Q7186")  # Article about Marie Kurie

    # I'm putting the list here, but you'll have a file with a list of QIDs here.
    qid_list_to_process = all_qids[:5]
    
    output_file = "entity_results-test.jsonl"

    # Run the main function
    process_qids_to_jsonl(qid_list_to_process, output_file)

Starting processing for 5 QIDs.
Results will be written to 'entity_results-test.jsonl'.
Processing Q5296...
Processing Q178995...
Processing Q23685...
Processing Q866...
Processing Q42253...

--- Processing Complete ---
Total Processed: 5
Successful Records: 5
Failed Records: 0
---------------------------



In [154]:
if __name__ == "__main__":
    #test_one("Q83285") # Article about Durres
    #test_one("Q7186")  # Article about Marie Kurie

    # I'm putting the list here, but you'll have a file with a list of QIDs here.
    qid_list_to_process = all_qids
    
    output_file = "entity_results2.jsonl"

    # Run the main function
    process_qids_to_jsonl(qid_list_to_process, output_file)

Starting processing for 16108 QIDs.
Results will be written to 'entity_results2.jsonl'.
Processing Q5296...
Processing Q178995...
Processing Q23685...
Processing Q866...
Processing Q42253...
Processing Q2429697...
Processing Q747452...
Processing Q322056...
Processing Q115564437...
Processing Q35127...
Processing Q114929139...
Processing Q14752155...
Processing Q834730...
Processing Q193555...
Processing Q108673301...
Processing Q181817...
Processing Q106997...
Processing Q80322358...
Processing Q214289...
Processing Q17460747...
Processing Q83808444...
Processing Q87131973...
Processing Q105883400...
Processing Q349852...
Processing Q254947...
Processing Q105840301...
Processing Q56274719...
Processing Q21738166...
Processing Q95...
Processing Q15908324...
Processing Q110664384...
Processing Q26457...
Processing Q445017...
Processing Q4109...
Processing Q24891605...
Processing Q31970512...
Processing Q59656031...
Processing Q112183099...
Processing Q285450...
Processing Q22686...
Proc

I have all of my wiki information now, what do I do with it?