# Create Wiki duckdb database

I have a big CSV file in my Google Drive, which I want to convert into a DuckDB database. I'm using Google Colab because the virtual machine connected to the Colab has better memory and GPU parameters than a local machine.

**Note:** You also have access to the CSV files through [this shared folder](https://drive.google.com/drive/folders/1zlNVoRALmlYETU8pIiU883LCxYptAcJB?usp=sharing), but you will not be able to run the code below as is, since the path name for the file refers to "MyDrive". You will need to have the CSV file on your drive and modify the path to show your file. This tutorial is especially important if you created your own CSV file from the 2022-2025 DPDP files.

In [1]:
import duckdb

## Mount Google Drive

These commands allow us to connect to the file in the Google Drive. A dialog will show up to give access to the Colab to access the drive; click Accept or Continue, since the access is temporary not permanent.


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Provide path names for the input CSV file and the new duckdb file to be created.

In [3]:
INPUT_CSV_PATH = '/content/drive/MyDrive/Classroom/CS 234 Fall 2025/Final Project/all_wiki_data.csv'
OUTPUT_DUCKDB_FILENAME = 'all_wiki.duckdb'

# The path where the new DuckDB file will be created in the virtual machine where the Colab is running
OUTPUT_DUCKDB_PATH = f'/content/{OUTPUT_DUCKDB_FILENAME}'

## Convert CSV into the DuckDB file

This step connects to a new and empty DuckDB file, reads the uploaded CSV using `read_csv_auto`, and writes the data to a table. In order to reduce the size of the database, I am leaving out two fields that I don't deem necessary: country (I will keep country_code) and page_id (I will keep qid).

The new table that I will be creating is titled "wiki_pageviews".

In [4]:
import time
start = time.time() # moment I started to run this cell

conn = duckdb.connect(database=OUTPUT_DUCKDB_PATH)

conn.sql(f"""
    CREATE OR REPLACE TABLE wiki_pageviews AS
    SELECT
        date,
        country_code,
        project,
        article,
        qid,
        pageviews
    FROM read_csv_auto('{INPUT_CSV_PATH}', SAMPLE_SIZE=-1, delim='\t', names=[
        'date',
        'country',        -- Will be read, but ignored by SELECT
        'country_code',
        'project',
        'page_id',        -- Will be read, but ignored by SELECT
        'article',
        'qid',
        'pageviews'
    ])
""")
end = time.time() # moment the conversion was complete
totalTime = end - start
print(f"Conversion took {totalTime} seconds.")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Conversion took 2851.5585854053497 seconds.


It took about 45 minutes for the database to be created from the CSV file.

## Verification of the created database

Count the total number of the rows in the table.

In [5]:
row_count = conn.sql("SELECT COUNT(*) FROM wiki_pageviews;").fetchone()[0]

In [6]:
print(f"This table has {row_count} rows.")

This table has 267488614 rows.


I will also show the first few rows of the database.

In [7]:
query = """
SELECT * FROM wiki_pageviews
LIMIT 10;
"""
result = conn.sql(query).df()
result

Unnamed: 0,date,country_code,project,article,qid,pageviews
0,2023-02-06,DZ,ar.wikipedia,ÙØªÙØ§Ø²Ù_Ø£Ø¶ÙØ§Ø¹,Q45867,108
1,2023-02-06,DZ,ar.wikipedia,Ø§ÙØ£ÙØ¯ÙØ³,Q123559,145
2,2023-02-06,AR,en.wikipedia,Robledo_Puch,Q3181149,99
3,2023-02-06,AR,es.wikipedia,Ojo_de_Horus,Q211286,135
4,2023-02-06,AR,es.wikipedia,Estaciones_del_aÃ±o,Q24384,171
5,2023-02-06,AR,es.wikipedia,Isla_de_Alcatraz,Q131354,126
6,2023-02-06,AR,es.wikipedia,Volkswagen_Gol,Q275442,148
7,2023-02-06,AR,es.wikipedia,RÃ­o_Cuarto_(ciudad),Q983451,179
8,2023-02-06,AR,es.wikipedia,Todo_Noticias,Q3244714,325
9,2023-02-06,AR,es.wikipedia,Tres_metros_sobre_el_cielo_(pelÃ­cula_de_2010),Q944385,112


## Generating tables for German and French articles

Below I show how we can query the database to create CSV files with a specific portion of the data. I am interested in comparing political articles in French and German for the two countries: France and Germany. So, I will filter the data based on those criteria. For the moment, I am only getting the article names and the total of pageviews for them.

### German table

In [8]:
create_de_wiki_query = """
CREATE OR REPLACE TABLE de_wiki_pageviews AS
SELECT
    article,
    SUM(pageviews) AS total_pageviews
FROM
    wiki_pageviews
WHERE
    project = 'de.wikipedia' AND country_code = 'DE'
GROUP BY
    article;
"""

conn.sql(create_de_wiki_query)
print("Table 'de_wiki_pageviews' created successfully.")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Table 'de_wiki_pageviews' created successfully.


Check the number of rows for the new table:

In [9]:
row_count = conn.sql("SELECT COUNT(*) FROM de_wiki_pageviews;").fetchone()[0]
print(f"This table has {row_count} rows.")

This table has 286157 rows.


Let's see how many articles have more than 1000 pageviews.

In [10]:
query_high_pageviews = """
SELECT
    article,
    total_pageviews
FROM
    de_wiki_pageviews
WHERE
    total_pageviews > 1000
ORDER BY
    total_pageviews DESC;
"""

high_pageview_articles = conn.sql(query_high_pageviews).df()
display(high_pageview_articles)

Unnamed: 0,article,total_pageviews
0,PornHub,10852025.0
1,Liste_der_grÃ¶Ãten_AuslegerbrÃ¼cken,10671097.0
2,ChatGPT,5355151.0
3,Deutschland,5269414.0
4,ZDF,3567736.0
...,...,...
138010,Fluoridierung,1001.0
138011,Benjamin_Bathurst,1001.0
138012,Strecke_77,1001.0
138013,Guinea-Pavian,1001.0


Less than 50% of all articles (286157) have more than 1000 views (138015 articles).

Save the entire table into a file:

In [11]:
output_csv_path = 'de_wiki_pageviews.csv'
conn.sql(f"COPY de_wiki_pageviews TO '{output_csv_path}' (HEADER, DELIMITER ',');")
print(f"Table 'de_wiki_pageviews' successfully saved to {output_csv_path}")

Table 'de_wiki_pageviews' successfully saved to de_wiki_pageviews.csv


### French table

I will repeat the same steps, but for the French table.

In [12]:
create_fr_wiki_query = """
CREATE OR REPLACE TABLE fr_wiki_pageviews AS
SELECT
    article,
    SUM(pageviews) AS total_pageviews
FROM
    wiki_pageviews
WHERE
    project = 'fr.wikipedia' AND country_code = 'FR'
GROUP BY
    article;
"""

conn.sql(create_fr_wiki_query)
print("Table 'fr_wiki_pageviews' created successfully.")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Table 'fr_wiki_pageviews' created successfully.


Check the total number of rows:

In [13]:
row_count = conn.sql("SELECT COUNT(*) FROM fr_wiki_pageviews;").fetchone()[0]
print(f"This table has {row_count} rows.")

This table has 212679 rows.


One thing we can notice right away is that the German table has signficantly more entries (286K) than the French table (212K).

Now let's check the number of French articles with more than 1000 pageviews:

In [14]:
query_high_pageviews = """
SELECT
    article,
    total_pageviews
FROM
    fr_wiki_pageviews
WHERE
    total_pageviews > 1000
ORDER BY
    total_pageviews DESC;
"""

high_pageview_articles = conn.sql(query_high_pageviews).df()
display(high_pageview_articles)

Unnamed: 0,article,total_pageviews
0,Cookie_(informatique),127668934.0
1,Gabriel_Attal,5038309.0
2,Jordan_Bardella,4674491.0
3,France,4143897.0
4,Kylian_MbappÃ©,3368832.0
...,...,...
106504,Terre_indigo,1001.0
106505,Cynophagie,1001.0
106506,Objets_magiques_de_Harry_Potter,1001.0
106507,La_solitudine_(chanson),1001.0


As we saw for the German table, less than 50% of articles have more than 1000 views (over two years) [106509 articles out of a total of 212679]. What is different for France is that two of the most viewed pages (Gabriel Attal and Jordan Bardella) belong to two young politicians in France.

We need to be careful when we compare France and Germany usage of Wikipedia, since the population of Germany is 84 million while the population of France is only 66.5 million.

In [15]:
output_csv_path = 'fr_wiki_pageviews.csv'
conn.sql(f"COPY fr_wiki_pageviews TO '{output_csv_path}' (HEADER, DELIMITER ',');")
print(f"Table 'fr_wiki_pageviews' successfully saved to {output_csv_path}")

Table 'fr_wiki_pageviews' successfully saved to fr_wiki_pageviews.csv


**Final note:** If you executed this notebook via Google Colab, the files that you created are temporarily stored in the cloud. You need to download them in your computer **before you close the connection to the cloud**. To do so, look for the icon of "folder" on the left-side menu on Google Colab. It opens a tab that has three folders: .., .config, and sample_data. It should also have the files that you created. Download them on your computer to have them safe.