This experiment notebook is the base for the lake_to_dwh.py file. It will read/load the newspaper htmls, transform them and prepare the features, which are needed for an easy analyse. The analyse will be about the word "klima" and its changing use over time by newspaper. See Readme for more.

In [5]:
import os
import sys
import logging
sys.path.append(os.path.abspath("pylib"))

import pandas as pd
from handle_sqlite import save_dataframe_to_db, read_table_as_dataframe
from handle_data_processing import process_newspaper_with_context


# Load all the newspapers
Here we will load the csv for one day.

test: saving df as sqlite and read it

In [2]:
test = {'Name': ['Tom', 'nick', 'chris', 'jack'],
        'Age': [20, 21, 19, 18]}

# Create DataFrame
df = pd.DataFrame(test)
# df (pd.DataFrame): The DataFrame to save.
# table_name (str): The name of the table in the database.
# connection (sqlite3.Connection): SQLite connection object.
# if_exists (str): How to behave if the table already exists. Options are 'fail', 'replace', 'append'.
save_dataframe_to_db(df, "test_table", "data_output/dwh_data.db", "replace")

2025-02-21 22:32:27,558 - INFO - Data saved to table 'test_table' in 'data_output/dwh_data.db' successfully.


In [3]:
read_table_as_dataframe("test_table", "data_output/dwh_data.db")

2025-02-21 22:32:31,360 - INFO - Data read from table 'test_table' in 'data_output/dwh_data.db' successfully.


Unnamed: 0,Name,Age
0,Tom,20
1,nick,21
2,chris,19
3,jack,18


 orchestrate the processing

In [7]:
newspapers = [
    {"file_name": "data_input/data_lake/2021-04-02-54books.html", "encoding": "utf-8", "name": "54books", "date": "2021-04-02"},
    {"file_name": "data_input/data_lake/2021-04-02-abendblatt.html", "encoding": "utf-8", "name": "abendblatt", "date": "2021-04-02"},
    {"file_name": "data_input/data_lake/2025-01-14-vice-de.html", "encoding": "utf-8", "name": "vice", "date": "2025-01-24"},
    {"file_name": "data_input/data_lake/2021-04-24-tagesschau.html", "encoding": "utf-8", "name": "tagesschaus", "date": "2021-04-24"},
    {"file_name": "data_input/data_lake/2025-01-24-tagesschau.html", "encoding": "utf-8", "name": "tagesschaus", "date": "2025-01-24"},
]


metadata_collection = []
context_collection = []

for newspaper in newspapers:
    try:
        metadata, context_data = process_newspaper_with_context(newspaper)
                
        # Add a unique ID for each newspaper in the metadata and add to context
        newspaper_id = len(metadata_collection) + 1  # This can be a simple counter for unique IDs (or use UUID)
        metadata["newspaper_id"] = newspaper_id  # Add newspaper_id to metadata
        
        # Append the metadata to its respective collection
        metadata_collection.append(metadata)
        
        # Append the context data with id to its respective collection if 'klima' was found at least once
        if metadata['klima_mentions_count'] > 0:
            # First add the same newspaper_id to each context data
            for context in context_data:
                context["newspaper_id"] = newspaper_id
            context_collection.extend(context_data) # Using extend here because context_data is already a list of dicts

    except Exception as e:
        logging.error(f"Error processing {newspaper['name']} for {newspaper['date']}: {e}")

# Convert to DataFrame after processing all newspapers
final_metadata_df = pd.DataFrame(metadata_collection)
final_context_df = pd.DataFrame(context_collection)

# Save the results to the database
#save_dataframe_to_db(final_metadata_df, "newspapers", db_path="dwh_data.db", if_exists="replace")
#save_dataframe_to_db(final_context_df, "context", db_path="dwh_data.db", if_exists="replace")

2025-02-21 22:34:22,620 - INFO - Processing newspaper: 54books (2021-04-02)
2025-02-21 22:34:22,663 - INFO - No 'klima' mentions found in 54books for 2021-04-02.
2025-02-21 22:34:22,664 - INFO - Processing newspaper: abendblatt (2021-04-02)
2025-02-21 22:34:22,776 - INFO - 1 'klima' mentions in abendblatt for 2021-04-02.
2025-02-21 22:34:22,777 - INFO - Processing newspaper: vice (2025-01-24)
2025-02-21 22:34:22,812 - INFO - No 'klima' mentions found in vice for 2025-01-24.
2025-02-21 22:34:22,812 - INFO - Processing newspaper: tagesschaus (2021-04-24)
2025-02-21 22:34:23,029 - INFO - 2 'klima' mentions in tagesschaus for 2021-04-24.
2025-02-21 22:34:23,030 - INFO - Processing newspaper: tagesschaus (2025-01-24)
2025-02-21 22:34:23,254 - INFO - 4 'klima' mentions in tagesschaus for 2025-01-24.


for every collected path open and process the newspaper and append result to list

In [8]:
final_metadata_df

Unnamed: 0,newspaper,data_published,klima_mentions_count,newspaper_id
0,54books,2021-04-02,0,1
1,abendblatt,2021-04-02,1,2
2,vice,2025-01-24,0,3
3,tagesschaus,2021-04-24,2,4
4,tagesschaus,2025-01-24,4,5


In [9]:
final_context_df

Unnamed: 0,pre_context,post_context,prefix,suffix,newspaper_id
0,Was geht vor:,oder Mieterinteresse? Hamburg,,schutz,2
1,Adam. Biden zur,Jobmotor nicht Jobkiller,,politik,4
2,Kampf gegen den,schafft Jobs -,,wandel,4
3,Startseite Wissen Gesundheit,& Umwelt Forschung,,,5
4,Millionen Kinder betroffen,schränkt Schulbildung weltweit,,wandel,5
5,weltweit ein Der,hat zunehmend Einfluss,,wandel,5
6,Wetterthema Quanten und,Welche Rolle spielen,,,5


take list and append to sqlite.

In [31]:
final_df

Unnamed: 0,pre_context,post_context,prefix,suffix,newspaper,date_published
0,Was geht vor:,oder Mieterinteresse? Hamburg,,schutz,abendblatt,2021-04-02
0,Startseite Wissen Gesundheit,& Umwelt Forschung,,,tagesschaus,2025-01-24
1,Millionen Kinder betroffen,schränkt Schulbildung weltweit,,wandel,tagesschaus,2025-01-24
2,weltweit ein Der,hat zunehmend Einfluss,,wandel,tagesschaus,2025-01-24
3,Wetterthema Quanten und,Welche Rolle spielen,,,tagesschaus,2025-01-24


In [25]:
final_df['pre_context']

0                   Was geht vor:
0    Startseite Wissen Gesundheit
1      Millionen Kinder betroffen
2                weltweit ein Der
3         Wetterthema Quanten und
Name: pre_context, dtype: object

In [26]:
final_df['post_context']

0     oder Mieterinteresse? Hamburg
0                & Umwelt Forschung
1    schränkt Schulbildung weltweit
2            hat zunehmend Einfluss
3              Welche Rolle spielen
Name: post_context, dtype: object