## From Data Lake to Data Warehouse

### What the Notebook Does

- **Data Loading and Parsing:**  
  The notebook reads HTML files containing newspaper articles along with their corresponding metadata from a CSV file. It then uses BeautifulSoup to parse the HTML and extract only the relevant content needed for further analysis.

- **Data Processing and Preparation:**  
  The extracted content is processed to isolate the contexts in which the term "klima" appears. This includes capturing the surrounding text to better understand the usage and meaning of the word in each article.

- **Data Storage:**  
  The processed data is structured and stored in a SQLite database with two tables. This ensures that the data is organized, easily accessible, and ready for further analysis. It will also export the data as csv for an easy import to other programms.

### Data Format

- **Table: newspaper**  
  Stores metadata about each newspaper's main page, including the publication details corresponding to a single day.  
  Each entry represents the main page of a newspaper for one day, as the dataset is derived from crawling the main page rather than individual articles. 
  **Columns:**  
  - `newspaper_id`  
  - `newspaper_name`  
  - `data_published`  
  - `klima_mentions_count`

- **Table: context**  
  Contains detailed text snippets surrounding the target word "klima". The id refers to a newspaper (main page) from one specific day. 
  **Columns:**  
  - `newspaper_id`  
  - `pre_context`  
  - `post_context`  
  - `prefix`  
  - `suffix`

### Why This Approach

- **Focused Analysis:**  
  By isolating the contexts where "klima" is mentioned, the notebook prepares data specifically tailored to analyze the evolution of the term's usage over time.

- **Data Organization:**  
  Storing data in a structured SQLite database facilitates efficient querying and analysis, ensuring that subsequent analytical processes can be performed seamlessly.

- **Reproducibility and Scalability:**  
  This clear separation of tasks—from data extraction to storage—supports a reproducible workflow that can easily be extended or modified for future analytical targets.

For additional details and background, please refer to the README file.


In [None]:
import os
import sys
import glob
import csv
sys.path.append(os.path.abspath("pylib"))

import pandas as pd
from handle_sqlite import read_table_as_dataframe
from handle_data_processing import batch_process_newspapers

### Load all the newspapers
Here we will load the csv files that contain details for the newspaper like path, date and status code. For every day there is one such file.

In [20]:
# Use glob to list all CSV files in the specified directory with date format in their names
csv_files = glob.glob('data_input/data-lake/*-*.csv')

# We sort here, so later we can see from startdate how the progress is till enddate
# we sort by the filename which contains the date, ignoring the directory path to make the sort efficient
csv_files.sort(key=lambda f: f.split('/')[-1])

print(f'Count of total days: {len(csv_files)}')

Count of total days: 1401


Now we will read the csv files one by one to get the html file paths, only including the one with status 200 (ok)

In [None]:
# Initialize an empty DataFrame to store newspaper data
newspapers = []

# Process each CSV file
for csv_file in csv_files:
    # Load CSV
    with open(csv_file, mode='r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        
        for row in reader:
            if row['status'] == '200':
                newspapers.append({
                    'name': row['name'],
                    'date': row['date'],
                    'file_name': row['file_name'],
                    'encoding': row['encoding']
                })
    

In [None]:
for newspaper in newspapers:
    print(newspaper)
    break

In [None]:
batch_process_newspapers(newspapers, batch_size=512, num_workers=12, db_path="data_output/dwh_data.db", input_path_prefix="data_input")

### check the saved data

In [None]:

meta_data = read_table_as_dataframe("newspapers", "data_output/dwh_data.db")
meta_data.head()

In [None]:
context_data = read_table_as_dataframe("context", "data_output/dwh_data.db")
context_data.head()

### export as csv

In [None]:
import datetime
today = datetime.datetime.now().strftime("%Y-%m-%d")

In [None]:
meta_data.to_csv("dwh_meta_{today}.csv", index=False)

In [None]:
context_data.to_csv("dwh_context_{today}.csv", index=False)

In [None]:
meta_data.newspaper_name.nunique()

In [None]:
meta_data.data_published.nunique()