## From Data Lake to Data Warehouse

### What the Notebook Does

- **Data Loading and Parsing:**  
  The notebook reads HTML files containing newspaper articles along with their corresponding metadata from a CSV file. It then uses BeautifulSoup to parse the HTML and extract only the relevant content needed for further analysis.

- **Data Processing and Preparation:**  
  The extracted content is processed to isolate the contexts in which the term "klima" appears. This includes capturing the surrounding text to better understand the usage and meaning of the word in each article.

- **Data Storage:**  
  The processed data is structured and stored in a SQLite database with two tables. This ensures that the data is organized, easily accessible, and ready for further analysis. It will also export the data as csv for an easy import to other programms.

### Data Format

- **Table: newspaper**  
  Stores metadata about each newspaper's main page, including the publication details corresponding to a single day.  
  Each entry represents the main page of a newspaper for one day, as the dataset is derived from crawling the main page rather than individual articles. 
  **Columns:**  
  - `newspaper_id`  
  - `newspaper_name`  
  - `data_published`  
  - `klima_mentions_count`

- **Table: context**  
  Contains detailed text snippets surrounding the target word "klima". The id refers to a newspaper (main page) from one specific day. 
  **Columns:**  
  - `newspaper_id`  
  - `pre_context`  
  - `post_context`  
  - `prefix`  
  - `suffix`

### Why This Approach

- **Focused Analysis:**  
  By isolating the contexts where "klima" is mentioned, the notebook prepares data specifically tailored to analyze the evolution of the term's usage over time.

- **Data Organization:**  
  Storing data in a structured SQLite database facilitates efficient querying and analysis, ensuring that subsequent analytical processes can be performed seamlessly.

- **Reproducibility and Scalability:**  
  This clear separation of tasks—from data extraction to storage—supports a reproducible workflow that can easily be extended or modified for future analytical targets.

For additional details and background, please refer to the README file.


In [None]:
import os
import sys
import glob
import csv

# Add custom library path
sys.path.append(os.path.abspath("pylib"))

import pandas as pd
from handle_sqlite import read_table_as_dataframe
from handle_data_processing import batch_process_newspapers

### Load All the Newspapers

In this section, we load the CSV files that contain details for each newspaper, such as file path, date, and HTTP status code. Each file represents data from one day.

In [None]:
# Use glob to list all CSV files in the specified directory that follow a date format in their names
csv_files = glob.glob('data_input/data-lake/*-*.csv')

We sort the files based on the date portion of the filename (ignoring the directory path). This help in the long taking processing step to easily track progress from the start date to the end date.

In [None]:
# we sort by the filename which contains the date, ignoring the directory path to make the sort efficient
csv_files.sort(key=lambda f: f.split('/')[-1])

# Output the total count of days (CSV files) to confirm the number of entries
print(f'Count of total days: {len(csv_files)}')

Now, we will read the CSV files one by one to retrieve the HTML file paths, including only those with a status code of 200 (OK).

In [None]:
# Initialize an empty DataFrame to store newspaper data
newspapers = []

# Process each CSV file
for csv_file in csv_files:
    # Open and read the CSV file
    with open(csv_file, mode='r', encoding='utf-8') as file:
        reader = csv.DictReader(file)
        
        # For each row, check if the status code is 200 (OK) and append the newspaper data to the list
        for row in reader:
            if row['status'] == '200':
                newspapers.append({
                    'name': row['name'],
                    'date': row['date'],
                    'file_name': row['file_name'],
                    'encoding': row['encoding']
                })
    

For verification: Display the first two newspaper entries to inspect the structure

In [None]:
# Print the first two entries to verify the structure of the newspaper data
newspapers[:2]

### Do the Actual Processing with Batch Processing and Multiprocessing

Here we process the newspaper data in batches using multiprocessing. This step prepares the data by extracting the relevant HTML content and storing the results in a SQLite database.

In [None]:
batch_process_newspapers(newspapers, batch_size=512, num_workers=12,
                         db_path="data_output/dwh_data.db", input_path_prefix="data_input")

### Check the Saved Data

After processing, we load the saved data from the SQLite database to verify that the data has been stored correctly.

In [None]:

# Read the 'newspapers' table from the database and display the first few rows
meta_data = read_table_as_dataframe("newspapers", "data_output/dwh_data.db")
meta_data.head()

In [None]:
# Read the 'context' table from the database and display the first few rows
context_data = read_table_as_dataframe("context", "data_output/dwh_data.db")
context_data.head()

### Export the Processed Data as CSV Files

Finally, we export the stored data as CSV files to facilitate further analysis in other programs that can't import sqlite files.

In [None]:
import datetime

# Get today's date in YYYY-MM-DD format
today = datetime.datetime.now().strftime("%Y-%m-%d")

# Export the metadata and context data as CSV files, embedding today's date in the filenames
meta_data.to_csv("dwh_meta_{today}.csv", index=False)
context_data.to_csv("dwh_context_{today}.csv", index=False)

In [None]:
# Check the number of unique newspaper names in the metadata (64)
# This helps verify that each newspaper is uniquely represented
meta_data.newspaper_name.nunique()

In [None]:
# Check the number of unique publication dates in the metadata (must equal 'Count of total days': 1401)
# This ensures that we have distinct entries for each day the main page was crawled
meta_data.data_published.nunique()