# Balanced Sampling of News Articles by Country

This script performs the following steps to prepare a balanced news dataset:

- **Load and prepare data:**  
  Load news articles from a specified folder, filtering by selected countries (`Bulgaria`, `Italy`, `Netherlands`, `United Kingdom`) and relevant news outlets using the `load_and_prepare_data` function.

- **Filter results:**  
  Outputs the total number of articles remaining after filtering non-relevant outlets.

- **Create balanced sample:**  
  Generate a balanced sample of news articles (`total_samples = 2000`) evenly distributed across the selected countries using the `balanced_sample` function.

- **Summary and preview:**  
  Prints the size of the sample, counts of articles per country, and previews the first few rows of the sampled dataset.

- **Save to CSV:**  
  Saves the balanced sample as a CSV file for later use in analysis or modeling.


In [1]:
import os
from dataloader import load_and_prepare_data, balanced_sample, load_human_annotated_for_translation
from config import NEWS_FOLDER, SELECTED_COUNTRIES

# Create a balanced sample, e.g. total 50 samples across countries
data = load_and_prepare_data(
    news_folder="~/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/", 
    countries=["Bulgaria", "Italy", "Netherlands", "United_Kingdom"], 
    outlet_dir="~/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/selected_outlets/"
)


📊 Article count per country BEFORE outlet filtering:
-----------------------------------------------------

🔍 Looking for: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/Bulgaria_news.csv
✅ Found: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/Bulgaria_news.csv


  df = pd.read_csv(file_path)


📄 Bulgaria: 260300 articles loaded before filtering
📰 Bulgaria: 224455 articles AFTER outlet filtering

🔍 Looking for: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/Italy_news.csv
✅ Found: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/Italy_news.csv


  df = pd.read_csv(file_path)


📄 Italy: 910942 articles loaded before filtering
📰 Italy: 670971 articles AFTER outlet filtering

🔍 Looking for: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/Netherlands_news.csv
✅ Found: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/Netherlands_news.csv
📄 Netherlands: 121188 articles loaded before filtering
📰 Netherlands: 61566 articles AFTER outlet filtering

🔍 Looking for: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/United_Kingdom_news.csv
✅ Found: /home/akroon/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/United_Kingdom_news.csv


  df = pd.read_csv(file_path)


📄 United_Kingdom: 1039569 articles loaded before filtering
📰 United_Kingdom: 532096 articles AFTER outlet filtering


In [2]:
print(f'N of all news articles after filtering out non-relevant outlets: {len(data)}')
#data.to_csv('~/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/output/news_filtered_Bulgaria_Italy_Netherlands_United_Kingdom_outlets.csv')

N of all news articles after filtering out non-relevant outlets: 1489088


In [3]:
total_samples = 2000
df_news_sample = balanced_sample(data, total_samples=total_samples, countries=SELECTED_COUNTRIES)

print(f"Sample created with {len(df_news_sample)} rows")
print("Samples per country:")
print(df_news_sample["country"].value_counts())

print("\nSample preview:")
print(df_news_sample.head())

sample_csv_path = f"~/webdav/ASCOR-FMG-5580-RESPOND-news-data (Projectfolder)/output/news_sample_{total_samples}.csv"
df_news_sample.to_csv(sample_csv_path, index=False)
print(f"\nSample saved to: {sample_csv_path}")

Sample created with 1995 rows
Samples per country:
country
United_Kingdom    500
Italy             499
Netherlands       499
Bulgaria          497
Name: count, dtype: int64

Sample preview:
   Unnamed: 0        uri lang  isDuplicate        date      time  \
0       87066  787939780  bul        False  2018-01-04  08:12:00   
1       30824  786664095  bul        False  2018-01-01  16:41:00   
2      249635  809889599  bul        False  2018-02-09  13:30:00   
3      249481  815894766  bul        False  2018-02-19  10:55:00   
4       15509  835882007  bul        False  2018-03-21  18:16:00   

                   dateTime dateTimePub dataType  sim  ...  \
0 2018-01-04 08:12:00+00:00         NaN     news  0.0  ...   
1 2018-01-01 16:41:00+00:00         NaN     news  0.0  ...   
2 2018-02-09 13:30:00+00:00         NaN     news  0.0  ...   
3 2018-02-19 10:55:00+00:00         NaN     news  0.0  ...   
4 2018-03-21 18:16:00+00:00         NaN     news  0.0  ...   

                            