# **Information Processing and Retrieval**

**Project developed by:**
- Diogo Fonte - up202004175
- Rodrigo Figueiredo - up202005216
- Sofia Rodrigo  - up202301429
- Vítor Cavaleiro - up202004724

## **Environment Setup**

In [31]:
import pandas as pd
import numpy as np
import os
import json

# Data Preparation

## Data Ingestion

### All The News - Collection of Articles from 18 publishers

In [32]:
# the original file is a .db file, which was exported as a json file using the sqlite studio

# get table with rows and columns
f = open("../data/all-the-news/all-the-news-conv.json", encoding="utf8")
data = json.load(f)
table = data["objects"][0]

# get rows and columns
columns = table["columns"]
rows = table["rows"]

# get column names
column_names = []
for column in columns:
    column_names.append(column["name"])

# Create resulting dictionary
result = {}
for column_name in column_names:
    result[column_name] = []

# get rows
for row in rows:
    for i in range(len(column_names)):
        result[column_names[i]].append(row[i])

pd.DataFrame.from_dict(result).to_csv('all_the_news.csv', encoding='utf-8')

In [33]:
all_the_news = pd.read_csv('all_the_news.csv', encoding='utf-8')
all_the_news.isna().sum()

  all_the_news = pd.read_csv('all_the_news.csv', encoding='utf-8')


Unnamed: 0          0
id                  0
title               5
author          54071
date            34274
content         37072
year            34274
month           34274
publication     29384
category        57091
digital         32689
section        151232
url            127008
dtype: int64

In [34]:
# Drops irrelevant columns
all_the_news = all_the_news.drop(columns=['Unnamed: 0', 'id', 'year', 'month', 'digital', 'section'])
all_the_news = all_the_news.rename(columns={"publication": "publisher"})
all_the_news.head()

Unnamed: 0,title,author,date,content,publisher,category,url
0,Agent Cooper in Twin Peaks is the audience: on...,\nTasha Robinson\n,2017-05-31,And never more so than in Showtime’s new...,Verge,Longform,
1,"AI, the humanity!",\nSam Byford\n,2017-05-30,AlphaGo’s victory isn’t a defeat for hum...,Verge,Longform,
2,The Viral Machine,\nKaitlyn Tiffany\n,2017-05-25,Super Deluxe built a weird internet empi...,Verge,Longform,
3,How Anker is beating Apple and Samsung at thei...,\nNick Statt\n,2017-05-22,Steven Yang quit his job at Google in th...,Verge,Longform,
4,Tour Black Panther’s reimagined homeland with ...,\nKwame Opam\n,2017-05-15,Ahead of Black Panther’s 2018 theatrical...,Verge,Longform,


In [35]:
rows_count = all_the_news.shape[0]
print("Number of rows: ", rows_count)

Number of rows:  225804


### BBC News

In [36]:
main_folder = "../data/bbc_news_collection/"
news = []

# Iterate through subfolders of the 5 categories (business, entertainment, politics, sport, tech)
for subfolder in os.listdir(main_folder):
    subfolder_path = os.path.join(main_folder, subfolder)
    
    if os.path.isdir(subfolder_path):
        for filename in os.listdir(subfolder_path):
            if filename.endswith(".txt"):
                with open(os.path.join(subfolder_path, filename), "r", encoding="utf-8") as file:
                    lines = file.readlines()
                    title = lines[0].strip()  # Read the first line as the title
                    author = np.nan  # No author information
                    date = "2005-12-31"
                    content = "".join(lines[1:]).replace("\n", " ").strip()  # Read the rest as content
                    publisher = "BBC"
                    category = subfolder
                    url = np.nan

                    aux = pd.DataFrame({"title": [title], "author": [author], "date": [date],
                                        "content": [content], "publisher": [publisher], "category": [category],
                                        "url": [url]})
                    news.append(aux)

bbc_news = pd.concat(news, ignore_index=True)
bbc_news.to_csv("BBC_articles.csv", index=False)
bbc_news.head()

Unnamed: 0,title,author,date,content,publisher,category,url
0,Musicians to tackle US red tape,,2005-12-31,Musicians' groups are to tackle US visa regula...,BBC,entertainment,
1,U2's desire to be number one,,2005-12-31,"U2, who have won three prestigious Grammy Awar...",BBC,entertainment,
2,Rocker Doherty in on-stage fight,,2005-12-31,Rock singer Pete Doherty has been involved in ...,BBC,entertainment,
3,Snicket tops US box office chart,,2005-12-31,The film adaptation of Lemony Snicket novels h...,BBC,entertainment,
4,Ocean's Twelve raids box office,,2005-12-31,"Ocean's Twelve, the crime caper sequel starrin...",BBC,entertainment,


In [37]:
rows_count_bbc = bbc_news.shape[0]
print("Number of rows: ", rows_count_bbc)

Number of rows:  2225


## Merge of Datasets