<img src="../figs/holberton_logo.png" alt="logo" width="500"/>

# A Machine Learning Framework for Automated Albanian News Article Title Classification

## 1. Executive Summary

**Automated news article classification** is a technique to **classify text data into predefined categories**. This has many uses, such as improving search results, understanding topics, analyzing sentiments, and recommending content. In Albanian, despite the growth of digital content, there is a lack of large text datasets, which hinders the progress of natural language processing (NLP) research and applications.

This work makes two main contributions:

1. We introduce a new dataset with `9600` news article titles from various categories.
2. We use this dataset to evaluate the performance of different machine learning algorithms for classifying topics.

Our experiments show that recurrent neural networks (RNNs) perform better than simpler classifiers and ensemble methods for this task.

### Applications

Text classification is essential in natural language processing (NLP) with various practical uses. It helps in:

- **Information Retrieval and Summarization**: Efficiently finding and summarizing relevant information from large datasets.
- **News Aggregation**: Grouping news articles by topics for easier navigation.
- **Customer Feedback Segmentation**: Analyzing and categorizing customer reviews and feedback for better insights.
- **Content Personalization**: Tailoring content to individual user preferences to enhance user experience.

### Challenges

While text classification has many applications, it also faces several challenges, especially for the Albanian language:

- **Limited Text Corpora**: Albanian has fewer large datasets available, hindering the development of NLP models.
- **Ambiguity in News Articles**: Articles can cover multiple topics, making it difficult to assign a single category. For example, a sports article might also discuss cultural or social aspects.
- **Grammatical and Lexical Variability**: The structure and style of Albanian text differ significantly from English, posing challenges for accurate classification.
- **Resource Constraints**: Low-resource languages like Albanian lack the extensive training data required for sophisticated machine learning models to perform well

## 2. Problem Definition

### Problem Definition

Let $\mathcal{D}$ be a dataset with $N$ news article titles. Each title $x_i$ is linked to a category label $y_i$ such that $y_i \in \{1, 2, \ldots, \mathcal{K}\}$, where $\mathcal{K}$ is the total number of categories. Each news title $x_i$ consists of a sequence of words, represented as $x_i = (w_{i1}, w_{i2}, \ldots, w_{iN_i})$, where $N_i$ is the number of words in title $x_i$. These words are the features used for classification.

The goal is to use the training data to learn a function $f: \mathcal{X} \mapsto \mathcal{Y}$, where $\mathcal{X}$ is the set of news article titles, and $\mathcal{Y}$ is the set of corresponding true labels. The function $f$ maps each title $x_i$ to its category $y_i$, aiming to accurately predict the category of new, unseen titles. The objective is to find the optimal function $f^*$ that minimizes a predefined loss function $L(f(x_i), y_i)$ over the entire dataset $\mathcal{D}$, such that:

$$
f^* = \underset{f}{\text{argmin}} \sum_{i=1}^N \mathcal{L}(f(x_i), y_i)
$$

where $N$ is the total number of news article titles, and $\mathcal{L}$ is the loss incurred by the model's prediction.


## 3. Proposed Methodology

### Proposed Methodology

Our approach consists of three main steps:

1. **Dataset Creation**: We collect news article titles from different categories using web scraping.
2. **Data Processing**: We convert the collected titles into numerical tokens using tokenization and also convert the labels into numerical values.
3. **Model Training**: We build and train various classification models, including traditional methods, ensemble techniques, and deep learning models, to accurately classify new, unseen news article titles.

Figure below provides an illustration of this process.

<img src="../figs/5-portfolio/approach.png" alt="logo" width="800"/>





## 4. Technical Approach

### 4.1 Data Preparation

We **create a dataset by scraping the web for news article titles, covering six main categories: politics, economy, current affairs, sport, culture, and lifestyle**. To make our approach robust, we include categories that might overlap, like culture and lifestyle. 

After collecting the data, we preprocess it to ensure uniform formatting. This includes **removing punctuation and special symbols, converting text to lowercase, and discarding titles that are too short**. We balance the dataset to represent each category equally. A human supervisor reviews the data to ensure its quality and relevance.

#### Key Steps

1. **Import Libraries**: Import `requests`, `BeautifulSoup`, and `csv`.

2. **Scrape Headlines Function**:
    - **Purpose**: Scrape headlines from a given URL and add them to a list.
    - **Process**:
        - Send an HTTP GET request to the URL.
        - Parse the HTML content using BeautifulSoup.
        - Extract headlines (`<h3>` tags) and append them to the list, excluding the header and footer.
        - Handle various HTTP and connection errors gracefully.

3. **Main Function**:
    - **Purpose**: Scrape multiple URLs and save all headlines to a CSV file.
    - **Process**:
        - Initialize an empty list to store headlines.
        - For each base URL, scrape a specified number of pages.
        - Call `scrape_headlines` for each page URL.
        - Write the collected headlines to a CSV file with 'Headline' and 'Category' columns.

4. **Example Usage**:
    - Define a list of URLs to scrape.
    - Set the number of pages to scrape per URL.
    - Specify the output CSV file name.
    - Call the `main` function with these parameters.



In [1]:
import requests
from bs4 import BeautifulSoup
import csv

def scrape_headlines(url, all_headlines):
    """
    Scrapes headlines from the given URL and appends them to the all_headlines list.

    Args:
        url (str): The URL of the page to scrape.
        all_headlines (list of tuples): The list to which scraped headlines will be appended.
    """
    try:
        # Send the GET request using default headers
        response = requests.get(url)
        response.raise_for_status()  # Raises HTTPError for bad responses (4xx or 5xx)

        soup = BeautifulSoup(response.content, 'html.parser')
        headlines = soup.find_all('h3')

        # Append each headline to the list
        for headline in headlines[1:-5]:       # Remove header and footer
            text = headline.text.strip()
            all_headlines.append((text, url))  # Save the category URL 

    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred for {url}: {http_err}")
    except requests.exceptions.ConnectionError as conn_err:
        print(f"Connection error occurred for {url}: {conn_err}")
    except requests.exceptions.Timeout as timeout_err:
        print(f"Timeout error occurred for {url}: {timeout_err}")
    except requests.exceptions.RequestException as req_err:
        print(f"An error occurred for {url}: {req_err}")
    except Exception as e:
        print(f"An unexpected error occurred for {url}: {e}")

def main(urls, num_pages_per_url, output_csv):
    """
    Scrapes multiple URLs and saves all headlines to a single CSV file.

    Args:
        urls (list of str): List of base URLs for scraping.
        num_pages_per_url (int): Number of pages to scrape for each URL.
        output_csv (str): The path to the CSV file where headlines will be saved.
    """
    all_headlines = []

    for base_url in urls:
        print(f"Scraping URL: {base_url}")  # Debugging: Print which URL is being scraped
        for i in range(1, num_pages_per_url + 1):
            url = f'{base_url}/page/{i}/'
            # print(f"Scraping page: {url}")  # Debugging: Print page URL
            scrape_headlines(url, all_headlines)

    # Write all headlines to a single CSV file
    with open(output_csv, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(['Headline', 'Category'])  # Write header row
        writer.writerows(all_headlines)            # Write all headlines

    print(f"All headlines saved to {output_csv}")

# Provide here the urls to scrape from
urls = [
    'https://fjala.al/category/ekonomia',
    'https://fjala.al/category/politics',
    'https://fjala.al/category/aktualitet/kronike',
    'https://fjala.al/category/sport',
    'https://fjala.al/category/lifestyle/show-biz'
]

num_pages_per_url = 5
output_csv = 'all_headlines.csv'
main(urls, num_pages_per_url, output_csv)


Scraping URL: https://fjala.al/category/ekonomia
Scraping URL: https://fjala.al/category/politics
Scraping URL: https://fjala.al/category/aktualitet/kronike
Scraping URL: https://fjala.al/category/sport
Scraping URL: https://fjala.al/category/lifestyle/show-biz
All headlines saved to all_headlines.csv
