# News Article Classification and Analysis

This notebook performs a comprehensive analysis of news articles, encompassing language detection, context classification, headline extraction, and sentiment analysis. It leverages various NLP techniques and pre-trained models to categorize and summarize textual data.

**The key steps include:**

1.  **Data Loading and Cleaning:** Loading the dataset from a CSV file and preprocessing the text content.
2.  **Language Detection:** Identifying the language of each article using the `langdetect` library.
3.  **Context Classification:** Assigning context labels (e.g., Political, Economy, Technology, Social) to articles using keyword-based weak labeling and transformer-based zero-shot classification.
4.  **Headline Extraction:** Generating concise headlines for articles using summarization pipelines.
5.  **Sentiment Analysis:** Determining the sentiment (Positive, Negative, Neutral) of articles using TextBlob and VADER sentiment analysis.

The notebook explores different models and methods for context classification, headline extraction, and sentiment analysis, comparing their performance and trade-offs. The final output includes the original content, extracted headlines, predicted context, detected language, and sentiment scores for each article.

In [22]:
import pandas as pd
from transformers import pipeline
from langdetect import detect
from pymongo import MongoClient
import os


In [23]:
df = pd.read_csv(r'../data/cleaned_data/clean_data.csv')

In [24]:
df['content'] = df['content'].astype(str)
df.dropna(subset='content',inplace=True)

In [25]:
df.shape

(1119, 7)

In [26]:
df['content'].head()

0    Poland’s interior ministry said seven drones a...
1    Appointed by President Emmanuel Macron on Tues...
2    The attack could have a chilling effect in the...
3    Ukraine’s President Volodymyr Zelensky describ...
4    “There are opportunities everywhere,” he said,...
Name: content, dtype: object

## This Model detect the Language, Title ant context Classification

### 1: Language Detect  

In [27]:
df['language'] = df['content'].apply(detect)

In [28]:
df[['content','language']].head()

Unnamed: 0,content,language
0,Poland’s interior ministry said seven drones a...,en
1,Appointed by President Emmanuel Macron on Tues...,en
2,The attack could have a chilling effect in the...,en
3,Ukraine’s President Volodymyr Zelensky describ...,en
4,"“There are opportunities everywhere,” he said,...",en


## 2: Context Classification  

### Chose model

In [29]:
# =========== Heavy Models (Use GPU) ===========
# ==== Context ====
classifier = pipeline(
    "zero-shot-classification",
    model="typeform/distilbert-base-uncased-mnli", 
    device=0
)

Device set to use cuda:0


### Apply Models

In [30]:
labels = ["politics", "technology", "social", "sports", "economy"]

def extract_features_context(text):

    # Context (category)
    context = classifier(text, candidate_labels=labels)['labels'][0]
    
    return  context

# Apply classification

df['context'] = df['content'].apply(
    lambda x: pd.Series(extract_features_context(x))
)

In [31]:
df[['content', 'context']]

Unnamed: 0,content,context
0,Poland’s interior ministry said seven drones a...,sports
1,Appointed by President Emmanuel Macron on Tues...,politics
2,The attack could have a chilling effect in the...,politics
3,Ukraine’s President Volodymyr Zelensky describ...,politics
4,"“There are opportunities everywhere,” he said,...",social
...,...,...
1114,Сегодня Центральный банк России установил новы...,sports
1115,Сегодня Центральный банк России установил новы...,social
1116,"Однако настоящая дорога не всегда черно-белая,...",social
1117,Неожиданные выводы исследователей: какое молок...,economy


In [32]:
df['context'].value_counts()

context
politics      480
technology    239
social        219
economy        91
sports         90
Name: count, dtype: int64

## 3: Headline extraction

#### Chose model

In [33]:
# =========== Medium Models (Use GPU) ===========
# ==== Headline ====
summarizer = pipeline(
    "summarization",
    model="sshleifer/distilbart-cnn-12-6", 
    device=0, 
    torch_dtype="auto"
)

Device set to use cuda:0


#### Apply Model

In [34]:
def extract_features_headline(text):

    cleaned_text = text[:512]
    
    try:
        headline = summarizer(
            cleaned_text,
            max_length=20,
            min_length=5,
            do_sample=False
        )[0]['summary_text']
    except Exception as e:
        return df['title']
    
    return headline

In [None]:
df['title'] = df.apply(lambda row: extract_features_headline(row['content']) if pd.isna(row['title']) else row['title'], axis=1)

In [37]:
df.headline.value_counts()

headline
What's new to streaming this week?                                                                                   3
حرب غزة                                                                                                              3
Photos this week                                                                                                     2
Moon phase today                                                                                                     2
Nepal                                                                                                                2
                                                                                                                    ..
«Сколько техники было оставлено афганцам- масштабы поражают!»- ветеран Афгана откровенно рассказал о своей службе    1
Мудрые строки Дементьева, которые должны прочитать все родители хотя бы раз в жизни                                  1
Астролог Василиса Володина поделилась в

## Sentiment Analysis Model


In [38]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to('cuda')


In [39]:
sentiment_pipeline = pipeline("sentiment-analysis", model=model, max_length=512, tokenizer=tokenizer)

df["sentiment"] = df["content"].apply(lambda x: sentiment_pipeline(x)[0]['label'])

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [40]:
df[['content', 'sentiment']].head()

Unnamed: 0,content,sentiment
0,Poland’s interior ministry said seven drones a...,neutral
1,Appointed by President Emmanuel Macron on Tues...,negative
2,The attack could have a chilling effect in the...,negative
3,Ukraine’s President Volodymyr Zelensky describ...,negative
4,"“There are opportunities everywhere,” he said,...",neutral


In [41]:
df['sentiment'].value_counts()

sentiment
negative    643
neutral     404
positive     72
Name: count, dtype: int64

## Final Date

In [42]:
df[['content', 'headline', 'context','sentiment' , 'language']]

Unnamed: 0,content,headline,context,sentiment,language
0,Poland’s interior ministry said seven drones a...,NATO shoots down Russian drones in Polish airs...,sports,neutral,en
1,Appointed by President Emmanuel Macron on Tues...,France hit by protests and disruption as new p...,politics,negative,en
2,The attack could have a chilling effect in the...,Israel strikes Hamas leadership in Qatar in un...,politics,negative,en
3,Ukraine’s President Volodymyr Zelensky describ...,Russian aerial bomb kills at least 25 civilian...,politics,negative,en
4,"“There are opportunities everywhere,” he said,...","She was deported nine times. Then, like others...",social,neutral,en
...,...,...,...,...,...
1114,Сегодня Центральный банк России установил новы...,Курс юаня к рублю на сегодня 11 сентября 2025 ...,sports,neutral,ru
1115,Сегодня Центральный банк России установил новы...,Курс тенге к рублю на сегодня 11 сентября 2025...,social,neutral,ru
1116,"Однако настоящая дорога не всегда черно-белая,...",Пересечение сплошной линии,social,neutral,ru
1117,Неожиданные выводы исследователей: какое молок...,Ученые раскрыли секрет,economy,neutral,ru


## Import Data to MongoDB

In [None]:
# ==== MongoDB connection ====
client = MongoClient("mongodb://localhost:27017/")
db = client["insightbot"]
collection = db["articles"]

# ==== Get last document id from MongoDB ====
last_doc = collection.find_one(sort=[("id", -1)])
last_id = last_doc["id"] if last_doc else 0

# ==== Drop duplicates and reset index ====
df = df.drop_duplicates(subset=["title", "content"], keep="first")
df = df.reset_index(drop=True)

# ==== Assign new IDs and prepare for insertion ====
df["id"] = range(last_id, last_id + len(df))
df = df.drop(columns=["_id"], errors="ignore")

# ==== Insert data into MongoDB ====
data_dict = df.to_dict("records")
if data_dict:
    collection.insert_many(data_dict)
    print(f"Inserted {len(data_dict)} records into MongoDB, IDs {last_id} → {last_id + len(df) - 1}.")
else:
    print("No new records to insert.")

Inserted 1439 records into MongoDB, IDs 0 → 1438.
