## BBC News Recommendation System

#### Loading Dataset

In [34]:
#randomly samples 500 rows from the DataFrame

import numpy as np
import pandas as pd


df = pd.read_csv('bbc-news-data.csv', sep='\t').sample(500, random_state = 115)

df.head()

Unnamed: 0,category,filename,title,content
587,entertainment,078.txt,Baby becomes new Oscar favourite,Clint Eastwood's boxing drama Million Dollar ...
1412,sport,100.txt,Mido makes third apology,Ahmed 'Mido' Hossam has made another apology ...
748,entertainment,239.txt,Fightstar take to the stage,Charlie Simpson took his new band Fightstar t...
1706,sport,394.txt,Tindall wants second opinion,England centre Mike Tindall is to seek a seco...
758,entertainment,249.txt,Soul sensation ready for awards,"South West teenage singing sensation, Joss St..."


In [35]:
df['category'].unique()

array(['entertainment', 'sport', 'politics', 'tech', 'business'],
      dtype=object)

In [36]:
#checking missing value

df.isnull().sum()

category    0
filename    0
title       0
content     0
dtype: int64

#### Preprocess and Testing

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


# Apply TF-IDF Vectorizer to the 'content' column
tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=2, stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['content'])

# Function to get user input
def get_user_input():
    user_input = input("Enter a news topic or description: ")
    tf_idf_input = tf.transform([user_input])
    return tf_idf_input

# Function to get top matching articles
def get_news(tfidf_matrix, tf_idf_input):
    cosine_similarities = cosine_similarity(tf_idf_input, tfidf_matrix)
    top_n = 5

    # Get top indices and corresponding similarity scores
    top_indices = cosine_similarities.argsort()[0][-top_n:]
    top_scores = cosine_similarities[0][top_indices]

    results = [
        (
            df['title'].iloc[i],
            df['category'].iloc[i],
            df['content'].iloc[i],
            score,
        )
        for i, score in zip(top_indices, top_scores)
    ]
    return results

if __name__ == '__main__':
    flag = True
    while flag:
        tf_idf_input = get_user_input()
        result = get_news(tfidf_matrix, tf_idf_input)

        print("\nTop 5 Recommended News:")
        for title, category, description, score in result[::-1]:  # Reverse to show highest scores first
            print(f"-> {title} (Category: {category}, Similarity: {round(score, 2)})\n   {description[:300]}...")  # Limit description to 300 chars

        print("\nDo you want to find more NEWS? (yes/no)")
        user_choice = input().lower()
        if user_choice != "yes":
            flag = False


Enter a news topic or description:  economy related



Top 5 Recommended News:
-> BBC poll indicates economic gloom (Category: business, Similarity: 0.13)
    Citizens in a majority of nations surveyed in a BBC World Service poll believe the world economy is worsening.  Most respondents also said their national economy was getting worse. But when asked about their own family's financial outlook, a majority in 14 countries said they were positive about th...
-> IMF 'cuts' German growth estimate (Category: business, Similarity: 0.12)
    The International Monetary Fund is to cut its 2005 growth forecast for the German economy from 1.8% to 0.8%, the Financial Times Deutschland reported.  The IMF will also reduce its growth estimate for the 12-member eurozone economy from 2.2% to 1.6%, the newspaper reported. The German economy has b...
-> UK 'risks breaking golden rule' (Category: business, Similarity: 0.11)
    The UK government will have to raise taxes or rein in spending if it wants to avoid breaking its "golden rule", a report suggests. 

 no


### Usage examples:


* Business:


"latest trends in the stock market"

"global trade agreements and their effects"


* Politics:

"current political events in the UK"

"the latest developments in international relations"


* Entertainment:

"upcoming award shows and nominations"

"interviews with popular musicians"



* Tech

"cybersecurity threats and prevention"

"the impact of technology on education"


* Sport:

"latest football scores and highlights"

"news and analysis of the Premier League"

### This is the end, thank you.