# Machine Learning Challenge

**Done by:** Fateen Ahmed
**Email:** fateenahmed.2k@gmail.com

## Objectives

**Level 1: The Basics** - To Develop a model that categorizes news articles into their respective categories.

**Level 2: The Intermediate** - To Generate abstracts that summarize the articles clearly and concisely.

**Level 3: The Advanced** - To Produce captions for each news article's image that accurately reflect the content.

**Level 4: The Mastery** - To Implement a real-time UI web app for inference, facilitating user interaction.

**Level 5: The Hero** - To Special focus on detecting articles related to Palestine and categorizing them under a new subcategory called "FreePalestine".


---


### Level 1: The Basics

In [16]:
# Initially the dataset is loaded from local machine.
import pandas as pd

training_data_path = '/Users/fateenahmed/Downloads/N24News 2/news/nytimes_train.json'
testing_data_path = '/Users/fateenahmed/Downloads/N24News 2/news/nytimes_test.json'

# Loading into dataframes
train_dataframe = pd.read_json(training_data_path)
test_dataframe = pd.read_json(testing_data_path)

# Displaying the first few rows
print(train_dataframe.head())

        section                                           headline  \
0       Theater  Before 'Moonlight' and 'The Walking Dead,' a F...   
1    Television  What's on TV Wednesday: 'Crip Camp' and 'Dark ...   
2        Sports  Rays Stick to Their Plan and Reach a 2nd World...   
3  Art & Design    For Robert Rauschenberg, No Artist Is an Island   
4       Theater  Jimmy Buffett's 'Margaritaville' Musical Sets ...   

                                         article_url  \
0  https://www.nytimes.com/2017/02/21/theater/dan...   
1  https://www.nytimes.com/2020/03/25/arts/televi...   
2  https://www.nytimes.com/2020/10/17/sports/base...   
3  https://www.nytimes.com/2017/05/11/arts/design...   
4  https://www.nytimes.com/2017/06/05/theater/jim...   

                                             article  \
0  Danai Gurira and Andre Holland in a theater at...   
1  CRIP CAMP: A DISABILITY REVOLUTION (2020) Stre...   
2  The Tampa Bay Rays told Charlie Morton it woul...   
3  We tend to thin

In [3]:
# Initial Analysis

# Checking the columns, distribution, and missing values.
print("Column names in the training dataset:", train_dataframe.columns.tolist())
print("\nDistribution of news categories in the training dataset:\n", train_dataframe['section'].value_counts())
print("\nMissing values in each column of the training dataset:\n", train_dataframe.isnull().sum())

Column names in the training dataset: ['section', 'headline', 'article_url', 'article', 'abstract', 'article_id', 'image', 'caption', 'image_id']

Distribution of news categories in the training dataset:
 Opinion            2437
Art & Design       2431
Television         2419
Music              2416
Travel             2413
Real Estate        2413
Books              2412
Health             2409
Theater            2409
Sports             2407
Science            2387
Fashion & Style    2385
Food               2385
Movies             2384
Technology         2376
Dance              2365
Media              2363
Style              2147
Automobiles        1456
Economy            1398
Your Money         1020
Global Business     955
Education           672
Well                529
Name: section, dtype: int64

Missing values in each column of the training dataset:
 section        0
headline       0
article_url    0
article        0
abstract       0
article_id     0
image          0
caption        

The next process is text preprocessing which is essential for preparing raw text for NLP tasks. The Natural Language Toolkit (`nltk`) library is utilised as it provides a set of text processing tools and is easy to use.

Text Preprocessing Tasks :

Converting all text to lowercase to ensure uniformity

Removing URLs using `re` library, as they do not contribute to the understanding of the text's content.

Eliminating punctuation and numbers, focusing the analysis on words which carry the semantic weight of the text.

Tokenizing using he `word_tokenize` function from `nltk` splits the cleaned text into individual words or tokens. This breaks down text into manageable units for analysis.

Removing Common words such as "and", "is", or "in" are removed using a predefined list of stopwords from `nltk.corpus.stopwords`. These words are generally considered noise in text analysis because they occur frequently across texts of all topics and provide little unique information about the content of any single document.

The tokens that remain after stopword removal are joined back into a single string, providing a cleaned version of the text for further processing or analysis.

In [4]:
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt')
nltk.download('stopwords')

english_stop_words = set(stopwords.words('english'))

def preprocess_text_article(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in english_stop_words]
    return ' '.join(tokens)

train_dataframe['combined_text'] = train_dataframe['headline'] + " " + train_dataframe['article']
train_dataframe['combined_text'] = train_dataframe['combined_text'].apply(preprocess_text_article)

print(train_dataframe['combined_text'].head())

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/fateenahmed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fateenahmed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


0    moonlight walking dead friendship born classro...
1    whats tv wednesday crip camp dark phoenix crip...
2    rays stick plan reach nd world series tampa ba...
3    robert rauschenberg artist island tend think a...
4    jimmy buffetts margaritaville musical sets bro...
Name: combined_text, dtype: object


Then for modeling, the TF-IDF (Term Frequency-Inverse Document Frequency) is utlised via `TfidfVectorizer` to convert our combined text data into a matrix of numerical features, focusing on the top 10,000 most frequent words to balance computational efficiency with feature representation. This transformation enables the ML algorithms to process and learn from textual data. Following feature extraction, we split our dataset into training and validation sets using `train_test_split`, allocating 20% of the data for validation. This split is essential for training our models on one subset of the data and evaluating.

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_features = tfidf_vectorizer.fit_transform(train_dataframe['combined_text'])
y_labels = train_dataframe['section']

X_train_features, X_validation_features, y_train_labels, y_validation_labels = train_test_split(
    X_features, y_labels, test_size=0.2, random_state=50)

To classify news articles into their respective categories, the Logistic Regression model from `sklearn.linear_model` is used. Logistic Regression is well-suited for multiclass classification tasks and offers a good balance between simplicity and performance for text classification.

First the Logistic Regression model is initialised with an increased `max_iter` parameter to ensure convergence, given the potentially large and complex dataset. Then the model is then trained (`fit`) on training dataset, consisting of TF-IDF features (`X_train_features`) and their corresponding labels (`y_train_labels`).

After training, the model is used to predict the categories of articles in the validation set (`X_validation_features`). The performance of the model is evaluated using the `classification_report` from `sklearn.metrics`, which provides key metrics such as precision, recall, and F1-score for each category. This evaluation helps us understand how well the model can generalize to unseen data.

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report


logistic_regression_model = LogisticRegression(max_iter=1000)
logistic_regression_model.fit(X_train_features, y_train_labels)

y_predicted_labels = logistic_regression_model.predict(X_validation_features)

print(classification_report(y_validation_labels, y_predicted_labels))

                 precision    recall  f1-score   support

   Art & Design       0.86      0.91      0.88       474
    Automobiles       0.94      0.93      0.94       259
          Books       0.89      0.91      0.90       518
          Dance       0.97      0.94      0.96       478
        Economy       0.90      0.84      0.87       273
      Education       0.79      0.81      0.80       128
Fashion & Style       0.76      0.71      0.73       485
           Food       0.87      0.92      0.89       478
Global Business       0.85      0.80      0.83       184
         Health       0.83      0.92      0.87       513
          Media       0.83      0.85      0.84       474
         Movies       0.83      0.90      0.86       480
          Music       0.93      0.92      0.92       471
        Opinion       0.87      0.87      0.87       520
    Real Estate       0.92      0.93      0.92       452
        Science       0.88      0.88      0.88       461
         Sports       0.94    

The model achieved an overall accuracy of 86% in classifying news articles into their respective categories, demonstrating strong performance across most of the 24 categories. Categories like 'Dance', 'Sports', and 'Real Estate' were among the top performers with F1-scores above 0.92, indicating the model's effectiveness in these areas. However, categories such as 'Style' and 'Well' exhibited lower F1-scores, suggesting room for improvement.

In [15]:
#Option 2 - Naive Bayes Classifier

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(train_dataframe['combined_text'], train_dataframe['section'], test_size=0.2, random_state=42)

model = make_pipeline(TfidfVectorizer(), MultinomialNB())

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, zero_division=0))

                 precision    recall  f1-score   support

   Art & Design       0.81      0.91      0.85       489
    Automobiles       0.98      0.77      0.86       266
          Books       0.73      0.80      0.76       476
          Dance       0.98      0.94      0.96       499
        Economy       0.71      0.64      0.67       286
      Education       0.00      0.00      0.00       132
Fashion & Style       0.69      0.75      0.72       465
           Food       0.91      0.80      0.85       450
Global Business       0.95      0.21      0.35       192
         Health       0.62      0.91      0.74       458
          Media       0.78      0.83      0.80       456
         Movies       0.93      0.43      0.59       469
          Music       0.90      0.90      0.90       516
        Opinion       0.41      0.85      0.55       478
    Real Estate       0.75      0.95      0.84       493
        Science       0.95      0.75      0.84       498
         Sports       0.90    

### Level 2: The Intermediate

Next the TF-IDF (Term Frequency-Inverse Document Frequency) vectorization and cosine similarity is used to identify and compile the most informative sentences from a text article into a concise summary. Initially, the article is split into individual sentences, which are then transformed into a matrix of TF-IDF features. By calculating the cosine similarity among these sentence vectors, the relative importance of each sentence is assesed based on its similarity to others, presuming that sentences with higher overall similarity scores better capture the main themes of the article. These sentences are ranked, and the top ones are selected to construct the summary, with the number of sentences chosen based on the num_sentences parameter.

In [13]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import nltk

def summarize_article_tf_idf(article_text, num_sentences=5):

    sentences = nltk.sent_tokenize(article_text)
    
    tfidf_vectorizer = TfidfVectorizer()
    sentence_tfidf_vectors = tfidf_vectorizer.fit_transform(sentences)
    
    sentence_cosine_similarity = cosine_similarity(sentence_tfidf_vectors)
    
    sentence_scores = sentence_cosine_similarity.sum(axis=1)
    
    ranked_sentence_indices = np.argsort(sentence_scores, axis=0)[::-1]
    top_sentences = [sentences[index] for index in ranked_sentence_indices.flatten()][:num_sentences]
    
    article_summary = ' '.join(top_sentences)
    return article_summary

# Example case
example_text = train_dataframe.iloc[3]['article']  # Ensure the column name matches
print("Original Article:\n", example_text)
print("\nSummarized Article:\n", summarize_article_tf_idf(example_text))

Original Article:
 We tend to think of artists as natural loners, off in their studios, wrestling with their inner selves. But "Robert Rauschenberg: Among Friends," which opens at the Museum of Modern Art on May 21, points us in a different direction. It situates Rauschenberg's work amid that of two dozen fellow artists who provided an audience for one another in New York City in the '50s and '60s, the years of bohemia's final flourish. Not that exchanges between artists are ever simple. Rauschenberg's "Bed," for instance, is a landmark painting from 1955 that stands about six feet tall, with a stapled-on pillow, a cotton sheet splotched with red and yellow, and rivulets of white pigment dripping onto a patchwork quilt. Like any true masterpiece, "Bed" can support multiple readings. You can see it, for starters, as a brilliant sendup of the overt emotionalism of Abstract Expressionism and say that it puts '50s-style paint-slinging to bed.

In his defense, Rauschenberg, who died in 2008

Next the spaCy model is utilised, which is used to detect named entities in the text. The function then iterates through each sentence and checks if it contains any of the named entities identified by spaCy. If so, the sentence's score is incremented, indicating that sentences containing named entities are considered more important for the summary. This approach is based on the intuition that sentences with named entities might carry more informational weight, making them good candidates for inclusion in the summary.

In [17]:
import spacy

nlp_model = spacy.load("en_core_web_sm")

def boost_scores_with_named_entities(sentences, named_entity_list, sentence_scores):
    for index, sentence in enumerate(sentences):
       
        if any(entity.text in sentence for entity in named_entity_list):
            sentence_scores[index] += 1
    
    return sentence_scores

The sentence are transformed into a semantic vector representation by averaging the word vectors of all tokens in the sentence. Then the spaCy model is loaded that includes word vectors, processes the input sentence to generate a document object, and then averages the vectors of tokens that have vector representations. If a sentence has no tokens with vectors or is empty, it returns a zero vector of a predefined length.

In [18]:
import spacy
import numpy as np

nlp_model = spacy.load("en_core_web_sm")

def compute_sentence_vector(sentence_text):
    doc = nlp_model(sentence_text)
    
    sentence_vector = np.mean([token.vector for token in doc if token.has_vector], axis=0)
    
    if len(doc) == 0 or not np.any(sentence_vector):
        return np.zeros((nlp_model.vocab.vectors_length,))
    
    return sentence_vector

The summaries are updated using the spaCy model that allows for named entity recognition and semantic analysis.

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.tokenize import sent_tokenize
import spacy
import numpy as np

nlp = spacy.load("en_core_web_sm")

def enhanced_summarize_article(article_text, num_sentences=5):
    sentences = sent_tokenize(article_text)
    doc = nlp(article_text)
    named_entities = doc.ents

    enhanced_vectors = np.array([compute_sentence_vector(sentence) for sentence in sentences])

    cosine_matrix = cosine_similarity(enhanced_vectors)

    scores = cosine_matrix.sum(axis=1)

    scores = boost_scores_with_named_entities(sentences, named_entities, scores)

    ranked_sentences = [sentences[i] for i in np.argsort(scores, axis=0)[::-1]]
    summary = ' '.join(ranked_sentences[:num_sentences])
    return summary

# Example case
example_article = train_dataframe.iloc[1]['article']
print("Original Article:\n", example_article)
print("\nSummarized Article:\n", enhanced_summarize_article(example_article, num_sentences=5))

Original Article:
 CRIP CAMP: A DISABILITY REVOLUTION (2020) Stream on Netflix. This documentary, the latest offering from Barack Obama and Michelle Obama's production company, draws a direct line between a Catskills summer camp and the American disability rights movement of the 1970s. Directed by Jim LeBrecht and Nicole Newnham, the film begins by focusing on Camp Jened, which was founded in the early 1950s and served as a community for campers with disabilities. But it eventually shifts focus to look at the adult lives of some of the camp's alumni, several of whom became prominent activists. In his review for The New York Times, Ben Kenigsberg wrote that the film "unfolds from a perspective of lived experience." Newnham and LeBrecht, he added, "deftly juggle a large cast of characters past and present, accomplishing the not-so-easy task of making all the personalities distinct."

DARK PHOENIX (2019) 9 p.m. on HBO. In an interview with The Times last year, the actress Sophie Turner di

In [17]:
import spacy
import numpy as np
from nltk.tokenize import sent_tokenize

# Assuming nlp has been loaded with spaCy and necessary functions defined:
nlp_model = spacy.load("en_core_web_sm")

def advanced_summarize_article(article_text, num_sentences=3):
    doc = nlp_model(article_text)
    sentences = [str(sent) for sent in list(doc.sents)]
    sentence_ranks = {}

    # Extract text of named entities for checking their presence in sentences
    named_entities_texts = {ent.text.lower() for ent in doc.ents}

    for i, sent in enumerate(sentences):
        sentence_ranks[i] = 0
        for token in nlp_model(sent):
            # Check if token's lowercased text is a named entity or if token is a noun/verb
            if token.text.lower() in named_entities_texts or token.pos_ in ('NOUN', 'VERB'):
                sentence_ranks[i] += 1

    # Rank sentences based on their scores
    ranked_sentences = sorted(sentence_ranks.keys(), key=lambda k: sentence_ranks[k], reverse=True)
    summary_sentences = [sentences[i] for i in ranked_sentences[:num_sentences]]

    # Join the selected sentences to form the summary
    summary = ' '.join(summary_sentences)
    return summary

# Example case, ensure you replace with an actual article text or DataFrame access
example_article = train_dataframe.iloc[1]['article']
print("Summarized Article:", advanced_summarize_article(example_article, num_sentences=3))

Summarized Article: In an interview with The Times last year, the actress Sophie Turner discussed the moment when Simon Kinberg, the writer and director of "Dark Phoenix," made clear how much the movie would rely on Turner's performance. For the most part, Manohla Dargis wrote in her review for The Times, Kinberg "just moves characters from point A to B, pausing for face-to-face heart to hearts before the next blowout." This documentary, the latest offering from Barack Obama and Michelle Obama's production company, draws a direct line between a Catskills summer camp and the American disability rights movement of the 1970s.


### Level 3: The Advanced

The next task is to generate captions for images associated with articles. This is done by combining the article's headline, abstract, and body into a single text, then using the spaCy NLP library to identify named entities, nouns, and verbs within this text, extracting these elements as key information for caption creation. Depending on whether an existing caption is present, the code either generates a new caption by incorporating the first identified noun and verb or improves the existing one by appending up to three named entities.

In [11]:
import spacy
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


nlp = spacy.load("en_core_web_sm")

def combine_textual_data(headline, abstract, article_body):
    return f"{headline} {abstract} {article_body}"

def extract_key_information_for_captions(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents if not ent.text.isspace()]  # Named entities
    nouns = [token.text for token in doc if token.pos_ == 'NOUN' and not token.text.isspace()]  # Nouns
    verbs = [token.text for token in doc if token.pos_ == 'VERB' and not token.text.isspace()]  # Verbs
    return entities, nouns, verbs

def generate_or_improve_caption(existing_caption, entities, nouns, verbs):
    if existing_caption:
        return f"{existing_caption} Featuring {', '.join(entities[:3])}." if entities else existing_caption
    else:
        return f"Captured moment of {nouns[0]} in action with {verbs[0]}." if nouns and verbs else "Image related to the article."

def process_dataset_for_captions(df):
    texts = [combine_textual_data(row['headline'], row['abstract'], row['article']) for _, row in df.iterrows()]

    entities_info = []
    for doc in nlp.pipe(texts, disable=["tagger", "parser"], batch_size=100):
        entities = [ent.text for ent in doc.ents if not ent.text.isspace()]
        nouns = [token.text for token in doc if token.pos_ == 'NOUN' and not token.text.isspace()]
        verbs = [token.text for token in doc if token.pos_ == 'VERB' and not token.text.isspace()]
        entities_info.append((entities, nouns, verbs))
    
    for index, (entities, nouns, verbs) in enumerate(entities_info):
        new_caption = generate_or_improve_caption(df.at[index, 'caption'], entities, nouns, verbs)
        df.at[index, 'generated_caption'] = new_caption

    return df

enhanced_df = process_dataset_for_captions(train_dataframe)
print(enhanced_df[['image', 'caption', 'generated_caption']].head()) 



                                               image  \
0  https://static01.nyt.com/images/2017/02/22/art...   
1  https://static01.nyt.com/images/2020/03/25/art...   
2  https://static01.nyt.com/images/2020/10/19/spo...   
3  https://static01.nyt.com/images/2017/05/14/art...   
4  https://static01.nyt.com/images/2017/06/06/art...   

                                             caption  \
0  Danai Gurira and André Holland in a theater at...   
1  Judy Heumann in &ldquo;Crip Camp: A Disability...   
2  The Rays celebrated after the final out of the...   
3  Robert Rauschenberg performing in "Pelican" in...   
4  Jimmy Buffett, performing in 2016. His musical...   

                                   generated_caption  
0  Danai Gurira and André Holland in a theater at...  
1  Judy Heumann in &ldquo;Crip Camp: A Disability...  
2  The Rays celebrated after the final out of the...  
3  Robert Rauschenberg performing in "Pelican" in...  
4  Jimmy Buffett, performing in 2016. His musical..

In [20]:
import pandas as pd
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import nltk
from nltk.corpus import stopwords

# Load spaCy model and NLTK data
nlp = spacy.load("en_core_web_sm")
nltk.download('punkt')
nltk.download('stopwords')

# Set of English stopwords
stop_words = set(stopwords.words('english'))

def extract_keywords(text):
    # Use spaCy for named entities
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents if ent.text.lower() not in stop_words]
    
    # Use NLTK for frequent terms, excluding stopwords and non-alphabetic words
    words = word_tokenize(text.lower())
    common_terms = [word for word in words if word.isalpha() and word not in stop_words]
    
    # Frequency distribution of common terms, excluding entities
    fdist = FreqDist(common_terms)
    top_common_terms = [word for word, freq in fdist.most_common(3) if word not in entities]
    
    # Combine and deduplicate
    keywords = list(set(entities + top_common_terms))
    return keywords

def generate_enhanced_caption(original_caption, max_length=100):
    keywords = extract_keywords(original_caption)
    if not keywords:
        # If no keywords are extracted, return the original or a shortened version
        return original_caption if len(original_caption) <= max_length else original_caption[:max_length-3] + "..."
    
    # Integrate keywords more contextually
    enhanced_portion = ', '.join(keywords[:3])
    new_caption = f"{enhanced_portion}. {original_caption}"
    
    # Shorten the caption if it exceeds the maximum length, trying to preserve the enhanced portion
    if len(new_caption) > max_length:
        cut_off_point = max_length - len(enhanced_portion) - 15
        original_caption_short = original_caption[:cut_off_point] + "..." if len(original_caption) > cut_off_point else original_caption
        new_caption = f"{enhanced_portion}. {original_caption_short}"
    
    return new_caption

# Assuming 'train_df' is already loaded with your dataset
# Applying the enhanced caption generation to a random subset of 100 rows
subset_df = train_dataframe.sample(n=100, random_state=42)
subset_df['generated_caption'] = subset_df['caption'].apply(generate_enhanced_caption)

# Displaying the original and generated captions for review
print(subset_df[['caption', 'generated_caption']].head())


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/fateenahmed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fateenahmed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                                 caption  \
37421  View of Emerald Bay in Lake Tahoe from Inspira...   
40743  A metallic bronze ch&egrave;vre leather Herm&e...   
48227  The Swedish Academy's formal gathering at the ...   
47135  A Spanish protester criticizes Swiss requests ...   
46191  Jessica Grose, at work in her home, is the edi...   

                                       generated_caption  
37421  bay, Lake Tahoe, Inspiration Point. View of Em...  
40743  metallic, egrave, bronze. A metallic bronze ch...  
48227  formal, Academy, Stockholm. The Swedish Academ...  
47135  protester, Spain, criticizes. A Spanish protes...  
46191  weekly, grose, NYT Parenting. Jessica Grose, a...  


In order to try and enhance image captions the named entities and common terms are extracted from the text. These keywords are incorporated into the original captions. Again,spaCy is used to identify specific, significant words or phrases as named entities, excluding stopwords. Then again, NLTK is applied for tokenizing the text and identifying frequent, relevant terms, excluding stopwords and non-alphabetic words.

In [21]:
import pandas as pd
import spacy
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import nltk
from nltk.corpus import stopwords


nltk.download('punkt')
nltk.download('stopwords')
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))

def extract_keywords(text):
    doc = nlp(text)
    entities = [ent.text for ent in doc.ents if ent.text.lower() not in stop_words]
    words = word_tokenize(text.lower())
    common_terms = [word for word in words if word.isalpha() and word not in stop_words]
    fdist = FreqDist(common_terms)
    top_common_terms = [word for word, freq in fdist.most_common(3) if word not in entities]
    keywords = list(set(entities + top_common_terms))
    return keywords

def generate_enhanced_caption(row):
    
    consolidated_text = f"{row['headline']}. {row['abstract']}. {row['article']}"
    article_keywords = extract_keywords(consolidated_text)

    caption_keywords = extract_keywords(row['caption']) if pd.notna(row['caption']) else []
    
    combined_keywords = list(dict.fromkeys(article_keywords + caption_keywords))[:5]
    
    enhanced_portion = ', '.join(combined_keywords)
    new_caption = f"{enhanced_portion}. {row['caption']}" if row['caption'] else f"This image relates to {enhanced_portion}."

    return new_caption[:100]

subset_df = train_dataframe.head(100).copy() 

subset_df['generated_caption'] = subset_df.apply(generate_enhanced_caption, axis=1)

print(subset_df[['caption', 'generated_caption']].head())

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/fateenahmed/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/fateenahmed/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                             caption  \
0  Danai Gurira and André Holland in a theater at...   
1  Judy Heumann in &ldquo;Crip Camp: A Disability...   
2  The Rays celebrated after the final out of the...   
3  Robert Rauschenberg performing in "Pelican" in...   
4  Jimmy Buffett, performing in 2016. His musical...   

                                   generated_caption  
0  third-years', one, The Walking Dead, A few yea...  
1  Crip Camp', FXM, China, 2019, the Shaolin Temp...  
2  Milwaukee, Roy Halladay, Boston, Saturday, Bra...  
3  1954, Brooklyn, John Cage, Bob, Charles Atlas....  
4  the Marquis Theater, buffett, Jimmy Buffett's,...  


In [20]:
import gradio as gr
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize, sent_tokenize
import numpy as np
import nltk
from nltk.corpus import stopwords
import re

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# Initialize spaCy for English language
nlp = spacy.load("en_core_web_sm")

# Initialize the TF-IDF Vectorizer and Logistic Regression model
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
logistic_regression_model = LogisticRegression(max_iter=1000)

def preprocess_text_article(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    tokens = word_tokenize(text)
    english_stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in english_stop_words]
    return ' '.join(tokens)

def compute_sentence_vector(sentence):
    doc = nlp(sentence)
    sentence_vector = np.mean([token.vector for token in doc if token.has_vector], axis=0)
    if np.isnan(sentence_vector).any():
        return np.zeros((nlp.vocab.vectors_length,))
    return sentence_vector

def boost_scores_with_named_entities(sentences, named_entities, scores):
    named_entity_texts = {ent.text for ent in named_entities}
    for i, sentence in enumerate(sentences):
        if any(ent in sentence for ent in named_entity_texts):
            scores[i] += 1
    return scores

def enhanced_summarize_article(article_text, num_sentences=5):
    sentences = sent_tokenize(article_text)
    doc = nlp(article_text)
    named_entities = doc.ents
    enhanced_vectors = np.array([compute_sentence_vector(sentence) for sentence in sentences])
    cosine_matrix = cosine_similarity(enhanced_vectors)
    scores = cosine_matrix.sum(axis=1)
    scores = boost_scores_with_named_entities(sentences, named_entities, scores)
    ranked_sentences = [sentences[i] for i in np.argsort(scores, axis=0)[::-1]]
    summary = ' '.join(ranked_sentences[:num_sentences])
    return summary

def extract_keywords(text):
    doc = nlp(text)
    keywords = set(token.text.lower() for token in doc if token.is_alpha and not token.is_stop)
    return list(keywords)

def load_data():
    training_data_path = '/Users/fateenahmed/Downloads/N24News 2/news/nytimes_train.json'  # Update this path
    train_dataframe = pd.read_json(training_data_path)
    train_dataframe['combined_text'] = train_dataframe['headline'] + " " + train_dataframe['article']
    train_dataframe['preprocessed_text'] = train_dataframe['combined_text'].apply(preprocess_text_article)
    return train_dataframe['preprocessed_text'], train_dataframe['section']

def train_model():
    global tfidf_vectorizer, logistic_regression_model
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
    logistic_regression_model.fit(X_train_tfidf, y_train)
    
    # Optionally, evaluate the model on the test set
    X_test_tfidf = tfidf_vectorizer.transform(X_test)
    y_pred = logistic_regression_model.predict(X_test_tfidf)
    print(classification_report(y_test, y_pred))

# Call train_model() to ensure models are trained and vectorizer is fitted
train_model()

def predict_category(article_body):
    processed_text = preprocess_text_article(article_body)
    tfidf_features = tfidf_vectorizer.transform([processed_text])
    category = logistic_regression_model.predict(tfidf_features)[0]
    return str(category)

def generate_abstract(article_body):
    return enhanced_summarize_article(article_body, num_sentences=5)

def generate_enhanced_caption(article_title, article_body, image_caption):
    combined_text = f"{article_title} {article_body}"
    keywords = extract_keywords(combined_text + " " + image_caption)
    enhanced_caption = ", ".join(keywords[:5])
    return enhanced_caption if enhanced_caption else image_caption

def process_article(article_title, article_body, image_caption_text=""):
    category = predict_category(article_body)
    abstract = generate_abstract(article_body)
    enhanced_caption = generate_enhanced_caption(article_title, article_body, image_caption_text)
    return category, abstract, enhanced_caption

interface = gr.Interface(
    fn=process_article,
    inputs=[
        gr.Textbox(label="Article Title", placeholder="Enter Article Title Here..."),
        gr.Textbox(label="Article Body", placeholder="Enter Article Body Here...", lines=7),
        gr.Textbox(label="Image Caption", placeholder="Enter Image Caption Here...")
    ],
    outputs=[
        gr.Text(label="Predicted Category"),
        gr.Text(label="Generated Abstract"),
        gr.Text(label="Enhanced Caption")
    ],
    title="Ahmed's News Article Processing App",
    description="This app predicts the category of news articles, generates abstracts, and enhances captions for images based on the provided image caption text."
)

if __name__ == "__main__":
    interface.launch()


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:997)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:997)>


                 precision    recall  f1-score   support

   Art & Design       0.89      0.92      0.90       489
    Automobiles       0.93      0.95      0.94       266
          Books       0.88      0.89      0.89       476
          Dance       0.97      0.95      0.96       499
        Economy       0.88      0.82      0.85       286
      Education       0.82      0.85      0.83       132
Fashion & Style       0.72      0.73      0.73       465
           Food       0.87      0.94      0.90       450
Global Business       0.87      0.84      0.85       192
         Health       0.80      0.88      0.84       458
          Media       0.82      0.89      0.85       456
         Movies       0.81      0.91      0.86       469
          Music       0.92      0.91      0.91       516
        Opinion       0.86      0.86      0.86       478
    Real Estate       0.92      0.93      0.93       493
        Science       0.88      0.88      0.88       498
         Sports       0.95    