## Step 1: Cleaning and Tokenization

   1 pip install spacy

   2 python -m spacy download en_core_web_sm
   
   3 pip install pandas
   
   
   The script loads data from an Excel file, cleans and tokenizes the text using spaCy, 
   and stores the cleaned articles in a new DataFrame column

In [19]:
import pandas as pd
import re
import spacy


nlp = spacy.load('en_core_web_sm')

# Load the dataset
df = pd.read_excel('Assignment.xlsx')

def clean_and_tokenize(article):
    article = re.sub(r'[^\w\s]', '', article)
    article = re.sub(r'\s+', ' ', article)
    doc = nlp(article)
    lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]
    return ' '.join(lemmatized_tokens)


df['Cleaned Article'] = df['Article'].apply(clean_and_tokenize)

df.head()

Unnamed: 0,Article,Cleaned Article
0,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...
1,"Move over, Ozempic — there’s a new drug in tow...",Ozempic s new drug town Eli Lillys Zepbound ac...
2,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,Sept 14 Reuters Bristol Myers Squibb BMYN say ...
3,Austin Wolcott was 18 years old and pretty sur...,Austin Wolcott 18 year old pretty sure not sur...
4,"Cancer, often referred to as the “emperor of a...",cancer refer emperor malady unyielding adversa...


## Step 2: Mood Analysis (VADER)

VADER is a lexicon and rule-based sentiment analysis tool specifically designed for analyzing social media text. It assigns sentiment scores to individual words in a text based on their polarity (positive, negative, or neutral) and intensity

pip install nltk

import nltk

nltk.download('vader_lexicon')


In [20]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def get_vader_mood(article):
    sia = SentimentIntensityAnalyzer()
    sentiment = sia.polarity_scores(article)
    if sentiment['compound'] >= 0.05:
        return 'Positive'
    elif sentiment['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'
    
df['VADER Mood'] = df['Cleaned Article'].apply(get_vader_mood)


df.head()


Unnamed: 0,Article,Cleaned Article,VADER Mood
0,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...,Positive
1,"Move over, Ozempic — there’s a new drug in tow...",Ozempic s new drug town Eli Lillys Zepbound ac...,Negative
2,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,Sept 14 Reuters Bristol Myers Squibb BMYN say ...,Negative
3,Austin Wolcott was 18 years old and pretty sur...,Austin Wolcott 18 year old pretty sure not sur...,Negative
4,"Cancer, often referred to as the “emperor of a...",cancer refer emperor malady unyielding adversa...,Negative


## Step 3: Mood Analysis (TextBlob)

TextBlob, a Python library, to analyze the sentiment of cleaned articles in a DataFrame. It defines a function to determine the mood (positive, negative, or neutral) of each article using TextBlob's sentiment analysis


pip install textblob



In [21]:
from textblob import TextBlob

def get_textblob_mood(article):
    blob = TextBlob(article)
    polarity = blob.sentiment.polarity
    if polarity > 0:
        return 'Positive'
    elif polarity < 0:
        return 'Negative'
    else:
        return 'Neutral'

df['TextBlob Mood'] = df['Cleaned Article'].apply(get_textblob_mood)

df.head()


Unnamed: 0,Article,Cleaned Article,VADER Mood,TextBlob Mood
0,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...,Positive,Positive
1,"Move over, Ozempic — there’s a new drug in tow...",Ozempic s new drug town Eli Lillys Zepbound ac...,Negative,Negative
2,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,Sept 14 Reuters Bristol Myers Squibb BMYN say ...,Negative,Negative
3,Austin Wolcott was 18 years old and pretty sur...,Austin Wolcott 18 year old pretty sure not sur...,Negative,Positive
4,"Cancer, often referred to as the “emperor of a...",cancer refer emperor malady unyielding adversa...,Negative,Positive


## TD-IDF (Theme Extraction)

The code extracts themes from cleaned articles using TF-IDF vectorization and generates sentences from these themes based on the nouns, verbs, and adjectives found.
#### Tf= total count of particular word in doc1/total words count in doc1
#### IDF= total document/occurence of word in how many document
#### we use Tf*idf in real problem

pip install scikit-learn

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(df['Cleaned Article'])

feature_names = vectorizer.get_feature_names_out()

N = 60
top_indices = np.argsort(tfidf_matrix.toarray())[:, ::-1][:, :N]

themes = []
for indices in top_indices:
    article_themes = [feature_names[idx] for idx in indices]
    themes.append(', '.join(article_themes))

df['Themes'] = themes

def generate_sentences_from_themes(theme):
    doc = nlp(theme)
    nouns = [token.text for token in doc if token.pos_ == 'NOUN']
    verbs = [token.text for token in doc if token.pos_ == 'VERB']
    adjectives = [token.text for token in doc if token.pos_ == 'ADJ']
    
    
    sentence_parts = []
    if nouns:
        sentence_parts.append(nouns[0].capitalize())  
        sentence_parts.append(verbs[0]) 
    if adjectives:
        sentence_parts.extend(adjectives) 
    
    if sentence_parts:
        sentence = ' '.join(sentence_parts) + "." 
    else:
        sentence = theme.capitalize() + "." 
    
    return sentence

df['Theme Sentences'] = df['Themes'].apply(generate_sentences_from_themes)

df.head()



Unnamed: 0,Article,Cleaned Article,VADER Mood,TextBlob Mood,Themes,Theme Sentences
0,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...,Positive,Positive,"drug, weight, glossier, executive, like, loss,...",Drug tell glossier big kind asterisk female an...
1,"Move over, Ozempic — there’s a new drug in tow...",Ozempic s new drug town Eli Lillys Zepbound ac...,Negative,Negative,"obesity, overweight, drug, weight, zepbound, d...",Obesity approve overweight high scientific pol...
2,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,Sept 14 Reuters Bristol Myers Squibb BMYN say ...,Negative,Negative,"bristol, cell, therapy, blood, eliquis, thin, ...",Cell continue thin generic different revlimid ...
3,Austin Wolcott was 18 years old and pretty sur...,Austin Wolcott 18 year old pretty sure not sur...,Negative,Positive,"cart, tcells, cancer, cell, therapy, wolcott, ...",Cart feel solid dipersio new human common happ...
4,"Cancer, often referred to as the “emperor of a...",cancer refer emperor malady unyielding adversa...,Negative,Positive,"cancer, therapy, cart, treatment, cell, patien...",Cancer make cytomed affordable associate effec...


## Aspect-Based Sentiment Analysis

It defines a function to perform aspect sentiment analysis on each article, extracting aspects (nouns and proper nouns) and determining their sentiment polarity using VADER

In [32]:
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob
import spacy


nlp = spacy.load('en_core_web_sm')

def aspect_sentiment_analysis(article):
    sia = SentimentIntensityAnalyzer()
    blob = TextBlob(article)
   
    aspect_sentiments = []

 
    sentences = blob.sentences
    for sentence in sentences:
       
        aspects = [token.text for token in nlp(sentence.raw) if token.pos_ in ['NOUN', 'PROPN']]
        
        aspect_sentiment = {}
        for aspect in aspects:
            vader_score = sia.polarity_scores(aspect)
            if vader_score['compound'] >= 0.05:
                aspect_sentiment[aspect] = 'Positive'
            elif vader_score['compound'] <= -0.05:
                aspect_sentiment[aspect] = 'Negative'
            else:
                aspect_sentiment[aspect] = 'Neutral'
        
       
        aspect_sentiments.append(aspect_sentiment)
    
    return aspect_sentiments




df['Aspect Sentiments'] = df['Cleaned Article'].apply(aspect_sentiment_analysis)

df.head()


Unnamed: 0,Article,Cleaned Article,VADER Mood,TextBlob Mood,Themes,Theme Sentences,Aspect Sentiments
0,"Retailers, the makers of foods marketed for we...",retailer maker food market weight loss type co...,Positive,Positive,"drug, weight, glossier, executive, like, loss,...",Drug tell glossier big kind asterisk female an...,"[{'retailer': 'Neutral', 'maker': 'Neutral', '..."
1,"Move over, Ozempic — there’s a new drug in tow...",Ozempic s new drug town Eli Lillys Zepbound ac...,Negative,Negative,"obesity, overweight, drug, weight, zepbound, d...",Obesity approve overweight high scientific pol...,"[{'Ozempic': 'Neutral', 'drug': 'Neutral', 'to..."
2,Sept 14 (Reuters) - Bristol Myers Squibb (BMY....,Sept 14 Reuters Bristol Myers Squibb BMYN say ...,Negative,Negative,"bristol, cell, therapy, blood, eliquis, thin, ...",Cell continue thin generic different revlimid ...,"[{'Sept': 'Neutral', 'Reuters': 'Neutral', 'Br..."
3,Austin Wolcott was 18 years old and pretty sur...,Austin Wolcott 18 year old pretty sure not sur...,Negative,Positive,"cart, tcells, cancer, cell, therapy, wolcott, ...",Cart feel solid dipersio new human common happ...,"[{'Austin': 'Neutral', 'Wolcott': 'Neutral', '..."
4,"Cancer, often referred to as the “emperor of a...",cancer refer emperor malady unyielding adversa...,Negative,Positive,"cancer, therapy, cart, treatment, cell, patien...",Cancer make cytomed affordable associate effec...,"[{'cancer': 'Negative', 'emperor': 'Neutral', ..."
