<a href="https://colab.research.google.com/github/Vishal123-max/Fist-Java-Code/blob/main/NLP_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Assignment - 2

**Dataset:** FinancialPhraseBank

**Group:** Group 76

#### Team Members
| Name | Roll No. | Contributions |
|------|----------|---------------|
| Taniya Yadav | 2024DA04119 | 100% |
| SRUTILEKHA DAS | 2022DC04094 | 100% |
| Vishal Yadav | 2024DA04134 | 100% |

**1. Introduction**

We are working with the FinancialPhraseBank dataset, which contains financial news headlines labeled with sentiments (positive, negative, neutral).

Our tasks include:
- Exploratory Data Analysis (EDA)
- Preprocessing
- Topic Modeling with LDA (10 topics)
- Coherence Score
- Visualization of topics
- Dependency Parsing of two sentences


**2. Load and Explore Dataset (EDA)**

In [2]:
import pandas as pd

# Load first 2000 rows
df = pd.read_csv("all-data.csv", nrows=2000, names=["Sentiment", "News Headline"], encoding='latin1')

# Quick look
print(df.head())
print("\nSentiment distribution:\n", df['Sentiment'].value_counts())
print("\nMissing values:\n", df.isnull().sum())

  Sentiment                                      News Headline
0   neutral  According to Gran , the company has no plans t...
1   neutral  Technopolis plans to develop in stages an area...
2  negative  The international electronic industry company ...
3  positive  With the new production plant the company woul...
4  positive  According to the company 's updated strategy f...

Sentiment distribution:
 Sentiment
positive    1031
neutral      902
negative      67
Name: count, dtype: int64

Missing values:
 Sentiment        0
News Headline    0
dtype: int64


**Explanation:**
- We load only 2000 rows.
- Check sentiment distribution.
- Verify missing values.


**3. Preprocessing**

In [3]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # remove punctuation/numbers
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return tokens

df['tokens'] = df['News Headline'].apply(preprocess)
df[['News Headline','tokens']].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,News Headline,tokens
0,"According to Gran , the company has no plans t...","[according, gran, company, plan, move, product..."
1,Technopolis plans to develop in stages an area...,"[technopolis, plan, develop, stage, area, less..."
2,The international electronic industry company ...,"[international, electronic, industry, company,..."
3,With the new production plant the company woul...,"[new, production, plant, company, would, incre..."
4,According to the company 's updated strategy f...,"[according, company, updated, strategy, year, ..."


**Explanation:**
- Lowercasing, punctuation removal.
- Tokenization.
- Stopword removal.
- Lemmatization.


**4. Topic Modeling with LDA (10 Topics)**

In [4]:
!pip install gensim

from gensim import corpora, models

# Create dictionary and corpus
dictionary = corpora.Dictionary(df['tokens'])
corpus = [dictionary.doc2bow(text) for text in df['tokens']]

# Build LDA model with 10 topics
lda_model = models.LdaModel(corpus=corpus,
                            id2word=dictionary,
                            num_topics=10,
                            random_state=42,
                            passes=10,
                            alpha='auto')

# Print topics
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic {idx}: {topic}")

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m44.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Topic 0: 0.012*"company" + 0.011*"year" + 0.008*"finnish" + 0.007*"said" + 0.007*"contract" + 0.006*"signed" + 0.005*"agreement" + 0.005*"awarded" + 0.005*"finland" + 0.005*"result"
Topic 1: 0.030*"million" + 0.015*"eur" + 0.015*"said" + 0.013*"company" + 0.013*"finnish" + 0.013*"oyj" + 0.010*"percent" + 0.009*"hel" + 0.009*"net" + 0.009*"maker"
Topic 2: 0.015*"company" + 0.013*"new" + 0.012*"market" + 0.010*"said" + 0.007*"n" + 0.006*"finland" + 0.006*"paper" + 0.006*"pharmaceutical" + 0.006*"capacity" + 0.006*"finnish"
Topic 3: 0.099*"eur" + 0.055*"mn" + 0.033*"profit" + 0.0

**5. Coherence Score**

In [5]:
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(model=lda_model, texts=df['tokens'],
                                     dictionary=dictionary, coherence='c_v')
coherence_score = coherence_model_lda.get_coherence()
print("Coherence Score:", coherence_score)

Coherence Score: 0.35557873178592125


**6. Topic Visualization**

In [6]:
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
vis

Collecting pyLDAvis
  Downloading pyLDAvis-3.4.1-py3-none-any.whl.metadata (4.2 kB)
Collecting funcy (from pyLDAvis)
  Downloading funcy-2.0-py2.py3-none-any.whl.metadata (5.9 kB)
Downloading pyLDAvis-3.4.1-py3-none-any.whl (2.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m29.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcy-2.0-py2.py3-none-any.whl (30 kB)
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-2.0 pyLDAvis-3.4.1


  return datetime.utcnow().replace(tzinfo=utc)


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


**7. Dependency Parsing (Two Sentences)**

In [7]:
import spacy
from spacy import displacy
import random

nlp = spacy.load("en_core_web_sm")

# Select two random sentences with >=10 words
long_sentences = df[df['News Headline'].str.split().apply(len) >= 10]['News Headline']
samples = random.sample(list(long_sentences), 2)

for sent in samples:
    doc = nlp(sent)
    displacy.render(doc, style="dep", jupyter=True)

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return date

  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)
  return datetime.utcnow().replace(tzinfo=utc)


**8. Conclusion**
- We performed EDA and preprocessing.
- Extracted 10 topics using LDA.
- Computed coherence score to evaluate topic quality.
- Visualized topics with pyLDAvis.
- Parsed two long sentences with spaCy dependency parser.

