# Assignment 3
### Daniel Mehta

## Exercise 1: [Beginner's Guide to Named Entity Recognition (NER) in NLTK Library - MLK - Machine Learning Knowledge](https://machinelearningknowledge.ai/beginners-guide-to-named-entity-recognition-ner-in-nltk-library-python/)

In [7]:
#!pip install nltk==3.8.1

In [8]:
import nltk
from nltk import word_tokenize,pos_tag

text = "NASA awarded Elon Musk’s SpaceX a $2.9 billion contract to build the lunar lander."
tokens = word_tokenize(text)
tag=pos_tag(tokens)
print(tag)

ne_tree = nltk.ne_chunk(tag)
print(ne_tree)

[('NASA', 'NNP'), ('awarded', 'VBD'), ('Elon', 'NNP'), ('Musk', 'NNP'), ('’', 'NNP'), ('s', 'VBD'), ('SpaceX', 'NNP'), ('a', 'DT'), ('$', '$'), ('2.9', 'CD'), ('billion', 'CD'), ('contract', 'NN'), ('to', 'TO'), ('build', 'VB'), ('the', 'DT'), ('lunar', 'NN'), ('lander', 'NN'), ('.', '.')]
(S
  (ORGANIZATION NASA/NNP)
  awarded/VBD
  (PERSON Elon/NNP Musk/NNP)
  ’/NNP
  s/VBD
  (ORGANIZATION SpaceX/NNP)
  a/DT
  $/$
  2.9/CD
  billion/CD
  contract/NN
  to/TO
  build/VB
  the/DT
  lunar/NN
  lander/NN
  ./.)


In [9]:
sent = nltk.corpus.treebank.tagged_sents()
print(nltk.ne_chunk(sent[0]))

(S
  (PERSON Pierre/NNP)
  (ORGANIZATION Vinken/NNP)
  ,/,
  61/CD
  years/NNS
  old/JJ
  ,/,
  will/MD
  join/VB
  the/DT
  board/NN
  as/IN
  a/DT
  nonexecutive/JJ
  director/NN
  Nov./NNP
  29/CD
  ./.)


In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp("NASA awarded Elon Musk’s SpaceX a $2.9 billion contract to build the lunar lander.")
for token in doc:
    print(token.text, token.ent_iob_, token.ent_type_)


NASA B ORG
awarded O 
Elon B PERSON
Musk I PERSON
’s I PERSON
SpaceX I PERSON
a O 
$ B MONEY
2.9 I MONEY
billion I MONEY
contract O 
to O 
build O 
the O 
lunar O 
lander O 
. O 


## Exercise 2:
### a) Find a new text dataset

In [14]:
import kagglehub
import os

# Download latest version
path = kagglehub.dataset_download("rmisra/news-category-dataset")
json_file = os.path.join(path, "News_Category_Dataset_v3.json")


print("Path to dataset files:", path)

Path to dataset files: /kaggle/input/news-category-dataset


#### b) Convert it into csv format

In [15]:
import json
import pandas as pd

In [18]:
with open(json_file, "r") as f:
    data = [json.loads(line) for line in f]

csv_file = "News_Category_Dataset.csv"

df = pd.read_json(json_file, lines=True)
df.to_csv(csv_file, index=False)

In [20]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


#### c) Redo the same exercise

In [22]:
for i in range(5):
    text = df['headline'].iloc[i]

    tokens = word_tokenize(text)
    tag=pos_tag(tokens)
    print(tag)

    ne_tree = nltk.ne_chunk(tag)
    print(ne_tree)

[('Over', 'IN'), ('4', 'CD'), ('Million', 'NNP'), ('Americans', 'NNPS'), ('Roll', 'NNP'), ('Up', 'NNP'), ('Sleeves', 'NNP'), ('For', 'IN'), ('Omicron-Targeted', 'NNP'), ('COVID', 'NNP'), ('Boosters', 'NNP')]
(S
  Over/IN
  4/CD
  Million/NNP
  Americans/NNPS
  (PERSON Roll/NNP Up/NNP Sleeves/NNP)
  For/IN
  Omicron-Targeted/NNP
  COVID/NNP
  Boosters/NNP)
[('American', 'NNP'), ('Airlines', 'NNPS'), ('Flyer', 'NNP'), ('Charged', 'NNP'), (',', ','), ('Banned', 'NNP'), ('For', 'IN'), ('Life', 'NNP'), ('After', 'IN'), ('Punching', 'VBG'), ('Flight', 'NNP'), ('Attendant', 'NNP'), ('On', 'IN'), ('Video', 'NNP')]
(S
  (GPE American/NNP)
  (ORGANIZATION Airlines/NNPS Flyer/NNP)
  Charged/NNP
  ,/,
  (GPE Banned/NNP)
  (ORGANIZATION For/IN Life/NNP)
  After/IN
  Punching/VBG
  (PERSON Flight/NNP Attendant/NNP)
  On/IN
  Video/NNP)
[('23', 'CD'), ('Of', 'IN'), ('The', 'DT'), ('Funniest', 'NNP'), ('Tweets', 'NNPS'), ('About', 'NNP'), ('Cats', 'NNP'), ('And', 'CC'), ('Dogs', 'NNP'), ('This', 'DT')

In [24]:
nlp = spacy.load("en_core_web_sm")
for i in range(5):
    text = df['headline'].iloc[i]
    doc = nlp(text)

    print(f"\nHeadline {i+1}: {text}")
    for token in doc:
        print(token.text, token.ent_iob_, token.ent_type_)



Headline 1: Over 4 Million Americans Roll Up Sleeves For Omicron-Targeted COVID Boosters
Over O 
4 O 
Million O 
Americans O 
Roll O 
Up O 
Sleeves O 
For O 
Omicron O 
- O 
Targeted O 
COVID O 
Boosters O 

Headline 2: American Airlines Flyer Charged, Banned For Life After Punching Flight Attendant On Video
American B ORG
Airlines I ORG
Flyer B PERSON
Charged I PERSON
, O 
Banned B ORG
For I ORG
Life I ORG
After I ORG
Punching I ORG
Flight I ORG
Attendant I ORG
On I ORG
Video I ORG

Headline 3: 23 Of The Funniest Tweets About Cats And Dogs This Week (Sept. 17-23)
23 B CARDINAL
Of O 
The O 
Funniest O 
Tweets O 
About O 
Cats O 
And O 
Dogs O 
This O 
Week O 
( O 
Sept. B DATE
17 I DATE
- I DATE
23 I DATE
) O 

Headline 4: The Funniest Tweets From Parents This Week (Sept. 17-23)
The O 
Funniest O 
Tweets O 
From O 
Parents O 
This B DATE
Week I DATE
( O 
Sept. B DATE
17 I DATE
- I DATE
23 I DATE
) O 

Headline 5: Woman Who Called Cops On Black Bird-Watcher Loses Lawsuit Against Ex-Emp