<a href="https://colab.research.google.com/github/Vyoma-garg/Natural-Language-Processing/blob/main/Stanford_Spacy_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Getting the dataset**

In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


For my future reference : 

---


**The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.**

In [None]:
from pprint import pprint
pprint(list(newsgroups_train.target_names))

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']


In [None]:
print(newsgroups_train.filenames.shape)
print(newsgroups_train.target.shape)

(11314,)
(11314,)


BY DEFAULT LOCATION OF EXTRACTION : **~/scikit_learn_data/20news_home**

In [None]:
newsgroups_train.filenames[:10]

array(['/root/scikit_learn_data/20news_home/20news-bydate-train/rec.autos/102994',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51861',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/51879',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38242',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/sci.space/60880',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.guns/54525',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/sci.med/58080',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.ibm.pc.hardware/60249',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.os.ms-windows.misc/10008',
       '/root/scikit_learn_data/20news_home/20news-bydate-train/comp.sys.mac.hardware/50502'],
      dtype='<U86')

In [None]:
newsgroups_train.target_names[:10]

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball']

In [None]:
newsgroups_train.target[:10]

array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])

### **(i)Stanford NER tagging**

In [None]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os
nltk.__version__

'3.2.5'

In [None]:
!wget http://nlp.stanford.edu/software/stanford-ner-2015-04-20.zip
!unzip stanford-ner-2015-04-20.zip 

In [None]:
from nltk.tag.stanford import StanfordNERTagger
jar = "stanford-ner-2015-04-20/stanford-ner-3.5.2.jar"
model = "stanford-ner-2015-04-20/classifiers/" 
st = StanfordNERTagger(model + "english.all.3class.distsim.crf.ser.gz", jar, encoding='utf8') 

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordNERTagger, self).__init__(*args, **kwargs)


 **Stanford tagging with 5000 DOCUMENTS**

In [None]:
data_2=newsgroups_train.data[0:5000]
data_2=str(data_2)
data_2

In [None]:
tokenized_text = nltk.word_tokenize(data_2)

classified_text = st.tag(tokenized_text)

entities = []
labels = []


from itertools import groupby
for tag, chunk in groupby(classified_text, lambda x:x[1]):
    if tag != "O":
        entities.append(' '.join(w for w, t in chunk))
        labels.append(tag)
        
        
entities_all = list(zip(entities, labels))
#entities_unique = list(set(zip(entities, labels))) #unique entities   
classified_text_df = pd.DataFrame(entities_all)
classified_text_df .columns = ["Entities", "Labels"]
classified_text_df 

**Potential tags in 5000 documents**

In [None]:
#entities_df_ = classified_text_df[classified_text_df['Labels'] != 'O']
classified_text_df.Labels.unique()

array(['ORGANIZATION', 'PERSON', 'LOCATION'], dtype=object)

**All LOCATION enitities in 5000 docs**

In [None]:
entities_df_loc = classified_text_df[classified_text_df['Labels'] == 'LOCATION']
#entities_df_loc.drop_duplicates(subset=['Entities'],keep='first', inplace=True)
entities_df_loc

In [None]:
entities_df_person = classified_text_df[classified_text_df['Labels'] == 'PERSON']
#entities_df_person.drop_duplicates(subset=['Entities'],keep='first', inplace=True)
entities_df_person

### **(ii)Spacy NER Tagging**

In [None]:
import spacy 
spacy.prefer_gpu()
from spacy import displacy
spacy.__version__

'3.0.6'

In [None]:
#Download spacy models
!python -m spacy download en_core_web_md
import en_core_web_md

In [None]:
nlp = spacy.load('en_core_web_md')

**with 5000 documents**

In [None]:
entities = []
labels = []
position_start = []
position_end = []
  
for i in data_2:
  i=str(i)
  doc = nlp(i)

  for ent in doc.ents:
    entities.append(ent.text)
    labels.append(ent.label_)
    position_start.append(ent.start_char)
    position_end.append(ent.end_char)
   
spacy_df= pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

In [None]:
spacy_df

**All entities with PERSON Tag in 5000 documents using Spacy**

In [None]:
spacy_df_person = spacy_df[spacy_df['Labels'] == 'PERSON']
#spacy_df_person.drop_duplicates(subset=['Entities'],keep='first', inplace=True)
spacy_df_person

**All entities with LOC Tag in 5000 documents using Spacy**

In [None]:
spacy_df_loc= spacy_df[spacy_df['Labels']== 'LOC']
#spacy_df_loc_gpe.drop_duplicates(subset=['Entities'],keep='first', inplace=True)
spacy_df_loc

**All entities with GPE Tag in 5000 documents using Spacy**

In [None]:
spacy_df_gpe= spacy_df[spacy_df['Labels']== 'GPE']
#spacy_df_loc_gpe.drop_duplicates(subset=['Entities'],keep='first', inplace=True)
spacy_df_gpe

**All entities with LOC AND GPE Tag in 5000 documents using Spacy**

In [None]:
#spacy_df_loc_gpe = spacy_df[(spacy_df['Labels'] == 'GPE') & (spacy_df['Labels'] == 'LOC') ]
spacy_df_loc_gpe = spacy_df[spacy_df['Labels'].isin(['GPE', 'LOC']) ]
#spacy_df_loc_gpe.drop_duplicates(subset=['Entities'],keep='first', inplace=True)
spacy_df_loc_gpe

## **1(b) Top 100 LOC and PERSON entities** 

### **STANFORD**

In [None]:
count_entities_df_loc=entities_df_loc['Entities'].value_counts()
count_entities_df_loc

**Top 100 LOC entities using Stanford NER**

In [None]:
count_entities_df_loc[0:100]

US          357
Israel      296
U.S.        158
USA         137
Canada      130
           ... 
Finland      19
Bethesda     19
Quebec       19
CA           19
Iraq         19
Name: Entities, Length: 100, dtype: int64



---

COUNT OF THE PERSON ENTITIES IN 5000 DOC.

In [None]:
count_entities_df_person=entities_df_person['Entities'].value_counts()
count_entities_df_person

**Top 100 PERSON entities using Stanford NER**

In [None]:
count_entities_df_person[0:100]

Jesus              318
Clinton            159
John                88
Christ              78
Matthew             67
                  ... 
Steven Bellovin     14
Reagan              14
Stalin              14
Gerald Olchowy      14
Sherri Nichols      14
Name: Entities, Length: 100, dtype: int64



---



---



### **SPACY**

**Top 100 PERSON entities using Spacy NER Tagging**

In [None]:
count_spacy_df_loc_gpe=spacy_df_loc_gpe['Entities'].value_counts()
count_spacy_df_loc_gpe

In [None]:
count_spacy_df_loc_gpe[0:100]

Israel       356
US           344
U.S.         178
Canada       167
Turkey       145
            ... 
Egypt         23
Portland      23
NC            22
Ann Arbor     22
Denver        21
Name: Entities, Length: 100, dtype: int64

**Top 100 LOC entities in documents using Spacy**

In [None]:
count_spacy_df_person=spacy_df_person['Entities'].value_counts()
count_spacy_df_person

In [None]:
count_spacy_df_person[0:100]

Clinton                 176
Serdar Argic            113
Matthew                  99
Mac                      93
geb@cs.pitt.edu          71
                       ... 
Hussein                  16
Phill Hallam-Baker       16
Timothy C. May           16
Michael A. Covington     16
David Veal Univ          16
Name: Entities, Length: 100, dtype: int64