#Named Entity Recognition (NER) & Scattertext

It's very important to install scattertext using pip. The conda and conda-forge versions are out of date and currently don't work due to some version conflicts!

In [None]:
!pip install scattertext

##Imports

In [25]:
import pandas as pd

import spacy
nlp = spacy.load("en_core_web_sm")
from spacy import displacy

import scattertext as st

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

##NER

Named Entity Recognition (NER) finds, classifies and labels important pieces of information in text documnets. SpaCy tools use both rule based systems and machine learning models in order to accomplish this.

Below, we'll grab a few reddit posts from the GPT3 subreddit and see how spaCy labels the text using its built-in visualization tool.

In [88]:
gpt3 = pd.read_csv("gpt3_data.csv")
gpt3 = gpt3[gpt3['content'].notnull()]

In [99]:
gpt3_example = gpt3.iloc[34:38]

for c in gpt3_example['content']:
    doc = nlp(str(c))
    displacy.render(doc, style='ent')

We can also print a list of each named entity and its classification. Ultimately, these outputs can be formatted as a table, dictionary or other object for analysis.

In [100]:
for c in gpt3_example['content']:
    doc = nlp(str(c))
    for ent in doc.ents:
      print(ent.text, ent.label_)

AI ORG
India GPE
Philippines GPE
AI ORG
AI Rising ORG
AI ORG
80% PERCENT
AI ORG
over 1 million CARDINAL
Philippine NORP
2028 DATE
India GPE
AI ORG
AI ORG
PS:* ORG
one CARDINAL
AI ORG
5000 CARDINAL
RAG ORG
1 CARDINAL
2 CARDINAL
3 CARDINAL
OpenAI GPE
PromptTools ORG
RAG PERSON
integrations](https://github.com ORG
OpenAI PERSON
Anthropic, Google Vertex/PaLM ORG
Llama PERSON
Replicate ORG
Weaviate PERSON
Pinecone, Qdrant
* ORG
LangChain ORG
RAG ORG
minutes TIME
example](https://github.com ORG
Jamie Dimon](https://www.google.com/ PERSON
69i57j69i60.7764j0j4&sourceid CARDINAL
UTF-8 FAC
AI ORG
fewer days DATE
AI ORG
AI ORG
thousands CARDINAL
3.5-day DATE
AI ORG
30,000 CARDINAL
5-day DATE
weeks DATE
JPMorgan ORG
AI ORG
3.5-day weeks DATE
PS:* ORG
one CARDINAL
AI ORG
5000 CARDINAL
GPT 4 LAW


SpaCy uses a numbner of methods to classify named entities, with one being Parts of Speech (POS) and syntactic dependencies. As such, we can also visualize POS and dependencies in spaCy.

In [102]:
sent = nlp("This is a sentence.")
displacy.render(sent, style="dep")

#Scattertext

Scattertext is a visualization tool that allows us to view a scatterplot of terms. The terms are positioned on two axes according to their frequency.

### Built-in example: 2012 US political convention speeches

First, we'll work through a modified version of the built-in example from scattertext:


In [101]:
eng_stopwords = set(stopwords.words('english'))

cdata = st.SampleCorpora.ConventionData2012.get_data().assign(
    parse=lambda df: df.text.apply(st.whitespace_nlp_with_sentences)
)

corpus = (
    st.CorpusFromParsedDocuments(cdata, category_col='party', parsed_col='parse')
    .build()
    .remove_terms(eng_stopwords, ignore_absences=True)
    .get_unigram_corpus()
    .compact(st.AssociationCompactor(2000))
)

html = st.produce_scattertext_explorer(
    corpus,
    category='democrat',
    category_name='Democratic',
    not_category_name='Republican',
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    metadata=corpus.get_df()['speaker'],
    transform=st.Scalers.dense_rank,
    #include_gradient=True,
    #left_gradient_term='More Republican',
    #middle_gradient_term='Metric: Dense Rank Difference',
    #right_gradient_term='More Democratic',
)

open('./scattertext0.html', 'w').write(html)

1677778

To view the result, we need to return to our folder and open the output .html file!

### Scattertext with Reddit data

Let's apply this to our Reddit data!  We can compare two different subreddits: GPT3 and MachineLearning.


In [92]:
ml = pd.read_csv("MachineLearning_data.csv")
ml["subreddit"] = "MachineLearning"
gpt3["subreddit"] = "gpt3"

reddit = pd.concat([gpt3, ml], ignore_index=True)[["content","subreddit"]]
reddit = reddit.loc[reddit.content.notnull(),:]
reddit = reddit.assign(
    parse=lambda df: df.content.apply(st.whitespace_nlp_with_sentences)
)

corpus = (
    st.CorpusFromParsedDocuments(reddit, category_col='subreddit', parsed_col='parse')
    .build()
    .remove_terms(eng_stopwords, ignore_absences=True)
    .get_unigram_corpus()
    .compact(st.AssociationCompactor(2000))
)

html = st.produce_scattertext_explorer(
    corpus,
    category='gpt3',
    category_name='gpt3',
    not_category_name='MachineLearning',
    minimum_term_frequency=0,
    pmi_threshold_coefficient=0,
    width_in_pixels=1000,
    transform=st.Scalers.dense_rank,
    #include_gradient=True,
    #left_gradient_term='More MachineLearning',
    #middle_gradient_term='Metric: Dense Rank Difference',
    #right_gradient_term='More gpt3',
)
open('./scattertext_reddit.html', 'w').write(html)

2063632