## Intro to Natural Language Processing in Python

- Making sense of language using statistics and computers
- basics of NLP 
    - topic identification
    - text classification
- NLP applications
    - chatbots
    - translation
    - sentiment analysis

Regular expressions
- strings with syntax that allow us to match patterns in other strings
- applications
    - find web links in document
    - parse email addressed
    - remove unwanted characters
        
            import re
            #matches a pattern (first argument) with a string (second argument)
            re.match('abc', 'abcdef')

            #special pattern matches a word
            word_regex = '\w+'
            re.match(word_regex, 'hi there!')

common regex patterns

        \w+ - match words
        \d - match digits
        \s - match space
        .* - wildcard
        + or * - greedy match
        
        #capitalized letters negates
        \S - not a space
        [a-z] - lowercase group
        
Python's re module
        
        import re
        split - split a string on regex
        findall - find all patterns in a string
        search - search for a pattern
        match - match an entire string or substring based on a pattern
        
        return - iterator, new string, or match object
        
        re.split('\s+', 'Split on spaces.')
        returns: ['Split', 'on', 'spaces.']

In [None]:
my_string = "Let's write RegEx"

In [1]:
re.findall(r'\s+', my_string)
Out[1]:
[' ', ' ']
In [2]:
re.findall(r'\w+', my_string)
Out[2]:
['Let', 's', 'write', 'RegEx']
In [3]:
re.findall(r'[a-z]', my_string)
Out[3]:
['e', 't', 's', 'w', 'r', 'i', 't', 'e', 'e', 'g', 'x']
In [4]:
re.findall(r"\w", my_string)
Out[4]:
['L', 'e', 't', 's', 'w', 'r', 'i', 't', 'e', 'R', 'e', 'g', 'E', 'x']

In [None]:
In [1]:
my_string
Out[1]:
"Let's write RegEx!  Won't that be fun?  I sure think so.  Can you find 4 sentences?  Or perhaps, all 19 words?"

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[.?!]"

# Split my_string on sentence endings and print the result
print(re.split(sentence_endings, my_string))
["Let's write RegEx", "  Won't that be fun", '  I sure think so', '  Can you find 4 sentences', '  Or perhaps, all 19 words', '']

# Find all capitalized words in my_string and print the result
capitalized_words = r"[A-Z]\w+"
print(re.findall(capitalized_words, my_string))
['Let', 'RegEx', 'Won', 'Can', 'Or']

# Split my_string on spaces and print the result
spaces = r"\s+"
print(re.split(spaces, my_string))
["Let's", 'write', 'RegEx!', "Won't", 'that', 'be', 'fun?', 'I', 'sure', 'think', 'so.', 'Can', 'you', 'find', '4', 'sentences?', 'Or', 'perhaps,', 'all', '19', 'words?']

# Find all digits in my_string and print the result
digits = r"\d"
print(re.findall(digits, my_string))
['4', '1', '9']

# Find all digits in my_string and print the result
digits = r"\d+"
print(re.findall(digits, my_string))
['4', '19']

###### Introduction to tokenization
- turning a string or document into smaller chunks (tokens)
- one step is preparing a text for NLP
- can create your own rules using regex
- examples
    - breaking out workds or sentences
    - separating punctuation
    - seperating all hashtags in a tweet
- why tokenize?
    - easier to map parts of speech, match common words, remove unwanted tokens
   
        from nltk.tokenize import word_tokenize
        word_tokenize('Hi there!')
        OUTPUT: ['Hi', 'there', '!']
        
- other nltk tokenizers
- sent_tokenize: tokenize a doc into sentences
- regexp_tokenize: tokenize a string or doc based on a regrex pattern
- TweetTokenizer: special class just for tweet tokenization, allows seperating of hastage, mentions, and additional exclamation points

difference between re.search() and re.match()
- match only matches from the beginning of a word until it cannot go any further
- search will search all letters

In [None]:
## Word tokenization with NLTK
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)

# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])

# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))

# Print the unique tokens result
print(unique_tokens)

{"'em", 'they', 'dorsal', 'found', 'then', 'an', 'Court', 'martin', 'bring', '#', 'temperate', 'It', 'King', 'So', 'covered', 'Who', "'", 'sovereign', 'velocity', 'who', 'am', 'wings', 'anyway', 'just', 'warmer', 'Wait', 'Arthur', 'KING', 'does', "'s", 'could', 'The', 'Are', 'suggesting', 'have', 'kingdom', 'minute', 'SOLDIER', 'carrying', 'halves', 'mean', 'yeah', 'at', 'winter', 'this', 'its', 'may', 'from', 'search', 'Patsy', 'if', '1', ',', 'Where', 'servant', 'Halt', 'feathers', 'defeator', 'through', 'bird', 'maybe', '?', 'wants', 'guiding', 'why', 'is', 'not', 'Supposing', 'go', 'ARTHUR', '.', 'Well', 'European', 'agree', 'ratios', 'needs', 'son', 'five', 'weight', 'Pendragon', 'creeper', 'Not', 'swallow', 'Listen', 'be', 'zone', 'husk', 'right', 'every', 'wind', 'Found', 'your', 'African', 'swallows', 'SCENE', 'But', 'They', "'re", 'simple', 'length', 'course', 'No', 'bangin', 'are', 'strand', 'grip', 'tell', '[', 'he', '2', 'carried', 'breadth', 'under', 'must', 'in', 'my', 'a', 'tropical', 'carry', 'held', 'strangers', 'matter', 'there', 'land', 'question', 'fly', "'m", 'Britons', 'other', 'climes', 'will', 'England', '--', 'We', 'A', 'Whoa', 'ridden', 'me', ':', 'them', 'get', 'plover', 'Camelot', 'master', 'maintain', 'clop', 'using', 'yet', 'Saxons', 'but', 'Pull', 'seek', 'use', 'pound', 'times', 'That', 'ounce', 'sun', 'order', 'castle', 'these', "'d", 'migrate', 'Am', 'second', 'all', 'point', 'you', 'or', '!', 'Please', 'I', 'Will', 'Uther', 'where', 'snows', 'You', 'In', ']', 'the', 'it', 'together', 'house', 'here', 'since', 'coconuts', 'on', 'speak', 'What', 'ask', 'empty', 'line', 'trusty', 'coconut', 'and', 'knights', 'two', 'Oh', 'beat', 'grips', 'of', 'Mercea', 'south', 'horse', '...', 'got', "n't", 'Yes', "'ve", 'goes', 'court', 'our', 'to', 'forty-three', 'air-speed', 'one', 'non-migratory', 'by', 'do', 'interested', 'with', 'back', 'join', 'Ridden', 'lord', 'that'}

In [None]:
## More regex with re.search()

# Search for the first occurrence of "coconuts" in scene_one: match
match = re.search("coconuts", scene_one)

# Print the start and end indexes of match
print(match.start(), match.end())

OUTPUT: 
    580 588
    
# Write a regular expression to search for anything in square brackets: pattern1
pattern1 = r"\[.*]"

# Use re.search to find the first text in square brackets
print(re.search(pattern1, scene_one))

OUTPUT:
    <_sre.SRE_Match object; span=(9, 32), match='[wind] [clop clop clop]'>

    
# Find the script notation at the beginning of the fourth sentence and print it
# Create a pattern to match the script notation (e.g. Character:), assigning the result to pattern2. 
# Remember that you will want to match any words or spaces that precede the : 
# (such as the space within SOLDIER #1:).
pattern2 = r"[\w\s]+:"
print(re.match(pattern2, sentences[3]))

OUTPUT:
    <_sre.SRE_Match object; span=(0, 7), match='ARTHUR:'>

###### Advanced tokenization with regex

- using or method with |
- define a group using ()
- define explicit character ranges using []
        
        import re
        match_digits_and_words = ('(\d+|\w+)')
        re.findall(match_digits_and_words, 'He has 11 cats.')
        RETURNL: ['He', 'has', '11', 'cats']

![image.png](attachment:image.png)

Note that the hypehen, period and comma need the OR character to look for thos special characters

In this example use match with a character range to match all lowercase ascii, any digits and spaces.
- It is greedy with the +
- once it hits the comma in the string it can't match anymore

        import re
        my_str = 'match lowercase spaces nums like 12, but no commas'
        re.match('[a-z0-0 ]+', my_str)
        Return: match = 'match lowercase spaces nums like 12')

In [None]:
## Choosing a tokenizer to retain sentence punctuation as separate tokens, 
# but have '#1' remain a single token.

pattern1 = r"(\w+|\?|!)"
regexp_tokenize(my_string, pattern1)
Out[13]:
['SOLDIER',
 '1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

pattern2 = r"(\w+|#\d|\?|!)"
regexp_tokenize(my_string, pattern2)
Out[14]:

['SOLDIER',
 '#1',
 'Found',
 'them',
 '?',
 'In',
 'Mercea',
 '?',
 'The',
 'coconut',
 's',
 'tropical',
 '!']

pattern3 = r"(#\d\w+\?!)"
regexp_tokenize(my_string, pattern3)
Out[15]:
[]

pattern4 = r"\s+"
regexp_tokenize(my_string, pattern4)
Out[16]:
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']

In [None]:
## Regex with NLTK tokenization

# Import the necessary modules
from nltk.tokenize import regexp_tokenize, TweetTokenizer

# Define a regex pattern to find hashtags: pattern1
pattern1 = r"#\w+"
# Use the pattern on the first tweet in the tweets list
hashtags = regexp_tokenize(tweets[0], pattern1)
print(hashtags)

OUTPUT: 
    ['#nlp', '#python']
    
# Write a pattern that matches both mentions (@) and hashtags (#)
pattern2 = r"([@|#]\w+)"
# Use the pattern on the last tweet in the tweets list
mentions_hashtags = regexp_tokenize(tweets[-1], pattern2)
print(mentions_hashtags)

OUTPUT:
    ['@datacamp', '#nlp', '#python']
    
# Use the TweetTokenizer to tokenize all tweets into one list
# Create an instance of TweetTokenizer called tknzr and use it inside a list comprehension 
# to tokenize each tweet into a new list called all_tokens.
# To do this, use the .tokenize() method of tknzr, with t as your iterator variable.
tknzr = TweetTokenizer()
all_tokens = [tknzr.tokenize(t) for t in tweets]
print(all_tokens)

OUTPUT: 
    
[['This', 'is', 'the', 'best', '#nlp', 'exercise', 'ive', 'found', 'online', '!', '#python'], ['#NLP', 'is', 'super', 'fun', '!', '<3', '#learning'], ['Thanks', '@datacamp', ':)', '#nlp', '#python']]


In [None]:
## Non-ascii tokenization

# Tokenize and print all words in german_text
all_words = word_tokenize(german_text)
print(all_words)

# Tokenize and print only capital words
capital_words = r"[A-Z|Ü]\w+"
print(regexp_tokenize(german_text, capital_words))

# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(regexp_tokenize(german_text, emoji))

OUTPUT:
['Wann', 'gehen', 'wir', 'Pizza', 'essen', '?', '🍕', 'Und', 'fährst', 'du', 'mit', 'Über', '?', '🚕']
['Wann', 'Pizza', 'Und', 'Über']
['🍕', '🚕']

###### Charting word length with nltk

use matplotlib to plot a histogram
        
        import matplotlib.pyplot as plt
        from nltk.tokenize import word_tokenize
        words = word_tokenize("This is a pretty cool tool!")
        word_lengths = [len(w) for w in words
        plt.hist(word_lengths)
        plt.show()

In [None]:
# Split the script into lines: lines
# Use the .split() method on holy_grail with the newline character ('\n') as the argument.
#lines = re.split('\n', holy_grail)
lines = holy_grail.split('\n')

# Replace all script lines for speaker
# re.sub() requires 3 arguments: The pattern, the replacement, and the string. 
# The pattern is given for you; the replacement is '' and the string is l.
pattern = "[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:"
lines = [re.sub(pattern, '', l) for l in lines]

# Tokenize each line: tokenized_lines
# Use regexp_tokenize() as the output expression of your list comprehension, 
# with s and "\w+" as the arguments.
tokenized_lines = [regexp_tokenize(s,r'\w+') for s in lines]

# Make a frequency list of lengths: line_num_words
# To create line_num_words, use len(t_line) as the output expression of the list comprehension.
line_num_words = [len(t_line) for t_line in tokenized_lines]

# Plot a histogram of the line lengths
plt.hist(line_num_words)

# Show the plot
plt.show()

## CHAPTER 2 - Simple Topic Identification

###### Word Counts with bag-of-words

Bag-of-words
- basic method for finding topics in a text
- create tokens using tokenization
- count tokens
- more word frequency determines significance in text
        
        from nltk.tokenize import word_tokenize
        from collections import Counter
        Counter(word_tokenize("""THe cat is in the box. The cat likes the box. The box is over the cat."""
        OUTPUT: Returns counter object, similar to dictionary
        Counter({'.':3, 
                 'The':3,
                 'box':3,
                 'cat':3,
                 'in':1,
                 ...
                 'the':3})
The and the have seperate entries, can correct this by making all words lowercase before tokenizing
        
        counter.most_common(2)
        #returns two most common words
        #list of tuples
        [('The', 3),('box', 3)]

In [None]:
# Import Counter
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

OUTPUT:
    [(',', 151), ('the', 150), ('.', 89), ('of', 81), ("''", 68), ('to', 63), ('a', 60), ('in', 44), ('and', 41), ('debugging', 40)]

###### Simple text preprocessing

preprocessing helps mkae better data
- tokenization, lowercasing are a type of preprocessing for NLP
- also lemmatization/stemming, shortening words to root stems
- removing stop words, puctuation or unwanted tokens

        from ntlk.corpus import stopwords
        text = """The cat is in the box. The cat likes the box. The box is over the cat."""
        #only return alphabetic strings (remove numbers and punctuation)
        tokens = [w for w in word_tokenize(text.lower()) if w.isalpha()]
        no_stops = [t for t in tokens if t is not in stopwords.words('english')]
        Counter(no_stops).most_common(2)
        
        Output: [('cat',3), ('box',3)]
        #more useful to remove stop words and non alph words

In [None]:
## Text preprocessing

# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

OUTPUT:
[('debugging', 40), ('system', 25), ('software', 16), ('bug', 16), ('problem', 15), ('tool', 15), ('computer', 14), ('process', 13), ('term', 13), ('used', 12)]


###### Introduction to Gensim

- open source NLP library
- uses top academic models to perform complex tasks
    - building document or word vectors
    - Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.
    - performing topic identification and document comparison
![image.png](attachment:image.png)

can see relationships between words, similar comparisons, etc
- build corpora (corpus), set of texts to perfom NLP tasks
            
            from gensim.corpora.dictionary import Dictionary
            from nltk.tokenize import word_tokenize
            my_documents = [list of movie reviews]
            tokenized_docs = [word_tokenize(doc.lower()) for doc in my_documents]
            # dictionary class, beginning of corpus
            dictionary = Dictionary(tokenized_docs)
            # look at token ids
            dictionary.token2id
            
            corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
            corpus
            # returns list of lists of tuples
            # each list represents a document
            # each tuple represents the token ID from the dictionary and the token frequency in the document

Gensim models can be easily save, updated, and reused
Dictionary can also be updated

In [None]:
## Creating and querying a corpus with gensim
## documents alreadyt preprocessed by lowercasing all words, tokenizing them, and removing stop words and punctuation

# Import Dictionary
from gensim.corpora.dictionary import Dictionary

# Create a Dictionary from the articles: dictionary
dictionary = Dictionary(articles)

# Select the id for "computer": computer_id
computer_id = dictionary.token2id.get("computer")

# Use computer_id with the dictionary to print the word
print(dictionary.get(computer_id))

# Create a MmCorpus: corpus
## Use a list comprehension in which you iterate over articles to create a gensim MmCorpus from dictionary.
## In the output expression, use the .doc2bow() method on dictionary with article as the argument.
corpus = [dictionary.doc2bow(article) for article in articles]

# Print the first 10 word ids with their frequency counts from the fifth document
print(corpus[4][:10])


In [None]:
## Gensim bag-of-words

# Save the fifth document: doc
doc = corpus[4]

# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)

# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
    print(dictionary.get(word_id), word_count)
    
# Create the defaultdict: total_word_count
## defaultdict allows us to initialize a dictionary that will assign a default value 
## to non-existent keys. By supplying the argument int, we are able to ensure that any 
## non-existent keys are automatically assigned a default value of 0. 
total_word_count = defaultdict(int)

## itertools.chain.from_iterable() allows us to iterate through a set of sequences 
## as if they were one continuous sequence. Using this function, we can easily iterate 
## through our corpus object (which is a list of lists)

## inside the second for loop, increment each word_id of total_word_count by word_count.
for word_id, word_count in itertools.chain.from_iterable(corpus):
    total_word_count[word_id] += word_count
    
    
# Create a sorted list from the defaultdict: sorted_word_count 
sorted_word_count = sorted(total_word_count.items(), key=lambda w: w[1], reverse=True) 

# Print the top 5 words across all documents alongside the count
for word_id, word_count in sorted_word_count[:5]:
    print(dictionary.get(word_id), word_count)

###### Tf-idf with gensim

-term frequency - inverse document frequency
- allows you to determine the most important words in each document
- each corpus may have shared words beyond just stopwords
- these words should be down-weighted in importance
- Example from astronomy: 'sky'
- ensures most common words dont show up as key words

        from gensim.models.tfidfmodel import TfidfModel
        tfidf = TfidfModel(corpus)
        #call on the second document in corpus
        tfidf[corpus[1]]
        #output is token ids and weights
        [(0, 0.17),
         (1, 0.17),
         (9, 0.30),
         (10, 0.77),  #higher idf means higher importance
         ...]
         
tf-idf can be calculated by multiplying term frequency with inverse document frequency
- Term frequency = percentage share of the word compared to all tokens in the document 
- Inverse document frequency = logarithm of the total number of documents in a corpora divided by the number of documents containing the term

- ex: calculate the tf-idf weight for the word "computer", which appears five times in a document containing 100 words. Given a corpus containing 200 documents, with 20 documents mentioning the word "computer"
    -tfidf weight for computer =  (5 / 100) * log(200 / 20)

In [None]:
## Tfidf with Wikipedia

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)

# Calculate the tfidf weights of doc: tfidf_weights
tfidf_weights = tfidf[doc]

# Print the first five weights
print(tfidf_weights[:5])

OUTPUT:
    [(24, 0.0022836332291091273), (39, 0.0043409401554717324), (41, 0.008681880310943465), (55, 0.011988285029371418), (56, 0.005482756770026296)]
    
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)

# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
    print(dictionary.get(term_id), weight)
    
OUTPUT:
reverse 0.4884961428651127
infringement 0.18674529210288995
engineering 0.16395041814479536
interoperability 0.12449686140192663
reverse-engineered 0.12449686140192663

## CHAPTER 3 Named Entity Recognition (NER)

- NLP task to identify important named enteties in text
    - people, placesm organizations, dates, states, titles, and other categories
- can be used alongside topic identification or alone
- answer basic NL wuestions - who? what? when? where?
![image.png](attachment:image.png)

- The Stanford CoreNLP library
    - NLTK allows interaction with NER via it's own model and also the Standford Library
    - requires installing required Java files (Java based)
    - setting system environment variables
    - integrated into Python via nltk module, or can be used alone or operated as an API server
    - good support for:
        - named entity recognition
        - coreference (or linking pronouns and entities together)
        - dependency trees (help with parsing meaning/relationships amongst words or phrases in sentence)
        
Using nltk for Named Entity Recognition (NER)

        import nltk
        sentence = '''In New York, I like to ride the Metro to visit MOMA 
                      and some restaurants rated well by Ruth Reichl.'''
                      
        #preprocess with tokenization
        tokenized_sent = nltk.word_tokenized(sentence)
        
        # add tags for proper nouns, pronouns, adjectives, verbs, other english grammar
        tagged_sent = nltk.pos_tag(tokenized_sent)
        
        #look at tags
        #NNP = prpper noun, singular
        tagged_sent[:3]
        [('In', 'IN'), ('New', 'NNP'), ('York', 'NNP')]
        
        #pass tagged sentence into (named entity chunk) ne_chunk function
        print(nltk.ne_chunk(tagged_sent))
        
        #returns sentence as a tree with leaves and subtrees representing more complex grammar
        #uses trained statistical and grammatical parsers (does not reference a database)
![image-2.png](attachment:image-2.png)

In [None]:
## NER with NLTK

# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(article)

# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]

# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences] 

# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)

# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, "label") and chunk.label() == "NE":
            print(chunk)
            
OUTPUT:
(NE Uber/NNP)
(NE Beyond/NN)
(NE Apple/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Travis/NNP Kalanick/NNP)
(NE Tim/NNP Cook/NNP)
(NE Apple/NNP)
(NE Silicon/NNP Valley/NNP)
(NE CEO/NNP)
(NE Yahoo/NNP)
(NE Marissa/NNP Mayer/NNP)

In [None]:
## Charting practice

# Create the defaultdict: ner_categories
ner_categories = defaultdict(int)

# Create the nested for loop
## Fill up the dictionary with values for each of the keys. 
## Remember, the keys will represent the label().
## In the outer for loop, iterate over chunked_sentences, using sent as your iterator variable.
## In the inner for loop, iterate over sent. 
## If the condition is true, increment the value of each key by 1.
## Remember to use the chunk's .label() method as the key!

for sent in chunked_sentences:
    for chunk in sent:
        if hasattr(chunk, 'label'):
            ner_categories[chunk.label()] += 1
            
# Create a list from the dictionary keys for the chart labels: labels
## For the pie chart labels, create a list called labels 
## from the keys of ner_categories, which can be accessed using .keys().
labels = list(ner_categories.keys())

# Create a list of the values: values
values = [ner_categories.get(v) for v in labels]

## Use plt.pie() to create a pie chart for each of the NER categories. 
## Along with values and labels=labels, pass the extra keyword arguments 
## autopct='%1.1f%%' and startangle=140 to add percentages to the chart and rotate the initial start angle.
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=140)

# Display the chart
plt.show()

![image.png](attachment:image.png)

###### Introduction to SpaCy

- NLP library wimilar to gensim, with different implementations
- focus on creating NLP pipelines to genertemodels and corpora
- open source with extra libraries and tools
    - ex: Displacy - a visualization too for viewing parse trees and interactive text
- to use SpaCy for NER, need to download pre-trained word vectors

        import spacy
        nlp = spacy.load('en')
        nlp.entity # entity recognizer object to find entities in text
        
        doc = nlp("""Berlin is the capital of Germany; 
                     and the residence of Chancellor Angela Merkel.""")
        #document attributes stored as attribute .ents
        doc.ents
        #returns main entities in text
        (Berlin, Germany, Angela Merkel)
        
        #use indexing to investigating labels of each entity
        print(doc.ents[0], doc.ents[0].label_)
        Berlin GPE  #Geopolitcal Entity
        
- integrates with ther great features like pipeline creation
- different entity types compared to nltk
- informal language corpora (good for tweets and chat messages)
- quickly growing

In [None]:
## Comparing NLTK with spaCy NER
## to minimize execution times, 
## specify the keyword arguments tagger=False, parser=False, matcher=False
## because you only care about the entity in this exercise.

# Import spacy
import spacy

# Instantiate the English model: nlp
nlp = spacy.load('en', tagger=False, parser=False, matcher=False)

# Create a new document: doc
doc = nlp(article)

# Print all of the found entities and their labels
for ent in doc.ents:
    print(ent.label_, ent.text)
    
OUTPUT:
ORG Uber
ORG Uber
ORG Apple
ORG Uber
ORG Uber
PERSON Travis Kalanick
ORG Uber
PERSON Tim Cook
ORG Apple
CARDINAL Millions
ORG Uber
GPE drivers’
LOC Silicon Valley’s
ORG Yahoo
PERSON Marissa Mayer
MONEY $186m

###### Multilingual NER with polyglot

multilingual NER with NLP library, polyglot
- uses word vectors to perform simple tasks like entity recognition


why polyglot?
- main benefit is wide variety of language (>130 languages)
- can be used for transliteration (the ability to translate text by swapping characters from one text to another)
![image-2.png](attachment:image-2.png)


In [None]:
## French NER with polyglot
## the Text class of polyglot has been imported from polyglot.text.

# Create a new text object using Polyglot's Text class: txt
txt = Text(article)

# Print each of the entities found
for ent in txt.entities:
    print(ent)
OUTPUT:
['Charles', 'Cuvelliez']
['Charles', 'Cuvelliez']
['Bruxelles']
['l’IA']
['Julien', 'Maldonato']
['Deloitte']
['Ethiquement']
['l’IA']
['.']

# Print the type of ent
print(type(ent))
OUTPUT:
    polyglot.text.Chunk
    
# Create the list of tuples: entities
entities = [(ent.tag, ' '.join(ent)) for ent in txt.entities]

# Print entities
print(entities)

OUTPUT:
[('I-PER', 'Charles Cuvelliez'), ('I-PER', 'Charles Cuvelliez'), ('I-ORG', 'Bruxelles'), ('I-PER', 'l’IA'), ('I-PER', 'Julien Maldonato'), ('I-ORG', 'Deloitte'), ('I-PER', 'Ethiquement'), ('I-LOC', 'l’IA'), ('I-PER', '.')]

In [None]:
## Spanish NER with polyglot
## The Text object has been created as txt, and each entity has been printed
## Your specific task is to determine how many of the entities contain the words "Márquez" or "Gabo" 
['Lina']
['Castillo']
['Teresa', 'Lozano', 'Long']
['Universidad', 'de', 'Texas']
['Austin']
...

# Initialize the count variable: count
count = 0

# Iterate over all the entities
for ent in txt.entities:
    # Check whether the entity contains 'Márquez' or 'Gabo'
    if "Márquez" in ent or "Gabo" in ent: 
        # Increment count
        count += 1

# Print count
print(count)

# Calculate the percentage of entities that refer to "Gabo": percentage
percentage = count / len(txt.entities)
print(percentage)

29
0.2959

## CHAPTER 4 - Building a supervised learning classifier

###### Classifying fake news using supervised learning with NLP

- use scikit-learn to create supervised learning data from test with vag-of-words model or tf-idf as features
- collect and preprocess data
- determine label
- split into training and testing sets
- extract features from text to predict label using bow
- evaluate trained model using test set

###### Building word count vectors with scikit-learn

-predicting movie genre
- goal: create bag-of-words vectors for the movie plots to see if we can predict genre based on words in the plot summary

        import pandas as pd
        import train-test-split
        from sklearn.feature_extraction import CountVectorizer
        df = loaded data with pandas
        X = df['plot']
        y = df['Sci-Fi']  #y=labels
        train test split
        
        # turns text into bag of words vector (similar to Gensim corpus)
        # also removes English stop words
        count_vectorizer = CountVectorizer(stop_words='english')
        
        #each token is now a feature for the ML model
        #generates a mapping of words with IDs and vecotrs 
        #represnets how many times each word appears in plot summary
        count_train = count_vectorizer.fit_transform(X_train.values)
        
        #call transform on the test data to create bag of words using the same dictionary
        # train and test set need to be consistent
        count_test = count_vectorizer.transform(X_test.values)
        
if there is not much data, there can be an issue with weords in teh test set which don't appear in the training set, throws an error
- need to add more training data or remove unknown words from test set

In [None]:
## CountVectorizer for text classification
## use pandas alongside scikit-learn to create a sparse text vectorizer 
## you can use to train and test a simple supervised model.

# Import the necessary modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# Print the head of df
print(df.head())

# Create a series to store the labels: y
y = df.label

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], y, test_size=0.33, random_state=53)

# Initialize a CountVectorizer object: count_vectorizer
count_vectorizer = CountVectorizer(stop_words='english')

# Transform the training data using only the 'text' column values: count_train 
count_train = count_vectorizer.fit_transform(X_train)

# Transform the test data using only the 'text' column values: count_test 
count_test = count_vectorizer.transform(X_test)

# Print the first 10 features of the count_vectorizer
print(count_vectorizer.get_feature_names()[:10])

OUTPUT:

                                                text label  
0  Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1  Google Pinterest Digg Linkedin Reddit Stumbleu...  FAKE  
2  U.S. Secretary of State John F. Kerry said Mon...  REAL  
3  — Kaydee King (@KaydeeKing) November 9, 2016 T...  FAKE  
4  Its primary day in New York and front-runners...  REAL  

['00', '000', '0000', '00000031', '000035', '00006', '0001', '0001pt', '000ft', '000km']


In [None]:
## TfidfVectorizer for text classification

# Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize a TfidfVectorizer object: tfidf_vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)

# Transform the training data: tfidf_train 
tfidf_train = tfidf_vectorizer.fit_transform(X_train)

# Transform the test data: tfidf_test 
tfidf_test = tfidf_vectorizer.transform(X_test)

# Print the first 10 features
print(tfidf_vectorizer.get_feature_names()[:10])

# Print the first 5 vectors of the tfidf training data
print(tfidf_train.A[:5])

OUTPUT:
['00', '000', '001', '008s', '00am', '00pm', '01', '01am', '02', '024']
[[0.         0.01928563 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.02895055 0.         ... 0.         0.         0.        ]
 [0.         0.03056734 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]

In [None]:
## Inspecting the vectors

# Create the CountVectorizer DataFrame: count_df
count_df = pd.DataFrame(count_train.A, columns=count_vectorizer.get_feature_names())

# Create the TfidfVectorizer DataFrame: tfidf_df
tfidf_df = pd.DataFrame(tfidf_train.A, columns = tfidf_vectorizer.get_feature_names())

# Print the head of count_df
print(count_df.head())

# Print the head of tfidf_df
print(tfidf_df.head())

# Calculate the difference in columns: difference
difference = set(tfidf_df.columns) - set(count_df.columns)
print(difference)

# Check whether the DataFrames are equal
print(count_df.equals(tfidf_df))


OUTPUT:
    
   000  00am  0600        10  100  107   11  110  1100   12    ...      \
0  0.0   0.0   0.0  0.000000  0.0  0.0  0.0  0.0   0.0  0.0    ...       
1  0.0   0.0   0.0  0.105636  0.0  0.0  0.0  0.0   0.0  0.0    ...       
2  0.0   0.0   0.0  0.000000  0.0  0.0  0.0  0.0   0.0  0.0    ...       
3  0.0   0.0   0.0  0.000000  0.0  0.0  0.0  0.0   0.0  0.0    ...       
4  0.0   0.0   0.0  0.000000  0.0  0.0  0.0  0.0   0.0  0.0    ...       

    younger  youth  youths  youtube  ypg  yuan  zawahiri  zeitung      zero  \
0  0.000000    0.0     0.0      0.0  0.0   0.0       0.0      0.0  0.033579   
1  0.000000    0.0     0.0      0.0  0.0   0.0       0.0      0.0  0.000000   
2  0.000000    0.0     0.0      0.0  0.0   0.0       0.0      0.0  0.000000   
3  0.015175    0.0     0.0      0.0  0.0   0.0       0.0      0.0  0.000000   
4  0.000000    0.0     0.0      0.0  0.0   0.0       0.0      0.0  0.000000   

   zerohedge  
0        0.0  
1        0.0  
2        0.0  
3        0.0  
4        0.0  

[5 rows x 5111 columns]
set()
False

###### Training and testing a classification model with scikit-learn

- Naive Bayes model commonly used for testing NLP classification problems due to basis in probability
- given a particular piece of data, how likely is a particular outcome 
    - EX: if the plot has a spaceship, how likely is it to be sci-fi?
    - Ex: given a spaceship and an alien, how likely now is it sci-fi?
- each word from CountVEctorizer acts as a feature
- simple and effective, not always the best option with other algorithm models and neural networks available
- NB may not work well with floats (tfidf weighted inputs), but try first; can also use SVM or linear models for these data types

        from sklearn.naive_bayes import MultinomialNB # also for multiple label classification
        from sklearn import metrics
        nb_classifier = MultinomialNB()
        nb_classifier.fit(count_train, y_train)
        pred = nb_classifier.predict(count_test)
        metrics.accuracy_score(y_test, pred)
        #percentage of correct guesses out of total guesses)
        
        #can also check confusion matrix
        metrics.confusion_matrix(y_test, pred, labels=[0,1])
        returns an array
        #convert to df
        main diagonal shows true scores
        predicted labels across top, true labels down side

In [None]:
## Training and testing the "fake news" model with CountVectorizer

# Import the necessary modules
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB

# Instantiate a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(count_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

OUTPUT:
0.893352462936394
[[ 865  143]
 [  80 1003]]

In [None]:
## Training and testing the "fake news" model with TfidfVectorizer

# Create a Multinomial Naive Bayes classifier: nb_classifier
nb_classifier = MultinomialNB()

# Fit the classifier to the training data
nb_classifier.fit(tfidf_train, y_train)

# Create the predicted tags: pred
pred = nb_classifier.predict(tfidf_test)

# Calculate the accuracy score: score
score = metrics.accuracy_score(y_test, pred)
print(score)

# Calculate the confusion matrix: cm
cm = metrics.confusion_matrix(y_test, pred, labels=['FAKE', 'REAL'])
print(cm)

OUTPUT:
0.8565279770444764
[[ 739  269]
 [  31 1052]]

###### Simple NLP, complex problems

Next steps to improving model:
- Tweaking alpha levels.
- Trying a new classification model.
- Training on a larger dataset.
- Improving text preprocessing.

In [None]:
## Improving your model

# Create the list of alphas: alphas
alphas = np.arange(0, 1, .1)

# Define train_and_predict()
def train_and_predict(alpha):
    # Instantiate the classifier: nb_classifier
    nb_classifier = MultinomialNB(alpha=alpha)
    # Fit to the training data
    nb_classifier.fit(tfidf_train, y_train)
    # Predict the labels: pred
    pred = nb_classifier.predict(tfidf_test)
    # Compute accuracy: score
    score = metrics.accuracy_score(y_test, pred)
    return score

# Iterate over the alphas and print the corresponding score
for alpha in alphas:
    print('Alpha: ', alpha)
    print('Score: ', train_and_predict(alpha))
    print()

OUTPUT:
    Alpha:  0.0
    Score:  0.8813964610234337
    
    Alpha:  0.1
    Score:  0.8976566236250598
    
    Alpha:  0.2
    Score:  0.8938307030129125
    
    Alpha:  0.30000000000000004
    Score:  0.8900047824007652
    
    Alpha:  0.4
    Score:  0.8857006217120995
    
    Alpha:  0.5
    Score:  0.8842659014825442
    
    Alpha:  0.6000000000000001
    Score:  0.874701099952176
    
    Alpha:  0.7000000000000001
    Score:  0.8703969392635102
    
    Alpha:  0.8
    Score:  0.8660927785748446
    
    Alpha:  0.9
    Score:  0.8589191774270684
    

In [None]:
## Inspecting uyour model
## map the important vector weights back to actual words using some simple inspection techniques.

# Get the class labels: class_labels
class_labels = nb_classifier.classes_

# Extract the features: feature_names
feature_names = tfidf_vectorizer.get_feature_names()

# Zip the feature names together with the coefficient array and sort by weights: feat_with_weights
feat_with_weights = sorted(zip(nb_classifier.coef_[0], feature_names))

# Print the first class label and the top 20 feat_with_weights entries
print(class_labels[0], feat_with_weights[:20])

# Print the second class label and the bottom 20 feat_with_weights entries
print(class_labels[1], feat_with_weights[-20:])

OUTPUT:
FAKE [(-12.641778440826338, '0000'), (-12.641778440826338, '000035'), (-12.641778440826338, '0001'), (-12.641778440826338, '0001pt'), (-12.641778440826338, '000km'), (-12.641778440826338, '0011'), (-12.641778440826338, '006s'), (-12.641778440826338, '007'), (-12.641778440826338, '007s'), (-12.641778440826338, '008s'), (-12.641778440826338, '0099'), (-12.641778440826338, '00am'), (-12.641778440826338, '00p'), (-12.641778440826338, '00pm'), (-12.641778440826338, '014'), (-12.641778440826338, '015'), (-12.641778440826338, '018'), (-12.641778440826338, '01am'), (-12.641778440826338, '020'), (-12.641778440826338, '023')]
REAL [(-6.790929954967984, 'states'), (-6.765360557845786, 'rubio'), (-6.751044290367751, 'voters'), (-6.701050756752027, 'house'), (-6.695547793099875, 'republicans'), (-6.6701912490429685, 'bush'), (-6.661945235816139, 'percent'), (-6.589623788689862, 'people'), (-6.559670340096453, 'new'), (-6.489892292073901, 'party'), (-6.452319082422527, 'cruz'), (-6.452076515575875, 'state'), (-6.397696648238072, 'republican'), (-6.376343060363355, 'campaign'), (-6.324397735392007, 'president'), (-6.2546017970213645, 'sanders'), (-6.144621899738043, 'obama'), (-5.756817248152807, 'clinton'), (-5.596085785733112, 'said'), (-5.357523914504495, 'trump')]