# Introduction

This notebook carries out NLP on Bills data, required for the NLP demo presentation:

https://docs.google.com/presentation/d/11fhVC7oKYUiyp7u2ZGQo-Ns1KYg9zo-ecAGVqzzHGp0/edit

Code is based on the techniques defined [here](https://github.com/datakind/NLP_Social_Sector/blob/master/notebooks/tf-idf.ipynb)

In addition, some visualizations were taken from:
    
https://spacy.io/usage/visualizers


In [175]:
import os
import sys

#!{sys.executable} -m pip install sumy
#!{sys.executable} -m pip install spacy

import re
import sumy
import json
import spacy
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
from zipfile import ZipFile
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

#!{sys.executable} -m pip install -U spacy
!{sys.executable} -m spacy download en_core_web_sm
import en_core_web_sm
#import custom_ner_model

import nltk
nltk.download('punkt')
nltk.download('vader_lexicon')


print("Done")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
Done


[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [4]:
def preprocess_tokenize(raw_text, spacy_tokenizer, lemmatize=True):
    """ Preprocess and tokenize for TF-IDF, Topic Modeling, Classification.
    Removes characters that shouldn't have been scraped.
    Sets everything to lower case.
    Removes stopwords and non alpha tokens in spacy case.

    Parameters
    ----------
    raw_text : str
        raw bill text

    spacy_tokenizer: spacy class
        Spacy tokenizer with configuration and additional stopwords 

    lemmatize: bool
        Return lemmatized token strings instead of spacy tokens

    Returns
    -------
    list of spacy tokens or lemmas
        https://spacy.io/api/doc

    """
    # lowercase
    raw_text = raw_text.lower()
    # Remove \n and unicode
    raw_text = re.sub(r'\\n', ' ', raw_text)
    raw_text = re.sub(r'\\u[0-9]*b?', ' ', raw_text)
    doc = spacy_tokenizer(raw_text)
    if lemmatize:
        return [t.lemma_ for t in doc if (t.is_alpha and not (t.is_stop or t.like_num))]
    else:
        return [t.text for t in doc if (t.is_alpha and not (t.is_stop or t.like_num))]


## Set our test text

In [190]:
# Set our text which we want to process
text = """Mental health and behavioral problems in New Hampshire children 
and students, as studied in June.""".replace("\n","")

In [197]:
# Create NLP 
nlp = en_core_web_sm.load()


## Stopwords removal

In [203]:
# Show stop words removal
nlp = en_core_web_sm.load()
text_no_stopwords = nlp(text)
text_no_stopwords = [t.text for t in text_no_stopwords if (t.is_alpha and not (t.is_stop or t.like_num))]
text_no_stopwords = ' '.join(text_no_stopwords)

print("Before")
print(text)

print("\nAfter")
print(text_no_stopwords)


Before
Mental health and behavioral problems in New Hampshire children and students, as studied in June.

After
Mental health behavioral problems Hampshire children students studied June


## Lemmatization

In [205]:
# Show lemmatizing
nlp = en_core_web_sm.load()
lemmas = nlp(text)
lemmas = [t.lemma_ for t in lemmas if (t.is_alpha and not (t.is_stop or t.like_num))]
lemmas = " ".join(lemmas)

print("Before")
print(text1)

print("\nAfter")
print(lemmas)

Before
mental health behavioral problems hampshire children students studied june

After
mental health behavioral problem Hampshire child student study June


## Part of speech (POS)

In [206]:
nlp = en_core_web_sm.load()
doc = nlp(text)
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

displacy.render(doc, style="dep", jupyter=True)

Mental mental ADJ JJ amod Xxxxx True False
health health NOUN NN nmod xxxx True False
and and CCONJ CC cc xxx True True
behavioral behavioral ADJ JJ conj xxxx True False
problems problem NOUN NNS ROOT xxxx True False
in in ADP IN prep xx True True
New New PROPN NNP compound Xxx True True
Hampshire Hampshire PROPN NNP compound Xxxxx True False
children child NOUN NNS pobj xxxx True False
and and CCONJ CC cc xxx True True
students student NOUN NNS conj xxxx True False
, , PUNCT , punct , False False
as as ADP IN mark xx True True
studied study VERB VBN advcl xxxx True False
in in ADP IN prep xx True True
June June PROPN NNP pobj Xxxx True False
. . PUNCT . punct . False False


## Sentiment

In [194]:
from nltk.sentiment import SentimentIntensityAnalyzer
vader_analyzer = SentimentIntensityAnalyzer()
res = vader_analyzer.polarity_scores(str(text))

print(text)
print(res)

Mental health and behavioral problems in New Hampshire children and students, as studied in June.
{'neg': 0.162, 'neu': 0.838, 'pos': 0.0, 'compound': -0.4019}


## Named entity recognition (NER)

In [195]:
nlp = en_core_web_sm.load()
doc = nlp(text)
displacy.render(doc, style="ent", jupyter=True)

## Summarizarion

In [196]:
# A larger bit of text
text_large = """In the Year of Our Lord Two Thousand Nineteen  AN ACT establishing a committee to study the effect of the opioid crisis, substance misuse, adverse childhood experiences (ACEs), and domestic violence as a cause of post-traumatic stress disorder syndrome (PTSD) and other mental health and behavioral problems in New Hampshire children and students.  Be it Enacted by the Senate and House of Representatives in General Court convened:   19:1  Committee Established.  There is established a committee to study the effect of the opioid crisis, substance misuse, adverse childhood experiences (ACEs), and domestic violence as a cause of post-traumatic stress disorder syndrome (PTSD) and other mental health and behavioral problems in New Hampshire children and students. 19:2  Membership and Compensation. I.  The members of the committee shall be as follows: (a)  Three members of the house of representatives, one of whom shall be from the health, human services and elderly affairs committee and one of whom shall be from the children and family law committee, appointed by the speaker of the house of representatives. (b)  One member of the senate, appointed by the president of the senate. II.  Members of the committee shall receive mileage at the legislative rate when attending to the duties of the committee.  19:3  Duties.  The committee shall study the effect of the opioid crisis, substance misuse, adverse childhood experiences (ACEs), and domestic violence as a cause of post-traumatic stress disorder syndrome (PTSD) and other mental health and behavioral problems in New Hampshire children and students.  19:4  Chairperson; Quorum.  The members of the study committee shall elect a chairperson from among the members.  The first meeting of the committee shall be called by the first-named house member.  The first meeting of the committee shall be held within 45 days of the effective date of this section.  Three members of the committee shall constitute a quorum.  19:5  Report.  The committee shall report its findings and any recommendations for proposed legislation to the speaker of the house of representatives, the president of the senate, the house clerk, the senate clerk, the governor, and the state library on or before November 1, 2019.  19:6  Effective Date.  This act shall take effect upon its passage.  Approved: May 15, 2019 Effective Date: May 15, 2019 """

parser = PlaintextParser.from_string(text_large,Tokenizer("english"))

print("Original text:")
print(text_large)

print("\nSummary:")
#for sentence in LexRankSummarizer()(parser.document, 2):
#    print(sentence)
for sentence in LsaSummarizer()(parser.document,2):
    print(sentence)

Original text:
In the Year of Our Lord Two Thousand Nineteen  AN ACT establishing a committee to study the effect of the opioid crisis, substance misuse, adverse childhood experiences (ACEs), and domestic violence as a cause of post-traumatic stress disorder syndrome (PTSD) and other mental health and behavioral problems in New Hampshire children and students.  Be it Enacted by the Senate and House of Representatives in General Court convened:   19:1  Committee Established.  There is established a committee to study the effect of the opioid crisis, substance misuse, adverse childhood experiences (ACEs), and domestic violence as a cause of post-traumatic stress disorder syndrome (PTSD) and other mental health and behavioral problems in New Hampshire children and students. 19:2  Membership and Compensation. I.  The members of the committee shall be as follows: (a)  Three members of the house of representatives, one of whom shall be from the health, human services and elderly affairs comm