## Part 4: NLP mini-projects (Liza).

**Text** is what we use to communicate. It documents how we talk and write. News as text is available and is relatively easy to scrap. Video and audio news can be transformed to text with the help of APIs.

### <span style="color:coral">Speech to text</span>

Speech Recognition using **Google Speech API**

We will perform a live demo, but not do this execrcise all together during the workshop, since it requires additional installations. Here is how they can be done:

Installations (on Mac)

```git clone http://people.csail.mit.edu/hubert/git/pyaudio.git```

```cd pyaudio```

```brew install portaudio```

```pip install pyAudio```

```pip install SpeechRecognition```

Save the following code in a file speech2text.py and run in Termial with 

```python speech2text.py```

``` #speech2text.py
# Requires PyAudio and PySpeech.
import speech_recognition as sr
 
# Record Audio
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Say something!")
    audio = r.listen(source)
 
# Speech recognition using Google Speech Recognition
try:
    # for testing purposes, we're just using the default API key
    # to use another API key, use `r.recognize_google(audio, key="GOOGLE_SPEECH_RECOGNITION_API_KEY")`
    # instead of `r.recognize_google(audio)`
    print("You said: " + r.recognize_google(audio))
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))
```

The result should look as follows:

![alt text](pics/banana.png)

### <span style="color:coral">Information retrieval (with spaCy)</span> </span> 

As discussed before, we can represent each text as a vector and evaluate the proximity of two texts as proximity of vectors. The [SpaCy](https://spacy.io/) library can help us do it neatly.

Given a piece of news and a set of emails, we will extract the most relevant emails with respect to the news based on the **similarity** of texts. Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. Word vectors can be generated using an algorithm like word2vec. Documentation can be found [here](https://spacy.io/usage/vectors-similarity).

In [8]:
#Install and download:  
# conda install spacy 
# python3 -m spacy download en

In [9]:
import spacy

nlp = spacy.load('en')

In [10]:
def similarity(text1, text2):
    doc1 = nlp(text1)
    doc2 = nlp(text2)
    sim = round(doc1.similarity(doc2),2)
    return sim

In [11]:
# a trivial example
text_name_1 = 'data/easy_text1.txt'
text_name_2 = 'data/easy_text1.txt'
text1 = open(text_name_1).read()
text2 = open(text_name_2).read()
print('TEXT1: ===================================================================\n' + text1)
print('TEXT2: ===================================================================\n' + text2)

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do


In [12]:
sim = similarity(text1, text2)
print('Similarity of documents = ' +  str(sim))

Similarity of documents = 1.0


In [13]:
# two diferent texts
text_name_1 = 'data/easy_text1.txt'
text_name_2 = 'data/easy_text2.txt'
text1 = open(text_name_1).read()
text2 = open(text_name_2).read()
print('TEXT1: ===================================================================\n' + text1)
print('TEXT2: ===================================================================\n' + text2)
sim = similarity(text1, text2)
print('Similarity of documents = ' +  str(sim))

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do
Alice was beginning to get very tired
Similarity of documents = 0.72


In [14]:
# read in the news text
text_name = 'data/news1.txt'
text_news = open(text_name).read()
print(text_news)

imply that Hillary has a life- threatening condition called sinus thrombosis, helped create ISIS, and was responsible for the death of Americans in Benghazi


In [15]:
# read emails 
import json

def load_json_data(path_to_file):
    lol = pd.read_json(path_to_file,encoding='ascii')
    data_DF = lol.T
    data_DF['from'] = data_DF['from'].str.lower()
    data_DF['body'] = data_DF['body'].apply(lambda x: " ".join(str(x).split()))
    #print(data_DF['body'])
    return data_DF

In [19]:
import pandas as pd

## Downloading the dataset
```
cd data
curl https://www.dropbox.com/s/20suwbl2l287r54/fulldatastuff.json?dl=0 -L -o fulldatastuff.json
```

In [20]:
emails_pd = load_json_data('data/fulldatastuff.json')

In [21]:
emails_pd.head()

Unnamed: 0,body,date,from,from_name,subject,to
0,How many more states can we get to follow Conn...,2016-05-17T19:51:22-07:00,gardem@dnc.org,Maureen Garde,"Re: CT To Automatically Register 400,000 Voters","[[""Davis, Marilyn"", DavisM@dnc.org]]"
1,She maxed out to us earlier this year total un...,2016-05-04T06:58:23-07:00,shapiroa@dnc.org,"""Shapiro, Alexandra""",What about asking Toni Bush to host?,"[[""Kaplan, Jordan"", KaplanJ@dnc.org]]"
10,,2016-05-04T16:49:31-04:00,postmaster@my.democrats.org,Contribution,Contribution: Finance - Tristate 2016 / Judith...,"[[,, allenz@dnc.org], [,, parrishd@dnc.org], [..."
100,Jordan KaplanNational Finance DirectorDemocrat...,2016-04-25T14:54:17-04:00,kaplanj@dnc.org,Jordan Kaplan,Re: Paris reception,"[[""Rauscher, Rachel"", RauscherR@dnc.org]]"
1000,,2016-05-04T05:47:44-06:00,illinoisplaybook@politico.com,Natasha Korecki,"POLITICO Illinois Playbook, presented by Nucle...",[]


In [22]:
emails = emails_pd['body'].tolist()
print(emails[0])
emails = emails[0:100]
print(len(emails))

How many more states can we get to follow Connecticut? Way to go!
100


In [23]:
# for a given news piece iterate through all emails and find ntop relevant ones
from collections import defaultdict
from heapq import nlargest

similarities = defaultdict(int)

def similar_emails(text_news, emails, ntop=1):
    for i in range(len(emails)):
        email = emails[i]
        similarities[i] = similarity(text_news, email)
    similarities_index = nlargest(ntop, similarities, key=similarities.get)
    return [emails[similarities_index[i]] for i in range(ntop)]

relevant_emails = similar_emails(text_news, emails, ntop=1)
print(relevant_emails[0][0:1000])

IN CASE YOU MISSED ITThe Only Time Donald Trump Undersells: Tax TimeABC NewsBy​ ​Brian RossMay 16, 2016http://abcnews.go.com/Politics/time-donald-trump-undersells-tax-time/story?id=39133709The Trump National Golf Club in Westchester County, New York, with its lovingly-manicured golf course, gently winding streams, stone bridges, 101-foot waterfall and an expansive clubhouse is, according to Donald Trump, reflective of “a true luxury lifestyle.”Creating such a “memorable club” is not cheap -- Trump wrote on a candidate disclosure form that the sprawling 147-acre private club bearing his name is worth “more than $50 million.”But when it came time to value the property for tax purposes, his lawyers have argued that Trump National is really only worth $1.35 million. The proposed valuation has bewildered officials in the small town of Ossining, who said the new figure would cut Trump’s tax burden by 90 percent and dump that burden on everyone else.“Trump says he represents the little guy, b

### Documentation:
https://spacy.io/

### <span style="color:coral">Text summarization (with NLTK)</span>

Another great library to work with texts is [NLTK](http://www.nltk.org/), which stands for Natural Language Toolkit. We have now extracted a large set of news on a given topic and would like to extract the most informative parts: out of each text we would like to exatract the most informative sentence. This type of summarization of texts is called **extractive**.

Take a look at this video: [Hillary Clinton's concession speech](https://www.vox.com/2016/11/9/13570328/hillary-clinton-concession-speech-full-transcript-2016-presidential-election). The webpage already provides the full transcript.



We will approach this task as follows:

In [25]:
# import text

text=open('data/trainhillary.txt').read().lower().replace('\xa0',' ')
print('corpus length:', len(text))
text[0:300]

corpus length: 6390


'thank you. thank you all very much. thank you so much.\n\nvery rowdy group. thank you, my friends. thank you. thank you. thank you so very much for being here. i love you all, too.\n\nlast night i congratulated donald trump and offered to work with him on behalf of our country.\n\ni hope that he will be a'

In [26]:
import nltk
from nltk.corpus import stopwords
from collections import defaultdict
stopWords = set(stopwords.words("english"))

In [27]:
sentences = nltk.sent_tokenize(text)
len(sentences)
sentences[0:10]

['thank you.',
 'thank you all very much.',
 'thank you so much.',
 'very rowdy group.',
 'thank you, my friends.',
 'thank you.',
 'thank you.',
 'thank you so very much for being here.',
 'i love you all, too.',
 'last night i congratulated donald trump and offered to work with him on behalf of our country.']

In [28]:
word_sent = [nltk.word_tokenize(s.lower()) for s in sentences]

In [29]:
word_sent[0:5]

[['thank', 'you', '.'],
 ['thank', 'you', 'all', 'very', 'much', '.'],
 ['thank', 'you', 'so', 'much', '.'],
 ['very', 'rowdy', 'group', '.'],
 ['thank', 'you', ',', 'my', 'friends', '.']]

In [30]:
#compute frequencies 
freq = defaultdict(int)
for sentence in word_sent:
    for word in sentence:
        if word not in stopWords:
            freq[word] +=1
len(freq)

353

In [31]:
m = float(max(freq.values()))
m

77.0

In [32]:
for word in freq.keys():
    freq[word] = freq[word]/m

In [33]:
min_cut=0.2
max_cut=0.8
freq_new = defaultdict(int)
for word in freq.keys():
    if not freq[word] > max_cut or freq[word] < min_cut:
        freq_new[word] = freq[word]
freq = freq_new
del freq_new

In [34]:
len(freq)

351

In [35]:
ranking = defaultdict(int)
for i, sentence in enumerate(word_sent):
        for word in sentence:
            if word in freq:
                ranking[i] +=freq[word]

In [36]:
from heapq import nlargest
sentences_index = nlargest(1, ranking, key=ranking.get)
print(sentences_index)
sentences[sentences_index[0]]

[35]


'we’ve spent a year and a half bringing together millions of people from every corner of our country to say with one voice that we believe that the american dream is big enough for everyone—for people of all races, and religions, for men and women, for immigrants, for lgbt people, and people with disabilities.'

In [37]:
## All of it in one function
from heapq import nlargest
import nltk
from nltk.corpus import stopwords
from collections import defaultdict

stopWords = set(stopwords.words("english"))

In [38]:
def summarize_text(text, stopWords, min_cut, max_cut, ntop=1):
   
    sentences = nltk.sent_tokenize(text)
    
    word_sent = [nltk.word_tokenize(s.lower()) for s in sentences]
    
    # compute frequencies 
    freq = defaultdict(int)
    for sentence in word_sent:
        for word in sentence:
            if word not in stopWords:
                freq[word] +=1

    # normilize frequencies 
    m = float(max(freq.values()))
    for word in freq.keys():
        freq[word] = freq[word]/m
 
    # cut off too frequent or too rare words
    freq_new = defaultdict(int)
    for word in freq.keys():
        if not freq[word] >= max_cut or freq[word] <= min_cut:
            freq_new[word] = freq[word]
    freq = freq_new
    del freq_new
    
    # rank sentences
    ranking = defaultdict(int)
    for i, sentence in enumerate(word_sent):
        for word in sentence:
            if word in freq:
                ranking[i] +=freq[word]
                
    sentences_index = nlargest(ntop, ranking, key=ranking.get)
    summary = [sentences[sentences_index[ind]] for ind in range(len(sentences_index))]
    return summary

In [39]:
min_cut = 0.2
max_cut = 0.8
ntop = 1
summary = summarize_text(text, stopWords, min_cut, max_cut, ntop)
print('SUMMARIZING SENTENCE : \n' + str(summary[0]))

SUMMARIZING SENTENCE : 
we’ve spent a year and a half bringing together millions of people from every corner of our country to say with one voice that we believe that the american dream is big enough for everyone—for people of all races, and religions, for men and women, for immigrants, for lgbt people, and people with disabilities.
