# Module 21 - Text Mining in Python - Topic Modeling


**_Author: Jessica Cervi_**

**Expected time = 3 hours**

**Total points = 120 points**
    



    

    
## Assignment Overview

In this assignment, we will continue working with Text Mining to explore a few examples similar to those in the lectures from this week. First, we will review how to tokenize, tag, and chunk some text in Python. Next, we will use named entities and sentiment analysis to extract sentiment in news articles for given entities. Finally, we will use text summarization techniques to shorten long pieces of text.


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 




### Learning Objectives

- Summarize texts and isolate topics in sample data 
- Perform named entity analysis using Python


## Index:

#### Module 21: Text Mining in Python

- [Question 1](#q1)
- [Question 2](#q2)
- [Question 3](#q3)
- [Question 4](#q4)
- [Question 5](#q5)
- [Question 6](#q6)

## Module 21: Text Mining in Python - Topic Modeling



In the first part of this assignment, we will be testing your knowledge of the topics covered in Module 12, such as tokenizing, tag, and chunk our data 


We will use a dataset of articles gathered from the New York Times API relating to elections.  

Before proceeding, ensure that you have the following packages installed on your machine:
- [nltk](https://www.nltk.org)- The leading platform for building Python programs to work with human language data.

- [gensim](https://pypi.org/project/gensim/) - An open-source library for unsupervised topic modeling and natural language processing.

### Reading the dataset and tokenizing data

We will begin this assignment by reading the dataset in a DataFrame `df` and by performing some data claning. Next, you will be asked to tokenize your data. Remember, tokenization  is the process by which big quantity of text is divided into smaller parts called tokens.

As usual, we begin by importing the `pandas` library and by reading the dataset into a DataFrame `df`. Next, because it won't be useful in out analysis, we drop the advertisements section and convert the values stored in the column `date` to floats.

In [1]:
import pandas as pd
df = pd.read_csv('data/nyt_headlines.csv')
df.drop(index=df[df['section'] == 'Briefing'].index, inplace=True)
df['date'] = pd.to_datetime(df['date'])


Finally, we visualize the first five rown of `df` and extract some information using the function `.info()`.

In [2]:
df.head()

Unnamed: 0,date,section,lead_paragraph
0,2019-07-25 21:48:55+00:00,U.S.,WASHINGTON — The Senate Intelligence Committee...
1,2019-08-30 11:15:22+00:00,World,JERUSALEM — The leader of the main Arab factio...
2,2019-08-29 18:00:28+00:00,U.S.,"BEAVER DAM, Wis. — Democratic Wisconsin Gov. T..."
3,2019-08-29 16:57:23+00:00,World,JERUSALEM — A small Israeli ultranationalist p...
4,2019-05-06 12:37:46+00:00,World,NEW DELHI — Violence disrupted the Indian elec...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 932 entries, 0 to 963
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   date            932 non-null    datetime64[ns, UTC]
 1   section         932 non-null    object             
 2   lead_paragraph  932 non-null    object             
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 29.1+ KB


Next, we import  the function the function `.sent_tokenize()` from the library `NLTK`.

In [4]:
from nltk import sent_tokenize

[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*20 points*

Create a list of sentences (tokenizing) the data from  the `lead_paragraph` series. To do so, define function `make_sents` which takes a `pandas` `Series`as an argument. Your function should use the function ``sent_tokenize`` to return a *nested list* with all sentences from the input `Series` collapsed into a single list.
        
Consider the example below:

|Sample Series Input|
| --- |
|"The cat. In the hat."|
|"One fish. Two fish. Red fish."|

Output:
`[['The cat.', 'In the hat.'], ['One fish.', 'Two fish.', 'Red fish.']]`

In [None]:
### GRADED

### YOUR SOLUTION HERE
def make_sents(s):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Tag and chunk the Data

Extracting the named entities using the library `NLTK` requires tagging each word with a part of speech using the function `.pos_tag()` and passing these to the `ne_chunk` method.  From here, we can obtain a tree representation of the sentence that includes any relevant named entity tags.  A full list of the entity tags can be found [here](https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/).

Below, import the necessary libraries to we examine a simple example.

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk
sample_sent = 'Today Google won the election.'

In [None]:
w = word_tokenize(sample_sent)
pos = pos_tag(w)
ne = ne_chunk(pos)

[Back to top](#Index:) 
<a id='q2'></a>
### Question 2:

*20 points*

Define a function, `make_chunks` that accepts, as input, a function which generates a list of  structured set of texts (corpora). Default the input of the function to `make_sents`. Your function should:

- Iterate through the list of corpora and assign a dictionary integer key, starting at 0 (this is the _row_ key).
- Iterates through the sentences of corpora and assign a dictionary integer key, starting at 0 (this is the _sentence_ key).

Your function should return the dictonary defined in the previous iterations.

**Hint: use the functions `word_tokenize`, `pos_tag`, `ne_chunk` and two nested `for` loops**
    
Consider the example below (before chunking):

```python
df['lead_paragraph'][6]
```
Output:
```
'ATHENS — Four years ago almost to this day, Greece came a breath away from leaving the euro. People formed quiet, somber lines outside banks to take out small amounts of cash, as a lockdown on the financial system barred them from accessing their savings. They stockpiled tinned food and toilet paper.'
```

The same text, after chunking, should be:

```python
print(make_chunks(ms=make_sents)[6])
```
Output:
```
{0: [Tree('GPE', [('ATHENS', 'NNP')]), ('—', 'NNP'), ('Four', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('almost', 'RB'), ('to', 'TO'), ('this', 'DT'), ('day', 'NN'), (',', ','), Tree('GPE', [('Greece', 'NNP')]), ('came', 'VBD'), ('a', 'DT'), ('breath', 'NN'), ('away', 'RB'), ('from', 'IN'), ('leaving', 'VBG'), ('the', 'DT'), ('euro', 'NN'), ('.', '.')], 1: [('People', 'NNS'), ('formed', 'VBD'), ('quiet', 'JJ'), (',', ','), ('somber', 'JJ'), ('lines', 'NNS'), ('outside', 'JJ'), ('banks', 'NNS'), ('to', 'TO'), ('take', 'VB'), ('out', 'RP'), ('small', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('cash', 'NN'), (',', ','), ('as', 'IN'), ('a', 'DT'), ('lockdown', 'NN'), ('on', 'IN'), ('the', 'DT'), ('financial', 'JJ'), ('system', 'NN'), ('barred', 'VBD'), ('them', 'PRP'), ('from', 'IN'), ('accessing', 'VBG'), ('their', 'PRP$'), ('savings', 'NNS'), ('.', '.')], 2: [('They', 'PRP'), ('stockpiled', 'VBD'), ('tinned', 'VBN'), ('food', 'NN'), ('and', 'CC'), ('toilet', 'NN'), ('paper', 'NN'), ('.', '.')]}
```

In [None]:
### GRADED

### YOUR SOLUTION HERE
def make_chunks(s):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Finally, we will use the tree representation to generate a dictionary of named entities.  We do this by examining each tree based on whether or not a named entity label has been provided.  

Below, we reconsider the simple example given above: note that only the word `Google` is identified as a named entity.  

In [None]:
#Test sentence
sample_sent = 'Today Google won the election.'
#Tokenize, tag and chunk
w = word_tokenize(sample_sent)
pos = pos_tag(w)
ne = ne_chunk(pos)

Next, we examine the output with visual representations

In [None]:
ne.pretty_print()

In [None]:
ne[1].pretty_print()

In [None]:
ne[1].label()

In [None]:
ne[1].leaves()

[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*20 points*

Define a function `get_pos` which takes as input a function which returns a chunked dictionary.  Default the input of the function to `make_sents`. Your function should return a dictionary where:
- The key is the integer value of the source dataset row and  value is a _list_ of _dictionaries_ where:
- The key is a string value of the named entity and the value is a tuple where:
- The first element is the entity tag and the second element is the part of speech tag

Obberve the example below:

```python
get_pos(mc=make_chunks)[6]
```

Returns:
```
[{'ATHENS': ('GPE', 'NNP')}, {'Greece': ('GPE', 'NNP')}]
```

In [None]:
### GRADED

### YOUR SOLUTION HERE
def get_pos(s):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Sentiment of an Entity
In this part of the assignment we use Vader to investigate the sentiment of headlines containing a specific entity. Remember, sentiment analysis is the process of *computationally* determining whether a piece of writing is positive, negative or neutral.

For this part we import the following necessary function:

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[Back to top](#Index:) 
<a id='q4'></a>

### Question 4

*20 points*

Define a function `get_sentiment_df` that takes, as input a required case-insensitive string search word (default this to the word `Putin`), and sentence list generator function. Default this second input to `make_sents`. 

Your function should searches through the resultant list of sentences output from `make_sents` for the input search word and return a DataFrame with three columns:

 - `lead_paragraph_index`: the originating row number (parent index) of the sentence list
 - `sentence`: the string sentence to which sentiment is evaluated against
 - `compound_sentiment`: the float value which is the compound sentiment returned by  the function`.polarity_scores()`
 
 *Hint:* It is possible to have multiple rows on the output per index, as each parent row may contain multiple sentences

Consider the example below:

```python
get_sentiment_df('Putin', ms=make_sents)
```

Example Output:
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>lead_paragraph_index</th>      <th>sentence</th>      <th>compound_sentiment</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>150</td>      <td>President Trump and President Vladimir V. Puti...</td>      <td>0.0000</td>    </tr>    <tr>      <th>1</th>      <td>219</td>      <td>And then President Trump brushed off Russia’s ...</td>      <td>0.5267</td>    </tr>    <tr>      <th>2</th>      <td>768</td>      <td>A day after President Trump’s remarks alongsid...</td>      <td>-0.1027</td>    </tr>    <tr>      <th>3</th>      <td>772</td>      <td>During a news conference with President Vladim...</td>      <td>0.0000</td>    </tr>    <tr>      <th>4</th>      <td>775</td>      <td>A day after President Trump’s remarks alongsid...</td>      <td>-0.1027</td>    </tr>    <tr>      <th>5</th>      <td>795</td>      <td>RUSSIAN ROULETTE The Inside Story of Putin’s W...</td>      <td>-0.5994</td>    </tr>    <tr>      <th>6</th>      <td>805</td>      <td>SARAJEVO, Bosnia and Herzegovina — Just before...</td>      <td>0.4576</td>    </tr>    <tr>      <th>7</th>      <td>826</td>      <td>WASHINGTON — Russians working for a close ally...</td>      <td>0.0772</td>    </tr>    <tr>      <th>8</th>      <td>845</td>      <td>On an October afternoon before the 2016 electi...</td>      <td>0.3182</td>    </tr>    <tr>      <th>9</th>      <td>871</td>      <td>Maria Butina, a Russian woman who tried to bro...</td>      <td>0.2500</td>    </tr>    <tr>      <th>10</th>      <td>871</td>      <td>During a news conference in Helsinki, Finland,...</td>      <td>0.4767</td>    </tr>    <tr>      <th>11</th>      <td>871</td>      <td>Here are three books that provide insight into...</td>      <td>0.4767</td>    </tr>    <tr>      <th>12</th>      <td>915</td>      <td>HELSINKI, Finland — President Trump stood next...</td>      <td>0.2023</td>    </tr>    <tr>      <th>13</th>      <td>930</td>      <td>President Vladimir Putin’s real challenge in S...</td>      <td>-0.1872</td>    </tr>    <tr>      <th>14</th>      <td>932</td>      <td>WASHINGTON — In 2016, American intelligence ag...</td>      <td>0.6808</td>    </tr>  </tbody></table>



In [None]:
### GRADED

### YOUR SOLUTION HERE
def get_sentiment_df(sw, ms):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Summarization 

Now, we turn to the problem of text summarization. Remember, text summarization refers to the technique of shortening long pieces of text with the intention  to create a coherent and fluent summary having only the main points outlined in the document.

We will use the example using the `gensim.summarization` module.  Our goal will be to build on our example where  we extracted meaningful sentences and provide a summary of the headlines related to a given entity. Below, we demonstrate an example of  the `gensim.summarization.summarize` method on our existing meaningful sentences.


In [None]:
import gensim.summarization

In [None]:
gensim.summarization.summarize_corpus(make_sents())[:5]

[Back to top](#Index:) 
<a id='q5'></a>

### Question 5: 

*20 points*

Write a function `get_summary` which takes as input a case-insensitive, string search word (default to `Putin`) and the DataFrame `df` defined in Question 4. Your function should return a list of lists showing all articles summarizing the string search word returned by `summarize_corpus()` 

Consider the example below :

```python
get_summary('Putin', df=df)
```

Example Output:
```
[['During a news conference with President Vladimir V. Putin of Russia, President Trump would not say whether he believed Russia meddled with the 2016 presidential election.'],
 ['RUSSIAN ROULETTE The Inside Story of Putin’s War on America and the Election of Donald Trump By Michael Isikoff and David Corn 338 pp. Twelve. $30.']]
```

In [None]:
### GRADED

### YOUR SOLUTION HERE
def get_summary(sw, df):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Topic Modeling

In this part of the assignment we demonstrate the use of the library `sklearn` topic modeling  capabilities.  Here, we rely on the `LatentDirichletAllocation` class that implements the LDA algorithm as demonstrated in the lectured.  This class expects a  vectorized array, accomplished  with the `CountVectorizer` or `TfidfVectorizer`.

As usual, we begin by importing the necessary libraries

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
#instantiate LDA and CountVectorizer class
lda = LatentDirichletAllocation()
cvect = CountVectorizer(stop_words='english')

#Transform lead_paragraph into document term matrix
dtm = cvect.fit_transform(df['lead_paragraph'])

#generate list of topics
topics = lda.fit_transform(dtm)

In [None]:
#function to print top words in each topic
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [None]:
#create feature names variable and use in print_top_words function below.
feature_names = cvect.get_feature_names()
print_top_words(lda, feature_names, n_top_words=10)

From here, we are able to identify the probability of a given topic use the fit instance  of `LatentDirichletAlocation`.  We feed the fit instance a sample text and are returned an array of probabilities for topic relation.

In [None]:
#sample headline to determine topic probability with
sample_headline = df['lead_paragraph'][12]
print(sample_headline)

In [None]:
#transform and view probabilities for topics
import numpy as np
samp_cvect = cvect.transform(np.array([sample_headline]))
lda.transform(samp_cvect)

In [None]:
#sort top tokens in topic with highest probability
feats = cvect.get_feature_names()
topics = lda.components_[4]
pd.DataFrame({'prob': topics, 'features': feats}).nlargest(10,  'prob')

[Back to top](#Index:) 
<a id='q6'></a>

### Question 6: 

*20 points*

Define a function `topic_frame` that takes, as inputs:
  - headline (str): text of headline to determine topic inclusion
  - model (sklearn estimator): fit estimator from sklearn.decomposition (LDA or NMF)
  - vectorizer (sklearn transformer): fit vectorizer with vocabulary (CountVectorizer, TfidfVectorizer, or HashingVectorizer)
  - n (int): number of tokens to include in the returned DataFrame. Default this value to 10.
  
Your function should return a DataFrame containing top n words relating to input headline topics using input model.
    
Consider the example below:

```python
topic_frame('Putin', lda, cvect)
```

Example Output:    
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>prob</th>      <th>features</th>    </tr>  </thead>  <tbody>    <tr>      <th>3391</th>      <td>34.284904</td>      <td>president</td>    </tr>    <tr>      <th>1486</th>      <td>34.272390</td>      <td>election</td>    </tr>    <tr>      <th>1488</th>      <td>25.862762</td>      <td>elections</td>    </tr>    <tr>      <th>978</th>      <td>22.230037</td>      <td>congressional</td>    </tr>    <tr>      <th>3693</th>      <td>22.128197</td>      <td>republican</td>    </tr>    <tr>      <th>728</th>      <td>21.695774</td>      <td>carolina</td>    </tr>    <tr>      <th>3013</th>      <td>20.821177</td>      <td>north</td>    </tr>    <tr>      <th>3726</th>      <td>17.187756</td>      <td>results</td>    </tr>    <tr>      <th>3962</th>      <td>15.357398</td>      <td>senate</td>    </tr>    <tr>      <th>4593</th>      <td>15.255583</td>      <td>trump</td>    </tr>  </tbody></table>

In [None]:
### GRADED

### YOUR SOLUTION HERE
def topic_frame(headline, model, vectorizer, n=10):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
