# Module 21 - Text Mining in Python - Topic Modeling


**_Author: Jessica Cervi_**

**Expected time = 3 hours**

**Total points = 120 points**
    



    

    
## Assignment Overview

In this assignment, we will continue working with Text Mining to explore a few examples similar to those in the lectures from this week. First, we will review how to tokenize, tag, and chunk some text in Python. Next, we will use named entities and sentiment analysis to extract sentiment in news articles for given entities. Finally, we will use text summarization techniques to shorten long pieces of text.


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 




### Learning Objectives

- Summarize texts and isolate topics in sample data 
- Perform named entity analysis using Python


## Index:

#### Module 21: Text Mining in Python

- [Question 1](#q1)
- [Question 2](#q2)
- [Question 3](#q3)
- [Question 4](#q4)
- [Question 5](#q5)
- [Question 6](#q6)

## Module 21: Text Mining in Python - Topic Modeling



In the first part of this assignment, we will be testing your knowledge of the topics covered in Module 12, such as tokenizing, tag, and chunk our data 


We will use a dataset of articles gathered from the New York Times API relating to elections.  

Before proceeding, ensure that you have the following packages installed on your machine:
- [nltk](https://www.nltk.org)- The leading platform for building Python programs to work with human language data.

- [gensim](https://pypi.org/project/gensim/) - An open-source library for unsupervised topic modeling and natural language processing.

### Reading the dataset and tokenizing data

We will begin this assignment by reading the dataset in a DataFrame `df` and by performing some data claning. Next, you will be asked to tokenize your data. Remember, tokenization  is the process by which big quantity of text is divided into smaller parts called tokens.

As usual, we begin by importing the `pandas` library and by reading the dataset into a DataFrame `df`. Next, because it won't be useful in out analysis, we drop the advertisements section and convert the values stored in the column `date` to floats.

In [1]:
import pandas as pd
df = pd.read_csv('data/nyt_headlines.csv')
df.drop(index=df[df['section'] == 'Briefing'].index, inplace=True)
df['date'] = pd.to_datetime(df['date'])


In [60]:
import nltk

In [91]:
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/alexei/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /home/alexei/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

Finally, we visualize the first five rown of `df` and extract some information using the function `.info()`.

In [62]:
df.head()

Unnamed: 0,date,section,lead_paragraph
0,2019-07-25 21:48:55+00:00,U.S.,WASHINGTON — The Senate Intelligence Committee...
1,2019-08-30 11:15:22+00:00,World,JERUSALEM — The leader of the main Arab factio...
2,2019-08-29 18:00:28+00:00,U.S.,"BEAVER DAM, Wis. — Democratic Wisconsin Gov. T..."
3,2019-08-29 16:57:23+00:00,World,JERUSALEM — A small Israeli ultranationalist p...
4,2019-05-06 12:37:46+00:00,World,NEW DELHI — Violence disrupted the Indian elec...


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 932 entries, 0 to 963
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype              
---  ------          --------------  -----              
 0   date            932 non-null    datetime64[ns, UTC]
 1   section         932 non-null    object             
 2   lead_paragraph  932 non-null    object             
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 69.1+ KB


Next, we import  the function the function `.sent_tokenize()` from the library `NLTK`.

In [64]:
from nltk import sent_tokenize

[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*20 points*

Create a list of sentences (tokenizing) the data from  the `lead_paragraph` series. To do so, define function `make_sents` which takes a `pandas` `Series`as an argument. Your function should use the function ``sent_tokenize`` to return a *nested list* with all sentences from the input `Series` collapsed into a single list.
        
Consider the example below:

|Sample Series Input|
| --- |
|"The cat. In the hat."|
|"One fish. Two fish. Red fish."|

Output:
`[['The cat.', 'In the hat.'], ['One fish.', 'Two fish.', 'Red fish.']]`

In [116]:
### GRADED

### YOUR SOLUTION HERE
def make_sents(s=df['lead_paragraph']):
    from nltk import sent_tokenize
    s2 = s.apply(sent_tokenize)
    return s2.tolist()

###
### YOUR CODE HERE
###


In [117]:
s3 = make_sents(df['lead_paragraph'])

In [118]:
a = pd.Series(["The cat. In the hat.","One fish. Two fish. Red fish."])

In [119]:
type(a)

pandas.core.series.Series

In [120]:
for line in a:
    print(line)
    print("----------")

The cat. In the hat.
----------
One fish. Two fish. Red fish.
----------


In [121]:
b = make_sents(a)

In [122]:
b

[['The cat.', 'In the hat.'], ['One fish.', 'Two fish.', 'Red fish.']]

In [123]:
a

0             The cat. In the hat.
1    One fish. Two fish. Red fish.
dtype: object

In [124]:
for line in s3:
    print(line)
    print("-----")

['WASHINGTON — The Senate Intelligence Committee concluded Thursday that election systems in all 50 states were targeted by Russia in 2016, an effort more far-reaching than previously acknowledged and one largely undetected by the states and federal officials at the time.']
-----
["JERUSALEM — The leader of the main Arab faction in parliament has shaken up Israel's election campaign by offering to sit in a moderate coalition government — a development that would end decades of Arab political marginalization and could potentially bring down Prime Minister Benjamin Netanyahu."]
-----
['BEAVER DAM, Wis. — Democratic Wisconsin Gov.', "Tony Evers says he's awaiting a recommendation from his legal team about when to call a special election for the congressional district being vacated by retiring Republican Rep. Sean Duffy."]
-----
["JERUSALEM — A small Israeli ultranationalist party has agreed to drop out of upcoming national elections to support Prime Minister Benjamin Netanyahu's ruling Li

['LAUDERHILL, Fla. — For nearly a week, the parking lot behind the Broward County elections office has been the scene of an unfolding postelection drama, with protesters demanding the arrest of the local elections supervisor and politicians claiming fraud in the ballot-counting process.', 'Gov.', 'Rick Scott has fueled the fury, sending his lawyers to court in a bid to call in the police to prevent any possible tampering with ballot-counting machines.']
-----
['They are a familiar sight at farmers’ markets and public squares across California every election season: workers gathering signatures needed to place voter initiatives on the ballot.']
-----
['Once, in his days as New York’s chief federal prosecutor and later as the city’s mayor, Rudolph W. Giuliani was a master of releasing damaging leaks aimed at the kneecaps of opponents.', 'Sometimes, they were true.']
-----
['The Democratic primary between Gov.', 'Andrew M. Cuomo and Cynthia Nixon was over in about the time it takes to wat

In [125]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [126]:
df['lead_paragraph'][6]

'ATHENS — Four years ago almost to this day, Greece came a breath away from leaving the euro. People formed quiet, somber lines outside banks to take out small amounts of cash, as a lockdown on the financial system barred them from accessing their savings. They stockpiled tinned food and toilet paper.'

### Tag and chunk the Data

Extracting the named entities using the library `NLTK` requires tagging each word with a part of speech using the function `.pos_tag()` and passing these to the `ne_chunk` method.  From here, we can obtain a tree representation of the sentence that includes any relevant named entity tags.  A full list of the entity tags can be found [here](https://pythonprogramming.net/part-of-speech-tagging-nltk-tutorial/).

Below, import the necessary libraries to we examine a simple example.

In [274]:
from nltk import word_tokenize, pos_tag, ne_chunk
sample_sent = 'Today Google won the election.'

In [275]:
new_sample = 'ATHENS — Four years ago almost to this day, Greece came a breath away from leaving the euro. People formed quiet, somber lines outside banks to take out small amounts of cash, as a lockdown on the financial system barred them from accessing their savings. They stockpiled tinned food and toilet paper.'

In [276]:
w = word_tokenize(sample_sent)
pos = pos_tag(w)
ne = ne_chunk(pos)

In [277]:
print(ne)

(S Today/NN (PERSON Google/NNP) won/VBD the/DT election/NN ./.)


In [280]:
w = word_tokenize(new_sample)
pos = pos_tag(w)
ne = ne_chunk(pos)

In [281]:
print(pos)

[('ATHENS', 'NNP'), ('—', 'NNP'), ('Four', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('almost', 'RB'), ('to', 'TO'), ('this', 'DT'), ('day', 'NN'), (',', ','), ('Greece', 'NNP'), ('came', 'VBD'), ('a', 'DT'), ('breath', 'NN'), ('away', 'RB'), ('from', 'IN'), ('leaving', 'VBG'), ('the', 'DT'), ('euro', 'NN'), ('.', '.'), ('People', 'NNS'), ('formed', 'VBD'), ('quiet', 'JJ'), (',', ','), ('somber', 'JJ'), ('lines', 'NNS'), ('outside', 'JJ'), ('banks', 'NNS'), ('to', 'TO'), ('take', 'VB'), ('out', 'RP'), ('small', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('cash', 'NN'), (',', ','), ('as', 'IN'), ('a', 'DT'), ('lockdown', 'NN'), ('on', 'IN'), ('the', 'DT'), ('financial', 'JJ'), ('system', 'NN'), ('barred', 'VBD'), ('them', 'PRP'), ('from', 'IN'), ('accessing', 'VBG'), ('their', 'PRP$'), ('savings', 'NNS'), ('.', '.'), ('They', 'PRP'), ('stockpiled', 'VBD'), ('tinned', 'VBN'), ('food', 'NN'), ('and', 'CC'), ('toilet', 'NN'), ('paper', 'NN'), ('.', '.')]


[Back to top](#Index:) 
<a id='q2'></a>
### Question 2:

*20 points*

Define a function, `make_chunks` that accepts, as input, a function which generates a list of  structured set of texts (corpora). Default the input of the function to `make_sents`. Your function should:

- Iterate through the list of corpora and assign a dictionary integer key, starting at 0 (this is the _row_ key).
- Iterates through the sentences of corpora and assign a dictionary integer key, starting at 0 (this is the _sentence_ key).

Your function should return the dictonary defined in the previous iterations.

**Hint: use the functions `word_tokenize`, `pos_tag`, `ne_chunk` and two nested `for` loops**
    
Consider the example below (before chunking):

```python
df['lead_paragraph'][6]
```
Output:
```
'ATHENS — Four years ago almost to this day, Greece came a breath away from leaving the euro. People formed quiet, somber lines outside banks to take out small amounts of cash, as a lockdown on the financial system barred them from accessing their savings. They stockpiled tinned food and toilet paper.'
```

The same text, after chunking, should be:

```python
print(make_chunks(ms=make_sents)[6])
```
Output:
```
{0: [Tree('GPE', [('ATHENS', 'NNP')]), ('—', 'NNP'), ('Four', 'CD'), ('years', 'NNS'), ('ago', 'RB'), ('almost', 'RB'), ('to', 'TO'), ('this', 'DT'), ('day', 'NN'), (',', ','), Tree('GPE', [('Greece', 'NNP')]), ('came', 'VBD'), ('a', 'DT'), ('breath', 'NN'), ('away', 'RB'), ('from', 'IN'), ('leaving', 'VBG'), ('the', 'DT'), ('euro', 'NN'), ('.', '.')], 1: [('People', 'NNS'), ('formed', 'VBD'), ('quiet', 'JJ'), (',', ','), ('somber', 'JJ'), ('lines', 'NNS'), ('outside', 'JJ'), ('banks', 'NNS'), ('to', 'TO'), ('take', 'VB'), ('out', 'RP'), ('small', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('cash', 'NN'), (',', ','), ('as', 'IN'), ('a', 'DT'), ('lockdown', 'NN'), ('on', 'IN'), ('the', 'DT'), ('financial', 'JJ'), ('system', 'NN'), ('barred', 'VBD'), ('them', 'PRP'), ('from', 'IN'), ('accessing', 'VBG'), ('their', 'PRP$'), ('savings', 'NNS'), ('.', '.')], 2: [('They', 'PRP'), ('stockpiled', 'VBD'), ('tinned', 'VBN'), ('food', 'NN'), ('and', 'CC'), ('toilet', 'NN'), ('paper', 'NN'), ('.', '.')]}
```

In [297]:
df['lead_paragraph'][6]

'ATHENS — Four years ago almost to this day, Greece came a breath away from leaving the euro. People formed quiet, somber lines outside banks to take out small amounts of cash, as a lockdown on the financial system barred them from accessing their savings. They stockpiled tinned food and toilet paper.'

In [309]:
### GRADED

### YOUR SOLUTION HERE
def make_chunks(ms=make_sents):
    row_key = 0
    sentence_key = 0
    row_dict = {}
    sentence_dict = {}
    for line in ms():
        sentence_key = 0
        for sentence in line:
            w = word_tokenize(sentence)
            pos = pos_tag(w)
            sentence_dict[sentence_key] = pos
            sentence_key += 1
        row_dict[row_key] = sentence_dict
        row_key += 1
        

    return row_dict

###
### YOUR CODE HERE
###


In [308]:

make_chunks(ms=make_sents)

{0: {0: [('As', 'IN'),
   ('the', 'DT'),
   ('November', 'NNP'),
   ('midterm', 'JJ'),
   ('elections', 'NNS'),
   ('approach', 'NN'),
   (',', ','),
   ('we', 'PRP'),
   ('invited', 'VBD'),
   ('Times', 'NNP'),
   ('readers', 'NNS'),
   ('to', 'TO'),
   ('ask', 'VB'),
   ('our', 'PRP$'),
   ('politics', 'NNS'),
   ('editor', 'NN'),
   (',', ','),
   ('Patrick', 'NNP'),
   ('Healy', 'NNP'),
   (',', ','),
   ('about', 'IN'),
   ('our', 'PRP$'),
   ('current', 'JJ'),
   ('political', 'JJ'),
   ('coverage', 'NN'),
   ('and', 'CC'),
   ('our', 'PRP$'),
   ('plans', 'NNS'),
   ('for', 'IN'),
   ('the', 'DT'),
   ('2020', 'CD'),
   ('presidential', 'JJ'),
   ('race', 'NN'),
   ('.', '.')],
  1: [('We', 'PRP'),
   ('quickly', 'RB'),
   ('heard', 'VBD'),
   ('from', 'IN'),
   ('more', 'JJR'),
   ('than', 'IN'),
   ('200', 'CD'),
   ('readers', 'NNS'),
   ('.', '.')],
  2: [('A', 'DT'),
   ('teenage', 'NN'),
   ('boy', 'NN'),
   ('killed', 'VBN'),
   ('by', 'IN'),
   ('a', 'DT'),
   ('stray', 

In [306]:
make_chunks()

WASHINGTON — The Senate Intelligence Committee concluded Thursday that election systems in all 50 states were targeted by Russia in 2016, an effort more far-reaching than previously acknowledged and one largely undetected by the states and federal officials at the time.
JERUSALEM — The leader of the main Arab faction in parliament has shaken up Israel's election campaign by offering to sit in a moderate coalition government — a development that would end decades of Arab political marginalization and could potentially bring down Prime Minister Benjamin Netanyahu.
BEAVER DAM, Wis. — Democratic Wisconsin Gov.
Tony Evers says he's awaiting a recommendation from his legal team about when to call a special election for the congressional district being vacated by retiring Republican Rep. Sean Duffy.
JERUSALEM — A small Israeli ultranationalist party has agreed to drop out of upcoming national elections to support Prime Minister Benjamin Netanyahu's ruling Likud party.
NEW DELHI — Violence dis

(Reuters) - Facebook Inc is tightening its political ad rules in the United States, it said on Wednesday, requiring new disclosures for its site and photo-sharing platform Instagram ahead of the U.S. presidential election in November 2020.
PRISTINA — Kosovo lawmakers voted to dissolve parliament on Thursday, paving the way for a parliamentary election after Prime Minister Ramush Haradinaj resigned last month.
STALOWA WOLA, Poland — Poland must resist the "traveling theater" of gay pride marches, the leader of its conservative ruling party said on Sunday, as the staunchly Roman Catholic country gears up for a parliamentary election on Oct. 13.
LONDON — New British finance minister Sajid Javid said he will announce higher public spending on health, education and the police next week in order to "clear the decks for Brexit," a move seen by many as preparation for an early election.
ROME — Italy's League leader Matteo Salvini said on Tuesday he was ready to keep the coalition government al

KIEV, Ukraine — Minutes after taking office on Monday, Ukraine’s new president, Volodymyr Zelensky, announced a snap parliamentary election that he hopes will consolidate his power and help him deliver on campaign promises to end endemic corruption and a prolonged separatist conflict.
Two of my children were born in socialist France.
They survived.
In fact, their births were great experiences: excellent medical care, wonderful postnatal follow-up, near-zero cost.
My son’s bris, in a Paris deserted through the August exodus, was another story, but I won’t get into that.
GAROWE, Somalia — The southern Somali state of Jubbaland has blocked access to the capital city Kismayo and its main airport ahead of Thursday's vote to elect a president of the semi-autonomous region, a senior regional official said on Tuesday.
CHARLOTTE, N.C. — Election meddling needn’t be a foreign power planting fictions on Facebook and courting the ragtag disciples of a real estate magnate.
On Thursday, India will a

Are your friends and relatives voting in the midterm elections this year?
Why or why not?
Do you agree with those who think the stakes are especially high right now?
It may be best known as the home of Cedar Point, the famous amusement park, but Sandusky, Ohio, on the shores of Lake Erie, has lately been getting attention for its politics.
WASHINGTON — His plan for ending a 35-day government shutdown failed to deliver a compromise.
On April 9, voters will decide whether Benjamin Netanyahu will remain prime minister of Israel.
The national election comes five weeks after Israel’s attorney general announced plans to indict Mr. Netanyahu on charges stemming from a yearslong corruption investigation.
Prime Minister Benjamin Netanyahu, who is facing a series of corruption investigations, said Israel would hold early elections in April.
DAKAR, Senegal — Hotels, a stadium and a conference center are rising in the capital, Dakar.
A train line will soon zip commuters from downtown to the new $5

Want this newsletter in your inbox?
Sign up here.
KABUL, Afghanistan — A group of lawmakers, many of them women, blocked the Afghan Parliament’s newly appointed speaker from taking his seat on Sunday, and security forces were dispatched after a scuffle broke out.
See full results and maps from the Nevada midterm elections.
JAKARTA, Indonesia — When Joko Widodo, the incumbent president of Indonesia, last year chose Ma’ruf Amin as his running mate for the general election this April, it became clear that Indonesian politics is now backed into a corner.
Mr. Ma’ruf is an Islamic cleric and scholar, and Mr. Joko was perhaps hoping to dampen attacks from conservative and radical Islamic groups that have called him anti-Islam (even though he is Muslim himself).
Instead, he has built a Trojan horse for his opponents outside the walls of his own city.
See full results and maps from the District of Columbia midterm elections.
JERUSALEM — Ehud Barak, a former Israeli prime minister and longtime c

See full results and maps from the New Mexico midterm elections.
To the Editor:
See full results and maps from the Maine midterm elections.
JERUSALEM — Frustrated over their choices in the Israeli election, and at Arab politicians whom they describe as ineffective, some Palestinian citizens of Israel have been gathering popular support for a boycott of Tuesday’s ballot.
WASHINGTON — Senator Lamar Alexander, Republican of Tennessee and one of the last bridges to bipartisanship in the Senate, announced on Monday that he would not seek re-election in 2020, citing a desire to leave the Senate “at the top of my game.”
See full results and maps from the Rhode Island midterm elections.
The investigation into a congressional seat narrowly won by a Republican reveals a detailed playbook for how election fraud can happen in the United States.
See full results and maps from the West Virginia midterm elections.
JOHANNESBURG — The “war room” for the African National Congress candidates running in l

It was the day before the 2016 presidential election, and at the Volusia County elections office, near Florida’s Space Coast, workers were so busy that they had fallen behind on their correspondence.
PHOENIX —  As primaries roll by and the midterms approach, it’s worth remembering that for Republicans 2016 represented an opportunity more than a victory.
It was a chance for them to help the country break the 30-year-spell the Clintons and the Bushes cast, President Barack Obama notwithstanding.
For Florida and Georgia, two outsize states in the Deep South, Election Day hasn’t actually ended.
JERUSALEM — Israelis were confronted with a rude new reality on Friday: a prime minister running for re-election while facing indictment for corruption.
WASHINGTON — The United States Cyber Command is targeting individual Russian operatives to try to deter them from spreading disinformation to interfere in elections, telling them that American operatives have identified them and are tracking their w

RALEIGH, N.C. — North Carolina officials on Thursday ordered a new contest in the Ninth Congressional District after the Republican candidate, confronted by evidence that his campaign had financed an illegal voter-turnout effort, called for a new election.
WASHINGTON — The American military took down a Russian troll farm last Election Day in a cyberattack that continued for several days after the vote, part of what United States officials have said is a persistent campaign to block and deter interference in American democracy.
WASHINGTON — In June 2016, five months before the American presidential election, Julian Assange made a bold prediction during a little-noticed interview with a British television show.
Watching HBO’s “Brexit” as an American is like going to a movie with the knowledge that you’re living in the sequel.
BERLIN — For years, Chancellor Angela Merkel of Germany has been fighting to expand the digital skills of the nation’s work force and to get more youths engaged in 

A who’s who of New York’s political power structure — and those who aspire to crash it — descended on a small stretch of Eastern Parkway in Brooklyn on Monday, seeking votes amid the bright and bedazzled costumes and pulsating beats of the city’s annual West Indian American Day Parade.
President Trump recently accused the Chinese of interfering in American politics ahead of the midterm elections.
“They do not want me or us to win because I am the first president to ever challenge China on trade,” he said, addressing the United Nations Security Council.
He provided no evidence, and appeared to be complaining mostly about retaliatory tariffs by the Chinese government, which may hurt constituencies that support him, and an advertorial touting U.S.-China trade in an Iowa newspaper.
ABUJA, Nigeria — The weather.
Sabotage of buildings storing election materials.
A raft of court challenges.
Welcome to a special election weekend edition of California Today.
SAN FRANCISCO — The Federal Election

In an effort to shed more light on how we work, The Times is running a series of short posts explaining some of our journalistic practices.
Read more of this series here.
This is part of a series on what is at stake in New York’s primary elections on Thursday, and in the general election on Nov. 6.
See full results and maps from the New York primaries.
See full results and maps from the District of Columbia primaries.
The results of the Ohio special election are too close to call.
Why did the Republican party have to fight so hard to defend this long-held House seat?
In a stark display of the nation’s divide, the lower and upper chamber of every legislature but one — Minnesota — will be controlled by the same party following Tuesday’s midterm elections.
It will be the first time in 104 years that only one state will have a divided legislature.
The day after Mr. DeSantis won the Republican primary, he said voters should not “monkey this up” by electing Mr. Gillum.
Mr. Gillum, who is bla

This district includes Virginia Beach, the Norfolk Naval Base and the Virginia portion of Delmarva.
It has been in Republican hands for all but one term of the last two decades.
Donald J. Trump carried the district by three points.
High Stakes for the Country, and Erdogan
This relatively wealthy and well-educated district is the <a href="https://www.nytimes.com/2017/12/18/us/politics/house-control-2018-suburbs-trump-republicans-democrats.html">sort of place</a> that Democrats are hoping to flip because of displeasure with President Trump.
Outside of Utah, no battleground district swung more from Mitt Romney to Hillary Clinton.
Mayor Bill de Blasio’s emphasis on small donors in last year’s mayoral race helped fuel a resurgence of donations of $175 or less to candidates participating in New York City’s matching funds program, according to a new report by the Campaign Finance Board.
After <a href="https://www.nytimes.com/interactive/2018/02/19/upshot/pennsylvania-new-house-districts-gerry

This coal country district in the heart of Appalachia has the lowest number of college graduates of any competitive House race.
A Democrat has not held this seat since the mid-1980s, but the retirement of a long-serving Republican, Rodney Frelinghuysen, created an opening for a Democrat in this political environment.
The district is home to many of the kind of educated, affluent suburban voters who are wary of the president, and the change in tax law<a href="https://www.nytimes.com/2018/04/05/nyregion/new-jersey-house-democrats-campaigns.html"> </a><a href="https://www.nytimes.com/2018/04/05/nyregion/new-jersey-house-democrats-campaigns.html">is likely to be particularly unpopular</a> here.
This majority Hispanic border district <a href="https://www.nytimes.com/2018/06/20/us/politics/republicans-family-separation.html">stretches along the Rio Grande</a> to the suburbs of San Antonio.
Historically, it has some of the lowest turnout in the country.
This open seat has drawn national inter

See full results and maps from the Kentucky primaries.
This geographically large, mostly rural district encompasses the entire northern part of the state as well as the coast east of Rockland.
No incumbent here has lost a race since 1916, according to Ballotpedia.
In normal political times, a glowing report on the nation’s economy just before Election Day would be a gift to the party in power and a uniform talking point for its candidates.
But entering the final weekend before Tuesday’s midterm vote, President Trump’s blistering message of nativist fear has become the dominant theme of the campaign’s last days, threatening to overshadow the good economic news.
SAN SALVADOR — Salvadorans elected Nayib Bukele, the media-savvy former mayor of the capital, as their next president on Sunday, delivering a sharp rebuke to the two parties that emerged from the country’s brutal civil war in the 1980s and have held power ever since.
LAUDERHILL, Fla. — For nearly a week, the parking lot behind th

Mr. Lance, a moderate, won re-election in this district by 11 points in 2016 even as it swung toward a narrow victory for Hillary Clinton.
The battle for control of Congress was front and center on Tuesday night, with races taking shape in several intensely contested House seats in California and New Jersey.
But there were revealing elections in the Midwest and the South, too, underscoring President Trump’s power in the Republican Party and the different ways Democrats hope to loosen his hold on red-state America.
See full results and maps from the South Carolina primaries.
Just a month and a half away from national elections, Pakistan’s powerful military establishment has mounted a fearsome campaign against its critics in the news media, on social networks, and in mainstream political movements.
This strongly Hispanic South Florida district was created in 2012 and encompasses most of southern Miami-Dade County, Key West and all three of Florida's national parks.
See full results and m

ISTANBUL — President Recep Tayyip Erdogan of Turkey on Wednesday called elections for June 24, almost a year and a half earlier than scheduled, saying the situation in Syria and Iraq, as well as economic stability, demanded it.
Within the last two weeks, the investigation of Russia’s interference in the 2016 presidential election has escalated.
Maria Butina, a Russian woman who tried to broker a back-channel meeting between Donald J. Trump and Russian president Vladimir V. Putin during the 2016 election, was charged with conspiring to influence a United States election, and 12 Russian officials were indicted.
During a news conference in Helsinki, Finland, after a private meeting with Mr. Putin, President Trump said he believed Mr. Putin’s denial of meddling, raising concerns that Mr. Trump was siding with a foreign power over his own intelligence agencies.
He backtracked several days later.
Here are three books that provide insight into Mr. Putin’s rise to power, Russia’s involvement i

That could complicate Republicans’ plans to make their economic record a central argument in their case for re-election.
Gov.
Steve Bullock of Montana, a Democrat who has crusaded against the loosening of campaign finance rules, is suing the Trump administration to block it from eliminating a mandate that politically active nonprofit groups disclose the identities of their major donors to the government.
NEW DELHI — When the leaders of the world’s two most populous nations meet on Friday in the Chinese city of Wuhan, Prime Minister Narendra Modi of India will be pushing to get less from President Xi Jinping of China.
CAIRO — Days before Egypt’s presidential election, President Abdel Fattah el-Sisi’s beaming visage adorns billboards across Egypt.
His rivals are in jail, the news media is in his pocket and his sole challenger — a politician so obscure many Egyptians would struggle to name him — hasn’t bothered to campaign.
President Vladimir Putin’s real challenge in Sunday’s presidentia

{0: {0: [('As', 'IN'),
   ('the', 'DT'),
   ('November', 'NNP'),
   ('midterm', 'JJ'),
   ('elections', 'NNS'),
   ('approach', 'NN'),
   (',', ','),
   ('we', 'PRP'),
   ('invited', 'VBD'),
   ('Times', 'NNP'),
   ('readers', 'NNS'),
   ('to', 'TO'),
   ('ask', 'VB'),
   ('our', 'PRP$'),
   ('politics', 'NNS'),
   ('editor', 'NN'),
   (',', ','),
   ('Patrick', 'NNP'),
   ('Healy', 'NNP'),
   (',', ','),
   ('about', 'IN'),
   ('our', 'PRP$'),
   ('current', 'JJ'),
   ('political', 'JJ'),
   ('coverage', 'NN'),
   ('and', 'CC'),
   ('our', 'PRP$'),
   ('plans', 'NNS'),
   ('for', 'IN'),
   ('the', 'DT'),
   ('2020', 'CD'),
   ('presidential', 'JJ'),
   ('race', 'NN'),
   ('.', '.')],
  1: [('We', 'PRP'),
   ('quickly', 'RB'),
   ('heard', 'VBD'),
   ('from', 'IN'),
   ('more', 'JJR'),
   ('than', 'IN'),
   ('200', 'CD'),
   ('readers', 'NNS'),
   ('.', '.')],
  2: [('A', 'DT'),
   ('teenage', 'NN'),
   ('boy', 'NN'),
   ('killed', 'VBN'),
   ('by', 'IN'),
   ('a', 'DT'),
   ('stray', 

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Finally, we will use the tree representation to generate a dictionary of named entities.  We do this by examining each tree based on whether or not a named entity label has been provided.  

Below, we reconsider the simple example given above: note that only the word `Google` is identified as a named entity.  

In [310]:
#Test sentence
sample_sent = 'Today Google won the election.'
#Tokenize, tag and chunk
w = word_tokenize(sample_sent)
pos = pos_tag(w)
ne = ne_chunk(pos)

Next, we examine the output with visual representations

In [311]:
ne.pretty_print()

                   S                              
    _______________|________________________       
   |        |      |         |       |    PERSON  
   |        |      |         |       |      |      
Today/NN won/VBD the/DT election/NN ./. Google/NNP



In [312]:
ne[1].pretty_print()

  PERSON  
    |      
Google/NNP



In [313]:
ne[1].label()

'PERSON'

In [314]:
ne[1].leaves()

[('Google', 'NNP')]

[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*20 points*

Define a function `get_pos` which takes as input a function which returns a chunked dictionary.  Default the input of the function to `make_sents`. Your function should return a dictionary where:
- The key is the integer value of the source dataset row and  value is a _list_ of _dictionaries_ where:
- The key is a string value of the named entity and the value is a tuple where:
- The first element is the entity tag and the second element is the part of speech tag

Obberve the example below:

```python
get_pos(mc=make_chunks)[6]
```

Returns:
```
[{'ATHENS': ('GPE', 'NNP')}, {'Greece': ('GPE', 'NNP')}]
```

In [315]:
### GRADED

### YOUR SOLUTION HERE
def get_pos(mc=make_chunks):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Sentiment of an Entity
In this part of the assignment we use Vader to investigate the sentiment of headlines containing a specific entity. Remember, sentiment analysis is the process of *computationally* determining whether a piece of writing is positive, negative or neutral.

For this part we import the following necessary function:

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[Back to top](#Index:) 
<a id='q4'></a>

### Question 4

*20 points*

Define a function `get_sentiment_df` that takes, as input a required case-insensitive string search word (default this to the word `Putin`), and sentence list generator function. Default this second input to `make_sents`. 

Your function should searches through the resultant list of sentences output from `make_sents` for the input search word and return a DataFrame with three columns:

 - `lead_paragraph_index`: the originating row number (parent index) of the sentence list
 - `sentence`: the string sentence to which sentiment is evaluated against
 - `compound_sentiment`: the float value which is the compound sentiment returned by  the function`.polarity_scores()`
 
 *Hint:* It is possible to have multiple rows on the output per index, as each parent row may contain multiple sentences

Consider the example below:

```python
get_sentiment_df('Putin', ms=make_sents)
```

Example Output:
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>lead_paragraph_index</th>      <th>sentence</th>      <th>compound_sentiment</th>    </tr>  </thead>  <tbody>    <tr>      <th>0</th>      <td>150</td>      <td>President Trump and President Vladimir V. Puti...</td>      <td>0.0000</td>    </tr>    <tr>      <th>1</th>      <td>219</td>      <td>And then President Trump brushed off Russia’s ...</td>      <td>0.5267</td>    </tr>    <tr>      <th>2</th>      <td>768</td>      <td>A day after President Trump’s remarks alongsid...</td>      <td>-0.1027</td>    </tr>    <tr>      <th>3</th>      <td>772</td>      <td>During a news conference with President Vladim...</td>      <td>0.0000</td>    </tr>    <tr>      <th>4</th>      <td>775</td>      <td>A day after President Trump’s remarks alongsid...</td>      <td>-0.1027</td>    </tr>    <tr>      <th>5</th>      <td>795</td>      <td>RUSSIAN ROULETTE The Inside Story of Putin’s W...</td>      <td>-0.5994</td>    </tr>    <tr>      <th>6</th>      <td>805</td>      <td>SARAJEVO, Bosnia and Herzegovina — Just before...</td>      <td>0.4576</td>    </tr>    <tr>      <th>7</th>      <td>826</td>      <td>WASHINGTON — Russians working for a close ally...</td>      <td>0.0772</td>    </tr>    <tr>      <th>8</th>      <td>845</td>      <td>On an October afternoon before the 2016 electi...</td>      <td>0.3182</td>    </tr>    <tr>      <th>9</th>      <td>871</td>      <td>Maria Butina, a Russian woman who tried to bro...</td>      <td>0.2500</td>    </tr>    <tr>      <th>10</th>      <td>871</td>      <td>During a news conference in Helsinki, Finland,...</td>      <td>0.4767</td>    </tr>    <tr>      <th>11</th>      <td>871</td>      <td>Here are three books that provide insight into...</td>      <td>0.4767</td>    </tr>    <tr>      <th>12</th>      <td>915</td>      <td>HELSINKI, Finland — President Trump stood next...</td>      <td>0.2023</td>    </tr>    <tr>      <th>13</th>      <td>930</td>      <td>President Vladimir Putin’s real challenge in S...</td>      <td>-0.1872</td>    </tr>    <tr>      <th>14</th>      <td>932</td>      <td>WASHINGTON — In 2016, American intelligence ag...</td>      <td>0.6808</td>    </tr>  </tbody></table>



In [None]:
### GRADED

### YOUR SOLUTION HERE
def get_sentiment_df(sw, ms=make_sents):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Summarization 

Now, we turn to the problem of text summarization. Remember, text summarization refers to the technique of shortening long pieces of text with the intention  to create a coherent and fluent summary having only the main points outlined in the document.

We will use the example using the `gensim.summarization` module.  Our goal will be to build on our example where  we extracted meaningful sentences and provide a summary of the headlines related to a given entity. Below, we demonstrate an example of  the `gensim.summarization.summarize` method on our existing meaningful sentences.


In [None]:
import gensim.summarization

In [None]:
gensim.summarization.summarize_corpus(make_sents())[:5]

[Back to top](#Index:) 
<a id='q5'></a>

### Question 5: 

*20 points*

Write a function `get_summary` which takes as input a case-insensitive, string search word (default to `Putin`) and the DataFrame `df` defined in Question 4. Your function should return a list of lists showing all articles summarizing the string search word returned by `summarize_corpus()` 

Consider the example below :

```python
get_summary('Putin', df=df)
```

Example Output:
```
[['During a news conference with President Vladimir V. Putin of Russia, President Trump would not say whether he believed Russia meddled with the 2016 presidential election.'],
 ['RUSSIAN ROULETTE The Inside Story of Putin’s War on America and the Election of Donald Trump By Michael Isikoff and David Corn 338 pp. Twelve. $30.']]
```

In [None]:
### GRADED

### YOUR SOLUTION HERE
def get_summary(sw, df):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Topic Modeling

In this part of the assignment we demonstrate the use of the library `sklearn` topic modeling  capabilities.  Here, we rely on the `LatentDirichletAllocation` class that implements the LDA algorithm as demonstrated in the lectured.  This class expects a  vectorized array, accomplished  with the `CountVectorizer` or `TfidfVectorizer`.

As usual, we begin by importing the necessary libraries

In [None]:
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
#instantiate LDA and CountVectorizer class
lda = LatentDirichletAllocation()
cvect = CountVectorizer(stop_words='english')

#Transform lead_paragraph into document term matrix
dtm = cvect.fit_transform(df['lead_paragraph'])

#generate list of topics
topics = lda.fit_transform(dtm)

In [None]:
#function to print top words in each topic
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
    print()

In [None]:
#create feature names variable and use in print_top_words function below.
feature_names = cvect.get_feature_names()
print_top_words(lda, feature_names, n_top_words=10)

From here, we are able to identify the probability of a given topic use the fit instance  of `LatentDirichletAlocation`.  We feed the fit instance a sample text and are returned an array of probabilities for topic relation.

In [None]:
#sample headline to determine topic probability with
sample_headline = df['lead_paragraph'][12]
print(sample_headline)

In [None]:
#transform and view probabilities for topics
import numpy as np
samp_cvect = cvect.transform(np.array([sample_headline]))
lda.transform(samp_cvect)

In [None]:
#sort top tokens in topic with highest probability
feats = cvect.get_feature_names()
topics = lda.components_[4]
pd.DataFrame({'prob': topics, 'features': feats}).nlargest(10,  'prob')

[Back to top](#Index:) 
<a id='q6'></a>

### Question 6: 

*20 points*

Define a function `topic_frame` that takes, as inputs:
  - headline (str): text of headline to determine topic inclusion
  - model (sklearn estimator): fit estimator from sklearn.decomposition (LDA or NMF)
  - vectorizer (sklearn transformer): fit vectorizer with vocabulary (CountVectorizer, TfidfVectorizer, or HashingVectorizer)
  - n (int): number of tokens to include in the returned DataFrame. Default this value to 10.
  
Your function should return a DataFrame containing top n words relating to input headline topics using input model.
    
Consider the example below:

```python
topic_frame('Putin', lda, cvect)
```

Example Output:    
<table border="1" class="dataframe">  <thead>    <tr style="text-align: right;">      <th></th>      <th>prob</th>      <th>features</th>    </tr>  </thead>  <tbody>    <tr>      <th>3391</th>      <td>34.284904</td>      <td>president</td>    </tr>    <tr>      <th>1486</th>      <td>34.272390</td>      <td>election</td>    </tr>    <tr>      <th>1488</th>      <td>25.862762</td>      <td>elections</td>    </tr>    <tr>      <th>978</th>      <td>22.230037</td>      <td>congressional</td>    </tr>    <tr>      <th>3693</th>      <td>22.128197</td>      <td>republican</td>    </tr>    <tr>      <th>728</th>      <td>21.695774</td>      <td>carolina</td>    </tr>    <tr>      <th>3013</th>      <td>20.821177</td>      <td>north</td>    </tr>    <tr>      <th>3726</th>      <td>17.187756</td>      <td>results</td>    </tr>    <tr>      <th>3962</th>      <td>15.357398</td>      <td>senate</td>    </tr>    <tr>      <th>4593</th>      <td>15.255583</td>      <td>trump</td>    </tr>  </tbody></table>

In [None]:
### GRADED

### YOUR SOLUTION HERE
def topic_frame(headline, model, vectorizer, n=10):
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
