In [1]:
import pyarrow.parquet as pq
import pandas as pd
from pymongo import MongoClient

In [2]:
table = pq.read_table('sdsc_data/output.parquet')
df = table.to_pandas()

In [4]:
df.head()

Unnamed: 0,news,id,collectiondate,title,url,publishdate,author,keywords,src,language,newsindex
0,"Countering Collins’ claims, Democrats said Tru...",1330233,2019-12-18,Partisan rage envelops Congress on the eve of...,http://www.nydailynews.com/news/politics/ny-...,2019-12-18,"[Dave Goldiner, Chris Sommerfeldt]","[rivals, expected, eve, president, volodymyr, ...",http://www.nydailynews.com/,,
1,After Cizikas` goal put the Islanders ahead by...,1330234,2019-12-18,Predators score 7 straight goals to beat Isla...,https://www.nydailynews.com/sports/hockey/is...,2019-12-18,[Allan Kreda],"[york, straight, islanders, goal, goals, sweep...",http://www.nydailynews.com/,,
2,"It was a tough call over David Fizdale, but Fi...",1330235,2019-12-18,The Daily News’ Knicks all-decade team for th...,http://www.nydailynews.com/sports/basketball...,2019-12-18,[Stefan Bondy],"[mother, daily, knicks, matt, practice, wins, ...",http://www.nydailynews.com/,,
3,And it’s a far cry from tactics such as “stamp...,1330236,2019-12-18,Kindergarten student earns enough money from ...,http://www.nydailynews.com/news/national/ny-...,2019-12-18,[Theresa Braine],"[debt, money, students, pay, meal, childs, thr...",http://www.nydailynews.com/,,
4,The Hawkeyes had slogged through 17 consecutiv...,1330237,2019-12-18,"Hayden Fry, Texan who turned around Iowa foot...",https://www.nydailynews.com/sports/football/...,2019-12-18,[Ralph D. Russo],"[hawkeyes, texan, unveiled, familiar, worn, un...",http://www.nydailynews.com/,,


In [3]:
client = MongoClient('mongodb://localhost:27017/')
db = client['mydatabase']
collection = db['sdsc-data']
#collection.delete_many({}) # Clears out collection

In [40]:
data = df.to_dict(orient = 'records') # Converts DataFrame to list of dictionaries (JSON-like)

In [None]:
## ONLY RUN THIS SECTION IF YOU HAVE NOT POPULATED MONGODB

In [41]:
from bson import BSON
from datetime import datetime
 
for doc in data: # Have to preprocess some column to fit datatype for MongoDB
    publish_date = datetime.combine(doc['publishdate'], datetime.min.time())
    collection_date = datetime.combine(doc['collectiondate'], datetime.min.time()
                                      )
    publish_date_timestamp = int(publish_date.timestamp() * 1000)
    collection_date_timestamp = int(collection_date.timestamp() * 1000)
    
    doc['publishdate'] = publish_date_timestamp
    doc['collectiondate'] = collection_date_timestamp
    
    doc['author'] = doc['author'].tolist()
    doc['keywords'] = doc['keywords'].tolist()

In [42]:
collection.insert_many(data)

<pymongo.results.InsertManyResult at 0x7f75508f19d0>

In [None]:
## END OF POPULATING MONGODB

In [4]:
query = {'publishdate': 1576656000000} # Example Query to get documents with publish date of 2019-12-18
result = collection.find(query)
for document in result[:5]:
    print(document)

{'_id': ObjectId('646c51b5ccfe8af7df2fd923'), 'news': 'Countering Collins’ claims, Democrats said Trump’s attempts to strong-arm Ukrainian President Volodymyr Zelensky into launching investigations of his political rivals while holding up $391 million in U.S. military aid amounted to several federal crimes, including bribery and wire fraud. They said they filed an impeachment article on abuse of power as opposed to bribery because they want to be able to point to a broader pattern of alleged misconduct that dates back to Trump’s invitation of Russian interference in the 2016 election.', 'id': 1330233, 'collectiondate': 1576656000000, 'title': ' Partisan rage envelops Congress on the eve of Trump’s expected impeachment ', 'url': '  http://www.nydailynews.com/news/politics/ny-house-rules-panel-parameters-debate-impeach-trump-20191217-p6nkbq2bwjhhja3u6dxetmeclu-story.html#nt=barker ', 'publishdate': 1576656000000, 'author': ['Dave Goldiner', 'Chris Sommerfeldt'], 'keywords': ['rivals', 'e

In [5]:
#pip3 install pytextrank
#spacy download en_core_web_sm

import spacy
import pytextrank

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe('textrank')

<pytextrank.base.BaseTextRankFactory at 0x7fd91ab1cbe0>

In [5]:
# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."
doc = nlp(text)
# examine the top-ranked phrases in the document
for phrase in doc._.phrases[:10]:
    print(phrase.text)

mixed types
systems
minimal generating sets
nonstrict inequations
strict inequations
linear Diophantine equations
natural numbers
solutions
linear constraints
all the considered types systems


In [40]:
pipeline = [
    {"$match": {"keywords": {"$in": ['election']}}},
]

result = collection.aggregate(pipeline)

for document in list(result)[:5]:
    print('Title:')
    print(document['title'])
    print('News:')
    print(document['news'])
    print('')
    print('Top-Ranked Phrases:')
    doc = nlp(document['news'])
    
    for phrase in doc._.phrases[:10]:
        print(phrase.text)
    print(document['keywords'])
    print('')

TypeError: 'CommandCursor' object is not subscriptable

In [6]:
#pipeline = [
#    {"$match": {"news": {"$regex": r'\bCalifornia\b.*\bSenate\b'}}}
#]

pipeline = [
    {"$match": {"news": {"$regex": r'\bTrump\b'}}}
]

election_list = []

result = collection.aggregate(pipeline)

for document in result:
    election_list.append(document)

In [7]:
len(election_list)

227413

In [18]:
text = election_list[1]

print(text)
doc = nlp(text['news'])
# examine the top-ranked phrases in the document
for phrase in doc._.phrases[:10]:
    print(phrase.text)

{'_id': ObjectId('646c51b5ccfe8af7df2fda59'), 'news': 'I was all in for Illinois House Bill 3904 — the Student Athlete Endorsement Act, which is similar to a new law in California that will allow intercollegiate sports competitors to hire agents and make money off the commercial use of their own names, images or likenesses, just like the pros do, and also similar to an idea OK’d in October by the NCAA board of governors. It passed the House but went nowhere in the Senate.', 'id': 1330628, 'collectiondate': 1576656000000, 'title': ' Column: Good news about Pedway signage and fake meat, plus other year-end updates ', 'url': '  http://www.chicagotribune.com/columns/eric-zorn/ct-column-pedway-impossible-whopper-zorn-20191217-aivryvlnuzd3rolsexiu22jqfu-story.html ', 'publishdate': 1576656000000, 'author': ['Eric Zorn'], 'keywords': ['names', 'signage', 'fake', 'okd', 'house', 'column', 'similar', 'went', 'meat', 'senate', 'yearend', 'pros', 'ncaa', 'student', 'passed', 'pedway', 'plus', 'up

In [28]:
import datetime

date_grouped = {}
for i in election_list[35000:35500]:
    doc = nlp(i['news'])
    timestamp = datetime.datetime.fromtimestamp(i['publishdate'] / 1000)
    year = timestamp.year
    month = timestamp.month
    date = f'{month}-{year}'
    
    if date not in date_grouped:
        date_grouped[date] = {}
    for phrase in doc._.phrases[:10]:
        if phrase.text in date_grouped[date]:
            date_grouped[date][phrase.text] += 1
        else:
            date_grouped[date][phrase.text] = 1

In [29]:
date_grouped.keys()

dict_keys(['8-2018', '9-2018'])

In [32]:
my_dict = date_grouped['9-2018']

sorted_dict = dict(sorted(my_dict.items(), key=lambda x: x[1]))

# Print the sorted dictionary
for key, value in sorted_dict.items():
    print(key, value)

federal pay 1
federal civilian employees 1
federal spending 1
federal employees 1
pay 1
such increases 1
Federal agency budgets 1
player protests 1
NFL games 1
NFL programming 1
global marketing executive roles 1
chief marketing officer 1
TV executives 1
players 1
Mr. Ellis 1
Tim Ellis 1
President Roosevelt 1
Judge Rogers 1
Mr. Humphrey 1
President Obama 1
other independent agencies 1
the Labor Day holiday weekend 1
salary 1
the Labor Day 1
an event 1
Mr. Simpson 1
Mr. Schiff 1
Glenn Simpson 1
Justice Department official Bruce Ohr 1
Simpson 1
NRA lawyer 1
Cleta Mitchell 1
U.S. production 1
car companies 1
U.S. carmakers 1
auto tariffs 1
new vehicles 1
auto makers 1
other small cars 1
several car lines 1
Focus 1
AI models 1
AI 1
AI Channel 1
AI coverage 1
Facebook AI Research 1
AI Staff Writer 1
UC Berkeley AI 1
UC Berkeley AI researchers 1
Google employees 1
multiple employer plans 1
open multiple employer plans 1
small employers 1
more small employers 1
employers 1
unrelated employers

In [43]:
search_string = 'Kevin De Leon'

regex_pattern = f'.*{search_string}.*'

query = {'news': {'$regex': regex_pattern}}

result = collection.find(query)

kevin_list = []

for document in result:
    kevin_list.append(document)

In [44]:
len(kevin_list)

5

In [46]:
for document in kevin_list:
    print('Title:')
    print(document['title'])
    print('News:')
    print(document['news'])
    print('')
    print('Top-Ranked Phrases:')
    doc = nlp(document['news'])
    
    for phrase in doc._.phrases[:10]:
        print(phrase.text)
    print(document['keywords'])
    print('')

Title:
 California GOP risks shutout in Tuesday vote 
News:
Transcript

>> President`s choice.>> Republicans in California are on the ropes, bracing for a possible shutout at the top of the ticket in primary elections on Tuesday. I`m Andy Sullivan in San Francisco, perhaps the most liberal city in the state where Republicans have been pushed to the margins. Tuesday could bring more bad news due to the state`s quirky election rules, Republicans could end up without a candidate for senator or governor in the fall elections.

And that could cause trouble for other GOP candidates around the state. In most states, the primary election narrows down the field to one Democrat and one Republican. But in California`s so-called jungle primary, it`s the top two vote-getters in any given race who face each other in November, even if they`re from the same party.

That means incumbent Democratic Senator, Dianne Feinstein could face Democratic State Senator, Kevin De Leon, rather than a Republican. Re

state Sen. Kevin de Leon
State Sen. Kevin De Leon
de Leon
Kevin de Leon
Kevin De Leon
Today`s vote
U.S. Sen. Dianne Feinstein
De Leon
Sen. Feinstein
Saturday night
['dianne', 'shot', 'democratic', 'long', 'feinstein', 'sen', 'rival', 'establishment', 'california', 'state', 'senate', 'vote', 'party', 'washington', 'leon', 'rejection', 'snub']

Title:
 Emily Doe writing memoir to `reclaim the story` of sexual assault by Brock Turner 
News:
CLOSE A former Stanford University swimmer whose sexual assault of an incapacitated woman drew national headlines and widespread scorn lost his bid for a new trial, pushing him closer to having to register as a sex offender for the rest of his life. USA TODAY

Her victim impact statement four years ago went viral, changing the way many think about sexual assault, sparking widespread outrage and inspiring millions of survivors. Now Emily Doe, the anonymous woman then-Stanford University student Brock Turner sexually assaulted in 2015, is writing a memoi