<a href="https://colab.research.google.com/github/brockmanmatt/gdelt_news_exploration/blob/master/LabelTweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Status:

|Section |Status|ToDo|AdditionalRefs|
|--------|-----|---|---|
|--Tweets--||||
|Load Tweets|Done||
|Create Digraphs|Done|Identify Variants
|Other NLP||||
|--Articles--|
|Load Articles|Done|||
|Extract Labels|Done|||
|Create Custom Labels||||
|--Combined Sets--||||
|Check People Overlap|Done|||
|Check Location Overlap||||



##### This is for colab; it connects the colab to Google drive. The articles and GDELT data are already on my drive

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')



Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
import pandas as pd
import datetime as dt
import pytz
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from collections import Counter


In [0]:
cd /content/gdrive/My Drive/TrumpTweets

/content/gdrive/My Drive/TrumpTweets


##### Acknowledging at the start, this is going to not catch important issues; e.g. the cat tweet with the hurricane. So might need to do that sort of thing by hand.

###loading all of Trump's tweets until 21 December; I'll go and just pull those starting Jan 2017

In [0]:
tweets = pd.read_json("Data/TrumpTweets.json")

In [0]:
# Checking with the tweets (e.g. for 1, https://twitter.com/realDonaldTrump/status/1208494102062477312), these are indeed the UTC time of the tweets
tweets.loc[1, "created_at"]

Timestamp('2019-12-21 21:07:01+0000', tz='UTC')

In [0]:
tweets = tweets[tweets["created_at"] >= dt.datetime(2017,1,1, 0,0,0,0, pytz.UTC)]

In [0]:
tweets.head()

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str
0,Twitter for iPhone,RT @WhiteHouse: LIVE: President @realDonaldTru...,2019-12-22 00:15:33+00:00,4668,0,1.0,1208541550424264704
1,Twitter for iPhone,https://t.co/h5bAKuoyV2,2019-12-21 21:07:01+00:00,19483,66120,0.0,1208494102062477312
2,Twitter for iPhone,Last night I was so proud to have signed the l...,2019-12-21 19:38:25+00:00,19265,88649,0.0,1208471806815997952
3,Twitter for iPhone,https://t.co/aVE8FY0eP0 https://t.co/5iTkl6q9oQ,2019-12-21 05:39:23+00:00,13942,50093,0.0,1208260654571896832
4,Twitter for iPhone,The great Democrat disgrace. But we are winnin...,2019-12-21 04:50:58+00:00,13039,49823,0.0,1208248471200899072


In [0]:
"This has {} tweets".format(len(tweets))

'This has 13778 tweets'

##### So, what do I get when I run TFIDF on the tweets? I actually only care about labeling the digraphs now that I think about it and removing stopwords (There's probably better ways to label tweets, e.g. pulling issues from articles which I'll do below, but let's try!) NOTE: THIS INCLUDES RETWEETS, not sure if should remove or not

In [0]:
stemmer = SnowballStemmer("english")
tweets['stemmed'] = tweets.text.map(lambda x: ' '.join([stemmer.stem(word) for word in x.lower().split(' ')]))

In [0]:
tweets.head(2)

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,stemmed
0,Twitter for iPhone,RT @WhiteHouse: LIVE: President @realDonaldTru...,2019-12-22 00:15:33+00:00,4668,0,1.0,1208541550424264704,rt @whitehouse: live: presid @realdonaldtrump ...
1,Twitter for iPhone,https://t.co/h5bAKuoyV2,2019-12-21 21:07:01+00:00,19483,66120,0.0,1208494102062477312,https://t.co/h5bakuoyv2


In [0]:
count_vectorizer = CountVectorizer(stop_words='english', max_df=.3, ngram_range=(1,3))
count_vectorizer.fit(tweets.stemmed)


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=0.3, max_features=None, min_df=1,
                ngram_range=(1, 3), preprocessor=None, stop_words='english',
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [0]:
"{} total unique terms identified".format(len(count_vectorizer.vocabulary_))


'293695 total unique terms identified'

In [0]:
vec_counts = count_vectorizer.transform(tweets.stemmed)


In [0]:
counts = np.asarray(vec_counts.sum(axis=0)).ravel().tolist()
counts_df = pd.DataFrame({'phrase': count_vectorizer.get_feature_names(), 'TweetCounts': counts})
for i in range(10):
  print("; ".join(counts_df.sort_values(by='TweetCounts', ascending=False)["phrase"].to_list()[10*i:10*(i+1)]))


https; rt; great; amp; presid; democrat; trump; veri; just; realdonaldtrump
peopl; thank; news; state; big; fake; new; border; want; american
time; today; work; mani; america; make; job; year; vote; republican
countri; fake news; look; media; good; unit; don; nation; like; impeach
day; total; bad; deal; come; country; china; report; onli; dem
win; crime; becaus; know; rt realdonaldtrump; hous; tax; trade; noth; wall
need; way; strong; whi; said; fbi; did; senat; say; meet
russia; thing; love; hunt; help; witch; witch hunt; world; honor; obama
congress; hard; whitehouse; presid trump; law; high; north; support; watch; unit state
ani; campaign; elect; mueller; people; schiff; realli; far; right; years


## So with en/di/trigraphs alone, we get a bunch of words! that's nice (I should remove https, rt, realdonaldtrump as stopwords). No point in throwing in the IDF part now that I think about it more, although I might be wrong about that. Anyway, so here's the part of the list if you want to look

In [0]:
counts_df.set_index("phrase", inplace=True)

In [0]:
counts_df.sort_values(by='TweetCounts', ascending=False)[50:80]

Unnamed: 0_level_0,TweetCounts
phrase,Unnamed: 1_level_1
win,445
crime,421
becaus,421
know,417
rt realdonaldtrump,406
hous,394
tax,389
trade,386
noth,381
wall,380


## So the problem that we're trying to address here is how to actually get the label for each of the tweets. For broad topics, that's actually kind of hard without using some sort of labeling. So let's see what the labels that we get from the GDELT set look like

## Here's 3 months of GDELT V2 Data for a couple publications I'd aggregated into a python pickle before

In [0]:
articles = pd.read_pickle("Data/GDELT_Select_Publications.pkl")

In [0]:
articles.head()

Unnamed: 0,GKGRECORDID,DATE,SourceCollectionIdentifier,SourceCommonName,DocumentIdentifier,Counts,V2Counts,Themes,V2Themes,Locations,V2Locations,Persons,V2Persons,Organizations,V2Organizations,V2Tone,Dates,GCAM,SharingImage,RelatedImages,SocialImageEmbeds,SocialVideoEmbeds,Quotations,AllNames,Amounts,TranslationInfo,Extras
4,20190905000000-4,20190900000000.0,1.0,reuters.com,https://in.reuters.com/article/uber-brazil/bra...,,,TAX_FNCACT;TAX_FNCACT_DRIVERS;TAX_FNCACT_EMPLO...,"IDEOLOGY,2527;ECON_STOCKMARKET,331;WB_696_PUBL...","4#Sao Paulo, SãPaulo, Brazil#BR#BR27#-23.5333#...","4#Minas Gerais, Acre, Brazil#BR#BR01#18585#-8....",richard chang;ricardo brito;anthony boadle,"Richard Chang,2472;Ricardo Brito,2410;Anthony ...",york stock exchange;thomson reuters trust prin...,"York Stock Exchange,340;Thomson Reuters Trust ...","0.961538461538461,2.16346153846154,1.201923076...",,"wc:391,c1.2:2,c12.1:24,c12.10:21,c12.12:4,c12....",https://s3.reutersmedia.net/resources/r/?m=02&...,,,https://youtube.com/user/ReutersVideo;https://...,,"Uber Technologies Inc,281;New York Stock Excha...","250,cases,1773;",,<PAGE_LINKS>http://thomsonreuters.com/en/about...
34,20190905000000-34,20190900000000.0,1.0,reuters.com,https://in.reuters.com/article/usa-election-cl...,,,LEADER;TAX_FNCACT;TAX_FNCACT_PRESIDENT;USPEC_P...,"IDEOLOGY,4424;BAN,3286;TAX_FNCACT_CANDIDATES,2...","3#South Bend, Indiana, United States#US#USIN#4...","2#Pennsylvania, United States#US#USPA##40.5773...",donald trump;cory booker;barack obama;elizabet...,"Donald Trump,63;Cory Booker,3956;Barack Obama,...",thomson reuters trust principles;georgetown un...,"Thomson Reuters Trust Principles,4424;Georgeto...","-2.4113475177305,2.26950354609929,4.6808510638...",,"wc:641,c1.2:1,c1.3:23,c12.1:31,c12.10:62,c12.1...",https://s2.reutersmedia.net/resources/r/?m=02&...,,,https://youtube.com/user/ReutersVideo;https://...,,"Donald Trump,47;White House,134;Elizabeth Warr...","10,Democratic presidential contenders,668;7000...",,<PAGE_LINKS>http://thomsonreuters.com/en/about...
87,20190905000000-87,20190900000000.0,1.0,washingtonpost.com,https://www.washingtonpost.com/national/couple...,,,TRIAL;TAX_WORLDMAMMALS;TAX_WORLDMAMMALS_CATS;T...,"GENERAL_HEALTH,586;MEDICAL,586;FOOD_SECURITY,6...",,,jennifer klein,"Jennifer Klein,231",associated press,"Associated Press,19;Associated Press,841","-7.05128205128205,0.641025641025641,7.69230769...",,"wc:136,c12.1:8,c12.10:10,c12.12:4,c12.13:1,c12...",,,,,,"Jennifer Klein,227;Oakland County,425;West Blo...","178,cats were removed earlier,105;4,dollars ,2...",,<PAGE_ALTURL_AMP>https://beta.washingtonpost.c...
90,20190905000000-90,20190900000000.0,1.0,washingtonpost.com,https://www.washingtonpost.com/national/some-m...,,,TAX_FNCACT;TAX_FNCACT_MERCHANT;TAX_FNCACT_JUDG...,"GENERAL_GOVERNMENT,120;GENERAL_GOVERNMENT,337;...","3#Houston, Texas, United States#US#USTX#29.763...","2#Texas, United States#US#USTX##31.106#-97.647...",dana sabraw,"Dana Sabraw,305",associated press,"Associated Press,939","-1.21951219512195,1.82926829268293,3.048780487...",,"wc:151,c12.1:10,c12.10:15,c12.12:2,c12.13:3,c1...",,,,,,"Nomaan Merchant,19;Judge Dana Sabraw,292;David...","11,parents who were deported,108;400,parents w...",,<PAGE_ALTURL_AMP>https://beta.washingtonpost.c...
101,20190905000000-101,20190900000000.0,1.0,washingtonpost.com,https://www.washingtonpost.com/local/dc-politi...,,,CRISISLEX_CRISISLEXREC;UNGP_CRIME_VIOLENCE;USP...,"KILL,1531;IMMIGRATION,1643;TAX_FNCACT_IMMIGRAN...","3#Brightwood Park, District Of Columbia, Unite...","3#Brightwood Park, District Of Columbia, Unite...",chidi anyanwutaku;rashad m young;james g walke...,"Chidi Anyanwutaku,4654;Fitsum Kebede,1576;Erne...",emergency medical services department;washingt...,"Emergency Medical Services Department,461;Regu...","-3.51758793969849,0.879396984924623,4.39698492...",,"wc:748,c12.1:29,c12.10:56,c12.12:24,c12.13:17,...",https://www.washingtonpost.com/resizer/8_Fu4i5...,,,,1547|25||contributed to the deaths,"Brightwood Park,233;Emergency Medical Services...","2,at the fire department,1826;2,at DCRA,1851;7...",,<PAGE_LINKS>https://www.washingtonpost.com/loc...


## So these are the publishers in this set, I think it's not what I'll end up with but this is preliminary

In [0]:
articles.SourceCommonName.unique()

array(['reuters.com', 'washingtonpost.com', 'cnn.com', 'nytimes.com',
       'breitbart.com', 'cbsnews.com', 'foxnews.com', 'thehill.com',
       'msnbc.com', 'politico.com', 'nbcnews.com'], dtype=object)

In [0]:
len(articles)

138585

In [0]:
#oh, right, need to convert date to actual dates, just to get sense of what dates are covered here
articles.DATE = articles.DATE.apply(lambda x: str(int(x)))
articles.DATE = pd.to_datetime(articles.DATE)
articles.DATE.min(), articles.DATE.max()

(Timestamp('2019-09-05 00:00:00'), Timestamp('2019-12-08 23:45:00'))

In [0]:
articles.fillna("",inplace=True)

In [0]:
#So I can pull all of the persons and orgs and see what overlaps with what's in the GDELT labels

In [0]:
def strip_people(gdelt_list):
  rslt = []
  for x in gdelt_list.lower().split(";"):
    rslt.append(x.split(",")[0])
  return ";".join(rslt)

In [0]:
articles["persons_stripped"] = articles.V2Persons.map(lambda x: strip_people(x))

In [0]:
articles.head()

Unnamed: 0,GKGRECORDID,DATE,SourceCollectionIdentifier,SourceCommonName,DocumentIdentifier,Counts,V2Counts,Themes,V2Themes,Locations,V2Locations,Persons,V2Persons,Organizations,V2Organizations,V2Tone,Dates,GCAM,SharingImage,RelatedImages,SocialImageEmbeds,SocialVideoEmbeds,Quotations,AllNames,Amounts,TranslationInfo,Extras,persons_stripped
4,20190905000000-4,2019-09-05,1.0,reuters.com,https://in.reuters.com/article/uber-brazil/bra...,,,TAX_FNCACT;TAX_FNCACT_DRIVERS;TAX_FNCACT_EMPLO...,"IDEOLOGY,2527;ECON_STOCKMARKET,331;WB_696_PUBL...","4#Sao Paulo, SãPaulo, Brazil#BR#BR27#-23.5333#...","4#Minas Gerais, Acre, Brazil#BR#BR01#18585#-8....",richard chang;ricardo brito;anthony boadle,"Richard Chang,2472;Ricardo Brito,2410;Anthony ...",york stock exchange;thomson reuters trust prin...,"York Stock Exchange,340;Thomson Reuters Trust ...","0.961538461538461,2.16346153846154,1.201923076...",,"wc:391,c1.2:2,c12.1:24,c12.10:21,c12.12:4,c12....",https://s3.reutersmedia.net/resources/r/?m=02&...,,,https://youtube.com/user/ReutersVideo;https://...,,"Uber Technologies Inc,281;New York Stock Excha...","250,cases,1773;",,<PAGE_LINKS>http://thomsonreuters.com/en/about...,richard chang;ricardo brito;anthony boadle
34,20190905000000-34,2019-09-05,1.0,reuters.com,https://in.reuters.com/article/usa-election-cl...,,,LEADER;TAX_FNCACT;TAX_FNCACT_PRESIDENT;USPEC_P...,"IDEOLOGY,4424;BAN,3286;TAX_FNCACT_CANDIDATES,2...","3#South Bend, Indiana, United States#US#USIN#4...","2#Pennsylvania, United States#US#USPA##40.5773...",donald trump;cory booker;barack obama;elizabet...,"Donald Trump,63;Cory Booker,3956;Barack Obama,...",thomson reuters trust principles;georgetown un...,"Thomson Reuters Trust Principles,4424;Georgeto...","-2.4113475177305,2.26950354609929,4.6808510638...",,"wc:641,c1.2:1,c1.3:23,c12.1:31,c12.10:62,c12.1...",https://s2.reutersmedia.net/resources/r/?m=02&...,,,https://youtube.com/user/ReutersVideo;https://...,,"Donald Trump,47;White House,134;Elizabeth Warr...","10,Democratic presidential contenders,668;7000...",,<PAGE_LINKS>http://thomsonreuters.com/en/about...,donald trump;cory booker;barack obama;elizabet...
87,20190905000000-87,2019-09-05,1.0,washingtonpost.com,https://www.washingtonpost.com/national/couple...,,,TRIAL;TAX_WORLDMAMMALS;TAX_WORLDMAMMALS_CATS;T...,"GENERAL_HEALTH,586;MEDICAL,586;FOOD_SECURITY,6...",,,jennifer klein,"Jennifer Klein,231",associated press,"Associated Press,19;Associated Press,841","-7.05128205128205,0.641025641025641,7.69230769...",,"wc:136,c12.1:8,c12.10:10,c12.12:4,c12.13:1,c12...",,,,,,"Jennifer Klein,227;Oakland County,425;West Blo...","178,cats were removed earlier,105;4,dollars ,2...",,<PAGE_ALTURL_AMP>https://beta.washingtonpost.c...,jennifer klein
90,20190905000000-90,2019-09-05,1.0,washingtonpost.com,https://www.washingtonpost.com/national/some-m...,,,TAX_FNCACT;TAX_FNCACT_MERCHANT;TAX_FNCACT_JUDG...,"GENERAL_GOVERNMENT,120;GENERAL_GOVERNMENT,337;...","3#Houston, Texas, United States#US#USTX#29.763...","2#Texas, United States#US#USTX##31.106#-97.647...",dana sabraw,"Dana Sabraw,305",associated press,"Associated Press,939","-1.21951219512195,1.82926829268293,3.048780487...",,"wc:151,c12.1:10,c12.10:15,c12.12:2,c12.13:3,c1...",,,,,,"Nomaan Merchant,19;Judge Dana Sabraw,292;David...","11,parents who were deported,108;400,parents w...",,<PAGE_ALTURL_AMP>https://beta.washingtonpost.c...,dana sabraw
101,20190905000000-101,2019-09-05,1.0,washingtonpost.com,https://www.washingtonpost.com/local/dc-politi...,,,CRISISLEX_CRISISLEXREC;UNGP_CRIME_VIOLENCE;USP...,"KILL,1531;IMMIGRATION,1643;TAX_FNCACT_IMMIGRAN...","3#Brightwood Park, District Of Columbia, Unite...","3#Brightwood Park, District Of Columbia, Unite...",chidi anyanwutaku;rashad m young;james g walke...,"Chidi Anyanwutaku,4654;Fitsum Kebede,1576;Erne...",emergency medical services department;washingt...,"Emergency Medical Services Department,461;Regu...","-3.51758793969849,0.879396984924623,4.39698492...",,"wc:748,c12.1:29,c12.10:56,c12.12:24,c12.13:17,...",https://www.washingtonpost.com/resizer/8_Fu4i5...,,,,1547|25||contributed to the deaths,"Brightwood Park,233;Emergency Medical Services...","2,at the fire department,1826;2,at DCRA,1851;7...",,<PAGE_LINKS>https://www.washingtonpost.com/loc...,chidi anyanwutaku;fitsum kebede;ernest chrappah


In [0]:
peopleCount = Counter(";".join(articles.persons_stripped).split(";"))
people_dict = dict(zip(peopleCount.keys(), sourceCount.values()))


In [0]:
tmp = pd.DataFrame([people_dict])

In [0]:
tmp = tmp.T

In [0]:
tmp.columns=["ArticleCounts"]

In [0]:
tmp.sort_values(by="ArticleCounts", ascending=False)[:50]

Unnamed: 0,ArticleCounts
donald trump,27440
joe biden,12458
,9464
boris johnson,6972
los angeles,6491
elizabeth warren,5296
hunter biden,4674
nancy pelosi,4110
bernie sanders,3633
rudy giuliani,3349


In [0]:
tmp.loc["donald trump", "ArticleCounts"]

27440

In [0]:
print("wait, {} percent of articles are labelled Trump?!?!".format(round(100*tmp.loc["donald trump", "ArticleCounts"]/len(articles),3)))

wait, 19.8 percent of articles are labelled Trump?!?!


In [0]:
pd.concat([tmp, counts_df], axis=1).dropna().sort_values(by="TweetCounts", ascending=False)[:50]

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,ArticleCounts,TweetCounts
donald trump,27440.0,150.0
adam schiff,2830.0,123.0
white house,59.0,83.0
joe biden,12458.0,69.0
kim jong,11.0,58.0
robert mueller,1192.0,43.0
chuck schumer,258.0,43.0
hurrican dorian,2.0,35.0
mark levin,87.0,34.0
bob mueller,43.0,27.0
