## NLP Entity Analysis of Trump Tweets

This is part of my Trump Tweet Analysis project that analyzed tweets between May 2009 and September 2017. 

I have used SpaCy to conduct an enttity analysis on these tweets.  

I then visualized the entity analysis in Tableau. A video overview of this visualization is available at https://youtu.be/mml56SAz2sk

Import the required packages, including both spacy and nlp wth the english dictionary.

In [2]:
import spacy 
nlp = spacy.load('en')
import dill
import os
import pandas as pd
os.chdir('/Users/donajstewart/Master_JSON_Files')


I previously created the merged_df dataframe from JSON files each containing 1 year of tweets, dropped columns I did not need, and pickled it. There are 31,935 tweets in the dataset.

In [3]:
merged_df=dill.load(open('merged_df.pkg', 'rb')) 
         

In [4]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31935 entries, 0 to 1742
Data columns (total 5 columns):
id_str            31935 non-null int64
created_at        31935 non-null datetime64[ns]
retweet_count     31935 non-null int64
favorite_count    31935 non-null int64
text              31935 non-null object
dtypes: datetime64[ns](1), int64(3), object(1)
memory usage: 1.5+ MB


The dataframe contains the unique id_string for each tweet, the retweet coun, favorite count and the text of the tweet.

In [5]:
merged_df.head(10)

Unnamed: 0,id_str,created_at,retweet_count,favorite_count,text
0,6971079756,2009-12-23 17:38:18,28,12,From Donald Trump: Wishing everyone a wonderfu...
1,6312794445,2009-12-03 19:39:09,33,6,Trump International Tower in Chicago ranked 6t...
2,6090839867,2009-11-26 19:55:38,13,11,Wishing you and yours a very Happy and Bountif...
3,5775731054,2009-11-16 21:06:10,5,3,Donald Trump Partners with TV1 on New Reality ...
4,5364614040,2009-11-02 14:57:56,7,6,"--Work has begun, ahead of schedule, to build ..."
5,5203117820,2009-10-27 15:31:48,4,5,"--From Donald Trump: ""Ivanka and Jared’s weddi..."
6,5069623974,2009-10-22 13:57:04,2,2,"Hear Donald Trump discuss big gov spending, ba..."
7,4862580190,2009-10-14 14:13:17,4,10,Watch video of Ivanka Trump sharing business a...
8,4629116949,2009-10-05 14:37:38,1,4,- Read what Donald Trump has to say about daug...
9,4472353826,2009-09-29 15:28:23,23,30,"""A lot of people have imagination, but can't e..."


## Part of Speech Tagging

We can interate through the text column and create parts of speech tags for each word.  NNP = proper noun, IN = conjunction, VB = verb.  A list of SpaCy annotation can be found here: https://spacy.io/api/annotation

In [6]:
for tweets_doc in nlp.pipe(iter(merged_df['text']), batch_size=1, n_threads=4):
...     print (tweets_doc[0].text, tweets_doc[0].tag_)
...       
           

From IN
Trump NNP
Wishing VBG
Donald NNP
--Work NN
--From IN
Hear VB
Watch VB
- :
" ``
Read VB
- :
- :
Reminder NN
Watch VB
-- :
Ivanka NNP
" ``
Browse NNP
Check VB
Congrats NNS
Donald NNP
" ``
Here RB
Watch VB
RE NN
Donald NNP
“ NFP
- :
RE NN
Thanks NNS
Today NN
Last JJ
“ NFP
Check VB
" ``
Read VB
" ``
Did VBD
Do VB
" ``
Read VB
" ``
" ``
" ``
" ``
" ``
Enter VB
" ``
Listen VB
Miss NNP
" ``
New JJ
Donald NNP
Donald NNP
Be VB
WIshing VBG
Wishing VBG
Do VB
I PRP
... :
Those DT
All PDT
Congratulations NNS
Tonight NN
Tonight NN
Congratulations NNS
My PRP$
Tomorrow NN
Watch VB
Staff NNP
" ``
We PRP
My PRP$
Tonight NN
Tonight NN
I PRP
I PRP
Be VB
Do VB
An DT
Be VB
Scotland NNP
Do VB
I PRP
Read VB
Tune NN
See VB
Do VB
Be VB
Do VB
Eric NNP
I PRP
Coming VBG
There EX
I PRP
Check VB
Check VB
I PRP
Went VBD
Spent NNP
The DT
Congratulations NNS
It PRP
Four CD
The DT
So RB
That DT
Friday NNP
Check VB
Check VB
Eric NNP
Mark NNP
Performing VBG
The DT
Enter VB
and CC
I PRP
Could MD
I PRP
and CC
and CC

## Named Entities

We can also create a list of entities - which differs from parts of speech.  Named Entities include persons, organizations, and geographical locations, see: https://spacy.io/api/annotation#named-entities   We need to load the NLP pipeline.

In [7]:
tweetentities = []
for tweets_doc in nlp.pipe(iter(merged_df['text']), batch_size=1, n_threads=4):
    for ent in tweets_doc.ents:
        tweetentities.append([ent.label_,ent.text]) 
print(tweetentities) 




We can place these entities in a new datafarame for further analysis using pandas.

In [8]:
tweetents_df=pd.DataFrame(tweetentities, columns={'entity', 'text'})
tweetents_df.head(5)

Unnamed: 0,entity,text
0,PERSON,Donald Trump
1,EVENT,New Year
2,DATE,2010
3,ORG,Trump International Tower
4,GPE,Chicago


We now have 52,458 entities extracted from the tweets.

In [9]:
tweetents_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52458 entries, 0 to 52457
Data columns (total 2 columns):
entity    52458 non-null object
text      52458 non-null object
dtypes: object(2)
memory usage: 819.7+ KB


The entity list can also be saved to csv

In [10]:
tweetents_df.to_csv('tweetents.csv', index=False)

#df.to_csv('example.csv', index=False)

Using groupby we see that PERSON entities account for the most entities, folowed by organizations and persons.

In [11]:

tweetents_df.groupby('entity').size().sort_values(ascending=False) \
  .reset_index(name='text')

Unnamed: 0,entity,text
0,PERSON,13579
1,ORG,11281
2,GPE,9176
3,DATE,4935
4,CARDINAL,3416
5,TIME,2158
6,WORK_OF_ART,2072
7,NORP,1991
8,MONEY,1288
9,ORDINAL,660


We can also see that the word Trump and Donald Trump are the most common values in the text column, followed by America.

In [12]:

tweetents_df['text'].value_counts()

Trump                                         1827
Donald Trump                                   873
America                                        830
Obama                                          809
today                                          619
tonight                                        583
one                                            528
China                                          477
U.S.                                           425
2016                                           416
Donald                                         384
ObamaCare                                      349
American                                       341
first                                          321
tomorrow                                       261
Hillary                                        251
US                                             245
TRUMP                                          227
Iran                                           213
Republicans                    

Among the Organizzations identified are 'Trump International Tower", 'Ivanka', and 'Tall Buidings & Urban Habitat', illustrating some of the limitations of NLP analysis. Overall there are 11,281 ORG entities.

In [13]:
tweetents_df.loc[tweetents_df['entity'] == 'ORG'] 

Unnamed: 0,entity,text
3,ORG,Trump International Tower
6,ORG,Council
7,ORG,Tall Buildings & Urban Habitat
11,ORG,Omarosa
12,ORG,Ultimate Merger
13,ORG,Trump International – Scotland
20,ORG,GMA
22,ORG,Ivanka
23,ORG,The Trump Card
27,ORG,DSRL


NORP stands for nationalities, religious or political groups. There are 52,527 NORP entities.

In [14]:
tweetents_df.loc[tweetents_df['entity'] == 'NORP']

Unnamed: 0,entity,text
507,NORP,Scottish
558,NORP,American
600,NORP,Americans
601,NORP,Indy Presidential
608,NORP,Iraqis
656,NORP,Americans
661,NORP,Iraqis
672,NORP,American
674,NORP,Republicans
735,NORP,European


GPE includes countries and other places. Again, however, it is not perfect - Melania and Facebook both show up as places.

In [15]:
tweetents_df.loc[tweetents_df['entity'] == 'GPE'] 

Unnamed: 0,entity,text
4,GPE,Chicago
33,GPE,Bahamas
88,GPE,Facebook
96,GPE,Trumpative
135,GPE,Las Vegas
148,GPE,Apprentice
216,GPE,Scotland
223,GPE,Aberdeen
224,GPE,Scotland
236,GPE,Melania


To visualize this data, and perform additional analysis, I imported it into Tableau. Video walkthrough available at https://youtu.be/mml56SAz2sk