# Twitter Diplomacy Study:
### An Analysis of President Trump and Secretary Pompeo's Twitter Activity from 08/07/2019 to 11/18/2019
  
  
**Ellie Frith**
  
 **November 11, 2019**

**Part I: Analysis of General Twitter Diplomacy Use**
- Reading in data
- Extracting summary statistics & examining data set:
    - number and proportion of diplomacy-related tweets for each user
        - permutation test for difference in proportions
    - number and names of diplomatic entities/ subject areas mentioned by each user
        - permutation test for difference in # of unique diplomatic entities mentioned by each user

**Read in data:**

In [1]:
import pandas as pd

In [2]:
pompeo_tweets = pd.read_csv('pompeo.csv')
trump_tweets = pd.read_csv('trump.csv')

**Number and Proportion of Diplomatic Tweets for each user:**

In [3]:
sum(trump_tweets.is_diplomatic)

334

In [4]:
sum(pompeo_tweets.is_diplomatic)

305

In [5]:
len(trump_tweets)

1583

In [6]:
len(pompeo_tweets)

407

In [7]:
# Difference in proportion of diplomacy-related tweets between trump and pompeo:
test_stat = (sum(trump_tweets.is_diplomatic)/len(trump_tweets))-(sum(pompeo_tweets.is_diplomatic)/len(pompeo_tweets))
test_stat

-0.538393961640961

**Permutation test for difference in proportion of diplomatic tweets:**

In [8]:
# Combine dataframes, including only key column
sub_df = [trump_tweets[['is_diplomatic']],pompeo_tweets[['is_diplomatic']]]
combined = pd.concat(sub_df, ignore_index = True)

In [9]:
# Perform permutation test (n = 10,000)
import numpy as np
import random

trump_count = len(trump_tweets)

prop_perm_test = np.zeros(10000)
  
for i in range(10000):
    
    # Select tweet indices randomly and assign to trump/ pompeo
    random_indices = random.sample(range(len(combined)),len(combined))
    trump_df = combined.loc[random_indices[0:trump_count]]
    pompeo_df = combined.loc[random_indices[trump_count:]]
 
    # Find proportions of diplomatic tweets for each and difference:
    trump_prop = trump_df.is_diplomatic.sum()/len(trump_df)
    pompeo_prop = pompeo_df.is_diplomatic.sum()/len(pompeo_df)
    
    prop_perm_test[i] = trump_prop-pompeo_prop

In [10]:
# Probablility of seeing this large or larger of a difference in proportion of diplomacy-related tweets is 
# essentially zero
sum(abs(prop_perm_test)>=abs(test_stat))/10000

0.0

**Names of diplomatic entities mentioned by each user:**

In [11]:
# Functions to reformat entities from tweets: (have as single string with multiple entities, want list):
    # Current format = single string, with multiple entities separated by comma
    # Desired format = list, in which each entity is a single string
    
def entity_reformat(entities):
    good_chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890,- '
    if pd.isnull(entities)==False:
        string = ''.join(i for i in entities if i in good_chars)
        string_list = string.split(',')
        for string in string_list:
            string = string.strip()
    else:
        string_list = entities   
    return(string_list)

trump_tweets.entities_mentioned = trump_tweets.entities_mentioned.apply(entity_reformat)
pompeo_tweets.entities_mentioned = pompeo_tweets.entities_mentioned.apply(entity_reformat)

In [12]:
# Function to create list of all entity mentions by each user (includes duplicates)

def entity_list(df):
    entity_list = []
    for i in df.entities_mentioned:
        if type(i)!=float:
            for item in i:
                string = item.strip()
                entity_list.append(string)      
    return entity_list

trump_entity_list = entity_list(trump_tweets)
pompeo_entity_list = entity_list(pompeo_tweets)

**Number of Unique Entities Mentioned by Each User:**

In [13]:
# Number of unique entities mentioned by Trump
trump_entity_set = set(trump_entity_list)
len(trump_entity_set)

52

In [14]:
# Number of unique entities mentioned by Pompeo 
pompeo_entity_set = set(pompeo_entity_list)
len(pompeo_entity_set)

111

**Permutation Test for Difference in Number of Unique Entities Mentioned:**

In [16]:
test_stat = len(trump_entity_set)-len(pompeo_entity_set)
test_stat

-59

In [17]:
# Create sub-dataframe with only desired column:
sub_df = [trump_tweets[['entities_mentioned']],pompeo_tweets[['entities_mentioned']]]

# Combine the tweet dataframes of the two users so that tweets can be permuted:
combined = pd.concat(sub_df, ignore_index = True)

# Perform Permutation Test (n=10,000)
unique_entity_perm_test = np.zeros(10000)
    
trump_count = len(trump_tweets)

  
for i in range(10000):
    
    # Select tweet indices randomly and assign to trump/ pompeo
    random_indices = random.sample(range(len(combined)),len(combined))
    trump_df = combined.loc[random_indices[0:trump_count]]
    pompeo_df = combined.loc[random_indices[trump_count:]]
 
    # Find proportions of diplomatic tweets for each and difference:
    trump_unique_count = len(set(entity_list(trump_df)))
    pompeo_unique_count = len(set(entity_list(pompeo_df)))
    
    unique_entity_perm_test[i] = trump_unique_count-pompeo_unique_count

In [18]:
# Calculate p-value:
sum(abs(unique_entity_perm_test)>=abs(test_stat))/10000

0.0579

**Part II: Relative Frequency of Specific Diplomatic Entity References in Diplomatic Tweets for Trump vs. Pompeo**
- calculate number of times each diplomatic entity was referenced by Trump vs Pompeo
- caluclate proportion of user's diplomatic tweets in which entity was referenced (proxy for importance of entity)
- calculate test statistic:
    - difference in proportion of mentions for each entity b/w users (trump-pompeo)
- Permutation tests for significance of test statistic for each entity:
    - do this 1000 times:
        - permute tweets between users 
        - calculate difference in proportions between users for each entity
    - find proportion of times difference was greater than or equal to test statistic for each test stat & entity
        - = p-value of difference in proportions for each entity 

In [19]:
# Create new dataframe to fill with diplomatic entity reference information
entity_df = pd.DataFrame(pompeo_entity_set.union(trump_entity_set))
entity_df.rename(columns = {0:'entity'}, inplace = True)

**Calculate the number of times each entity was mentioned by Pompeo vs Trump:**

In [20]:
def entity_count(entity,entity_list):
    count = 0
    for i in entity_list:
        if entity in i:
            count = count+1
    return count

In [21]:
entity_df['trump_count'] = entity_df.entity.apply(lambda x: entity_count(x, trump_entity_list))
entity_df['pompeo_count'] = entity_df.entity.apply(lambda x: entity_count(x, pompeo_entity_list))

**Calculate proportion of user's diplomatic tweets that reference each entity:**

In [22]:
# Find proportion of diplomatic tweets in which each entity is mentioned: (relative importance)
entity_df['trump_prop'] = entity_df['trump_count']/len(trump_tweets[trump_tweets.is_diplomatic==1])
entity_df['pompeo_prop'] = entity_df['pompeo_count']/len(pompeo_tweets[pompeo_tweets.is_diplomatic==1])

**Calculate test statistic: difference in proportions between users for each entity**

In [23]:
entity_df['prop_difference'] = entity_df.trump_prop-entity_df.pompeo_prop

**Combine tweet dataframes in preparation for permutation test:**

In [28]:
# First: combine datasets of trump diplomatic tweets and pompeo diplomatic tweets into one
pompeo_tweets['user'] = 'pompeo'
trump_tweets['user'] = 'trump'

diplom_sub = [trump_tweets[trump_tweets.is_diplomatic==1],pompeo_tweets[pompeo_tweets.is_diplomatic==1]]

combined_tweets = pd.concat(diplom_sub, ignore_index = True)
combined_tweets = combined_tweets[['entities_mentioned','user']]

**Permutation test for difference in mention counts:**

In [29]:
# Perform permutation test for difference in mention counts (n=10,000):

entity_prop_perm_test = np.zeros(shape=(len(entity_df),10000))

entities = entity_df[['entity']]
num_trump = len(combined_tweets[combined_tweets.user=='trump'])
    
for i in range(10000):
    
    # Select tweet indices randomly and assign to trump/ pompeo
    random_indices = random.sample(range(len(combined_tweets)),len(combined_tweets))
    trump_df = combined_tweets.loc[random_indices[0:num_trump]]
    pompeo_df = combined_tweets.loc[random_indices[num_trump:]]
        
    # Find list of entities_mentioned:
    trump_list = entity_list(trump_df)
    pompeo_list = entity_list(pompeo_df)
    
    # Find proportion of tweets mentioning each entity
    trump_prop = entities.entity.apply(lambda x: entity_count(x, trump_list))/len(trump_df)
    pompeo_prop = entities.entity.apply(lambda x: entity_count(x, pompeo_entity_list))/len(pompeo_df)
     
    # Find difference
    entity_prop_perm_test[:,i] = trump_prop-pompeo_prop

**Calculate permutation test p-value:**

In [37]:
# Calculate proportion of permutations in which absolute value of difference in mentions was greater than or equal to
# the absolute value of the observed difference in data (p-value for each entity)
entity_prop_perm_test = pd.DataFrame(abs(entity_prop_perm_test))
test_stat = abs(entity_df.prop_difference)

# Compare row-wise to see if permutation proportion is greater than or equal to test stat
entity_prop_perm_test = entity_prop_perm_test.ge(test_stat, axis = 0)

entity_df['prop_p_val'] = entity_prop_perm_test.sum(axis=1)/10000

In [42]:
entity_df[entity_df.prop_p_val<=0.05]
# Entities with significant p-values are those with the biggest difference in count/ proportion (not surprising)

Unnamed: 0,entity,trump_count,pompeo_count,trump_prop,pompeo_prop,prop_difference,prop_p_val
0,China,78,14,0.233533,0.045902,0.187631,0.0
11,Iran,10,63,0.02994,0.206557,-0.176617,0.0
14,Iraq,2,8,0.005988,0.02623,-0.020241,0.0366
22,Pacific Islands,0,4,0.0,0.013115,-0.013115,0.0497
26,Venezuela,1,26,0.002994,0.085246,-0.082252,0.0
28,Saudi Arabia,3,11,0.008982,0.036066,-0.027084,0.0183
31,Turkey,33,12,0.098802,0.039344,0.059458,0.0025
42,Cuba,1,8,0.002994,0.02623,-0.023235,0.0138
44,Kurds,24,0,0.071856,0.0,0.071856,0.0
49,UN,5,27,0.01497,0.088525,-0.073555,0.0


In [47]:
entity_df[entity_df.prop_p_val>0.05]
# Entities with insignificant p-values are those with smaller observed difference in count/ proportion of tweets
# mentioning a given entity (also unsurprising)

Unnamed: 0,entity,trump_count,pompeo_count,trump_prop,pompeo_prop,prop_difference,prop_p_val
1,Hizballah,0,2,0.000000,0.006557,-0.006557,0.2273
2,NATO,7,11,0.020958,0.036066,-0.015107,0.1745
3,United Arab Emirates,0,1,0.000000,0.003279,-0.003279,0.4747
4,Serbia,0,1,0.000000,0.003279,-0.003279,0.4874
5,Colombia,0,3,0.000000,0.009836,-0.009836,0.1069
...,...,...,...,...,...,...,...
112,Ecuador,0,3,0.000000,0.009836,-0.009836,0.1080
113,Israel,12,11,0.035928,0.036066,-0.000137,1.0000
114,Chile,1,2,0.002994,0.006557,-0.003563,0.4670
116,Slovakia,0,1,0.000000,0.003279,-0.003279,0.4749


In [51]:
# Number of entities with significant p-values:
sum(entity_df.prop_p_val<=0.05)

24

In [53]:
# Number of entities with insiginificant p-values:
sum(entity_df.prop_p_val>0.05)

94

In [43]:
# 20.33% of entities have statistically significant p-value for difference in proportion of mentions
sum(entity_df.prop_p_val<=0.05)/len(entity_df)

0.2033898305084746

In [45]:
# List of entities with insignificant p-values:
list(entity_df.entity[entity_df.prop_p_val>0.05])

['Hizballah',
 'NATO',
 'United Arab Emirates',
 'Serbia',
 'Colombia',
 'Ireland',
 'Saudia Arabia',
 'South Sudan',
 'Poland',
 'Montenegro',
 'North Korea',
 'New Zealand',
 'Guinea',
 'Brazil',
 'Greece',
 'India',
 'Gibraltar',
 'IAEA',
 'Tunisia',
 'Mauritius',
 'South Korea',
 'Mexico',
 'Central Europe',
 'Belgium',
 'ICC',
 'Guatemala',
 'Papua New Guinea',
 'Caribbean',
 'Europe',
 'Taliban',
 'Japan',
 'Egypt',
 'Balkans',
 'Romania',
 'Angola',
 'Botswana',
 'al-Qaeda',
 'Western Balkans',
 'OECD',
 'Bolivia',
 'Panama',
 'Russia',
 'Asia',
 'North Macedonia',
 'GCC',
 'Federated States of Micronesia',
 'Greenland',
 'Hong Kong',
 'Paris Climate Agreement',
 'Norway',
 'Indonesia',
 'EU',
 'Bahamas',
 'Denmark',
 'Zimbabwe',
 'Haiti',
 'Latin America',
 'Hamas',
 'APEC',
 'Palestine',
 'El Salvador',
 'Nicaragua',
 'United Kingdom',
 'Bahrain',
 'Scotland',
 'Uganda',
 'Jordan',
 'Kosovo',
 'Bolvia',
 'Qatar',
 'Central Asia',
 'Cyprus',
 'North America',
 'Sri Lanka',
 'Pa

In [98]:
# Export entity dataframe to csv, so I can make appropriate visuals in R:
entity_df.to_csv('./entity_df.csv', index = False)