# Twitter's Information Operation Dataset Review

The purpose of the analysis is to review October's 2020 dataset release from Twitters Information Operation's Transparency Center. Accounts in this dataset have been identifed by Twitter as accounts tied to platform manipulation tied to government or state-backed actor. This activity is considered an information operation and is against Twitter's policy. 

October's 2020 data includes accounts tied to the following regions:
- Iran
- Russia
- Thailand
- Saudi Arabia
- Cuba

Each region consist of the following dataset:

- tweets_csv_hashed.zip - all tweets and metadata
- users_csv_hashed.zip - list of users and profile metadata
- media_file_list_hashed.txt - this files includes the list URLs, there are a number of .zip files where each contains a number of users' tweet media (Number of .zip files varies based on the volume of media

The analysis below will review the users_csv and the tweets_csv file to answer the following questions:

- The average number of followers for these accounts.
- The average number of accounts these malicious accounts were following back.
- The max/min number of followers for these accounts.
- The max/min number of accounts these malicious accounts were following back.
- The most popular month and year these accounts were created.
- A profile review of the account with the most followers.


In [2]:
import pandas as pd
import glob
import datetime as dt
# combine all csv's into one dataframe

path = r'/Users/latoya/Desktop/twitter/'
all_files = glob.glob(path + "*.csv")

data = []

for f in all_files:
    
    df = pd.read_csv(f, index_col=None, header=0)
    data.append(df)

final = pd.concat(data, axis=0, ignore_index=True)

print("Original data and column information:\n")


print('dataset info')

print(final.info())


Original data and column information:

dataset info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1594 entries, 0 to 1593
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   userid                    1594 non-null   object
 1   user_display_name         1594 non-null   object
 2   user_screen_name          1594 non-null   object
 3   user_reported_location    381 non-null    object
 4   user_profile_description  812 non-null    object
 5   user_profile_url          83 non-null     object
 6   follower_count            1594 non-null   int64 
 7   following_count           1594 non-null   int64 
 8   account_creation_date     1594 non-null   object
 9   account_language          1594 non-null   object
dtypes: int64(2), object(8)
memory usage: 124.7+ KB
None


In [3]:
#changing creation_date column to datetime dtype

final['account_creation_date'] = final['account_creation_date'].astype('datetime64')

#new datetime dtype
print("Columns with changed data type for account_creation_date:\n")
final.info()

Columns with changed data type for account_creation_date:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1594 entries, 0 to 1593
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   userid                    1594 non-null   object        
 1   user_display_name         1594 non-null   object        
 2   user_screen_name          1594 non-null   object        
 3   user_reported_location    381 non-null    object        
 4   user_profile_description  812 non-null    object        
 5   user_profile_url          83 non-null     object        
 6   follower_count            1594 non-null   int64         
 7   following_count           1594 non-null   int64         
 8   account_creation_date     1594 non-null   datetime64[ns]
 9   account_language          1594 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 124.7+ KB


In [4]:
#Creation of new column with just month and year

copy = final

dy = copy['account_creation_date']

copy['month_year_creation'] = dy.dt.strftime('%B %Y')

print("Updated data with new column just containing month and year of creation:")
print(copy.info())

Updated data with new column just containing month and year of creation:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1594 entries, 0 to 1593
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   userid                    1594 non-null   object        
 1   user_display_name         1594 non-null   object        
 2   user_screen_name          1594 non-null   object        
 3   user_reported_location    381 non-null    object        
 4   user_profile_description  812 non-null    object        
 5   user_profile_url          83 non-null     object        
 6   follower_count            1594 non-null   int64         
 7   following_count           1594 non-null   int64         
 8   account_creation_date     1594 non-null   datetime64[ns]
 9   account_language          1594 non-null   object        
 10  month_year_creation       1594 non-null   object        
dtypes: dateti

## Analysis

- The average number of followers for these accounts.
    - The average number of followers these accounts had were 1,892.
    
- The average number of accounts these malicious accounts were following back.
    - The average number of accounts these accounts followed back were 810
    
- The max/min number of followers for these accounts.
    - The min/max number of followers these accounts had were 0/1,197,574
    
- The max/min number of accounts these malicious accounts were following back.
    - The min/max number accounts these accounts were following back were: 0 / 206,563 
    
- The most popular month and year these accounts were created.
    - January 2019 and December 2020
    
- A profile review of the account with the most followers.
    - User Screen Name: fahad1althani
    - Account Created: July 2013
    - Account Language: Arabic
    - Follower Count: 1,197,574
    - Country: Saudi Arabia


In [5]:
'''Calculation of various trends for accounts included in dataset'''

avg_follower = copy['follower_count'].mean()
min_follower = copy['follower_count'].min()    
max_follower = copy['follower_count'].max()
avg_following=copy['following_count'].mean()
min_following=copy['following_count'].min()    
max_following = copy['following_count'].max()

print("Avg followers:",avg_follower)
print("Lowest followers:",min_follower)
print("Maximum followers:", max_follower)

print("Avg following:",avg_following)
print("Lowest following:",min_following)
print("Maximum following:", max_following)
um_mask = copy['follower_count'] == max_follower



language_counts = copy['account_language'].value_counts()


top_20_months = copy['month_year_creation'].value_counts().head(20)

print("\nlanguage counts\n",language_counts,"\n")

print("top 20 year and months:\n", top_20_months)

print("\nMost Popular Account:")

most_pop= copy.loc[um_mask]

print(most_pop[['user_screen_name','month_year_creation','account_language','follower_count','user_display_name']])


Avg followers: 1892.3036386449185
Lowest followers: 0
Maximum followers: 1197574
Avg following: 810.2528230865746
Lowest following: 0
Maximum following: 206563

language counts
 en       940
es       382
th       252
ar        17
zh-cn      2
ro         1
Name: account_language, dtype: int64 

top 20 year and months:
 January 2020      512
December 2019     361
February 2020     103
April 2019         58
November 2019      57
September 2019     39
February 2019      35
April 2020         33
March 2020         29
March 2019         29
June 2019          29
October 2019       26
July 2019          19
August 2019        14
May 2019           13
May 2020           11
September 2015     10
September 2018      9
December 2018       8
June 2018           7
Name: month_year_creation, dtype: int64

Most Popular Account:
   user_screen_name month_year_creation account_language  follower_count  \
15    fahad1althani           July 2013               ar         1197574   

         user_display_na

In [6]:
''' Below will review the most popular profile outlined above to identify it's most popular tweet '''

tweets = pd.read_csv('qatar_tweets.csv',index_col=None, header=0)

user = 'fahad1althani'
print(tweets.info())

mask = tweets['user_screen_name'] == user

profile = tweets[mask]

print(profile.shape)

hashtags = profile['hashtags']

#mask_2 = hashtags != '[]'
print(profile.head(1))


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220254 entries, 0 to 220253
Data columns (total 30 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   tweetid                   220254 non-null  int64  
 1   userid                    220254 non-null  object 
 2   user_display_name         220254 non-null  object 
 3   user_screen_name          220254 non-null  object 
 4   user_reported_location    25015 non-null   object 
 5   user_profile_description  217903 non-null  object 
 6   user_profile_url          2913 non-null    object 
 7   follower_count            220254 non-null  int64  
 8   following_count           220254 non-null  int64  
 9   account_creation_date     220254 non-null  object 
 10  account_language          220254 non-null  object 
 11  tweet_language            219959 non-null  object 
 12  tweet_text                220254 non-null  object 
 13  tweet_time                220254 non-null  o

In [86]:
'''Instalation of google translator libery to aid in translation of tweet data'''
#!pip install google_trans_new

Collecting google_trans_new
  Downloading google_trans_new-1.1.9-py3-none-any.whl (9.2 kB)
Installing collected packages: google-trans-new
Successfully installed google-trans-new-1.1.9


In [89]:

from google_trans_new import google_translator  

translator = google_translator()

n_empty = hashtags != '[]'#to disregard tweets without hashtags



f_hashtags = hashtags[n_empty]

hash_list = list(f_hashtags.unique()) 

translated =[]

for i in hash_list:
    translated.append(translator.translate(i))


# print(hashtags_translated)

In [157]:
#Translated hashtags


for i in translated:
       print(i)
        


print(test)


['Qatari', 'Umrah'] 
['The Hamdeen Organization', 'Preventing Qataris from Umrah'] 
['Qatari', 'The Boycott', 'The Gulf', 'The Hamdeen Organization'] 
['The Hamdeen_ terrorist organization', 'Saudi Arabia', 'Al-Houthi', 'the mullahs' regime, 'Iran', 'Qatar'] 
['Forgiveness',' Meet_Fahd_al_2_twitter '') 
['Iran', 'Tamim_Ben_Hamad', 'Saudi Arabia', 'Emirates', 'Hamdan_Terrorist Organization', 'Qatar'] 
['Al-Udeid Base', 'Tamim', 'Qatar', 'Al-Arabiya'] 
['Meet_Fahd_al_second_twitter' ') 
['Qatar'] 
['King Abdulaziz_Label Festival', 'Qatar', 'Doha', 'Saudi Arabia'] 
['Abdullah 
['Qatar', 'Qatar', 'Doha'] 
['National_Qatari Day', 'Qatar', 'Good news', 'Tamim drains Qatar', 'Doha'] 
['Emirates', 'Jebel Jais', 'The Gulf', 'Al Hamdeen'] 
['Saudi Arabia', 'Qatar', 'Doha', 'Iran', 'Revolutionary Guard', 'Al-Houthi'] 
['Qatarelix', 'Qatar', 'Doha', 'Qatar_Qatar_Sics'] 
['Iran', 'Qatar', 'Saudi Arabia', 'The Emirates', 'Bahrain', 'Egypt', 'Al Jazeera', 'Al Arabiya'] 
['Human Rights', 'Qatar', 'The

In [215]:
#Identifying the most liked, retweeted, 
retweets = profile[['tweet_time','tweet_text','retweet_count','like_count','quote_count','reply_count','hashtags']]

retweets = retweets[n_empty]


top_retweet = retweets.sort_values(by=['retweet_count'],ascending=False).head(1)

top_like= retweets.sort_values(by=['like_count'],ascending=False).head(1)

top_quote= retweets.sort_values(by=['quote_count'],ascending=False).head(1)

top_reply= retweets.sort_values(by=['reply_count'],ascending=False).head(1)


#top_retweet
print("\ntop retweet\n",top_retweet[['hashtags','retweet_count']],"\n")
print("Translation: General Leadership ,Qatar")

print("\ntop like\n",top_like[['hashtags','like_count']],"\n")

print("Translation: General leadership , Qatar '")

print("\ntop quote\n",top_like[['hashtags','quote_count']],"\n")

print("Translation: General leadership , Qatar '")

print("\ntop reply\n",top_like[['hashtags','reply_count']],"\n")

print("Translation: General leadership , Qatar '")


#top retweeted tweet

t_tweet = top_retweet[['tweet_time','tweet_text']].to_string()

print("\ntop retweeted tweet: \n",t_tweet)



top retweet
                          hashtags  retweet_count
171842  ['القياده_العامه', 'قطر']           4289 

Translation: General Leadership ,Qatar

top like
                          hashtags  like_count
171842  ['القياده_العامه', 'قطر']        4571 

Translation: General leadership , Qatar '

top quote
                          hashtags  quote_count
171842  ['القياده_العامه', 'قطر']          202 

Translation: General leadership , Qatar '

top reply
                          hashtags  reply_count
171842  ['القياده_العامه', 'قطر']          506 

Translation: General leadership , Qatar '

top retweeted tweet: 
               tweet_time                                                                                                                                                                  tweet_text
171842  2020-05-06 19:52  تصاعد أدخنة من مبنى #القياده_العامه للجيش في #قطر.\nوزارة الداخلية وتلفزيون قطر وقناة الجزيرة والشيراتون و المخابرات قريبة جدًا من هذا المبنى. (؟) https:

In [139]:
'''Export of all data used in dataset'''

#final.to_csv('original_data_combined.csv')
#copy.to_csv('final_dataset.csv')
#top_20_months.to_csv('topmonthyear.csv')