# Dataset Spotlight: How ISIS Uses Twitter

### Description

This dataset includes 17,000 tweets from 100+ pro-ISIS fanboys from all over the world since the November 2015 Paris Attacks. 

The columns describe the following:

**Name**
- Username
- Description
- Location
- Number of followers at the time the tweet was downloaded
- Number of statuses by the user when the tweet was downloaded
- Date and timestamp of the tweet
- The tweet itself


#### Based on this data, these are some examples for deriving insights and analysis:

- **Social Network Cluster Analysis:** Who are the major players in the pro-ISIS twitter network? Ideally, we would like this visualized via a cluster network with the biggest influencers scaled larger than smaller influencers.
- **Keyword Analysis:** Which keywords derived from the name, username, description, location, and tweets were the most commonly used by ISIS fanboys? Examples include: "baqiyah", "dabiq", "wilayat", "amaq"
- **Data Categorization of Links:** Which websites are pro-ISIS fanboys linking to? Categories include: Mainstream Media, Altermedia, Jihadist Websites, Image Upload, Video Upload,
- **Sentiment Analysis:** Which clergy do pro-ISIS fanboys quote the most and which ones do they hate the most? Search the tweets for names of prominent clergy and classify the tweet as positive, negative, or neutral and if negative, include the reasons why. Examples of clergy they like the most: "Anwar Awlaki", "Ahmad Jibril", "Ibn Taymiyyah", "Abdul Wahhab". Examples of clergy that they hate the most: "Hamza Yusuf", "Suhaib Webb", "Yaser Qadhi", "Nouman Ali Khan", "Yaqoubi".
- **Timeline View:** Visualize all the tweets over a timeline and identify peak moments


*Source:*
- http://blog.kaggle.com/2016/06/03/dataset-spotlight-how-isis-uses-twitter/
- https://www.kaggle.com/kzaman/how-isis-uses-twitter

*Sources on network analysis*
- https://www.slideshare.net/koorukuroo/network-analysis-with-networkx-fundamentals-of-network-theory1
- https://www.slideshare.net/BenjaminBengfort/social-network-analysis-with-python
- https://www.datacamp.com/courses/network-analysis-in-python-part-1

In [26]:
import pandas as pd

In [27]:
# import dataset as pandas dataframe

df = pd.read_csv('Twitter_ISIS_data/how-isis-uses-twitter/tweets.csv')
df

Unnamed: 0,name,username,description,location,followers,numberstatuses,time,tweets
0,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:07,ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU...
1,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:27,ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI '...
2,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:29,ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH ...
3,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:37,ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI ...
4,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:45,ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH...
5,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:51,THE SECOND CLIP IN A DA'WAH SERIES BY A SOLDIE...
6,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 22:04,ENGLISH TRANSCRIPT : OH MURABIT! : http://t.co...
7,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 22:06,ENGLISH TRANSLATION: 'A COLLECTION OF THE WORD...
8,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 22:17,Aslm Please share our new account after the pr...
9,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/10/2015 0:05,ENGLISH TRANSLATION: AQAP STATEMENT REGARDING ...


In [28]:
df.iloc[8]['tweets']

'Aslm Please share our new account after the previous one was suspended.@KhalidMaghrebi @seifulmaslul123 @CheerLeadUnited'

### Return all tagged profiles per tweet

#### Tokenize Tweet text

Consider @taggedprofile and #word as one token.

In [29]:
import re
 
emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)
 
def tokenize(text):
    return tokens_re.findall(text)
 
def preprocess(text, lowercase=False):
    tokens = tokenize(text)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

In [30]:
# create new column for tokenized tweets that returns @taggedprofile as one token 
df['tokenized_tweets'] = df['tweets'].apply(lambda x: preprocess(x))

In [31]:
# create new column that returns only the tokens that refer to a tagged profile
df['tagged_profiles'] = df['tokenized_tweets'].map(lambda x: [token for token in x if token.startswith('@')])

### Add topic to tweet

In [32]:
# remove urls from tweets
import re
df['tweets'] = df['tweets'].apply(lambda x: re.sub(r'http\S+', '', x))

In [33]:
# add index name to dataframe
df = df.reset_index()

In [34]:
# create a list of tuples of tweets and index of tweet
tweets = df['tweets'].tolist()
index = df['index'].values.tolist()
tweets_index = zip(index, tweets)
tweets_index[:5]

[(0,
  "ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ABU MUHAMMED AL MAQDISI:  "),
 (1,
  "ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF INTEGRITY, SACRIFICE IS  EASY'  "),
 (2,
  'ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWLANI (HA):  '),
 (3,
  "ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP: 'THE PROMISE OF VICTORY':  "),
 (4,
  "ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT 'ALTHOUGH THE DISBELIEVERS DISLIKE IT.' ")]

In [35]:
# import modules
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

In [36]:
def get_topic_relevancy_per_message(list_of_document_index_and_message, no_features=1000, no_topics=4): # include parameters in function (no_features, no_topics)

    document_index = [row[0] for row in list_of_document_index_and_message]
    documents = [row[1] for row in list_of_document_index_and_message]

    # get term counts 
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(documents)
    tf_feature_names = tf_vectorizer.get_feature_names()

    # convert to arrays
    tf_array = tf_vectorizer.fit_transform(documents).toarray()
    tf_feature_names_array = np.array(tf_vectorizer.get_feature_names())

    # run model
    lda = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
    doctopic = lda.fit_transform(tf_array)

    # scale the document-component matrix such that the component values associated with each document sum to one
    doctopic = doctopic / np.sum(doctopic, axis=1, keepdims=True)

    # give each document a name (name of the index) 
    # and convert the name of each document to an array
    documents_names = np.asarray(document_index)
    doctopic_orig = doctopic.copy()

    # create a matrix that returns the relevance of a topic per document
    num_groups = len(set(documents_names))
    doctopic_grouped = np.zeros((num_groups, no_topics))
    for i, name in enumerate(sorted(set(documents_names))):
        doctopic_grouped[i, :] = np.mean(doctopic[documents_names == name, :], axis=0)

    # get the importantance of each topic per document
    doctopic = doctopic_grouped


    # print documents
    # print display_topics(nmf, tfidf_feature_names, no_top_words) 
    # print doctopic

    return doctopic

In [37]:
def return_document_and_most_important_topic_and_value(list_of_document_index_and_message, no_features=1000, no_topics=4):
    
    doctopic = get_topic_relevancy_per_message(list_of_document_index_and_message, no_features=no_features, no_topics=no_topics) 

    document_index = [row[0] for row in list_of_document_index_and_message]
    documents = [row[0] for row in list_of_document_index_and_message]
    
    # give each document a name (name of the index) 
    # and convert the name of each document to an array

    doctopic_list = doctopic.tolist()
    documents_names = np.asarray(document_index)
    documents_names_list = documents_names.tolist()

    documents_names_doctopics = zip(documents_names_list, doctopic_list)
    
    # create dataframe of array with topic names as column names
    column_names = range(len(doctopic[0]))
    topic_names = ["topic" + str(i) for i in column_names]

    df = pd.DataFrame(doctopic, index=documents_names, columns=topic_names)

    # return max value for each row
    # convert column to list
    max_value = df.max(axis=1).tolist()

    # create new column that states the topic that is most important for the document 
    # convert column to list
    topic = df.idxmax(axis=1).tolist()

    # create list of most important topic and highest value per document
    document_topic_value = zip(documents_names, topic, max_value)

    return document_topic_value

In [38]:
def return_messages_with_clear_topic(list_of_document_index_and_message, no_features=1000, no_topics=4, threshold=0):
    """return documents where topic relevancy is > a certain level"""
    document_topic_value = return_document_and_most_important_topic_and_value(list_of_document_index_and_message, no_features=no_features, no_topics=no_topics)
    
    df = pd.DataFrame(document_topic_value, columns = ['documents_names', 'topic', 'max_value'])

    # keep only the rows where value > threshold
    df_relevant = df[df['max_value'] > threshold]

    # convert this dataframe to a list
    relevant_doctopics = df_relevant.values.tolist()

    return relevant_doctopics

In [39]:
def return_topwords_topdocuments_per_topic(list_of_document_index_and_message, no_topics=0, no_top_words=0, no_top_documents=0): 
    
    documents = [row[1] for row in list_of_document_index_and_message]
    
    # import modules
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.decomposition import LatentDirichletAllocation
    import numpy as np

    # prepare data for model
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=0.01, stop_words='english')
    tf = tf_vectorizer.fit_transform(documents)
    tf_feature_names = tf_vectorizer.get_feature_names()

    lda_model = LatentDirichletAllocation(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
    lda_W = lda_model.transform(tf)
    lda_H = lda_model.components_

    def display_topwords_topdocuments_per_topic(H, W, feature_names, documents, no_top_words, no_top_documents):
        for topic_idx, topic in enumerate(H):
            print "Topic %d:" % (topic_idx)
            print " ".join([feature_names[i]
                            for i in topic.argsort()[:-no_top_words - 1:-1]])
            top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
            for doc_index in top_doc_indices:
                print documents[doc_index]
    
    return display_topwords_topdocuments_per_topic(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)

In [40]:
%%time
tweet_topic = return_document_and_most_important_topic_and_value(tweets_index, no_topics=5)

CPU times: user 41.8 s, sys: 301 ms, total: 42.1 s
Wall time: 42.3 s


In [43]:
tweet_topic[:5]

[(0, 'topic4', 0.37178027166032135),
 (1, 'topic1', 0.5106151896958239),
 (2, 'topic1', 0.8337247690729706),
 (3, 'topic1', 0.4920825812772331),
 (4, 'topic1', 0.8626235165628258)]

In [364]:
%%time
return_messages_with_clear_topic(tweets_index, threshold=0.8)[:5]

CPU times: user 47.5 s, sys: 459 ms, total: 47.9 s
Wall time: 48.8 s


[[0, 'topic3', 0.9053934038959924],
 [1, 'topic3', 0.8744746384392372],
 [2, 'topic3', 0.8491833584587678],
 [3, 'topic3', 0.9059774387788857],
 [4, 'topic3', 0.8742651239245751]]

In [44]:
%%time
return_topwords_topdocuments_per_topic(tweets_index, no_topics=4, no_top_words=20, no_top_documents=1)

Topic 0:
rt isis syria amp iraq ramiallolah city people assad army video new west fallujah attack russia libya mosul muslim war
Topic 1:
al allah today islamicstate abu regime damascus just warreporter1 islam claims don support news la south attack nusra baghdad killed
@3diyah1000: Most popular attacks in Somalia last years : #dayniile attack, #Lego attack, #Jannale attack, #Yurkud attack.
#Somalia
Topic 2:
state islamic breaking amaqagency fighters ypg area turkey like military fsa الله fight east في group know control says new
الله أكبر الله أكبر الله أكبر ولله الحمد
الله أكبر الله أكبر الله أكبر ولله الحمد
الله أكبر الله أكبر الله أكبر ولله الحمد
Topic 3:
rt killed forces near army aleppo iraqi soldiers syrian rebels usa destroyed assad palmyra ramadi north muslims airstrikes saudi russian
RT @Nidalgazaui: #Shiite militants killed a number of #Saudi soldiers near Saudi #Iraqi border in an assault on #Saudi army positions 
CPU times: user 17.3 s, sys: 106 ms, total: 17.4 s
Wall time:

In [45]:
# add topic to df with tweets
row_topic = [(row[0], row[1]) for row in tweet_topic]

# convert list of tuples to dataframe
df_tweet_topic = pd.DataFrame(row_topic, columns=['index', 'topic'])

In [46]:
# merge df_tweet_topic with df
df = df.merge(df_tweet_topic, on='index', how='left')
df.head()

Unnamed: 0,index,name,username,description,location,followers,numberstatuses,time,tweets,tokenized_tweets,tagged_profiles,topic
0,0,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:07,ENGLISH TRANSLATION: 'A MESSAGE TO THE TRUTHFU...,"[ENGLISH, TRANSLATION, :, ', A, MESSAGE, TO, T...",[],topic4
1,1,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:27,ENGLISH TRANSLATION: SHEIKH FATIH AL JAWLANI '...,"[ENGLISH, TRANSLATION, :, SHEIKH, FATIH, AL, J...",[],topic1
2,2,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:29,ENGLISH TRANSLATION: FIRST AUDIO MEETING WITH ...,"[ENGLISH, TRANSLATION, :, FIRST, AUDIO, MEETIN...",[],topic1
3,3,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:37,ENGLISH TRANSLATION: SHEIKH NASIR AL WUHAYSHI ...,"[ENGLISH, TRANSLATION, :, SHEIKH, NASIR, AL, W...",[],topic1
4,4,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,1/6/2015 21:45,ENGLISH TRANSLATION: AQAP: 'RESPONSE TO SHEIKH...,"[ENGLISH, TRANSLATION, :, AQAP, :, ', RESPONSE...",[],topic1


### Make a selection of the columns (user, tagged profiles, tweets, tweet topic) and reduce the dataset to only active users and connections

In [47]:
# create new dataframe that contains username, tagged_profiles and topic
df_profiles_topic = df[['index', 'username', 'tagged_profiles', 'topic']]

# keep only the tweets that contain a tagged profile
df_profiles_topic = df_profiles_topic[df_profiles_topic.astype(str)['tagged_profiles'] != '[]']
df_profiles_topic.head()

Unnamed: 0,index,username,tagged_profiles,topic
8,8,GunsandCoffee70,"[@KhalidMaghrebi, @seifulmaslul123, @CheerLead...",topic3
10,10,GunsandCoffee70,"[@KhalidMaghrebi, @seifulmaslul123, @CheerLead...",topic0
14,14,GunsandCoffee70,"[@IbnNabih1, @MuwMedia, @Dawlat_islam7]",topic0
15,15,GunsandCoffee70,"[@IbnNabih1, @KhalidMaghrebi_, @MuwMedia, @Pol...",topic3
16,16,GunsandCoffee70,"[@IbnNabih1, @KhalidMaghrebi_, @MuwMedia, @Pol...",topic0


In [48]:
# remove @ character from tagged profiles
df_profiles_topic['tagged_profiles'] = df_profiles_topic['tagged_profiles'].apply(lambda x: [token.replace('@','') for token in x])
df_profiles_topic.head()

Unnamed: 0,index,username,tagged_profiles,topic
8,8,GunsandCoffee70,"[KhalidMaghrebi, seifulmaslul123, CheerLeadUni...",topic3
10,10,GunsandCoffee70,"[KhalidMaghrebi, seifulmaslul123, CheerLeadUni...",topic0
14,14,GunsandCoffee70,"[IbnNabih1, MuwMedia, Dawlat_islam7]",topic0
15,15,GunsandCoffee70,"[IbnNabih1, KhalidMaghrebi_, MuwMedia, Polder_...",topic3
16,16,GunsandCoffee70,"[IbnNabih1, KhalidMaghrebi_, MuwMedia, Polder_...",topic0


In [49]:
set(df_profiles_topic.iloc[0]['tagged_profiles'])

{'CheerLeadUnited', 'KhalidMaghrebi', 'seifulmaslul123'}

In [50]:
# retrieve list of unique tagged profiles
tagged_profiles = list(set([pair for row in df_profiles_topic['tagged_profiles'] for pair in row]))
print len(tagged_profiles)
tagged_profiles[:5]

3301


['', 'Jazrawi_3uud', '5629haqq', 'VSOForever', 'AsebAlRasWenak']

In [51]:
# retrieve list of unique usernames
usernames = list(df_profiles_topic['username'].unique())
print len(usernames)
usernames[:5]

107


['GunsandCoffee70',
 'YazeedDhardaa25',
 'BaqiyaIs',
 'abubakerdimshqi',
 'WhiteCat_7']

In [52]:
# get number of unique tagged profiles and unique usernames
print '# unique tagged profiles: ', len(tagged_profiles)
print '# unique usernames: ', len(usernames)

# unique tagged profiles:  3301
# unique usernames:  107


In [53]:
# check on passive accounts
print 'Jazrawi_3uud' in tagged_profiles
print 'Jazrawi_3uud' in usernames

True
False


Some Twitter users are tagged, but don't post tweets themselves.
This could imply that those users that are tagged, but don't post tweets themselves, refer to passive accounts.

We may be interested in the profiles of those passive accounts.

In [381]:
len([account for account in tagged_profiles if account in usernames])

77

77 profiles out of 3301 profiles are members that both post a tweet as well as get tagged by other members.

In [382]:
len([account for account in tagged_profiles if account not in usernames])

3224

This implies indeed that 3224 profiles are passive accounts.

Store both the profiles of active accounts and passive accounts in a separate list. 
For the network analysis we willn not use the passive profiles.

In [54]:
active_accounts = [account for account in tagged_profiles if account in usernames]
passive_accounts = [account for account in tagged_profiles if account not in usernames]
print '# of active accounts:', len(active_accounts)
print '# of passive accounts:', len(passive_accounts)

# of active accounts: 77
# of passive accounts: 3224


In [55]:
# create a dataframe that only contains the active accounts
df_profiles_topic['tagged_profiles'] = df_profiles_topic['tagged_profiles'].apply(lambda x: [account for account in x if account in usernames])
print len(df_profiles_topic)

# keep only the tweets that contain a tagged profile
df_profiles_topic = df_profiles_topic[df_profiles_topic.astype(str)['tagged_profiles'] != '[]']
print len(df_profiles_topic)

df_profiles_topic.head()

9878
1953


Unnamed: 0,index,username,tagged_profiles,topic
32,32,YazeedDhardaa25,[YazeedDhardaa25],topic0
174,174,YazeedDhardaa25,[RamiAlLolah],topic3
180,180,YazeedDhardaa25,[RamiAlLolah],topic4
181,181,YazeedDhardaa25,[RamiAlLolah],topic4
186,186,YazeedDhardaa25,[YazeedDhardaa25],topic1


### Prepare data for network analysis

#### - Create a list of connections as a pair of tuples

Output should look like this:
    
    [(username, tagged_profile1), (username, tagged_profile2), (username, tagged_profile3)]
    
    
    
    
#### - Add attributes to connections

A connection can have multiple topics
   
Output should look like this:

    {(username, tagged_profile1): topic1, 
     (username, tagged_profile2): topic1,
     (username, tagged_profile3): topic1,
     (username, tagged_profile1): topic2,
     (username, tagged_profile2): topic2,
     (username, tagged_profile1): topic3,}

In [56]:
# create a dataframe that only stores the username and tagged profiles
df_connections = df_profiles_topic[['username', 'tagged_profiles']]
print len(df_connections)
df_connections.head()

1953


Unnamed: 0,username,tagged_profiles
32,YazeedDhardaa25,[YazeedDhardaa25]
174,YazeedDhardaa25,[RamiAlLolah]
180,YazeedDhardaa25,[RamiAlLolah]
181,YazeedDhardaa25,[RamiAlLolah]
186,YazeedDhardaa25,[YazeedDhardaa25]


In [57]:
# create another dataframe that only stores the username and topic
df_username_topic = df_profiles_topic[['username', 'topic']]
print len(df_username_topic)
df_username_topic.head()

1953


Unnamed: 0,username,topic
32,YazeedDhardaa25,topic0
174,YazeedDhardaa25,topic3
180,YazeedDhardaa25,topic4
181,YazeedDhardaa25,topic4
186,YazeedDhardaa25,topic1


In [386]:
# # get number of tags per tagged profile

# usernames_tuple = list(set([username[0] for username in edges]))
# print usernames_tuple[:5]
# print '---'
# tagged_profiles_tuple = [tagged_profile[1] for tagged_profile in edges]
# print tagged_profiles_tuple[:5]
# print '---'
# print 'Bedouin127' in tagged_profiles_tuple
# print '---'
# from collections import Counter
# number_of_tags_per_tagged_profile_dict = Counter(tagged_profiles_tuple)

# print number_of_tags_per_tagged_profile_dict.get('Bedouin127')
# print '---'
# print number_of_tags_per_tagged_profile_dict

In [58]:
# convert both dataframes to dictionary

# df connection
connections_dict = df_connections.set_index('username').T.to_dict(orient='list')
print len(connections_dict.keys())
print connections_dict
print '---'
# df username topic
username_topic_dict = df_username_topic.set_index('username').T.to_dict(orient='list')
print len(username_topic_dict.keys())
print username_topic_dict

87
{'murasil1': [['WarReporter1']], 'Suspend_Me_fags': [['Suspend_Me_fags']], 'Uncle_SamCoco': [['WarReporter1']], 'maisaraghereeb': [['maisaraghereeb']], '432Mryam': [['abuayisha102']], 'Witness_alHaqq': [['WarReporter1']], 'ismailmahsud': [['ismailmahsud']], 'ks48a174031': [['RamiAlLolah']], 'Afriqqiya_252': [['Afriqqiya_252', 'Afriqqiya_252', 'Afriqqiya_252', 'Afriqqiya_252']], 'wayf44rerr': [['AbuMusab_110']], 'freelance_112': [['freelance_112']], 'Nidalgazaui': [['warrnews']], 'ro34th': [['Uncle_SamCoco']], 'MilkSheikh2': [['AsimAbuMerjem']], '_IshfaqAhmad': [['Nidalgazaui']], '__alfresco__': [['WarReporter1']], 'Fidaee_Fulaani': [['Fidaee_Fulaani']], 'WhiteCat_7': [['mobi_ayubi']], 'abuhanzalah10': [['RamiAlLolah']], 'melvynlion': [['melvynlion']], 'Jazrawi_Joulan': [['MaghrebiHD']], 'Mountainjjoool': [['nvor85j']], 'al_nusra': [['al_nusra']], 'st3erer': [['AbuMusab_110']], 'MaghrebiWM': [['RamiAlLolah']], 'fahadslay614': [['fahadslay614', 'fahadslay614', 'fahadslay614', 'fahadsl



In [None]:
# # convert dataframe to dictionary
# user_topics_dict = df_.set_index('username').T.to_dict(orient='list')
# user_topics_dict

In [59]:
# create the edges (connections) of the graph

# convert the connections dictionary to a list of tuples where each key corresponds to its value in the dictionary

def get_connections(dictionary_with_list_of_values):
    connections = []
    for user,value in dictionary_with_list_of_values.iteritems():
        for item in value:
            for profile in item:
                connections.append((user, profile))
    return connections

edges = get_connections(connections_dict)
print len(edges)
print edges

101
[('murasil1', 'WarReporter1'), ('Suspend_Me_fags', 'Suspend_Me_fags'), ('Uncle_SamCoco', 'WarReporter1'), ('maisaraghereeb', 'maisaraghereeb'), ('432Mryam', 'abuayisha102'), ('Witness_alHaqq', 'WarReporter1'), ('ismailmahsud', 'ismailmahsud'), ('ks48a174031', 'RamiAlLolah'), ('Afriqqiya_252', 'Afriqqiya_252'), ('Afriqqiya_252', 'Afriqqiya_252'), ('Afriqqiya_252', 'Afriqqiya_252'), ('Afriqqiya_252', 'Afriqqiya_252'), ('wayf44rerr', 'AbuMusab_110'), ('freelance_112', 'freelance_112'), ('Nidalgazaui', 'warrnews'), ('ro34th', 'Uncle_SamCoco'), ('MilkSheikh2', 'AsimAbuMerjem'), ('_IshfaqAhmad', 'Nidalgazaui'), ('__alfresco__', 'WarReporter1'), ('Fidaee_Fulaani', 'Fidaee_Fulaani'), ('WhiteCat_7', 'mobi_ayubi'), ('abuhanzalah10', 'RamiAlLolah'), ('melvynlion', 'melvynlion'), ('Jazrawi_Joulan', 'MaghrebiHD'), ('Mountainjjoool', 'nvor85j'), ('al_nusra', 'al_nusra'), ('st3erer', 'AbuMusab_110'), ('MaghrebiWM', 'RamiAlLolah'), ('fahadslay614', 'fahadslay614'), ('fahadslay614', 'fahadslay614')

In [60]:
# add topics to connections

def get_connections_topic(connections_dict, username_topic_dict):
    """Get a separate list of the topic per tweet and per connection
    between user and tagged profile."""
    
    connections_topic_dict = {}
    for user1,value1 in connections_dict.iteritems():
        for user2,value2 in username_topic_dict.iteritems():
            if user1 == user2:

                for item in value1:
                    for profile in item:

                        for topic in value2:
                            connections_topic_dict[(user1, profile)] = topic
    return connections_topic_dict

In [61]:
connection_topic = get_connections_topic(connections_dict, username_topic_dict)
print len(connection_topic)
connection_topic

89


{('04_8_1437', '04_8_1437'): 'topic3',
 ('06230550_IS', 'MaghrebiHD'): 'topic0',
 ('1515Ummah', 'mustaklash'): 'topic2',
 ('432Mryam', 'abuayisha102'): 'topic0',
 ('AbdusMujahid149', 'AbdusMujahid149'): 'topic0',
 ('AbuMusab_110', 'kIakishini5'): 'topic0',
 ('AbuNaseeha_03', 'MaghrebiQM'): 'topic3',
 ('Abu_Azzzam25', 'RamiAlLolah'): 'topic4',
 ('Abu_Ibn_Taha', 'Baqiyah_Khilafa'): 'topic1',
 ('Afriqqiya_252', 'Afriqqiya_252'): 'topic1',
 ('Al_Battar_Engl', 'Al_Battar_Engl'): 'topic1',
 ('Alwala_bara', 'RamiAlLolah'): 'topic4',
 ('AsimAbuMerjem', 'Nidalgazaui'): 'topic4',
 ('Bajwa47online', 'Bajwa47online'): 'topic2',
 ('Baqiyah_Khilafa', 'Jazrawi_Joulan'): 'topic1',
 ('Battar_English', 'Battar_English'): 'topic1',
 ('BilalIbnRabah1', 'Nidalgazaui'): 'topic2',
 ('DabiqsweetsMan', 'RamiAlLolah'): 'topic3',
 ('DawlaWitness11', 'MaghrebiWM'): 'topic3',
 ('Dieinurage308', 'Dieinurage308'): 'topic0',
 ('EPlC24', 'MaghrebiHD'): 'topic0',
 ('FidaeeFulaani', 'RamiAlLolah'): 'topic2',
 ('Fidaee_F

## Perform Network Analysis

In [62]:
import networkx as nx

In [63]:
# create an undirected graph
graph = nx.Graph()

# create an undirected graph
d_graph = nx.DiGraph()

In [64]:
# add edges to graph
graph.add_edges_from(edges)

# add edges to directed graph
d_graph.add_edges_from(edges)

In [65]:
print "# of profiles: ", graph.number_of_nodes()
print "# of connections: ", graph.number_of_edges()

# of profiles:  90
# of connections:  86


In [38]:
# draw graph
import matplotlib.pyplot as plt
nx.draw(graph)
plt.show()

# # ways of drawing the graph
# nx.draw_random(graph)
# nx.draw_circular(graph)
# nx.draw_spectral(graph)

In [28]:
nodes = graph.nodes()

#### Create the exact same graph as integer values

In [83]:
# # convert nodes and edges to integer values 
# import numpy as np
# from sklearn import preprocessing

# le = preprocessing.LabelEncoder()

# # usernames
# le.fit(usernames)
# usernames_int = le.transform(usernames) 
# # convert numpy array to list
# usernames_int = usernames_int.tolist()

# # usernames
# le.fit(tagged_profiles)
# tagged_profiles_int = le.transform(tagged_profiles) 
# # convert numpy array to list
# tagged_profiles_int = tagged_profiles_int.tolist()

## Analyse network

### Analyse important members of network

One way to define "importance" is the individual's betweenness centrality. The betweenness centrality is a measure of how many shortest paths pass through a particular vertex. The more shortest paths that pass through the vertex, the more central the vertex is to the network.

The following approach computes the betweenness centrality of each vertex in the network using a parallelized algorithm.

*Source:* https://blog.dominodatalab.com/social-network-analysis-with-networkx/

In [69]:
import networkx as nx
from multiprocessing import Pool
import itertools
import matplotlib.pyplot as plt
 
def partitions(nodes, n):
    "Partitions the nodes into n subsets"
    nodes_iter = iter(nodes)
    while True:
        partition = tuple(itertools.islice(nodes_iter,n))
        if not partition:
            return
        yield partition

def btwn_pool(G_tuple):
    return nx.betweenness_centrality_source(*G_tuple)
 

def between_parallel(G, processes = None):
    p = Pool(processes=processes)
    part_generator = 4*len(p._pool)
    node_partitions = list(partitions(G.nodes(), int(len(G)/part_generator)))
    num_partitions = len(node_partitions)

    bet_map = p.map(btwn_pool,
                    zip([G]*num_partitions,
                        [True]*num_partitions,
                        [None]*num_partitions,
                        node_partitions))
 
    bt_c = bet_map[0]
    for bt in bet_map[1:]:
        for n in bt:
            bt_c[n] += bt[n]
    return bt_c
 

# Runs the parallel betweenness algorithm on the network and plots the "top" vertices
def plotBetweeness(G_fb, top = 10):
 
    bt = between_parallel(G_fb)

    max_nodes =  sorted(bt.iteritems(), key = lambda v: -v[1])[:top]
    bt_values = [5]*len(G_fb.nodes())
    bt_colors = [0]*len(G_fb.nodes())
    for max_key, max_val in max_nodes:
        bt_values[max_key] = 150
        bt_colors[max_key] = 2

    plt.axis("off")
    nx.draw_networkx(G_fb, cmap = plt.get_cmap("rainbow"), node_color = bt_colors, node_size = bt_values, with_labels = False)
    plt.savefig("Twitter_ISIS_data/Between_network.png", format="PNG")
    plt.show()
    
plotBetweeness(graph)

TypeError: list indices must be integers, not str

### Community Detection

The criteria for finding good communities is similar to that for finding good clusters. We want to maximize intra-community edges while minimizing inter-community edges. Formally, the algorithm tries to maximize the modularity of network, or the fraction of edges that fall within the community minus the expected fraction of edges if the edges were distributed by random. Good communities should have a high number of intra-community edges, so by maximizing the modularity, we detect dense communities that have a high fraction of intra-community edges.

*Source:* https://blog.dominodatalab.com/social-network-analysis-with-networkx/

In [29]:
import networkx as nx
import matplotlib.pyplot as plt
import community
 
def detectCommunities(G):
 
    parts = community.best_partition(G)
    values = [parts.get(node) for node in G.nodes()]

    plt.axis("off")
    nx.draw_networkx(G, cmap = plt.get_cmap("jet"), node_color = values, node_size = 35, with_labels = False)
    plt.savefig("Twitter_ISIS_data/communities.png", format = "PNG")
    plt.show()
    
detectCommunities(graph)

### Analyse importancy of profiles

In [42]:
# find k-clique communities in graph 
# return the members of a clique of 7 members

list(nx.k_clique_communities(graph, 7))

[frozenset({'NaseemAhmed50',
            'Nidalgazaui',
            'RamiAlLolah',
            'Uncle_SamCoco',
            'WarReporter1',
            'mobi_ayubi',
            'warrnews'})]

In [43]:
# find the maximal number of cliques in this graph
nx.graph_number_of_cliques(graph)

148

In [44]:
# return the clique number (size of the largest clique) for this graph
nx.graph_clique_number(graph)

7

In [48]:
# Return the size of the largest maximal clique containing each given node.
size_of_largest_clique_per_node = nx.node_clique_number(graph)
# print size_of_largest_clique_per_node

# return only those profiles that are a member of clique that is larger than 6 members
profiles_of_large_cliques_dict = {k:v for (k,v) in size_of_largest_clique_per_node.items() if v > 6}
profiles_of_large_cliques_dict

{'NaseemAhmed50': 7,
 'Nidalgazaui': 7,
 'RamiAlLolah': 7,
 'Uncle_SamCoco': 7,
 'WarReporter1': 7,
 'mobi_ayubi': 7,
 'warrnews': 7}

In [49]:
# return a list of the profiles that are a member of cliques that are larger than 7 members
profiles_of_large_cliques = profiles_of_large_cliques_dict.keys()
profiles_of_large_cliques

['Uncle_SamCoco',
 'WarReporter1',
 'RamiAlLolah',
 'Nidalgazaui',
 'mobi_ayubi',
 'warrnews',
 'NaseemAhmed50']

In [71]:
# Return the number of maximal cliques for each node.
number_of_cliques_per_node = nx.number_of_cliques(graph)

# return only those profiles that are a member of more than 5 different cliques
number_of_more_then_5_cliques_per_node = {k:v for (k,v) in number_of_cliques_per_node.items() if v > 5}
number_of_more_then_5_cliques_per_node

{'AbuNaseeha_03': 6,
 'EPlC24': 7,
 'FidaeeFulaani': 6,
 'Freedom_speech2': 13,
 'IbnKashmir_': 6,
 'Jazrawi_Joulan': 9,
 'Jazrawi_Saraqib': 9,
 'MaghrabiArabi': 16,
 'MaghrebiHD': 7,
 'MaghrebiQM': 9,
 'MilkSheikh2': 8,
 'NaseemAhmed50': 6,
 'Nidalgazaui': 40,
 'QassamiMarwan': 7,
 'RamiAlLolah': 62,
 'Uncle_SamCoco': 25,
 'WarReporter1': 33,
 '_IshfaqAhmad': 6,
 '__alfresco__': 16,
 'mobi_ayubi': 17,
 'warrnews': 16,
 'wayf44rerr': 8}

Some accounts are a member of many different cliques. 
    
*How many of these accounts are active accounts?*

In [72]:
print len([account for account in number_of_more_then_5_cliques_per_node.keys() if account in passive_accounts])

0


None of the accounts that are part of a large clique are passive accounts. This means that cliques only consist of active users.

In [56]:
# Return a list of cliques containing the given node.
members_of_clique_per_node = nx.cliques_containing_node(graph)
members_of_clique_per_node

{'04_8_1437': [['04_8_1437']],
 '06230550_IS': [['MaghrebiHD', '06230550_IS'],
  ['06230550_IS', 'pleaoftheummah']],
 '1515Ummah': [['RamiAlLolah',
   'Nidalgazaui',
   'Uncle_SamCoco',
   'Fidaee_Fulaani',
   '1515Ummah'],
  ['RamiAlLolah', 'mustaklash', '1515Ummah']],
 '1Dawlah_III': [['1Dawlah_III', 'AbuNaseeha_03'],
  ['1Dawlah_III', 'Alwala_bara']],
 '432Mryam': [['abuayisha102', '432Mryam']],
 'AbdusMujahid149': [['AbdusMujahid149']],
 'AbuMusab_110': [['wayf44rerr',
   '__alfresco__',
   'Uncle_SamCoco',
   'AbuMusab_110',
   'st3erer'],
  ['freelance_112', '__alfresco__', 'AbuMusab_110'],
  ['freelance_112', 'kIakishini5', 'AbuMusab_110'],
  ['kIakishini5', 'st3erer', 'AbuMusab_110']],
 'AbuNaseeha_03': [['RamiAlLolah',
   'Nidalgazaui',
   'Uncle_SamCoco',
   'MaghrebiQM',
   'AbuNaseeha_03'],
  ['RamiAlLolah', 'abuhumayra4', 'AbuNaseeha_03'],
  ['RamiAlLolah',
   'Abu_Azzzam25',
   'Uncle_SamCoco',
   'MaghrebiQM',
   'AbuNaseeha_03'],
  ['QassamiMarwan', 'Nidalgazaui', 'Magh

In [57]:
# return only those members of a clique who are a member of a clique that is larger than 6 members 
# look at profiles_of_large_cliques

members_of_large_cliques = dict((key,value) for key,value in members_of_clique_per_node.iteritems() if key in profiles_of_large_cliques)
members_of_large_cliques

{'NaseemAhmed50': [['RamiAlLolah', 'Bajwa47online', 'NaseemAhmed50'],
  ['RamiAlLolah',
   'Nidalgazaui',
   'IbnKashmir_',
   'WarReporter1',
   'warrnews',
   'NaseemAhmed50'],
  ['RamiAlLolah',
   'Nidalgazaui',
   'Uncle_SamCoco',
   'WarReporter1',
   'mobi_ayubi',
   'NaseemAhmed50',
   'warrnews'],
  ['RamiAlLolah',
   'Nidalgazaui',
   'darulhijrateyni',
   'warrnews',
   'NaseemAhmed50'],
  ['QassamiMarwan', 'Nidalgazaui', 'NaseemAhmed50', 'darulhijrateyni'],
  ['DawlaWitness11', 'NaseemAhmed50']],
 'Nidalgazaui': [['Suspend_Me_fags', 'Nidalgazaui', 'safiyaimback'],
  ['BilalIbnRabah1', 'Nidalgazaui', 'MilkSheikh2'],
  ['BilalIbnRabah1', 'Nidalgazaui', 'WarReporter1'],
  ['saifulakhir', 'Nidalgazaui', 'war_analysis'],
  ['RamiAlLolah',
   'Nidalgazaui',
   'IbnKashmir_',
   'WarReporter1',
   'Freedom_speech2'],
  ['RamiAlLolah',
   'Nidalgazaui',
   'IbnKashmir_',
   'WarReporter1',
   'warrnews',
   '__alfresco__'],
  ['RamiAlLolah',
   'Nidalgazaui',
   'IbnKashmir_',
   'W

In [336]:
# # return only the names of the members of a clique that is larger than 7 members
# for key,value in members_of_large_cliques.iteritems():
#     for clique in value:
#         if len(clique) > 7:
#             print {key: clique}
            
# for key, value in members_of_large_cliques.items():
#     for clique in value:
#         if len(clique) > 7:
#             print clique

#### Get number of connections per user

In [75]:
degree = nx.degree(graph)
print '# of profiles with connections in undirected graph: ', len(degree.keys())
print degree
print '---'
degree_d_graph = nx.degree(d_graph)
print '# of profiles with connections in directed graph: ', len(degree_d_graph.keys())
print degree_d_graph

# of profiles with connections in undirected graph:  98
{'murasil1': 6, 'Suspend_Me_fags': 5, 'Uncle_SamCoco': 26, 'maisaraghereeb': 3, '432Mryam': 1, 'Witness_alHaqq': 4, 'ismailmahsud': 6, 'ks48a174031': 2, 'MaghrebiQ': 1, 'Afriqqiya_252': 3, 'wayf44rerr': 12, 'freelance_112': 7, 'Nidalgazaui': 34, 'ro34th': 4, 'MilkSheikh2': 8, '_IshfaqAhmad': 9, '__alfresco__': 15, 'Fidaee_Fulaani': 8, 'WhiteCat_7': 2, 'squadsquaaaaad': 1, 'abuhanzalah10': 1, 'melvynlion': 6, '1Dawlah_III': 2, 'Jazrawi_Joulan': 11, 'Mountainjjoool': 6, 'al_nusra': 2, 'nvor85j': 1, 'pleaoftheummah': 1, 'st3erer': 10, 'emran_getu': 2, 'MaghrebiWM': 6, 'fahadslay614': 2, 'darulhijrateyni': 7, 'Bajwa47online': 4, 'AbdusMujahid149': 2, 'klakishinki': 2, 'Jazrawi_Saraqib': 11, 'Freedom_speech2': 11, 'abuhumayra4': 7, 'wayyf44rer': 2, '04_8_1437': 2, 'moustiklash': 6, 'CXaafada2': 1, 'FidaeeFulaani': 10, 'Al_Battar_Engl': 2, 'YazeedDhardaa25': 3, 'QassamiMarwan': 10, 'JoinISNation102': 4, 'Abu_Ibn_Taha': 2, 'AbuMusab_110'

#### Show the density of the graph

In [76]:
print 'density undirected graph: ', nx.density(graph)

density undirected graph:  0.0675362928677


This seems like a very low density. Which means that there are possibly many profiles that have only one connection

In [81]:
# show proportion of ingoig edges per profile
ingoing_edges = nx.in_degree_centrality(d_graph)
print 'ingoing edges per node: '
print ingoing_edges
print '---'

# return only those profiles that show a high proportion of the ingoing edges
{k:v for (k,v) in ingoing_edges.items() if v > 0.1}

ingoing edges per node: 
{'murasil1': 0.010309278350515464, 'Suspend_Me_fags': 0.020618556701030927, 'Uncle_SamCoco': 0.15463917525773196, 'maisaraghereeb': 0.010309278350515464, '432Mryam': 0.010309278350515464, 'Witness_alHaqq': 0.0, 'ismailmahsud': 0.030927835051546393, 'ks48a174031': 0.0, 'MaghrebiQ': 0.010309278350515464, 'Afriqqiya_252': 0.020618556701030927, 'wayf44rerr': 0.09278350515463918, 'freelance_112': 0.061855670103092786, 'Nidalgazaui': 0.3402061855670103, 'ro34th': 0.0, 'MilkSheikh2': 0.07216494845360824, '_IshfaqAhmad': 0.020618556701030927, '__alfresco__': 0.10309278350515463, 'Fidaee_Fulaani': 0.030927835051546393, 'WhiteCat_7': 0.0, 'squadsquaaaaad': 0.010309278350515464, 'abuhanzalah10': 0.0, 'melvynlion': 0.010309278350515464, '1Dawlah_III': 0.020618556701030927, 'Jazrawi_Joulan': 0.09278350515463918, 'Mountainjjoool': 0.0, 'al_nusra': 0.010309278350515464, 'nvor85j': 0.010309278350515464, 'pleaoftheummah': 0.010309278350515464, 'st3erer': 0.05154639175257732, 'e

{'MaghrebiQM': 0.10309278350515463,
 'Nidalgazaui': 0.3402061855670103,
 'RamiAlLolah': 0.5257731958762887,
 'Uncle_SamCoco': 0.15463917525773196,
 'WarReporter1': 0.2268041237113402,
 '__alfresco__': 0.10309278350515463,
 'warrnews': 0.14432989690721648}

In [85]:
[account for account in {k:v for (k,v) in ingoing_edges.items() if v > 0.1}.keys() if account in active_accounts]

['__alfresco__',
 'Uncle_SamCoco',
 'WarReporter1',
 'RamiAlLolah',
 'Nidalgazaui',
 'MaghrebiQM',
 'warrnews']

The profiles with many ingoing edges are still profiles of active accounts. 

In [82]:
# show proportion of outgoing edges per profile
outgoing_edges = nx.out_degree_centrality(d_graph)
print 'outgoing edges per node: '
print outgoing_edges

# return only those profiles that show a high proportion of the outgoing edges
{k:v for (k,v) in outgoing_edges.items() if v > 0.1}

outgoing edges per node: 
{'murasil1': 0.061855670103092786, 'Suspend_Me_fags': 0.030927835051546393, 'Uncle_SamCoco': 0.16494845360824742, 'maisaraghereeb': 0.020618556701030927, '432Mryam': 0.010309278350515464, 'Witness_alHaqq': 0.041237113402061855, 'ismailmahsud': 0.030927835051546393, 'ks48a174031': 0.020618556701030927, 'MaghrebiQ': 0.0, 'Afriqqiya_252': 0.010309278350515464, 'wayf44rerr': 0.061855670103092786, 'freelance_112': 0.010309278350515464, 'Nidalgazaui': 0.020618556701030927, 'ro34th': 0.041237113402061855, 'MilkSheikh2': 0.030927835051546393, '_IshfaqAhmad': 0.07216494845360824, '__alfresco__': 0.09278350515463918, 'Fidaee_Fulaani': 0.05154639175257732, 'WhiteCat_7': 0.020618556701030927, 'squadsquaaaaad': 0.0, 'abuhanzalah10': 0.010309278350515464, 'melvynlion': 0.05154639175257732, '1Dawlah_III': 0.0, 'Jazrawi_Joulan': 0.020618556701030927, 'Mountainjjoool': 0.061855670103092786, 'al_nusra': 0.010309278350515464, 'nvor85j': 0.0, 'pleaoftheummah': 0.0, 'st3erer': 0.0

{'IbnKashmir_': 0.1134020618556701,
 'MaghrabiArabi': 0.20618556701030927,
 'Uncle_SamCoco': 0.16494845360824742,
 'WarReporter1': 0.14432989690721648,
 'mobi_ayubi': 0.16494845360824742}

While the accounts of RamiAlLolah and Nidalgazaui show many ingoing edges, proportions of 0.53 and 0.34 respectively, these profiles are not among the top tweeting profiles.

In contrast, the account of MaghrabiArabi does not appear in the top receiving profiles, scores at the top of those accounts that show a high proportion of outgoing tweets.

#### Split profiles into two groups:
- 1 group of those profiles who have more than 1 connection
- 1 group of those profiles who have only 1 connection

In [57]:
print 'total # of profiles: ', len(degree.keys())

total # of profiles:  98


In [56]:
# get only the keys of which the value is > 1

profiles_with_multiple_connections = []
for k,v in degree.iteritems():
    if v > 1:
        profiles_with_multiple_connections.append(k)
        
print '# of profiles with multiple connections: ', len(profiles_with_multiple_connections)
print profiles_with_multiple_connections[:5]

# of profiles with multiple connections:  85
['murasil1', 'Suspend_Me_fags', 'Uncle_SamCoco', 'maisaraghereeb', 'Witness_alHaqq']


In [55]:
# get only the keys of which the value is = 1

profiles_with_one_connection = []
for k,v in degree.iteritems():
    if v == 1:
        profiles_with_one_connection.append(k)
        
print '# of profiles with only one connection: ', len(profiles_with_one_connection)
print profiles_with_one_connection[:5]

# of profiles with only one connection:  13


['432Mryam', 'MaghrebiQ', 'squadsquaaaaad', 'abuhanzalah10', 'nvor85j']

#### Analyse the number of occurences per connection

In [66]:
from collections import Counter
occurences_per_connection = Counter(edges)
occurences_per_connection

Counter({('04_8_1437', '04_8_1437'): 3,
         ('06230550_IS', 'MaghrebiHD'): 1,
         ('1515Ummah', 'mustaklash'): 1,
         ('432Mryam', 'abuayisha102'): 1,
         ('AbdusMujahid149', 'AbdusMujahid149'): 1,
         ('AbuMusab_110', 'kIakishini5'): 1,
         ('AbuNaseeha_03', 'MaghrebiQM'): 1,
         ('Abu_Azzzam25', 'RamiAlLolah'): 1,
         ('Abu_Ibn_Taha', 'Baqiyah_Khilafa'): 1,
         ('Afriqqiya_252', 'Afriqqiya_252'): 4,
         ('Al_Battar_Engl', 'Al_Battar_Engl'): 1,
         ('Alwala_bara', 'RamiAlLolah'): 1,
         ('AsimAbuMerjem', 'Nidalgazaui'): 1,
         ('Bajwa47online', 'Bajwa47online'): 1,
         ('Baqiyah_Khilafa', 'Jazrawi_Joulan'): 1,
         ('Battar_English', 'Battar_English'): 1,
         ('BilalIbnRabah1', 'Nidalgazaui'): 1,
         ('DabiqsweetsMan', 'RamiAlLolah'): 1,
         ('DawlaWitness11', 'MaghrebiWM'): 1,
         ('Dieinurage308', 'Dieinurage308'): 4,
         ('EPlC24', 'MaghrebiHD'): 1,
         ('FidaeeFulaani', 'RamiAlL

In some cases a user sends a message to his own profile. This sounds as a remarkable thing to do. 
This could explain why some nodes don't seem to have a connection when the connections are represented as a graph.

Even, in some cases a user has only one connection, which is his own profile, and sends multiple tweets to his own profile. For example: 
    
    ('Al_Battar_Engl', 'Al_Battar_Engl'): 20 
    
This could imply that either this person posts tweets to him self as a reminder, possibly a reminder of another tweet. Or multiple users are the owner of this profile and possibly collect retweets. This could still be explored in more detail.

### Add the number of occurences per connection as an attribute to the edge

In [67]:
nx.set_edge_attributes(graph, 'occurences_per_connection_att', occurences_per_connection)
print 'example of # of occurences between AbuMusab_110 and freelance_112: ' 
graph['AbuMusab_110']['freelance_112']['occurences_per_connection_att']

example of # of occurences between AbuMusab_110 and freelance_112: 


KeyError: 'freelance_112'

### Find the users that only send tweets to themselves

In [104]:
# return only those profiles that are a member of a clique that is smaller than 2 members
profiles_of_small_cliques = {k:v for (k,v) in size_of_largest_clique_per_node.items() if v < 2}
profiles_of_small_cliques

{'04_8_1437': 1,
 'AbdusMujahid149': 1,
 'Al_Battar_Engl': 1,
 'Battar_English': 1,
 'Dieinurage308': 1,
 'al_nusra': 1,
 'btt_ar': 1,
 'fahadslay614': 1}

In [105]:
# return the connections of those profiles 

# return only those members of a clique who are a member of a clique that is smaller than 2 members 
dict((key,value) for key,value in members_of_clique_per_node.iteritems() if key in profiles_of_small_cliques)

{'04_8_1437': [['04_8_1437']],
 'AbdusMujahid149': [['AbdusMujahid149']],
 'Al_Battar_Engl': [['Al_Battar_Engl']],
 'Battar_English': [['Battar_English']],
 'Dieinurage308': [['Dieinurage308']],
 'al_nusra': [['al_nusra']],
 'btt_ar': [['btt_ar']],
 'fahadslay614': [['fahadslay614']]}

In [119]:
# get the connections of one user with himself
identical_user_connection = [('04_8_1437', '04_8_1437'),
                             ('AbdusMujahid149', 'AbdusMujahid149'),
                             ('Al_Battar_Engl', 'Al_Battar_Engl'),
                             ('Battar_English', 'Battar_English'),
                             ('Dieinurage308', 'Dieinurage308'),
                             ('al_nusra', 'al_nusra'),
                             ('btt_ar', 'btt_ar'),
                             ('fahadslay614', 'fahadslay614')]

In [127]:
# get the number of times that a profile sends a tweet to his own account
for k,v in occurences_per_connection.iteritems():
    if k in identical_user_connection:
        print k,v

('btt_ar', 'btt_ar') 13
('fahadslay614', 'fahadslay614') 10
('al_nusra', 'al_nusra') 1
('Dieinurage308', 'Dieinurage308') 4
('Al_Battar_Engl', 'Al_Battar_Engl') 20
('Battar_English', 'Battar_English') 18
('04_8_1437', '04_8_1437') 3
('AbdusMujahid149', 'AbdusMujahid149') 1


### Find similarities between user names

Some profile names are very similar to each other. In some cases only one letter has changed in comparison to another exising profile name. 
For example:

    ('MaghrebiHD', 'MaghrebiQ')

This could implicate that one and the same user is the owner of multiple profiles.
In order to find the users that may tweet from multiple accounts, we will search for those profiles that are very similar to each other.

I will use the Levenshtein distance to calculate the similarity between profile names.
https://en.wikipedia.org/wiki/Levenshtein_distance#Computing_Levenshtein_distance

In [128]:
import Levenshtein 

The algorythm counts the number of letters that are different between a pair of words.
For example:

    Levenshtein.distance('thisis', 'thisisnot')
    = 3

In [180]:
# creat pairs of usernames
# output [(username1, username1), (username1, username2)]
import itertools
username_combinations = list(itertools.combinations(usernames, 2))

In [171]:
# get similarity between pairs of usernames
word_similarity = []
for username1, username2 in username_combinations:
    word_similarity.append(Levenshtein.distance(username1, username2))
    
username_similarity = zip(username_combinations, word_similarity)
username_similarity[:5]

[(('GunsandCoffee70', 'AbuLaythAlHindi'), 14),
 (('GunsandCoffee70', 'YazeedDhardaa25'), 15),
 (('GunsandCoffee70', 'abubakerdimshqi'), 14),
 (('GunsandCoffee70', 'BaqiyaIs'), 14),
 (('GunsandCoffee70', 'WhiteCat_7'), 13)]

In [179]:
# create keys for username_combinations
# create values for word_similarity list
username_similarity_dict = dict(username_similarity)

In [178]:
# retrieve usernames that are very similar to each other
{k:v for (k,v) in username_similarity_dict.items() if v < 5}

{('Fidaee_Fulaani', 'FidaeeFulaani'): 1,
 ('MaghrebiHD', 'MaghrebiQ'): 2,
 ('MaghrebiHD', 'MaghrebiWM'): 2,
 ('MaghrebiQ', 'MaghrebiWM'): 2,
 ('MaghrebiQM', 'MaghrebiHD'): 2,
 ('MaghrebiQM', 'MaghrebiQ'): 1,
 ('MaghrebiQM', 'MaghrebiWM'): 1,
 ('Mosul_05', 'Abdul__05'): 4,
 ('abuayisha102', 'abuayisha108'): 1,
 ('abutariq040', 'abutariq041'): 1,
 ('klakishinki', 'kIakishini5'): 3,
 ('mustaklash', 'moustiklash'): 2,
 ('mustaklash', 'mustafaklash56'): 4,
 ('warreporter2', 'WarReporter1'): 3,
 ('wayf44rerr', 'wayff44rer'): 2,
 ('wayyf44rer', 'wayf44rerr'): 2,
 ('wayyf44rer', 'wayff44rer'): 1}

It will be interesting to explore the impact on the degree density when the profile names that are very similar, are grouped to one profile name.

### Add topic as attribute to edges

In [39]:
nx.set_edge_attributes(graph, 'tweettopic_per_connection_att', connection_topic)
print 'example of topic of tweet between wayyf44rer and  RamiAlLolah: ' 
graph['wayyf44rer']['RamiAlLolah']['tweettopic_per_connection_att']

example of topic of tweet between wayyf44rer and  RamiAlLolah: 


'topic4'

In [47]:
# show all connections per topic
topic0 = [key for key,value in connection_topic.iteritems() if value =='topic3']
topic1 = [key for key,value in connection_topic.iteritems() if value =='topic3']
topic2 = [key for key,value in connection_topic.iteritems() if value =='topic3']
topic3 = [key for key,value in connection_topic.iteritems() if value =='topic3']
topic4 = [key for key,value in connection_topic.iteritems() if value =='topic4']

print '# connections with topic 0: ', len(topic0)
print '# connections with topic 1: ', len(topic1)
print '# connections with topic 2: ', len(topic2)
print '# connections with topic 3: ', len(topic3)
print '# connections with topic 4: ', len(topic4)

 # connections with topic 0:  22
# connections with topic 1:  22
# connections with topic 2:  22
# connections with topic 3:  22
# connections with topic 4:  16
