# Compsci 690V Homework 5

We pick the mini challenge 1 of VAST Challenge 2008 for this homework. 

**Description:** The Paraiso movement is controversial and is having considerable social impact in a specific area of the world. We have extracted a segment of the Paraiso (the movement) Wikipedia edits page. Please note this is not the Paraiso Manifesto Wiki page which is part of the background materials, but a related different page. Please use visual analytics to describe the social relationships of the editors (those that have edited/modified the Wikipedia page) as they are reflected in these files.



To obtain a dataset similar to the one given in the mini challenge, we scraped the edits of the wikipedia page **"2016–17 Kashmir unrest"** (link: https://en.wikipedia.org/w/index.php?title=2016%E2%80%9317_Kashmir_unrest&limit=500&action=history). We chose this dataset since it is a recent revolutionary movement simirarly to the one described in the challenge. We scraped the text using the script in the file **getEdits.py**.

We parsed the text of each commit into a dataframes with the columns:
* **timestamp** - the timestamp of the edit
* **user** - the user or ip of the editor
* **minorEdit** - True or False - if the edit is minor
* **pageLength** - Length of the page after the edit (in Bytes)
* **editDiff** - bytes changed by the edit
* **comment** - comment or description of the edit
* **tags**
* **entireEdit** - entire raw edit text



(1338, 8)


Next we look at the users who have made the most edits:

In [2]:
contributions = df['user'].value_counts()
print(contributions.head())

DinoBambinoNFS     437
Kautilya3          101
Support2016         56
117.214.245.178     29
Thnidu              26
Name: user, dtype: int64


Some of the edits have also been done by wikipedia bots. These bots usally have **'Bot'** at the end of their names:

In [3]:
for name in contributions.index.tolist():
    if('Bot' in name):
        print(name, contributions[name])

ClueBot NG 2
MusikBot 2
Bender the Bot 2
KolbertBot 1


We look at the edits that we just reverts of previous commits:

In [4]:
revertEdits = []
for row in df.iterrows():
    if('Revert' in row[1]['comment'] or 'revert' in row[1]['comment']):
        revertEdits.append(row)

print('Number of edits that are reverts: ',len(revertEdits))



Number of edits that are reverts:  83


## Method 1: Term Frequency-Inverse document Frequency feature extraction and clustering
We will use TF-idf feature extraction and cluster the users using Kmeans clustering to see if there are any distinct clusters forming. To do that, we first create a dataframe **userWiseComments** to store all the comments from all the edits of a user. 

In [5]:
def getUserWiseComments(dataframe):
    userWiseComments = pd.DataFrame(columns=['user','allComments'])
    users = list(set(dataframe['user']))
    i = 0
    for user in users:
        userRows = df.loc[df['user'] == user]
        allComments = ''
        for row in userRows.iterrows():
            allComments += row[1]['comment']
        userWiseComments.loc[i] = [user,allComments]
        i += 1
    return userWiseComments
userWiseComments = getUserWiseComments(df)


We will now create a tfidf matrix using the TfidfVectorizer and creating english stems of tokens for all comments of each user:

In [7]:
stemmer = SnowballStemmer("english")

#take comments from dataframe
commentsDf = userWiseComments['allComments']

#token given text and find their stems
def tokenizeAndStem(text):
    CommentsTokens=[]
    #for userComments in commentsDf:
    CommentsTokens = (nltk.word_tokenize(text))
    #filter out punctuations and numeric tokens
    filtered_tokens = []
    for token in CommentsTokens:
        if re.search('[a-zA-Z0-9]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


#making Term Frequency-Inverse document Frequency matrix model
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
                                 min_df=0, stop_words='english',
                                 use_idf=False,tokenizer=tokenizeAndStem, ngram_range=(1,2))  #creating features by taking 3 words at a time

tfidf_matrix = tfidf_vectorizer.fit_transform(commentsDf)
print (tfidf_matrix.shape)
terms = tfidf_vectorizer.get_feature_names()
#terms are list of features in the matrix

(289, 12447)


Next we create clusters:

In [8]:
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

We look at the top 10 features of each cluster:

In [9]:
order_centroids = km.cluster_centers_.argsort()[:,::-1]

for centres in order_centroids:
    topFeatures=[]
    for words in centres[:10]:
        topFeatures.append(terms[words])
    print(topFeatures)

['fix', 'reaction', 'typo', 'domest reaction', 'domest', 'fix typo', 'unrest', 'ad', 'use', 'link']
['background', 'septemb', 'background background', 'novemb', 'death', 'septemb novemb', 'background death', 'septemb background', 'reword', 'background reword']
['unrest', 'talk', 'use', 'edit', 'kashmir', 'revis', 'ad', 'link', 'casualti', 'revert']


We can see that there are some similarities in the features in each cluster. We try to plot it using dimensionality reduction via manifold reduction:

In [10]:
#dimensionality reduction for the TF-IDF matrix (or df equally)
MDS()

# convert two components as we're plotting points in a two-dimensional plane

# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, random_state=1)

#toarray() converts sparse array to dense numpy array
pos = mds.fit_transform(tfidf_matrix.toarray())  # shape (n_components, n_samples)

#store the dimensions in xs, ys
xs, ys = pos[:, 0], pos[:, 1]


In [11]:
def get_colors(clusters):
    colors=[]
    for i in clusters:
        if i==0:
            colors.append('red')
        elif i==1:
            colors.append('blue')
        elif i==2:
            colors.append('green')
    return colors

colors = get_colors(clusters)

source = ColumnDataSource(data=dict(x=xs,y=ys,colors = colors, user = list(userWiseComments['user']), 
                                   comment = userWiseComments['allComments']))

hover = HoverTool(tooltips=[
    ("user",'@user'),
    ("comment","@comment")
])




b=figure( plot_height=500, plot_width=900, title='clustering groups based on edits', tools = [hover])


b.square('x','y',fill_color='colors',line_color='colors',source=source,size=10)
show(b)

### Result:
As we can see that this method is not very successful in grouping users and there are no clear clusters.

### Method 2: Creating groups via reverts and mentions

In this method we hypothesize that the users will be split into 4 major group:
1. Bots.
2. Neutral users (unbiased).
3. Users aligned with Government of India, Indian Army and Central Reserve Police.
4. Users aligned with Kashmiri protesters and separatists.

Wikipedia bots have Bot at the end of their name, therefore getting group one is easy:

In [6]:
group1 = []
group2 = []
group3 = []
group4 = []
data = df
for name in contributions.index.tolist():
    if('bot' in name or 'Bot' in name):
        group1.append(name)
        data = data.loc[df['user']!= name]
    
pprint.pprint(group1)

['Yobot',
 'GreenC bot',
 'ClueBot NG',
 'BG19bot',
 'MusikBot',
 'Bender the Bot',
 'KolbertBot']


For group 3 and group 4 we analyze the reverts and mentions of usernames in the comments and try to find the opposing members from that. We extract phrases in which the usernames have been mentioned: 

In [7]:
userWiseComments2 = getUserWiseComments(data)
userMentions = pd.DataFrame(columns=['user','mention','phrase', 'to'])
listOfUserNames = list(userWiseComments2['user'])
#sia = SentimentIntensityAnalyzer()
for row in userWiseComments2.iterrows():
    tokens = row[1]['allComments'].split()
    for name in listOfUserNames:
        indexes = [i for i,token in enumerate(tokens) if token==name]
        for i in indexes:
            start = max(0, i-5)
            phrase = " ".join(tokens[start:i+1])
            printText = ""
            to = False
            for word in tokens[start:i+1]:
                if(word == 'to' or word == 'To'):
                    printText +="\033[43;3m"+word+"\033[m "
                    to = True
                else:
                    printText +=word + " "
            #print(printText)
            userMentions.loc[len(userMentions)] = [row[1]['user'],name,phrase,to]
            
          

From the above phrases, we can see that when the word to is included in the phrase, the mention is done in a positive sense. Otherwise it is done in a negative sense. We will attempt to create a relationship matrix on the following criteria: 
1. if the mention is used in a positive way, we will add 1 to the relationship
2. if the mention is used in a negative way, we will subtract 1 to the relationship
3. All relationships start at 0

In [8]:
relationshipMatrix = {}
for name1 in listOfUserNames:
    relationshipMatrix[name1] = {}
    for name2 in listOfUserNames:
        relationshipMatrix[name1][name2] = 0
        
for row in userMentions.iterrows():
    if(row[1]['to'] == True):
        if(row[1]['user']<row[1]['mention']):
            relationshipMatrix[row[1]['user']][row[1]['mention']] += 1
        else:
            relationshipMatrix[row[1]['mention']][row[1]['user']] += 1
    else:
        if(row[1]['user']<row[1]['mention']):
            relationshipMatrix[row[1]['user']][row[1]['mention']] -= 1
        else:
            relationshipMatrix[row[1]['mention']][row[1]['user']] -= 1

relationships = pd.DataFrame(columns=['user1','user2','relationship'])
i=0
for key1,dictionary in relationshipMatrix.items():
        for key2,value in dictionary.items():
            if(value != 0):
                relationships.loc[i] = [key1,key2,value]
                i+=1
                


To create groups, we start by taking the most active users in the **relationships** dataframe and grouping according to them.

In [15]:
relationshipUsers = list(relationships['user1'])+list(relationships['user2'])
uniqueUsers = set(relationshipUsers)
userCount = {}
for user in uniqueUsers:
    userCount[user] = relationshipUsers.count(user)
    
userCount = pd.Series(userCount)
userCount = userCount.sort_values(ascending=False)
for user in userCount.index:
    
    if(user not in group3 and user not in group4):
        group3.append(user)
        
    if(user in group3):
        rows1 = relationships.loc[relationships['user1'] == user]
        rows2 = relationships.loc[relationships['user2'] == user]
        for row in rows1.iterrows():
            user2 = row[1]['user2']
            if(user2 in group3 or user2 in group4):
                continue
            elif(row[1]['relationship']>0):
                group3.append(user2)
            else:
                group4.append(user2)
                
        for row in rows2.iterrows():
            user2 = row[1]['user1']
            if(user2 in group3 or user2 in group4):
                continue
            elif(row[1]['relationship']>0):
                group3.append(user2)
            else:
                group4.append(user2)
        
    if(user in group4):
        rows1 = relationships.loc[relationships['user1'] == user]
        rows2 = relationships.loc[relationships['user2'] == user]
        for row in rows1.iterrows():
            user2 = row[1]['user2']
            if(user2 in group3 or user2 in group4):
                continue
            elif(row[1]['relationship']>0):
                group4.append(user2)
            else:
                group3.append(user2)
                
        for row in rows2.iterrows():
            user2 = row[1]['user1']
            if(user2 in group3 or user2 in group4):
                continue
            elif(row[1]['relationship']>0):
                group4.append(user2)
            else:
                group3.append(user2)

Next we put all the remaining users in **group 2**:

In [16]:
for user in listOfUserNames:
    if(user not in group3 and user not in group4):
        group2.append(user)

The groups that we thus get are:

In [17]:
group1Source = ColumnDataSource(dict(names = group1))
group2Source = ColumnDataSource(dict(names = group2))
group3Source = ColumnDataSource(dict(names = group3))
group4Source = ColumnDataSource(dict(names = group4))


data_table1 = DataTable(source=group1Source, columns=[TableColumn(field="names", title="Group 1 (Bots)")], width=180, height=480,
                        reorderable = False, row_headers= False)
data_table2 = DataTable(source=group2Source, columns=[TableColumn(field="names", title="Group 2 (Neutral)")], width=200, height=480,
                        reorderable = False, row_headers= False)
data_table3 = DataTable(source=group3Source, columns=[TableColumn(field="names", title="Group 3")], width=200, height=480,
                        reorderable = False, row_headers= False)
data_table4 = DataTable(source=group4Source, columns=[TableColumn(field="names", title="Group 4")], width=200, height=480,
                        reorderable = False, row_headers= False)

show(Row(widgetbox(data_table1, width=200),
         widgetbox(data_table2, width=220),
         widgetbox(data_table3, width=220),
         widgetbox(data_table4, width=220)))

### Result:
This method is the most promising of all the methods as it is possible that the two opposite groups would revert and mention each other a lot. Thus this partition is very probable. 

## Method 3: Creating groups via sentiment analysis
In this method we try to analysis the attitude of the users by the sentiment of their comments. We use the **VADER sentiment analysis tool**:

Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

Again we remove all the bots first and the group them by netural, positive and negative. 

In [17]:
group1 = []
group2 = []
group3 = []
group4 = []
compound = {}
data = df
for name in contributions.index.tolist():
    if('bot' in name or 'Bot' in name):
        group1.append(name)
        data = data.loc[df['user']!= name]
        
sia = SentimentIntensityAnalyzer()
userWiseComments2 = getUserWiseComments(data)
for row in userWiseComments2.iterrows():
    allComments = row[1]['allComments']
    res = sia.polarity_scores(allComments)
    compound[allComments] = res['compound']
    if res['compound'] > 0.2:
        group3.append(row[1]['user'])
    elif res['compound'] < -0.2:
        group4.append(row[1]['user'])
    else:
        group2.append(row[1]['user'])


In [18]:
group1Source = ColumnDataSource(dict(names = group1))
group2Source = ColumnDataSource(dict(names = group2))
group3Source = ColumnDataSource(dict(names = group3))
group4Source = ColumnDataSource(dict(names = group4))


data_table1 = DataTable(source=group1Source, columns=[TableColumn(field="names", title="Group 1 (Bots)")], width=180, height=480,
                        reorderable = False, row_headers= False)
data_table2 = DataTable(source=group2Source, columns=[TableColumn(field="names", title="Group 2 (Neutral)")], width=200, height=480,
                        reorderable = False, row_headers= False)
data_table3 = DataTable(source=group3Source, columns=[TableColumn(field="names", title="Group 3 (Positive)")], width=200, height=480,
                        reorderable = False, row_headers= False)
data_table4 = DataTable(source=group4Source, columns=[TableColumn(field="names", title="Group 4 (Negative)")], width=200, height=480,
                        reorderable = False, row_headers= False)

show(Row(widgetbox(data_table1, width=200),
         widgetbox(data_table2, width=220),
         widgetbox(data_table3, width=220),
         widgetbox(data_table4, width=220)))

### Result:
This method is successful in grouping users by their attitudes. However, this method alone would not be able to give the grouping of each faction. Combining this method with method 2 would help improve the result. Overall, method 2 is the most promising method for solving the Vast 2008 Mini challenge 1 