# Sentiment analysis of open-source software communities

This Jupyter notebook includes the data preparation and analysis
for our project exploring open-source software communities.

**Code last updated**: 6 November 2018

***

## Table of contents

* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)

***

## Preliminaries

### Load libraries and functions

In [None]:
import os, glob, string

In [None]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
from utils import annotate, project_features

***

## Data preparation

Cycle through all GitHub project files to clean data and prepare datasets as needed for analysis. 
For complete list of downloaded variables and new variables created, see `metadata.md` file.

In [None]:
# list all projects' raw data
project_list = os.listdir('../../data/raw_data')

In [None]:
# load in the lists needed
bot_list = pd.read_csv('../bot_names.txt')['bot_name']
gratitude_list = set(pd.read_csv('./utils/gratitude.txt')['expressions_of_gratitude'])

In [None]:
# create a container for the bus factor ratings
bus_factor_df = pd.DataFrame()

# cycle through all raw data projects
for project in project_list:
    
    # read in the next project's files
    temp_comments = pd.read_csv('../../data/raw_data/'+project+'/comments.tsv',
                                sep='\t', index_col=0).sort_index()
    temp_issues = pd.read_csv('../../data/raw_data/'+project+'/issues.tsv',
                              sep='\t', index_col=0).sort_index()
    temp_commits = pd.read_csv('../../data/raw_data/'+project+'/commits.tsv',
                              sep='\t', index_col=0).sort_index()
    
    # append the current project to each
    temp_comments['project'] = project
    temp_issues['project'] = project
    temp_commits['project'] = project
    
    # annotate each file
    temp_comments, temp_issues = annotate.annotate_logs(temp_comments,
                                                        temp_issues)
    
    # drop columns we don't need
    temp_comments = temp_comments.drop(columns=['node_id','updated_at','author_id'])
    temp_issues = temp_issues.drop(columns=['node_id','organization','author_id','locked'])
#     temp_commits = temp_commits.drop(columns=['author_id','sha'])
    
    # clean up the text body
    temp_comments = annotate.body_cleanup(temp_comments, bot_list)
    temp_issues = annotate.body_cleanup(temp_issues, bot_list)
#     temp_commits = annotate.body_cleanup(temp_commits, bot_list)
    
    # run sentiment analysis
    temp_comments = annotate.add_sentiment(temp_comments)
    temp_issues = annotate.add_sentiment(temp_issues)
#     temp_commits = annotate.add_sentiment(temp_commits)
    
    # add gratitude info
    temp_comments = annotate.add_gratitude(temp_comments, gratitude_list)
    temp_issues = annotate.add_gratitude(temp_issues, gratitude_list)
#     temp_commits = annotate.add_gratitude(temp_commits, gratitude_list)

    # join the dataframes
    temp_joined_frame = (temp_comments.join(temp_issues, 
                                            lsuffix='_comment',
                                            rsuffix='_issue',
                                            on='ticket_id')
                                       .reset_index(drop=True)
                                       .drop(columns='project_comment')
                                       .rename(columns={'project_issue': 'project'}))
    
    # calculate bus factor
    temp_bus_factor = project_features.compute_bus_factor(temp_commits)
    bus_factor_df = bus_factor_df.append({'project': project,
                                          'bus_factor': temp_bus_factor},
                                        ignore_index=True)
    temp_joined_frame['bus_factor'] = temp_bus_factor
    
    # save cleaned data to intermediary folders
    temp_comments.to_csv('../../data/processed_data/'+project+'-processed-comments.csv',
                         index=False, header=True)
    temp_issues.to_csv('../../data/processed_data/'+project+'-processed-issues.csv',
                         index=False, header=True)
    temp_commits.to_csv('../../data/processed_data/'+project+'-processed-commits.csv',
                         index=False, header=True)
    temp_joined_frame.to_csv('../../data/processed_data/'+project+'-processed-joined.csv',
                         index=False, header=True)
    
    # use identical bins sizes for all histograms
    bin_number = 50    
    fig_dpi = 150
    y_label_text = 'Density'
    density_choice = True
    alpha_level = .5
    
    # create overlapping histograms for emotion in comment text
    plt.figure()
    plt.hist(temp_comments['negative_emotion'], 
             bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
    plt.hist(temp_comments['positive_emotion'], 
             bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
    plt.hist(temp_comments['neutral_emotion'], 
             bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
    plt.title('Histogram of emotion proportions in comment bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel(y_label_text)
    plt.grid(True)

    # plot comment emotion hisogram
    plt.savefig('../../figures/emotion_histograms/'+project+'-comment_body.png',
               dpi=fig_dpi)
    plt.close()
    
    # create overlapping histograms for emotion in issue text
    plt.figure()
    plt.hist(temp_issues['negative_emotion'], 
             bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
    plt.hist(temp_issues['positive_emotion'], 
             bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
    plt.hist(temp_issues['neutral_emotion'], 
             bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
    plt.title('Histogram of emotion proportions in issue bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel(y_label_text)
    plt.grid(True)

    # plot comment text
    plt.savefig('../../figures/emotion_histograms/'+project+'-issue_body.png',
               dpi=fig_dpi)
    plt.close()
    
# save bus factor file
bus_factor_df.to_csv('../../data/processed_data/all-bus_factor.csv',
                         index=False, header=True)

### Model preparation

In [None]:
# get project names again
project_list = os.listdir('../../data/raw_data')

In [None]:
# create empty frame
joined_frame = pd.DataFrame()

# read in joined frames for all projects
for project in project_list:

    # read in the next project's prepared files
    temp_joined_frame = pd.read_csv('../../data/processed_data/'+project+'-processed-joined.csv',
                                    sep=',').sort_index()

    # append to dataframe
    joined_frame = joined_frame.append(temp_joined_frame).reset_index(drop=True)

In [None]:
# remove any bots and the bot columns
joined_frame = (joined_frame.loc[(joined_frame['bot_flag_comment']==False) &
                                   (joined_frame['bot_flag_issue']==False)]
                            .reset_index(drop=True)
                            .drop(columns=['bot_flag_comment','bot_flag_issue']))

In [None]:
# identify the timestamp of the author's most recent issue and comment in this group
most_recent_comment = (joined_frame.groupby(['project',
                                            'author_name_comment'])
                                   .max()[['created_at_comment', 'ticket_id_issue']])
most_recent_issue = (joined_frame.groupby(['project','author_name_issue'])
                                 .max()[['created_at_issue', 'ticket_id_issue']])

In [None]:
# add the most recent timestamps to dataframe
joined_frame = (joined_frame.join(most_recent_comment, on=['project', 'author_name_issue'], rsuffix='_last')
                               .rename(columns={"created_at_comment_last": "issue_author_last_comment_stamp",
                                                "ticket_id_issue_last": "issue_author_last_comment_ticket"})
                               .join(most_recent_issue, on=['project', 'author_name_issue'], rsuffix='_last')
                               .rename(columns={"created_at_issue_last": "issue_author_last_issue_stamp",
                                                "ticket_id_issue_last": "issue_author_last_issue_ticket"}))

In [None]:
# is this the first ticket that the ticket author submitted?
joined_frame['first_ticket'] = (joined_frame['num_PR_created_issue']==0) & (joined_frame['num_issue_created_issue']==0)

In [None]:
# is this issue the last one that the issue author submitted?
joined_frame['issue_author_last_issue'] = joined_frame['ticket_id_issue']==joined_frame['issue_author_last_issue_ticket']

In [None]:
# is this issue the last thing that the author worked on?
joined_frame['issue_author_last_comment'] = joined_frame['ticket_id_issue']==joined_frame['issue_author_last_comment_ticket']

In [None]:
# if they've never commented, make sure we note that the issue was their last activity
joined_frame.loc[joined_frame['issue_author_last_comment_ticket'].isnull()==True, 'issue_author_last_comment'] = True

In [None]:
# save to file
joined_frame.to_csv('../../data/analysis_data/all-sentiment_frame.csv',
                         index=False, header=True)

In [None]:
# save one without the comment/ticket bodies for analysis in R
joined_frame_for_r = joined_frame.drop(columns=['body_comment','body_issue', 'title', 'labels'])
joined_frame_for_r.to_csv('../../data/analysis_data/all-sentiment_frame-for_r.csv',
                         index=False, header=True)

In [None]:
joined_frame_for_r.head(10)

***

## Data analysis

*Currently porting to R for speed. Will later move back to Python.*

***

# Code testing ground

### Data preparation

In [None]:
project = 'mayavi'

In [None]:
bus_factor = pd.read_csv('../../data/processed_data/all-bus_factor.csv',
                         sep=',').sort_index()

In [None]:
temp_comments = pd.read_csv('../../data/raw_data/'+project+'/comments.tsv',
                          sep='\t', index_col=0).sort_index()

In [None]:
temp_issues = pd.read_csv('../../data/raw_data/'+project+'/issues.tsv',
                          sep='\t', index_col=0).sort_index()

In [None]:
temp_commits = pd.read_csv('../../data/raw_data/'+project+'/commits.tsv',
                               sep='\t', index_col=0).sort_index()

### Annotate the files with new columns

In [None]:
temp_comments, temp_issues = annotate.annotate_logs(temp_comments,temp_issues)

### Remove unnecessary columns

In [None]:
temp_comments = temp_comments.drop(columns=['node_id','updated_at','author_id'])

In [None]:
temp_issues = temp_issues.drop(columns=['node_id','organization','author_id','locked'])

### Clean up body

In [None]:
bot_list = pd.read_csv('../bot_names.txt')['bot_name']

In [None]:
temp_comments = annotate.body_cleanup(temp_comments, bot_list)

In [None]:
temp_issues = annotate.body_cleanup(temp_issues, bot_list)

### Sentiment analysis

In [None]:
temp_comments = annotate.add_sentiment(temp_comments)

In [None]:
temp_issues = annotate.add_sentiment(temp_issues)

### Gratitude

In [None]:
gratitude_list = set(pd.read_csv('./utils/gratitude.txt')['expressions_of_gratitude'])

In [None]:
temp_comments = annotate.add_gratitude(temp_comments, gratitude_list)

In [None]:
temp_issues = annotate.add_gratitude(temp_issues, gratitude_list)

### Plot

In [None]:
# use identical bins sizes for all histograms
bin_number = 50    
fig_dpi = 150
y_label_text = 'Density'
density_choice = True
alpha_level = .5

In [None]:
# create overlapping histograms for emotion in comment text
plt.figure()
plt.hist(temp_comments['negative_emotion'], 
         bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
plt.hist(temp_comments['positive_emotion'], 
         bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
plt.hist(temp_comments['neutral_emotion'], 
         bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
plt.title('Histogram of emotion proportions in comment bodies\nfor '+project)
plt.xlabel('Proportion of emotion words to total words')
plt.ylabel(y_label_text)
plt.grid(True)

In [None]:
# create overlapping histograms for emotion in issue text
plt.figure()
plt.hist(temp_issues['negative_emotion'], 
         bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
plt.hist(temp_issues['positive_emotion'], 
         bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
plt.hist(temp_issues['neutral_emotion'], 
         bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
plt.title('Histogram of emotion proportions in issue bodies\nfor '+project)
plt.xlabel('Proportion of emotion words to total words')
plt.ylabel(y_label_text)
plt.grid(True)

### Analyses

In [None]:
comments_df = temp_comments

In [None]:
issues_df = temp_issues

In [None]:
joined_frame = comments_df.join(issues_df, 
                                lsuffix='_comment',
                                rsuffix='_issue',
                                on='ticket_id')

In [None]:
# identify the timestamp of the author's most recent issue and comment in this group
most_recent_comment = joined_frame.groupby('author_name_comment').max()['created_at_comment']
most_recent_issue = joined_frame.groupby('author_name_issue').max()['created_at_issue']

In [None]:
# add the most recent timestamps to dataframe
joined_frame = (joined_frame.join(most_recent_comment, on='author_name_issue', rsuffix='_last')
                               .rename(columns={"created_at_comment_last": "issue_author_last_comment_stamp"})
                               .join(most_recent_issue, on='author_name_issue', rsuffix='_last')
                               .rename(columns={"created_at_issue_last": "issue_author_last_issue_stamp"}))

In [None]:
# is this the first ticket that the ticket author submitted?
joined_frame['first_ticket'] = (joined_frame['num_PR_created_issue']==0) & (joined_frame['num_issue_created_issue']==0)

In [None]:
# is this issue the last one that the issue author submitted?
joined_frame['issue_author_last_issue'] = joined_frame['created_at_issue']==joined_frame['issue_author_last_issue_stamp']

In [None]:
# is this issue the last thing that the author worked on?
joined_frame['issue_author_last_contribution'] = joined_frame['created_at_issue'] > joined_frame['issue_author_last_comment_stamp']

In [None]:
# if they've never commented, make sure we note that the issue was their last activity
joined_frame.loc[joined_frame['issue_author_last_comment_ticket'].isnull()==True, 'issue_author_last_comment'] = True

In [None]:
joined_frame.head(10)

For some reason, we're getting an error for trying to join `object` and
`int64` when we try to use `pd.DataFrame.join` on the `project` variable,
so this is what we're doing for now instead.

**Edit**: Still unsure why this is happening, but it happens whenever you
load back in the edited dataframe files and then try to merge them. I've 
circumvented this issue for now by simply joining the dataframes as soon as 
they've been edited.

In [None]:
joined_frame['bus_factor'] = (bus_factor[bus_factor['project']==project]
                                      .reset_index()['bus_factor'][0])

#### Survivor curves by emotional tenor

### Ideas

Do comments, generally, get more friendly or more hostile over time?

Does the emotional valence of a contributor's first ticket predict whether they'll come back to make a second one?

Are requesters more or less polite?

Does friendliness bring people back?

Does the number and intensity of negative and positive comments on a first-time contributor's issue 
change whether they come back to make another ticket?

***