# Sentiment analysis of open-source software communities

This Jupyter notebook includes the data preparation and analysis
for our project exploring open-source software communities.

To run this notebook, you will need the following files and directories:

* `../../data/processed_data/`: Directory of files produced by `./extract_features.py`
* `../bot_names.txt`: File of usernames identified as being bots
* `./utils/gratitude.txt`: List of words identified as gratitude-related

The most significant output of this notebook 
(`../../data/analysis_data/all-sentiment_frame-for_r.csv`) will be imported
into `./oss_community-language_dynamics.Rmd` for data analysis.

**Code last updated**: 07 October 2019

**Code written by**: A. Paxton (University of Connecticut) & N. Varoquaux
(CNRS)

***

## Table of contents

* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)

***

## Preliminaries

### Load libraries and functions

In [1]:
import os, glob, string

In [2]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [4]:
from utils import annotate, project_features

***

## Data preparation

### Initial file cleaning

Cycle through all GitHub project files to clean data and prepare datasets as needed for analysis. 
For complete list of downloaded variables and new variables created, see `metadata.md` file.

In [5]:
# list all projects' raw data
project_list = os.listdir('../../data/processed_data/dataset_scip')

In [6]:
# load in the lists needed
bot_list = pd.read_csv('../bot_names.txt')['bot_name']
gratitude_list = set(pd.read_csv('./utils/gratitude.txt')['expressions_of_gratitude'])

In [7]:
# create a container for the bus factor ratings
# bus_factor_df = pd.DataFrame()

# cycle through all raw data projects
for project in project_list:
    
    # automatically build paths for tickets and comments
    tickets_filename = os.path.join(
        "../../data/processed_data/dataset_scip", project, "processed-tickets.csv")
    comments_filename = os.path.join(
        "../../data/processed_data/dataset_scip", project, "processed-comments.csv")
    
    # read in the next project's prepared files
    temp_tickets = pd.read_csv(tickets_filename, sep=",").sort_index()
    temp_comments = pd.read_csv(comments_filename, sep=",").sort_index()
    
    # use identical bins sizes for all histograms
    bin_number = 50    
    fig_dpi = 150
    y_label_text = 'Density'
    density_choice = True
    alpha_level = .5
    
    # create overlapping histograms for emotion in comment text
    plt.figure()
    plt.hist(temp_comments['negative_emotion'], 
             bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
    plt.hist(temp_comments['positive_emotion'], 
             bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
    plt.hist(temp_comments['neutral_emotion'], 
             bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
    plt.title('Histogram of emotion proportions in comment bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel(y_label_text)
    plt.grid(True)

    # plot comment emotion hisogram
    plt.savefig('../../figures/emotion_histograms/'+project+'-comment_body.png',
               dpi=fig_dpi)
    plt.close()
    
    # create overlapping histograms for emotion in tickets text
    plt.figure()
    plt.hist(temp_tickets['negative_emotion'], 
             bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
    plt.hist(temp_tickets['positive_emotion'], 
             bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
    plt.hist(temp_tickets['neutral_emotion'], 
             bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
    plt.title('Histogram of emotion proportions in issue bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel(y_label_text)
    plt.grid(True)

    # plot comment text
    plt.savefig('../../figures/emotion_histograms/'+project+'-issue_body.png',
               dpi=fig_dpi)
    plt.close()
    
# save bus factor file
# bus_factor_df.to_csv('../../data/processed_data/all-bus_factor.csv',
#                         index=False, header=True)

  interactivity=interactivity, compiler=compiler, result=result)


### Model preparation

In [8]:
# get project names again
project_list = os.listdir('../../data/processed_data/dataset_scip')

In [9]:
# create empty frame
joined_tickets = pd.DataFrame()
joined_comments = pd.DataFrame()

# read in joined frames for all projects
for project in project_list:
    # read in the next project's prepared
    tickets_filename = os.path.join(
        "../../data/processed_data/dataset_scip", project, "processed-tickets.csv")
    comments_filename = os.path.join(
        "../../data/processed_data/dataset_scip", project, "processed-comments.csv")
    
    temp_tickets = pd.read_csv(tickets_filename)
    temp_comments = pd.read_csv(comments_filename)
    joined_tickets = joined_tickets.append(temp_tickets, sort=False).reset_index(drop=True)
    joined_comments = joined_comments.append(temp_comments, sort=False).reset_index(drop=True)

In [10]:
# identify the timestamp of the author's most recent issue and comment in this group
most_recent_comment = (joined_comments.groupby(['project',
                                                'author_name'])
                                   .max()[['created_at', 'ticket_id']])
most_recent_ticket = (joined_tickets.groupby(['project','author_name'])
                                 .max()[['created_at', 'ticket_id']])

In [11]:
# add the most recent timestamps to dataframe
joined_comments = (joined_comments.join(most_recent_comment, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "ticket_author_last_comment_stamp",
                                                "ticket_id_last": "ticket_author_last_comment_ticket"})
                               .join(most_recent_ticket, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "ticket_author_last_ticket_stamp",
                                                "ticket_id_last": "ticket_author_last_ticket_ticket"}))

joined_tickets = (joined_tickets.join(most_recent_comment, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "ticket_author_last_comment_stamp",
                                                "ticket_id_last": "ticket_author_last_comment_ticket"})
                               .join(most_recent_ticket, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "ticket_author_last__stamp",
                                                "ticket_id_last": "ticket_author_last_ticket"}))

In [12]:
# is this the first ticket that the ticket author submitted?
joined_tickets['first_ticket'] = ((joined_tickets['num_PR_created'] == 0) &
                                  (joined_tickets['num_issue_created'] == 0))

In [13]:
# is this ticket the last one that the issue author submitted?
joined_tickets['ticket_author_last_ticket'] = (
    joined_tickets['ticket_id'] == joined_tickets['ticket_author_last_ticket'])

In [14]:
# is this ticket the last thing that the author worked on?
joined_tickets['ticket_author_last_comment'] = (
    joined_tickets['ticket_id'] == joined_tickets['ticket_author_last_comment_ticket'])

In [15]:
# if they've never commented, make sure we note that the issue was their last activity
joined_tickets.loc[
    joined_tickets['ticket_author_last_comment_ticket'].isnull() == True, 'ticket_author_last_comment'] = True

In [16]:
# save to file
try:
    os.makedirs("../../data/analysis_data/dataset_scip")
except OSError:
    pass
joined_tickets.to_csv('../../data/analysis_data/dataset_scip/sentiment_frame_tickets.tsv',
                         index=False, header=True, sep="\t")
joined_comments.to_csv('../../data/analysis_data/dataset_scip/sentiment_frame_comments.tsv',
                         index=False, header=True)

In [17]:
# save one without the comment/ticket bodies for analysis in R
joined_tickets_for_r = joined_tickets.drop(columns=['body', 'title', 'labels'])
joined_tickets_for_r.to_csv('../../data/analysis_data/dataset_scip/sentiment_frame_tickets-for_r.tsv',
                         index=False, header=True, sep="\t")
joined_comments_for_r = joined_comments.drop(columns=['body'])
joined_comments_for_r.to_csv('../../data/analysis_data/dataset_scip/sentiment_frame_comments-for_r.tsv',
                         index=False, header=True, sep="\t")


In [18]:
joined_tickets_for_r.head(10)

Unnamed: 0,assignees,author_association,closed_at,comments,created_at,id,state,updated_at,scip_dataset,project,...,positive_emotion,compound_emotion,grateful_count,grateful_list,ticket_author_last_comment_stamp,ticket_author_last_comment_ticket,ticket_author_last__stamp,ticket_author_last_ticket,first_ticket,ticket_author_last_comment
0,,CONTRIBUTOR,2012-02-12 18:49:18,3,2012-01-29 03:12:55,3009564,closed,2019-08-29 22:47:04,,numpy,...,0.0,-0.34,0.0,[],2019-07-06 04:57:54,13917.0,2019-07-05 07:01:57,False,False,False
1,,NONE,2019-08-15 14:17:56,10,2012-03-07 08:06:48,3539198,closed,2019-08-15 14:17:56,,numpy,...,0.0,0.0,0.0,[],2012-05-21 21:52:01,230.0,2012-03-07 08:06:48,True,True,True
2,,CONTRIBUTOR,2012-05-20 23:47:12,2,2012-04-29 07:28:17,4339575,closed,2019-08-29 22:46:29,,numpy,...,0.0,0.0,0.0,[],2019-07-06 04:57:54,13917.0,2019-07-05 07:01:57,False,False,False
3,,NONE,2019-08-18 16:14:13,8,2012-10-19 15:08:21,7718518,closed,2019-08-18 16:14:13,,numpy,...,0.03,-0.2263,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False
4,,NONE,2019-12-04 11:59:37.855112,7,2012-10-19 15:08:41,7718542,open,2019-09-17 20:13:10,,numpy,...,0.0,-0.5267,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False
5,,NONE,2019-12-04 11:59:37.855112,17,2012-10-19 15:08:47,7718550,open,2019-09-25 16:53:15,,numpy,...,0.073,-0.3716,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False
6,,NONE,2019-12-04 11:59:37.855112,7,2012-10-19 15:09:11,7718576,open,2019-09-16 22:16:30,,numpy,...,0.0,0.0,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False
7,cournape,NONE,2019-08-18 16:15:33,5,2012-10-19 15:09:58,7718624,closed,2019-08-18 16:15:33,,numpy,...,0.031,-0.3612,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False
8,,NONE,2019-12-04 11:59:37.855112,8,2012-10-19 19:25:31,7726270,open,2019-06-02 14:18:37,,numpy,...,0.0,0.0,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False
9,,NONE,2019-12-04 11:59:37.855112,7,2012-10-19 19:28:58,7726427,open,2019-06-02 16:42:19,,numpy,...,0.085,0.34,0.0,[],2012-10-23 02:47:05,2690.0,2012-10-19 22:36:00,False,False,False


In [19]:
joined_comments_for_r.head(10)

Unnamed: 0,author_association,created_at,id,scip_dataset,ticket_id,author_name,project,was_updated,num_PR_created,num_issue_created,...,negative_emotion,neutral_emotion,positive_emotion,compound_emotion,grateful_count,grateful_list,ticket_author_last_comment_stamp,ticket_author_last_comment_ticket,ticket_author_last_ticket_stamp,ticket_author_last_ticket_ticket
0,NONE,2019-09-27 17:34:37,536031478,,14602,Christopher-Bradshaw,numpy,False,1,0,...,0.25,0.75,0.0,-0.4588,0.0,[],2019-09-27 17:34:37,14602,2019-09-27 01:44:43,14602.0
1,MEMBER,2019-09-27 16:26:21,536008554,,14609,seberg,numpy,False,203,62,...,0.06,0.791,0.149,0.446,0.0,[],2019-09-27 16:26:21,14609,2019-09-23 20:07:00,14585.0
2,MEMBER,2019-09-27 16:25:37,536008306,,14609,seberg,numpy,False,203,62,...,0.0,1.0,0.0,0.0,0.0,[],2019-09-27 16:26:21,14609,2019-09-23 20:07:00,14585.0
3,CONTRIBUTOR,2019-09-27 16:20:24,536006562,,2880,WarrenWeckesser,numpy,False,58,32,...,0.0,0.888,0.112,0.4404,1.0,['thanks'],2019-09-27 16:20:24,14601,2019-09-26 21:30:38,14600.0
4,MEMBER,2019-09-27 14:30:53,535964275,,14605,mattip,numpy,True,308,63,...,0.0,0.356,0.644,0.6633,0.0,[],2019-09-27 14:30:53,14608,2019-09-27 14:23:35,14608.0
5,MEMBER,2019-09-27 14:24:09,535961446,,14608,mattip,numpy,False,308,63,...,0.0,1.0,0.0,0.0,0.0,[],2019-09-27 14:30:53,14608,2019-09-27 14:23:35,14608.0
6,MEMBER,2019-09-27 14:23:33,535961193,,14605,seberg,numpy,False,203,62,...,0.064,0.866,0.069,0.0795,1.0,['thanks'],2019-09-27 16:26:21,14609,2019-09-23 20:07:00,14585.0
7,CONTRIBUTOR,2019-09-27 14:17:10,535958494,,14606,ewmoore,numpy,False,20,8,...,0.0,1.0,0.0,0.0,0.0,[],2019-09-27 14:17:10,14606,2019-08-30 22:07:10,14402.0
8,CONTRIBUTOR,2019-09-27 14:12:57,535956730,,14606,ewmoore,numpy,False,20,8,...,0.0,1.0,0.0,0.0,0.0,[],2019-09-27 14:17:10,14606,2019-08-30 22:07:10,14402.0
9,CONTRIBUTOR,2019-09-27 14:00:28,535951797,,14607,Kai-Striega,numpy,False,7,2,...,0.0,0.746,0.254,0.7783,0.0,[],2019-09-27 14:00:28,14607,2019-09-27 12:16:46,14607.0


In [20]:
# Output the number of rows and columns of each dataframe…
# This is to check that R and python have the same numbers 
# of rows and columns.
print(joined_comments_for_r.shape)
print(joined_tickets_for_r.shape)

(524062, 29)
(90117, 36)


## Data analysis

*Currently porting to R for speed. Will later move back to Python.*

***

## Future directions

Do comments, generally, get more friendly or more hostile over time?

Does the emotional valence of a contributor's first ticket predict whether they'll come back to make a second one?

Are requesters more or less polite?

Does friendliness bring people back?

Does the number and intensity of negative and positive comments on a first-time contributor's issue 
change whether they come back to make another ticket?

Do the trajectories of conversations (in each community) change over time?

***