# Sentiment analysis of open-source software communities

This Jupyter notebook includes the data preparation and analysis
for our project exploring open-source software communities.

**Code last updated**: 27 October 2018

***

## Table of contents

* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)

***

## Preliminaries

### Load libraries and functions

In [1]:
import os
import nltk
import pandas as pd
from utils import annotate

### Read in data

**Original columns in `comments.tsv`**
* `author_association`: Comment author's role in the project
    * `NONE`: No association with the project
    * `FIRST_TIMER`: Has not previously committed to GitHub
    * `FIRST_TIME_CONTRIBUTOR`: First time contributing to this repository
    * `COLLABORATOR`: Has previously contributed to repository
    * `MEMBER`: Member of the organization that owns the repository
    * `CONTRIBUTOR`: Invited to collaborate on repository
    * `OWNER`: Owner of repository
* `body`: Comment content
* `created_at`: Time of comment creation
* `id`: Unique identifier of comment
* `node_id`: Unique identifier of entry for graphQL
* `updated_at`: Time of comment update
* `ticket_id`: Sequential identifier of ticket (issue or PR) in repository
* `author_name`: Commenter's GitHub username
* `author_id`: Commenter's unique identifier

In [2]:
comments_df = pd.read_csv('../../data/mayavi/comments.tsv',
                          sep='\t', index_col=0).sort_index()

In [3]:
comments_df.head(5)

Unnamed: 0,author_association,body,created_at,id,node_id,updated_at,ticket_id,author_name,author_id
0,COLLABORATOR,Very nice. Thanks a lot. Could you integrate y...,2011-04-25 15:45:36,1053358,MDEyOklzc3VlQ29tbWVudDEwNTMzNTg=,2011-04-25 15:45:36,5,GaelVaroquaux,208217
1,CONTRIBUTOR,Yes of course.,2011-04-25 15:46:44,1053363,MDEyOklzc3VlQ29tbWVudDEwNTMzNjM=,2011-04-25 15:46:44,5,Snegovikufa,413925
2,NONE,It would also be nice to report this bug upstr...,2011-04-25 15:53:18,1053385,MDEyOklzc3VlQ29tbWVudDEwNTMzODU=,2011-04-25 15:53:18,5,epatters,316610
3,CONTRIBUTOR,I'm not sure: is this merge request correct no...,2011-04-25 16:07:27,1053437,MDEyOklzc3VlQ29tbWVudDEwNTM0Mzc=,2011-04-25 16:07:27,5,Snegovikufa,413925
4,COLLABORATOR,@epatters: +1 @Snegovikufa: is QT_API a stand...,2011-04-25 16:07:59,1053442,MDEyOklzc3VlQ29tbWVudDEwNTM0NDI=,2011-04-25 16:07:59,5,GaelVaroquaux,208217


**Original columns in `issues.tsv`**
* `assignees`
* `author_association`
* `body`: Content of ticket
* `closed_at`: Date and time when ticket was closed
* `comments`: Number of comments made on ticket
* `created_at`: Date and time of ticket creation
* `id`: Unique identifier for ticket
* `labels`
* `locked`
* `node_id`: Unique identifier for ticket for graphQL
* `project`: Name of repository
* `organization`: Name of organization that owns the repository
* `author_name`: Ticket creator's GitHub username
* `author_id`: Ticket creator's unique identifier
* `ticket_id`: Sequential identifier of ticket (issue or PR) in repository
* `type`: Type of ticket (`issue` or `pull_request`)

In [4]:
issues_df = pd.read_csv('../../data/mayavi/issues.tsv',
                          sep='\t', index_col=0).sort_index()

In [5]:
issues_df.head(5)

Unnamed: 0,assignees,author_association,body,closed_at,comments,created_at,id,labels,locked,node_id,...,title,updated_at,project,organization,author_name,author_id,ticket_id,type,num_PR_created,num_issue_created
0,,NONE,,2018-10-11 07:08:17,0,2018-10-11 07:08:10,368981124,,False,MDU6SXNzdWUzNjg5ODExMjQ=,...,python3,2018-10-11 07:08:17,mayavi,enthought,icevoicey,30742101,725,issue,0,0
1,,MEMBER,If OSMesa is available and user requests an of...,2018-10-11 04:00:31,1,2018-10-11 03:49:59,368941830,,False,MDExOlB1bGxSZXF1ZXN0MjIxOTk4MzU5,...,Try and fix #477.,2018-10-11 04:00:34,mayavi,enthought,prabhuramachandran,272585,724,pull_request,93,7
2,,MEMBER,Creating a renderwindow in some configurations...,2018-10-09 19:36:26,1,2018-10-09 18:24:08,368337412,,False,MDExOlB1bGxSZXF1ZXN0MjIxNTM4Nzgw,...,Improve offscreen window creation.,2018-10-09 19:36:29,mayavi,enthought,prabhuramachandran,272585,723,pull_request,92,7
3,,NONE,This bug manifests when the SurfaceSource obje...,,2,2018-10-09 15:08:38,368259788,,False,MDExOlB1bGxSZXF1ZXN0MjIxNDc3Nzk0,...,Fix bug related to SurfaceSource.scalars,2018-10-09 15:48:02,mayavi,enthought,rahulporuri,1926457,722,pull_request,0,1
4,,NONE,"Hi, I am new to Mayavi. I have just installed ...",,1,2018-10-09 11:39:39,368167895,,False,MDU6SXNzdWUzNjgxNjc4OTU=,...,from mayavi import mlab not working,2018-10-10 21:33:01,mayavi,enthought,Love-Chrissie,31875095,721,issue,0,0


***

### Data preparation

**Cleanup for `comments.tsv`**
* Remove `node_id`

### Annotate the files with new columns

For both, we add new columns.

**Columns added to `comments_df`**
* `num_PR_created`: Number of pull requests created by the commentor before this comment
* `num_issue_created`: Number of issues created by the commenter before this comment
* `was_updated`: Whether the comment body was updated after posting

**Columns added to `issues_df`**
* `num_PR_created`: Number of pull requests created by the ticket creator before this ticket
* `num_issue_created`: Number of issues created by the ticket creator before this ticket
* `was_updated`: Whether the ticket body was updated after posting
* `num_comments`: Number of total comments on ticket (*duplicate of `comments`)

In [6]:
comment_df, issues_df = annotate.annotate_comments_tickets(comments_df,issues_df)

In [7]:
comment_df.head(5)

Unnamed: 0,author_association,body,created_at,id,node_id,updated_at,ticket_id,author_name,author_id,was_updated,num_PR_created,num_issue_created
0,COLLABORATOR,Very nice. Thanks a lot. Could you integrate y...,2011-04-25 15:45:36,1053358,MDEyOklzc3VlQ29tbWVudDEwNTMzNTg=,2011-04-25 15:45:36,5,GaelVaroquaux,208217,False,0,3
1,CONTRIBUTOR,Yes of course.,2011-04-25 15:46:44,1053363,MDEyOklzc3VlQ29tbWVudDEwNTMzNjM=,2011-04-25 15:46:44,5,Snegovikufa,413925,False,1,0
2,NONE,It would also be nice to report this bug upstr...,2011-04-25 15:53:18,1053385,MDEyOklzc3VlQ29tbWVudDEwNTMzODU=,2011-04-25 15:53:18,5,epatters,316610,False,0,0
3,CONTRIBUTOR,I'm not sure: is this merge request correct no...,2011-04-25 16:07:27,1053437,MDEyOklzc3VlQ29tbWVudDEwNTM0Mzc=,2011-04-25 16:07:27,5,Snegovikufa,413925,False,1,0
4,COLLABORATOR,@epatters: +1 @Snegovikufa: is QT_API a stand...,2011-04-25 16:07:59,1053442,MDEyOklzc3VlQ29tbWVudDEwNTM0NDI=,2011-04-25 16:07:59,5,GaelVaroquaux,208217,False,0,3


In [8]:
issues_df.head(5)

Unnamed: 0,assignees,author_association,body,closed_at,comments,created_at,id,labels,locked,node_id,...,project,organization,author_name,author_id,ticket_id,type,num_PR_created,num_issue_created,was_updated,num_comments
0,,NONE,,2018-10-11 07:08:17,0,2018-10-11 07:08:10,368981124,,False,MDU6SXNzdWUzNjg5ODExMjQ=,...,mayavi,enthought,icevoicey,30742101,725,issue,0,0,True,0
1,,MEMBER,If OSMesa is available and user requests an of...,2018-10-11 04:00:31,1,2018-10-11 03:49:59,368941830,,False,MDExOlB1bGxSZXF1ZXN0MjIxOTk4MzU5,...,mayavi,enthought,prabhuramachandran,272585,724,pull_request,93,7,True,1
2,,MEMBER,Creating a renderwindow in some configurations...,2018-10-09 19:36:26,1,2018-10-09 18:24:08,368337412,,False,MDExOlB1bGxSZXF1ZXN0MjIxNTM4Nzgw,...,mayavi,enthought,prabhuramachandran,272585,723,pull_request,92,7,True,1
3,,NONE,This bug manifests when the SurfaceSource obje...,,2,2018-10-09 15:08:38,368259788,,False,MDExOlB1bGxSZXF1ZXN0MjIxNDc3Nzk0,...,mayavi,enthought,rahulporuri,1926457,722,pull_request,0,1,True,2
4,,NONE,"Hi, I am new to Mayavi. I have just installed ...",,1,2018-10-09 11:39:39,368167895,,False,MDU6SXNzdWUzNjgxNjc4OTU=,...,mayavi,enthought,Love-Chrissie,31875095,721,issue,0,0,True,1
