# Sentiment analysis of open-source software communities

This Jupyter notebook includes the data preparation and analysis
for our project exploring open-source software communities.

**Code last updated**: 27 October 2018

***

## Table of contents

* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)

***

## Preliminaries

### Load libraries and functions

In [None]:
import os
import nltk
import pandas as pd
from utils import annotate

### Read in data

**Original columns in `comments.tsv`**
* `author_association`: Comment author's role in the project
    * `NONE`: No association with the project
    * `FIRST_TIMER`: Has not previously committed to GitHub
    * `FIRST_TIME_CONTRIBUTOR`: First time contributing to this repository
    * `COLLABORATOR`: Has previously contributed to repository
    * `MEMBER`: Member of the organization that owns the repository
    * `CONTRIBUTOR`: Invited to collaborate on repository
    * `OWNER`: Owner of repository
* `body`: Comment content
* `created_at`: Time of comment creation
* `id`: Unique identifier of comment
* `node_id`: Unique identifier of entry for graphQL
* `updated_at`: Time of comment update
* `ticket_id`: Sequential identifier of ticket (issue or PR) in repository
* `author_name`: Commenter's GitHub username
* `author_id`: Commenter's unique identifier

In [None]:
comments_df = pd.read_csv('../../data/mayavi/comments.tsv',
                          sep='\t', index_col=0).sort_index()

In [None]:
comments_df.head(5)

**Original columns in `issues.tsv`**
* `assignees`
* `author_association`
* `body`: Content of ticket
* `closed_at`: Date and time when ticket was closed
* `comments`: Number of comments made on ticket
* `created_at`: Date and time of ticket creation
* `id`: Unique identifier for ticket
* `labels`
* `locked`
* `node_id`: Unique identifier for ticket for graphQL
* `project`: Name of repository
* `organization`: Name of organization that owns the repository
* `author_name`: Ticket creator's GitHub username
* `author_id`: Ticket creator's unique identifier
* `ticket_id`: Sequential identifier of ticket (issue or PR) in repository
* `type`: Type of ticket (`issue` or `pull_request`)

In [None]:
issues_df = pd.read_csv('../../data/mayavi/issues.tsv',
                          sep='\t', index_col=0).sort_index()

In [None]:
issues_df.head(5)

***

### Data preparation

### Annotate the files with new columns

**Columns added to `comments_df`**
* `num_PR_created`: Number of pull requests created by the commenter before this comment
* `num_issue_created`: Number of issues created by the commenter before this comment
* `was_updated`: Whether the comment body was updated after posting
* `comment_order`: The index of the comment within the ticket

**Columns added to `issues_df`**
* `num_PR_created`: Number of pull requests created by the ticket creator before this ticket
* `num_issue_created`: Number of issues created by the ticket creator before this ticket
* `was_updated`: Whether the ticket body was updated after posting
* `is_closed`: Whether the ticket has been closed

In [None]:
comment_df, issues_df = annotate.annotate_comments_tickets(comments_df,issues_df)

### Clean up dataframes

In both dataframes, remove unnecessary columns.

In [None]:
comment_df = comment_df.drop(columns=['node_id','created_at','updated_at','author_id'])

In [None]:
issues_df = issues_df.drop(columns=['node_id','organization','author_id','locked'])

In [None]:
comment_df.head(5)

### Clean up body

**NB**: Need to remove quoted comments. Can do it by removing `^>.*$`

**NB**: Perhaps make a column that lists all referenced users?

***