# Sentiment analysis of open-source software communities

This Jupyter notebook includes the data preparation and analysis
for our project exploring open-source software communities.

To run this notebook, you will need the following files and directories:

* `../../data/processed_data/`: Directory of files produced by `./extract_features.py`
* `../bot_names.txt`: File of usernames identified as being bots
* `./utils/gratitude.txt`: List of words identified as gratitude-related

The most significant output of this notebook 
(`../../data/analysis_data/all-sentiment_frame-for_r.csv`) will be imported
into `./oss_community-language_dynamics.Rmd` for data analysis.

**Code last updated**: 30 May 2019

**Code written by**: A. Paxton (University of Connecticut) & N. Varoquaux
(University of California, Berkeley)

***

## TODO for Nelle

Extract bus_factor_df

***

## Table of contents

* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)

***

## Preliminaries

### Load libraries and functions

In [None]:
import os, glob, string

In [None]:
import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
from utils import annotate, project_features

***

## Data preparation

### Initial file cleaning

Cycle through all GitHub project files to clean data and prepare datasets as needed for analysis. 
For complete list of downloaded variables and new variables created, see `metadata.md` file.

In [None]:
# list all projects' raw data
project_list = os.listdir('../../data/processed_data/dataset_upto2019')

In [None]:
# load in the lists needed
bot_list = pd.read_csv('../bot_names.txt')['bot_name']
gratitude_list = set(pd.read_csv('./utils/gratitude.txt')['expressions_of_gratitude'])

In [None]:
# create a container for the bus factor ratings
# bus_factor_df = pd.DataFrame()

# cycle through all raw data projects
for project in project_list:
    
    # automatically build paths for issues and comments
    issues_filename = os.path.join(
        "../../data/processed_data/dataset_upto2019", project, "processed-issues.csv")
    comments_filename = os.path.join(
        "../../data/processed_data/dataset_upto2019", project, "processed-comments.csv")
    
    # read in the next project's prepared files
    temp_issues = pd.read_csv(issues_filename, sep=",").sort_index()
    temp_comments = pd.read_csv(comments_filename, sep=",").sort_index()
    
    # use identical bins sizes for all histograms
    bin_number = 50    
    fig_dpi = 150
    y_label_text = 'Density'
    density_choice = True
    alpha_level = .5
    
    # create overlapping histograms for emotion in comment text
    plt.figure()
    plt.hist(temp_comments['negative_emotion'], 
             bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
    plt.hist(temp_comments['positive_emotion'], 
             bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
    plt.hist(temp_comments['neutral_emotion'], 
             bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
    plt.title('Histogram of emotion proportions in comment bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel(y_label_text)
    plt.grid(True)

    # plot comment emotion hisogram
    plt.savefig('../../figures/emotion_histograms/'+project+'-comment_body.png',
               dpi=fig_dpi)
    plt.close()
    
    # create overlapping histograms for emotion in issue text
    plt.figure()
    plt.hist(temp_issues['negative_emotion'], 
             bin_number, density=density_choice, facecolor='r', alpha=alpha_level)
    plt.hist(temp_issues['positive_emotion'], 
             bin_number, density=density_choice, facecolor='g', alpha=alpha_level)
    plt.hist(temp_issues['neutral_emotion'], 
             bin_number, density=density_choice, facecolor='grey', alpha=alpha_level)
    plt.title('Histogram of emotion proportions in issue bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel(y_label_text)
    plt.grid(True)

    # plot comment text
    plt.savefig('../../figures/emotion_histograms/'+project+'-issue_body.png',
               dpi=fig_dpi)
    plt.close()
    
# save bus factor file
# bus_factor_df.to_csv('../../data/processed_data/all-bus_factor.csv',
#                         index=False, header=True)

### Model preparation

In [None]:
# get project names again
project_list = os.listdir('../../data/processed_data/dataset_upto2019')

In [None]:
# create empty frame
joined_issues = pd.DataFrame()
joined_comments = pd.DataFrame()

# read in joined frames for all projects
for project in project_list:
    # read in the next project's prepared
    issues_filename = os.path.join(
        "../../data/processed_data/dataset_upto2019", project, "processed-issues.csv")
    comments_filename = os.path.join(
        "../../data/processed_data/dataset_upto2019", project, "processed-comments.csv")
    
    temp_issues = pd.read_csv(issues_filename)
    temp_comments = pd.read_csv(comments_filename)
    joined_issues = joined_issues.append(temp_issues).reset_index(drop=True)
    joined_comments = joined_comments.append(temp_comments).reset_index(drop=True)
    


In [None]:
# identify the timestamp of the author's most recent issue and comment in this group
most_recent_comment = (joined_comments.groupby(['project',
                                                'author_name'])
                                   .max()[['created_at', 'ticket_id']])
most_recent_issue = (joined_issues.groupby(['project','author_name'])
                                 .max()[['created_at', 'ticket_id']])

In [None]:
# add the most recent timestamps to dataframe
joined_comments = (joined_comments.join(most_recent_comment, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "issue_author_last_comment_stamp",
                                                "ticket_id_last": "issue_author_last_comment_ticket"})
                               .join(most_recent_issue, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "issue_author_last_issue_stamp",
                                                "ticket_id_last": "issue_author_last_issue_ticket"}))

joined_issues = (joined_issues.join(most_recent_comment, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "issue_author_last_comment_stamp",
                                                "ticket_id_last": "issue_author_last_comment_ticket"})
                               .join(most_recent_issue, on=['project', 'author_name'], rsuffix='_last')
                               .rename(columns={"created_at_last": "issue_author_last_issue_stamp",
                                                "ticket_id_last": "issue_author_last_issue_ticket"}))

In [None]:
# is this the first ticket that the ticket author submitted?
joined_issues['first_ticket'] = ((joined_issues['num_PR_created']==0) &
                                 (joined_issues['num_issue_created']==0))

In [None]:
# is this issue the last one that the issue author submitted?
joined_issues['issue_author_last_issue'] = (
    joined_issues['ticket_id'] == joined_issues['issue_author_last_issue_ticket'])

In [None]:
# is this issue the last thing that the author worked on?
joined_issues['issue_author_last_comment'] = (
    joined_issues['ticket_id'] == joined_issues['issue_author_last_comment_ticket'])

In [None]:
# if they've never commented, make sure we note that the issue was their last activity
joined_issues.loc[
    joined_issues['issue_author_last_comment_ticket'].isnull() == True, 'issue_author_last_comment'] = True

In [None]:
# save to file
try:
    os.makedirs("../../data/analysis_data/")
except OSError:
    pass
joined_issues.to_csv('../../data/analysis_data/sentiment_frame_tickets.csv',
                         index=False, header=True)
joined_comments.to_csv('../../data/analysis_data/sentiment_frame_comments.csv',
                         index=False, header=True)

In [None]:
# save one without the comment/ticket bodies for analysis in R
joined_issues_for_r = joined_issues.drop(columns=['body', 'title', 'labels'])
joined_issues_for_r.to_csv('../../data/analysis_data/sentiment_frame_issues-for_r.csv',
                         index=False, header=True)
joined_comments_for_r = joined_comments.drop(columns=['body'])
joined_comments_for_r.to_csv('../../data/analysis_data/sentiment_frame_comments-for_r.csv',
                         index=False, header=True)


In [None]:
joined_issues_for_r.head(10)

In [None]:
joined_comments_for_r.head(10)

***

## Data analysis

*Currently porting to R for speed. Will later move back to Python.*

***

## Future directions

Do comments, generally, get more friendly or more hostile over time?

Does the emotional valence of a contributor's first ticket predict whether they'll come back to make a second one?

Are requesters more or less polite?

Does friendliness bring people back?

Does the number and intensity of negative and positive comments on a first-time contributor's issue 
change whether they come back to make another ticket?

Do the trajectories of conversations (in each community) change over time?

***