# Sentiment analysis of open-source software communities

This Jupyter notebook includes the data preparation and analysis
for our project exploring open-source software communities.

**Code last updated**: 6 November 2018

***

## Table of contents

* [Preliminaries](#Preliminaries)
* [Data preparation](#Data-preparation)

***

## Preliminaries

### Load libraries and functions

In [None]:
import os, nltk, glob

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
from utils import annotate

***

## Data preparation

Cycle through all GitHub project files to clean data and prepare datasets as needed for analysis. 
For complete list of downloaded variables and new variables created, see `metadata.md` file.

In [None]:
# list all projects
project_list = os.listdir('../../data/raw_data')

In [None]:
# load in the list of bots
bot_list = pd.read_csv('../bot_names.txt')['bot_name']

# cycle through all 
for project in project_list:
    
    # read in the next comments and issues files
    temp_comments = pd.read_csv('../../data/raw_data/'+project+'/comments.tsv',
                                sep='\t', index_col=0).sort_index()
    temp_issues = pd.read_csv('../../data/raw_data/'+project+'/issues.tsv',
                              sep='\t', index_col=0).sort_index()
    
    # append the current project to each
    temp_comments['project'] = project
    temp_issues['project'] = project
    
    # annotate each file
    temp_comments, temp_issues = annotate.annotate_comments_tickets(temp_comments,
                                                                    temp_issues)
    
    # drop columns we don't need
    temp_comments = temp_comments.drop(columns=['node_id','created_at',
                                                'updated_at','author_id'])
    temp_issues = temp_issues.drop(columns=['node_id','organization',
                                          'author_id','locked'])
    
    # clean up the text body
    temp_comments = annotate.body_cleanup(temp_comments, bot_list)
    temp_issues = annotate.body_cleanup(temp_issues, bot_list)
    
    # run sentiment analysis
    temp_comments = annotate.add_sentiment(temp_comments)
    temp_issues = annotate.add_sentiment(temp_issues)
    
    # save cleaned data to intermediary folders
    temp_comments.to_csv('../../data/processed_data/'+project+'-processed-comments.csv',
                         index=False, header=True)
    temp_issues.to_csv('../../data/processed_data/'+project+'-processed-issues.csv',
                         index=False, header=True)
    
    # create overlapping histograms
    bin_number = 10
    plt.hist(temp_comments['negative_emotion'], 
             bin_number, facecolor='r', alpha=0.5)
    plt.hist(temp_comments['positive_emotion'], 
             bin_number, facecolor='g', alpha=0.5)
    plt.hist(temp_comments['neutral_emotion'], 
             bin_number, facecolor='grey', alpha=0.75)

    # create labels
    plt.title('Histogram of emotion proportions in comment bodies\nfor '+project)
    plt.xlabel('Proportion of emotion words to total words')
    plt.ylabel('Counts')
    plt.grid(True)

    # plot it
    plt.savefig('../../figures/emotion_histograms/'+project+'.png',
               dpi=150)
    plt.close()

In [None]:
# concatenate all into master file
comments_df = pd.DataFrame()
issues_df = pd.DataFrame()

***

# Code testing ground

### Data preparation

In [None]:
comments_df = pd.read_csv('../../data/raw_data/mayavi/comments.tsv',
                          sep='\t', index_col=0).sort_index()

In [None]:
issues_df = pd.read_csv('../../data/raw_data/mayavi/issues.tsv',
                          sep='\t', index_col=0).sort_index()

### Annotate the files with new columns

In [None]:
comment_df, issues_df = annotate.annotate_comments_tickets(comments_df,issues_df)

### Remove unnecessary columns

In [None]:
comment_df = comment_df.drop(columns=['node_id','created_at','updated_at','author_id'])

In [None]:
issues_df = issues_df.drop(columns=['node_id','organization','author_id','locked'])

### Clean up body

In [None]:
bot_list = pd.read_csv('../bot_names.txt')['bot_name']

In [None]:
comment_df = annotate.body_cleanup(comment_df, bot_list)

In [None]:
issues_df = annotate.body_cleanup(issues_df, bot_list)

### Sentiment analysis

In [None]:
comment_df = annotate.add_sentiment(comment_df)

In [None]:
issues_df = annotate.add_sentiment(issues_df)

### Gratitude

**Note**: Must add to pipeline above.

In [None]:
gratitude_list = set(pd.read_csv('./utils/gratitude.txt')['expressions_of_gratitude'])

In [None]:
from nltk.tokenize import RegexpTokenizer
from collections import Counter

In [None]:
    tokenizer = RegexpTokenizer(r'\w+')
    comment_df['tokenized'] = comment_df['body'].apply(str.lower).apply(tokenizer.tokenize)
    comment_df['word_count'] = comment_df['tokenized'].apply(lambda x: Counter(x))

In [None]:
comment_df['grateful_count'] = (comment_df['word_count']
                                   .apply(lambda x: np.sum([v for k, v in x.items() 
                                                            if k in gratitude_list])))

In [None]:
comment_df['grateful_list'] = (comment_df['word_count']
                                   .apply(lambda z: [k for k in z if k in gratitude_list]))

In [None]:
comment_df.sort_values(by='grateful_count',
                      ascending=False)

### Plot

In [None]:
project = 'mayavi'

In [None]:
# set a bin number
bin_number = 10

# create overlapping histograms
plt.hist(comment_df['negative_emotion'], 
         bin_number, facecolor='r', alpha=0.5)
plt.hist(comment_df['positive_emotion'], 
         bin_number, facecolor='g', alpha=0.5)
plt.hist(comment_df['neutral_emotion'], 
         bin_number, facecolor='grey', alpha=0.75)

# create labels
plt.title('Histogram of emotion proportions in comment bodies\nfor mayavi')
plt.xlabel('Proportion of emotion words to total words')
plt.ylabel('Counts')
plt.grid(True)

# plot it
plt.savefig('../../figures/emotion_histograms/'+project+'.png',
           dpi=150)
plt.close()

***