# The Google Product Manager Project <a class="anchor" id="top"></a>

## Table of contents: 
* [Goal](#goal)
* [Method](#method)
* [Requirements](#requirements)
* [Process](#process)
    * [Authenticate](#auth)
    * [List issues](#list)
    * [Get issue details and clean up the data](#details)
    * [Corpus](#corpus)
    * [Cluster](#cluster)
    * [Prioritize](#prioritize)

## Goal <a class="anchor" id="goal"></a>

The goal of this Jupyter notebook is to mine the information from Google's issue tracker create groups of similar feature requests and to prioritize these groups of feature requests.

[Go to top](#top)

## Method <a class="anchor" id="method"></a>

The high-leve steps to reach the goal are as follows:

1. Pull raw data about feature requests from Google's issue tracker.
2. Extract and clean up the "description" of the issue.
3. Use a tokenizer and vectorizer to extract the text features.
4. Use the K-means unsupervised clustering algorithm to identify clusters of issues.
5. Identify the most popular cluster of features using an index based on issue counts and aggregated votes.
6. Identify the scenario, persona, and stories if possible.

[Go to top](#top)

## Requirements <a class="anchor" id="requirements"></a>

This Jupyter Notebook was created using the following:

* Debian 9.54
* Python 3.5.3
* Jupyter Notebook 4.2.3
* All python requirements listed here: https://raw.githubusercontent.com/danieldsj/google-pm/master/requirements.txt

[Go to top](#top)

## Process <a class="anchor" id="process"></a>

### Authenticate <a class="anchor" id="auth"></a>

Google's issue tracker is public, but does not appear to have a documented and supported API. Using a browsers development mode, one can observe that there are XML HTTP Requests (XHR) that strings that appear to be JSON objects.

The XHR requests are being made to URLs similar to the following:

https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:feature_request&s=created_time:desc

Since the HTTP request requires authentication and there is no official API or SDK for the issue tracker, I used a browsercookie library in python to use cookies on my desktop browser for authentication.

In [1]:
import browsercookie

cookies = browsercookie.firefox()

# NOTE: The warning below occurs because Firefox is open and being used for Jupyter Notebook.
# for more information, see...
# https://bitbucket.org/richardpenman/browsercookie/issues/14/sessionstorejs-does-not-exist-if-firefox

Firefox session filename does not exist: /home/daniel/.mozilla/firefox/edv7ja09.default/sessionstore.js


[Go to top](#top)

### Get list of issues <a class="anchor" id="list"></a>

The XHR response string has the following characteristics:
* The response is multi-line and the first line is not valid json.
* The url uses the "+" character and may require special encoding to be used with requests.
* The results are paginated.

Since the request supports pagination and we are looking at thousands of issues, we should define some functions that we can use with Python multiprocessing libraries.

In [2]:
import requests
import json

def generate_urls(cookies=cookies, count=25):
    """ Will general all of the URLs required due to the pagination of the results. """
    url = 'https://issuetracker.google.com/action/issues?count={count}&p=' + \
          '{page}&q=status:open+type:feature_request&s=created_time:desc'

    # Make a small request to get the number of issues.
    response = requests.get(url.format(page=1, count=1), cookies=cookies)
    lines = [x for x in response.iter_lines()] # workaround for first line.
    data = json.loads(lines[1].decode())

    # Calculate the number of pages required to get all of the results.
    number_of_pages = int(data['numTotalResults'] / count) + 1

    # Use the number of pages to generate the urls for all of the requests.
    for page in range(1,number_of_pages + 1):
        yield url.format(page=page, count=count)

def get_raw_issue(url):
    """ Pull pull down the response of each URL. """
    response = requests.get(url, cookies=cookies)
    lines = [x for x in response.iter_lines()] # workaround for first line.
    data = json.loads(lines[1].decode())
    issue_ids = list()
    return data['issues']

Now, let's use the Multiprocessing Pool class to make the requests in parallel...

In [3]:
import multiprocessing

urls = [x for x in generate_urls()]
print("The number of URLS generated: " + str(len(urls)))
print("A sample of the urls: ")
for u in urls[:3]:
    print(" " + u)

raw_issues = list()
pool = multiprocessing.Pool(4)
pool_output = pool.map(get_raw_issue, urls)
for batch in pool_output:
    raw_issues.extend(batch)
pool.close()
    
print("The number of issues detected: " + str(len(raw_issues)))
print("An example issue: ")
print(raw_issues[0])

The number of URLS generated: 70
A sample of the urls: 
 https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:feature_request&s=created_time:desc
 https://issuetracker.google.com/action/issues?count=25&p=2&q=status:open+type:feature_request&s=created_time:desc
 https://issuetracker.google.com/action/issues?count=25&p=3&q=status:open+type:feature_request&s=created_time:desc
The number of issues detected: 1750
An example issue: 
{'aggregatedData': {'voteCount': 0, 'createdTimeMicros': 1521908332752000, 'modifiedTimeMicros': 1521908332752000}, 'userData': {'voted': False, 'starred': False}, 'issueId': 76201362, 'snapshot': [{'severity': 'S2', 'createdTimeMicros': 1521908332752000, 'significanceOverride': 'UNKNOWN', 'version': 0, 'isDeleted': False, 'title': 'Add-ons for Google Classroom', 'user': 'le...@videmantay.net', 'componentId': 191645, 'type': 'FEATURE_REQUEST', 'cc': ['gr...@google.com', 'ho...@google.com', 'le...@videmantay.net', 'ob...@google.com'], 'isA

[Go to top](#top)

### Get issue details and clean up the data <a class="anchor" id="details"></a>

The response for these requests provide some valuable information, but the only description of the feature being requested is a very short title.  When clicking on an issue on the website, we make additional XML HTTP Requests to different URLS similar to the following:

https://issuetracker.google.com/action/issues/74163608

Again, let's create functions to pull the details and leverage the Multiprocessing Pool class to make the requests in parallel...

In [4]:
def get_raw_issue(issue_id, cookies=cookies):
    response = requests.get("https://issuetracker.google.com/action/issues/{}".format(issue_id), cookies=cookies)
    lines = [x for x in response.iter_lines()]
    return [json.loads(lines[1].decode())]

raw_issue_details = []
issue_ids = [x['issueId'] for x in raw_issues]
pool = multiprocessing.Pool(4)
pool_output = pool.map(get_raw_issue, issue_ids)
for batch in pool_output:
    raw_issue_details.extend(batch)
pool.close()

print("Example raw issue: ")
print(raw_issue_details[0])

Example raw issue: 
{'issue': {'aggregatedData': {'voteCount': 0, 'createdTimeMicros': 1521908332752000, 'modifiedTimeMicros': 1521908332752000}, 'userData': {'voted': False, 'starred': False}, 'issueId': 76201362, 'snapshot': [{'severity': 'S2', 'createdTimeMicros': 1521908332752000, 'snapshotNumber': 1, 'attachmentId': [16665195], 'significanceOverride': 'UNKNOWN', 'version': 0, 'type': 'FEATURE_REQUEST', 'title': 'Add-ons for Google Classroom', 'comment': "Before filing an issue, please read and follow these instructions carefully. \n\nFirst, please search through existing issues to ensure that the feature request has not already been reported. You can start the search here: https://issuetracker.google.com/savedsearches/566256\n\nIf the feature has already been requested, you can click the star next to the issue number to subscribe and receive updates. We prioritize responding to the issues with the most stars. You can also comment on the issue to provide any context of how the feat

There is quite a bit of information here, some things to note:
* We will be aggregating the voteCount values and counting the number of issues to help prioritize our clusters.
* We will be extracting the first comment of each issue to use as the corpus for our clustering.
* The corpus contains html formatting and urls which we will likely strip out.
* There may be other strings within the first comment that have a negative impact on the clustering algorithms.

Let's extract the information we need and try to clean up the text...

In [5]:
import html
import re

issues = []

for i in raw_issue_details:
    issue_id = i['issue']['issueId']
    issue_votes = i['issue']['aggregatedData']['voteCount']
    issue_dirty_description = i['events'][0].get('comment', "")
    issue_description = issue_dirty_description
    
    # There appears to be HTML formatting support in comments.
    issue_description = re.sub('<[^<]+?>', '', issue_description)
    
    # Remove URLs.
    issue_description = re.sub(r'https?:\/\/.*[\r\n]*', '', issue_description, flags=re.MULTILINE) 
    
    # Escaped characters too!
    issue_description = html.unescape(issue_description)
    
    # There seems to be other boilerplate text that should be removed, the following are some quoted text based 
    # multiple attempts at clustering:
    #     Before filing an issue, please read and follow these instructions carefully...
    #     Build: 3.0.1, AI-171.4443003, 201711091821, AI-171.4443003, JRE 1.8.0_152...
    #     # WARNING: DO NOT INCLUDE YOUR API KEY OR CLIENT ID CREDENTIALS # It is OK ...
    #     As Google Maps was recently updated with a fresh new look [1], we're also updating...
    #     [REMEMBER: Vote on existing features using the "Me Too!" button...
    
    issues.append({'id': issue_id, 'votes': issue_votes, 'dirty_dsecription': issue_dirty_description, 
                   'description': issue_description})

print("Here is an example issue with the data we will use for clustering: ")
print(json.dumps(issues[35], indent=2, sort_keys=True))

Here is an example issue with the data we will use for clustering: 
{
  "description": "[REMEMBER: Vote on existing features using the \"Me Too!\" button. This will help us prioritize the most requested features.]1) What's the feature?Currently the only way to display date is in a table and hide the table header with the same color font as the theme background2) How do you plan on using this feature?I want to show the maximum date (again a feature not available) at the top of the report3) What's the impact that this feature will have on you or your business? (Time saved, new users, etc.)Time Saved and Provide latest date at the top of the report without the workaround.",
  "dirty_dsecription": "[REMEMBER: Vote on existing features using the &quot;Me Too!&quot; button. This will help us prioritize the most requested features.]<br><br>1) What&#39;s the feature?<br>Currently the only way to display date is in a table and hide the table header with the same color font as the theme backgrou

[Go to top](#top)
### Cluster  <a class="anchor" id="cluster"></a>

Now that we have some text, we can try to cluster the text to see what groups come out of it. Will be using this guide as inspiration: https://pythonprogramminglanguage.com/kmeans-text-clustering/

Lets apply some of the machine learning k-means clustering goodness to produce some a list of top 10 features in each cluster.

In [6]:
number_of_clusters = 50  # Gow many clusters?
ngram_range = (1, 2) # Single words or two-word phrases?

# The following words appear to throw off the clustering and produce clusters that are not valuable.
extra_stop_words = ["feature", "features", 
                    "issue", "issues", 
                    "requests", "requests", 
                    "thing", "things", 
                    "google"]

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

feature_documents = [x['description'] for x in issues]

stop_words = text.ENGLISH_STOP_WORDS.union(extra_stop_words)
vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=ngram_range)
x = vectorizer.fit_transform(feature_documents)

feature_model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
feature_model.fit(x)

order_centroids = feature_model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

clusters = []

for i in range(number_of_clusters):
    clusters.append({
            'index': i,
            'features': [terms[x] for x in order_centroids[i, :10]],
            'votes': 0,
            'issues': 0
        })    

print("Here is a dict describing a cluster that was detected: ")
print(json.dumps(clusters[35], indent=2, sort_keys=True))

Here is a dict describing a cluster that was detected: 
{
  "features": [
    "resolution",
    "face recognition",
    "recognition",
    "face",
    "higher",
    "resolutions",
    "640x480",
    "dp",
    "preview",
    "output"
  ],
  "index": 35,
  "issues": 0,
  "votes": 0
}


Now I can make predictions of which cluster a string belongs to...

In [7]:
import random

idx = random.randint(0, len(issues))

y = vectorizer.transform([issues[idx]['description']])
prediction = feature_model.predict(y)

print("The following command belongs in cluster {}:".format(prediction[0]))
print(issues[idx]['description'])

print("")
print("Feature cluster %d:" % prediction[0]),
for ind in order_centroids[prediction[0], :10]:
    print(' %s' % terms[ind])

The following command belongs in cluster 31:
apps-scripts-notifications@google.com send me mail with text (russian):Не удалось выполнить скрипт Gmail Meter. Ниже приведена сводка сбоев. Чтобы настроить триггеры для этого скрипта или изменить параметры получения уведомлений о сбоях, нажмите здесь.Начало  Функция Сообщение об ошибке     Триггер Конец29.12.17 1:20   activityReport  Для выполнения этого действия необходима авторизация.   time-based      29.12.17 1:20С уважением,Google Apps ScriptНужна помощь? Ознакомьтесь с документацией по скриптам Google Apps. Не отвечайте на это сообщение. (c) Google, 2017.

Feature cluster 31:
 using
 data
 time
 new
 users
 like
 use
 saved
 time saved
 business


[Go to top](#top)

### Prioritize  <a class="anchor" id="prioritize"></a>

Now that we have a model with approximately 50 clusters, we can start prioritizing these clusters.  How should we prioritize?  Let's iterate over all of the issues, and aggregate data in the list of clusters we created earlier.  We will aggregate the number of issues, and the combined votes.

In [8]:
for cluster in clusters:
    cluster['votes'] = 0
    cluster['issues'] = 0

for issue in issues:
    vectorized_description = vectorizer.transform([issue['description']])
    prediction = feature_model.predict(vectorized_description)
    issues[issues.index(issue)]['prediction'] = prediction[0]
    clusters[issue['prediction']]['votes'] += issue['votes']
    clusters[issue['prediction']]['issues'] += 1

Now let's sort the clusters by the number of issues or the number of votes to see the ordered lists...

In [9]:
clusters_by_issues = sorted(clusters, key=lambda k: k['issues'], reverse=True) 
clusters_by_votes = sorted(clusters, key=lambda k: k['votes'], reverse=True) 

print("Top clusters sorted by issue count: ")
for c in clusters_by_issues[:10]:
    print("Cluster: {}, issues: {}".format(c['index'], c['issues']))
    
print()
print("Top clusters sorted by vote count: ")
for c in clusters_by_votes[:10]:
    print("Cluster: {}, votes: {}".format(c['index'], c['votes']))

Top clusters sorted by issue count: 
Cluster: 31, issues: 1481
Cluster: 7, issues: 129
Cluster: 43, issues: 40
Cluster: 17, issues: 19
Cluster: 1, issues: 14
Cluster: 16, issues: 12
Cluster: 33, issues: 6
Cluster: 30, issues: 4
Cluster: 48, issues: 4
Cluster: 0, issues: 1

Top clusters sorted by vote count: 
Cluster: 31, votes: 1495
Cluster: 7, votes: 27
Cluster: 15, votes: 20
Cluster: 48, votes: 14
Cluster: 45, votes: 13
Cluster: 17, votes: 10
Cluster: 34, votes: 5
Cluster: 0, votes: 3
Cluster: 42, votes: 2
Cluster: 43, votes: 2


Although the sorted lists look similar, they are not the same order.  Since we have two criteria for sorting, we should come up with some kind of index that uses both the votes and the issue count values.

In [10]:
# Customize these one is more important than the other.
vote_multiplier = 1
issue_multiplier = 1

# In order to normalize the votes and issue counts, we need the min/max
min_votes = min([c['votes'] for c in clusters])
max_votes = max([c['votes'] for c in clusters])
min_issues = min([c['issues'] for c in clusters])
max_issues = max([c['issues'] for c in clusters])

# Calculate the scores per cluster using the min/max.  Avrage the two scores.
for cluster in clusters:
    cluster['vote_score'] = vote_multiplier * ((cluster['votes'] - min_votes) / (max_votes - min_votes))
    cluster['issue_score'] = issue_multiplier * ((cluster['issues'] - min_issues) / (max_issues - min_issues))
    cluster['score'] = (cluster['vote_score'] + cluster['issue_score']) / 2

# Sort by the new score value.
clusters_by_score= sorted(clusters, key=lambda k: k['score'], reverse=True)

print("Top 10 clusters sorted by ratio: ")
for c in clusters_by_score[:10]:
    print("    Cluster: {0:5d}, score: {1:8.3f}, issue_score: {2:8.3f}, vote_score: {3:8.3f}".format(
            c['index'], c['score'], 
            clusters[c['index']]['issue_score'],
            clusters[c['index']]['vote_score']))
                                                                                       

Top 10 clusters sorted by ratio: 
    Cluster:    31, score:    1.000, issue_score:    1.000, vote_score:    1.000
    Cluster:     7, score:    0.052, issue_score:    0.086, vote_score:    0.018
    Cluster:    43, score:    0.014, issue_score:    0.026, vote_score:    0.001
    Cluster:    17, score:    0.009, issue_score:    0.012, vote_score:    0.007
    Cluster:    15, score:    0.007, issue_score:    0.000, vote_score:    0.013
    Cluster:    48, score:    0.006, issue_score:    0.002, vote_score:    0.009
    Cluster:     1, score:    0.005, issue_score:    0.009, vote_score:    0.001
    Cluster:    45, score:    0.004, issue_score:    0.000, vote_score:    0.009
    Cluster:    16, score:    0.004, issue_score:    0.007, vote_score:    0.001
    Cluster:    33, score:    0.002, issue_score:    0.003, vote_score:    0.001


Let's dump some information about this top cluster...

In [11]:
print(json.dumps(clusters_by_score[0], indent=2, sort_keys=True))

{
  "features": [
    "using",
    "data",
    "time",
    "new",
    "users",
    "like",
    "use",
    "saved",
    "time saved",
    "business"
  ],
  "index": 31,
  "issue_score": 1.0,
  "issues": 1481,
  "score": 1.0,
  "vote_score": 1.0,
  "votes": 1495
}


Let's take a sample of the issues that are part of this cluster...

In [18]:
for issue in [i for i in issues if i['prediction'] == clusters_by_score[6]['index']]:
    print("-----")
    print(issue['description'])

-----
[REMEMBER: Vote on existing features using the "Me Too!" button. This will help us prioritize the most requested features.]1) What's the feature?2) How do you plan on using this feature?3) What's the impact that this feature will have on you or your business? (Time saved, new users, etc.)
-----
[REMEMBER: Vote on existing features using the "Me Too!" button. This will help us prioritize the most requested features.]1) What's the feature?2) How do you plan on using this feature?3) What's the impact that this feature will have on you or your business? (Time saved, new users, etc.)
-----
[REMEMBER: Vote on existing features using the "Me Too!" button. This will help us prioritize the most requested features.]1) What's the feature?2) How do you plan on using this feature?3) What's the impact that this feature will have on you or your business? (Time saved, new users, etc.)
-----
[REMEMBER: Vote on existing features using the "Me Too!" button. This will help us prioritize the most req