# The Google Product Manager Project <a class="anchor" id="top"></a>

## Table of contents: 
* [Goal](#goal)
* [Method](#method)
* [Requirements](#requirements)
* [Process](#process)
    * [Authenticate](#auth)
    * [List issues](#list)
    * [Get issue details and clean up the data](#details)
    * [Corpus](#corpus)
    * [Cluster](#cluster)
    * [Prioritize](#prioritize)

## Goal <a class="anchor" id="goal"></a>

The goal of this Jupyter notebook is to mine the information from Google's issue tracker to identity the most popular feature and to identify the who, what, where, when, and why about the problem the feature solves.

[Go to top](#top)

## Method <a class="anchor" id="method"></a>

The following is a high-level strategy of attaining the above goal:

1. Pull raw data about feature requests from Google's issue tracker.
2. Extract the "description" of the issue.
3. Use a tokenizer and vectorizer to extract the text features (bigrams and unigrams)
4. Use the K-means unsupervised clustering algorithm to identify clusters of issues.
5. Identify the most popular cluster of features using an index.
6. Identify the scenario (W5H).
7. Identify the persona.
8. Identify stories.
9. Priorize the stories.

[Go to top](#top)

## Requirements <a class="anchor" id="requirements"></a>

1. Clone this repository.
2. Ensure you have python3 and jupyter-notebook installed on your system.
3. Create an virtual environment.
4. Activate the virtual environment.
5. Install the dependencies in the requirements.txt file.
5. Run the command jupyter-notebook.
6. Open the google-product-manager.ipynb document.

[Go to top](#top)

## Process <a class="anchor" id="process"></a>

### Authenticate <a class="anchor" id="auth"></a>

Google's issue tracker is public.  It allows anyone to report issues and request features.  The issue tracker does not have a supported API or SDK (that I can tell), but when loading the pages with developer mode enabled I can find a request that returns a JSON structure with the details necessary to do some kind of post-processing.

When looking at the issue tracker in development mode, we can see that a XHR request is made using the following URL:

https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:feature_request&s=created_time:desc

Since the HTTP request requires authentication and there is no official API or SDK for the issue tracker, I used a browsercookie library in python to bypass the authentication piece (I tried using APIs, but had difficulty).

In [1]:
import browsercookie

cookies = browsercookie.firefox()

# NOTE: The warning below occurs because Firefox is open and being used for Jupyter Notebook.
# for more information, see...
# https://bitbucket.org/richardpenman/browsercookie/issues/14/sessionstorejs-does-not-exist-if-firefox

Firefox session filename does not exist: /home/daniel/.mozilla/firefox/edv7ja09.default/sessionstore.js


[Go to top](#top)

### Get list of issues <a class="anchor" id="list"></a>

After looking at the response, I found several caveats:
* The response is multi-line and the first line is not valid json.
* The syntax required the character "+" which the python requests library encodes by default.
* The results are paginated.
* There are thousands of results which may take time to get via http requests.

Since we are going to make multiple HTTP requests and it may take some time, let's first define some functions that will be later used by the multi-processing library.

In [2]:
import requests
import json

def generate_urls(cookies=cookies, count=25):
    """ Will general all of the URLs required due to the pagination of the results. """
    url = 'https://issuetracker.google.com/action/issues?count={count}&p=' + \
          '{page}&q=status:open+type:feature_request&s=created_time:desc'

    # Make a small request to get the number of issues.
    response = requests.get(url.format(page=1, count=1), cookies=cookies)
    lines = [x for x in response.iter_lines()] # workaround for first line.
    data = json.loads(lines[1].decode())

    # Calculate the number of pages required to get all of the results.
    number_of_pages = int(data['numTotalResults'] / count) + 1

    # Use the number of pages to generate the urls for all of the requests.
    for page in range(1,number_of_pages + 1):
        yield url.format(page=page, count=count)

def get_raw_issue(url):
    """ Pull pull down the response of each URL. """
    response = requests.get(url, cookies=cookies)
    lines = [x for x in response.iter_lines()] # workaround for first line.
    data = json.loads(lines[1].decode())
    issue_ids = list()
    return data['issues']

Now, let's use multiprocessing to pull down the information.

In [3]:
import multiprocessing

urls = [x for x in generate_urls()]
print("The number of URLS generated: " + str(len(urls)))
print("A sample of the urls: ")
for u in urls[:3]:
    print(" " + u)

raw_issues = list()
pool = multiprocessing.Pool(4)
pool_output = pool.map(get_raw_issue, urls)
for batch in pool_output:
    raw_issues.extend(batch)
pool.close()
    
print("The number of issues detected: " + str(len(raw_issues)))
print("An example issue: ")
print(raw_issues[0])

The number of URLS generated: 69
A sample of the urls: 
 https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:feature_request&s=created_time:desc
 https://issuetracker.google.com/action/issues?count=25&p=2&q=status:open+type:feature_request&s=created_time:desc
 https://issuetracker.google.com/action/issues?count=25&p=3&q=status:open+type:feature_request&s=created_time:desc
The number of issues detected: 1725
An example issue: 
{'aggregatedData': {'voteCount': 0, 'modifiedTimeMicros': 1521908332752000, 'createdTimeMicros': 1521908332752000}, 'snapshot': [{'type': 'FEATURE_REQUEST', 'title': 'Add-ons for Google Classroom', 'reporter': 'le...@videmantay.net', 'isArchived': False, 'significanceOverride': 'UNKNOWN', 'priority': 'P2', 'componentId': 191645, 'inProd': False, 'status': 'NEW', 'user': 'le...@videmantay.net', 'severity': 'S2', 'isDeleted': False, 'version': 0, 'cc': ['gr...@google.com', 'ho...@google.com', 'le...@videmantay.net', 'ob...@google.com'], 'cr

[Go to top](#top)

### Get issue details and clean up the data <a class="anchor" id="details"></a>

Although we have a list of issues and some information about the issues, we only have the "title" value for each issue to use as our corpus. In order to get more details, we need to make additional requests to urls like this one...

https://issuetracker.google.com/action/issues/74163608

...to get more details about each case. Since we need to repeat this operation several thousand times, we should use multiprocessing to make the requests in parallel.

In [4]:
def get_raw_issue(issue_id, cookies=cookies):
    response = requests.get("https://issuetracker.google.com/action/issues/{}".format(issue_id), cookies=cookies)
    lines = [x for x in response.iter_lines()]
    return [json.loads(lines[1].decode())]

raw_issue_details = []
issue_ids = [x['issueId'] for x in raw_issues]
pool = multiprocessing.Pool(4)
pool_output = pool.map(get_raw_issue, issue_ids)
for batch in pool_output:
    raw_issue_details.extend(batch)
pool.close()

print("Example raw issue: ")
print(raw_issue_details[0])

Example raw issue: 
{'issue': {'aggregatedData': {'voteCount': 0, 'modifiedTimeMicros': 1521908332752000, 'createdTimeMicros': 1521908332752000}, 'snapshot': [{'isDeleted': False, 'priority': 'P2', 'snapshotNumber': 1, 'reporter': 'le...@videmantay.net', 'isArchived': False, 'significanceOverride': 'UNKNOWN', 'type': 'FEATURE_REQUEST', 'componentId': 191645, 'comment': "Before filing an issue, please read and follow these instructions carefully. \n\nFirst, please search through existing issues to ensure that the feature request has not already been reported. You can start the search here: https://issuetracker.google.com/savedsearches/566256\n\nIf the feature has already been requested, you can click the star next to the issue number to subscribe and receive updates. We prioritize responding to the issues with the most stars. You can also comment on the issue to provide any context of how the feature would benefit you.\n\nAlso, please verify that the functionality you are requesting is 

There is quite a bit of information here, some things to note:
* We will be aggregating the voteCount values and tally the number of issues to prioritize the clusters.
* We will be extracting the first comment of each issue to use as the corpus for our clustering.
* The corpus contains html formatting and urls which we will likely strip out.

Let's extract only the parts we will be using and clean up the text we will be using as a corpus.

In [14]:
import html
import re

issues = []

for i in raw_issue_details:
    issue_id = i['issue']['issueId']
    issue_votes = i['issue']['aggregatedData']['voteCount']
    issue_dirty_description = i['events'][0].get('comment', "")
    issue_description = issue_dirty_description
    
    # There appears to be HTML formatting support in comments.
    issue_description = re.sub('<[^<]+?>', '', issue_description) 
    # Remove URLs.
    issue_description = re.sub(r'https?:\/\/.*[\r\n]*', '', issue_description, flags=re.MULTILINE) 
    # Escaped characters too!
    issue_description = html.unescape(issue_description)
    issues.append({'id': issue_id, 'votes': issue_votes, 'dirty_dsecription': issue_dirty_description, 
                   'description': issue_description})

print("Here is an example issue with the data we will use for clustering: ")
print(json.dumps(issues[35], indent=2, sort_keys=True))

Here is an example issue with the data we will use for clustering: 
{
  "description": "[REMEMBER: Vote on existing features using the \"Me Too!\" button. This will help us prioritize the most requested features.]1) What's the feature?Currently the only way to display date is in a table and hide the table header with the same color font as the theme background2) How do you plan on using this feature?I want to show the maximum date (again a feature not available) at the top of the report3) What's the impact that this feature will have on you or your business? (Time saved, new users, etc.)Time Saved and Provide latest date at the top of the report without the workaround.",
  "dirty_dsecription": "[REMEMBER: Vote on existing features using the &quot;Me Too!&quot; button. This will help us prioritize the most requested features.]<br><br>1) What&#39;s the feature?<br>Currently the only way to display date is in a table and hide the table header with the same color font as the theme backgrou

[Go to top](#top)
### Cluster  <a class="anchor" id="cluster"></a>

Now that we have some text, we can try to cluster the text to see what groups come out of it. Will be using this guide as inspiration: https://pythonprogramminglanguage.com/kmeans-text-clustering/

Lets apply some of the machine learning k-means clustering goodness to produce some a list of top 10 features in each cluster.

In [6]:
number_of_clusters = 50  # Gow many clusters?
ngram_range = (1, 2) # Single words or two-word phrases?
extra_stop_words = ["feature", "features", 
                    "issue", "issues", 
                    "requests", "requests", 
                    "thing", "things", 
                    "google"]


from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

feature_documents = [x['description'] for x in issues]

stop_words = text.ENGLISH_STOP_WORDS.union(extra_stop_words)
vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=ngram_range)
x = vectorizer.fit_transform(feature_documents)

feature_model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
feature_model.fit(x)

order_centroids = feature_model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

clusters = []

for i in range(number_of_clusters):
    clusters.append({
            'index': i,
            'features': [terms[x] for x in order_centroids[i, :10]],
            'votes': 0,
            'issues': 0
        })    

print("Here is a dict describing a cluster that was detected: ")
print(json.dumps(clusters[35], indent=2, sort_keys=True))

Here is a dict describing a cluster that was detected: 
{
  "features": [
    "emulator",
    "bringing solution",
    "familiarize",
    "tests allow",
    "solution production",
    "help making",
    "bigtable datastore",
    "sub looking",
    "looking know",
    "spanner help"
  ],
  "index": 35,
  "issues": 0,
  "votes": 0
}


Now I can make predictions of which cluster a string belongs to...

In [15]:
import random

idx = random.randint(0, len(issues))

y = vectorizer.transform([issues[idx]['description']])
prediction = feature_model.predict(y)

print("The following command belongs in cluster {}:".format(prediction[0]))
print(issues[idx]['description'])

print("")
print("Feature cluster %d:" % prediction[0]),
for ind in order_centroids[prediction[0], :10]:
    print(' %s' % terms[ind])

The following command belongs in cluster 15:
We're attempting to use Stackdriver Logging to log custom logs. Our log names look like this, depending on the type of process logging them, and the class inside the code that logged it:- projects/hivemp-dev/logs/hivemp%2Fbilling-paid-invoice-aggregation-scheduler%2FHiveMP.Api.Framework.Cluster.Cluster- projects/hivemp-dev/logs/hivemp%2Fbilling-paid-invoice-aggregation-processor%2FHiveMP.Api.Framework.Cluster.Cluster- projects/hivemp-dev/logs/hivemp%2Fbilling-paid-invoice-aggregation-processor%2FRedpoint.CommonFramework.Event.PubSub.GooglePubSubetc.The problem is that the UI cuts the "hivemp/<target>/" part off the "hivemp/<target>/<class>" string, and when you try and select "Cluster" across multiple targets, the dropdown doesn't work properly at all because it incorrectly goes "oh, the second Cluster selection is a duplicate (same text)", even though the prefix is different. This makes the Stackdriver logging interface pretty unusable if w

[Go to top](#top)

### Prioritize  <a class="anchor" id="prioritize"></a>

Now that we have a model with approximately 50 clusters, we can start prioritizing these clusters.  How should we prioritize?  Let's iterate over all of the issues, and aggregate data in the list of clusters we created earlier.  We will aggregate the number of issues, and the combined votes.

In [35]:
for cluster in clusters:
    cluster['votes'] = 0
    cluster['issues'] = 0

for issue in issues:
    vectorized_description = vectorizer.transform([issue['description']])
    prediction = feature_model.predict(vectorized_description)
    issues[issues.index(issue)]['prediction'] = prediction[0]
    clusters[issue['prediction']]['votes'] += issue['votes']
    clusters[issue['prediction']]['issues'] += 1

Now let's sort the clusters by the number of issues or the number of votes to see the ordered lists...

In [36]:
clusters_by_issues = sorted(clusters, key=lambda k: k['issues'], reverse=True) 
clusters_by_votes = sorted(clusters, key=lambda k: k['votes'], reverse=True) 

print("Top clusters sorted by issue count: ")
for c in clusters_by_issues[:10]:
    print("Cluster: {}, issues: {}".format(c['index'], c['issues']))
    
print()
print("Top clusters sorted by vote count: ")
for c in clusters_by_votes[:10]:
    print("Cluster: {}, votes: {}".format(c['index'], c['votes']))

Top clusters sorted by issue count: 
Cluster: 15, issues: 1470
Cluster: 9, issues: 126
Cluster: 7, issues: 35
Cluster: 12, issues: 18
Cluster: 33, issues: 14
Cluster: 20, issues: 10
Cluster: 23, issues: 4
Cluster: 28, issues: 3
Cluster: 32, issues: 3
Cluster: 3, issues: 2

Top clusters sorted by vote count: 
Cluster: 15, votes: 1511
Cluster: 9, votes: 27
Cluster: 23, votes: 14
Cluster: 12, votes: 10
Cluster: 45, votes: 8
Cluster: 6, votes: 3
Cluster: 31, votes: 3
Cluster: 16, votes: 2
Cluster: 0, votes: 1
Cluster: 7, votes: 1


Although the sorted lists look similar, they are not the same order.  Since we have two criteria for sorting, we should come up with some kind of index that uses both the votes and the issue count values.

In [37]:
# Customize these one is more important than the other.
vote_multiplier = 1
issue_multiplier = 1

# In order to normalize the votes and issue counts, we need the min/max
min_votes = min([c['votes'] for c in clusters])
max_votes = max([c['votes'] for c in clusters])
min_issues = min([c['issues'] for c in clusters])
max_issues = max([c['issues'] for c in clusters])

# Calculate the scores per cluster using the min/max.  Avrage the two scores.
for cluster in clusters:
    cluster['vote_score'] = vote_multiplier * ((cluster['votes'] - min_votes) / (max_votes - min_votes))
    cluster['issue_score'] = issue_multiplier * ((cluster['issues'] - min_issues) / (max_issues - min_issues))
    cluster['score'] = (cluster['vote_score'] + cluster['issue_score']) / 2

# Sort by the new score value.
clusters_by_score= sorted(clusters, key=lambda k: k['score'], reverse=True)

print("Top 10 clusters sorted by ratio: ")
for c in clusters_by_score[:10]:
    print("    Cluster: {0:5d}, score: {1:8.3f}, issue_score: {2:8.3f}, vote_score: {3:8.3f}".format(
            c['index'], c['score'], 
            clusters[c['index']]['issue_score'],
            clusters[c['index']]['vote_score']))
                                                                                       

Top 10 clusters sorted by ratio: 
    Cluster:    15, score:    1.000, issue_score:    1.000, vote_score:    1.000
    Cluster:     9, score:    0.051, issue_score:    0.085, vote_score:    0.018
    Cluster:     7, score:    0.012, issue_score:    0.023, vote_score:    0.001
    Cluster:    12, score:    0.009, issue_score:    0.012, vote_score:    0.007
    Cluster:    23, score:    0.006, issue_score:    0.002, vote_score:    0.009
    Cluster:    33, score:    0.005, issue_score:    0.009, vote_score:    0.001
    Cluster:    20, score:    0.003, issue_score:    0.006, vote_score:    0.000
    Cluster:    45, score:    0.003, issue_score:    0.000, vote_score:    0.005
    Cluster:    32, score:    0.001, issue_score:    0.001, vote_score:    0.001
    Cluster:     6, score:    0.001, issue_score:    0.000, vote_score:    0.002


Let's dump some information about this top cluster...

In [38]:
print(json.dumps(clusters_by_score[0], indent=2, sort_keys=True))

{
  "features": [
    "using",
    "data",
    "time",
    "new",
    "users",
    "use",
    "like",
    "saved",
    "android",
    "time saved"
  ],
  "index": 15,
  "issue_score": 1.0,
  "issues": 1470,
  "ratio": 1.027891156462585,
  "score": 1.0,
  "vote_score": 1.0,
  "votes": 1511
}


Let's take a sample of the issues that are part of this cluster...

In [55]:
for issue in [i for i in issues if i['prediction'] == clusters_by_score[8]['index']]:
    print("-----")
    print(issue['description'])

-----
* Which Developer Preview build are you using? See Settings > About phone > Build number (for example OPP5.170921.005).sdk_gphone_x86-userdebug P PPP1.180208.011 4624533 dev-keys    * What device are you using? (for example, Pixel XL)Emulator (Nexus 5X, x86, Google APIs)    * What are the steps to reproduce the problem? (Please provide the minimal reproducible test case.)(on a clean emulator image — permissions must not have been granted before)1. Create any app that uses SliceManager to bind to a SliceProvider (host) — using the slice compat library 28.0.0-alpha12. In the host app, observe that there is no way to query for available SliceProviders and discover what Slices they offerThis means that there is no direct way to check if a certain slice provider is available without going to the implementation detail of slice providers being content providers and using the package manager to check.That is limited in that there is no way to check if a content provider is a slice provid