# The Google Product Manager Project

## Goal
The goal of this Jupyter notebook is to mine the information from Google's issue tracker to produce two prioritized list of features and issues.

## Method
The following is the strategy of attaining the above goal:

1. Mine google's public issue tracker for raw data for each issue and feature.
2. Divide the work into features and issues (the next steps are repeated for each).
3. Use some kind of natural language toolkit to identify common terms and bigrams.
4. Use a machine learning unsupervised clustering algorithm to identify clusters of issues and features.
5. Perform a brief case study on the top 5 clusters.
6. Identify the scenario (W5H) for each cluster.
7. Identify the persona for each cluster.
8. Identify stories for each cluster.
9. Priorize the stories.

## Process

### Mine Data

Google's issue tracker is public.  It allows anyone to report issues and request features.  The issue tracker does not have a supported API or SDK (that I can tell), but when loading the pages with developer mode enabled I can find a request that returns a JSON structure with the details necessary to do some kind of post-processing.

When looking at the issue tracker in development mode, we can see that a XHR request is made using the following URL:

* Bugs: https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:bug&s=created_time:desc
* Features: https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:feature_request&s=created_time:desc

I did some research and noticed the following caveats:

* The response is multi-line and the first line is not valid json.
* The HTTP request requires authentication and there is no official API or SDK for the issue tracker.
* The syntax required the character "+" which the python requests library encodes by default.

First I used a browsercookie library in python to bypass the authentication piece (I tried using APIs, but had difficulty).

In [3]:
import requests
import browsercookie
import json

cookies = browsercookie.load()

Firefox session filename does not exist: /home/daniel/.mozilla/firefox/edv7ja09.default/sessionstore.js


I wrote a bit of code to extract the last 5000 bugs in issue tracker here:

In [4]:
# the requests library encodes the "+" character, so I am treating it as a string as a workaround.
bug_url = "https://issuetracker.google.com/action/issues?count=1000&p=1&q=status:open+type:bug&s=created_time:desc"
page = 1
bugs = []

while(True):
    response = requests.get(bug_url, cookies=cookies)
    
    # For some reason, the first line of the response is invalid json.
    lines = [x for x in response.iter_lines()]
    issues = json.loads(lines[1].decode())['issues']
    
    if len(issues) == 0:
        break
        
    elif len(bugs) >= 5000:
        break
        
    else:
        bugs.extend(issues)
        bug_url.replace("&p=" + str(page), "&p=" + str(page + 1))
        page += 1

print("Found {} bugs".format(len(bugs)))

Found 5000 bugs


I repeated the same process for features...

In [5]:
feature_url = "https://issuetracker.google.com/action/issues?count=1000&p=1&q=status:open+type:feature_request&s=created_time:desc"
page = 1
features = []

while(True):
    response = requests.get(feature_url, cookies=cookies)
    
    # For some reason, the first line of the response is invalid json.
    lines = [x for x in response.iter_lines()]
    issues = json.loads(lines[1].decode())['issues']
    
    if len(issues) == 0:
        break
        
    elif len(features) >= 5000:
        break
        
    else:
        features.extend(issues)
        feature_url.replace("&p=" + str(page), "&p=" + str(page + 1))
        page += 1

print("Found {} features".format(len(features)))

Found 5000 features


If we inspect a single issue, we can take a look at the some the data within, but the only text data that we see is the title.

In [6]:
print(features[0]['snapshot'][0]['title'])

Custom sizing for each individual page


It appears that we are able to open each issue using a similar request that makes use of the issue ID.  An example for the issue above is 

https://issuetracker.google.com/action/issues/74163608

Let's try making a HTTP request using the same trick as above to get a description for each issue...

In [7]:
response = requests.get("https://issuetracker.google.com/action/issues/74163608", cookies=cookies)
lines = [x for x in response.iter_lines()]
the_issue = json.loads(lines[1].decode())

print(the_issue['events'][0]['comment'])

Is it possible to get the feature added to get individual sizing for each page on a datastudio report.<br>We have one page on a report that requires 900+ length and another that requires 600 to avoid scrolling an empty canvas.<br><br>This will be especially useful when it comes to embedding pages on a site or elsewhere for readability.


Now that we have some text, we can try to cluster the text to see what groups come out of it.  Will be using this guide as inspiration: https://pythonprogramminglanguage.com/kmeans-text-clustering/

Let's start by creating a corpus to cluster and cleaning up the text so that it does not contain HTML or escaped characters.  This operation is very time consuming, so I have written it in a way that the results are stored in local text files.

In [None]:
import re
import html
import pickle

try:
    # Attempt to open a pre-existing file with feature descriptions.
    with open('feature_documents.pickle', 'rb') as f:
        feature_documents = pickle.load(f)
        
    print("Found {} saved documents.".format(len(feature_documents)))

except FileNotFoundError:

    # If a file does not exist, then pull it manually.
    feature_documents = []
    
    for feature in features:
        url = "https://issuetracker.google.com/action/issues/" + str(feature['issueId'])
        response = requests.get(url, cookies=cookies)
        lines = [x for x in response.iter_lines()]
        first_comment = json.loads(lines[1].decode())['events'][0]['comment']
        document = re.sub('<[^<]+?>', '', first_comment) # There appears to be HTML formatting support in comments.
        document = re.sub(r'^https?:\/\/.*[\r\n]*', '', document, flags=re.MULTILINE) # Remove URLs.
        document = html.unescape(document)
        feature_documents.append(document)
        
    with open('feature_documents.pickle', 'wb') as f:
        pickle.dump(feature_documents, f)
        
    print("Downloaded and saved {} documents.".format(len(feature_documents)))

Now, lets apply some of the machine learning k-means clustering goodness to produce some a list of top 10 features in each cluster.

In [36]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
#vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words='english', min_df=1)
x = vectorizer.fit_transform(documents)

number_of_clusters = 100
feature_model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
feature_model.fit(x)

print("Top terms per feature cluster:")
print()
order_centroids = feature_model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    print("Feature cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print()

Top terms per feature cluster:

Feature cluster 0:
 send
 current
 incredibly
 screen send
 app feedback
 feedback page
 power
 shortcut
 shortcut holding
 option current

Feature cluster 1:
 feature
 startup
 issues
 issue
 provide
 existing
 feature requested
 receive
 search
 new feature

Feature cluster 2:
 table
 features
 feature
 effectively
 readers
 reports descriptive
 definitions
 definitions metrics
 example3 impact
 example3

Feature cluster 3:
 drive api
 drive
 id
 id export
 type true
 using id
 identifies pages
 identifies
 id exported
 specific mime

Feature cluster 4:
 feature
 issue
 apps script
 provide
 apps
 existing
 script
 feature requested
 search
 issues

Feature cluster 5:
 sql
 com sql
 https cloud
 cloud google
 cloud
 maintenance
 sql docs
 mysql
 docs mysql
 google com

Feature cluster 6:
 features
 feature
 using
 feature impact
 feature plan
 features feature
 vote existing
 remember
 remember vote
 help prioritize

Feature cluster 7:
 bugs carefully


Now I can make predictions of which cluster a string belongs to...

In [31]:
import random

idx = random.randint(0, len(features))

url = "https://issuetracker.google.com/action/issues/" + str(feature['issueId'])
response = requests.get(url, cookies=cookies)
lines = [x for x in response.iter_lines()]
first_comment = json.loads(lines[1].decode())['events'][0]['comment']
document = re.sub('<[^<]+?>', '', first_comment) # There appears to be HTML formatting support in comments.
document = re.sub(r'^https?:\/\/.*[\r\n]*', '', document, flags=re.MULTILINE) # Remove URLs.
document = html.unescape(document)

y = vectorizer.transform([document])
prediction = feature_model.predict(y)

print("The following command belongs in cluster {}:".format(prediction[0]))
print(document)

The following command belongs in cluster 25:
I recently read about these Changes:https://android-review.googlesource.com/c/platform/system/sepolicy/+/577303https://android-review.googlesource.com/c/platform/system/sepolicy/+/588493and oppose them, as they will basically impose the same restrictions as an iPhone.The reason: There seem to be no way for the user to indicate that background recording/filming should be allowed. There should be an per-app permission setting, where the user can change this.Why I Think a user-changeable setting should be available:For camera:Some lockscreens have the ability to take the photo of the user at attempts of entering the incorrect password, and then email this photo to the phone owner.Since such apps will Always be at the background while the phone is locked, I Think there should be a possibility for the user to indicate that he/her want to allow the app to record video/take photos while idle.For mic:The mic change is a bit more aggressive. This cha

I have created a model that has used a sample of 100 features to 