# The Google Product Manager Project <a class="anchor" id="top"></a>

## Table of contents: 
* [Goal](#goal)
* [Method](#method)
* [Requirements](#requirements)
* [Process](#process)
    * [Authenticate](#auth)
    * [List issues](#list)
    * [Corpus](#corpus)
    * [Cluster](#cluster)
    * [Prioritize](#prioritize)

## Goal <a class="anchor" id="goal"></a>

The goal of this Jupyter notebook is to mine the information from Google's issue tracker to produce two prioritized list of features and bugs.

[Go to top](#top)

## Method <a class="anchor" id="method"></a>

The following is a high-level strategy of attaining the above goal:

1. Mine google's public issue tracker for raw data for features and bugs (the next steps are repeated for each)
2. Extract the "description" of the issue.
3. Use a vectorizer or natural language library to extract the text features (bigrams and unigrams)
4. Use a unsupervised clustering algorithm to identify clusters of issues.
5. Prioritize the top 3 clusters using an index generated by the number of issues and votes.
6. Identify the scenario (W5H) for each cluster.
7. Identify the persona for each cluster.
8. Identify stories for each cluster.
9. Priorize the stories.

[Go to top](#top)

## Requirements <a class="anchor" id="requirements"></a>

1. Clone this repository.
2. Ensure you have python3 and jupyter-notebook installed on your system.
3. Create an virtual environment.
4. Activate the virtual environment.
5. Install the dependencies in the requirements.txt file.
5. Run the command jupyter-notebook.
6. Open the google-product-manager.ipynb document.

[Go to top](#top)

## Process <a class="anchor" id="process"></a>

### Authenticate <a class="anchor" id="auth"></a>

Google's issue tracker is public.  It allows anyone to report issues and request features.  The issue tracker does not have a supported API or SDK (that I can tell), but when loading the pages with developer mode enabled I can find a request that returns a JSON structure with the details necessary to do some kind of post-processing.

When looking at the issue tracker in development mode, we can see that a XHR request is made using the following URL:

* Bugs: https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:bug&s=created_time:desc
* Features: https://issuetracker.google.com/action/issues?count=25&p=1&q=status:open+type:feature_request&s=created_time:desc

Since the HTTP request requires authentication and there is no official API or SDK for the issue tracker, I used a browsercookie library in python to bypass the authentication piece (I tried using APIs, but had difficulty).

In [1]:
import requests
import browsercookie
import json

cookies = browsercookie.load()

Firefox session filename does not exist: /home/daniel/.mozilla/firefox/edv7ja09.default/sessionstore.js


[Go to top](#top)

### Get list of issues <a class="anchor" id="list"></a>

After looking at the response, I found several caveats:
* The response is multi-line and the first line is not valid json.
* The syntax required the character "+" which the python requests library encodes by default.

I created two queries that would produce a list of bugs and issues sorted by their creation date.  I then wrote some code that would pull down a list of issues for each query and work around some of these limitations.

In [14]:
def pull_issues(issue_type, issue_limit=5000):
    
    comp = "(" + "|".join([str(x) for x in components]) + ")"
    generic_url = "https://issuetracker.google.com/action/issues?count=999&p={page}&q=status:open+type:{issue_type}&s=created_time:desc"
    
    assert issue_type in ["feature_request", "bug"]
    page = 1
    collection = []
    
    while(True):
        url = generic_url.format(issue_type=issue_type, page=page)
        response = requests.get(url, cookies=cookies)
        lines = [x for x in response.iter_lines()] # workaround for first line.
        
        data = json.loads(lines[1].decode())
    
        if data.get('issues'):
            if len(data['issues']) == 0:
                break
            elif len(data['issues']) >= issue_limit:
                break
            else:
                collection.extend(data['issues'])
                page += 1
        else:
            break
    
    return collection[:issue_limit]

Now we use the function to pull down the bug and feature lists.

In [15]:
bugs = pull_issues('bug')
features = pull_issues('feature_request')

print("Found {} bugs.".format(len(bugs)))
print("Found {} features.".format(len(features)))

Found 5000 bugs.
Found 5000 features.


[Go to top](#top)

### Corpus <a class="anchor" id="corpus"></a>

If we inspect a single issue, we can take a look at the some the data within, but the only text data that we see is the title.

In [16]:
print(features[0]['snapshot'][0]['title'])

Checksums for private Compute Engine Images


It appears that we are able to open each issue using a similar request that makes use of the issue ID.  An example for the issue above is 

https://issuetracker.google.com/action/issues/74163608

The structure looks like the zero-indexed comment or the initial comment is similar to a verbose description of the problem or feature request.  Let's try making a HTTP request using the same trick as above to get a description for each issue...

In [17]:
response = requests.get("https://issuetracker.google.com/action/issues/74163608", cookies=cookies)
lines = [x for x in response.iter_lines()]
the_issue = json.loads(lines[1].decode())

print(the_issue['events'][0]['comment'])

Is it possible to get the feature added to get individual sizing for each page on a datastudio report.<br>We have one page on a report that requires 900+ length and another that requires 600 to avoid scrolling an empty canvas.<br><br>This will be especially useful when it comes to embedding pages on a site or elsewhere for readability.


Now that we have some text, we can try to cluster the text to see what groups come out of it.  Will be using this guide as inspiration: https://pythonprogramminglanguage.com/kmeans-text-clustering/

Let's start by creating a corpus to cluster and cleaning up the text so that it does not contain HTML or escaped characters.  

**WARNING:** This operation is very time consuming, so I have written it in a way that the results are stored in local text files.  If you want to limit the number of issues you are clustering, please add a range in the "for feature in features:" line; for example, "for feature in features[:5]:" will only download the details for 5 issues.

In [18]:
import re
import html
import pickle

try:
    # Attempt to open a pre-existing file with feature descriptions.
    with open('feature_documents.pickle', 'rb') as f:
        feature_documents = pickle.load(f)
        
    print("Found {} saved documents.".format(len(feature_documents)))

except FileNotFoundError:

    # If a file does not exist, then pull it manually.
    feature_documents = []
    
    for feature in features:
        url = "https://issuetracker.google.com/action/issues/" + str(feature['issueId'])
        response = requests.get(url, cookies=cookies)
        lines = [x for x in response.iter_lines()]
        try:
            first_comment = json.loads(lines[1].decode())['events'][0]['comment']
            document = re.sub('<[^<]+?>', '', first_comment) # There appears to be HTML formatting support in comments.
            document = re.sub(r'^https?:\/\/.*[\r\n]*', '', document, flags=re.MULTILINE) # Remove URLs.
            document = html.unescape(document)
            feature_documents.append(document)
        except KeyError:
            continue
        
    with open('feature_documents.pickle', 'wb') as f:
        pickle.dump(feature_documents, f)
        
    print("Downloaded and saved {} documents.".format(len(feature_documents)))

Found 4990 saved documents.


[Go to top](#top)
### Cluster  <a class="anchor" id="cluster"></a>

Now, lets apply some of the machine learning k-means clustering goodness to produce some a list of top 10 features in each cluster.

In [19]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score


stop_words = text.ENGLISH_STOP_WORDS.union(["feature", "features", "issue", "issues", "requests", "requests", "thing", "things"])

vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,2))
x = vectorizer.fit_transform(feature_documents)

number_of_clusters = 50
feature_model = KMeans(n_clusters=number_of_clusters, init='k-means++', max_iter=100, n_init=1)
feature_model.fit(x)

print("Top terms per feature cluster:")
print()
order_centroids = feature_model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_of_clusters):
    print("Feature cluster %d:" % i),
    for ind in order_centroids[i, :10]:
        print(' %s' % terms[ind]),
    print()

Top terms per feature cluster:

Feature cluster 0:
 e3
 support
 format
 bq
 load bq
 tracking support
 orc
 orc format
 level tracking
 format load

Feature cluster 1:
 github
 github com
 https github
 container
 com
 https
 container builder
 repo
 libphonenumber
 builder

Feature cluster 2:
 send
 forward
 tel
 students
 apk
 student
 request
 mail
 ble
 material

Feature cluster 3:
 animal
 post
 grocery
 list
 shelters
 animal shelters
 location
 descriptionis
 quoted phpbb
 way quote

Feature cluster 4:
 data studio
 studio
 data
 using
 help
 language
 business
 queries
 time saved
 saved

Feature cluster 5:
 quota
 engine
 app engine
 app
 quotas
 job
 limit
 quota limit
 serviceaccountsperproject
 serviceaccountsperproject quota

Feature cluster 6:
 log
 stackdriver
 logs
 logging
 log entries
 stackdriver logging
 hivemp
 applog
 chatty
 pagerduty

Feature cluster 7:
 android
 app
 screen
 phone
 pi
 orange
 mode
 like
 auto
 used

Feature cluster 8:
 elements
 selection
 ab

Now I can make predictions of which cluster a string belongs to...

In [20]:
import random

idx = random.randint(0, len(features))

url = "https://issuetracker.google.com/action/issues/" + str(features[idx]['issueId'])
response = requests.get(url, cookies=cookies)
lines = [x for x in response.iter_lines()]
first_comment = json.loads(lines[1].decode())['events'][0]['comment']
document = re.sub('<[^<]+?>', '', first_comment) # There appears to be HTML formatting support in comments.
document = re.sub(r'^https?:\/\/.*[\r\n]*', '', document, flags=re.MULTILINE) # Remove URLs.
document = html.unescape(document)

y = vectorizer.transform([document])
prediction = feature_model.predict(y)

print("The following command belongs in cluster {}:".format(prediction[0]))
print(document)

The following command belongs in cluster 25:
Issue summary: Provide support for multi regional Managed instance groups Business impact for users: wants to ensure cookie session affinity stays intact, even if client changes IP, uses a proxy or a VPN. Task the customer wishes to accomplish: Either have multi-regional instance groups, or have session affinity take precedence over the LB location algorithm. Current functionality if applicable: Multi-zone Current customer workaround(s): none, must create multiple managed instance groups and have multiple backend services


[Go to top](#top)

### Prioritize  <a class="anchor" id="prioritize"></a>

I have created a model that has used a sample of 100 features to 

