## User Feedback Data Collection

This notebook has been divided into two parts 
<br/><br/>
(A) Training Data <br/>
The model to classify reddit posts as user feedback was trained using positive examples from User Voice and negative examples labelled manually from a set of Reddit posts. The data has been stored in txt files *positive_training_data.txt* and *negative_training_data.txt* respectively. Running these code cells will allow you to update these files. This is only needed once in a while if you to add fresh data. Recommended: Add newly classified to improve the training as well.
<br/><br/>
(A) Social Media Raw Feedback <br/>
Reddit contains a vast amount of user data, some of which is extremely valuable user feedback including requests for new features and improvements on existing ones. Hence, the model creates is applied on the bulk of reddit data to extract the relevant information. These cells extract unfiltered data from reddit given a query. I included two queries that can be used. This should also be done every couple of days to allow a constant influx of a data. In this case, the data is added to a new file instead of appending to the old data so as to not classify the same data multiple times.

In [10]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re

### Training Data

#### User Voice Data: 

In [1]:
def get_uservoice_page(page_number):
    link = "https://powerpoint.uservoice.com/forums/270149-powerpoint-for-the-web?filter=hot&page="+str(page_number)
    html_uservoice = requests.get(link)
    return html_uservoice

def get_page_feedback(html_uservoice):
    soup = BeautifulSoup(html_uservoice.content, 'html.parser')
    feedback = soup.select('div[class="uvIdeaHeader"]')
    all_feedback = []
    
    for f in feedback:
        text = f.get_text()
        text = text.replace("\n", "").strip()
        text = text.replace("    ", ". ").strip()
        all_feedback.append(text)
        
    return all_feedback

def get_uservoice_feedback():
    with open("positive_training_data.txt", "a") as file_object:
        for i in range(1, 29):
            html_uservoice = get_uservoice_page(i)
            all_feedback = get_page_feedback(html_uservoice)
            for f in all_feedback:
                f = f+"\n"
                file_object.write(f)

In [2]:
get_uservoice_feedback()

#### Reddit Data: 

In [3]:
# After running this, copy go through the file to remove any positive examples and then copy over the remaining negative examples to negative_training_data.txt (and optionally delete this file)
with open("unlabelled_training_data.txt", "a") as file_object:
    q_powerpoint = requests.get('https://www.reddit.com/r/powerpoint/search.json?q=powerpoint&sort=new&restrict_sr=0&limit=100', headers = {'User-agent': 'your bot 0.1'})
    reddit_data = q_powerpoint.json()["data"]["children"]
    for r in reddit_data:
        feedback = re.sub('\[(.+)\]\(([^ ]+)( "(.+)")?\)', '', r["data"]["title"].replace("\n", " ").strip()) + " " + re.sub('\[(.+)\]\(([^ ]+)( "(.+)")?\)', '', r["data"]["selftext"].replace("\n", " ").strip()) + "\n"
        if len(feedback) < 2200:
            file_object.write(feedback)

### Social Media Raw Data

In [11]:
def get_unfiltered_reddit_data(query, file_iteration):
    filename = "unfiltered_reddit_data" + file_iteration + ".txt"
    with open(filename, "a") as file_object:
        q_powerpoint = requests.get(query, headers = {'User-agent': 'your bot 0.1'})
        reddit_data = q_powerpoint.json()["data"]["children"]
        for r in reddit_data:
            feedback = re.sub('\[(.+)\]\(([^ ]+)( "(.+)")?\)', '', r["data"]["title"].replace("\n", " ").strip()) + " " + re.sub('\[(.+)\]\(([^ ]+)( "(.+)")?\)', '', r["data"]["selftext"].replace("\n", " ").strip()) + "\n"
            if len(feedback) < 2200:
                file_object.write(feedback)
                

In [12]:
# Run get_unfiltered_reddit_data
# Query 1 (typically has better feature feedback): https://www.reddit.com/r/powerpoint/search.json?q=features&sort=new&restrict_sr=1&limit=100
# Query 2: https://www.reddit.com/r/powerpoint/search.json?q=powerpoint&sort=new&restrict_sr=0&limit=100


In [13]:
get_unfiltered_reddit_data("https://www.reddit.com/r/powerpoint/search.json?q=powerpoint&sort=new&restrict_sr=0&limit=100", "1")
