# Analyzing Overwatch Forum Posts

This is an analysis of data I scrapped from Blizzard's official Overwatch general discussion <a href="https://us.battle.net/forums/en/overwatch/22813879/">forum</a>. For more information on how scrapping was performed or about this project in general, visit my repo <a href="https://github.com/dskarbrevik/W266_Project_Skarbrevik">here</a>.

<a id="toc"></a>

## Table of Contents
<ol>
    <li>[Setup the environment](#section1)</li>
    <li>[Preprocess the data](#section2)</li>
    <li>[Basic EDA](#section3)</li>
    <li>[Exploring "per month" slices](#section4)</li>
    <li>[Tokenize the text](#section5)</li>
</ol>

***

<a id='section1'></a>

## 1) Setup the environment



#### Import libraries

In [1]:
import pandas as pd
import numpy as np
import utils # utils.py can be found in repo; contains helper functions used in this notebook
import nltk
import re
import sys
import string
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
from collections import Counter, defaultdict

In [37]:
# in case I need to reload utils after editing it...

#import importlib
#importlib.reload(utils)

<module 'utils' from 'C:\\Users\\skarb\\Desktop\\GitHub\\Analyzing-Forum-Text\\utils.py'>

Note: nltk data must be downloaded prior to running cells further in this notebook. Run `nltk.download()` to install necessary data.

#### Import data

In [2]:
ow_db1 = pd.read_csv("E:\\Data Science Playgrounds\\Blizzard Forums Project\\Web Scrapes\\11-18-17\\overwatch_database_1_5248.csv", encoding = "ISO-8859-1")
ow_db2 = pd.read_csv("E:\\Data Science Playgrounds\\Blizzard Forums Project\\Web Scrapes\\11-18-17\\overwatch_database_5249_7691.csv", encoding = "ISO-8859-1")
ow_db3 = pd.read_csv("E:\\Data Science Playgrounds\\Blizzard Forums Project\\Web Scrapes\\11-18-17\\overwatch_database_7692_8655.csv", encoding = "ISO-8859-1")

#### Concatenate data files

In [3]:
all_db = [ow_db1, ow_db2, ow_db3]
master_db = pd.concat(all_db)
before_preprocess = master_db.head() # save to compare later for fun

***

<div align="right">
    [back to top](#toc)
</div>

<a id='section2'></a>

## 2) Preprocess the data

#### a) Remove any rows with more than 1 NaN value
Setting "thresh" the way I do here allows 1 NaN per row of table. This way we can allow forum posts where the user entered a topic line but no body text (as this may still be valuable data).

In [4]:
rows_before = master_db.shape[0]
master_db = master_db.dropna(thresh=len(master_db.columns.tolist())-1)

#### b) Remove users that are "community managers" (i.e. employees)

This will remove all forum posts from users that are employees of Blizzard. Because these posts are not representative of the thoughts/motivation/language of a non-affiliated forum user, I believe removing these posts is a necessary step. I formed the list of employee users by hand so it is possible that it is not complete, but to the best of my knowledge it is. Ultimately, this removes a very small number of rows from the dataset, so not likely consequential either way.

In [5]:
admin_list = ["Lylirra", "Tom Powers", "Josh Engen", "Jeff Kaplan", "Scott Mercer", "Michael Chu", "Geoff Goodman", "Zoevia"]

admins_in_df = utils.check_for_admins(master_db, "user", admin_list)

master_db = utils.remove_admins(master_db, "user", admins_in_df)

There were admins in the dataframe.
Found 5 admins in the dataframe. Removing those users dropped 59 rows from the dataframe.


#### c) Convert 'date' and 'time' columns to datetime objects and combine them

(This could take a minute)

In [6]:
master_db['date'] = pd.to_datetime(master_db['date'])
master_db['time'] = pd.to_timedelta(master_db['time'] + ':00', unit='h')

In [7]:
master_db['date'] = master_db['date'].add(master_db['time']) # combine features into one date/time feature
master_db = master_db.drop(['time'], axis=1)

#### d) Sort dataframe by date

In [8]:
master_db = master_db.sort_values(by='date')
master_db = master_db.reset_index(drop=True)

Unnamed: 0,date,num_replies,text,time_last_reply,topic,user
0,2016-05-23 23:34:00,1,A new topic?,"Jun 19, 2016",Could this be...,Ralavik
1,2016-05-23 23:37:00,48,I'm curious what kind of exodus Battleborn wil...,"Sep 10, 2016",Who else dropped Battleborn like a brick,umadbro
2,2016-05-23 23:38:00,1,Want to find a decent team to hang out and put...,"May 27, 2016",Looking for skilled players,Bammer
3,2016-05-23 23:40:00,4,Best Monday of the year,"Aug 15, 2016",Game is live...,Omni
4,2016-05-23 23:41:00,3,Apparently all of us were supposed to receive ...,Dec 11,Free Battletag Change?,SandyCheeks


#### e) Combine "topic" and "text" features

(This could take a minute)

To create one unified chunk of text for each post without sacrificing data, I am going to make a new feature in the dataframe that treats the "topic" of a post as the first sentence of a post. From my experience of the significance of topic lines in posts of this forum, this seems like a reasonable assumption.

In [9]:
master_db = utils.combine_text_cols(master_db)

#### f) Rearrange columns for better viewing

In [10]:
old_cols = master_db.columns.tolist()
new_cols = ['date', 'user', 'all_text', 'topic', 'text', 'num_replies', 'time_last_reply']

if set(old_cols) == set(new_cols):
    master_db = master_db[new_cols]


### What the data looked like BEFORE pre-processing:

In [11]:
before_preprocess

Unnamed: 0,date,num_replies,text,time,time_last_reply,topic,user
0,5/23/16,1,Welcome to the General Discussion forum! We en...,21:32,"May 23, 2016",Welcome to General Discussion - Please Read!,Lylirra
1,5/23/16,0,To help us get the most of out of your feedbac...,21:36,"May 23, 2016",Bug Report and Technical Support Forums,Lylirra
2,11/16/17,817,"Hi everyone, We've seen a heavy influx of conv...",22:34,5m,"[Feedback Thread] Mercy Updates - Nov 16, 2017",Tom Powers
3,11/18/17,22,Why 3 days? Wouldn't 1 day be enough?,04:20,1m,My issue with the XQC ban,Aldwyn
4,11/18/17,25,You should not leave voice chat just cause 1 g...,03:04,1m,I will report if you leave voice chat,Budda?


### What the data looks like AFTER pre-processing:

In [12]:
master_db.head()

Unnamed: 0,date,user,all_text,topic,text,num_replies,time_last_reply
0,2016-05-23 23:34:00,Ralavik,Could this be... A new topic?,Could this be...,A new topic?,1,"Jun 19, 2016"
1,2016-05-23 23:37:00,umadbro,Who else dropped Battleborn like a brick. I'm ...,Who else dropped Battleborn like a brick,I'm curious what kind of exodus Battleborn wil...,48,"Sep 10, 2016"
2,2016-05-23 23:38:00,Bammer,Looking for skilled players. Want to find a de...,Looking for skilled players,Want to find a decent team to hang out and put...,1,"May 27, 2016"
3,2016-05-23 23:40:00,Omni,Game is live... Best Monday of the year,Game is live...,Best Monday of the year,4,"Aug 15, 2016"
4,2016-05-23 23:41:00,SandyCheeks,Free Battletag Change? Apparently all of us we...,Free Battletag Change?,Apparently all of us were supposed to receive ...,3,Dec 11


**Note on final features:** Some features (i.e. "num_replies" and "time_last_reply") will not be used in subsequent analysis and modeling. However, these features are included because they may be useful for future work.

***

<div align="right">
    [back to top](#toc)
</div>


<a id='section3'></a>

## 3) Basic EDA
(This may also take a minute)

In [14]:
total_posts = master_db.shape[0]
total_users = len(set(master_db['user']))
avg_post = total_posts/total_users

#find number of users with more than 50 posts
counter = Counter(master_db['user'])
over_50_posts = 0
for values in counter.items():
    if values[1] >= 50:
        over_50_posts += 1

print("Size of dataset = {0:.2f} MB".format(sys.getsizeof(master_db)/1000000))
print("Number of forum posts = {}".format(total_posts))
print("Number of posts dropped due to NaN: {}".format(rows_before-master_db.shape[0]))
print("Feature names: {}".format(list(master_db)))
print("Most replies a post received: {}".format(max(master_db['num_replies'])))
print("Average number of replies per post: {0:.2f}".format(np.mean(master_db['num_replies'])))
print("Number of unique users: {}".format(len(set(master_db['user']))))
print("Number of users with over 50 posts: {}".format(over_50_posts))
print("Average number of posts per user: {0:.2f}".format(avg_post))
print("Oldest Post in Dataset: {}".format(master_db.head(n=1).index.tolist()[0]))
print("Newest Post in Dataset: {}".format(master_db.tail(n=1).index.tolist()[0]))

Size of dataset = 622.17 MB
Number of forum posts = 428092
Number of posts dropped due to NaN: 68
Feature names: ['date', 'user', 'all_text', 'topic', 'text', 'num_replies', 'time_last_reply']
Most replies a post received: 8749
Average number of replies per post: 9.95
Number of unique users: 85308
Number of users with over 50 posts: 1370
Average number of posts per user: 5.02
Oldest Post in Dataset: 0
Newest Post in Dataset: 428091


**Note on quantity of data:** This dataset does not look at the text in replies to forum post, it only looks at the forum post text itself. However because we see an average of about 10 replies per post, if we assume that the replies to a post have about as much text each as the post itself (actually seems possible for this forum) then gathering this data would increase our dataset by an order of magnitude. Something to keep in mind if "needing more data" seems like a problem down the line.

***

<div align="right">
    [back to top](#toc)
</div>

<a id="section4"></a>

## 4) Exploring "per month" slices

Because dates range from 2016-05-23 to 2017-11-18 we will be able to make 17 "per month slices" starting with June 2016 and ending with October 2017.

In [34]:
month_1_df = utils.df_for_month(master_db, 6, 2016)
month_1_users = len(set(month_1_df['user'])) # num of unique users this month

In [29]:
month_1_df.head(n=3)

Unnamed: 0,date,user,all_text,topic,text,num_replies,time_last_reply
4157,2016-06-01 00:00:00,Kissker,When did they vanish? (velvet ropes) Did the v...,When did they vanish? (velvet ropes),Did the velvet ropes do something wrong? I don...,3,"Jun 1, 2016"
4158,2016-06-01 00:02:00,Requiem,Consecutive Match Bonus is Unhealthy. There is...,Consecutive Match Bonus is Unhealthy,There is one thing that I have noticed since s...,10,Apr 30
4159,2016-06-01 00:02:00,DeminRamst,What's your patch note dream? I think it's typ...,What's your patch note dream?,"I think it's typical of any game, you get hook...",117,"Oct 18, 2016"


In [30]:
month_1_df.tail(n=3)

Unnamed: 0,date,user,all_text,topic,text,num_replies,time_last_reply
23857,2016-06-30 23:57:00,Cabskee,"""View Career Profile"" should change based on g...","""View Career Profile"" should change based on g...","In a Ranked game, when you click ""View Career ...",0,"Jun 30, 2016"
23858,2016-06-30 23:58:00,TravokWolf,Competitive Reward. Easy fix to allow people t...,Competitive Reward,Easy fix to allow people to feel progression (...,2,"Jul 1, 2016"
23859,2016-06-30 23:59:00,MoonPunisher,Your ranked system is unbearable. The amount o...,Your ranked system is unbearable.,The amount of matches with leavers is ridiculo...,0,"Jun 30, 2016"


So we see that we've isolated one month of posts...

In [31]:
print("Unique users this month: {}".format(month_1_users))
print("Average posts per user this month: {:.2f}".format(month_1_df.shape[0]/month_1_users))

Unique users this month: 8843
Average posts per user this month: 2.23


#### Separating ALL months

Now that we see how splitting one month looks. Let's create separate dataframes for each FULL month of posts. This includes the timespan of 06/2016 to 10/2017. A total of 17 months. Just short of a year and a half! Compared to the RateBeer/BeerReview datasets (around 10 years) this will be a very small time frame. However the density of posts in this short time is very high so we're hopeful that something may come of this :)

In [42]:
df_by_month = utils.make_all_months(master_db)
month_keys = {"06_2016":0, "07_2016":1, "08_2016":2, "09_2016":3, 
              "10_2016":4, "11_2016":5, "12_2016":6, "01_2017":7, 
              "02_2017":8, "03_2017":9, "04_2017":10, "05_2017":11,
              "06_2017":12, "07_2017":13, "08_2017":14, "09_2017":15, "10_2017":16}

In [43]:
df_by_month[month_keys["06_2016"]].head()

Unnamed: 0,date,user,all_text,topic,text,num_replies,time_last_reply
4157,2016-06-01 00:00:00,Kissker,When did they vanish? (velvet ropes) Did the v...,When did they vanish? (velvet ropes),Did the velvet ropes do something wrong? I don...,3,"Jun 1, 2016"
4158,2016-06-01 00:02:00,Requiem,Consecutive Match Bonus is Unhealthy. There is...,Consecutive Match Bonus is Unhealthy,There is one thing that I have noticed since s...,10,Apr 30
4159,2016-06-01 00:02:00,DeminRamst,What's your patch note dream? I think it's typ...,What's your patch note dream?,"I think it's typical of any game, you get hook...",117,"Oct 18, 2016"
4160,2016-06-01 00:03:00,HaPpY,Ranked MM Algorithm Proposal. TLDR: soloing vs...,Ranked MM Algorithm Proposal,TLDR: soloing vs groups frustrating... MM twea...,2,"Jun 1, 2016"
4161,2016-06-01 00:03:00,Zeriu,Deflecting Hanzos Ult back? I had no idea you ...,Deflecting Hanzos Ult back?,I had no idea you could deflect Hanzos ult dir...,0,"Jun 1, 2016"


<div align=right>
    [back to top](#toc)
</div>

<a id="section5"></a>

### Step 5) Tokenize posts

In [19]:
print(counts.most_common(50))

NameError: name 'counts' is not defined

In [25]:
freqdist = count_tokens(month_1_text["text"])
freqdist.most_common(50)

[('.', 101440),
 (',', 79464),
 ('I', 50959),
 ('?', 17133),
 ("'s", 16519),
 ('game', 15562),
 ("n't", 15179),
 (')', 11164),
 ('team', 10580),
 ('like', 9973),
 ('(', 9892),
 (':', 9578),
 ('would', 9233),
 ('get', 8516),
 ('play', 8397),
 ('people', 7414),
 ('``', 6844),
 ("''", 6746),
 ('!', 6381),
 ('one', 6375),
 ('time', 5937),
 ("'m", 5853),
 ('...', 5611),
 ('players', 5607),
 ('damage', 5210),
 ('even', 5032),
 ('think', 5022),
 ('It', 4894),
 ('The', 4817),
 ('know', 4497),
 ('-', 4468),
 ('games', 4436),
 ('--', 4362),
 ('really', 4333),
 ('see', 4213),
 ('This', 4121),
 ('could', 3977),
 ('make', 3918),
 ('playing', 3832),
 ('good', 3799),
 ('want', 3702),
 ('much', 3617),
 ('hero', 3535),
 ('way', 3533),
 ('match', 3401),
 ('%', 3393),
 ('If', 3378),
 ('So', 3245),
 ('2', 3181),
 ("'ve", 3167)]

***

In [45]:
sc2_db = sc2_db.drop('Unnamed: 0', axis=1)

In [59]:
test = sc2_db.loc[sc2_db['num_replies'] == 2762]["text"]

In [69]:
test.iloc[0]

'Is there anyway to get your online character name changed? The game was set up for me, and they chose the name for me.'

In [72]:
sc2_db_test = sc2_db

In [73]:
sc2_db_test = sc2_db_test.set_index(['date'])

In [79]:
sc2_db_test['num_replies'].mean()

15.189090946168474