# Analyzing Overwatch Forum Posts

This is an analysis of data I scrapped from Blizzard's official Overwatch general discussion <a href="https://us.battle.net/forums/en/overwatch/22813879/">forum</a>. For more information on how scrapping was performed or about this project in general, visit my repo <a href="https://github.com/dskarbrevik/W266_Project_Skarbrevik">here</a>.

<a id="toc"></a>

## Table of Contents
<ol>
    <li>[Setup the environment](#section1)</li>
    <li>[Preprocess the data](#section2)</li>
    <li>[Basic EDA](#section3)</li>
    <li>[Exploring "per month" slices](#section4)</li>
    <li>[Tokenize the text](#section5)</li>
</ol>

***

<a id='section1'></a>

## 1) Setup the environment



#### Import libraries

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import sys
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
from collections import Counter, defaultdict

Note: nltk data must be downloaded prior to running cells further in this notebook. Run `nltk.download()` to install necessary data.

#### Helper functions

In [2]:
def count_tokens(posts):
    #test = defaultdict(lambda:0)
    #counts = Counter()
    fdist = nltk.FreqDist()
    for i in range(len(posts)):
        if type(posts[i]) ==  str:
            sent = sent_tokenize(posts[i])
            for sentences in sent:
                text = re.sub(r'^https?:\/\/.*[\r\n]*', '', sentences) #try to get rid of urls in text
                words = word_tokenize(text)
                filtered_sentence = [w for w in words if not w in stop_words]
                for words in filtered_sentence:
                    #counts[words] += 1
                    fdist[words] += 1
    return fdist #returns a Counter like object


# def month_text(db, start_date):
#     end_date = 


#### Import data

In [3]:
ow_db1 = pd.read_csv("E:\\Data Science Playgrounds\\Blizzard Forums Project\\Web Scrapes\\11-18-17\\overwatch_database_1_5248.csv", encoding = "ISO-8859-1")
ow_db2 = pd.read_csv("E:\\Data Science Playgrounds\\Blizzard Forums Project\\Web Scrapes\\11-18-17\\overwatch_database_5249_7691.csv", encoding = "ISO-8859-1")
ow_db3 = pd.read_csv("E:\\Data Science Playgrounds\\Blizzard Forums Project\\Web Scrapes\\11-18-17\\overwatch_database_7692_8655.csv", encoding = "ISO-8859-1")

#### Concatenate data files

In [4]:
all_db = [ow_db1, ow_db2, ow_db3]
master_db = pd.concat(all_db)
before_preprocess = master_db.head() # save to compare later for fun

***

<div align="right">
    [back to top](#toc)
</div>

<a id='section2'></a>

## 2) Preprocess the data

#### a) Convert 'date' and 'time' columns to datetime objects and combine them

(This could take a minute)

In [6]:
master_db['date'] = pd.to_datetime(master_db['date'])
master_db['time'] = pd.to_timedelta(master_db['time'] + ':00', unit='h')

In [7]:
master_db['date'] = master_db['date'].add(master_db['time']) # combine features into one date/time feature
master_db = master_db.drop(['time'], axis=1)

#### b) Make the 'date' our new index (and sort it)

This makes it easy to separate forum posts into month-long chunks.

In [8]:
master_db = master_db.set_index(['date'])
master_db = master_db.sort_index()

#### c) Remove any rows with more than 1 NaN value
There are a total of 5 columns in the table so tresh=4 only allows 1 NaN per row of table. This way we can allow forums posts where the user entered a topic line but no body text (as this may still be valuable data).

In [9]:
rows_before = master_db.shape[0]
master_db = master_db.dropna(thresh=4)

#### d) Rearrange columns for better viewing

In [12]:
old_cols = master_db.columns.tolist()
new_cols = ['user', 'topic', 'text', 'num_replies', 'time_last_reply']

print(old_cols)

['num_replies', 'text', 'time', 'time_last_reply', 'topic', 'user']


In [14]:
old_cols = master_db.columns.tolist()
new_cols = ['user', 'topic', 'text', 'num_replies', 'time_last_reply']

if set(old_cols) == set(new_cols):
    master_db = master_db[new_cols]


just for fun...

### What the data looked like BEFORE pre-processing:

In [None]:
before_preprocessing

### What the data looks like AFTER pre-processing:

In [18]:
master_db.head()

Unnamed: 0_level_0,user,topic,text,num_replies,time_last_reply
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-05-23 21:32:00,Lylirra,Welcome to General Discussion - Please Read!,Welcome to the General Discussion forum! We en...,1,"May 23, 2016"
2016-05-23 21:36:00,Lylirra,Bug Report and Technical Support Forums,To help us get the most of out of your feedbac...,0,"May 23, 2016"
2016-05-23 23:34:00,Ralavik,Could this be...,A new topic?,1,"Jun 19, 2016"
2016-05-23 23:37:00,umadbro,Who else dropped Battleborn like a brick,I'm curious what kind of exodus Battleborn wil...,48,"Sep 10, 2016"
2016-05-23 23:38:00,Bammer,Looking for skilled players,Want to find a decent team to hang out and put...,1,"May 27, 2016"


***

<div align="right">
    [back to top](#toc)
</div>


<a id='section3'></a>

## 3) Basic EDA
(This may also take a minute)

In [19]:
total_posts = master_db.shape[0]
total_users = len(set(master_db['user']))
avg_post = total_posts/total_users

#find number of users with more than 50 posts
counter = Counter(master_db['user'])
over_50_posts = 0
for values in counter.items():
    if values[1] >= 50:
        over_50_posts += 1

print("Size of dataset = {} MB".format(sys.getsizeof(master_db)/1000000))
print("Number of forum posts = {}".format(total_posts))
print("Number of posts dropped due to NaN: {}".format(rows_before-master_db.shape[0]))
print("Feature names: {}".format(list(master_db)))
print("Most replies a post received: {}".format(max(master_db['num_replies'])))
print("Average number of replies per post: {}".format(np.mean(master_db['num_replies'])))
print("Number of unique users: {}".format(len(set(master_db['user']))))
print("Number of users with over 50 posts: {}".format(over_50_posts))
print("Average number of posts per user: {}".format(avg_post))
print("Oldest Post in Dataset: {}".format(master_db.head(n=1).index.tolist()[0]))
print("Newest Post in Dataset: {}".format(master_db.tail(n=1).index.tolist()[0]))
print("\n")
print("Here's a sample of the data:")
master_db.head()

Size of dataset = 353.467733 MB
Number of forum posts = 425720
Number of posts dropped due to NaN: 2440
Feature names: ['user', 'topic', 'text', 'num_replies', 'time_last_reply']
Most replies a post received: 8749
Average number of replies per post: 9.973618810485766
Number of unique users: 84977
Number of users with over 50 posts: 1362
Average number of posts per user: 5.009826188262706
Oldest Post in Dataset: 2016-05-23 21:32:00
Newest Post in Dataset: 2017-11-18 04:53:00


Here's a sample of the data:


Unnamed: 0_level_0,user,topic,text,num_replies,time_last_reply
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-05-23 21:32:00,Lylirra,Welcome to General Discussion - Please Read!,Welcome to the General Discussion forum! We en...,1,"May 23, 2016"
2016-05-23 21:36:00,Lylirra,Bug Report and Technical Support Forums,To help us get the most of out of your feedbac...,0,"May 23, 2016"
2016-05-23 23:34:00,Ralavik,Could this be...,A new topic?,1,"Jun 19, 2016"
2016-05-23 23:37:00,umadbro,Who else dropped Battleborn like a brick,I'm curious what kind of exodus Battleborn wil...,48,"Sep 10, 2016"
2016-05-23 23:38:00,Bammer,Looking for skilled players,Want to find a decent team to hang out and put...,1,"May 27, 2016"


**Note on quantity of data:** This dataset does not look at the text in replies to forum post, it only looks at the forum post text itself. However because we see an average of about 10 replies per post, if we assume that the replies to a post have about as much text each as the post itself (actually seems possible for this forum) then gathering this data would increase our dataset by an order of magnitude. Something to keep in mind if "needing more data" seems like a problem down the line.

***

<div align="right">
    [back to top](#toc)
</div>

<a id="section4"></a>

## 4) Exploring "per month" slices

Because dates range from 2016-05-23 to 2017-11-18 we will be able to make 17 "per month slices" starting with June 2016 and ending with October 2017.

In [23]:
month_1_text = master_db["2016-06-01":"2016-06-30"][["user","topic","text"]]
month_1_users = len(set(month_1_text['user'])) # num of unique users this month

In [24]:
month_1_text.head(n=3)

Unnamed: 0_level_0,user,topic,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-06-01,NigDodge,Losing a game 5 seconds after joining,Happens to me a lot. And I !@#$ing hate it.
2016-06-01,DragonLight,Why PS4 announcement appearing on the PC version,Look at the breaking news . says PS4 issues. w...
2016-06-01,Deus,"Why not give ""Fan the Hammer"" a cooldown?",Yeah he can still flash-fan but he won't be th...


In [25]:
month_1_text.tail(n=3)

Unnamed: 0_level_0,user,topic,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-06-30,Neko,why the bug-PTR bug forum rarely got respond,I will post this here. if it seem inappropriat...
2016-06-30,MackeySackey,Got a loss on a win,I was playing a game of competitive play and m...
2016-06-30,Paso,Feedback on people leaving during Competitive ...,This post is just some personal experience in ...


So we see that we've isolated one month of posts...

In [26]:
print("Unique users this month: {}".format(month_1_users))
print("Average posts per user this month: {}".format(month_1_text.shape[0]/month_1_users))

Unique users this month: 8817
Average posts per user this month: 2.2253600998071907


<div align=right>
    [back to top](#toc)
</div>

<a id="section5"></a>

### Step 5) Tokenize posts

In [102]:
print(counts.most_common(50))

[('.', 107771), (',', 84363), ('I', 54400), ('?', 18793), ("'s", 17386), ('game', 16595), ("n't", 16186), (')', 11828), ('team', 11460), (':', 11161), ('like', 10567), ('(', 10503), ('would', 9683), ('get', 9163), ('play', 9038), ('people', 7929), ('``', 7250), ("''", 7138), ('!', 6882), ('one', 6820), ('time', 6308), ("'m", 6248), ('...', 6133), ('players', 6028), ('damage', 5400), ('even', 5355), ('think', 5314), ('It', 5190), ('The', 5123), ('know', 4814), ('games', 4799), ('-', 4729), ('really', 4598), ('see', 4432), ('--', 4407), ('This', 4393), ('could', 4175), ('make', 4152), ('playing', 4066), ('good', 4021), ('want', 3946), ('match', 3830), ('much', 3830), ('way', 3742), ('hero', 3678), ('%', 3659), ('If', 3624), ('So', 3472), ('2', 3446), ('player', 3407)]


In [168]:
freqdist = count_tokens(month_1_text["text"])
freqdist.most_common(50)

[('.', 107356),
 (',', 84073),
 ('I', 54177),
 ('?', 18334),
 ("'s", 17340),
 ('game', 16535),
 ("n't", 16133),
 (')', 11783),
 ('team', 11442),
 ('like', 10533),
 ('(', 10453),
 (':', 10052),
 ('would', 9658),
 ('get', 9143),
 ('play', 9011),
 ('people', 7904),
 ('``', 7221),
 ("''", 7115),
 ('!', 6824),
 ('one', 6799),
 ('time', 6293),
 ("'m", 6219),
 ('...', 6096),
 ('players', 6012),
 ('damage', 5387),
 ('even', 5346),
 ('think', 5287),
 ('It', 5177),
 ('The', 5108),
 ('know', 4791),
 ('games', 4785),
 ('-', 4708),
 ('really', 4582),
 ('see', 4401),
 ('--', 4400),
 ('This', 4362),
 ('could', 4164),
 ('make', 4138),
 ('playing', 4060),
 ('good', 4006),
 ('want', 3935),
 ('much', 3821),
 ('match', 3819),
 ('way', 3730),
 ('hero', 3663),
 ('%', 3622),
 ('If', 3609),
 ('So', 3458),
 ('2', 3438),
 ('player', 3399)]

***

In [45]:
sc2_db = sc2_db.drop('Unnamed: 0', axis=1)

In [59]:
test = sc2_db.loc[sc2_db['num_replies'] == 2762]["text"]

In [69]:
test.iloc[0]

'Is there anyway to get your online character name changed? The game was set up for me, and they chose the name for me.'

In [72]:
sc2_db_test = sc2_db

In [73]:
sc2_db_test = sc2_db_test.set_index(['date'])

In [79]:
sc2_db_test['num_replies'].mean()

15.189090946168474