# Data Molding

Shape the data to fit each step of the modeling process.
1. Classification Model
    - Type of Data: Prepared CSV file created from manual classification of WoWChatLog.txt.
2. Sentiment Model
    - Type of Data: Classified CSV file with only 'Game' catagorized text.
3. Topic Model
    - Type of Data: Only negative sentiment text pertaining to the game/patch.

Save three dataframes for modeling

## Imports

In [1]:
# imports
import pandas as pd
import numpy as np
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords


# instancing TradeChat class from functions.py
from functions import TradeChat
tc = TradeChat()

In [2]:
# load up the CSV file of WoWChatLog.txt
original_df = pd.read_csv("data\\trade_chat_v3.csv", index_col='index')

In [22]:
original_df.sentiment.value_counts()

Other       3883
Negative     566
Name: sentiment, dtype: int64

In [23]:
original_df.target.value_counts()

Chat     992
Boost    750
Patch    713
LFM      688
Game     640
Trade    484
LFG      150
Bug       32
Name: target, dtype: int64

## 1. Classification Model (cm)

The classification model will seperate the text into three catagories:
1. Game - The text pertains to the game, World of Warcraft. Contains Patch, Game, and Bug
2. Chat - The common chatter between players that is everything other than game related.
3. Service - This pertains to players looking for guilds, guild recruitment, boosting, and players selling/buying items.

In [24]:
# Creating a copy of the orginal dataframe
cm_df = original_df.copy()

In [25]:
# grouping the 9 labels into 3 general labels
cm_df.target.replace(['Patch', 'Bug', 'Game'], 'Game', inplace=True)
cm_df.target.replace(['Boost', 'LFM', 'LFG', 'Trade'], 'Service', inplace=True)

In [26]:
cm_df.target.value_counts()

Service    2072
Game       1385
Chat        992
Name: target, dtype: int64

In [9]:
# Seperating the Service class to drop duplicates since the text 
# here is mainly repetitive and will cause major overfitting down the line
service_target = cm_df[cm_df.target == 'Service'].copy()
rest_target = cm_df[cm_df.target != 'Service'].copy()

service_target.drop_duplicates('text', inplace=True)

# combining the dataframes back together
cm_df = pd.merge(rest_target, service_target, how="outer")

Now we tokenize the data. This will lower case the text as well as remove any nonstandard characters.

In [12]:
# loading in stopwords and adding data related stop words.
sw = stopwords.words('english')
sw.extend(['u','ur','im','dont','thats'])

# Instantiating my tokenizor
tokenizer = RegexpTokenizer(r"(?u)\b([a-z]+|9.2)\w*\b")

In [13]:
# Calling my TradeChat class to create new columns for tokenization and joined tokens
cm_df_token = tc.nlp_tokenizer(cm_df, tokenizer, sw, stem='lemmatizer ')

In [14]:
cm_df_token.target.value_counts()

Game       1385
Chat        992
Service     578
Name: target, dtype: int64

## 2. Sentiment Model (sm)

The sentiment model will evaluate negative and non-negative sentiment from text pertaining to the game, World of Warcraft.

In [15]:
# Selecting only Game labeled data points
sm_df_token = cm_df_token[cm_df_token.target == 'Game']

In [16]:
sm_df_token.sentiment.value_counts()

Other       968
Negative    417
Name: sentiment, dtype: int64

## 3. Topic Model (tm)

The topic model will only be fed negative sentiment text to find what game features should be looked at to retain players

In [19]:
# Selecting only negative sentiment text to produce topics to focus on for improvement
tm_df_token = sm_df_token[sm_df_token.sentiment == "Negative"]

In [21]:
tm_df_token.sentiment.value_counts()

Negative    417
Name: sentiment, dtype: int64

## Storage

Store all dataframe to be used in mdoeling.

In [None]:
%store cm_df_token
%store sm_df_token
%store tm_df_token