1) a. Visit the Discussion Forum: In order to implement supervised learning, you will need a set of the data for which you know the true “labels” of your documents. Each student must start a thread on the discussion forum describing their data and the labels they will use. You must start this thread on Thursday, and must comment on at least two other student’s threads by Saturday (noon EST). Posts should describe the research question, the data, and how the labels were chosen. Replies by classmates should make constructive comments on other possible labels, anticipate problems, etc. I (the instructor) will also comment. 

b. Next, code a subset of your data (at least 10%). This is a time-consuming but necessary part of supervised methods; don't, however, spend more than 1 hour on this portion of the assignment. You may, however, have information that you can treat as “labels.” For instance, the “data” of publication can be treated as a label, where the classification task is effectively a test of whether or not the time of publication can be predicted from the terms. You may use either handcoded labels or metadata as your training set during our section on supervised learning, but you are encouraged to think of substantively interesting tasks that might be part of your project. For this portion of the assignment, the deliverable is a description of the labels you chose and why they’re relevant to your research question.

    - I spent 3 hours, got to about 3% labeled

Labels are from: http://www.crystalfeel.socialanalyticsplus.net/

tValence:	

- 0 - this text expresses highest possible intensity of unpleasant feelings; 1 - this text expresses highest possible intensity of pleasant feelings.
- In psychology, valence refers to the degree of overall unpleasantness (negative valence) or pleasantness (positive valence) of an emotional experience or expression. This term is often used interchangeably as “sentiment”.

tJoy:

- 0 - this text expresses lowest possible intensity of joy; 1 - this text expresses highest possible intensity of joy
- Joy is a pleasant emotional state in response to a pleasant observation or a remembrance thereof. In our emotion scheme, tJoy covers a broad range of related positive feelings such as contentment, pleasure, happiness, ecstasy and excitement, as well as more subtle sense of hope, pride, gratitude, and compassion.

tAnger:	
- 0 - this text expresses lowest possible intensity of anger; 1 - this text expresses highest possible intensity of anger	
- Anger is an unpleasant emotional state involving a strong uncomfortable and hostile response to a perceived provocation, hurt or threat. Anger usually has many physical and mental consequences. In our emotion scheme, tAnger covers a range of related negative feelings such as annoyance, irritation, aggravation, fury and rage.

tFear:
- 0 - this text expresses lowest possible intensity of fear; 1 - this text expresses highest possible intensity of fear
- Fear is an unpleasant emotional state arising from the threat of danger, pain or harm. Fear often leads to confrontation with or escape from/avoiding the threat. In our emotion scheme, tFear covers a range of related negative feelings such as concerned, anxiety, worry, scared, dread, horror and terror.

tSadness:
- 0 - this text expresses lowest possible intensity of sadness; 1 - this text expresses highest possible intensity of sadness]	
- Sadness is an unpleasant emotional state characterized by feelings of disadvantage, loss and despair. Sadness may lead to silence, inaction and withdrawal from others. In our emotion scheme, tSadness covers a range of related negative feelings such as helplessness, disappointment, melancholy, sorrow and grief.

2) Using the e1071 library in R, train a naive Bayes classifier to predict your labels from your data, as shown in the code example provided. Interpret the results. Which terms were strongly predictive? Which were not?

Visit the Discussion Forum for part 1.a. of this assignment.

Submit your responses to sections 1b and 2 in a pdf. This should be between 1 and 2 pages.

# Setup

In [1]:
import numpy as np
import pandas as pd

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# Load Data

In [2]:
df_game_reviews = pd.read_csv(r'../data/processed/game_reviews_processed.csv', low_memory=False)

In [3]:
df_game_reviews.columns

Index(['recommendationid', 'language', 'Valence Intensity', 'Joy Intensity',
       'Anger Intensity', 'Fear Intensity', 'Sadness Intensity', 'Sentiment',
       'Emotion', 'review',
       ...
       'Unnamed: 93', 'Unnamed: 94', 'Unnamed: 95', 'Unnamed: 96',
       'Unnamed: 97', 'Unnamed: 98', 'Unnamed: 99', 'Unnamed: 100',
       'Unnamed: 101', 'Unnamed: 102'],
      dtype='object', length=103)

In [4]:
# clean
# subset data that's hand-coded
def prune_cols(df):
    # keep only needed cols and subset english only
    global df_game_reviews
    df_game_reviews = df_game_reviews.loc[df_game_reviews['Sentiment'].notnull(), ['recommendationid', 'Sentiment', 'Emotion', 'review']] # subsets for english
    df_game_reviews = df_game_reviews.loc[df_game_reviews['review'].notnull(), ['recommendationid', 'Sentiment', 'Emotion', 'review']] # subsets for NaN reviews   
    return df_game_reviews

prune_cols(df_game_reviews)

Unnamed: 0,recommendationid,Sentiment,Emotion,review
0,71182846,Negative,Anger,"When I just joined the battle, my squad was at..."
1,71182712,Negative,Sadness,Game is fun and exciting when it works. Very s...
5,26578229,Positive,Joy,Excellent game. \n\nPlenty of fun to be had in...
7,71178247,Negative,Anger,Players are too serious and you can't have fun...
8,71177738,Positive,Joy,great simulation game
...,...,...,...,...
472,70608305,Positive,Joy,the best\n
473,70608216,Positive,Joy,"Most fun when people work together, it is what..."
474,70606979,Negative,Anger,"The game has a lot to learn, but being able to..."
476,70606193,Positive,Joy,"9/10 would recommend, bring friends for 10/10"


In [5]:
# clean
# one hot encode emotions into new cols

def code_cols(df):
    df['emotion_code'] = None
    df['emotion_joy'] = None
    df['emotion_fear'] = None
    df['emotion_anger'] = None
    df['emotion_sadness'] = None
    
    for index, row in df.iterrows():
        if  df.loc[index,'Emotion'] == 'Joy':
            df.loc[index,'emotion_code'] = 1
        elif df.loc[index,'Emotion'] == 'Anger':
            df.loc[index,'emotion_code'] = 2
        elif df.loc[index,'Emotion'] == 'Fear':
            df.loc[index,'emotion_code'] = 3
        else:
            df.loc[index,'emotion_code'] = 4
        
    
    for index, row in df.iterrows():
        if  df.loc[index,'Emotion'] == 'Joy':
            df.loc[index,'emotion_joy'] = 1
            df.loc[index,'emotion_fear'] = 0
            df.loc[index,'emotion_anger'] = 0
            df.loc[index,'emotion_sadness'] = 0
        elif df.loc[index,'Emotion'] == 'Anger':
            df.loc[index,'emotion_joy'] = 0
            df.loc[index,'emotion_fear'] = 0
            df.loc[index,'emotion_anger'] = 1
            df.loc[index,'emotion_sadness'] = 0
        elif df.loc[index,'Emotion'] == 'Fear':
            df.loc[index,'emotion_joy'] = 0
            df.loc[index,'emotion_fear'] = 1
            df.loc[index,'emotion_anger'] = 0
            df.loc[index,'emotion_sadness'] = 0
        else:
            df.loc[index,'emotion_joy'] = 0
            df.loc[index,'emotion_fear'] = 0
            df.loc[index,'emotion_anger'] = 0
            df.loc[index,'emotion_sadness'] = 1
    
    #set dtype()
    df['emotion_code'] = df['emotion_code'].astype('int64')    
    df['emotion_joy'] = df['emotion_joy'].astype('int64')
    df['emotion_fear'] = df['emotion_fear'].astype('int64')
    df['emotion_anger'] = df['emotion_anger'].astype('int64')
    df['emotion_sadness'] = df['emotion_sadness'].astype('int64')
        
    return df

code_cols(df_game_reviews)

Unnamed: 0,recommendationid,Sentiment,Emotion,review,emotion_code,emotion_joy,emotion_fear,emotion_anger,emotion_sadness
0,71182846,Negative,Anger,"When I just joined the battle, my squad was at...",2,0,0,1,0
1,71182712,Negative,Sadness,Game is fun and exciting when it works. Very s...,4,0,0,0,1
5,26578229,Positive,Joy,Excellent game. \n\nPlenty of fun to be had in...,1,1,0,0,0
7,71178247,Negative,Anger,Players are too serious and you can't have fun...,2,0,0,1,0
8,71177738,Positive,Joy,great simulation game,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...
472,70608305,Positive,Joy,the best\n,1,1,0,0,0
473,70608216,Positive,Joy,"Most fun when people work together, it is what...",1,1,0,0,0
474,70606979,Negative,Anger,"The game has a lot to learn, but being able to...",2,0,0,1,0
476,70606193,Positive,Joy,"9/10 would recommend, bring friends for 10/10",1,1,0,0,0


# Vectorize DF

In [6]:
vectorizer = CountVectorizer(stop_words='english')

#build features
all_features = vectorizer.fit_transform(df_game_reviews['review'])

# Build Model 

In [7]:
#split up dataset
X_train, X_test, y_train, y_test = train_test_split(all_features, df_game_reviews.emotion_code, test_size=0.3, random_state=42)

#call model 
classifier = MultinomialNB()

#fit model
classifier.fit(X_train, y_train)

# return count of correctly perdicted docs
nr_correct = (y_test == classifier.predict(X_test)).sum()
print(f'{nr_correct} documents classified correctly')

# retrun incorrectly perdicted docs
nr_incorrect = y_test.size - nr_correct
print(f'{nr_incorrect} documents classified incorrectly')

# return percent success
fraction_wrong = nr_incorrect / (nr_correct + nr_incorrect)
print(f'The (testing) accuracy of the model is {1-fraction_wrong:.2%}')

51 documents classified correctly
37 documents classified incorrectly
The (testing) accuracy of the model is 57.95%
