### NLP team Data Challenge!!!
Hello interns! This is your first day working at BigBlueSteam Co!

Until 17.00 o’clock you have to use your Steam Game Review Dataset to create a model that predicts from text whether the user recommends a game title or not.

Until now our platform hosts reviews and game publishers accuse us of hosting comments that are often  unjustifiable and not based on evidence but on rage. Our company will use your model real-time so positive comments can be publicly available instantly. On the other hand, negative ones have to be detected in order to be reviewed before becoming public so we can avoid excessive language and unjustifiable accusations.

# Data Understanding

In [None]:
#!pip install better-profanity

In [1]:
#Imports
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode
from ipywidgets import interact, fixed
import ipywidgets as widgets
from io import BytesIO
import re
from re import sub
from PIL import Image
import textwrap
import urllib.request
from wordcloud import WordCloud, ImageColorGenerator
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Angel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Angel\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [19]:
from better_profanity import profanity

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [3]:
train = pd.read_csv('.\\data\\train_gr\\train.csv')
train.head(5)

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,1,Spooky's Jump Scare Mansion,2016.0,I'm scared and hearing creepy voices. So I'll...,1
1,2,Spooky's Jump Scare Mansion,2016.0,"Best game, more better than Sam Pepper's YouTu...",1
2,3,Spooky's Jump Scare Mansion,2016.0,"A littly iffy on the controls, but once you kn...",1
3,4,Spooky's Jump Scare Mansion,2015.0,"Great game, fun and colorful and all that.A si...",1
4,5,Spooky's Jump Scare Mansion,2015.0,Not many games have the cute tag right next to...,1


In [4]:
games = pd.read_csv('.\\data\\train_gr\\game_overview.csv')
games.head(5)

Unnamed: 0,title,developer,publisher,tags,overview
0,Spooky's Jump Scare Mansion,Lag Studios,Lag Studios,"['Horror', 'Free to Play', 'Cute', 'First-Pers...",Can you survive 1000 rooms of cute terror? Or ...
1,Sakura Clicker,Winged Cloud,Winged Cloud,"['Nudity', 'Anime', 'Free to Play', 'Mature', ...",The latest entry in the Sakura series is more ...
2,WARMODE,WARTEAM,WARTEAM,"['Early Access', 'Free to Play', 'FPS', 'Multi...",Free to play shooter about the confrontation o...
3,Fractured Space,Edge Case Games Ltd.,Edge Case Games Ltd.,"['Space', 'Multiplayer', 'Free to Play', 'PvP'...",Take the helm of a gigantic capital ship and g...
4,Counter-Strike: Global Offensive,"Valve, Hidden Path Entertainment",Valve,"['FPS', 'Multiplayer', 'Shooter', 'Action', 'T...",Counter-Strike: Global Offensive (CS: GO) expa...


In [5]:
test = pd.read_csv('.\\data\\test_gr\\test.csv')
test.head(5)

Unnamed: 0,review_id,title,year,user_review
0,1603,Counter-Strike: Global Offensive,2015.0,"Nice graphics, new maps, weapons and models. B..."
1,1604,Counter-Strike: Global Offensive,2018.0,I would not recommend getting into this at its...
2,1605,Counter-Strike: Global Offensive,2018.0,Edit 11/12/18I have tried playing CS:GO recent...
3,1606,Counter-Strike: Global Offensive,2015.0,The game is great. But the community is the wo...
4,1607,Counter-Strike: Global Offensive,2015.0,I thank TrulyRazor for buying this for me a lo...


# Data Preparation

In [6]:
# function for performing a summarising, preliminary EDA
def initial_eda(df):
    """ Prints the dataFrames's column names, data types, 
    number of distinct values, number of missing values
    @param df: pandas DataFrame
    """
    if isinstance(df, pd.DataFrame):
        total_na = df.isna().sum().sum()
        print("Dimensions : %d rows, %d columns" % (df.shape[0], df.shape[1]))
        print("Total NA Values : %d " % (total_na))
        print("%35s %20s   %10s %10s" % ("Column Name", "Data Type", "#Distinct", "NA Values"))
        col_name = df.columns
        dtyp = df.dtypes
        uniq = df.nunique()
        na_val = df.isna().sum()
        for i in range(len(df.columns)):
            print("%35s %20s   %10s %10s" % (col_name[i], dtyp[i], uniq[i], na_val[i]))
    else:
        print("Expect a DataFrame but got a %15s" % (type(df)))

In [7]:
initial_eda(games)

Dimensions : 64 rows, 5 columns
Total NA Values : 0 
                        Column Name            Data Type    #Distinct  NA Values
                              title               object           64          0
                          developer               object           59          0
                          publisher               object           54          0
                               tags               object           64          0
                           overview               object           62          0


In [8]:
initial_eda(train)

Dimensions : 17494 rows, 5 columns
Total NA Values : 178 
                        Column Name            Data Type    #Distinct  NA Values
                          review_id                int64        17494          0
                              title               object           44          0
                               year              float64            8        178
                        user_review               object        17490          0
                    user_suggestion                int64            2          0


In [9]:
initial_eda(test)

Dimensions : 8045 rows, 4 columns
Total NA Values : 67 
                        Column Name            Data Type    #Distinct  NA Values
                          review_id                int64         8045          0
                              title               object           20          0
                               year              float64            8         67
                        user_review               object         8045          0


In [10]:
# cleaning
def clean_text(comment):
    return ''.join(re.sub(r"(@[A-Za-z0-9]+)|(http\S+)|(\$[A-Za-z0-9]+)|([0-9]+)","",comment))

In [11]:
def remove_special_chars(comment):# unrolls hastags and also removes symbols
    for remove in map(lambda r: re.compile(re.escape(r)), [",", ":", "\"", "=", "&", ";", "%", "$",
                                                                     "@", "%", "^", "*", "(", ")", "{", "}",
                                                                     "[", "]", "|", "/", "\\", ">", "<", "-",
                                                                     "!", "?", ".", "'",
                                                                     "--", "---", "#"]):
        comment.replace(remove, "", inplace=True)
    return comment

In [12]:
badwords1 = pd.read_csv('badwords.txt', header=None)
badwords2 = pd.read_csv('bad-words.csv', header=None)

In [13]:
bad = badwords1.append(badwords2)
bad.drop_duplicates(inplace = True)

## 1. Before anything else we need at some point all bad-language to be removed.

In [14]:
def remove_stopwords(input_text):
    stopwords_list = stopwords.words('english')
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list 
                                              or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words) 

In [15]:
def preprocessing(df):
    df["clean_text"]=df['user_review'].apply(lambda row: remove_stopwords(row))
    df["clean_text"]=df['clean_text'].apply(lambda row: clean_text(row))
    remove_special_chars(df.clean_text)
    df["badwordcount"] = df['user_review'].apply(
        lambda comment: sum(comment.count(str(w)) for w in bad))
    df['num_words'] = df['user_review'].apply(
            lambda comment: len(comment.split()))
    df['num_chars'] = df['user_review'].apply(len)
    df["normword_badwords"] = df["badwordcount"]/df['num_words']
    
    return df

In [16]:
train = preprocessing(train)
train

Unnamed: 0,review_id,title,year,user_review,user_suggestion,clean_text,badwordcount,num_words,num_chars,normword_badwords
0,1,Spooky's Jump Scare Mansion,2016.00,I'm scared and hearing creepy voices. So I'll...,1,Im scared hearing creepy voices So Ill pause m...,1,132,710,0.01
1,2,Spooky's Jump Scare Mansion,2016.00,"Best game, more better than Sam Pepper's YouTu...",1,Best game better Sam Peppers YouTube account W...,2,44,335,0.05
2,3,Spooky's Jump Scare Mansion,2016.00,"A littly iffy on the controls, but once you kn...",1,littly iffy controls know play easy master Ive...,4,70,397,0.06
3,4,Spooky's Jump Scare Mansion,2015.00,"Great game, fun and colorful and all that.A si...",1,Great game fun colorful thatA side note though...,0,47,280,0.00
4,5,Spooky's Jump Scare Mansion,2015.00,Not many games have the cute tag right next to...,1,Not many games cute tag right next horror tag ...,4,67,334,0.06
...,...,...,...,...,...,...,...,...,...,...
17489,25535,EverQuest II,2012.00,Arguably the single greatest mmorp that exists...,1,Arguably single greatest mmorp exists today fr...,0,175,984,0.00
17490,25536,EverQuest II,2017.00,"An older game, to be sure, but has its own cha...",1,An older game sure charm holds special place h...,0,269,1472,0.00
17491,25537,EverQuest II,2011.00,When I frist started playing Everquest 2 it wa...,1,When frist started playing Everquest amazing i...,11,312,1642,0.04
17492,25538,EverQuest II,,cool game. THe only thing that REALLY PISSES M...,1,cool game THe thing REALLY PISSES ME OFF ridab...,0,46,264,0.00


In [None]:
train['profanity_flag'] = train['clean_text'].apply(lambda x: profanity.contains_profanity(x))

In [None]:
train['profanity_flag']

## 2.    It would be a great idea to accompany your script with a spell checking mechanism. (Demonstrate  it on single reviews not all dataset it may too take looong!!!)

## 3. Try to capture the comments with the most intense feelings of joy or anger.    Provide us with a model and script to run you classification from command line (you could keep a portion of data as a new csv to try it as external data)

## 4. Last but not least, any analysis on the reviews would be valuable and highly appreciated!


In [None]:
# angel ................