Data Wrangling and Feature Engineering:
    
In this step, I am going to break out all the words into a list (tokenizing) and reduce them down to their root (lemmitizing). From there I can determine what language the reviews are, label the words, etc.  

1. First, I need to clean the data  
    i. determine the language  
    ii. removing any nonalphanumeric character  
    iii. remove stop words, as they are not useful for predictions, but cause a lot of noise in the data  
2. Tokenize: create a list of all the words (referred to as tokens)
3. Lemmatization: reducing the words to their root
4. N-gram labels

I will use this to extract some predictive features that the model will be able to use to determine the sentiment of a given review

In [56]:
PATH = 'C:/Users/Anam/Documents/DataScience/'

In [132]:
import pandas as pd
import numpy as np
import csv
from langdetect import detect
import re
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize
pd.options.mode.chained_assignment = None
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.corpus import wordnet

In [58]:
reviews_file = PATH +"datascience/Projects/SentenceSentiments/Data/reviews.csv"
reviews_df = pd.read_csv(reviews_file, sep='\t', header = 0, quoting=csv.QUOTE_NONE)
print("Reviews records: " + str(len(reviews_df.index)))

Reviews records: 3000


In [59]:
def detect_lang(text_series):
    languages = []
    for s in text_series:
        languages.append(detect(s))
    return languages

In [124]:
def clean_text(s):
    s = re.sub('[\']+', '', s)
    s = re.sub('[\W_]+', ' ', s)
    return s.lower()

In [134]:
reviews_df['tokens'] = reviews_df['review'].transform(lambda x: clean_text(x))
reviews_df['tokens'] = reviews_df['tokens'].transform(lambda x: pos_tag(word_tokenize(x)))
reviews_df['tokens'].head()

0    [(shot, NN), (in, IN), (the, DT), (southern, J...
1    [(my, PRP$), (order, NN), (was, VBD), (not, RB...
2    [(this, DT), (place, NN), (should, MD), (hones...
3    [(it, PRP), (makes, VBZ), (very, RB), (strange...
4                            [(nice, JJ), (sound, NN)]
Name: tokens, dtype: object

In [133]:
def convert_to_wordnet_tag(tokens):

    for word, tag in tokens:
        if tag.startswith('J'):
            tag = wordnet.ADJ
        elif tag.startswith('N'):
            tag = wordnet.NOUN
        elif tag.startswith('R'):
            tag = wordnet.ADV
        elif tag.startswith('V'):
            tag = wordnet.VERB
        else:
            tag = ''
    return tokens

In [135]:
reviews_df['tokens'] = reviews_df['tokens'].transform(lambda x: convert_to_wordnet_tag(x))
reviews_df['tokens'].head()

0    [(shot, NN), (in, IN), (the, DT), (southern, J...
1    [(my, PRP$), (order, NN), (was, VBD), (not, RB...
2    [(this, DT), (place, NN), (should, MD), (hones...
3    [(it, PRP), (makes, VBZ), (very, RB), (strange...
4                            [(nice, JJ), (sound, NN)]
Name: tokens, dtype: object

In [117]:
stop_words = set(stopwords.words('english')) 
reviews_df['tokens'] = reviews_df['tokens'].transform(lambda x: [w for w in x if not w in stop_words])
reviews_df['tokens'].head()

In [121]:
def lemmatize(tokens):
    lemma = []
    for word, tag in tokens:
        meaningful_word_tag = tag[0].lower()
        meaningful_word_tag = meaningful_word_tag if meaningful_word_tag in ['a', 's', 'r', 'n', 'v'] else None
        if meaningful_word_tag:
            lemma.append(lemmatizer.lemmatize(word, meaningful_word_tag))
    return lemma

In [122]:
reviews_df['tokens'] = reviews_df['tokens'].transform(lambda x: lemmatize(x))
reviews_df['tokens'].head()