<img style="float:left;margin-left: -12px; margin-top: -10px" src="yelp-logo-27.png"  width=50>

# Part 2: Natural Language Processing

In this notebook we will now go through the text data in the reviews and class/business descriptions. By preprocessing this data and using the NLP tools provided to us through *__Spacy__* and __*NLTK*__ we will be able to derive some meaning from the text to *hopefully* improve our models.

The steps involved in this are as follows: 

1. word count
2. character count
3. Number of numerics
4. Number of upper case
5. Number of Exclamation Points (!)
7. Count of stop words
8. drop stop words
9. lemmetize our words
10. TF-IDF
11. Class Imbalance and Sentiment Analysis

#### Import needed libraries:

In [36]:
import pandas as pd
import numpy as np
import spacy
import pickle
from Mod_5_functions import pickle_file,open_pickle,clean_text_column
from nltk.corpus import stopwords
from Mod_5_functions import return_lemma
from sklearn.feature_extraction.text import TfidfVectorizer
import seaborn as sns
import matplotlib.pyplot as plt 
from textblob import TextBlob
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
%matplotlib inline

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")



#### Import the pickled DataFrames:

In [2]:
user_reviews_df = open_pickle('Data/filtered_user_data.pkl')

#### 1. word count:

In [3]:
user_reviews_df['word_count'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len(str(x).split(' ')))

#### 2. character count


In [4]:
user_reviews_df['char_count'] = user_reviews_df.rev_comp_reviews.str.len() #this includes the spaces

#### 3. Number of numerics


In [5]:
user_reviews_df['numerics'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len([x for x in x.split() if x.isdigit()]))

#### 4. Number of upper case


In [6]:
user_reviews_df['upper'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len([x for x in x.split() if x.isupper()]))

#### 5. Number of Exclamation Points (!)


In [7]:
user_reviews_df['bangs'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len([x for x in x.split('!')]) - 1 )

#### 6. Count of stop words


In [8]:
stop = stopwords.words('english')

user_reviews_df['stp_wrd_cnt'] = user_reviews_df.rev_comp_reviews.apply(lambda x: 
                                                                        len([x for x in x.split() if x in stop]))

In [38]:
user_reviews_df.head(3)

Unnamed: 0,comapny_source,company_loc,rev_comp_rating,rev_comp_reviews,rev_comp_url,rev_company_name,userUrl,word_count,char_count,numerics,upper,bangs,stp_wrd_cnt
0,Peloton,"370 Canal St New York, NY 10013",3.0,planet fitness affordable frills gym happy opt...,https://www.yelp.com/biz/planet-fitness-manhat...,Planet Fitness - Manhattan - Canal St - NY,https://www.yelp.com/user_details?userid=exPhu...,219,1189,0,5,0,100
1,Peloton,"90 E 10th St New York, NY 10003",2.0,purchased groupon friend calling book receptio...,https://www.yelp.com/biz/montauk-salt-cave-new...,Montauk Salt Cave,https://www.yelp.com/user_details?userid=exPhu...,791,4417,2,19,4,331
2,Peloton,"1841 Broadway New York, NY 11023",3.0,enjoyed class one least favorite barre studios...,https://www.yelp.com/biz/pure-barre-new-york-c...,Pure Barre - New York Columbus Circle - 60th &...,https://www.yelp.com/user_details?userid=exPhu...,88,480,0,2,0,39


### Data Preprocessing

Next, we need to move into data cleaning. This section will be very important for the remaineder of this project and the models we run. In the next few cells we will:
1. create a function to remove all punction
2. lower case all of the words in our messages
4. remove all words shorter than 3 characters
3. remove stop words
4. check for spelling and correct where needed
5. remove frequent
6. remove rare/uncommon words


#### 1) and 2) get rid of special charaters and lower case:

Use the function *clean_text_column*, which we imported above.

In [10]:
user_reviews_df.rev_comp_reviews = user_reviews_df.rev_comp_reviews.apply(lambda row: clean_text_column(row))

#### 3. drop stop words


In [11]:
stop = stopwords.words('english') #loads the stop words for the english language
user_reviews_df.rev_comp_reviews = user_reviews_df.rev_comp_reviews.apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
#returns only words that are not in the list of stop words

In [51]:
!pip install -U symspellpy

Collecting symspellpy
  Downloading https://files.pythonhosted.org/packages/4c/d5/9cf41f05a30f205c00489e3d37639c348349ba6f8d0e1005f26dc9a9ac60/symspellpy-6.3.8-py3-none-any.whl
Installing collected packages: symspellpy
Successfully installed symspellpy-6.3.8


#### 4. Correct Spelling:

Check the words that have only been used once, some of these will definitly be misspelled! 

In [152]:
#list of words that only occur once:
x = [[1,2,3],[4,5,6]]
word_lists = list(user_reviews_df.rev_comp_reviews.apply(lambda x: x.split(' ')))
all_words = [word for rev in word_lists for word in rev]
corpus_word_counts_df = pd.DataFrame(pd.Series(all_words).value_counts()).reset_index()\
.rename(columns={'index':'words',0:'counts'})


In [153]:
corpus_word_counts_df_1 = corpus_word_counts_df[corpus_word_counts_df['counts'] == 1]

In [161]:
corpus_word_counts_df_1.head()

Unnamed: 0,words,counts
21691,firstlast,1
21692,5075,1
21693,strong3,1
21694,ladderi,1
21695,articulates,1


In [162]:
tqdm_notebook.pandas(desc="Progress: ")

corpus_word_counts_df_1['corrected'] = corpus_word_counts_df_1.words.progress_apply(lambda w: 
                                                                                     sym_spell.word_segmentation(w)[0])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [165]:
corpus_word_counts_df_1.head()

Unnamed: 0,words,counts,corrected
21691,firstlast,1,first last
21692,5075,1,5075
21693,strong3,1,strong 3
21694,ladderi,1,ladder i
21695,articulates,1,articulates


In [166]:
def replace_fixed_words(rev,df):
    words = rev.split(' ')
    cor_rev = []
    for word in words: 
        if word in list(df.words):
            cor_rev.append(df[df['words'] == word]['corrected'].item())
        else:
            cor_rev.append(word)
    return ' '.join(cor_rev)

In [169]:
pickle_file(user_reviews_df, 'Data/spelling_corrections_df')

'Pickled object!'

In [168]:
user_reviews_df['rev_comp_reviews_corrections'] = user_reviews_df\
.rev_comp_reviews.progress_apply(lambda x: replace_fixed_words(x,corpus_word_counts_df_1))

In [170]:
word_lists_new = list(user_reviews_df.rev_comp_reviews_corrections.apply(lambda x: x.split(' ')))
all_words_corr = [word for rev in word_lists_new for word in rev]

count_1= corpus_word_counts_df.shape[0]
count_2 = pd.Series(all_words_corr).value_counts().shape[0]

print(f'Before the spell checker we had {count_1} words, and now we have: {count_2}',
      f'{corpus_word_counts_df_1.shape[0]} were corrected for',sep='\n\n')



Before the spell checker we had 53476 words, and now we have: 32334

31785 were corrected for


#### 5. lemmetize our words


In [None]:
def return_lemma(review,nlp):
    doc = nlp(review)
    return ' '.join([word.lemma_ for word in doc])

In [171]:
nlp = spacy.load('en_core_web_sm')
user_reviews_df['rev_comp_reviews_corrections_new'] = user_reviews_df.rev_comp_reviews_corrections.\
progress_apply(lambda x: return_lemma(x,nlp))


In [172]:
pickle_file(user_reviews_df, 'Data/sp_and_lemm_df')


'Pickled object!'

In [175]:
user_reviews_df.rev_comp_reviews_corrections[:3]

0    planet fitness affordable frills gym happy opt...
1    purchased groupon friend calling book receptio...
2    enjoyed class one least favorite barre studios...
Name: rev_comp_reviews_corrections, dtype: object

#### 6. TF-IDF


In [None]:
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(user_reviews_df.rev_comp_reviews)

In [None]:
tfidf_matrix.shape

In [None]:
test = [word for rev in user_reviews_df.rev_comp_reviews for word in rev.split(' ')]
len(set(test)), len(test)

In [None]:
tester = pd.Series(test)
tester.value_counts()

# UPDATE!!!!

We can see that out of the **911876** words in our corpus (all the words in all of the reviews), only **50 of the words are unique**. That is pretty crazy, but also teling of the reviews being left. We can use this to our advantage. With such a small list of words lets explore them!

In [None]:
fig = plt.figure(figsize=(10,5))
words 
sns.countplot(test,color='red')
plt.title('Unique Word Count',fontsize=15)
plt.xticks(rotation='vertical')
plt.xlabel('Unique Review Words');
# sns.set(font_scale=1)

#### 11. Class Imbalance and Sentiment Analysis:

If we look at the distributions above we can see that some users appear more frequently than others, but also our star ratings are positively skewed. 

In the next few cells we will try address this issue by augemneting the scores given by users with the sentiment scores of their reviews. 

**An example of how this works:**

In [None]:
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores('I love this place so much! It is my favoirte place ever!!')
ss_2 = sid.polarity_scores('this place is literally the worst ever, it deserves a zero!')
ss_3 = sid.polarity_scores('this place is average, got the job done.')
ss_4 = sid.polarity_scores(users_reviews_df.rev_comp_reviews[100])

In [None]:
print(f'I love this place so much! It is my favoirte place ever!!: {ss}',
     f'this place is literally the worst ever, it deserves a zero!: {ss_2}',
     f'this place is average, got the job done.: {ss_3}',
     f'{users_reviews_df.rev_comp_reviews[100]}: {ss_4}', sep='\n\n')

**Based on the results of this exploration, maybe the best way to handle this is the following:**
- subtract the negative score from the positive score
- multiply this by the original rating
- add this new rating to our original rating

In [None]:
users_reviews_df['sentiment_score'] = users_reviews_df.rev_comp_reviews.apply(lambda rev: sid.polarity_scores(rev))
users_reviews_df['pos_neg'] = users_reviews_df.sentiment_score.apply(lambda sent: sent['pos']-sent['neg'])
users_reviews_df['new_rating'] = users_reviews_df.pos_neg*users_filter.rev_comp_rating + users_filter.rev_comp_rating

In [None]:
sns.distplot(users_filter['new_rating'],color='red',bins=15)
plt.title('New Star Rating Histogram',fontsize=15);

#### Train-Test Spilt:

there's no good way that's been suggested on the internet to test your models. So, here we will remove a few users with multiple reviews so we can test on them. 

In [None]:
tot_users = len(set(users_filter.userUrl))
print(f'There are a total of {tot_users} users, and 20% of that is {int(tot_users*.2)}.')

**This is the random list of users we will be selecting and grabbing the usersID and then grabbing all of their rows:**

In [None]:
users_select = np.random.randint(0,15722,766)

citations:


Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

https://opensourceforu.com/2016/12/analysing-sentiments-nltk/
