<img style="float: left; margin-left: -10px; margin-top: -10px" src="yelp-logo-27.png"  width=50>

# Natural Language Processing

In this notebook we will now go through the text data in the reviews and class/business descriptions. By preprocessing this data and using the NLP tools provided to us through *__Spacy__* and __*NLTK*__ we will be able to derive some meaning from the text to *hopefully* improve our models.

The steps involved in this are as follows: 

1. word count
2. character count
3. Number of numerics
4. Number of upper case
5. Number of Exclamation Points (!)
7. Count of stop words
8. drop stop words
9. lemmetize our words
10. TF-IDF
11. Sentiment Analysis

#### Import needed libraries:

In [1]:
import pandas as pd
import numpy as np
import spacy
import pickle
from Mod_5_functions import pickle_file,open_pickle,clean_text_column
from nltk.corpus import stopwords

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

#### Import the pickled DataFrames:

In [2]:
bus_reviews_df = open_pickle('Data/filtered_data_2.pkl')
user_reviews_df = open_pickle('Data/filtered_user_data.pkl')

#### 1. word count:

In [3]:
user_reviews_df['word_count'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len(str(x).split(' ')))

#### 2. character count


In [4]:
user_reviews_df['char_count'] = user_reviews_df.rev_comp_reviews.str.len() #this includes the spaces

#### 3. Number of numerics


In [5]:
user_reviews_df['numerics'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len([x for x in x.split() if x.isdigit()]))

#### 4. Number of upper case


In [6]:
user_reviews_df['upper'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len([x for x in x.split() if x.isupper()]))

#### 5. Number of Exclamation Points (!)


In [7]:
user_reviews_df['bangs'] = user_reviews_df.rev_comp_reviews.apply(lambda x: len([x for x in x.split('!')]) - 1 )

#### 6. Count of stop words


In [8]:
stop = stopwords.words('english')

user_reviews_df['stp_wrd_cnt'] = user_reviews_df.rev_comp_reviews.apply(lambda x: 
                                                                        len([x for x in x.split() if x in stop]))

In [9]:
user_reviews_df.head()

Unnamed: 0,comapny_source,company_loc,rev_comp_rating,rev_comp_reviews,rev_comp_url,rev_company_name,userUrl,word_count,char_count,numerics,upper,bangs,stp_wrd_cnt
0,Peloton,"370 Canal St New York, NY 10013",3.0,"Planet Fitness is an affordable, no frills gym...",https://www.yelp.com/biz/planet-fitness-manhat...,Planet Fitness - Manhattan - Canal St - NY,https://www.yelp.com/user_details?userid=exPhu...,219,1189,0,5,0,100
1,Peloton,"90 E 10th St New York, NY 10003",2.0,I purchased a Groupon for a friend and I. When...,https://www.yelp.com/biz/montauk-salt-cave-new...,Montauk Salt Cave,https://www.yelp.com/user_details?userid=exPhu...,791,4417,2,19,4,331
2,Peloton,"1841 Broadway New York, NY 11023",3.0,"I enjoyed my class, but this was one of my lea...",https://www.yelp.com/biz/pure-barre-new-york-c...,Pure Barre - New York Columbus Circle - 60th &...,https://www.yelp.com/user_details?userid=exPhu...,88,480,0,2,0,39
3,Peloton,"19 W 45th St New York, NY 10036",4.0,I came in for their Pilates Mat Fundamental cl...,https://www.yelp.com/biz/return-to-life-center...,Return To Life Center - Pilates and Functional...,https://www.yelp.com/user_details?userid=exPhu...,106,584,0,2,2,39
4,Peloton,"140 W 23rd St New York, NY 10011",4.0,I came in for my first Peloton class awhile ba...,https://www.yelp.com/biz/peloton-new-york,Peloton,https://www.yelp.com/user_details?userid=exPhu...,206,1137,0,9,1,91


### Data Preprocessing

Next, we need to move into data cleaning. This section will be very important for the remaineder of this project and the models we run. In the next few cells we will:
1. create a function to remove all punction
2. lower case all of the words in our messages
3. remove stop words
4. check for spelling and correct where needed
5. remove frequent
6. remove rare/uncommon words


#### 1) and 2) get rid of special charaters and lower case:

Use the function *clean_text_column*, which we imported above.

In [11]:
user_reviews_df.rev_comp_reviews = user_reviews_df.rev_comp_reviews.apply(lambda row: clean_text_column(row))

#### 8. drop stop words


In [13]:
stop = stopwords.words('english') #loads the stop words for the english language
user_reviews_df.rev_comp_reviews = user_reviews_df.rev_comp_reviews.apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
#returns only words that are not in the list of stop words

#### 9. lemmetize our words


In [15]:
def return_lemma(review,nlp):
    doc = nlp(review)
    return ' '.join([word.lemma_ for word in doc])

In [16]:
nlp = spacy.load('en_core_web_sm')
user_reviews_df.rev_comp_reviews = user_reviews_df.rev_comp_reviews.apply(lambda x: return_lemma(x,nlp))


#### 10. TF-IDF


#### 11. Sentiment Analysis

In [17]:
user_reviews_df.rev_comp_reviews

0        planet fitness affordable frill gym happy opti...
1        purchase groupon friend call book receptionist...
2        enjoy class one least favorite barre studio si...
3        come pilate mat fundamental class love -PRON- ...
4        come first peloton class awhile back completel...
5        find sonic class pass donation base class happ...
6        think think stereotypical yoga studio mean goo...
7        jesus sign burn barre class nicole classpass d...
8        wow place really gorgeous come flow yoga basic...
9        think favorite barre class yet sign classpass ...
10       two star may seem harsh -PRON- will see nothin...
11       -PRON- have use classpass week rating sort bas...
12       sign first ever barre class classpass great ti...
13       second barre class love pure barre classic sop...
14       jesus much fun come see force majeure vaudevil...
15       buy fitness pass one gym include pass rock fri...
16       grow come make sad know well even go sleep awa.