# Binary Classification of Park Reviews
By scraping Google Maps to gather reviews of parks in Montréal a dataset of 45k+ reviews was collected. The next step after data collection is to create a classifier that determines if a review is expressing positiv or negative sentiment about the park. To come up with a classifier various models are tested and compared. 

The following notebook is broken down into the following sections: 
<br>**1. [Data Exploration](#data-exp)
<br>&nbsp;&nbsp; 1.1 [Read in data](#read-data)
<br>&nbsp;&nbsp; 1.2 [Calculating summary statistics](#sum-stats)
<br>&nbsp;&nbsp; 1.3 [Gathering most common terms](#common-terms)
<br>2. [Testing different classifiers](#classifiers)
<br>&nbsp;&nbsp; 2.1 [Text preprocessing](#pre-process)
<br>&nbsp;&nbsp; 2.2 [Testing classifiers](#test-class)

<a id="data-exp"></a> 
## 1. Data Exploration 
The first step before jumping into creating classifiers, is to get familiar with the datasets. The first section is broken down into first loading in the data, looking at the distribution of stars for the reviews and also looking for patterns in reviews to see if any terms pop up more in negative or positive reviews. 

<a id="read-data"></a> 
### 1.1 Read in data
The data is saved in two datasets, both saved in csv files. The first one includes all of the reviews in English (allEnReviews.csv) and the second dataset includes all reviews in different languages and includes a column that contains information about the language a review is in (ParkReviewsLang.csv). Using the second dataset the French reviews can be extracted and analyzed as one. 

In [1]:
import pandas as pd

In [3]:
# read in english reviews 
parkReviews = pd.read_csv('allEnReviews.csv', index_col=0)
parkReviews.head()

Unnamed: 0,review_for,review_id,username,user_url,published,date_retrieved,num_stars,num_reviews,review_text,label
0,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUNpeGF6TTNnRRAB,Claudia,https://www.google.com/maps/contrib/1001449741...,7 months ago,2021-06-20 22:04:09.211296,4.0,107.0,One of the nicest entry points to this invitin...,1
1,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSURDOGEyMGpnRRAB,Nate Neel,https://www.google.com/maps/contrib/1121030547...,8 months ago,2021-06-20 22:04:09.212245,5.0,121.0,"Waterfront to fish or just relax, great place ...",1
2,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUM4Nk9Ya3lnRRAB,Yucel Salimoglu,https://www.google.com/maps/contrib/1034180738...,11 months ago,2021-06-20 22:04:09.213178,4.0,79.0,Everything except the parking is good here.,1
3,Parc de la Capture-d'Ethan-Allen,ChZDSUhNMG9nS0VJQ0FnSUNVdWNUbE9REAE,COCO BEADZ,https://www.google.com/maps/contrib/1036060504...,a year ago,2021-06-20 22:04:09.214115,4.0,128.0,"Defenely the best park in Montreal East, Tetre...",1
4,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUMwdHJDTm1nRRAB,Anna Maria Fiore,https://www.google.com/maps/contrib/1016779009...,a year ago,2021-06-20 22:04:09.215069,5.0,39.0,It's so peaceful and happy place near the water,1


There are a total of 3312 3 star reviews. When using the number of stars as a cutoff for wether something is classfied as positive or negative, it may be helpful to first get a sense of what type of reviews fall into the three star category. 

In [45]:
# take a look at 3 star reviews 
parkReviews[parkReviews['num_stars'] == 3.0]

Unnamed: 0,review_for,review_id,username,user_url,published,date_retrieved,num_stars,num_reviews,review_text,label,word_count,uppercase_char_count,special_char_count
13,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUNnd0xqYmhnRRAB,Marc-André Maurice,https://www.google.com/maps/contrib/1054968570...,4 years ago,2021-06-20 22:04:09.223866,3.0,32.0,If you want to have a beer by the St-Laurence ...,0,12,3,1
15,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSURnd2R1bW5nRRAB,Carismé Pierre,https://www.google.com/maps/contrib/1056897378...,3 years ago,2021-06-20 22:04:09.225823,3.0,97.0,Correct...,0,1,1,3
18,Parc de la Capture-d'Ethan-Allen,ChZDSUhNMG9nS0VJQ0FnSUNxMWM2bVFREAE,Steve Huard,https://www.google.com/maps/contrib/1165278777...,6 days ago,2021-06-20 22:04:09.228581,3.0,3.0,Très beau mes j'aimerais pouvoir descendre au ...,0,21,1,3
121,Parc Mohawk,ChdDSUhNMG9nS0VJQ0FnSUNpbVB6SjJBRRAB,Miguel Veliz,https://www.google.com/maps/contrib/1101192075...,8 months ago,2021-06-22 11:54:53.183611,3.0,61.0,"Field is uneven, represents risks for players....",0,20,4,5
124,Parc Mohawk,ChdDSUhNMG9nS0VJQ0FnSUM0bVBIazB3RRAB,Fonaq,https://www.google.com/maps/contrib/1076101585...,2 years ago,2021-06-22 11:54:53.191064,3.0,17.0,"Great place to go play tennis, soccer or half ...",0,11,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
41749,Square Dézéry,ChdDSUhNMG9nS0VJQ0FnSURDbU9lRnhnRRAB,Andre Gagnon,https://www.google.com/maps/contrib/1019125653...,9 months ago,2021-06-22 20:31:39.563398,3.0,72.0,Too many itinerants next to Notre Dame campsite,0,8,3,0
41758,Square Dézéry,ChdDSUhNMG9nS0VJQ0FnSUNRMW8tbXRBRRAB,Francis Tanguay,https://www.google.com/maps/contrib/1170721416...,3 years ago,2021-06-22 20:31:39.573976,3.0,18.0,A little too expensive for some article but ve...,0,11,1,1
41761,Square Dézéry,ChdDSUhNMG9nS0VJQ0FnSURRMU1fM3pBRRAB,Johanne gauvreau,https://www.google.com/maps/contrib/1162172471...,3 years ago,2021-06-22 20:31:39.580552,3.0,139.0,Very good xx,0,3,1,0
41783,Parc Paul-Séguin,ChdDSUhNMG9nS0VJQ0FnSURnLS16Um93RRAB,Denise Le Blanc,https://www.google.com/maps/contrib/1040043322...,a year ago,2021-06-23 16:12:02.475158,3.0,12.0,"A small park ideal for young families, games, ...",0,13,2,4


In [4]:
# read in park reviews with language labels
parkReviews = pd.read_csv('ParkReviewsLang.csv', index_col=0)
parkReviews.shape

(41822, 11)

In [3]:
# take a look at some park reviews
parkReviews.head()

Unnamed: 0,review_for,review_id,username,user_url,published,date_retrieved,num_stars,num_reviews,review_text,label,lang
0,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUNpeGF6TTNnRRAB,Claudia,https://www.google.com/maps/contrib/1001449741...,7 months ago,2021-06-20 22:04:09.211296,4.0,107.0,One of the nicest entry points to this invitin...,1,en
1,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSURDOGEyMGpnRRAB,Nate Neel,https://www.google.com/maps/contrib/1121030547...,8 months ago,2021-06-20 22:04:09.212245,5.0,121.0,"Waterfront to fish or just relax, great place ...",1,en
2,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUM4Nk9Ya3lnRRAB,Yucel Salimoglu,https://www.google.com/maps/contrib/1034180738...,11 months ago,2021-06-20 22:04:09.213178,4.0,79.0,Everything except the parking is good here.,1,en
3,Parc de la Capture-d'Ethan-Allen,ChZDSUhNMG9nS0VJQ0FnSUNVdWNUbE9REAE,COCO BEADZ,https://www.google.com/maps/contrib/1036060504...,a year ago,2021-06-20 22:04:09.214115,4.0,128.0,"Defenely the best park in Montreal East, Tetre...",1,en
4,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUMwdHJDTm1nRRAB,Anna Maria Fiore,https://www.google.com/maps/contrib/1016779009...,a year ago,2021-06-20 22:04:09.215069,5.0,39.0,It's so peaceful and happy place near the water,1,en


In [6]:
# extrac french reviews only 
frenchReviews = parkReviews[parkReviews['lang'] == 'fr']
frenchReviews.shape

(17149, 11)

Since the Google Maps reviews that are made in another language besides English also include an automatic English translation it is important to get rid of the English translation to obtain only the original French text. 

In [12]:
# extract French text and save in additional column 
frenchReviewsDf = frenchReviews.copy()
frenchReviewText = frenchReviews['review_text'].apply(lambda x: x.split('(Original)')[-1].strip())
frenchReviewsDf['french_text'] = frenchReviewText

In [13]:
frenchReviewsDf.head()

Unnamed: 0,review_for,review_id,username,user_url,published,date_retrieved,num_stars,num_reviews,review_text,label,lang,french_text
22684,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUNxM3JIQjhBRRAB,Claude Gagnon,https://www.google.com/maps/contrib/1182846684...,a week ago,2021-06-20 22:04:09.230541,4.0,86.0,(Translated by Google) Very beautiful park to ...,1,fr,Tres beau parc pour faire un pinic et profiter...
22685,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSURLckpMTm5nRRAB,Guy Durand,https://www.google.com/maps/contrib/1056233036...,a month ago,2021-06-20 22:04:09.232736,5.0,53.0,(Translated by Google) The people are all very...,1,fr,Les gens sont tous très sociables.\nExceptionnel.
22686,Parc de la Capture-d'Ethan-Allen,ChZDSUhNMG9nS0VJQ0FnSUNBNktMclNREAE,Dania Pascual,https://www.google.com/maps/contrib/1064940459...,2 years ago,2021-06-20 22:04:09.233660,5.0,41.0,"(Translated by Google) Nice place to walk, jog...",1,fr,"Belle place pour marcher, jogger, promener le ..."
22687,Parc de la Capture-d'Ethan-Allen,ChdDSUhNMG9nS0VJQ0FnSUNxNWFEUzBBRRAB,Stéphane Lessard,https://www.google.com/maps/contrib/1061684432...,a week ago,2021-06-20 22:04:09.234580,4.0,7.0,(Translated by Google) Excellent food!\n\n(Ori...,1,fr,Nourriture excellente !
22688,Parc de la Capture-d'Ethan-Allen,ChZDSUhNMG9nS0VJQ0FnSUNLaTZ2YVFBEAE,Pitchou Kasongo,https://www.google.com/maps/contrib/1176531246...,2 months ago,2021-06-20 22:04:09.235577,4.0,93.0,(Translated by Google) I like to get some fres...,1,fr,J aime bien pour prendre de l air frais


<a id="sum-stats"></a> 
### 1.2 Calculating summary statistics 
The following section details the calculation of summary statistics including the number of words, special and uppercase characters per review. These values can then be used to compare if there is a significant difference between positive and negative reviews as well as if there are differences between English and French reviews. 

In [6]:
import string

In [4]:
# determine the number of words, special and uppercase characters per review

parkReviews['word_count'] = [len(review.split()) for review in parkReviews['review_text']]

parkReviews['uppercase_char_count'] = [sum(char.isupper() for char in review) \
                              for review in parkReviews['review_text']]                           

parkReviews['special_char_count'] = [sum(char in string.punctuation for char in review) \
                            for review in parkReviews['review_text']]       

In [14]:
# 
frenchReviewsDf['word_count'] = [len(review.split()) for review in frenchReviewsDf['french_text']]

frenchReviewsDf['uppercase_char_count'] = [sum(char.isupper() for char in review) \
                              for review in frenchReviewsDf['french_text']]                           

frenchReviewsDf['special_char_count'] = [sum(char in string.punctuation for char in review) \
                            for review in frenchReviewsDf['french_text']]       

In [9]:
# create a breakdown of English positive and negative reviews
pos_reviews = parkReviews[parkReviews['label'] == 1] 
neg_reviews = parkReviews[parkReviews['label'] == 0]

In [15]:
# create a breakdown of French positive and negative reviews 
pos_reviewsFr = frenchReviewsDf[frenchReviewsDf['label'] == 1]
neg_reviewsFr = frenchReviewsDf[frenchReviewsDf['label'] == 0]

After breaking down the dataset into positive and negative reviews, one can take a look at the length of negative and positive reviews. There are 88.28% positive reviews, which means there is an unbalanced dataset. At the same time the length of reviews is not very different for negative and positive reviews. 

In [10]:
pos_reviews['word_count'].describe()

count    36920.000000
mean        13.435049
std         17.949938
min          1.000000
25%          4.000000
50%          8.000000
75%         16.000000
max        634.000000
Name: word_count, dtype: float64

In [11]:
neg_reviews['word_count'].describe()

count    4902.000000
mean       15.740718
std        21.407586
min         1.000000
25%         4.000000
50%         9.000000
75%        19.000000
max       490.000000
Name: word_count, dtype: float64

In [48]:
36920/(4902 + 36920)

0.8827889627468797

In [16]:
# positive word count for French reviews
pos_reviewsFr['word_count'].describe()

count    14635.000000
mean        12.701537
std         16.127711
min          1.000000
25%          4.000000
50%          8.000000
75%         15.000000
max        474.000000
Name: word_count, dtype: float64

In [17]:
# negative reviews word count for French reviews
neg_reviewsFr['word_count'].describe()

count    2514.000000
mean       16.305091
std        19.940490
min         1.000000
25%         5.000000
50%        10.000000
75%        20.000000
max       258.000000
Name: word_count, dtype: float64

In [18]:
14635/(14635 + 2514) # 85.3% positive reviews in French dataset

0.8534025307598111

Next we compare the number of uppercase letters used in the postive and negative reviews. As we can see there is no real difference between the two as well in both the English and French case. 

In [12]:
pos_reviews['uppercase_char_count'].describe()

count    36920.000000
mean         2.075433
std          3.525758
min          0.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        259.000000
Name: uppercase_char_count, dtype: float64

In [13]:
neg_reviews['uppercase_char_count'].describe()

count    4902.000000
mean        2.157487
std         3.377507
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        87.000000
Name: uppercase_char_count, dtype: float64

In [19]:
pos_reviewsFr['uppercase_char_count'].describe()

count    14635.000000
mean         1.819542
std          2.511317
min          0.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        107.000000
Name: uppercase_char_count, dtype: float64

In [20]:
neg_reviewsFr['uppercase_char_count'].describe()

count    2514.000000
mean        2.029037
std         3.391481
min         0.000000
25%         1.000000
50%         1.000000
75%         2.000000
max        80.000000
Name: uppercase_char_count, dtype: float64

Finally let's take a look at the special characters present in the positive and negativve group of reviews.

In [14]:
pos_reviews['special_char_count'].describe()

count    36920.000000
mean         2.111430
std          3.324735
min          0.000000
25%          0.000000
50%          1.000000
75%          3.000000
max        135.000000
Name: special_char_count, dtype: float64

In [15]:
neg_reviews['special_char_count'].describe()

count    4902.000000
mean        2.436557
std         4.232427
min         0.000000
25%         0.000000
50%         1.000000
75%         3.000000
max       109.000000
Name: special_char_count, dtype: float64

In [21]:
pos_reviewsFr['special_char_count'].describe()

count    14635.000000
mean         2.345268
std          3.710336
min          0.000000
25%          0.000000
50%          1.000000
75%          3.000000
max         94.000000
Name: special_char_count, dtype: float64

In [22]:
neg_reviewsFr['special_char_count'].describe()

count    2514.000000
mean        2.945505
std         4.274390
min         0.000000
25%         0.000000
50%         2.000000
75%         4.000000
max        63.000000
Name: special_char_count, dtype: float64

In general comparing the statistics between the positive and negative reviews there is no significant differences between the average length of reviews and number of special character counts. A more exact way to determine if there are any differences would be to perform a two sample t-test. 

<a id="common-terms"></a> 
### 1.3 Examining the most frequent words
For both the English and French dataset, the most common words can be determined. By examining both dataset one can see that the most common words in both the positive and negative review categories the words are very simialar. 

In [18]:
from collections import Counter

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/andreamock/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
def getMostCommonWords(reviews, n_most_common, stopwords=None):
    
    '''Given a list of reviews, extracts the n most common words and if applicable removes stopwords. 
    A Counter object of the n most common words is returned'''
    # flatten review column into a list of words, and set each to lowercase
    flattened_reviews = [word for review in reviews for word in \
                         review.lower().split()]


    # remove punctuation from reviews
    flattened_reviews = [''.join(char for char in review if \
                                 char not in string.punctuation) for \
                         review in flattened_reviews]


    # remove stopwords, if applicable
    if stopwords:
        flattened_reviews = [word for word in flattened_reviews if \
                             word not in stopwords]


    # remove any empty strings that were created by this process
    flattened_reviews = [review for review in flattened_reviews if review]

    return Counter(flattened_reviews).most_common(n_most_common)

### 1.3.1 Most common words in English reviews
The most common words in the positive and negative English reviews are very similar and contain a lot of stopwords. When performing the same process without stopwords allows to extract words that are more meaningful. In both cases one can see that the words in both the negative and positive review categories are very similar. One reason for this could be that the negative reviews include negations ie not beautiful but the negation is not included in the most common words.

In [20]:
# most common words in positive English reviews
getMostCommonWords(pos_reviews['review_text'], 15)

[('the', 23250),
 ('a', 17043),
 ('and', 16999),
 ('park', 15169),
 ('to', 13713),
 ('for', 11101),
 ('of', 10785),
 ('place', 8822),
 ('in', 8482),
 ('nice', 8142),
 ('beautiful', 7812),
 ('is', 7354),
 ('very', 6595),
 ('with', 6271),
 ('great', 5028)]

In [21]:
# most common words in negative English reviews
getMostCommonWords(neg_reviews['review_text'], 15)

[('the', 4240),
 ('a', 2299),
 ('and', 2045),
 ('to', 1913),
 ('park', 1826),
 ('of', 1594),
 ('is', 1485),
 ('for', 1299),
 ('in', 1109),
 ('not', 1039),
 ('it', 991),
 ('but', 943),
 ('there', 789),
 ('nice', 723),
 ('i', 701)]

In [49]:
getMostCommonWords(pos_reviews['review_text'], 10, stopwords.words('english'))

[('park', 15169),
 ('place', 8822),
 ('nice', 8142),
 ('beautiful', 7812),
 ('great', 5028),
 ('good', 3680),
 ('kids', 2177),
 ('walk', 2147),
 ('montreal', 2019),
 ('water', 1977)]

In [23]:
getMostCommonWords(neg_reviews['review_text'], 15, stopwords.words('english'))

[('park', 1826),
 ('nice', 723),
 ('place', 686),
 ('beautiful', 446),
 ('good', 431),
 ('people', 357),
 ('small', 354),
 ('little', 301),
 ('children', 297),
 ('water', 277),
 ('kids', 261),
 ('many', 245),
 ('go', 233),
 ('great', 203),
 ('lot', 196)]

### 1.3.2 Most common words in French reviews

Similar to the most common words in English reviews the most common words involve a lot of common words. Even without stopwords the same patterns as with the English reviews pops up: the most common words in positive and negative reviews are very similar.

In [26]:
# most common words in positive French reviews
getMostCommonWords(pos_reviewsFr['french_text'], 15)

[('de', 8125),
 ('parc', 6350),
 ('pour', 6085),
 ('et', 5374),
 ('les', 3795),
 ('très', 3631),
 ('un', 3510),
 ('beau', 3264),
 ('la', 3110),
 ('le', 2947),
 ('des', 2541),
 ('en', 2281),
 ('à', 2241),
 ('avec', 1934),
 ('bien', 1706)]

In [27]:
# most common words in negative French reviews
getMostCommonWords(neg_reviewsFr['french_text'], 15)

[('de', 2048),
 ('parc', 1001),
 ('les', 930),
 ('et', 930),
 ('pour', 920),
 ('le', 811),
 ('pas', 736),
 ('la', 720),
 ('un', 666),
 ('des', 499),
 ('à', 496),
 ('a', 476),
 ('mais', 471),
 ('en', 433),
 ('il', 426)]

In [28]:
getMostCommonWords(pos_reviewsFr['french_text'], 10, stopwords.words('french'))

[('parc', 6350),
 ('très', 3631),
 ('beau', 3264),
 ('bien', 1706),
 ('endroit', 1702),
 ('a', 1697),
 ('enfants', 1500),
 ('jeux', 1299),
 ('belle', 1267),
 ('cest', 1036)]

In [30]:
getMostCommonWords(neg_reviewsFr['french_text'], 15, stopwords.words('french'))

[('parc', 1001),
 ('a', 476),
 ('très', 387),
 ('beau', 330),
 ('cest', 296),
 ('enfants', 278),
 ('bien', 269),
 ('petit', 225),
 ('beaucoup', 218),
 ('plus', 206),
 ('trop', 195),
 ('jeux', 192),
 ('endroit', 186),
 ('terrain', 174),
 ('peu', 160)]

<a id="classifiers"></a> 

## 2. Testing different classifiers
The next step after having taken a look at the general dataset is to create a classifier that allows one to determine if a review is negative or positive. At the same time the accuracy of both English and French binary classification will be compared. As seen in the section below, the classifier in French language is less accurate than the English language one.

### 2.1 Text vectorization 
Using the Tf-Idf Vectorizer the different words can be converted to feature vectors and weighted for further processing.

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import train_test_split

In [148]:
vectorizer = TfidfVectorizer(min_df=15)
bow = vectorizer.fit_transform(list(parkReviews['review_text']))
labels = parkReviews['label']

In [149]:
bow.shape

(41822, 2127)

In [150]:
# names of features
vectorizer.get_feature_names()

['00',
 '10',
 '100',
 '11',
 '12',
 '13',
 '15',
 '18',
 '19',
 '20',
 '2017',
 '2018',
 '2019',
 '2020',
 '2021',
 '25',
 '30',
 '45',
 '50',
 '67',
 'abandoned',
 'able',
 'about',
 'above',
 'absolutely',
 'access',
 'accessible',
 'accommodate',
 'according',
 'across',
 'active',
 'activities',
 'activity',
 'actually',
 'adapted',
 'add',
 'added',
 'addicts',
 'adding',
 'addition',
 'adds',
 'adequate',
 'adjacent',
 'adjusted',
 'admire',
 'adorable',
 'adore',
 'adult',
 'adults',
 'advantage',
 'advise',
 'affordable',
 'after',
 'afternoon',
 'again',
 'against',
 'age',
 'ages',
 'ago',
 'air',
 'alcohol',
 'alike',
 'alive',
 'all',
 'allow',
 'allowed',
 'allows',
 'almost',
 'alone',
 'along',
 'alot',
 'already',
 'also',
 'although',
 'always',
 'am',
 'amazing',
 'ambiance',
 'ambience',
 'amenities',
 'among',
 'amount',
 'ample',
 'amusement',
 'an',
 'and',
 'angrignon',
 'angus',
 'animal',
 'animals',
 'animated',
 'animation',
 'anjou',
 'annoying',
 'another'

In [99]:
# total number of features
len(vectorizer.get_feature_names())

2127

In [113]:
# create a dictionary of features and tfidf scores
tfidfDict = dict(zip(vectorizer.get_feature_names(), bow.toarray()[0]))
tfidfDict['appreciated'] # sample score for a words

0.2007442684284577

In [119]:
# create a dataframe that contains 
featureDf = pd.DataFrame.from_dict(tfidfDict, 
                                   orient='index', columns=['tfidf'])
featureDf.reset_index(inplace=True)
featureDf = featureDf.rename(columns = {'index':'feature'})

In [122]:
featureDf.sort_values('tfidf')[-10:] 

Unnamed: 0,feature,tfidf
1895,today,0.183384
575,enjoyable,0.190613
1214,nicest,0.196902
1302,pandemic,0.199586
105,appreciated,0.200744
1397,points,0.208271
1445,promenade,0.2156
1101,mask,0.220245
584,entry,0.221551
918,inviting,0.222929


Now we can take a look at the words that have the highest tfidf score in the positive and negative sentiment datasets.

In [124]:
vectorizer_pos = TfidfVectorizer(min_df=15)
bow_pos = vectorizer_pos.fit_transform(list(pos_reviews['review_text']))
labels_pos = pos_reviews['label']

In [125]:
vectorizer_neg = TfidfVectorizer(min_df=15)
bow_neg = vectorizer_neg.fit_transform(list(neg_reviews['review_text']))
labels_neg = neg_reviews['label']

In [126]:
tfidfDictPos = dict(zip(vectorizer_pos.get_feature_names(), bow_pos.toarray()[0]))
tfidfDictNeg = dict(zip(vectorizer_neg.get_feature_names(), bow_neg.toarray()[0]))

In [127]:
posFeatureDf = pd.DataFrame.from_dict(tfidfDictPos, 
                                   orient='index', columns=['tfidf'])
posFeatureDf.reset_index(inplace=True)
posFeatureDf = posFeatureDf.rename(columns = {'index':'feature'})

In [128]:
negFeatureDf = pd.DataFrame.from_dict(tfidfDictNeg, 
                                   orient='index', columns=['tfidf'])
negFeatureDf.reset_index(inplace=True)
negFeatureDf = negFeatureDf.rename(columns = {'index':'feature'})

In [132]:
posFeatureDf.sort_values('tfidf')[-15:]

Unnamed: 0,feature,tfidf
1611,the,0.175157
1643,toilet,0.178911
284,chalet,0.183008
605,forget,0.184969
862,lawrence,0.187326
1077,offers,0.188328
1764,waterfront,0.196256
1624,this,0.198067
1639,today,0.201076
492,enjoyable,0.202747


In [131]:
negFeatureDf.sort_values('tfidf')[-15:]

Unnamed: 0,feature,tfidf
188,friendly,0.0
189,friends,0.0
190,from,0.0
187,free,0.0
481,the,0.127299
498,to,0.154861
567,you,0.243902
215,have,0.251447
232,if,0.277831
71,by,0.282278


In [137]:
# select 200 best features 
selected_features = SelectKBest(chi2, k=200).fit(bow, labels).get_support(indices=True)

In [146]:
# use selected features for vectorizer
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)

bow2 = vectorizer.fit_transform(list(parkReviews['review_text']))
bow2

<41822x200 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [151]:
X_train, X_test, y_train, y_test = train_test_split(bow, labels, test_size=0.33)

In [152]:
# check out the dataset 
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(28020, 2127)
(13802, 2127)
(28020,)
(13802,)


### 2.1.1 Text vectorization of French reviews
Similar to the English reviews, one can use text vectorization to pre-process the French reviews.

In [32]:
vectorizerFr = TfidfVectorizer(min_df=15)
bowFr = vectorizerFr.fit_transform(list(frenchReviewsDf['french_text']))
labelsFr = frenchReviewsDf['label']

In [33]:
bowFr.shape

(17149, 1180)

In [35]:
len(vectorizerFr.get_feature_names())

1180

In [36]:
tfidfDictFr = dict(zip(vectorizerFr.get_feature_names(), bow.toarray()[0]))

In [38]:
tfidfDictFr['améliorer'] # tfidif score of a french word

0.0

In [39]:
featureDfFr = pd.DataFrame.from_dict(tfidfDictFr, 
                                   orient='index', columns=['tfidf'])
featureDfFr.reset_index(inplace=True)
featureDfFr = featureDfFr.rename(columns = {'index':'feature'})

In [40]:
# words with highest tfidf scores
featureDfFr.sort_values('tfidf')[-10:]

Unnamed: 0,feature,tfidf
827,pour,0.16261
379,et,0.173683
152,beau,0.197146
1089,un,0.200496
398,faire,0.308118
295,de,0.313332
1079,tres,0.359868
664,nature,0.371318
558,la,0.425115
848,profiter,0.447984


Now we can take a look at the words that have the highest tfidf score in the positive and negative sentiment datasets.

In [43]:
vectorizer_posFr = TfidfVectorizer(min_df=15)
bow_posFr = vectorizer_posFr.fit_transform(list(pos_reviewsFr['french_text']))
labels_posFr = pos_reviewsFr['label']

In [44]:
vectorizer_negFr = TfidfVectorizer(min_df=15)
bow_negFr = vectorizer_negFr.fit_transform(list(neg_reviewsFr['french_text']))
labels_negFr = neg_reviewsFr['label']

In [45]:
tfidfDictPosFr = dict(zip(vectorizer_posFr.get_feature_names(), bow_posFr.toarray()[0]))
tfidfDictNegFr = dict(zip(vectorizer_negFr.get_feature_names(), bow_negFr.toarray()[0]))

In [46]:
posFeatureFrDf = pd.DataFrame.from_dict(tfidfDictPosFr, 
                                   orient='index', columns=['tfidf'])
posFeatureFrDf.reset_index(inplace=True)
posFeatureFrDf = posFeatureFrDf.rename(columns = {'index':'feature'})

In [47]:
negFeatureFrDf = pd.DataFrame.from_dict(tfidfDictNegFr, 
                                   orient='index', columns=['tfidf'])
negFeatureFrDf.reset_index(inplace=True)
negFeatureFrDf = negFeatureFrDf.rename(columns = {'index':'feature'})

In [48]:
posFeatureFrDf.sort_values('tfidf')[-15:] # words with highest tfidif words in positive reviews

Unnamed: 0,feature,tfidf
339,faites,0.0
340,familial,0.0
323,eu,0.0
322,ete,0.0
625,parc,0.149239
712,pour,0.161498
320,et,0.173906
129,beau,0.193378
930,un,0.20258
336,faire,0.309099


In [49]:
negFeatureFrDf.sort_values('tfidf')[-15:] # words with highest tfidif words in negative reviews

Unnamed: 0,feature,tfidf
108,gens,0.0
107,gazon,0.0
97,fait,0.0
313,être,0.0
105,fontaine,0.0
104,font,0.0
103,fois,0.0
102,fleuve,0.0
101,fin,0.0
100,fermé,0.0


In [137]:
# select 200 best features 
selected_features = SelectKBest(chi2, k=200).fit(bowFr, labelsFr).get_support(indices=True)

In [146]:
# use selected features for vectorizer
vectorizer = TfidfVectorizer(min_df=15, vocabulary=selected_features)

bow2 = vectorizer.fit_transform(list(parkReviews['review_text']))
bow2

<41822x200 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [50]:
X_trainFr, X_testFr, y_trainFr, y_testFr = train_test_split(bowFr, labelsFr, test_size=0.33)

In [51]:
# check out the dataset 
print(X_trainFr.shape)
print(X_testFr.shape)
print(y_trainFr.shape)
print(y_testFr.shape)

(11489, 1180)
(5660, 1180)
(11489,)
(5660,)


<a id="test-class"></a> 
## 2.2 Testing various classifiers 


### 2.2.1 Random Forest classifier 

In [38]:
from sklearn.ensemble import RandomForestClassifier as rfc
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

In [153]:
classifier = rfc()
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)

0.8958846543979133

In [53]:
# random forest classifier on french reviews
classifierFr = rfc()
classifierFr.fit(X_trainFr,y_trainFr)
classifierFr.score(X_testFr,y_testFr)

0.8740282685512367

### 2.2.2 Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier

In [156]:
%%time

clfdt = DecisionTreeClassifier(min_samples_split=30,max_depth=10)
clfdt.fit(X_train, y_train)

print("train shape: " + str(X_train.shape))
print("score on test: "  + str(clfdt.score(X_test, y_test)))
print("score on train: " + str(clfdt.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8871902622808289
score on train: 0.9016773733047823
CPU times: user 1 s, sys: 77.9 ms, total: 1.08 s
Wall time: 3.08 s


In [56]:
%%time

clfdtFr = DecisionTreeClassifier(min_samples_split=30,max_depth=10)
clfdtFr.fit(X_trainFr, y_trainFr)

print("train shape: " + str(X_trainFr.shape))
print("score on test: "  + str(clfdt.score(X_testFr, y_testFr)))
print("score on train: " + str(clfdt.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.8643109540636043
score on train: 0.8885890852119419
CPU times: user 200 ms, sys: 13.1 ms, total: 213 ms
Wall time: 228 ms


In [157]:
%%time

bg=BaggingClassifier(DecisionTreeClassifier(min_samples_split=10,max_depth=3),max_samples=0.5,max_features=1.0,n_estimators=10)
bg.fit(X_train, y_train)

print("train shape: " + str(X_train.shape))
print("score on test: " + str(bg.score(X_test, y_test)))
print("score on train: "+ str(bg.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8836400521663527
score on train: 0.8860099928622412
CPU times: user 1.61 s, sys: 125 ms, total: 1.73 s
Wall time: 6.37 s


In [57]:
%%time

bgFr=BaggingClassifier(DecisionTreeClassifier(min_samples_split=10,max_depth=3),max_samples=0.5,max_features=1.0,n_estimators=10)
bgFr.fit(X_trainFr, y_trainFr)

print("train shape: " + str(X_trainFr.shape))
print("score on test: " + str(bgFr.score(X_testFr, y_testFr)))
print("score on train: "+ str(bgFr.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.8591872791519435
score on train: 0.8603881974062146
CPU times: user 463 ms, sys: 14.3 ms, total: 477 ms
Wall time: 507 ms


In [158]:
# boosting decision tree

adb = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),n_estimators=100,learning_rate=0.5)
adb.fit(X_train, y_train)

print("train shape: " + str(X_train.shape))
print("score on test: " + str(adb.score(X_test, y_test)))
print("score on train: "+ str(adb.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8921170844805101
score on train: 0.9065310492505353


In [58]:
# boosting decision tree with french reviews

adbFr = AdaBoostClassifier(DecisionTreeClassifier(max_depth=2),n_estimators=100,learning_rate=0.5)
adbFr.fit(X_trainFr, y_trainFr)

print("train shape: " + str(X_trainFr.shape))
print("score on test: " + str(adbFr.score(X_testFr, y_testFr)))
print("score on train: "+ str(adbFr.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.8743816254416961
score on train: 0.8938114718426321


### 2.2.3 Naive Bayes

In [168]:
from sklearn.naive_bayes import MultinomialNB

In [169]:
%%time
mnb = MultinomialNB().fit(X_train, y_train)

CPU times: user 19.8 ms, sys: 35.7 ms, total: 55.5 ms
Wall time: 299 ms


In [170]:
print("train shape: " + str(X_train.shape))
print("score on test: " + str(mnb.score(X_test, y_test)))
print("score on train: "+ str(mnb.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8926242573540066
score on train: 0.8975374732334047


In [61]:
%%time # classification on French reviews
mnbFr = MultinomialNB().fit(X_trainFr, y_trainFr)

CPU times: user 6.85 ms, sys: 11.9 ms, total: 18.7 ms
Wall time: 36.9 ms


In [62]:
print("train shape: " + str(X_trainFr.shape))
print("score on test: " + str(mnbFr.score(X_testFr, y_testFr)))
print("score on train: "+ str(mnbFr.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.873321554770318
score on train: 0.8766646357385325


### 2.2.4 Logistic Regression 

In [171]:
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier

In [172]:
%%time

lr=LogisticRegression(max_iter=5000)
lr.fit(X_train, y_train)

CPU times: user 333 ms, sys: 42.2 ms, total: 375 ms
Wall time: 489 ms


LogisticRegression(max_iter=5000)

In [173]:
print("train shape: " + str(X_train.shape))
print("score on test: " + str(lr.score(X_test, y_test)))
print("score on train: "+ str(lr.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8945080423127083
score on train: 0.9053533190578158


In [174]:
%%time

#logistic regression with stochastic gradient decent
sgd=SGDClassifier()
sgd.fit(X_train, y_train)

CPU times: user 68 ms, sys: 5.54 ms, total: 73.5 ms
Wall time: 83.4 ms


SGDClassifier()

In [175]:
print("train shape: " + str(X_train.shape))
print("score on test: " + str(sgd.score(X_test, y_test)))
print("score on train: "+ str(sgd.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8908129256629475
score on train: 0.8955745895788723


In the case of the French reviews the accuracy is slightly lower.

In [64]:
%%time

lrFr=LogisticRegression(max_iter=5000)
lrFr.fit(X_trainFr, y_trainFr)

CPU times: user 513 ms, sys: 69.2 ms, total: 582 ms
Wall time: 757 ms


LogisticRegression(max_iter=5000)

In [65]:
print("train shape: " + str(X_trainFr.shape))
print("score on test: " + str(lrFr.score(X_testFr, y_testFr)))
print("score on train: "+ str(lrFr.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.8763250883392226
score on train: 0.881016624597441


In [66]:
%%time

#logistic regression with stochastic gradient decent
sgdFr=SGDClassifier()
sgdFr.fit(X_trainFr, y_trainFr)

CPU times: user 25.7 ms, sys: 7.11 ms, total: 32.9 ms
Wall time: 43.3 ms


SGDClassifier()

In [67]:
print("train shape: " + str(X_trainFr.shape))
print("score on test: " + str(sgdFr.score(X_testFr, y_testFr)))
print("score on train: "+ str(sgdFr.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.876678445229682
score on train: 0.8839759770214988


### 2.2.5 K-nearest neighbors
A k-nearest neighbors classifier is trained on the English and French reviews. In the case of the English reviews the score on the test set is around 87.47% and for the French reviews it is 84.54%.

In [176]:
%%time

from sklearn.neighbors import KNeighborsClassifier

#knn = KNeighborsClassifier(n_neighbors=5,algorithm = 'ball_tree')
knn = KNeighborsClassifier(algorithm = 'brute', n_jobs=-1)

knn.fit(X_train, y_train)

print("train shape: " + str(X_train.shape))
print("score on test: " + str(knn.score(X_test, y_test)))
print("score on train: "+ str(knn.score(X_train, y_train)))

train shape: (28020, 2127)
score on test: 0.8747283002463411
score on train: 0.9011063526052819
CPU times: user 2min 37s, sys: 3min 47s, total: 6min 25s
Wall time: 20min 13s


In [68]:
%%time
# knn on french reviews
knnFr = KNeighborsClassifier(algorithm = 'brute', n_jobs=-1)

knnFr.fit(X_trainFr, y_trainFr)

print("train shape: " + str(X_trainFr.shape))
print("score on test: " + str(knnFr.score(X_testFr, y_testFr)))
print("score on train: "+ str(knnFr.score(X_trainFr, y_trainFr)))

train shape: (11489, 1180)
score on test: 0.8454063604240283
score on train: 0.876403516406998
CPU times: user 21.9 s, sys: 12.5 s, total: 34.3 s
Wall time: 32.9 s


### 2.3 RNN for classification
An RNN can be used to take the more information into account. The following is an initial try of using an RNN. This is just a beginning and has to be explored further.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
train, test = train_test_split(allReviewEn, test_size = 0.3, random_state=42)

# clean the indexing
train.reset_index(drop=True),test.reset_index(drop=True)

# save train and test in csv files 
train[['review_text', 'label']].to_csv('all_en_train.csv', index=False)
test[['review_text', 'label']].to_csv('all_en_test.csv', index=False)

In [None]:
### Using Torchtest to processs text data

import numpy as np 

import torch 
import torchtext

from torchtext.legacy.data import Field, BucketIterator, TabularDataset, LabelField

import nltk 
nltk.download('punkt') # for punkt tokenizer

from nltk import word_tokenize 

In [None]:
# torchtext field parameter specifies how data should be processed, here tokenized
TEXT = Field(tokenize = word_tokenize)

LABEL = LabelField(dtype = torch.float) # convert 

datafields = [ ('review_text', TEXT), ('label', LABEL)] 

# specify what data that will work with, split to train and text, map to field 
trn, tst = TabularDataset.splits(path = '/Users/andreamock/Documents/review_datasets',
                               train = 'all_en_train.csv', test = 'all_en_test.csv', format = 'csv',
                               skip_header = True, fields = datafields)


# training examples 
trn[:5]

print(f'Number of training examples: {len(trn)}')
print(f'Number of testing examples: {len(tst)}')

# each example has label and text
trn[5].__dict__.keys()

trn[1].review_text # text has been tokenized in individual words

trn[1].label

# limit size of feature vectors to 15000, use one-encoding to get the top 15000 words in vocab
TEXT.build_vocab(trn, max_size = 15000)

LABEL.build_vocab(trn)

print(f'Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}')
print(f'Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}')
# two additional tokens were added to vocab, one for unknown words and another for padding to make sentences equal lengths

print(TEXT.vocab.freqs.most_common(50)) 

print(TEXT.vocab.itos[:10]) # integer to string mapping 0 and 1 to unknown and padding

batch_size = 64 

# returns a batch of examples where each example is of similar length (thus minimizing padding for each example)
train_iterator, test_iterator = BucketIterator.splits(
    (trn,tst), batch_size = batch_size, sort_key = lambda x: len(x.review_text), sort_within_batch = False
)

In [None]:
## Designing an RNN for binary text classification 

import torch.nn as nn

class RNN(nn.Module):
    
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        # input_dim = input dimensions of words 
        # embedding_dim = dimension of word embeddings, dense word representation for training RNN
        # hidden_dim = dimension of hidden state of RNN
        # output_dim = output dimensions of RNN output
        
        super().__init__()
        #  convert one-hot encoded sentences to dense format using embeddings to represent each word
        self.embedding = nn.Embedding(input_dim, embedding_dim)
        # input to rnn is current word's embedding and previous hidden state, one word per time instance (memory cell)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim)
        # fully connected layer to classify as positive or negative 
        self.fc = nn.Linear(hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(0.3)
        
    def forward(self, text):
        # input sentence (list of indexes of one hot encoded words) is represented using its embedding
        embedded = self.embedding(text)
        
        embedded_dropout = self.dropout(embedded)
        
        # output = concatentation of hidden state for every time step (ie word) [sentence length, batch size, hiddendim]
        # hidden = final hidden state fed into linear layer
        output, (hidden, _) = self.rnn(embedded_dropout)
        
        hidden_1D = hidden.squeeze(0) # get rid of unnecessary dimension 
        
        assert torch.equal(output[-1, :, :], hidden_1D) # confirm that it is indeed last hidden state 
        
        return self.fc(hidden_1D) # last hidden state fed into fully connected layer

In [None]:
# setting dimensions 
input_dim = len(TEXT.vocab)
embedding_dim = 100
hidden_dim = 256
output_dim=1

model = RNN(input_dim, embedding_dim, hidden_dim, output_dim)

model # see what our model looks like

In [None]:
# train with optimizer
import torch.optim as optim 

optimizer = optim.Adam(model.parameters(), lr=1e-6)

# binary cross entropy with logits (cross-entropy for binary classification, 
# w/ sigmoid activation func to predict in range of 0 and 1)
criterion = nn.BCEWithLogitsLoss()

In [None]:
def train(model, iterator, optimizer, criterion): # helper function for training process
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:  # iterator over all batches of training data
        
        optimizer.zero_grad() # zero out gradients of optimizer
                
        predictions = model(batch.review_text).squeeze(1) # make predictions, squeeze to be 1d instead of [, ]
        
        loss = criterion(predictions, batch.label) # calculate loss
        
        rounded_preds = torch.round(torch.sigmoid(predictions))
        correct = (rounded_preds == batch.label).float() # how many were correct
        
        acc = correct.sum() / len(correct)
        
        loss.backward() # backward pass on rnn
        
        optimizer.step()
        
        epoch_loss += loss.item() # keep track of epoch loss and accuracy
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
num_epochs = 5

for epoch in range(num_epochs):
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    print(f' Epoch: {epoch+1}, Train loss: {train_loss}, Train Acc: {train_acc*100:.2f}%')


Now we can test the accuracy on our test data.

In [None]:
# don't want to update the parameters when evaluating the accuracy
epoch_loss = 0
epoch_acc = 0

model.eval()

with torch.no_grad():

    for batch in test_iterator:

        predictions = model(batch.review_text).squeeze(1)

        loss = criterion(predictions, batch.label)

        rounded_preds = torch.round(torch.sigmoid(predictions))
        
        correct = (rounded_preds == batch.label).float() 
        acc = correct.sum() / len(correct)

        epoch_loss += loss.item()
        epoch_acc += acc.item()

test_loss = epoch_loss / len(test_iterator)
test_acc  = epoch_acc / len(test_iterator)

print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')