# exploring different areas of text analytics.
_Looking to generate ideas for improving SICK_

## methods I'm exploring 

### * Bag of Words method
> - taking each word from a sentence to get some kind of measure by which we can find out if the word exists in another sentence or not and also its importance
> - each sentence in the google reviews will be treated as a bag of words, in other words each sentence is a **document** and all the documents make up a **corpus**

#### goals
1. create a dictionary of all the unique words in the corpus (exclusding words like "the", "an", "is" etc)
1. convert all documents into vectors which will represent the presence of words from our dictionary in the documents.
* *We're going to do this with the Tf-idf vectorizer in sklearn*

#### steps
1. Count the number of times each word appears in each document(hotel review)
    1. create feature vector and find out of how many zeroes are in my feature vector.
    1. calculate the non-zero value density in the vector
    1. remove stopwords and non-english characters
    1. test stemming or lemmatization

In [5]:
# import dependencies
from datetime import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
import matplotlib.pyplot as plit
import seaborn as sns
import random
import regex as re
import warnings
warnings.filterwarnings('ignore')

# load data

In [7]:
initData = pd.read_csv('resources/McDonald_s_Reviews.csv', encoding='ISO-8859-1')

## let's do some quick initial cleaning and exploring into the google data

In [8]:
initData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33396 entries, 0 to 33395
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   reviewer_id    33396 non-null  int64  
 1   store_name     33396 non-null  object 
 2   category       33396 non-null  object 
 3   store_address  33396 non-null  object 
 4   latitude       32736 non-null  float64
 5   longitude      32736 non-null  float64
 6   rating_count   33396 non-null  object 
 7   review_time    33396 non-null  object 
 8   review         33396 non-null  object 
 9   rating         33396 non-null  object 
dtypes: float64(2), int64(1), object(7)
memory usage: 2.5+ MB


In [9]:
# we can get a quick overview of how many reviews share the same text with value_counts function
initData['review'].value_counts()

review
Excellent                                2148
Good                                     1264
Neutral                                   942
Poor                                      315
Terrible                                  292
                                         ... 
then the worst burger here in the US        1
it's a McDonalds                            1
There are employees who speak Spanish       1
What you order                              1
they took good care of me                   1
Name: count, Length: 22285, dtype: int64

## * Bag of Words method start

In [10]:
# create the countvectorizer 
count_vector = CountVectorizer()

In [11]:
# create dictionary of words from the corpus
features = count_vector.fit(initData['review'])

In [12]:
# I want to check out some words that got extracted from the corpus
feature_names = features.get_feature_names_out()
feature_names

array(['00', '000', '0000000', ..., 'ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý',
       'ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý',
       'ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý'],
      dtype=object)

In [15]:
feature_names

array(['00', '000', '0000000', ..., 'ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý',
       'ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý',
       'ýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýýý'],
      dtype=object)

In [13]:
print(f'Total number of features extracted are: {len(feature_names)}')

Total number of features extracted are: 14378


**ok bet, we have 14378 unique features identified**

In [19]:
# creating a random sampler to show 10 feature names
random.sample(sorted(feature_names),10)

['tge',
 'aak',
 'warden',
 'cameras',
 'forgiveness',
 'specialist',
 'ofen',
 'dammmm',
 'effert',
 '73rd']

In [21]:
# creating feature vector
feature_vector = count_vector.transform(initData['review'])
feature_vector.shape

(33396, 14378)

#### the above shows us all 50,000 documents, are represented by 27,373 features(unique words)
> corresponding features will carry the number of times that word appeard in the document. If the word is not present then the feature gets a 0 value

In [23]:
# I need to find out of how many zeroes are in my feature vector
feature_vector.getnnz()

576061

In [24]:
# gets the non-zero value density in the document
feature_vector.getnnz()/(feature_vector.shape[0]*feature_vector.shape[1])

0.0011997079653556363

wayyy too many zeroes in my feature vector **sadnesss -_-**

In [25]:
# show sparse matrix
feature_vector.todense()

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]])

#### need to fix the dataset before moving forward

#### * getting rid of stopwords

In [28]:
# create stopwords variable set to english
all_stopwords = set(stopwords.words('english'))