<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Grammarly-App-User-Review" data-toc-modified-id="Grammarly-App-User-Review-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Grammarly App User Review</a></span><ul class="toc-item"><li><span><a href="#Data" data-toc-modified-id="Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data</a></span></li></ul></li><li><span><a href="#Data-Collection" data-toc-modified-id="Data-Collection-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Collection</a></span></li><li><span><a href="#Text-Pre-processing" data-toc-modified-id="Text-Pre-processing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Text Pre-processing</a></span><ul class="toc-item"><li><span><a href="#Data-Cleaning" data-toc-modified-id="Data-Cleaning-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Data Cleaning</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Tokenization</a></span></li><li><span><a href="#Stemming" data-toc-modified-id="Stemming-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Stemming</a></span></li><li><span><a href="#Lemmatizing" data-toc-modified-id="Lemmatizing-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Lemmatizing</a></span></li></ul></li><li><span><a href="#Text-Parsing" data-toc-modified-id="Text-Parsing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Text Parsing</a></span></li><li><span><a href="#Initial-Data-Analysis" data-toc-modified-id="Initial-Data-Analysis-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Initial Data Analysis</a></span></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Ratings" data-toc-modified-id="Ratings-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Ratings</a></span></li><li><span><a href="#Vote-Count" data-toc-modified-id="Vote-Count-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Vote Count</a></span></li><li><span><a href="#App-Version" data-toc-modified-id="App-Version-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>App Version</a></span></li></ul></li><li><span><a href="#Text-Representation" data-toc-modified-id="Text-Representation-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Text Representation</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Sentiment-Analysis" data-toc-modified-id="Sentiment-Analysis-8.1"><span class="toc-item-num">8.1&nbsp;&nbsp;</span>Sentiment Analysis</a></span></li><li><span><a href="#Topic-Modeling" data-toc-modified-id="Topic-Modeling-8.2"><span class="toc-item-num">8.2&nbsp;&nbsp;</span>Topic Modeling</a></span></li></ul></li><li><span><a href="#Evaluation" data-toc-modified-id="Evaluation-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Evaluation</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

# Grammarly App User Review
Analyze the topics and sentiment of user reviews for Pinterest to determine suitable ideas for A/B testing and defining novel metrics.

## Data
- review_id: unique identifier of review
- title: summary of review
- author: unique identifier of author
- author_url: author url
- version: version of the app
- rating: numerical rating of the app
- review: blob of user review of the app
- vote_count: numerical count of unique users who like the review

# Data Collection

In [9]:
# Import libraries
import os
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns;sns.set()

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import spacy
import unicodedata

from scipy.stats import shapiro

import pprint
import time
import typing
import requests

# Visual Formatting
sns.set_style(rc={
 'axes.axisbelow': True,
 'axes.edgecolor': '.8',
 'axes.facecolor': 'white',
 'axes.grid': True,
 'axes.labelcolor': '.15',
 'axes.spines.bottom': True,
 'axes.spines.left': True,
 'axes.spines.right': True,
 'axes.spines.top': True,
 'figure.facecolor': 'white',
 'xtick.bottom': False,
 'xtick.color': '.15',
 'xtick.direction': 'out',
 'xtick.top': False,
 'ytick.color': '.15',
 'ytick.direction': 'out',
 'ytick.left': False,
 'ytick.right': False})

In [2]:
# Web scraping of user reviews 
def is_error_response(http_response, seconds_to_sleep: float = 1) -> bool:
    """
    Returns False if status_code is 503 (system unavailable) or 200 (success),
    otherwise it will return True (failed). This function should be used
    after calling the commands requests.post() and requests.get().

    :param http_response:
        The response object returned from requests.post or requests.get.
    :param seconds_to_sleep:
        The sleep time used if the status_code is 503. This is used to not
        overwhelm the service since it is unavailable.
    """
    if http_response.status_code == 503:
        time.sleep(seconds_to_sleep)
        return False

    return http_response.status_code != 200

def get_json(url) -> typing.Union[dict, None]:
    """
    Returns json response if any. Returns None if no json found.

    :param url:
        The url go get the json from.
    """
    response = requests.get(url)
    if is_error_response(response):
        return None
    json_response = response.json()
    return json_response

def get_reviews(app_id, page=1) -> typing.List[dict]:
    """
    Returns a list of dictionaries with each dictionary being one review. 
    
    :param app_id:
        The app_id you are searching. 
    :param page:
        The page id to start the loop. Once it reaches the final page + 1, the 
        app will return a non valid json, thus it will exit with the current 
        reviews. 
    """
    reviews: typing.List[dict] = [{}]

    while True:
        url = (f'https://itunes.apple.com/rss/customerreviews/id={1158877342}/'
               f'page={page}/sortby=mostrecent/json')
        json = get_json(url)

        if not json:
            return reviews

        data_feed = json.get('feed')

        if not data_feed.get('entry'):
            get_reviews(app_id, page + 1)

        reviews += [
            {
                'review_id': entry.get('id').get('label'),
                'title': entry.get('title').get('label'),
                'author': entry.get('author').get('name').get('label'),
                'author_url': entry.get('author').get('uri').get('label'),
                'version': entry.get('im:version').get('label'),
                'rating': entry.get('im:rating').get('label'),
                'review': entry.get('content').get('label'),
                'vote_count': entry.get('im:voteCount').get('label')
            }
            for entry in data_feed.get('entry')
            if not entry.get('im:name')
        ]

        page += 1
reviews_dict = get_reviews('1158877342')

In [7]:
reviews=pd.DataFrame(reviews_dict[1:]) 
reviews.head()

Unnamed: 0,author,author_url,rating,review,review_id,title,version,vote_count
0,Chollathon,https://itunes.apple.com/us/reviews/id154691946,1,I mostly rely on Grammarly because English is ...,5758601188,It's not as smart as I used this app a year ago.,1.8.1,0
1,undercover pizzq,https://itunes.apple.com/us/reviews/id542732992,3,I was hoping that it would be more than a slig...,5757083923,It could be better,1.8.1,0
2,Crystal(mug shot),https://itunes.apple.com/us/reviews/id351819855,3,It's good but continuously messes up some thin...,5754680966,Meh,1.8.1,0
3,Vee.0222,https://itunes.apple.com/us/reviews/id1017024576,5,Grammarly is amazing. It makes it incredibly e...,5752622807,I LOVE IT,1.8.1,0
4,Damaris 🥰☺️😆😀,https://itunes.apple.com/us/reviews/id947625305,3,"Well, I need extra help with spelling but we h...",5752078991,Money,1.8.1,0


# Text Pre-processing

## Data Cleaning

In [11]:
# Replace values with accent marks
cols = reviews.select_dtypes(include=[np.object]).columns
def remove_accented_chars(reviews, cols):
    reviews[cols] = reviews[cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
    return reviews
remove_accented_chars(reviews, cols)

Unnamed: 0,author,author_url,rating,review,review_id,title,version,vote_count
0,Chollathon,https://itunes.apple.com/us/reviews/id154691946,1,I mostly rely on Grammarly because English is ...,5758601188,It's not as smart as I used this app a year ago.,1.8.1,0
1,undercover pizzq,https://itunes.apple.com/us/reviews/id542732992,3,I was hoping that it would be more than a slig...,5757083923,It could be better,1.8.1,0
2,Crystal(mug shot),https://itunes.apple.com/us/reviews/id351819855,3,It's good but continuously messes up some thin...,5754680966,Meh,1.8.1,0
3,Vee.0222,https://itunes.apple.com/us/reviews/id1017024576,5,Grammarly is amazing. It makes it incredibly e...,5752622807,I LOVE IT,1.8.1,0
4,Damaris,https://itunes.apple.com/us/reviews/id947625305,3,"Well, I need extra help with spelling but we h...",5752078991,Money,1.8.1,0
5,bskzksbsocm,https://itunes.apple.com/us/reviews/id1149389630,5,I am staring to download it I am downloading i...,5749512641,Great app,1.8.1,0
6,ellie2346,https://itunes.apple.com/us/reviews/id147426054,5,I love Grammarly so much for my school docs an...,5746486204,Grammarly for school,1.8.1,0
7,,https://itunes.apple.com/us/reviews/id32541662,4,I really like this app and I was very excited ...,5745481290,Premium costs money,1.8.1,0
8,Boulder Beast,https://itunes.apple.com/us/reviews/id220640647,1,"Absolutely horrible. The spellcheck is trash, ...",5744315100,Horrible,1.8.1,0
9,Yshergyshsbdbdjdj,https://itunes.apple.com/us/reviews/id460635613,1,Don't do it. I wasted 139.99 and it only works...,5743059930,If you are planning to buy premium....,1.8.1,0


In [None]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

remove_special_characters("Well this was fun! What do you think? 123#@!", 
                          remove_digits=True)

## Tokenization

In [None]:
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

remove_stopwords("The, and, if are stopwords, computer is not")

## Stemming

In [None]:
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

simple_stemmer("My system keeps crashing his crashed yesterday, ours crashes daily")

## Lemmatizing

In [None]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmatize_text("My system keeps crashing! his crashed yesterday, ours crashes daily")

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count=CountVectorizer()

def tokenize(reviews):
    """
    Params:
        - Reviews
         
    Returns:
        - Tokens (n-grams)
        
    """
    review_doc=np.array(reviews)
    bag=count.fit_transform(review_doc)
    return count.vocabulary_

tokenize(neg_reviews.review)

# Text Parsing

# Initial Data Analysis

In [None]:
def initial_analysis(df):
    """
    Report for initial data analysis for given dataframe
    
    Params:
        - df
    
    Returns:
         - Report of dataset shape and datatypes of all columns
         
    """
    print('Report of Initial Data Analysis:')
    print(f'Shape of dataframe: {df.shape}')
    print(f'Features and Data Types: \n {df.dtypes}')
initial_analysis(reviews)

In [None]:
def percent_missing(df):
    """
    Calculate percent of missing values for each column in dataset
    
    Params:
        - df
        
    Returns:
        - Dictionary of columns and percent of missing records
    
    """
    col=list(df.columns)
    perc=[round(df[c].isna().mean()*100,2) for c in col]
    miss_dict=dict(zip(col,perc))
    return miss_dict
percent_missing(reviews)

In [None]:
version_lst=list(reviews.version.unique())
version_dict={}
def version_sub(version_lst):
    """
    Given a list of unique app versions retrieve the first 3 characters of the string
    remove the delimiter '.' and convert to numeric. Find the maximum value based on major
    and first minor update value. Return the versions to be included in analysis.
    
    Params:
        - version_lst, list of unique versions
        
    Returns:
        - List of unique and most recent app versions with the same major and minor update
    
    """
    temp_sub=[]
    for v in version_lst:
        temp_sub.append(v[:3].replace('.',''))
        temp_sub=[int(s) for s in temp_sub]
    version_dict.update(dict(zip(version_lst,temp_sub)))
    return version_dict     
version_sub(version_lst)

In [None]:
curr_version=[]
def app_version(version_dict):
    """
    Given a dictionary of sorted unique app versions and the first two values as int determine the
    most recent version and all versions containing the same first two values.
    
    Params:
        - version_dict
        
    Returns:
        - List of app versions to retain in dataset
    
    """
    max_version=max(list(version_dict.values()))
    for k in list(version_dict.keys()):
        if version_dict[k] == max_version:
            curr_version.append(k)
    return curr_version
app_version(version_dict)

In [None]:
def filter_reviews(df,curr_version):
    """
    Filter the reviews based on the version number
    Params:
        - df
        - curr_version
    
    Returns:
        - filtered df
        
    """
    df=df[df.version.isin(curr_version)]
    return df
filter_reviews(reviews,curr_version)

In [None]:
num_feat=[]
def numerical_features(df):
    """
    Determine the columns that are numerical
    
    Params:
         - df
     
     Returns:
         - List of columns that are numerical
     
    """
    col=['rating','vote_count']
    reviews[col]=reviews[col].astype('int')
    
    for c in list(df.columns):
        if (df[c].dtypes) == 'int' or (df[c].dtypes) == 'float':
            num_feat.append(c)
    return num_feat
numerical_features(reviews)

In [None]:
norm_dict={}
def sample_normality(df,col_list):
    """
    Given a dataframe determines whether each numerical column is Gaussian 

    Ho = Assumes distribution is not Gaussian
    Ha = Assumes distribution is Gaussian

    Params:
        - df

    Returns:
        - List of columns non-gaussian and W-statistic

    """
    non_gauss=[]
    w_stat=[]
    
    # Determine if each sample of numerical feature is gaussian
    alpha = 0.05
    for f in num_feat:
        stat,p=shapiro(df[f])
        if p <= alpha: # Reject Ho -- Distribution is not normal
            non_gauss.append(f)
            w_stat.append(stat)
            
    # Dictionary of numerical features not gaussian and W-statistic 
    temp_dict=dict(zip(non_gauss,w_stat))
    norm_dict.update(temp_dict)
    return norm_dict
sample_normality(reviews,num_feat)

# Exploratory Data Analysis

## Ratings

In [None]:
ax=sns.countplot(x='rating',data=reviews); # Ordinal
ax.set_title('Frequency of User Review Ratings')
ax.set_ylabel('Count of Rating')
ax.set_xlabel('Rating')

## Vote Count

In [None]:
print(reviews.vote_count.value_counts()) # No votes on reviews

## App Version

In [None]:
print(reviews.version.value_counts()) # Ordinal 

# Text Representation

In [None]:
def sentiment_label(df):
    """
    Define labels for sentiment by the user review rating by converting the string to 
    integer. Create new column to define the sentiment.
    
    Params:
        - df
        
    Returns:
        - Sentiment labels
        
    """
    df['rating']=df['rating'].astype('int')
    df['sentiment']=np.where(df['rating']>=4,1,0)
sentiment_label(reviews)

In [None]:
def subset_reviews(df):
    """
    Params:
        - dataframe
    
    Returns:
        - negative reviews
        
    """
    reviews=reviews[['review','sentiment']]
    reviews=reviews[reviews.sentiment==0]
subset_reviews(reviews)

In [None]:
def clean_text(text):
    """
    Remove punctuation and all other characters from the reviews and 
    returned all text in lower case format.
    
    Params:
        - Reviews
        
    Returns:
        - Revised reviews
        
    """
    emoticons=re.findall('(?::|;|=)(?:-)?(?:\)|\(D|P)',text)
    text=(re.sub('[\W]+',' ',text.lower()) + ' '.join(emoticons).replace('-',''))
    return text

neg_reviews['review']=neg_reviews['review'].apply(clean_text)

# Modeling

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf=TfidfTransformer(use_idf=True,
                       norm='l2',
                       smooth_idf=True)

In [None]:
np.set_printoptions(precision=2)

In [None]:
print(tfidf.fit_transform(count.fit_transform(review_doc)).toarray())

## Sentiment Analysis
- Segment app reviews by polarity of the users sentiment to prirotize negative reviews
- Rating can be used as validation for determining if these reviews are positive, negative or neutral

## Topic Modeling

# Evaluation

# Conclusion