# 1. App Review NLP work

This question uses the Apple App review dataset that you generated in the "Pulling online data" workshop. Your dataset should have at least 3-5 different applications, with data from a few countries.

**1.1** Using the bag-of-words or TF-IDF vector model (from SKLearn), cluster the reviews into 5 clusters. Measure the accuracy of the cluster overlap against the real review scores.

**1.2** Now use a sentence embedding using on of the `gensim` pre-trained word embedding models to achieve the same in clustering. Get the best classification accuracy score you can on the 5-star review scale targer using all unsupervised methods you want.

**1.3** Using any method you want (pre-trained models, dimensionality reduction, feature engineering, etc.) make the best **regression** model you can to predict the 5 star rating. Rate the accuracy in regression terms (mean squared error) and in classification terms (accuracy score, etc.)

**1.4** Do the same as in 1.3, but use a classification model. Are classification models better or worse to predict a 5-point rating scale? Explain in a few paragraphs and justify with metrics.


In [1]:
import requests
import json
import pandas as pd

"""
From my 4.5 workshop

"""

app_ids={
    'amazon': '297606951',
    'messenger': '454638411',
    'wise': '612261027',
}
country_codes=[
    'ca',
    'us',
    'aus',
    'nz',
    'ie',
    'gb'
]

def is_error_response(http_response, seconds_to_sleep: float = 1) -> bool:
    """
    Returns False if status_code is 503 (system unavailable) or 200 (success),
    otherwise it will return True (failed). This function should be used
    after calling the commands requests.post() and requests.get().

    :param http_response:
        The response object returned from requests.post or requests.get.
    :param seconds_to_sleep:
        The sleep time used if the status_code is 503. This is used to not
        overwhelm the service since it is unavailable.
    """
    if http_response.status_code == 503:
        time.sleep(seconds_to_sleep)
        return False

    return http_response.status_code != 200

def get_json(url):# -> typing.Union[dict, None]:
    """
    Returns json response if any. Returns None if no json found.

    :param url:
        The url go get the json from.
    """
    response = requests.get(url)
    if is_error_response(response):
        return None
    json_response = response.json()
    return json_response


apps_r = []

def get_reviews_for(app_name, in_country, at_page=1):
    
    global app_ids
    app_id = app_ids[app_name]
    reviews = []
    
    while True:
        url = (f'https://itunes.apple.com/{in_country}/rss/customerreviews/page={at_page}/id={app_id}/sortby=mostrecent/json')
        json = get_json(url)

        if not json:
            return reviews

        feed = json.get('feed')
        
        try:
            if not feed.get('entry'):
                get_reviews_for(app_id, in_country, at_page + 1)
            reviews += [
                {
                    'review_id': entry.get('id').get('label'),
                    'app': app_name,
                    'title': entry.get('title').get('label'),
                    'author': entry.get('author').get('name').get('label'),
                    'author_url': entry.get('author').get('uri').get('label'),
                    'version': entry.get('im:version').get('label'),
                    'rating': entry.get('im:rating').get('label'),
                    'review': entry.get('content').get('label'),
                    'vote_count': entry.get('im:voteCount').get('label'),
                    'page': at_page
                }
                for entry in feed.get('entry')
                if not entry.get('im:name')
            ]
            at_page += 1
        except Exception as e:
            return reviews


In [2]:
for country in country_codes:
    for app in app_ids.keys():
        print("Fetching", app, "for", country)
        apps_r += get_reviews_for(app, country)
print("Done")

Fetching amazon for ca
Fetching messenger for ca
Fetching wise for ca
Fetching amazon for us
Fetching messenger for us
Fetching wise for us
Fetching amazon for aus
Fetching messenger for aus
Fetching wise for aus
Fetching amazon for nz
Fetching messenger for nz
Fetching wise for nz
Fetching amazon for ie
Fetching messenger for ie
Fetching wise for ie
Fetching amazon for gb
Fetching messenger for gb
Fetching wise for gb
Done


In [3]:
apps = pd.DataFrame()
for app in apps_r:
    apps = apps.append(app, ignore_index=True)
print('Done')
apps['rating'] = apps['rating'].astype(float)
apps

Done


Unnamed: 0,app,author,author_url,page,rating,review,review_id,title,version,vote_count
0,amazon,mrkachi,https://itunes.apple.com/ca/reviews/id705420305,1.0,4.0,Having a hard time finding where I can reply t...,7091392821,Can’t respond to sellers,17.4.0,0
1,amazon,Seleena1995,https://itunes.apple.com/ca/reviews/id457357629,1.0,1.0,The new update won’t let me select other optio...,7091347680,Awful,17.4.0,0
2,amazon,Vavalulu citron,https://itunes.apple.com/ca/reviews/id301949810,1.0,1.0,I downloaded the app solely for the purpose of...,7090032691,Useless,17.4.0,0
3,amazon,Zikkom,https://itunes.apple.com/ca/reviews/id47695021,1.0,1.0,"You search for a certain brand, it shows you o...",7089279513,NO relevance!,17.4.0,0
4,amazon,vkkhgf,https://itunes.apple.com/ca/reviews/id999320194,1.0,5.0,Good,7088143871,Good,17.4.0,0
...,...,...,...,...,...,...,...,...,...,...
5685,wise,maaj1970,https://itunes.apple.com/gb/reviews/id175421421,10.0,5.0,A complete satisfaction in terms of client pea...,6299764481,Excellent app well designed,5.35,0
5686,wise,Evdybfhbfhbfthfgbfhko,https://itunes.apple.com/gb/reviews/id216475711,10.0,1.0,"Great when it works, but on numerous occasions...",6295435367,Unreliable,5.35,0
5687,wise,thailandia123456J,https://itunes.apple.com/gb/reviews/id1143975714,10.0,5.0,Muito boa já uso há muito tempo espetacular ob...,6295305850,Ótimo,5.35,0
5688,wise,LEPJ26,https://itunes.apple.com/gb/reviews/id1017044021,10.0,5.0,This is absolutely a brilliant option to trans...,6293821255,Best app to transfer money overseas,5.35,0


In [4]:
# """
# Checking language of reviews
# """
# import polyglot
# from polyglot.text import Text, Word

# apps['lang'] = apps['review'].apply(lambda x: Text(x).language.name)
# apps

**1.1** Using the bag-of-words or TF-IDF vector model (from SKLearn), cluster the reviews into 5 clusters. Measure the accuracy of the cluster overlap against the real review scores.

In [7]:
print(pd.__version__)
pd.Panel

1.2.3


AttributeError: module 'pandas' has no attribute 'Panel'

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
# import statsmodels.api as sm
# from sklearn.decomposition import PCA
# from sklearn.metrics import r2_score

COMPRESSED_SIZE = 200

tf = TfidfVectorizer()
vectors = tf.fit_transform(apps['review'])

feature_names = tf.get_feature_names()
dense = vectors.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
df

Unnamed: 0,00,000,00pm,01,02,03,05,06,08,0872464276,...,สม,สะดวกมากค,ำเสมอนะคะ,แต,แบบน,ảnh,ấy,必要費用がわかりやすくリーズナブルと思います,海外送金に利用させて頂いてます,非常好好
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5685,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5686,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5687,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5688,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2. Face data

Here let's apply manifold learning on some face data.

Use the following code:

```
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=30)
```

To get the `faces` dataset.

Use dimensionality reduction so that the darkness of the image is sorted in the first dimension as seen in this picture:

![](isofaces.png)

Then produce a picture similar to this one with your result