# NLP Assignment
---

## Apple Store App Review

The Apple app store has a `GET` API to get reviews on apps. The URL is:

```
https://itunes.apple.com/{COUNTRY_CODE}/rss/customerreviews/id={APP_ID_HERE}/page={PAGE_NUMBER}/sortby=mostrecent/json
```

Note that you need to provide:

- The country codes (`'us'`, `'fr'`, `'ca'`, `'au'`). Use these four, or any other 4 apps of your choice.
- The app ID. This can be found in the web page for the app right after `id`.
    - You will need to use the IDs for the apps of your choice.
- The "Page Number". The request responds with multiple pages of data, but sends them one at a time. Note that you are limited to 10 pages.

For example, Candy Crush's US webpage is `https://apps.apple.com/us/app/candy-crush-saga/id553834731`, which means that the ID is `553834731`.

**Your goal is to use any predictive model you want in order to predict the 5 star rating for a particular app, depending on the review.**

Requirements:
1. Scrape the Apple Store in order to obtain reviews for the apps and countries
2. Save your results in a DataFrame. The head of your DataFrame should look like this:

<img src="../Data/df_example.png" width="500">

3. Using any method you want (pretrained models, dimensionality reduction, TF-IDF vectorization, etc.) make the best model you can to predict the 5 star rating.
4. Test your model with a "new" review.

In [1]:
# 1.
import requests
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Mapping of country codes and app IDs
country_codes = {'USA': 'us', 'France': 'fr', 'Canada': 'ca', 'Australia': 'au'}
app_identifiers = {'Candy Crush': '553834731', 'Facebook': '284882215', 'X': '333903271', 'Tinder': '547702041'}

def fetch_reviews(app_name, app_id, country_name, country_code):
    review_entries = []
    for page_num in range(1, 11):
        url = f"https://itunes.apple.com/{country_code}/rss/customerreviews/id={app_id}/page={page_num}/sortby=mostrecent/json"
        response = requests.get(url)

        if response.status_code == 200:
            try:
                json_data = response.json()
                entries = json_data.get('feed', {}).get('entry', [])
                for entry in entries:
                    review_entries.append({
                        'App': app_name,
                        'Country': country_name,
                        'Rating': entry['im:rating']['label'],
                        'Review': entry['content']['label']
                    })
            except requests.exceptions.JSONDecodeError:
                print(f"Error decoding JSON from {url}")
        else:
            print(f"Failed to fetch data from {url} with status code {response.status_code}")
    return review_entries

def gather_all_reviews(app_identifiers, country_codes):
    all_reviews = []
    for app_name, app_id in app_identifiers.items():
        for country_name, country_code in country_codes.items():
            reviews = fetch_reviews(app_name, app_id, country_name, country_code)
            all_reviews.extend(reviews)
    return all_reviews

In [2]:
# 2.
# Fetch all reviews
all_reviews_data = gather_all_reviews(app_identifiers, country_codes)

# Convert the list of reviews to a DataFrame
df = pd.DataFrame(all_reviews_data)
df.head()

Unnamed: 0,App,Country,Rating,Review
0,Candy Crush,USA,1,Don’t waste your time. First few levels are e...
1,Candy Crush,USA,1,What’s up with the ads that don’t complete to ...
2,Candy Crush,USA,1,Mid- isle equipment add keeps freezing at 29 s...
3,Candy Crush,USA,5,So fun and I love it helps your stress
4,Candy Crush,USA,4,Like the game but your tiny screen version tak...


In [3]:
# 3
# Preprocessing
df['Review'] = df['Review'].str.lower().str.replace(r'[^\w\s]', '', regex=True)
df['is_five_star'] = (df['Rating'] == '5').astype(int)

# Feature Engineering - TFIDF
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(df['Review'])
y = df['is_five_star']

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Building
model = LogisticRegression()
model.fit(X_train, y_train)

# Prediction and Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.97      0.89      1167
           1       0.85      0.44      0.58       423

    accuracy                           0.83      1590
   macro avg       0.84      0.70      0.74      1590
weighted avg       0.83      0.83      0.81      1590



In [4]:
# 4
# Test with a new review
new_review = ["This game is awesome and I love it!"]
new_review_tfidf = tfidf.transform(new_review)
predicted_rating = model.predict(new_review_tfidf)
if predicted_rating[0] == 1:
    print("Predicted 5-star rating")
else:
    print("Not a 5-star rating")

Predicted 5-star rating


In [5]:
# Test with a new review
new_review = ["This game is bad!"]
new_review_tfidf = tfidf.transform(new_review)
predicted_rating = model.predict(new_review_tfidf)
if predicted_rating[0] == 1:
    print("Predicted 5-star rating")
else:
    print("Not a 5-star rating")

Not a 5-star rating


## Great Job! 