# Scraping Apple App Store Reviews

In this quick mini-project, we find out how we can scrape app store reviews and save them in a dataframe. The API that we will be using can be found [here](https://pypi.org/project/app-store-scraper/#quickstart)

Install API: ```pip3 install app-store-scraper```.

In [1]:
# Basic imports
import pandas as pd
import numpy as np

# Library used to scrape app store
from app_store_scraper import AppStore

# For date time formatting
import datetime as dt

## Scraping the App Store

Visit the target app's Apple Store page and get it's app name and app id from the web address. For the following example, we will be scraping the reviews from ABillion. As a side note, ABillion is a social media platform for reviews for mainly vegan food and products. Visit their website [here](https://www.abillion.com). For ABillion, the link to the App Store page is ```https://apps.apple.com/us/app/abillionveg/id1377119949```.

From the web address, we get ```app_name: abillionveg``` and ```app_id: 1377119949```.

Set ```how_many``` to the number of reviews you want to find. In the following example, we set it to 200, but since the app does not have as many reviews, we get 161 reviews instead.

In [2]:
app = AppStore(country="us", app_name="abillionveg", app_id=1377119949)
app.review(how_many=200)

2022-02-03 17:36:13,081 [INFO] Base - Initialised: AppStore('us', 'abillionveg', 1377119949)
2022-02-03 17:36:13,083 [INFO] Base - Ready to fetch reviews from: https://apps.apple.com/us/app/abillionveg/id1377119949
2022-02-03 17:36:15,930 [INFO] Base - [id:1377119949] Fetched 161 reviews (161 fetched in total)


From viewing one review, we see that the data is formatted as a nested dictionary. We need to convert this into a dataframe with the appropriate column names.

In [3]:
app.reviews[0]

{'review': 'I found this from an instagram ad and decided to try it out. I was playing around with it and reviewing some products/dishes that I had saved in my phone, and 10 reviews later, I “unlocked” a $10 donation to a sanctuary of my choice! I was SO surprised and instantly started loving this app even more. Check out their website: 1 review = $1 donated. Write 10 reviews and donate $10. This is not money out of your own pocket, it’s directly from abillionveg. I will continue writing reviews so I can unlock more donations. LOVE their mission of bringing the veggie community together while also supporting rescue sanctuaries and orgs. I’d give them 100 stars if possible.\n\nUpdate: I wish there was some type of “search” bar on your own profile so you can easily find one of your past reviews to update it. Your profile list can get pretty long once you review dozens of products/dishes. Also, they will make you remove your review of a product that is not 100% plant based (I don’t agree 

In [4]:
# Converting dictionary to dataframe

reviews = pd.DataFrame.from_dict(app.reviews)
reviews.head(2)

Unnamed: 0,review,rating,isEdited,title,date,developerResponse,userName
0,I found this from an instagram ad and decided ...,5,True,Reviews = donations to sanctuaries!!!,2020-01-14 19:10:13,"{'id': 11122362, 'body': 'Hey Sarah, Means so...",Sarahhhhhhho
1,"Reviewing is actually a hobby of mine, and kno...",5,False,"My favorite “social” app, game changer for veg...",2021-08-22 03:35:01,"{'id': 24846835, 'body': 'TurdsInRocks, thank ...",TurdsInRocks


In [6]:
# Renaming columns for concision and lowercase everything for uniformity.

reviews.rename(columns={'date': 'review_datetime', 'developerResponse': 'dev'}, inplace=True)
reviews.columns = reviews.columns.str.lower()
reviews.head(2)

Unnamed: 0,review,rating,isedited,title,review_datetime,dev,username
0,I found this from an instagram ad and decided ...,5,True,Reviews = donations to sanctuaries!!!,2020-01-14 19:10:13,"{'id': 11122362, 'body': 'Hey Sarah, Means so...",Sarahhhhhhho
1,"Reviewing is actually a hobby of mine, and kno...",5,False,"My favorite “social” app, game changer for veg...",2021-08-22 03:35:01,"{'id': 24846835, 'body': 'TurdsInRocks, thank ...",TurdsInRocks


## Unpacking Nested Dictionary

Handling the nested dictionary. We unpack the dictionary nested in ```dev``` onto the dataframe with the function below:

In [7]:
def devresponse(row):
    if pd.notnull(row['dev']):
        row['id'] = row['dev']['id']
        row['dev_response'] = row['dev']['body']
        row['dev_response_datetime'] = row['dev']['modified']
    return row

reviews = reviews.apply(devresponse, axis=1)

## Converting developer's response date and time

Viewing the top 2 entries, notice that ```dev_response_datetime``` has a ```T``` and ```Z``` in it. The ```T``` doesn't really stand for anything. It is just the separator that the ISO 8601 combined date-time format requires. You can read it as an abbreviation for Time. The ```Z``` stands for the Zero timezone, as it is offset by 0 from the Coordinated Universal Time (UTC).

If the developer has not responded, the values will return as ```nan```.

In [8]:
reviews.head(2)

Unnamed: 0,dev,dev_response,dev_response_datetime,id,isedited,rating,review,review_datetime,title,username
0,"{'id': 11122362, 'body': 'Hey Sarah, Means so...","Hey Sarah,\n\nMeans so much to us that you got...",2020-01-14T19:10:13Z,11122362.0,True,5,I found this from an instagram ad and decided ...,2020-01-14 19:10:13,Reviews = donations to sanctuaries!!!,Sarahhhhhhho
1,"{'id': 24846835, 'body': 'TurdsInRocks, thank ...","TurdsInRocks, thank you so much for your warm ...",2021-09-01T10:23:23Z,24846835.0,False,5,"Reviewing is actually a hobby of mine, and kno...",2021-08-22 03:35:01,"My favorite “social” app, game changer for veg...",TurdsInRocks


Converting it into datetime format. Simply converting using ```pd.to_datetime``` converts all the available information, including the timezone. For UTC, datetime is denoted by +00:00. As such, we convert the datetime using ```dt.strftime``` with a specified format. In this case, we want to mimick the date and time format of the initial review, as shown in ```review_datetime```.

In [9]:
pd.to_datetime(reviews['dev_response_datetime'])

0     2020-01-14 19:10:13+00:00
1     2021-09-01 10:23:23+00:00
2     2019-08-02 03:38:59+00:00
3     2019-08-02 04:07:39+00:00
4                           NaT
                 ...           
156   2021-12-14 17:14:20+00:00
157                         NaT
158   2019-08-02 04:10:06+00:00
159   2019-09-03 10:31:13+00:00
160   2019-08-02 04:59:21+00:00
Name: dev_response_datetime, Length: 161, dtype: datetime64[ns, UTC]

In [10]:
reviews['dev_response_datetime'] = pd.to_datetime(reviews['dev_response_datetime']).dt.strftime('%Y-%m-%d %H:%M:%S')
reviews.head(2)

Unnamed: 0,dev,dev_response,dev_response_datetime,id,isedited,rating,review,review_datetime,title,username
0,"{'id': 11122362, 'body': 'Hey Sarah, Means so...","Hey Sarah,\n\nMeans so much to us that you got...",2020-01-14 19:10:13,11122362.0,True,5,I found this from an instagram ad and decided ...,2020-01-14 19:10:13,Reviews = donations to sanctuaries!!!,Sarahhhhhhho
1,"{'id': 24846835, 'body': 'TurdsInRocks, thank ...","TurdsInRocks, thank you so much for your warm ...",2021-09-01 10:23:23,24846835.0,False,5,"Reviewing is actually a hobby of mine, and kno...",2021-08-22 03:35:01,"My favorite “social” app, game changer for veg...",TurdsInRocks


Note that the data type for ```dev_response_datetime``` is still an ```object```.

In [11]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   dev                    100 non-null    object        
 1   dev_response           100 non-null    object        
 2   dev_response_datetime  100 non-null    object        
 3   id                     100 non-null    float64       
 4   isedited               161 non-null    bool          
 5   rating                 161 non-null    int64         
 6   review                 161 non-null    object        
 7   review_datetime        161 non-null    datetime64[ns]
 8   title                  161 non-null    object        
 9   username               161 non-null    object        
dtypes: bool(1), datetime64[ns](1), float64(1), int64(1), object(6)
memory usage: 11.6+ KB


We then convert it into datetime format with the following code:

In [12]:
reviews['dev_response_datetime'] = pd.to_datetime(reviews['dev_response_datetime'])

In [13]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161 entries, 0 to 160
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   dev                    100 non-null    object        
 1   dev_response           100 non-null    object        
 2   dev_response_datetime  100 non-null    datetime64[ns]
 3   id                     100 non-null    float64       
 4   isedited               161 non-null    bool          
 5   rating                 161 non-null    int64         
 6   review                 161 non-null    object        
 7   review_datetime        161 non-null    datetime64[ns]
 8   title                  161 non-null    object        
 9   username               161 non-null    object        
dtypes: bool(1), datetime64[ns](2), float64(1), int64(1), object(5)
memory usage: 11.6+ KB


## Saving data to CSV

Lastly, save the reviews to csv.

In [14]:
reviews.to_csv('abillionveg_app_store_reviews.csv', index=False)

And we're done. The ```reviews``` dataset is set to be used however you please.