# Combine Scrapped Datasets from Google Play Store and Apple App Store

## Import Libraries

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

from nltk.tokenize import TweetTokenizer, RegexpTokenizer

import enchant

## Import Datasets

In [28]:
google_reviews = pd.read_csv('data/scrapper_102582_18_Dec_0757.csv')

In [29]:
google_reviews.shape

(102582, 10)

In [30]:
apple_reviews = pd.read_csv('data/app_store_12_13.csv')

In [31]:
apple_reviews.shape

(11617, 7)

The number of reviews on Apple App stores are much fewer than the number of reviews on Google Play Store.

### Google Play Store Reviews

In [32]:
google_reviews.head()

Unnamed: 0,reviewId,userName,userImage,content,score,thumbsUpCount,reviewCreatedVersion,at,replyContent,repliedAt
0,gp:AOqpTOHCvgqdJ2qLwgZ-dXETf4Q-aRiFEaHKC9vmNZ2...,Hailey Hawthorne,https://play-lh.googleusercontent.com/a-/AOh14...,"The graphics are gorgeous, the gameplay is inc...",5,881,1.1.1_1437351_1398019,2020-12-11 13:55:29,,
1,gp:AOqpTOHF4rxce7qwlomwP23shwoxC8I19SkLm26yC6C...,Ratul Kumar Biswas,https://play-lh.googleusercontent.com/a-/AOh14...,"If you love free world RPG, then this is it! T...",5,48,1.1.1_1437351_1398019,2020-12-16 18:10:47,,
2,gp:AOqpTOExfYs5EJj-TnaLYPjWfsUzQBE82Djf9nsN5-t...,fang theforgotten,https://play-lh.googleusercontent.com/a-/AOh14...,I absolutely love the game! No complaints besi...,5,185,1.1.1_1437351_1398019,2020-12-16 00:18:43,,
3,gp:AOqpTOEZuEm3oA8vH-NI9GtWCwcrqS1QpcBoFX8Ds-t...,Dohwon Kim,https://play-lh.googleusercontent.com/-9bHV96D...,Great game. Regardless of ripoff issues and su...,4,1,1.1.1_1437351_1398019,2020-12-16 12:46:11,,
4,gp:AOqpTOHiRvGrFsLKwnH7JOpzSaHP45OE4v7wV_wJbdf...,Ruby Daniel,https://play-lh.googleusercontent.com/a-/AOh14...,Amazing graphics and gameplay although a lot o...,5,186,1.1.1_1437351_1398019,2020-12-10 22:51:08,,


Columns such as reviewId, userName, userImage are not important and will be dropped. reviewCreatedVersion will be dropped as well, but I will feature engineer a similar feature for both datasets subsequently. 

In [33]:
len(google_reviews) - google_reviews.isnull().sum()

reviewId                102582
userName                102578
userImage               102582
content                 102582
score                   102582
thumbsUpCount           102582
reviewCreatedVersion     87335
at                      102582
replyContent                34
repliedAt                   34
dtype: int64

There are only **34 out of 102582** reviews that have been replied. I will take a look at these replies.

In [34]:
replied = google_reviews[google_reviews['replyContent'].notnull()][['content', 'score', 'thumbsUpCount', 'at', 'replyContent', 'repliedAt']]

replied.head()

Unnamed: 0,content,score,thumbsUpCount,at,replyContent,repliedAt
13517,"Great graphics, fluid movement, open world fre...",5,1,2020-09-28 03:30:58,"Hello Traveller, currently the Genshin Impact ...",2020-09-27 09:53:07
18231,Editd Review: after a few days playing I do li...,4,35,2020-09-29 14:07:20,"Hello Traveller, currently the Genshin Impact ...",2020-09-27 09:52:11
21285,"I was a beta-tester, 5 stars are legit. I just...",2,2,2020-09-27 12:48:22,"Hello Traveller, currently the Genshin Impact ...",2020-09-27 10:08:18
23628,The game is really good it has nice graphics n...,5,1,2020-09-28 08:36:57,"Hello Traveller, currently the Genshin Impact ...",2020-09-27 09:54:41
25160,EDIT: Very bad customer service. I have sent a...,1,1,2020-10-03 15:25:00,"We sincerely apologize for the trouble caused,...",2020-10-03 14:31:53


In [35]:
for i in range(len(replied)):
    print(f"On {replied.iloc[i]['at'][:-3]}")
    print("====================")
    print(f"User Sent: (Score {replied.iloc[i]['score']})")
    print(f"{replied.iloc[i]['content']}")
    print("")
    print("Mihoyo Replied:")
    print(f"{replied.iloc[i]['replyContent']}")
    print("")
    print("")

On 2020-09-28 03:30
User Sent: (Score 5)
Great graphics, fluid movement, open world freedom, game-wise it's stunning, I like element combination, like wind + fire will make difference in combat, affordable IAP products, looks like i will stuck here for a very long time. Plus the tutorial is just appeared as we doing something so you won't stuck on it for long

Mihoyo Replied:
Hello Traveller, currently the Genshin Impact is available for pre-download service, the oficial open will be at September 28th, 2020, then let's start a new journey together.

Please visit our oficial website for more information: https://genshin.mihoyo.com/en.


On 2020-09-29 14:07
User Sent: (Score 4)
Editd Review: after a few days playing I do like it a lot. The only downsides I can give are: 1. Uses a fair amount of battery power. 2. The controls can be cumbersome causing you to move or act in a way that could cause you to do something you dont mean to. 3. Sometimes the game lags during cutscene. I am using a

In general, the replies from Mihoyo are mostly on the first few days to inform reviewers about the official opening of Genshin Impact. There are also a few replies that are to direct reviewers to customers service and regarding the best devices/settings to play the game.

The **columns for replies will be dropped** as there are not too fews records and do not impact the reviews.

In [12]:
google_reviews_cleaned = google_reviews[['content', 'score', 'at', 'thumbsUpCount']].copy()

google_reviews_cleaned['source'] = 'google_play_store'

google_reviews_cleaned.columns = ['content', 'score', 'date', 'thumbsUp', 'source']

google_reviews_cleaned.head(10)

Unnamed: 0,content,score,date,thumbsUp,source
0,"The graphics are gorgeous, the gameplay is inc...",5,2020-12-11 13:55:29,881,google_play_store
1,"If you love free world RPG, then this is it! T...",5,2020-12-16 18:10:47,48,google_play_store
2,I absolutely love the game! No complaints besi...,5,2020-12-16 00:18:43,185,google_play_store
3,Great game. Regardless of ripoff issues and su...,4,2020-12-16 12:46:11,1,google_play_store
4,Amazing graphics and gameplay although a lot o...,5,2020-12-10 22:51:08,186,google_play_store
5,"This game is phenomenal, as it works cross pla...",5,2020-12-11 10:30:33,19,google_play_store
6,This game is pretty great. Until you hit late ...,3,2020-12-15 21:34:40,7,google_play_store
7,It's an amazing game overall. I would rate it ...,4,2020-12-12 05:33:00,27,google_play_store
8,"At the first glance the game looks perfect, gr...",1,2020-12-17 01:29:09,4,google_play_store
9,"I've tried a lot of mobile games, and I'm goin...",5,2020-12-15 18:01:03,6,google_play_store


In [18]:
google_reviews_cleaned.shape

(102582, 5)

### Apple Store Reviews

In [19]:
apple_reviews.head()

Unnamed: 0.1,Unnamed: 0,userName,review,rating,title,date,isEdited
0,0,Joy SQ.,"Besides everything that people say, this game ...",5,My Favorite Game ❤️❤️,2020-10-21 23:57:01,False
1,1,DiggingDiva,Words cannot describe how impeccable this game...,5,The most impressive game I’ve ever played.,2020-12-03 03:05:55,False
2,2,dontunderestimateme,I wish I found out about this game later when ...,5,Super addicting,2020-11-26 06:03:03,False
3,3,Tronscream,In my many years of trying to find games to pl...,5,The most successful open world game,2020-10-01 18:31:58,False
4,4,Hello Peoples And Creators,So. The only problem I have with this game is ...,5,Amazing! But...,2020-10-01 21:47:12,False


The userName column will be dropped. I will explore if the isEdited column provides any interesting information.

In [21]:
apple_reviews['isEdited'].value_counts()

False    11617
Name: isEdited, dtype: int64

All the records have False for isEdited, it will be dropped as well.

In order for the datasets from Apple App store and Google Play store to be used together, I will **combine the title and review** so that it is in the same format of having just one review post per record.

In [22]:
apple_reviews['combined'] = apple_reviews['title'] + ' ' + apple_reviews['review']

apple_reviews['source'] = 'apple_app_store'

apple_reviews_cleaned = apple_reviews[['combined', 'rating', 'date', 'source']].copy()

apple_reviews_cleaned.columns = ['content', 'score', 'date', 'source']

apple_reviews_cleaned.head()

Unnamed: 0,content,score,date,source
0,My Favorite Game ❤️❤️ Besides everything that ...,5,2020-10-21 23:57:01,apple_app_store
1,The most impressive game I’ve ever played. Wor...,5,2020-12-03 03:05:55,apple_app_store
2,Super addicting I wish I found out about this ...,5,2020-11-26 06:03:03,apple_app_store
3,The most successful open world game In my many...,5,2020-10-01 18:31:58,apple_app_store
4,Amazing! But... So. The only problem I have wi...,5,2020-10-01 21:47:12,apple_app_store


In [23]:
apple_reviews_cleaned['date'].max()

'2020-12-11 20:07:00'

In [24]:
apple_reviews_cleaned['date'].min()

'2020-09-27 03:04:13'

## Combine the Dataframes

In [25]:
combined_df = pd.concat([google_reviews_cleaned, apple_reviews_cleaned])

combined_df.reset_index(drop=True, inplace=True)

combined_df['date'] = pd.to_datetime(combined_df['date'])

combined_df.head()

Unnamed: 0,content,score,date,thumbsUp,source
0,"The graphics are gorgeous, the gameplay is inc...",5,2020-12-11 13:55:29,881.0,google_play_store
1,"If you love free world RPG, then this is it! T...",5,2020-12-16 18:10:47,48.0,google_play_store
2,I absolutely love the game! No complaints besi...,5,2020-12-16 00:18:43,185.0,google_play_store
3,Great game. Regardless of ripoff issues and su...,4,2020-12-16 12:46:11,1.0,google_play_store
4,Amazing graphics and gameplay although a lot o...,5,2020-12-10 22:51:08,186.0,google_play_store


In [26]:
combined_df.to_pickle('data/combined_df.p')