# Select random reviews for Coding

We want to select a random sample of reviews for a first round of coding.
Possible methods
- by app version
- by week / day

Let's aim for a 50/50 split for App Store / Play Store reviews.

Since the review counts dropped quickly, we need to find out where to stop sampling.

## Review Counts

In [4]:
import pandas as pd

reviews = pd.read_pickle('data/combined.pkl')

In [5]:
reviews['date'] = pd.to_datetime(reviews['date'], utc=True)

# groups for both PlayStore and AppStore
by_source = reviews.groupby('source')

appS = by_source.get_group('AppStore')
playS = by_source.get_group('PlayStore')

# group into days
appS_by_day = appS.groupby(appS['date'].dt.date)
playS_by_day = playS.groupby(playS['date'].dt.date)

In [6]:
import plotly.graph_objects as go
fig = go.Figure(
    data=[go.Scatter(x=appS_by_day.count()['date'].index.values,
                     y=appS_by_day.count()['date'], name='AppStore'),
         go.Scatter(x=playS_by_day.count()['date'].index.values,
                     y=playS_by_day.count()['date'], name='PlayStore')],
    layout_title_text="New Daily Reviews"
)
fig.show()

April 27th is the last day where the App Store has more than 20 reviews.

In [10]:
import plotly.graph_objects as go

by_version_appS = appS.groupby('version')
by_version_playS = playS.groupby('version')
by_version = reviews.groupby('version')


fig = go.Figure(data=[
    go.Bar(name='PlayStore', x=by_version_playS.count().index, 
           y=by_version_playS['score'].count(), text = by_version_playS['score'].size()),
    go.Bar(name='AppStore', x=by_version_appS['score'].mean().index, 
           y=by_version_appS['score'].count(), text = by_version_appS['score'].size()),
    go.Bar(name='Combined', x=by_version['score'].mean().index, 
           y=by_version['score'].count(), text = by_version['score'].size())
], layout_title_text="Review count by app version and store")
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()

Versions 1.0.1 and 1.0.5 would need to be excluded if we wanted to split the reviews 50/50 between stores, since they were released for only one of the stores.
Version 1.1.0 has only 27 App Store reviews.

In [34]:
import plotly.graph_objects as go
appS_by_day = appS.groupby(appS['date'].dt.date)


scatters = []
for v in by_version:
    d = v[1].groupby(v[1]['date'].dt.date)
    scatters.append(go.Scatter(x=d['date'].count().index.values, y=d['date'].count(), name=v[0]))

fig = go.Figure(
#     data=[go.Scatter(x=appS_by_day.count()['date'].index.values,
#                      y=appS_by_day.count()['date'], name='AppStore'),
#          go.Scatter(x=playS_by_day.count()['date'].index.values,
#                      y=playS_by_day.count()['date'], name='PlayStore')],
    data = scatters,
    layout_title_text="New Daily Reviews by Version"
)
fig.show()

Here we see which version was reviewed at which point in time. From this data, it probably makes sense to exclude versions 1.0.9 and 1.1.0.