# Data Importing, Exploration, Cleaning and Visualising

### Importing libraries

In [133]:
# Importing the libraries needed for this part of the project 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import pymongo
from pymongo import MongoClient

### Reading CSV files

In [134]:
# Getting and storing the data in a dataframe
df = pd.read_csv('df_all_2020_2021.csv')



Columns (7) have mixed types.Specify dtype option on import or set low_memory=False.



In [135]:
# Checking the current value counts of each app name
df['appId'].value_counts()

Clash of Clans    757133
Brawl Stars       246871
Clash Royale      207964
Hay Day            67808
Boom Beach         18013
Clash  Quest         722
Name: appId, dtype: int64

### Now we group the data by month in order to see a comparision between all supercell's apps between April 2020 and May 2021

In [136]:
# Here I will do a monthly aggregation of the data to use it for the visualisation
# First I will create a column for the month and the year
df['month'] = pd.to_datetime(df['at']).dt.month
df['year'] = pd.to_datetime(df['at']).dt.year




In [137]:
# Here I will create a new dataframe of the monthly value
# We will group by the year, month and the appId
# Aggregate by score, thumbsUpCount and count
df_month = df.groupby(['year', 'month', 'appId'], as_index = False).agg({'score': np.mean, 'thumbsUpCount': np.sum, 'count': np.sum})

In [138]:
df_month.head()

Unnamed: 0,year,month,appId,score,thumbsUpCount,count
0,2020,4,Boom Beach,4.197368,2996,2584
1,2020,4,Brawl Stars,4.238264,9327,10459
2,2020,4,Clash Royale,3.828147,14580,13785
3,2020,4,Clash of Clans,4.445508,1702,3239
4,2020,4,Hay Day,4.103312,10918,7821


### Merging the year and month in one column

In [139]:
# Here we merge the month and year columns to get a monthly observation of the app
df_month['Monthly date'] = df_month['year'].astype(str) + '-' + df_month['month'].astype(str)
df_month

Unnamed: 0,year,month,appId,score,thumbsUpCount,count,Monthly date
0,2020,4,Boom Beach,4.197368,2996,2584,2020-4
1,2020,4,Brawl Stars,4.238264,9327,10459,2020-4
2,2020,4,Clash Royale,3.828147,14580,13785,2020-4
3,2020,4,Clash of Clans,4.445508,1702,3239,2020-4
4,2020,4,Hay Day,4.103312,10918,7821,2020-4
...,...,...,...,...,...,...,...
67,2021,5,Brawl Stars,3.917051,1226,651,2021-5
68,2021,5,Clash Quest,3.666667,5,6,2021-5
69,2021,5,Clash Royale,3.765957,944,423,2021-5
70,2021,5,Clash of Clans,3.997102,7225,3106,2021-5


In [140]:
# Checking the value counts of eachh app after aggregating per a month

df_month['appId'].value_counts()

Hay Day           14
Clash of Clans    14
Clash Royale      14
Brawl Stars       14
Boom Beach        14
Clash  Quest       2
Name: appId, dtype: int64

#### The reason why Clash Quest has only 2 entries might be because this app was launched very recenly and dont have enoguh data for one year.

In [141]:
# If we check the data type we find out that the date went back to being an object again
# This happened because of splitting the date column into multiple columns and merging them back again
# In this case we will keep it as is as it will be good enough to use for the visualisation

df_month.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72 entries, 0 to 71
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   year           72 non-null     int64  
 1   month          72 non-null     int64  
 2   appId          72 non-null     object 
 3   score          72 non-null     float64
 4   thumbsUpCount  72 non-null     int64  
 5   count          72 non-null     int64  
 6   Monthly date   72 non-null     object 
dtypes: float64(1), int64(4), object(2)
memory usage: 4.5+ KB


### Creating summary statistics to get an idea of the monthly data

In [142]:
# Here we create a summary statistics to check the date so we know from where to start and end 
df_month.describe(include = 'all')

Unnamed: 0,year,month,appId,score,thumbsUpCount,count,Monthly date
count,72.0,72.0,72,72.0,72.0,72.0,72
unique,,,6,,,,14
top,,,Hay Day,,,,2021-5
freq,,,14,,,,6
mean,2020.375,6.166667,,4.148183,28911.083333,18034.875,
std,0.48752,3.267424,,0.235614,36201.343745,24187.504233,
min,2020.0,1.0,,3.660916,5.0,6.0,
25%,2020.0,4.0,,4.006792,3756.0,2437.25,
50%,2020.0,5.0,,4.197695,15752.5,10842.5,
75%,2021.0,9.0,,4.360618,33363.25,21569.5,


## Creating visualisation for the monthly data

In [143]:
# Here we import the library that we will be using for visualisation
import plotly.express as px



#### This is an interactive chart where you can gain a lot of information by running your arrow on the line, it displays, the average score per a month, date, thumbs up count and the count of players who left a review. Thumbs up count translate to how interactive are players with the review posted, the higher the Thumbs up count means that a lot of people agree with a certain review. The score is out of 5.

#### You can switch the app name on and off depending on what you want to see. In this case I am showing a comparison between Supercell's apps for between April 2020 and May 2021 to see which of them has the highest score, number of reviews and engaging these reviews which is expressed as thumbs up count.

#### It was quite expected to see Clash of Clans to lead the chart as the highest reviews in term of every aspect and comes at lowest place Clash Royal in term of score but the views has a very high count number and engagement level. At mid range we see Boom Beach quite stable in its scoring as the line doesn't fluctuate that much at around 4.3. 

#### It is important to note that these are only the scores of the people who left a review, this chart doesnt include the scores of players who didn't leave a review. The review will be used to run a sentiment analysis later on this project.

In [144]:
# Here we use the plotly syntax to plot a line charts of all apps
px.line(df_month, x= 'Monthly date', y = 'score', hover_data = {'thumbsUpCount': True, 'count':True},
       color = 'appId', range_y = [1, 5.1], 
       labels = dict(score ='Average Score per a month', thumbsUpCount = 'Thumbs Up Count',
                    appId = 'App Name' ))

#### The second chart shows a histogram of the scores and the counts of players leaving a score. It is clear from the histogram that most of the reviews are positive, so we can say that during Covid19 time the apps of supercell didn't score low but it would be interesting to compare these results to before the pandemic and see if there is any major differences. Unfortunately due to the time needed and space to scrape the data, I didn't do that for all the apps but we can explore some of the apps we I managed to get more data on.

In [145]:
fig = px.histogram(df_month, x = 'score', y = 'count', nbins = 20)
fig.show()

In [146]:
# This code will create multible histograms and its violin chart
fig = px.histogram(df_month, x="score", y = 'count', color="appId", marginal="violin", # can be `box`, `violin`
                         hover_data=df_month.columns, labels = dict(score ='Average Score per month',
                                                                    thumbsUpCount = 'Thumbs Up Count', 
                                                                    appId = 'App Name', count = 'Count per month'))
fig.show()


#### The chart above shows the histogram of each app, and looking at the violin chart we can see the Boom Beach has had the most stability in its scoring, and Clash of clans has the highest score per reviews. Clash Royal on the other had is a bit on the low side compared to the other apps, however it still holds a good score. It is clear that we dont have enough data on Clash Quest because the app was released on April 2021 so we only have 1 month and a half worth of data, thats why the distribution between its score is large due to the low volume of reviews and differences in the players opinions. We will need more time to show if Clash Quest will be a successful game or not.

#### When you look at the histograms it is good to note that Hay Day, has a lot lower sum of count than Clash of Clans, the reason why its displayed just above Clash of clans is because it cannot be visible otherwise. If you hover the arrow on the chart you will be able to see the right sum of count for each app and I recommend that to gain more insights. You can also turn on and off the chart depending on what app you want to look it for more insights