# Analysis

**Problem Statement:**

There are so many apps on app store, some paid some free; I want to validate few of these scenarios:
 - What makes an app successfull, what is that metric
 - What other metrics are available?
 - Is there correlation between this success metric and other metrics
 - What does the clusters look like if we try to cluster them
 
This might sound vague at the moment but essentially I want to analyse apps in app store;

### Task 1: In this notebook we would essentially be dealing with pre processing of the data to make it more convenient to use;

In [None]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("whitegrid")
sns.set_palette("Set1", 8, .75)

In [None]:
df = pd.read_csv("../dataset/data.csv")

In [None]:
# list of all columns
df.columns

In [None]:
print ("No of rows in dataset: %d, No of columns: %d" %  (df.shape[0], df.shape[1]))

**Note that:** `trackName` property is the name of application and `bundleId` is the unique name; we can use these bost as unique properties;


In [None]:
# let's define a new dataframe where we will store more processed information
data = pd.DataFrame()
data['bundleId'] = df['bundleId']
data['trackName'] = df['trackName']

## Pre processing

### (1) Kind

In [None]:
df.kind.value_counts()

**Summary**: all data in the list is of same kind hence we will ignore this property
### (2) features

In [None]:
df.features.value_counts(normalize=True) * 100

**Summary:** since there are only two possible outcomes I'll conver this to a column called `isUniversal`

In [None]:
def getIsUniversal(x):
    if x == '[]':
        return False
    return True

data['IsUniversal'] = df['features'].apply(lambda x: getIsUniversal(x))

data.head(10)

### (3) advisories

In [None]:
# df.advisories.value_counts()
# TODO: this is a categorical data and one row can have one or more of these; Think of how to deal with this
# but this could be an important data after all

advisories = {}
for advisoriesString in df.advisories.values:
    ads = json.loads(advisoriesString)
    for a in ads:
        if a not in advisories:
            advisories[a] = 0
        advisories[a] = advisories[a] + 1

In [None]:
adf = pd.DataFrame([[k,v] for k,v in advisories.items()], columns=['advisory', 'count'])
adf.head(10)

In [None]:
adf = adf.sort_values(["count"], axis=0, ascending=False)

In [None]:
plt.figure(figsize=(10, 5))
fx = sns.barplot(x='advisory', y='count', data=adf)
fx.set_xticklabels(rotation=90, labels=adf['advisory'])
plt.title("Distribution by advisory")

In [None]:
data["advisories"] = df.advisories

### (4) trackCensoredName

This seem to be very similar to the name of the track itself, let's see the no of rows where these values differ

In [None]:
print (df.loc[df['trackCensoredName'] != df['trackName']].shape)

df.loc[df['trackCensoredName'] != df['trackName']][['trackCensoredName', 'trackName']]

**Summary:**: only four rows with trackCensoredName differing from trackName so this features is useless for now

### (5) fileSizeBytes
This seem to be an interesting property let's see

In [None]:
meanSize = df.fileSizeBytes.mean()
medianSize = df.fileSizeBytes.median()


print ("Average file size in MD: %0.2f " % (meanSize / (1024 * 1024)))
print ("Median file size in MD: %0.2f " % (medianSize / (1024 * 1024)))

df.fileSizeBytes.plot(figsize=(10,6))

In [None]:
data['fileSizeInMB'] = df['fileSizeBytes'].apply(lambda x: x / (1024 * 1024))
data.head(10)

### (6) contentAdvisoryRating

In [None]:
print (df.contentAdvisoryRating.value_counts())
print ()
print ("Normalized distribution %")
print (df.contentAdvisoryRating.value_counts(normalize =True) * 100)

_ct = pd.crosstab(df.contentAdvisoryRating, "count")
_ct.plot(kind="pie", subplots=True)

_ct.plot(kind="bar")

In [None]:
data['contentAdvisoryRating'] = df['contentAdvisoryRating']

**Summary**: while nearly `75%` is `4+` rated apps other seem to have a meaning full distribution as well;
### (7) genreIds

This feature intutively seem to be highly correlated to `genres` property; So we shall rather look at that in place of this;

### (8) currentVersionReleaseDate
release date of current version; seems an interesting property which doens't need much preprocessing we will keep it as such;

In [None]:
data['currentVersionReleaseDate'] = df.currentVersionReleaseDate

### (9) currency

In [None]:
df.currency.value_counts()

All values are USD hence we will ignore
### (10) wrapperType

In [None]:
df.wrapperType.value_counts()

all values are software; hence we will ignore
### (11) version

In [None]:
df.version.value_counts().head(20)

**Summary**
 - This seem to be string like property which can have any value;
 We can maybe extract features like major version and subversions

In [None]:
def getMajorVersion(ver):
    return ver.split('.')[0]

def getMajorSubVersion(ver):
    splt = ver.split('.')
    if len(splt) >= 2:
        return ".".join(splt[0:2])
    return splt[0] +".0"

data['version'] = df['version'].apply(lambda x: getMajorVersion(x))
data['subversion'] = df['version'].apply(lambda x: getMajorSubVersion(x))

data[['bundleId', 'version', 'subversion']].head(10)

### (12) artistName

In [None]:
print ("artists with max applications")
df.artistName.value_counts().head(20)

In [None]:
data['artist'] = df['artistName']

### (13) artistId:
might is highly correlated to artistName and hence will ignore

### (14) genres
This is one of very important property; Let's do some analysis here

In [None]:
import json
genres = {}
for genreString in df.genres.values:
    gs = json.loads(genreString)
    for g in gs:
        if g not in genres:
            genres[g] = 0
        genres[g] = genres[g] + 1

In [None]:
gdf = pd.DataFrame([[k,v] for k,v in genres.items()], columns=['genre', 'count'])
gdf.head(10)

In [None]:
gs = gdf.sort_values(["count"], axis=0, ascending=False)

In [None]:
plt.figure(figsize=(10, 5))
fx = sns.barplot(x='genre', y='count', data=gs.head(15))
fx.set_xticklabels(rotation=90, labels=gs['genre'])
plt.title("Distribution by genres")

In [None]:
data['genres'] = df['genres']
data.head(10)

### (15) Price
One of most important field

In [None]:
percentageFree = df[df.price == 0].shape[0] / df.shape[0] * 100

print ("%% Free: %0.3f %%" % percentageFree)

In [None]:
data['price'] = df['price']

In [None]:
data.head(3)

### (16) Description
This is a plain text field, will just copy it to new dataframe and use as text features later

In [None]:
data['description'] = df.description

### (17) isVppDeviceBasedLicensingEnabled
I don't totally know what this feature means (TODO);
99.2% sample have true value rest have false, but stills storing this to actual df

In [None]:
df.isVppDeviceBasedLicensingEnabled.value_counts(normalize=True) * 100

In [None]:
data['isVppDeviceBasedLicensingEnabled'] = df.isVppDeviceBasedLicensingEnabled

### (17) primaryGenreName
This is an interesting feature

In [None]:
df.primaryGenreName.value_counts(normalize=True) * 100

In [None]:
_ct = pd.crosstab(df.primaryGenreName, "count")
_ct.plot(kind='bar', figsize=(10, 5))

In [None]:
data['primaryGenreName'] = df.primaryGenreName

### (17) releaseDate
released date of app; This could be related to how old the app is

In [None]:
# data['releaseDate'] = df.releaseDate
from datetime import datetime
def getAgeInDays(datestring):
    return (datetime.now() - datetime.strptime(datestring, "%Y-%m-%dT%H:%M:%SZ")).days

data['releaseDate'] = df.releaseDate
data['ageInDays'] = df.releaseDate.apply(lambda x: getAgeInDays(x))

In [None]:
data['ageInDays'].hist()

### (18) minimumOsVersion
We will just keep the main version no for simplicity

In [None]:
def getMajorOSVersion(osver):
    return osver.split(".")[0]
data['minimumOsVersion'] = df.minimumOsVersion.apply(lambda x: getMajorOSVersion(x))

In [None]:
# % distribution
data['minimumOsVersion'].value_counts(normalize=True) * 100

### (19) formattedPrice
Since we have price we will ignore this

### (20) primaryGenreId
Since we have string of primary Genre we will ignore this

### (21, 22) averageUserRating & avgUserRatingCV
This is another most important feature

In [None]:
# Actual
df.averageUserRating.value_counts(normalize=True) * 100

In [None]:
# current version
df.avgUserRatingCV.value_counts(normalize=True) * 100

In [None]:
_ct = pd.crosstab(df.averageUserRating, "Count")
_ct.plot(kind='bar', figsize=(15, 7))
plt.title("Plot for average user rating")

_ct = pd.crosstab(df.avgUserRatingCV, "Count")
_ct.plot(kind='bar', figsize=(15, 7))
plt.title("Plot for average user rating for current version")

In [None]:
data['averageUserRating'] = df.averageUserRating
data['avgUserRatingCV'] = df.avgUserRatingCV

**Sumamry**: maximum apps have no user rating at all;
### (23, 24) userRatingCount & userRatingCountCV

In [None]:
data['userRatingCount'] = df.userRatingCount.replace('null', 0)
data.userRatingCount = data.userRatingCount.apply(lambda x: int(x))
data.userRatingCount.hist(figsize=(10, 5))
plt.title("Histogram for user rating count")

plt.figure()
data['userRatingCountCV'] = df.userRatingCountCV.replace('null', 0)
data.userRatingCountCV = data.userRatingCountCV.apply(lambda x: int(x))
data.userRatingCountCV.hist(figsize=(10, 5))
plt.title("Histogram for user rating count for current version")

### (25) sellerUrl
 - Not all of the sellers have url, so one interesting property would be to check if website exist on the first place

In [None]:
data['hasUrl'] = df.sellerUrl.apply(lambda x: x != 'null')

print ("% apps with and without seller url")
print (data.hasUrl.value_counts(normalize=True) * 100)

data['hasUrl'].head(10)

In [None]:
def getDomain(url):
    if url == 'null':
        return 'null'
    
    _splt = url.split('/')
    if len(_splt) >= 3:
        return _splt[2]
    return 'null'

data['sellerUrl'] = df.sellerUrl
data['sellerUrlDomain'] = df.sellerUrl.apply(lambda x: getDomain(x))

In [None]:
data.head(10)

### (26) releaseNotes

In [None]:
data['releaseNotes'] = df.releaseNotes

## Write new data frame to disk

In [None]:
data.to_csv("../dataset/processed.data.csv", encoding='utf-8', index=False)

### Let's have a look at summary so far for paid apps