# Analysing data about Playstore and AppStore applications

This project included the data analysis and it's findings for various _Android_ and _iOS_ mobile apps. 

The goal of this project is to analyze data to __help developers understand what type of apps are likely to attract more users__.

Below, we `import` csv and the 2 datasets that we need and transform the 2 latter into `list of lists`.


In [1]:
import csv
appstore_data = open("AppleStore.csv", encoding="utf-8")
playstore_data = open("googleplaystore.csv", encoding="utf-8")

In [2]:
ios_read = csv.reader(appstore_data)
ios = list(ios_read)

In [3]:
android_read = csv.reader(playstore_data)
android = list(android_read)

Then, we define a `function` called __explore_data__ having 4 parameters.

This `function` can be used to `print` out the set number of rows and columns from any dataset.

We also test this `function` in our 2 datasets.

In [4]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
        
    if rows_and_columns:
        print("Number of rows: ", len(dataset))
        print("Number of columns: ", len(dataset[0]))

In [5]:
explore_data(ios, 0, 3, rows_and_columns = True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  7198
Number of columns:  16


In [6]:
explore_data(android, 0, 3, rows_and_columns = True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10842
Number of columns:  13


Now, we check if our data has any _empty rows_ or _short rows_ and we keep only the rows that are not both into out new lists __andriodC__ and __iosC__.

In [7]:
androidC = []

for app in android:
    if len(app) > 0 and len(app) == 13:
        androidC.append(app)
    elif len(app) > 0 and len(app) < 13:
        print("short row")
    else:
        print("empty row")

short row


In [8]:
iosC = []

for app in ios:
    if len(app) > 0 and len(app) == 16:
        iosC.append(app)
    elif len(app) > 0 and len(app) < 16:
        print("short row")
    else:
        print("empty row")

After that, we check if both our datasets contain any __duplicate__ rows. First, we check how many instances of Instagram are in the `name column` for both datasets and we see a _couple of duplicates_ in the __android__ data.

Then, we check the length of duplicate data in both datasets and found that only __2 rows__ are duplicate in __ios__ while __1181 rows__ are duplicated in the __android__ dataset.

In [9]:
for app in androidC:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
for app in iosC:
    name = app[1]
    if name == "Instagram":
        print(app)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Below, we add all the `unique` rows of both datasets to _oai (original_app_ios)_ and _oaa (original_app_android)_ respectively.

In [11]:
dai = []
oai = []


for i in iosC:
    name = i[1]
    if name in oai:
        dai.append(name)
    else:
        oai.append(name)
        
print(len(dai))
print(len(oai))

2
7196


In [12]:
daa = []
oaa = []


for i in androidC:
    name = i[0]
    if name in oaa:
        daa.append(name)
    else:
        oaa.append(name)
        
print(len(daa))
print(len(oaa))

1181
9660


Below, we create a dictationary `reviews_max` which counts the `reviews` of all apps and takes the greatest `value` of `review` when there's duplication.

In [13]:
reviews_max = {}

for i in androidC[1:]:
    name = i[0]
    n_reviews = float(i[3])
    if (name in reviews_max) and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif (name not in reviews_max):
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


Below, we create 2 empty `lists`.

Then, we `loop` through the __android dataset__ (without headers), assign the _name_ and _no of reviews_ to the `name` and `n_reviews` variables respectively.

Then if `n_reviews` is equal to the __reviews from the dictationary__ (that we created previously), and the `name` is not yet in our of out empty lists (`already_added`), then we `append` the __entire row__ to `android_clean` and the `name` to `already_added`.

Finally, we `print` the `length` of the `android_clean` list and the `already_added` list.

In [15]:
android_clean = []
already_added = []

for i in androidC[1:]:
    name = i[0]
    n_reviews = float(i[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(i)
        already_added.append(name)
        
# for i in android_clean:
#     print(i)
    
print(len(android_clean))
print(len(already_added))

9659
9659


In [16]:
def english_or_not(string):
    not_ascii = 0
    for i in string:
        if (ord(i) > 127) or (ord(i) < 0):
            not_ascii += 1     
    if not_ascii > 3:
        return False
    else:
        return True

In [17]:
print(english_or_not('Instagram'))
print(english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_or_not('Docs To Go™ Free Office Suite'))
print(english_or_not('Instachat 😜'))

True
False
True
True


In [18]:
english_apps = []

for i in android_clean:
    name = i[0]
    if (english_or_not(name)):
        english_apps.append(i)

In [19]:
print(len(english_apps))
print(english_apps[1])

9614
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


In [20]:
free_apps = []

for i in english_apps:
    if_free = i[6]
    price = i[7]
    if (if_free == "Free") and (price == "0"):
        free_apps.append(i)
        
print(len(free_apps))   


8863


In [21]:
print(len(iosC))

english_apps_ios = []

for i in iosC:
    name = i[1]
    if (english_or_not(name)):
        english_apps_ios.append(i)
        
print(len(english_apps_ios))

7198
6184


In [22]:
free_apps_ios = []

for i in english_apps_ios:
    price = i[4]
    
    if price == "0.0":
        free_apps_ios.append(i)

print(len(free_apps_ios))
print(free_apps_ios[1])

3222
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


We completed cleaning the code above. In `free_apps_ios`, we have __3222__ data and in `free_apps`, we have __6184__ data.

# Analysis

Here, we need to make sure that the app profile fits both App Store and Google Play. This is because our strategy goes like this:

- We create the minimal version of an app
- We add it to Google Play Store
- We wait 6 months
- If the app is successful, we build and add it to the App Store

We need to find apps successful in both markets because that would help us raise our profits.

In [23]:
free_apps_android = free_apps

print(free_apps_android[0])
print(free_apps_ios[0])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


In [24]:
android_freq = {}
ios_freq = {}

for i in free_apps_android:
    genre = i[1]
    if genre in android_freq:
        android_freq[genre] += 1
    else:
        android_freq[genre] = 1
        

for i in free_apps_ios:
    genre = i[-5]
    if genre in ios_freq:
        ios_freq[genre] += 1
    else:
        ios_freq[genre] = 1
    
# for i,j in android_freq.items():
#     print(i + " : " + str(j))
# for i,j in ios_freq.items():
#     print(i + " : " + str(j))

print(next(iter(android_freq.items())))
print(next(iter(ios_freq.items())))

('ART_AND_DESIGN', 57)
('Social Networking', 106)


In [25]:
max_val_and = 0
max_key_and = ''
for i,j in android_freq.items():
    if(max_val_and < j):
        max_val_and = j
        max_key_and = i
    
print(max_key_and + " : " + str(max_val_and))

max_val_ios = 0
max_key_ios = ''
for i,j in ios_freq.items():
    if(max_val_ios < j):
        max_val_ios = j
        max_key_ios = i
    
print(max_key_ios + " : " + str(max_val_ios))

FAMILY : 1675
Games : 1874


In [26]:
def freq_table(dataset, index):
    new_dict = {}
    total_values = 0
    data = 0
    for i in dataset:
        if i[index] in new_dict:
            new_dict[i[index]] += 1
            data += 1
        else:
            new_dict[i[index]] = 1
            data +=1
           
    for values in new_dict.values():
        total_values = total_values + values
        
    print(total_values)
    
    for keys, values in new_dict.items():
        def find_percent(values, total_values):
            percentage = (values/total_values) * 100
            return percentage
        new_dict[keys] = (new_dict.get(keys), find_percent(values, total_values))
        
    return new_dict

In [27]:
new = freq_table(free_apps_android, 1)

8863


In [28]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

In [29]:
display_table(free_apps_ios, -5)

3222
Games : (1874, 58.16263190564867)
Entertainment : (254, 7.883302296710118)
Photo & Video : (160, 4.9658597144630665)
Education : (118, 3.662321539416512)
Social Networking : (106, 3.2898820608317814)
Shopping : (84, 2.60707635009311)
Utilities : (81, 2.5139664804469275)
Sports : (69, 2.1415270018621975)
Music : (66, 2.0484171322160147)
Health & Fitness : (65, 2.0173805090006205)
Productivity : (56, 1.7380509000620732)
Lifestyle : (51, 1.5828677839851024)
News : (43, 1.3345747982619491)
Travel : (40, 1.2414649286157666)
Finance : (36, 1.1173184357541899)
Weather : (28, 0.8690254500310366)
Food & Drink : (26, 0.8069522036002483)
Reference : (18, 0.5586592178770949)
Business : (17, 0.5276225946617008)
Book : (14, 0.4345127250155183)
Navigation : (6, 0.186219739292365)
Medical : (6, 0.186219739292365)
Catalogs : (4, 0.12414649286157665)


Above, we see that `Games` is the most common genre with over __1800 counts__ while `Catalogs` is the least common genre will just __4 counts__. `Games` can be seen as an `outlier` here as __58%__ of all apps are `games`. We also see that there are more apps designed for `entertainment purposes`, because even after `Games`, the second highest is `Entertainment` with __7.88%__ , then again `Photo & Video` on third with __4.96%__.

Apps designed for practical purposes lie mostly on the middle.

In [32]:
display_table(free_apps_android, 1)

8863
FAMILY : (1675, 18.898792733837304)
GAME : (862, 9.725826469592688)
TOOLS : (750, 8.462146000225657)
BUSINESS : (407, 4.592124562789123)
LIFESTYLE : (346, 3.9038700214374367)
PRODUCTIVITY : (345, 3.8925871601038025)
FINANCE : (328, 3.7007785174320205)
MEDICAL : (313, 3.5315355974275078)
SPORTS : (301, 3.396141261423897)
PERSONALIZATION : (294, 3.317161232088458)
COMMUNICATION : (287, 3.2381812027530184)
HEALTH_AND_FITNESS : (273, 3.0802211440821394)
PHOTOGRAPHY : (261, 2.944826808078529)
NEWS_AND_MAGAZINES : (248, 2.798149610741284)
SOCIAL : (236, 2.6627552747376737)
TRAVEL_AND_LOCAL : (207, 2.335552296062281)
SHOPPING : (199, 2.245289405393208)
BOOKS_AND_REFERENCE : (190, 2.1437436533904997)
DATING : (165, 1.8616721200496444)
VIDEO_PLAYERS : (159, 1.7939749520478394)
MAPS_AND_NAVIGATION : (124, 1.399074805370642)
FOOD_AND_DRINK : (110, 1.241114746699763)
EDUCATION : (103, 1.1621347173643235)
ENTERTAINMENT : (85, 0.9590432133589079)
LIBRARIES_AND_DEMO : (83, 0.9364774906916393)
AU

Above, we can see that in Android's case, FAMILY tops the charts by a whopping 1675 counts, which is 18.89% of the play store market. On second place, we have GAME, which is almost half of the former at 9.72%. We can see COMICS and BEAUTY at the end with 0.62% and 0.59% respectively.

We can see that there FAMILY can be taken as an outlier here and also most Categories lie at 1%, 2% or 3%.

In [33]:
display_table(free_apps_android, -4)

8863
Tools : (749, 8.450863138892023)
Entertainment : (538, 6.070179397495204)
Education : (474, 5.348076272142616)
Business : (407, 4.592124562789123)
Productivity : (345, 3.8925871601038025)
Lifestyle : (345, 3.8925871601038025)
Finance : (328, 3.7007785174320205)
Medical : (313, 3.5315355974275078)
Sports : (307, 3.463838429425702)
Personalization : (294, 3.317161232088458)
Communication : (287, 3.2381812027530184)
Action : (275, 3.102786866749408)
Health & Fitness : (273, 3.0802211440821394)
Photography : (261, 2.944826808078529)
News & Magazines : (248, 2.798149610741284)
Social : (236, 2.6627552747376737)
Travel & Local : (206, 2.324269434728647)
Shopping : (199, 2.245289405393208)
Books & Reference : (190, 2.1437436533904997)
Simulation : (181, 2.042197901387792)
Dating : (165, 1.8616721200496444)
Arcade : (164, 1.8503892587160102)
Video Players & Editors : (157, 1.771409229380571)
Casual : (156, 1.7601263680469368)
Maps & Navigation : (124, 1.399074805370642)
Food & Drink : (11

The category column of Google Playstore dataset, seen above shows us that tools has the highest percentage of 8.45% and Adventure;Education has the lowest with 1% and 0.01%.

When comparing the two datasets given above, it is clear that in both markets, entertainment applications are extremely popular. Then, lifestyle and educational apps and finally medical, beauty etc.

After analyzing all the patterns above, it is quite clear that an application with entertainment as a genre (whether it is games or photography, or social media) would be a crowded place to start.

In [38]:
prime_genre_freq = freq_table(free_apps_ios, -5)
print(prime_genre_freq)

3222
{'Social Networking': (106, 3.2898820608317814), 'Photo & Video': (160, 4.9658597144630665), 'Games': (1874, 58.16263190564867), 'Music': (66, 2.0484171322160147), 'Reference': (18, 0.5586592178770949), 'Health & Fitness': (65, 2.0173805090006205), 'Weather': (28, 0.8690254500310366), 'Utilities': (81, 2.5139664804469275), 'Travel': (40, 1.2414649286157666), 'Shopping': (84, 2.60707635009311), 'News': (43, 1.3345747982619491), 'Navigation': (6, 0.186219739292365), 'Lifestyle': (51, 1.5828677839851024), 'Entertainment': (254, 7.883302296710118), 'Food & Drink': (26, 0.8069522036002483), 'Sports': (69, 2.1415270018621975), 'Book': (14, 0.4345127250155183), 'Finance': (36, 1.1173184357541899), 'Education': (118, 3.662321539416512), 'Productivity': (56, 1.7380509000620732), 'Business': (17, 0.5276225946617008), 'Catalogs': (4, 0.12414649286157665), 'Medical': (6, 0.186219739292365)}


In [56]:
empt_dict = {}
for i in prime_genre_freq:
    total = 0
    len_genre = 0
    for j in iosC:
        genre_app = j[-5]
        if i == genre_app:
            usr_Rat = float(j[5])
            total += usr_Rat
            len_genre += 1
    user_Ratings = total / len_genre
    empt_dict[i] = user_Ratings
#     print(empt_dict)

sorted_dict = sorted(empt_dict.items(), key=lambda x: x[1], reverse = True)

for genre, avgRating in sorted_dict:
    print(f"{genre}: {avgRating}")

Social Networking: 45498.89820359281
Music: 28842.021739130436
Reference: 22410.84375
Weather: 22181.027777777777
Shopping: 18615.32786885246
Photo & Video: 14352.280802292264
Travel: 14129.444444444445
Sports: 14026.929824561403
Food & Drink: 13938.619047619048
Games: 13691.996633868463
News: 13015.066666666668
Navigation: 11853.95652173913
Finance: 11047.653846153846
Health & Fitness: 9913.172222222222
Productivity: 8051.3258426966295
Entertainment: 7533.678504672897
Utilities: 6863.822580645161
Lifestyle: 6161.763888888889
Book: 5125.4375
Business: 4788.087719298245
Education: 2239.2295805739514
Catalogs: 1732.5
Medical: 592.7826086956521


According to the results gotton above, we see that Social Networking is the most popular with the most amount of ratings, followed by Music which was quite clear. However, Reference was a surprise. At the end, we can see Education, Catalogs and Medical with Medical having the least with only 592.78.

If, we come up with a Social Networking application or a Music app, it would be more popular among the users.

In [58]:
unique_genres = freq_table(free_apps_android, 1)

8863


In [68]:
category_dict = {}
for category in unique_genres:
    total = 0
    len_category = 0
    for j in androidC:
        category_app = j[1]
        if category_app == category:
            no_of_installs = j[5]
            n_installs = no_of_installs.replace('+','')
            ntwo_installs = n_installs.replace(',','')
            no_o_i = float(ntwo_installs)
            total += no_o_i
            len_category += 1
            
    avg_no_of_installs = total / len_category
    category_dict[category] = avg_no_of_installs
   
# print(category_dict)

sorted_categories = sorted(category_dict.items(), key = lambda x: x[1], reverse = True)

for genre, installs in sorted_categories:
    print(f"{genre} : {installs}")

COMMUNICATION : 84359886.95348836
SOCIAL : 47694467.46440678
VIDEO_PLAYERS : 35554301.25714286
PRODUCTIVITY : 33434177.75707547
GAME : 30669601.761363637
PHOTOGRAPHY : 30114172.10447761
TRAVEL_AND_LOCAL : 26623593.58914729
NEWS_AND_MAGAZINES : 26488755.335689045
ENTERTAINMENT : 19256107.382550336
TOOLS : 13585731.809015421
SHOPPING : 12491726.096153846
BOOKS_AND_REFERENCE : 8318050.112554112
PERSONALIZATION : 5932384.647959184
EDUCATION : 5586230.769230769
MAPS_AND_NAVIGATION : 5286729.124087592
FAMILY : 5201959.181034483
WEATHER : 5196347.804878049
HEALTH_AND_FITNESS : 4642441.3841642225
SPORTS : 4560350.255208333
FINANCE : 2395215.120218579
BUSINESS : 2178075.7934782607
FOOD_AND_DRINK : 2156683.0787401577
HOUSE_AND_HOME : 1917187.0568181819
ART_AND_DESIGN : 1912893.8461538462
LIFESTYLE : 1407443.8193717278
DATING : 1129533.3632478632
COMICS : 934769.1666666666
LIBRARIES_AND_DEMO : 741128.3529411765
AUTO_AND_VEHICLES : 625061.305882353
PARENTING : 525351.8333333334
BEAUTY : 513151.886

Above, in the case of _Android apps_ on _Google Play Store_, we see that `COMMUNICATION` genre is the highest installed genre with __84359886.95348836__ average installs followed by `SOCIAL` with __47694467.46440678__ average installs. `EVENTS` and `MEDICAL` are the lowest installed genres with __249580.640625__ for the former and __115026.86177105832__ for the latter respectively