# Analyzing Mobile App Data: Unveiling Insights for User Acquisition
## Introduction

The analysis aims to uncover the factors driving user engagement in our Android and iOS mobile apps. By understanding these factors, we can optimize app development strategies and enhance user acquisition, leading to increased revenue from in-app advertisements.

The approach involves analyzing data from the Google Play Store and Apple App Store datasets, exploring the datasets, performing data cleaning, identifying popular app genres, and analyzing user ratings and reviews. We will also examine the relationship between app categories and user installs.

#### Summary of results

A key finding is the disparity in the most common app genres between the Google Play Store and the Apple App Store. Gaming apps dominate the App Store, while the Google Play Store showcases a more diverse range of practical and entertainment-focused genres. This highlights the importance of tailoring app development and marketing strategies for each platform to maximize user engagement and revenue potential.

#### Open both datasets and save as list of lists:

In [1]:
from csv import reader


opened_applestore = open('AppleStore.csv')
read_applestore = reader(opened_applestore)
#opened_applestore.close()
applestore_data = list(read_applestore)


opened_googleplay = open('googleplaystore.csv')
from csv import reader
read_googleplay = reader(opened_googleplay)
#opened_googleplay.close()
googleplay_data = list(read_googleplay)

#### Define function to print rows in a readable way:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

#### Explore datasets with explore_data function (add True to get number of rows and columns of each dataset):

In [3]:
print("Googleplay excerpt with number of rows and columns:")
print("\n")
explore_data(googleplay_data, 0, 3, True)
print("\n")
print("Applestore excerpt with number of rows and columns:")
print("\n")
explore_data(applestore_data, 0, 3, True)

Googleplay excerpt with number of rows and columns:


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


Applestore excerpt with number of rows and columns:


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+',

## Identifying helpful columns for analysis

For the Googleplay Store, helpful columns are: Name(0), Category/ Genre (1), Reviews(3), Installs (5), Price or type (7)(6), Rating(2), Content Rating(8)

see Documentation for details: [Playstore Documentation](https://www.kaggle.com/datasets/lava18/google-play-store-apps)

For the Applestore: Category/ Genre(11), Rating (7), Price(4), 

see Documentation for details: [Applestore Documentation](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)


In [4]:
explore_data(googleplay_data, 0, 1)
explore_data(applestore_data, 0, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




## Data cleaning

* Determine and delete the index of a row with shorter length than usual (missing entries). Detected row: 10473


In [5]:
header_length = len(googleplay_data[0])

for row in googleplay_data:
    if len(row) != header_length:
    
        print(row)
        print(len(row))
        print(header_length)
        print(googleplay_data.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
13
10473


In [6]:
print(googleplay_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
del googleplay_data[10473]

### Duplicate entries
* The Google Play dataset consists of duplicate entries:

In [8]:
total_apps = []
duplicate_apps = []

for app in googleplay_data:
    name = app[0]
    if name in total_apps:
        duplicate_apps.append(name)      
    else:
        total_apps.append(name)
        
        
print(len(total_apps))
print(len(duplicate_apps))
print("Duplicate app examples: ", duplicate_apps[:10])

9660
1181
Duplicate app examples:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


### Deletion process
The newest version of the duplicate apps will be kept. To determine which of the app versions is the newest, we assume that it has the highest total rating count.

Steps to remove the duplicates:
* Create dictionary with app name as dictionary key and highest number of reviews as dictionary value

In [9]:
reviews_max = {}

for app in googleplay_data[1:]:
    
    name = app[0]
    n_reviews = float(app[3])
     
    if name in reviews_max and n_reviews > reviews_max[name]:
        
        reviews_max[name] = n_reviews
        
        
    elif name not in reviews_max:
        
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
#print(reviews_max)
        

        
        
    
    

9659


#### Duplicate rows removal with dictionary reviews_max:

In [10]:
android_clean = []
already_added = []

for app in googleplay_data[1:]:
    
    name = app[0] 
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        # second condition to account for duplicate entries 
        # with same review_max
        
        android_clean.append(app)
        already_added.append(name)
        
print(android_clean[:5])
print(len(android_clean))
        

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
9659


### Removing Non-English Apps

* Common English characters: 0 - 127 (ASCII)
* Task: Write function to check a string for common English characters
* Use it to remove non-English speaking apps

In [11]:
def english_check(string):
    
    count = 0
    
    for letter in string:
        if ord(letter) > 127:
            count += 1
            
            if count > 3:
                return False       
    return True

#Check function:           
       
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check("Instagram"))
print(english_check('Instachat 😜'))

False
True
True
True


Now the function can filter out non-English apps:

In [12]:
googleplay_english = []
applestore_english = []


for app in android_clean[1:]:
    name = app[0]  
    if english_check(name) == True:
            googleplay_english.append(app)
            
for app in applestore_data[1:]:
    name = app[1]      
    if english_check(name) == True:  
            applestore_english.append(app)            
            
explore_data(googleplay_english, 0, 2, True)
print("\n")
explore_data(applestore_english, 0, 2, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9613
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


We have 9613 english speaking googleplay apps and 6183 applestore apps. (The english speaking googleplay apps are clean from duplicates as well)

### Isolating free from non-free apps:
* Identify columns for app price
* Loop through the googleplay and applestore datasets and add free apps to a separate list
* Check length of free app datasets

In [13]:
explore_data(googleplay_data, 0, 3)
print("\n")
explore_data(applestore_data, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Ph

* Googleplay app price column: 7
* Applestore app price column: 4

In [14]:
googleplay_final = []
applestore_final = []

for app in googleplay_english:
    price_googleplay = app[7]
    
    if price_googleplay == "0":
        googleplay_final.append(app)
        
print("Apps in googleplay_final: " + str(len(googleplay_final)))
print("Apps in googleplay_english: " + str(len(googleplay_english)))
print("\n")

for app in applestore_english:
    price_applestore = app[4]
    
    if price_applestore == "0.0":
        applestore_final.append(app)
        
print("Apps in applestore_final: " + str(len(applestore_final)))
print("Apps in applestore_english: " + str(len(applestore_english)))


Apps in googleplay_final: 8863
Apps in googleplay_english: 9613


Apps in applestore_final: 3222
Apps in applestore_english: 6183


A significant amount of time was dedicated to data cleaning for our Google Play Store and Apple Store datasets. The cleaning process involved:

* Removing inaccurate data
* Eliminating duplicate app entries
* Filtering out non-English apps
* Isolating the free apps

These steps resulted in refined datasets suitable for analysis.
After filtering, we have a total of 8862 Android apps and 3222 iOS apps remaining, providing us with an adequate dataset for our analysis.

## Analysis: Most Common Apps by Genre

Our objective is to find app profiles that can attract a significant user base, as our revenue is directly influenced by app popularity. To validate app ideas efficiently, we follow a three-step strategy:

* Begin by launching a basic Android version of the app on Google Play and evaluate user response.
* If the initial Android version garners positive feedback and engagement, we proceed with further development and enhancements.
* After assessing profitability over a six-month period, we consider building an iOS version of the app for the App Store.

Our aim is to identify app profiles that can thrive on both Google Play and the App Store. For our analysis, we will construct frequency tables to determine the most prevalent genres in each market.

#### Identifying suitable columns for frequency tables:

In [15]:
explore_data(googleplay_data, 0, 2)
explore_data(applestore_data, 0,2)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']




* Googleplay Store columns: Category(1), Genres(9)
* Applestore columns: Prime Genre(11)

##### Function to create frequency tables:

In [30]:
def freq_table(dataset, index):
    ftable = {}
    
    for app in dataset:
        table_key = app[index]
        
        if table_key in ftable:
            ftable[table_key] += 1
        
        else:
            ftable[table_key] = 1
            

#In percentage:

    table_percentages = {}
    total = len(dataset)
    for key in ftable:
        percentage = (ftable[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages
    return ftable

# need return statement so that the function returns the frequency 
# table instead of none

#### Function to sort frequency tables:

In [33]:
def display_table(dataset, index):
    ftable = freq_table(dataset, index)
    ftable_display = []
    for key in ftable:
        key_val_as_tuple = (ftable[key], key)
        ftable_display.append(key_val_as_tuple)

    table_sorted = sorted(ftable_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(googleplay_final, 1)
print("\n")
display_table(googleplay_final, 9)
print("\n")
display_table(applestore_final, 11)

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

### Applestore Analysis:

Upon analyzing the frequency table for the prime_genre column in the App Store dataset, several patterns emerge:

* The most prevalent genre among the free English apps is "Games," with a count of 1874. This genre dominates the market by a substantial margin.
* Following games, the next most common genres are "Entertainment" with 254 apps and "Photo & Video" with 160 apps, although they are significantly less prevalent compared to games.
* The majority of the app profiles in the dataset appear to be designed for entertainment purposes, encompassing gaming, photo and video editing, and social networking.
* Genres focused on practical purposes, such as education, shopping, utilities, productivity, lifestyle, and health & fitness, also demonstrate a notable presence, albeit to a lesser extent than entertainment genres.
* Genres like weather, finance, food & drink, reference, business, and travel exhibit a relatively smaller number of apps, indicating a more niche market segment.

Based solely on this frequency table, it would be smart to consider developing an app profile that aligns with the dominant genres, particularly games or entertainment-related offerings. However, it is essential to note that the high number of apps within a genre does not guarantee a corresponding large user base. Further analysis is required to assess user engagement, competition, and overall market demand before making a conclusive recommendation for an app profile.

### Googleplay Store Analysis

Based on the frequency tables generated for the Category and Genres columns in the Google Play dataset, we can observe the following patterns:

##### Most common genres in the Google Play market (Category):

* "Family" is the most common category, with 1676 apps.
* Following "Family," the next most common categories are "Game" (862 apps) and "Tools" (750 apps).
* Other prevalent categories include "Business," "Lifestyle," "Productivity," and "Finance."

##### Most common genres in the Google Play market (Genres):

* The most common genre is "Tools" with 749 apps.
* Other prominent genres include "Entertainment," "Education," "Business," and "Productivity."
* We also observe a variety of genres such as "Sports," "Communication," "Health & Fitness," "Photography," and "News & Magazines."

#### Comparison with the Applestore market:

In the Applestore, the dominant genre is "Games," while the Google Play market shows a more diverse range of genres, including "Family," "Tools," "Entertainment," and "Education."
The Google Play market appears to have a broader selection of categories and genres compared to the App Store, which is more focused on entertainment genres.

#### App profile recommendation
Based on the frequency tables alone, we could consider developing a family-oriented app or a tool/utility app for the Google Play market. However, it is important to note that the frequency tables reveal the most common genres, but they do not directly indicate the number of users. Further analysis is required to understand user engagement, competition, and market demand to make a more informed recommendation for an app profile.

### Decoding Genre Popularity: User Rating Analysis

To determine the most popular genres based on user ratings, the analysis involves calculating the average number of user ratings per app genre on the App Store. As the installation data is not available, the total number of user ratings will be used as a substitute.

To calculate the average number of user ratings per genre, a nested loop structure will be utilized to group the apps by genre, accumulate the user ratings, and divide the total ratings by the number of apps in each genre (yielding the average).

In [19]:
genres_applestore = freq_table(applestore_final, 11)

for genre in genres_applestore:
    total_ratings = 0
    len_genre = 0
    
    for app in applestore_final:
        app_genre = app[11]
        
        
        if genre == app_genre:
            total_ratings += float(app[5])
            len_genre += 1
            
    avg_ratings = total_ratings / len_genre
    print(genre, ":", avg_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Based on the average number of user ratings, genres like "Navigation," "Reference," "Music," and "Weather" exhibit high user engagement in the App Store. Among these, the **"Reference" genre**, with an average of 74,942 user ratings per app, stands out as a **promising app profile recommendation**. Developing an educational or informative app in this genre could tap into its popularity and meet user demand. Further market research is advised to validate this recommendation and consider factors like competition and target audience.

In [43]:
googleplay_categories = freq_table(googleplay_final, 1)

for category in googleplay_categories:
    total_installs = 0
    len_category = 0
    
    for app in googleplay_final:
        app_category = app[1]
        
        
        if category == app_category:
            installs = float(app[5].replace("+", "").replace(",", ""))
            
            total_installs += installs           
            len_category += 1
            
    avg_installs = total_installs / len_category
    print(category, ":", avg_installs)

ART_AND_DESIGN : 2021626.7857142857
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

### App Genre Popularity Analysis: Insights from Installs

After analyzing the average number of installs per app genre on Google Play, we can derive the following insights:

* **Communication and social networking apps** demonstrate the **highest average installs**, indicating a **substantial user base and potential profitability**.
* Genres like photography, video players, and entertainment also exhibit significant popularity, with millions of average installs.
* Categories such as books and reference, education, and productivity show promising average install figures, suggesting opportunities for profitable app profiles.

Based on these findings, developing a **communication or social networking app appears to be a recommended app profile** for Google Play. These genres offer a wide user reach and a consistent demand for innovative applications. However, further market research, competitor analysis, and consideration of user preferences are essential to ensure a successful app profile aligned with market trends and potential profitability.