### Introduction:  
Our aim in this project is to find mobile app profiles that are profitable for the Google Play and App Store markets.  
  
We will analyze the apps that are free to download and install, and the main source of revenue for these apps consists of in-app ads. This means that revenue of any given free app is mostly influenced by number of users that use app. Our goal for this project is to analyze data to help to understand what kinds of apps are likely to attract more users.

#### Reading a CSV file and separating data from header:

In [1]:
import csv

In [2]:
def read_csv_file(filename):
    with open(filename, encoding='utf8') as fd:
        all_data = list(csv.reader(fd))
        header = all_data[0]
        data = all_data[1:]
    
    return header, data

In [3]:
android_header, android_data = read_csv_file('googleplaystore.csv')
print('Total length: {}'.format(len(android_data)))
print('Android data:')
for row in android_data[:5]:
    print(row)

Total length: 10841
Android data:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


In [4]:
ios_header, ios_data = read_csv_file('AppleStore.csv')
print('Total length: {}'.format(len(ios_data)))
print('iOS data:')
for row in ios_data[:5]:
    print(row)

Total length: 7197
iOS data:
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


#### Deleting wrong data:  
From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of Google Play dataset, we can see that [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472.

In [5]:
print(android_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del android_data[10472]

In [7]:
print(android_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


#### Removing duplicate entries:  
As we explore the Google Play dataset further, we will find that some apps have more than one entry. For example, Instagram has 4 entries.

In [8]:
for row in android_data:
    if row[0] == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [9]:
def check_duplicate_data(dataset, name_ind):
    unique_apps = list()
    duplicate_apps = list()
    
    for row in dataset:
        app_name = row[name_ind]
        if app_name in unique_apps:
            duplicate_apps.append(app_name)
        else:
            unique_apps.append(app_name)
    
    return len(duplicate_apps)

In [10]:
# Another way to check duplicate entries using sets
def check_duplicate_entries(dataset, name_ind):
    unique_apps = set()
    
    for row in dataset:
        unique_apps.add(row[name_ind])
    
    return len(dataset) - len(unique_apps)

In [11]:
print('Number of duplicate apps in Google Play dataset: {}'.format(check_duplicate_data(android_data, 0)))

Number of duplicate apps in Google Play dataset: 1181


In [12]:
print('Number of duplicate apps is AppStore dataset: {}'.format(check_duplicate_data(ios_data, 2)))

Number of duplicate apps is AppStore dataset: 2


In total there are ```1181``` duplicate app entries in Google Play dataset.  
We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app.  
  
When we examine duplicate entries of Instagram, the main difference happens on 4th column of each row, which corresponds to the number of reviews. The different numtbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We will keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.  
  
To remove the duplicate entries, we will:
* Create a dictionary where each key is an unique app name, and the value is the highest number of reviews of that app.
* Use the above created dictionary to create a new dataset, which will have only one entry per app

In [13]:
def remove_duplicate_apps(dataset, name_ind, review_ind):
    
    if check_duplicate_data(dataset, name_ind) > 0:
        reviews_max = dict()
    
        for row in dataset:
            app = row[name_ind]
            review = float(row[review_ind])
        
            if app in reviews_max and reviews_max[app] < review:
                reviews_max[app] = review
            elif app not in reviews_max:
                reviews_max[app] = review

        clean_data = list()
        already_added = list()
    
        for row in dataset:
            app = row[name_ind]
            review = float(row[review_ind])
        
            if reviews_max[app] == review and (app not in already_added):
                clean_data.append(row)
                already_added.append(app)
    
        return clean_data
    else:
        return dataset

In [14]:
android_data = remove_duplicate_apps(android_data, 0, 3)
print('Length after removing duplicates: {}'.format(len(android_data)))
print('Android data:')
for row in android_data[:5]:
    print(row)

Length after removing duplicates: 9659
Android data:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


In [15]:
ios_data = remove_duplicate_apps(ios_data, 2, 6)
print('Length after removing duplicates: {}'.format(len(ios_data)))
print('iOS data:')
for row in ios_data[:5]:
    print(row)

Length after removing duplicates: 7195
iOS data:
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


#### Removing non-Enlish apps:  
As we explore the datasets even further, we can notice the names of the some of the app suggest that they are not directed towards an English-speaking audience. Below are the example:

In [16]:
print(android_data[4412][0])
print(android_data[7940][0])

中国語 AQリスニング
لعبة تقدر تربح DZ


We are not interested in keeping these kind of apps, so we will remove them. One way is to, remove each app whose name contains a symbol that is not commonly used in English text - English text usually includes the letters from English alphabet (a-zA-Z), numbers (0-9), punctuations (.,;?! etc), and other symbols (+-* / etc).  
  
All these characters that are specific to English texts are encoded using the ASCII standards. Each ASCII character has a corresponding number between 0-127 associated with it, and we can take advantage of that to build a function that checks an app name and tell us whether it contains non-ASCII characters.  
But some English app name use emojis or other symbols (™, 😜 etc) that fall outside of the ASCII range. To minimize the impact of loss, we will only remove an app if its name has more than 3 non-ASCII characters.

In [17]:
def is_english(app):
    non_ascii_count = 0
    
    for char in app:
        if ord(char) > 127:
            non_ascii_count += 1
    
    if non_ascii_count > 3:
        return False
    return True

In [18]:
def english_dataset(dataset, name_ind):
    clean_data = list()
    
    for row in dataset:
        app = row[name_ind]
        if is_english(app):
            clean_data.append(row)
    
    return clean_data

In [19]:
android_data = english_dataset(android_data, 0)
print('Length after removing non-English apps: {}'.format(len(android_data)))
print('Android data:')
for row in android_data[:5]:
    print(row)

Length after removing non-English apps: 9614
Android data:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


In [20]:
ios_data = english_dataset(ios_data, 2)
print('Length after removing non-English apps: {}'.format(len(ios_data)))
print('iOS data:')
for row in ios_data[:5]:
    print(row)

Length after removing non-English apps: 6181
iOS data:
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


#### Isolating the free apps:  
As mentioned in the introduction, we will analyze the apps that are free to download and install, and revenue of free apps are in-app ads. Since our dataset contains both free and non-free apps, we will need to isolate only the free apps for our analysis.

In [21]:
def get_free_apps(dataset, price_ind):
    free_apps = list()
    
    for row in dataset:
        price = row[price_ind].replace('$', '')
        if float(price) == 0.0:
            free_apps.append(row)
    
    return free_apps

In [22]:
android_data = get_free_apps(android_data, 7)
print('Length after isolating free apps: {}'.format(len(android_data)))
print('Android data:')
for row in android_data[:5]:
    print(row)

Length after isolating free apps: 8864
Android data:
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']
['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


In [23]:
ios_data = get_free_apps(ios_data, 5)
print('Length after isolating free apps: {}'.format(len(ios_data)))
print('iOS data:')
for row in ios_data[:5]:
    print(row)

Length after isolating free apps: 3220
iOS data:
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']
['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']
['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']
['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']
['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']


We are now left with ```8864``` android apps and ```3220``` iOS apps.  
  
#### Most common apps by genre:  
Let us begin our analysis by getting a sense of the most common genres for each market. For this, we will build a frequency table.  
  
We will build two functions which we can use to analyze the frequency tables:
* One function to generate frequency table that shows percentage
* Another function that we can use to display the percentages in descending order

In [24]:
def frequency_table(dataset, genre_ind):
    freq_table = dict()
    
    for row in dataset:
        genre = row[genre_ind]
        freq_table[genre] = freq_table.get(genre, 0) + 1
    
    total = len(dataset)
    table_percentage = dict()
    for key in freq_table:
        table_percentage[key] = (freq_table[key] / total) * 100
    
    return table_percentage

In [25]:
def display_table(dataset, index):
    table_percentage = frequency_table(dataset, index)
    
    # sort in descending order
    table_sorted = sorted(table_percentage.items(), key=lambda x: x[1], reverse=True)
    for entry in table_sorted:
        print('{}: {:.2f}'.format(entry[0], round(entry[1], 2)))

Examining frequency table for the ```prime_genre``` column of the App Store data set.

In [26]:
display_table(ios_data, -5) #display frequency table for 'prime_genre'

Games: 58.14
Entertainment: 7.89
Photo & Video: 4.97
Education: 3.66
Social Networking: 3.29
Shopping: 2.61
Utilities: 2.52
Sports: 2.14
Music: 2.05
Health & Fitness: 2.02
Productivity: 1.74
Lifestyle: 1.58
News: 1.34
Travel: 1.24
Finance: 1.12
Weather: 0.87
Food & Drink: 0.81
Reference: 0.56
Business: 0.53
Book: 0.43
Navigation: 0.19
Medical: 0.19
Catalogs: 0.12


We can see that among the free English apps, more than half (```58.14%```) are ```games```. ```Entertainment``` apps are close to ```8%```, followed by ```photos and videos``` app, which is close to ```5%```. Only ```3.66%``` of the apps are designed for ```education```, followed by ```social networking``` apps which amount ```3.29%``` of the apps in our dataset.  
  
The general impression is that App Store (at least the part containing free English apps) is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare.  
  
Examining the ```Genre``` and ```Category``` columns of the Google Play dataset

In [29]:
display_table(android_data, 1) #display frequency table for 'Category'

FAMILY: 18.91
GAME: 9.72
TOOLS: 8.46
BUSINESS: 4.59
LIFESTYLE: 3.90
PRODUCTIVITY: 3.89
FINANCE: 3.70
MEDICAL: 3.53
SPORTS: 3.40
PERSONALIZATION: 3.32
COMMUNICATION: 3.24
HEALTH_AND_FITNESS: 3.08
PHOTOGRAPHY: 2.94
NEWS_AND_MAGAZINES: 2.80
SOCIAL: 2.66
TRAVEL_AND_LOCAL: 2.34
SHOPPING: 2.25
BOOKS_AND_REFERENCE: 2.14
DATING: 1.86
VIDEO_PLAYERS: 1.79
MAPS_AND_NAVIGATION: 1.40
FOOD_AND_DRINK: 1.24
EDUCATION: 1.16
ENTERTAINMENT: 0.96
LIBRARIES_AND_DEMO: 0.94
AUTO_AND_VEHICLES: 0.93
HOUSE_AND_HOME: 0.82
WEATHER: 0.80
EVENTS: 0.71
PARENTING: 0.65
ART_AND_DESIGN: 0.64
COMICS: 0.62
BEAUTY: 0.60


There are not many apps designed for fun in Google play, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle etc). However, if we investigate this further, we can see that the family category means mostly games for kids.

In [28]:
display_table(android_data, -4) #display frequency table for 'Genre'

Tools: 8.45
Entertainment: 6.07
Education: 5.35
Business: 4.59
Lifestyle: 3.89
Productivity: 3.89
Finance: 3.70
Medical: 3.53
Sports: 3.46
Personalization: 3.32
Communication: 3.24
Action: 3.10
Health & Fitness: 3.08
Photography: 2.94
News & Magazines: 2.80
Social: 2.66
Travel & Local: 2.32
Shopping: 2.25
Books & Reference: 2.14
Simulation: 2.04
Dating: 1.86
Arcade: 1.85
Video Players & Editors: 1.77
Casual: 1.76
Maps & Navigation: 1.40
Food & Drink: 1.24
Puzzle: 1.13
Racing: 0.99
Libraries & Demo: 0.94
Role Playing: 0.94
Auto & Vehicles: 0.93
Strategy: 0.91
House & Home: 0.82
Weather: 0.80
Events: 0.71
Adventure: 0.68
Comics: 0.61
Art & Design: 0.60
Beauty: 0.60
Parenting: 0.50
Card: 0.45
Casino: 0.43
Trivia: 0.42
Educational;Education: 0.39
Board: 0.38
Educational: 0.37
Education;Education: 0.34
Word: 0.26
Casual;Pretend Play: 0.24
Music: 0.20
Entertainment;Music & Video: 0.17
Puzzle;Brain Games: 0.17
Racing;Action & Adventure: 0.17
Casual;Brain Games: 0.14
Casual;Action & Adventure:

The difference between the *Genres* and the *Category* columns is not clear, but one thing that we can notice is that the *Genres* column has more categories.  
Up to this point, we found that the AppStore is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. 