# Profitable App Profiles for the App Store and Google Play Markets
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

(Full disclosure this is a guided project.)

## Opening and Exploring the Data
You can download the Google Play Store data set [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) and the Apple App Store data set [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [1]:
from csv import reader

with open("googleplaystore.csv") as f:
  android = list(reader(f))
android_header, android_data = android[0], android[1:]

with open("AppleStore.csv") as f:
  ios = list(reader(f))
ios_header, ios_data = ios[0], ios[1:]


FileNotFoundError: ignored

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
print(android_header)
print('\n')
explore_data(android_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [None]:
print(ios_header)
print('\n')
explore_data(ios_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


## A Little Cleanup
### Stage 1
I trust no-one to provide a perfect/clean data set so I am going to do a quick check (to the best of my ability) to verify that the data doesn't contain any inaccuraices or duplicate rows.
#### Stage 1a: Remove Inaccuracies
Working smart is *almost* always better than working hard. So instead of jumping in headfirst, tilting at errors, I am going to browse the kaggle discussion boards for both data sets. There is a chance that those that came before me found errors that might trip me up. ([Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps/discussion)/[Apple App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion))

Turns out the Google Play Store Data does have an error on row 10472. The rating (index 2) is listed as **19**, but all Google Play Store rating lie in a range from 0 - 5 (unless no rating exists which would then show a value of "NaN"). This was more than likely caused by a missing rating entry for this row.

Just to verify I am going to print that row and see if it has a different number of values than it should.

In [None]:
print(android_header, "\nRow Length:", len(android_header))
print(android_data[10472], "\nRow Length:", len(android_data[10472]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 
Row Length: 13
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 
Row Length: 12


Yep, looks bad man. The values to the right of the missing one all shifted one to the left. Since we don't know what the rating was it's safer to just drop this row as rating will be a factor in our analysis. I'm also going to make sure that the row is removed by printing the new row at index 10472.

In [None]:
print(android_data[10472])
del android_data[10472]
print(android_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


At this point the guided project is pushing me to move on to the next step but we can do an quick and easy check here to ensure all of the rating values are wwithin the appropriate range.

In [None]:
out_of_range_ratings = 0
problem_values = []
for row in android_data:
  if not row[2] == "NaN" and not (0 <= float(row[2]) <= 5):
    out_of_range_ratings += 1
    problem_values.append(row[2])
print("Out of Range Ratings:", out_of_range_ratings)
if out_of_range_ratings > 0:
  print("Problem Values:", problem_values)

Out of Range Ratings: 0


Alright, we're now showing 0 out of range values for the ratings column. Let's move on.

#### Stage 1b: Remove Duplicates

In [None]:
unique_apps = []
duplicate_apps = []
for row in android_data:
  app_name = row[0]
  if app_name in unique_apps:
    duplicate_apps.append(app_name)
  else:
    unique_apps.append(app_name)
print("Number of Duplicate Apps:", len(duplicate_apps))
print("Example Duplicates:", duplicate_apps[:10])


Number of Duplicate Apps: 1181
Example Duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [None]:
print(android_header)
for row in android_data:
  if row[0] == "Slack":
    print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


In [None]:
print(android_header)
for row in android_data:
  if row[0] == "Instagram":
    print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [None]:
android_data_dict = {}
for row in android_data:
  name = row[0]
  reviews = float(row[3])
  if name not in android_data_dict:
    android_data_dict[name] = row
  elif name in android_data_dict and reviews > float(android_data_dict[name][3]):
    android_data_dict[name] = row
android_data_clean = list(android_data_dict.values())

### Stage 2
Our company's niche is free apps targeted at an English-speaking audience.

This means I am going to find and drop:
*   Apps with non-English names
*   Non-free apps

####Stage 2a: Remove Non-English Apps

In [None]:
def is_english(string):
  non_ascii_count = 0
  for char in string:
    if ord(char) > 127:
      non_ascii_count += 1
  
  if non_ascii_count > 3:
    return False
  else:
    return True
print(is_english("TEST"))
print(is_english("TEST 測試"))
print(is_english("TEST TEST 測試 測試"))

True
True
False


In [None]:
android_data_english = []
for row in android_data_clean:
  if is_english(row[0]):
    android_data_english.append(row)
print("Android Length Before:", len(android_data_clean))
print("Android Length After:", len(android_data_english))

ios_data_english = []
for row in ios_data:
  if is_english(row[1]):
    ios_data_english.append(row)
print("iOS Length Before:", len(ios_data))
print("iOS Length After:", len(ios_data_english))

Android Length Before: 9659
Android Length After: 9614
iOS Length Before: 7197
iOS Length After: 6183


####Stage 2b: Remove Non-Free Apps

In [None]:
android_data_final = []
for row in android_data_english:
  if row[7] == "0":
    android_data_final.append(row)
print("Android Length Before:", len(android_data_english))
print("Android Length After:", len(android_data_final))

ios_data_final = []
for row in ios_data_english:
  if row[4] == "0.0":
    ios_data_final.append(row)
print("iOS Length Before:", len(ios_data_english))
print("iOS Length After:", len(ios_data_final))

Android Length Before: 9614
Android Length After: 8864
iOS Length Before: 6183
iOS Length After: 3222


## Most Common Apps by Genre

In [None]:
def freq_table(dataset, index):
  table = {}
  count = 0
  for row in dataset:
    count += 1
    value = row[index]
    if value in table:
      table[value] += 1
    else:
      table[value] = 1
  
  for key in table:
    table[key] = (table[key] / count) * 100
  return table

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [None]:
display_table(ios_data_final, 11) # Prime Genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [None]:
display_table(android_data_final, 1) # Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [None]:
display_table(android_data_final, 9) # Genre

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

### Most Popular Apps by Genre on the Apple App Store

In [None]:
genres_ios = freq_table(ios_data_final, 11) # Prime Genre

for genre in genres_ios:
  total = 0
  len_genre = 0
  for app in ios_data_final:
    if app[11] == genre:
      total += float(app[5])
      len_genre += 1
  genres_ios[genre] = total / len_genre
for genre in sorted(genres_ios, key=genres_ios.get, reverse=True):
  print (genre, ":", genres_ios[genre])

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


In [None]:
print("Navigation Apps")
for app in ios_data_final:
  if app[11] == "Navigation":
    print(app)

print("\nReference Apps")
for app in ios_data_final:
  if app[11] == "Reference":
    print(app)

print("\nSocial Networking Apps")
for app in ios_data_final:
  if app[11] == "Social Networking":
    print(app)

Navigation Apps
['323229106', 'Waze - GPS Navigation, Maps & Real-time Traffic', '94139392', 'USD', '0.0', '345046', '3040', '4.5', '4.5', '4.24', '4+', 'Navigation', '37', '5', '36', '1']
['585027354', 'Google Maps - Navigation & Transit', '120232960', 'USD', '0.0', '154911', '1253', '4.5', '4.0', '4.31.1', '12+', 'Navigation', '37', '5', '34', '1']
['329541503', 'Geocaching®', '108166144', 'USD', '0.0', '12811', '134', '3.5', '1.5', '5.3', '4+', 'Navigation', '37', '0', '22', '1']
['504677517', 'CoPilot GPS – Car Navigation & Offline Maps', '82534400', 'USD', '0.0', '3582', '70', '4.0', '3.5', '10.0.0.984', '4+', 'Navigation', '38', '5', '25', '1']
['344176018', 'ImmobilienScout24: Real Estate Search in Germany', '126867456', 'USD', '0.0', '187', '0', '3.5', '0.0', '9.5', '4+', 'Navigation', '37', '5', '3', '1']
['463431091', 'Railway Route Search', '46950400', 'USD', '0.0', '5', '0', '3.0', '0.0', '3.17.1', '4+', 'Navigation', '37', '0', '1', '1']

Reference Apps
['282935706', 'Bibl

<Analyze the results and try to come up with at least one app profile recommendation for the App Store. Note that there's no fixed answer here, and it's perfectly fine if the app profile you recommended is different than the one recommended in the solution notebook.>

In [None]:
display_table(android_data_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [None]:
import re
categories_android = freq_table(android_data_final, 1) # Category

for category in categories_android:
  total = 0
  len_category = 0
  for app in android_data_final:
    if app[1] == category:
      n_installs = re.sub("[^0-9]", "", app[5])
      total += int(n_installs)
      len_category += 1
  categories_android[category] = total / len_category

for category in sorted(categories_android, key=categories_android.get, reverse=True):
  print (category, ":", categories_android[category])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

<Analyze the results and try to come up with at least one app profile recommendation for Google Play. Remember, our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play. Note that there's no fixed answer here, and it's perfectly fine if the app profile you recommended is different than the one recommended in the solution notebook.>

<In this project, we went through a complete data science workflow:

* We started by clarifying the goal of our project.
* We collected relevant data.
* We cleaned the data to prepare it for analysis.
* We analyzed the cleaned data.

In the solution notebook, we concluded that taking a very popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store market. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc. You might have reached a different conclusion, which is perfectly fine as long as you managed to build a data-driven argumentation for your recommendation.

These are a few next steps you could take:

* Analyze the frequency table for the Genre column of the Google Play data set, and see whether you can find useful patterns.
* Assume we could also make revenue via in-app purchases and subscriptions, and try to find out which genres seem to be liked the most by users — you could examine app ratings here.
* Refine your project using our data science project style guide.

If you're going to work on the next steps above independently, you'll almost inevitably face some problems like not knowing how to fix an error, or not knowing what code to write to perform a certain task. In situations like these, the best thing to do is to start with a Google search (or any other search engine). In most situations, there will always be people who already ran into the same kind of problem, and you'll be able to piggyback on the solution they came up with.

As you search for solutions to your problems, you'll notice that one particular site will constantly show up in the first few results of your query — Stack Overflow. The community on Stack Overflow is very active, and the answers you'll find there are almost always accurate and up-to-date. One important tip when you're searching on Google is to start with the word "python". For instance, if you want to find out how to remove the characters from a string, search for "python how to remove a character from a string" (not just "how to remove a character from a string") — otherwise you'll most likely get results for other programming languages.

Congratulations, this is the end of the course! We'll continue our data science journey in the next course, where we'll keep focusing on Python and learn about object-oriented programming, dates and times, and many other concepts that are essential for a data scientist.

Curious to see what other students have done on this project? Head over to our Community to check them out. While you are there, please remember to show some love and give your own feedback!

And of course, we welcome you to share your own project and show off your hard work. Head over to our Community to share your finished Guided Project!>