# Profitable Apps Profiles for the App Store and Google Play Markets

In this project, I have been given a role of a Data Analyst for a company that builds Andriod and IOS mobile apps. We make our apps available on Google Play and the App Store, and all the apps that we create are free of cost! Since the only way our company makes revenue is off in-apps ads, my goal is to help the developers develop apps that are most likely to attract the most users.

## Opening our datasets


In [None]:
#Opening IOS Apps Dataset
opened_file = open("AppleStore.csv")
from csv import reader
read_file = reader(opened_file)
apps_data_IOS = list(read_file)

#Opening Andriod Apps Dataset
opened_file_2 = open("googleplaystore.csv")
from csv import reader
read_file_2 = reader(opened_file_2)
apps_data_android = list(read_file_2)

## Columns That Could Help Us With Our Analysis and Exploring the Data
 - For the Google Play dataset we can see that  in total we have 10841 rows and 13 columns. Some columns that could help us with our analysis are: 'Price', 'Category', 'App', 'Genres', 'Type', 'Installs' and 'Reviews'.
 - For the IOS apps dataset we can see that in total we have 7197 rows and 16 columns. Some columns that might be useful for our data analysis are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'.

In [None]:
#Here we define the function 'explore_data()'
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset[1:]))
        print('Number of columns:', len(dataset[0]))

#Here we print the header row of the Andriod Apps Dataset and print only the list that is between 2-4 we also print the header-row of the Andriod Dataset
print(apps_data_android[0])
print('\n')
android_explore = explore_data( dataset = apps_data_android, start = 2, end = 4 , rows_and_columns = True)

# To see better becasue the line was on top of each other
print('\n')

#Here we print the header row of  the IOS Apps Dataset and print only the list that is between 3-4, we also print the header-row of the IOS Dataset
print(apps_data_IOS[0])
print("\n")
ios_explore = explore_data( dataset = apps_data_IOS , start = 3 , end = 4, rows_and_columns = True)

## Deleting Wrong Data
 - According to the discussion section there was an error in a certain row of the Google Play dataset. The discussion stated that the error was in row [10473] (without the header row), and when I printed out the row it gave us a rating of 19, but in the Google Play store the maximum rating is always 5, So I then ended up removing it. 

In [None]:
# Just to see what was before [10473] and how that would translate, after we used the "del()" function
print(apps_data_android[10472])
print("\n")

#Printed out the list and then removed it becasue there was an error with the "Rating"
print(apps_data_android[10473])
print("\n")
del apps_data_android[10473]

#To see what was after [10473] and how that would translate, after we used the "del()" function
print(apps_data_android[10474])
print("\n")

#Printing the length to see how many lists were removed
print(len(apps_data_android[1:]))


## Removing Diplicate Entries : Part One

In [None]:
# Confriming that the Google Play dataset has mutlpile duplicate entries

# Duplicate entries for Instagram, when we run this code we can see that the app Instagram has four duplicate entries
duplicate_entry_1 = []
for apps in apps_data_android:
    name = apps[0]
    if name == 'Instagram':
        duplicate_entry_1.append(name)
print(duplicate_entry_1)

print("\n")

# Duplicate entries for Clash of Clans, when we run this code we can see that the app Clash of Clans has four duplicate entries
duplicate_entry_2 = []
for apps in apps_data_android:
    name = apps[0]
    if name == "Clash of Clans":
        duplicate_entry_2.append(name)
print(duplicate_entry_2)

print("\n")

# Duplicate entries for Amazon Kindle, when we run this code we can see that the app Amazon Kindle has two duplicate entries
duplicate_entry_3 = []

for apps in apps_data_android:
    name = apps[0]
    if name == "Amazon Kindle":
        duplicate_entry_3.append(name)
print(duplicate_entry_3)


We can see in the above code that we assign an empty dictionary outside the for loop and then inform the code that if a specific name is equivalent to a specific string, we must append that name to the empty dictionary outside the for loop for as many times as that name exists. If the name appears more than once after appending the empty dictionary, we know we have a duplicate entry. When you run the code above, you'll notice that Instagram and Clash of Clans both have four entries, indicating that they both have three extra entries, while Amazon Kindle only has two, indicating that it only has one extra entry.

In [None]:
# Counting number of duplicate apps for Google Play store apps

duplicate_apps_name = []
unique_apps_name = []

for apps in apps_data_android:
    name = apps[0]
    if name in unique_apps_name:
        duplicate_apps_name.append(name)
    else:
        unique_apps_name.append(name)
        
print("There are", len(duplicate_apps_name), "duplicate app names!")
print("\n")
print("Some of the app names are: ", duplicate_apps_name[:5])
print("\n")
print("There are", len(unique_apps_name), "unique app names!")
print("\n")
print("Some of the app names are: ", unique_apps_name[:5])

Now let's print out some duplicate entries and go about how we could remove those duplicate entries. We also print out the header row of the Google Play dataset.

In [None]:
print(apps_data_android[0])
print("\n")
for apps in apps_data_android:
    name = apps[0]
    if name == "Instagram":
        print(apps)

When we examine the duplicate entries, we can see that the 'Reviews' entry for each duplicate entry changes. Since the entry is constantly changing, we'll keep the one with the most 'Reviews' as the most recent entry and delete all other duplicates since more reviews indicate that the data is more recent!

## Removing Duplicate Entries: Part Two

In [None]:
reviews_max = {}

for apps in apps_data_android[1:]:
    name = apps[0]
    n_reviews = (apps[3])
    if name in reviews_max and(reviews_max[name]<n_reviews):
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]= n_reviews

print(len(reviews_max))

We found out that there were approximately about 1,181 apps that were duplicates. We printed out the length to find out how many apps are unique or the apps they have a max rating referring to it to be a unique and a new app in the dataset.

In [None]:
android_clean = []
already_added = []

for apps in apps_data_android[1:]:
    name = apps[0]
    n_reviews = apps[3]
    if (n_reviews == reviews_max[name]) and name not in already_added:
        android_clean.append(apps)
        already_added.append(name)
print(android_clean)
print(already_added)
print(len(android_clean))

- First of we started off by creating two empty lists and assigned them to the variable 'andriod_clean' and 'already_added' where 'andriod_clean' represents our cleaned dataset and 'already_added' represents our apps names.
- After that, we used the for loop and skipped the header row. Next, we assigned the name of the apps to a variable name 'name' and we assigned the number of reviews of all the apps to a variable named 'n_reviews'.
- Then we use an if statement and say that if  'n_reviews; is equal to the number of reviews from our dictionary (reviews_max[name]) and if the 'name' is not already in the empty list 'already_added' then append the apps to the empty list 'andriod_clean' and append the name of the apps to the empty list 'already_added'.


## Removing Non-English Apps: Part One

Our company was only interested in developing apps that only use English. To perform our data analysis we need to remove the non-English apps from our dataset so our company could have accurate results on what apps are most profitable.

In [None]:
def is_english(string):
    
    for word in string:
        if ord(word) > 127:
            return False
    
    return True
check_word_number_1 = is_english(string = 'Instagram')
print(check_word_number_1)
print("\n")
check_word_number_2 = is_english(string = '爱奇艺PPS -《欢乐颂2》电视剧热播')
print(check_word_number_2)
print("\n")
check_word_number_3 = is_english(string = 'Docs To Go™ Free Office Suite')
print(check_word_number_3)
print("\n")
check_word_number_4 = is_english(string = 'Instachat 😜')
print(check_word_number_4)
print("\n")
print(ord("™"))
print("\n")
print(ord("😜"))


Above we created a function that helps us find out whether the app is English or not. This function seems to work fine until we get to 😜  and ™. This corresponding encoding number for the elements 😜  and ™ is not in the range of 0 - 127 which is the number for English text according to the American Standard Code for Information Interchange. In the next cell we will fix this!

## Removing Non-English Apps: Part Two

In [None]:
def is_english(string):
    non_ascii = 0
    
    for word in string:
        if ord(word) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english("Docs To Go™ Free Office Suite"))
print(is_english("Instachat 😜"))

- Above we edited the function in the previous cell to account for things like emojis and the trademark symbol.
-  We told the function that if the function has 3 or more characters outside the range of the English language which is 0-127 then only allow up to three characters otherwise return 'False if there are more than three characters.



In [None]:
android_english = []
IOS_english = []

for apps in android_clean:
    name = apps[0]
    if is_english(name):
        android_english.append(apps)
        
for apps in apps_data_IOS:
    name = apps[1]
    if is_english(name):
        IOS_english.append(apps)

print(explore_data(android_english, 3, 4, True))
print('\n')
print(explore_data(IOS_english, 1, 2, True))

We used the is_english() function to filter out ever more and more apps and we ended with 9613 Google Play apps and 6183 App store apps.

In [None]:
print(apps_data_android[0])
print("\n")
print(apps_data_IOS[0])
print("\n")

android_final_data = []
IOS_final_data = []

for apps in android_english:
    price = apps[7]
    if price == "0":
        android_final_data.append(apps)
#print(android_final_data)

for apps in IOS_english:
    price = apps[4]
    if price == "0.0":
        IOS_final_data.append(apps)
#print(IOS_final_data)
print("The length of the Google Play data set is",len(android_final_data),".")
print("\n")
print("The length of the App store data set is",len(IOS_final_data),'.')

After finishing our data cleaning process we ended up with 8,862 Google Play apps and 3,222 App store app. We started with more than 10,000 Google play apps and more than 7,000 App store apps!

## Most Common Apps by Genre: Part One

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps. Due to this we have created a process that will help us determine if our apps is succesful or not.
- First of we will build a minimal Android version of the app and will add it to the Google Play store.
- If the app has a good response from the users we will develop the app further.
- Lastly after developing the app further, if the apps is still profitable in the next six months we will build an IOS version of the app and will release it on the App store too.

Since our end goal is to find apps to add both on the Google Play store and the App store we need to find app profiles that are successful on both markets and we need to analyze them by building frequency tables. We will analyze the 'prime_genre' column of the App store and the 'Genres' and 'Category' columns of the Google Play data set.


## Most Common Apps by Genre: Part Two

In [None]:
#Printing out header row
print(apps_data_android[0])
print("\n")
print(apps_data_IOS[0])
print("\n")

In [None]:
# Creating a function which allows us to create a frequency dictionary
def freq_table(dataset, index):
    table = {}
    total_number = 0
    
    for row in dataset:
        total_number = total_number +1
        value = row[index]
        if value in table:
            table[value] = table[value]+1
        else:
            table[value] = 1
# Creating the percentage dictionary
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total_number) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

# Imported the given function
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0]) 

# Printing the table for IOS Dataset, I printed the 'prime_genre' column

print(display_table(dataset= IOS_final_data,index=11))
print("\n")

# Printing the table for the Google Play data, I printed the 'Genres' column

print(display_table(dataset= android_final_data,index=9))
print("\n")

#Printing the table for the Google Play data, I printed the 'Category' column

print(display_table(dataset= android_final_data,index=1))

Above we built two functions: 
- One function to generate frequency tables that show percentages 
- Another function we can use to display the percentages in a descending order 
On the next screen, we will start out with our analysis!

## Most Common Apps by Genre: Part Three

In [None]:
#Printing the table for IOS Dataset, I printed the 'prime_genre' column

print(display_table(dataset= IOS_final_data,index=11))

- We can see that among the free English applications, in excess of a half (58.16%) are games. Entertainment applications are near 8%, trailed by photograph and video applications, which are near 5%. Just 3.66% of the applications are intended for schooling, trailed by long range interpersonal communication applications which sum for 3.29% of the applications in our dataset.
- The overall impression is that App Store (in any event the part containing free English applications) is overwhelmed by applications that are intended for fun (games, amusement, photograph and video, long range informal communication, sports, music, and so forth), while applications with useful purposes (training, shopping, utilities, efficiency, way of life, and so on) are more uncommon. Notwithstanding, the way that fun applications are the most various doesn't likewise suggest that they additionally have the best number of clients — the interest probably won't be equivalent to the offer.
- Next we will analyze the Google Play dataset.

In [None]:
#Printing the table for the Google Play data, I printed the 'Category' column

print(display_table(dataset= android_final_data,index=1))

In the Google Play dataset, we can see that the Family category is the majority of apps followed by Games at 9.69% and then Prodducity apps at 3.89%. Looking more into the dataset when we go visits the Family catrogry we get to know that most of the apps inside the family catorgey are kids games.

[1]: https://camo.githubusercontent.com/9bf24b9efc3d88a3d55f5c09e314987941f0bab5/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f64712d636f6e74656e742f3335302f7079316d385f66616d696c792e706e67

![Image](https://camo.githubusercontent.com/9bf24b9efc3d88a3d55f5c09e314987941f0bab5/68747470733a2f2f73332e616d617a6f6e6177732e636f6d2f64712d636f6e74656e742f3335302f7079316d385f66616d696c792e706e67)

Looking more into the dataset when we go visit the Family class which makes up 19% of the applications, we become acquainted with that a large portion of the applications inside the family classification are children's games. In end from taking a look at the Google Play dataset, we can say that it is more adjusted compared to the App store dataset because one classification isn't ruling any remaining classes.

In [None]:
# Printing the table for the Google Play data, I printed the 'Genres' column

print(display_table(dataset= android_final_data,index=9))

Looking at the genres column we can't really tell the difference between the Category and Genres column in the Google Play dataset but one thing we can tell for sure is that the Genres column we are looking at has way more data compared to the 'prime_genre' from the App Store dataset and also compared to the Category section from the Google Play dataset. In conclusion, the App Store dataset seems to be dominated by one category while the Google Play dataset is balanced.

## Most Popular Apps by Genre on the App Store

In [None]:
#Used the previously built function to generate a frequency table
IOS_genres = freq_table(IOS_final_data, -5)

#Using a for loop to count average user ratings for each genre
for genre in IOS_genres:
    total = 0
    len_genre = 0
    for apps in IOS_final_data:
        genre_apps = apps[-5]
        if genre_apps == genre:            
            number_of_ratings = float(apps[5])
            total += number_of_ratings
            len_genre += 1
#Creating the avg_number_of_ratings variable and printing out the average number of user ratings for a certain genre.
    avg_number_of_ratings = total / len_genre
    print(genre, ':', avg_number_of_ratings)

Looking through the apps that have the highest number of user reviews is for the  Navigation. Let's try opening up the Navigation profile and see what apps are there.


In [None]:
for apps in IOS_final_data:
    if apps[-5] == 'Navigation':
        print(apps[1], ':', apps[5])

Once we open the Navigation profile we see that the apps that heavily influenced the user rating are Google Maps and Waze, the others are nothing compared to Google Maps and Waze.

## Most Popular Apps by Genre on Google Play

In [None]:
# Printing header row
print(apps_data_android[0])
print("\n")

# Creating a frequecny tbale
android_categories = freq_table(android_final_data, 1)

for category in android_categories:
    total = 0
    len_category = 0
    for apps in android_final_data:
        category_app = apps[1]
        if category_app == category:            
            n_installs = apps[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_number_of_installs = total / len_category
    print(category, ':', avg_number_of_installs)
print("\n")
    
#Printing out the communication apps
for apps in android_final_data:
    if apps[1] == 'COMMUNICATION' and (apps[5] == '1,000,000,000+'
                                      or apps[5] == '500,000,000+'
                                      or apps[5] == '100,000,000+'):
        print(apps[0], ':', apps[5])

Communication applications have the most installs on average: 38,456,119. A few apps with over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), as well as a few others with over 100 and 500 million installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), heavily skew this amount. This itself doesn't seem like a good app profile to recommend to our developers because it is already heavily populated.


## More Analysis

Let's try exploring more app profiles so we can decide on what apps could most profitable. 


In [None]:
# Exploring the Books and Reference Genres
for apps in android_final_data:
    if apps[1] == 'BOOKS_AND_REFERENCE':
        print(apps[0], ':', apps[5])

The Books and Reference genre includes a wide range of applications, such as tools for processing and reading ebooks, various library collections, dictionaries, programming or language tutorials, and so on. There appears to be a small number of highly popular apps that continue to distort the average. Next lets exlpore the same category for App Store dataset.

In [None]:
# Exploring the Google Play dataset
for apps in android_final_data:
    if apps[1] == 'BOOKS_AND_REFERENCE' and (apps[5] == '1,000,000,000+'
                                            or apps[5] == '500,000,000+'
                                            or apps[5] == '100,000,000+'):
        print(apps[0], ':', apps[5])

It appears that there are only a few extremely successful applications, indicating that there is still room for growth in this genre/category. 


## Conclusion

In this task, we broke down information about the App Store and Google Play applications and determined were to suggest an application profile that can be beneficial for the two stores. I presumed that making an application for the Books and References genre/category in both the App Store and the Google Play store could be beneficial for the two stores. It appears as the applications under the Books and References type are very little and are not profoundly populated showing that there is space for development. One thing to note is that while making the application, our application ought to have something one of a kind, something that other applications don't have, for instance, every day quotes, tests, and a part for book reviews where individuals could voice their opinions.