# Project DataQuest 01 (DRAFT)

This is the first of several self guided projects from [DataQuest](https://dataquest.io). The goal of this project to show what type of apps are likely to attract more users on Google Play and the App Store.

* [Dataset for Google Play Store](https://www.kaggle.com/datasets/lava18/google-play-store-apps)
* [Dataset for Apple Store](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps)

We kick this off by downloading both datasets:

In [14]:
!wget -O googleplaystore.csv https://dq-content.s3.amazonaws.com/350/googleplaystore.csv
!wget -O AppleStore.csv https://dq-content.s3.amazonaws.com/350/AppleStore.csv

--2024-08-05 19:54:45--  https://dq-content.s3.amazonaws.com/350/googleplaystore.csv
Resolving dq-content.s3.amazonaws.com (dq-content.s3.amazonaws.com)... 3.5.30.157, 52.217.121.241, 3.5.17.187, ...
Connecting to dq-content.s3.amazonaws.com (dq-content.s3.amazonaws.com)|3.5.30.157|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1349314 (1.3M) [text/csv]
Saving to: ‘googleplaystore.csv’


2024-08-05 19:54:46 (3.08 MB/s) - ‘googleplaystore.csv’ saved [1349314/1349314]

--2024-08-05 19:54:46--  https://dq-content.s3.amazonaws.com/350/AppleStore.csv
Resolving dq-content.s3.amazonaws.com (dq-content.s3.amazonaws.com)... 3.5.2.146, 52.217.121.241, 3.5.17.187, ...
Connecting to dq-content.s3.amazonaws.com (dq-content.s3.amazonaws.com)|3.5.2.146|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 725761 (709K) [text/csv]
Saving to: ‘AppleStore.csv’


2024-08-05 19:54:47 (2.59 MB/s) - ‘AppleStore.csv’ saved [725761/725761]



Now let's open the datasets and take a look at the first few rows of each dataset. The following function will help us with that:

In [15]:
def explore_data(dataset, start=0, end=1, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')  # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

We now start converting our data into a list of lists and then explore the data. 

In [16]:
play_store_buffer = open('googleplaystore.csv')
app_store_buffer = open('AppleStore.csv')

from csv import reader

play_store_raw = reader(play_store_buffer)
app_store_raw = reader(app_store_buffer)

play_store_ds = list(play_store_raw)
app_store_ds = list(app_store_raw)

Here is an exploration of the first three rows of the Google Play Store dataset:

In [17]:
explore_data(play_store_ds, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


Columns that currently exist for play_store dataset:

| Index | Column Name      | Description                                                           |
| ----- | ---------------- | --------------------------------------------------------------------- |
| 0     | App              | The name of the app.                                                  |
| 1     | Category         | The category to which the app belongs.                                |
| 2     | Rating           | The average user rating of the app out of 5.                          |
| 3     | Reviews          | The total number of user reviews.                                     |
| 4     | Size             | The size of the app.                                                  |
| 5     | Installs         | The total number of times the app has been installed.                 |
| 6     | Type             | Indicates whether the app is free or paid.                            |
| 7     | Price            | The cost of the app (usually 0 for free apps).                        |
| 8     | Content Rating   | The age rating for the app (e.g., Everyone, Teen, Mature 17+).        |
| 9     | Genres           | The genres associated with the app.                                   |
| 10    | Last Updated     | The date when the app was last updated.                               |
| 11    | Current Ver      | The current version of the app.                                       |
| 12    | Android Ver      | The minimum Android version required to run the app.                  |


In [18]:
explore_data(app_store_ds, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


Columns that currently exist for app_store dataset:

| Index | Column Name        | Description                                                                 |
| ----- | ------------------ | --------------------------------------------------------------------------- |
| 0     | id                 | Unique identifier for the app.                                              |
| 1     | track_name         | The name of the app.                                                        |
| 2     | size_bytes         | The size of the app in bytes.                                               |
| 3     | currency           | The currency in which the price is listed.                                  |
| 4     | price              | The cost of the app.                                                        |
| 5     | rating_count_tot   | The total number of user ratings.                                           |
| 6     | rating_count_ver   | The number of user ratings for the current version.                         |
| 7     | user_rating        | The average user rating out of 5.                                           |
| 8     | user_rating_ver    | The average user rating for the current version out of 5.                   |
| 9     | ver                | The current version of the app.                                             |
| 10    | cont_rating        | The age rating for the app (e.g., 4+, 9+, 12+).                             |
| 11    | prime_genre        | The main category of the app.                                               |
| 12    | sup_devices.num    | The number of devices the app supports.                                     |
| 13    | ipadSc_urls.num    | The number of iPad screenshots available.                                   |
| 14    | lang.num           | The number of languages the app supports.                                   |
| 15    | vpp_lic            | Indicates if the app is available for Volume Purchase Program (VPP) licensing. |


## Cleaning play store data

First thing we need to do is clean the play store and app store of it's:
- duplicates
- non-free
- non-english
- incompatible data

First we look for "lengths", meaning rows that do not match the header length. This is corrupt data. Once we clean it, we print the length of play_store_ds. Once we see that the row count went down by one, we can move on to cleaning duplicates. Duplicates of App Name constitute bad data. We remove them destructively in favor of the first instance of the name in our dataset. Next, we isolate free apps from this dataset, also destructive against non-free apps. We look for text `Free` for applicable rows. Finally, we loop through text and, if we find non-English characters, we delete said row. The next set of code will illustrate this.

We repeat the process for the App Store, which should be cleaned in the same way. What we see is a drastic reduction in rows, with playstore starting at 10842 records and appstore starting at 7197 records with both decreasing to 4783 and 2263 records respectively.

In [19]:
def clean_lengths(ds):
    """Finds mismatched row lenghts. This is a destructive action!
    """
    for idx in range(len(ds)):
        if 0 < idx < len(ds):
            if len(ds[0]) != len(ds[idx]):
                del ds[idx]

def clean_dupes(ds, dupe_idx):
    """Cleans up rows with duplicate value. This is a destructive action!
    """
    seen = set()
    for idx in range(len(ds)):
        if 0 < idx < len(ds):
            dupe = ds[idx][dupe_idx]
            if dupe in seen:
                del ds[idx]
            else:
                seen.add(dupe)

def clean_non_free(ds, price_idx):
    """Finds non-free items based on price index provided. This is a destructive action!
    """
    # Given a price index, remove non-free items
    for idx in range(len(ds)):
        if 0 < idx < len(ds):
            price = ds[idx][price_idx]
            if price in ('0', '0.0', 'Free'):
                del ds[idx]

def clean_non_english(ds, text_idx):
    """Finds non-english items based on a textual index provided. This is a destructive action!
    """
    for idx in range(len(ds)):
        if 0 < idx < len(ds):
            text = ds[idx][text_idx]
            for t in text:
                if(ord(t)) > 127:
                    try:
                        del ds[idx]
                    except:
                        pass

In [20]:
# We clean the playstore data
explore_data(play_store_ds, 0, 0, True)
clean_lengths(play_store_ds)
print("Cleaning lengths:")
explore_data(play_store_ds, 0, 0, True)
print("Cleaning dupes:")
clean_dupes(play_store_ds, 0)
explore_data(play_store_ds, 0, 0, True)
print("Cleaning non-free:")
clean_non_free(play_store_ds, 6)
explore_data(play_store_ds, 0, 0, True)
print("Cleaning non-english:")
clean_non_english(play_store_ds, 0)
explore_data(play_store_ds, 0, 3, True)

Number of rows: 10842
Number of columns: 13
Cleaning lengths:
Number of rows: 10841
Number of columns: 13
Cleaning dupes:
Number of rows: 10033
Number of columns: 13
Cleaning non-free:
Number of rows: 5252
Number of columns: 13
Cleaning non-english:
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 4783
Number of columns: 13


Let's explore the cleaned data. We will start with the Apple Store data. We start by:

- exploring
- deleting nonconforming data
- more exploration
- clean duplicates
- more exploration
- isolate free apps
- more exploration
- isolate English apps

We do this for both Google Play Store and Apple Store datasets.

In [21]:
# We clean the playstore data
explore_data(app_store_ds, 0, 0, True)
print("Cleaning lengths:")
clean_lengths(app_store_ds)
explore_data(app_store_ds, 0, 0, True)
print("Cleaning dupes:")
clean_dupes(app_store_ds, 1)
explore_data(app_store_ds, 0, 0, True)
print("Cleaning non-free:")
clean_non_free(app_store_ds, 4)
explore_data(app_store_ds, 0, 0, True)
print("Cleaning non-english:")
clean_non_english(app_store_ds, 1)
explore_data(app_store_ds, 0, 3, True)

Number of rows: 7198
Number of columns: 16
Cleaning lengths:
Number of rows: 7198
Number of columns: 16
Cleaning dupes:
Number of rows: 7196
Number of columns: 16
Cleaning non-free:
Number of rows: 4620
Number of columns: 16
Cleaning non-english:
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 3018
Number of columns: 16


## Analysis

We want to start getting some useful insights from our data. Let's start by looking at the most common genres for each market. We will do this by creating a frequency table for the prime_genre column of the App Store dataset and the Genres and Category columns of the Google Play Store dataset.

In [22]:
def freq_table(ds, idx):
    table = {}
    total = 0
    for row in ds[1:]:
        total += 1
        if row[idx] in table:
            table[row[idx]] += 1
        else:
            table[row[idx]] = 1
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage
    return table_percentages

In [23]:
freq = freq_table(app_store_ds, 11)
freq

{'Photo & Video': 5.63473649320517,
 'Games': 55.61816373881339,
 'Social Networking': 1.690420947961551,
 'Health & Fitness': 2.8505137553861455,
 'Weather': 1.2595293337752733,
 'Shopping': 0.8617832283725556,
 'News': 0.8617832283725556,
 'Navigation': 0.49718263175339744,
 'Entertainment': 6.662247265495526,
 'Finance': 0.563473649320517,
 'Travel': 0.6960556844547563,
 'Reference': 0.8286377195889957,
 'Sports': 1.2263838249917136,
 'Productivity': 3.513423931057342,
 'Music': 2.485913158766987,
 'Food & Drink': 0.8617832283725556,
 'Lifestyle': 1.3589658601259529,
 'Utilities': 3.3476963871395427,
 'Education': 7.1594298972489225,
 'Book': 0.596619158104077,
 'Business': 0.9612197547232351,
 'Medical': 0.39774610540271793,
 'Catalogs': 0.06629101756711965}

As we can see, the most popular genre for apps on Apple Store is `Games`. Now let's see the same for Google Play Store:

In [24]:
freq = freq_table(play_store_ds, 1)
freq

{'ART_AND_DESIGN': 0.6482643245503974,
 'AUTO_AND_VEHICLES': 0.7737348389795065,
 'BEAUTY': 0.5018820577164366,
 'BOOKS_AND_REFERENCE': 1.8820577164366372,
 'BUSINESS': 4.224173985780008,
 'COMICS': 0.37641154328732745,
 'COMMUNICATION': 3.220409870347135,
 'DATING': 1.9238812212463405,
 'EDUCATION': 1.3383521539104977,
 'ENTERTAINMENT': 1.2128816394813886,
 'EVENTS': 0.6482643245503974,
 'FINANCE': 3.408615641990799,
 'FOOD_AND_DRINK': 1.0455876202425762,
 'HEALTH_AND_FITNESS': 3.1994981179422837,
 'HOUSE_AND_HOME': 0.6900878293601004,
 'LIBRARIES_AND_DEMO': 0.8155583437892095,
 'LIFESTYLE': 3.0112923462986196,
 'GAME': 10.623170221664575,
 'FAMILY': 19.259723964868254,
 'MEDICAL': 5.060644081974069,
 'SOCIAL': 2.300292764533668,
 'SHOPPING': 1.944792973651192,
 'PHOTOGRAPHY': 3.304056879966541,
 'SPORTS': 3.617733166039314,
 'TRAVEL_AND_LOCAL': 2.2166457549142615,
 'TOOLS': 8.07193642827269,
 'PERSONALIZATION': 3.9732329569217897,
 'PRODUCTIVITY': 4.077791718946048,
 'PARENTING': 0.5

Two observations between App Store and Play Store are that the most popular genre for apps on App Store is `Games` while the most popular genre for apps on Play Store is `Family` followed by `Games`.

Out of all the genres, which one has the highest rating? Let's look at the dataset and understand ratings. 

In [25]:
# Get the average rating per genre
def avg_rating(ds, genre_idx, rating_idx):
    table = {}
    for row in ds[1:]:
        genre = row[genre_idx]
        rating_str = row[rating_idx]
        # Replace all instances of +,$ in the string and convert to float
        rating = float(rating_str.replace('+', '').replace(',', '').replace('$', ''))
        if genre in table:
            table[genre][0] += rating
            table[genre][1] += 1
        else:
            table[genre] = [rating, 1]
    for key in table:
        table[key].append(table[key][0] / table[key][1])
    return table

In [26]:
app_store_ratings = avg_rating(app_store_ds, 11, 7)
app_store_ratings_max = max(
    app_store_ratings, key=lambda x: app_store_ratings[x][2])
print("Ratings", app_store_ratings)
print("App Store Highest rated genre", app_store_ratings_max)
print("App Store Highest rated genre score", app_store_ratings[app_store_ratings_max][2])

play_store_rating = avg_rating(play_store_ds, 1, 5)
play_store_rating_max = max(
    play_store_rating, key=lambda x: play_store_rating[x][2])
print("Play Store Highest rating count genre", play_store_rating_max)
print("Play Store Highest rating count genre average", play_store_rating[play_store_rating_max][0])

Ratings {'Photo & Video': [690.0, 170, 4.0588235294117645], 'Games': [6977.0, 1678, 4.15792610250298], 'Social Networking': [186.0, 51, 3.6470588235294117], 'Health & Fitness': [364.0, 86, 4.232558139534884], 'Weather': [151.5, 38, 3.986842105263158], 'Shopping': [108.5, 26, 4.173076923076923], 'News': [84.5, 26, 3.25], 'Navigation': [54.0, 15, 3.6], 'Entertainment': [735.0, 201, 3.656716417910448], 'Finance': [60.0, 17, 3.5294117647058822], 'Travel': [81.0, 21, 3.857142857142857], 'Reference': [103.5, 25, 4.14], 'Sports': [125.5, 37, 3.391891891891892], 'Productivity': [427.5, 106, 4.033018867924528], 'Music': [312.5, 75, 4.166666666666667], 'Food & Drink': [93.5, 26, 3.5961538461538463], 'Lifestyle': [144.0, 41, 3.5121951219512195], 'Utilities': [369.5, 101, 3.6584158415841586], 'Education': [824.5, 216, 3.8171296296296298], 'Book': [73.0, 18, 4.055555555555555], 'Business': [119.0, 29, 4.103448275862069], 'Medical': [47.0, 12, 3.9166666666666665], 'Catalogs': [8.0, 2, 4.0]}
App Stor

In the app store we see a high average rating of 4.23 while the play store has Social rating count (meaning amount of people who found time in their day to write something about the app) is 53562688. 

## Conclusion

The following output show's where our data is leading us.

* App Store Highest rated genre Health & Fitness
* App Store Highest rated genre score 4.232558139534884
* Play Store Highest rating count genre SOCIAL
* Play Store Highest rating count genre average 5891895686.0

If the goal is to make an app that is popular, the best bet is to make a social app for Google Play Store. If the goal is to make an app that is highly rated, the best bet is to make a Health & Fitness app for the Apple Store. But if you want success in both, you should make a social health and fitness app for both platforms.

Of course, this is simplistic, but the assignment was rather sparse. The data is not very deep, and the analysis is not very deep. But it is a good start to understanding the data and what it can tell us.