# App Store Analysis

## Using Data Analytics to Inform Business Strategy

## 1. Project Design

### 1.1 Company Background

The company XYZ is in the business of building Android and iOS mobile apps. These apps are free to download and install, and they are distributed through Google Play Store and the Apple App Store.

The company's main source of revenue is in-app advertising. This posits the business model as volume driven - scale in terms of the number of users becomes very important. The more the number of users who see and engage with the ads, proportionally better is the revenue opportunity.

### 1.2 Business Challenge

The senior management team is meeting for the annual strategy event to decide on allocation of resources and future app development roadmap. The team is seeking inputs from the business strategy group that will help the company maximize return-on-investment(ROI) opportunities. 

### 1.3 Project Scope

Our goal for this project is to offer actionable insights that are backed by data. Based on our understanding of the company's business model, we know that the biggest driver of ROI is the number of users for an app - the revenue opportunity is directly proportional. We will focus our exploration on this topic. 

Our project scope is to analyze app store data and identify the type of apps that are likely to attract more users. Such actionable intelligence can help optimize revenue and the company can focus on creating the kind of apps that are popular.

Our key requirements are as follows:

- We are interested in free apps only
- We are interested in apps in English language only


### 1.4 Sources of data

Apple Store Data: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps. This dataset contains ~7000 iOS apps as of July 2017.

Google Play Store Data: https://www.kaggle.com/lava18/google-play-store-apps. This dataset contains approximately 10,000 Android apps as of August 2018.

## 2. Data Preparation

### 2.1 Open Apple Play Store and Google Play Store data sets

In [60]:
#Open Google Play Store and Apple Store datasets and save them as list of lists.

from csv import reader

#Apple Store Dataset
opened_file = open("./AppleStore.csv")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

#Google Play Store Dataset
opened_file = open("./googleplaystore.csv")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### 2.2 Create explore_data() function

In [61]:
# Defining a function to make it easy to print data

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print("\n") #Add a new empty line after each row
        
    if rows_and_columns:
        print("Number of rows: " + str(len(dataset)))
        print("Number of columns: " + str(len(dataset[0])))

### 2.3 Exploring the datasets

Let's look at the structure of the two datasets that we have created. For each dataset, we would like to know the following:

- Number of columns in each dataset to learn about the headers
- Number of rows in each dataset to learn about the total number of entries

Let's use the explore_data() function that we created to gather these insights.

#### 2.3.1 Exploring the iOS dataset

In [62]:
print("ios header:\n", ios_header, "\n")
explore_data(ios, 0, 2, True)

ios header:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


#### 2.3.2 Exploring the android dataset

In [68]:
print("android header:\n", android_header, "\n")
explore_data(android, 0, 2, True)

android header:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


#### 2.3.3 Results of the data structure analysis

| Data set | Number of Rows | Column Names |
| ------   | ------         | ------       |
| iOS      | 7197           |'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'     |
| Android  | 10841          |'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'|

### 3. Cleaning our datasets

Our two datasets, in their current format, are a list of lists. However, we cannot use them right away. The data needs to be cleaned and prepared so that we do not get any wrong results in our analysis. As per our requirements, we need to remove all paid apps and non-English language apps too.

We will focus on the following three steps that are integral to any Data Cleansing process:

- remove or correct wrong data
- remove duplicate data
- modify the data to fit the purpose of our analysis

#### Finding erroneous data

[This discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Google Play Store describes missing data for row 10472. Let's check if this is the case by matching the length of the entry 10472 to the length of the header.

In [64]:
# Checking the index of the entry with missing data

for row in android:
    if len(row) != len(android_header):
        print(android.index(row))
        print(row)

10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Let's remove the erroneous row by using the del statement

In [65]:
print(len(android))
del android[10472] #do not run this more than once
print(len(android))

10841
10840


[This discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176) on App Store data mentions presence of duplicate data. Let's find the duplicate data in our ios dataset.

In [66]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [67]:


# Create two empty lists - one list contains unique data and the other list contains duplicates.

ios_clean = []
ios_duplicate_apps = []

print("Number of rows in the original dataset: ", len(ios))
for row in ios:
    if row[1] in ios_clean:
        ios_duplicate.append(row)
    else:
        ios_clean.append(row)
        
print("Number of rows in the cleaned dataset: ", len(ios_clean))
print("Number of rows in the duplicate dataset: ", len(ios_duplicate))

Number of rows in the original dataset:  7197
Number of rows in the cleaned dataset:  7197
Number of rows in the duplicate dataset:  0
