# Profitable App Profiles on Apple and Android Store

Goal of this project is to analyze data to help the developers understand the profile of the apps that are likely to attract more users. Most of the apps are free to download so main source of revenue comes from in-app ads. The more number of users engage and see the ads, the more revenue flows in.

## Exploring the data

As of Sep 2018, there were 2 million iOS apps available on Apple Store and 2.1 million Android Apps open on Google Play Store. These datasets are a sample of the apps data. Google Play Store data contains about 10800 apps data and Apple Store data contains about 7500 apps data.

In [29]:
# Open Apple Store data 
from csv import reader
opened_file= open('AppleStoreData.csv',encoding='UTF-8') #Important to specify the encoding of the Input file
read_file = reader(opened_file)
ios_data = list(read_file)
ios_header=ios_data[0] # Specify header of ios data
ios = ios_data[1:] # All the ios data except for the header 

In [30]:
# Open Google Play Store data 
from csv import reader
opened_file= open('googleplaystore.csv',encoding='UTF-8') #Important to specify the encoding of the Input file
read_file = reader(opened_file)
playstore_data = list(read_file)
android_header=playstore_data[0] # Specify header of ios data
android = playstore_data[1:] # All the ios data except for the header 

In [31]:
#Creating a explore data function 
def explore_data(dataset,start,end,rows_columns = False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # Adds a new empty line after each row
    if rows_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:',len(dataset[0]))
        

In [32]:
# Explore ios data
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


Apple Store data has 7197 records with 16 variables. Some of the important variables which will help in analysing are 'Track_Name', 'price', 'rating_count_tot', 'user_rating', 'prime_genre'

In [33]:
# Explore playstore data
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Google Play Store data has 10841 records with 13 variables. Some of the important variables which will help in analysing are 'App', 'Genres', 'Rating', 'Reviews', 'Type', 'Installs', 'Price'

### ios data dictionary

## Data Cleaning

### Playstore data cleaning
Through data discussions on the playstore community:
    * Wrong entry 10472 row if header is not included as Category is not present and column shift has happened
    * Remove Non English Apps
    * Remove apps that are not free

In [34]:
print(android_header)
print('\n')
explore_data(android,10472,10473)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




As can be seen, there is no value for category in row 10472 and hence there is a column shift. Under Category, we see the value as 1.9 which is incorrect. Need to delete this row to clean up the playstore data.

In [35]:
#Deleting the row 10472
del android[10472]

In [37]:
explore_data(android,0,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


So, now the number of rows in the android is 10840 after deleting a row of data

## Removing Duplicate Entries
### Part One
Checking if the Android data has duplicate entries of the apps and see the best way to remove the data

In [41]:
#Instagram has 4 entries in android
for app in android:
    name = app[0]
    if name == "Instagram":
        print(app)


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


As can be seen, the only difference is the column 4 which shows the number of reviews. Rather than removing random duplicates, its better if we can keep the row of data with highest number of reviews as that's the latest data. The other rows of duplicate data can be removed. 
In total, there are 1,181 cases where an app occurs more than once.

In [44]:
#Counting the duplicate apps
unique_apps=[]
duplicate_apps=[]

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate apps:', len(duplicate_apps))
print('\n')    
print('Example of Duplicate apps:', duplicate_apps[:5])



Number of Duplicate apps: 1181


Example of Duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [46]:
#Number of rows remaining in the data which are unique
print("Number of Unique apps in Android data:", len(android)-len(duplicate_apps))

Number of Unique apps in Android data: 9659


In [None]:
#Removing duplicates
