# Data Cleaning - Google Playstore and Apple Store

Main goal of the project is to clean the dataset and prepare it for visualising in PowerBI. 
The dataset contains appsname, apps_ratings, total_ratings, type, price, size, reviews and other attributes. For loops, conditional statements and dictionaries are used to create final tables for visualisation.

Two datasets are imported in the Jupyter Notebook environment for cleaning. 
- A dataset [googlestore](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
- A dataset [applestore](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

#### Step 1: 
- Defining function `dataset` to import, read and convert csv file to list.
- Defining function `explore_data` to explore the imported data with start and end values.

In [1]:
def dataset(file_name):
    opened_file = open(file_name, encoding = "utf8")
    from csv import reader
    read_file = reader(opened_file)
    list_file = list(read_file)
    
    return list_file
    
applestore = dataset('AppleStore.csv')
googlestore = dataset('googleplaystore.csv')

In [2]:
def explore_data(list_file, start, end, rows_and_columns=False):
    data_slice = list_file[start:end]
    for row in data_slice[1:]:
        print(row)
        print('\n') # adds and empty line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(list_file))
        print('Number of columns:', len(list_file[0]))

            
    

Exploring the imported dataset - `applestore` and `googlestore`


In [3]:
apple_store = explore_data(applestore, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
google_store = explore_data(googlestore, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


#### Step 2:
Finding out the row having missing value. To do so, I have compared the length of all the rows with the length of header of the dataset. The index position and length of the missing row is printed using `print` function.

In [5]:
header = googlestore[0]

for row in googlestore:
    if len(row) != len(header):
        print('The length of header:', len(header), '\n')
        print(row, '\n')
        print('The length of row:', len(row))
        print('The index position:', googlestore.index(row))

The length of header: 13 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

The length of row: 12
The index position: 10473


#### Step 3:
Deleting the row having missing value.

In [6]:
del googlestore[10473]

In the next step I am going to examine the duplicate apps.
> I will use two empty lists: `unique_apps` and `duplicate_apps` and append the list duplicate_apps if the name occurs more than once in the unique_apps, if not then we will append unique_apps.

In [7]:
duplicate_apps = []
unique_apps = []

for row in googlestore[1:]:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Count of duplicate apps:', len(duplicate_apps))
print('\n')
print('Some names of duplicate apps:', duplicate_apps[:15])

Count of duplicate apps: 1181


Some names of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


#### Step 4: 
Retaining the rows with highest `reviews`.
> After running the `for` loop on the googlestore data I found 1181 apps which have duplicate values.
I will discard duplicate values in the next step using the criterion of `reviews`. The rows with highest reviews will be retained and remaining will be discarded. 

In [8]:
reviews_max = {}

for row in googlestore[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


Verifying the length of the `reviews_max` dictionary by comparing it with the *expected length of the dictionary*.

In [9]:
print('Expected length of the dictionary:', len(googlestore[1:]) - len(duplicate_apps))
print(len(reviews_max))


Expected length of the dictionary: 9659
9659


#### Step 5:
Creating a list `android_clean` to store the cleaned dataset. `If` condition statement is used to append the values in the `android_clean` list.

In [10]:
android_clean = []
already_added = []

for row in googlestore[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
    
    



Printing the length of the `android_clean` dataset to verify the unique values in it.

In [11]:
print(len(android_clean))

9659


Trimming '$' sign from the price columns.

In [12]:
for row in android_clean:
    price = row[7]
    price = price.replace('$', '')
    row[7] = price
    
print(android_clean[218:221])

[['TurboScan: scan documents and receipts in PDF', 'BUSINESS', '4.7', '11442', '6.8M', '100,000+', 'Paid', '4.99', 'Everyone', 'Business', 'March 25, 2018', '1.5.2', '4.0 and up'], ['Tiny Scanner Pro: PDF Doc Scan', 'BUSINESS', '4.8', '10295', '39M', '100,000+', 'Paid', '4.99', 'Everyone', 'Business', 'April 11, 2017', '3.4.6', '3.0 and up'], ['Zenefits', 'BUSINESS', '4.2', '296', '14M', '50,000+', 'Free', '0', 'Everyone', 'Business', 'June 15, 2018', '3.2.1', '4.1 and up']]


In [13]:
explore_data(android_clean, 0,5, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


#### Step 6:
Discarding the apps in languages other than English. 

> To retain English apps I have used the ascii value of the characters. Almost all the english characters have ascii value less than 127, using this condition I have filtered the English apps from other languages. 
However, looking at some of the English apps having special symbols in its name, an exception of 3 non-ascii characters is made and apps having non-ascii value less than or equal to 3 are considered in the clean dataset.  

In [14]:
def english(value):
    non_ascii = 0
    for char in value:
        if ord(char) > 127:
            non_ascii += 1
        
    if non_ascii > 3:
        return False
    else:
        return True

print(english('Instagram'))
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english('Docs To Go™ Free Office Suite'))

True
False
True


#### Step 7:
Two lists `googlestore_clean_english` and `applestore_clean_english` are appended to create the clean dataset for visualisation. The defined function `english` is used to retain the english apps and append it into the lists.

In [15]:
googlestore_clean_english = []
applestore_clean_english = []

for row in android_clean:
    name = row[0]
    
    if english(name):
        googlestore_clean_english.append(row)
        
for row in applestore:
    name = row[1]
    
    if english(name):
        applestore_clean_english.append(row)


**Printng first 3 rows of the clean dataset.**

In [16]:
print(googlestore_clean_english[:3])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


In [17]:
print(applestore_clean_english[:3])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']]


Importing Numpy and Pandas Library to convert the list into csv file for visualising the data.

In [18]:
import numpy as np
import pandas as pd

arr = np.array(googlestore_clean_english)
DF = pd.DataFrame(arr)
DF.to_csv("Googlestore_final.csv")

arr_1 = np.array(applestore_clean_english)
DF = pd.DataFrame(arr_1)
DF.to_csv("Applestore_final.csv")


## Using the clean data to further analyze the profitable free apps.

In [19]:
print(len(googlestore_clean_english))
print(len(applestore_clean_english))

9614
6184


#### Step 1:

In this step the dataset is further filtered to retain the apps which are free.

In [20]:
googlestore_final = []
applestore_final = []

for row in googlestore_clean_english:
    price = row[7]
    
    if price == '0':
        googlestore_final.append(row)

for row in applestore_clean_english[1:]:
    price = row[4]
    
    if price == '0.0':
        applestore_final.append(row)


In [21]:
explore_data(googlestore_final,0,5,True)
print('\n')
explore_data(applestore_final,0,5,True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 8864
Number of columns: 13


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5',

#### Step 2:

**Defining two functions `freq_table` and `display_table`.**

- `Freq_table` function is used to create a percentage frequency table.
- `display_table` function is used to create a tuple to display the frequency of index of the dataset and sorting it in descending order.

In [22]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_per = {}
    
    for row in table:
        percentage = (table[row] / total) * 100
        table_per[row] = percentage
        
    return table_per

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Most Common Free Apps by Genre

Displaying the frequency table of `Genre` from the googlestore_final list.

In [23]:
display_table(googlestore_final, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Displaying the frequency table of `category` from googlestore_final list.

In [24]:
display_table(googlestore_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

>After analyzing the frequency table of googleplay store apps by `Genre` and `category`, we cannot find any domination of specific genre or category. However, what we can say is that Google Play store has more variety of genres than Apple Store.

Displaying frequency table of `prime_genre` from applestore_final list.

In [25]:
display_table(applestore_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


> After analyzing the `prime_genre` frequency table of Apple Store, we can conclude that **Gaming Apps** dominate the Apple Store with approximately 58%. 

## Most Common Genre of the Apps by User Ratings

Displaying the average of user_rating against the genre from applestore_final list.

In [26]:
prime_genre_freq = freq_table(applestore_final, -5)

#print(prime_genre_freq)

for genre in prime_genre_freq:
    total = 0
    len_genre = 0
    
    for row in applestore_final:
        genre_app = row[-5]
        if genre_app == genre:
            user_ratings = float(row[5])
            total += user_ratings
            len_genre += 1
    Average = total / len_genre
    print(genre,':', Average)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


> After looking at the results we can conclude that the most popular apps on applestore are of **Navigation** `Genre` with average user_rating of approximately **86000**.

In [27]:
category_freq = freq_table(googlestore_final,1)

for category in category_freq:
    total = 0
    len_category = 0
    
    for row in googlestore_final:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    average = total / len_category
    print(category,':', average)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

After analyzing the results of `n_installs` of the apps on Google Play Store, we find that **Communication** apps are having highest installs of **3,84,56,119**.