---
This project is about gathering data that helps developers determine what are the characteristics of the applications that have large number of downloads. The focus of analysis would be applications that are free to download and install and are directed toward English-speaking audiences.


Reading and transforming .csv files to iterables
---

---

`explore_data` function takes 3 arguments(data set, start index, end index, boolean for showing total number of rows and columns)

`csv_to_list` function takes a string (name of the csv file) as an argument and returns a list

In [8]:
from csv import reader

# this function returns a list from csv files
def csv_to_list(csv_file: str) -> list:
    with open(csv_file) as opened_file:
        read_file = reader(opened_file)
        data_set = list(read_file)
    return data_set

google = csv_to_list('googleplaystore.csv')
apple = csv_to_list('AppleStore.csv')
android_header = google[0]
android_apps = google[1:]
ios_header = apple[0]
ios_apps = apple[1:]
    
def explore_data(data_set, start, end, rows_and_columns = False) -> None:    
    dataset_slice = data_set[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(data_set))
        print('Number of columns: ', len(data_set[0]))


The code below prints extracted data from csv file where the first row shows the description
of each column

You can follow these links for [google play store csv file](https://www.kaggle.com/lava18/google-play-store-apps) and [apple app store csv file](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

In [13]:
explore_data(android_apps, 0, 4)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




A [certain discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) says that a row has a
missing field in one of its columns.

We can use the code below to see the contents of the row mentioned
`explore_data(google, 10473, 10474)`

Precise data is needed to get reliable results in data analysis thus we need to delete the row with faulty contents.
Now we'll hold the deletion of the row to think about what else can we do.


In [12]:
explore_data(android_apps, 10472, 10473)
del android_apps[10472]
explore_data(android_apps, 10470, 10479)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']


['Wi-Fi Visualizer', 'TOOLS', '3.9', '132', '2.6M', '50,000+', 'Free', '0', 'Everyone', 'Tools', 'May 17, 2017', '0.0.9', '2.3 and up']


['Lennox iComfort Wi-Fi', 'LIFESTYLE', '3.0', '552', '7.6M', '50,000+', 'Free', '0', 'E

Removing duplicate data entry
---

---
The function `remove_duplicate` removes duplicates from the data set. It takes 4 criterion (data set, number of ratings, last version update, current version) for removal to make sure that only the updated data entry will remain.

In [14]:
# cleaning data with guide wtf super efficient
def clean_dict(data_set):
    reviews_max = {}
    for row in data_set:
        name = row[0]
        n_reviews = float(row[3])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        if name not in reviews_max:
            reviews_max[name] = n_reviews
    # return dict of highest review count with header removed
    return reviews_max

new_rev = clean_dict(android_apps)
print(len(new_rev))
print(len(android_apps))

def clean_data(data_set):
    cleaned_data = []
    already_added =[]
    
    for row in data_set:
        name = row[0]
        n_reviews = float(row[3])
        
        if name not in already_added and n_reviews == new_rev[name]:
            cleaned_data.append(row)
            already_added.append(name)
    return [cleaned_data, already_added]

new_clean = clean_data(android_apps)
print(len(new_clean[0]), len(new_clean[1]))
print(len(android_apps))

9659
10840
9659 9659
10840


In [12]:
# test process for leaving only the highest review count
def sort_by_review_count(rating_list, index):
    return max([float(row[index]) for row in rating_list])

dum = [row for row in google[1:7]]
dum2 = list(dum)
new_max = sort_by_review_count(dum2, 3)
print(dum, dum2, new_max)

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', '

In [5]:
# isolate faulty rows
dummy = []
for i in range(20):
    dummy.append(google_data[2])
print(len(dummy))
def delete_duplicates(data_set):
    new_list = list(set([tuple(row) for row in data_set]))

20
[('Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up')]
