# First Python guided project

The goal of the present project is to practically display concepts and lessions learned until now and to run into errors that make us understand even more Python structures. For that purpose, we'll be using a sample dataset in *.csv* format related to apps. 

In [1]:
raw = open("D:/MEGA/UC3M/CSS/Lenguajes/Python/AppleStore.csv", encoding="UTF-8")
from csv import reader
data = reader(raw)
data = list(data)

raw2 = open("D:/MEGA/UC3M/CSS/Lenguajes/Python/googleplaystore.csv", encoding="UTF-8")
from csv import reader
data2 = reader(raw2)
data2 = list(data2)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(data, 1, 5, rows_and_columns=True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7198
Number of columns: 17


In [4]:
explore_data(data2, 1, 5, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
print(data[0])
print(data2[0])

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Reading the columns' names, there are several ones we'll be using from now on:

In [6]:
print(data2[10473]) # Incorrect row

del data2[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


After removing some corrupt rows, it's also time to check for duplicates: entries that refer to the same app although they were collected at a different time. We'll chek which rows are in fact duplicated (or more) and clean them until we only have one observation.  

In [7]:
duplicates = []
unique = []

for app in data2[1:]: 
    if app[0] in unique:
        duplicates.append(app[0])
    else:
        unique.append(app[0])

print(len(duplicates))

1181


In [8]:
reviews_max = {}

for app in data2[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [9]:
android_clean = []
already_added = []

for app in data2[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

After cleaning duplicate entries, it's also a necessary step to leave only english apps, as it's the market in which we'll compete against other brands. To do so, it's not enough to filter out results with just one non-ascii character, rather we have to set a condition of at least 3 non-ascii ones. This is due to other special characters such as emojis or other symbols that belong to the english language but wouldn't be recognized as so. 

In [10]:
def lang_filter(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(lang_filter('Docs To Go™ Free Office Suite'))
print(lang_filter('Instachat 😜'))

True
True


In [11]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if lang_filter(name):
        android_english.append(app)
        
for app in data[1:]:
    name = app[1]
    if lang_filter(name):
        ios_english.append(app)

In [12]:
print(android_english[0])
print(ios_english[1:5])

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
[['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']]


The next filter we'll be applying is an only-free apps one: 

In [13]:
android_eng_free = []
ios_eng_free = []

for app in android_english:
    if app[6] == "Free":
        android_eng_free.append(app)
    
for app in ios_english:
    if app[5] == "0":
        ios_eng_free.append(app)

In [14]:
print(len(android_eng_free))
print(len(ios_eng_free))

8863
4056


Now that we've cleaned and pre-processed data, it's time to analyze it. Let's build frequency tables for the most common genres:

In [15]:
print(android_eng_free[1:3])
print(ios_eng_free[1:3])

[['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]
[['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']]


In [16]:
def freq_table(dataset, index):
    table = {}
    for app in dataset:
        category = app[index]
        if category not in table:
            table[category] = 1
        elif category in table: 
            table[category] += 1

    relative_freq_table = {}
    for key in table: 
        percentage = (table[key]/len(dataset)) * 100
        relative_freq_table[key] = percentage
    return relative_freq_table

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:
# See the columns in data
print(android_eng_free[0]) #index 12
print(ios_eng_free[0]) #indexes 1 and 9

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


In [18]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [19]:
freq_android = freq_table(android_eng_free, 1)
freq_ios = freq_table(ios_eng_free, 12)

In [24]:
display_table(ios_eng_free, 12)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Looking at the results for the App Store, it's clear to us that games are the most common type of mobile applications among the english market, followed (far from close) by enterntainment ones. If we look at the bottom of the table, we'll also notice how non-leisure apps are the least popular (medical, business...). However, these frequencies represent the number of available free apps, not the total amount of users who use them, so our insights actually could go the other way around: if we were to develop a new mobile software, it's clear that the market segment dedicated to games and enterntainment is overcrowded, which could lead to additional difficulties when trying to establish ourselves.

In [37]:
display_table(android_eng_free, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

Regarding the Android Play Store, results differ significantly: Games are not the most popular type of app among creators anymore, rather than it being the 'Family' ones. Categories previously not important such as Business, Productivity or Lifestyle are now on top of the ladder, which means that if we were only to target a specific Store, and it being the Android one, producing a game wouldn't be a bad choice either. Nonetheless, we should also take into account that there's way more variety in this one than in the App Store, presenting a more balanced terrain.