# Profitable App Profiles for the App Store and Google Play Markets

![header.jpg](header.jpg)

### **Context**:

As a data analyst working for a company that **only build free apps to download and install** available on Google Play and in the App Store, our main source of revenue consists of in-app ads. 

This means that the number of users of our apps determines our revenue for any given app therefore the more users who see and engage with the ads, the better. 

### **Our goal**  

Collect and analyze data from each of the online stores to understand what type of apps are likely to attract more users.

* * *

The datasets are in two documents of type csv's i.e. **Comma Separated Values file** for read the content, it is necessary to load a function called `reader` of the python `csv` module.

In [1]:
from csv import reader

## Exploring datasets:

We are going to explore the two sets of data, but first we are going to kwon what type of [character encoding](https://en.wikipedia.org/wiki/Character_encoding) both datasets have.

A command called [file](https://www.man7.org/linux/man-pages/man1/file.1.html), will help us to kwon the type of file regardless of its extension and avoid a error called `UnicodeDecodeError`.

In [2]:
! file -i AppleStore.csv

AppleStore.csv: application/csv; charset=utf-8


In [3]:
! file -i googleplaystore.csv

googleplaystore.csv: application/csv; charset=utf-8


Our files have the same type of charset https://en.wikipedia.org/wiki/UTF-8.


Fast idea:

- **charset:** is the set of characters you can use.
- **encoding:** is the way these characters are stored into memory.

[source](https://stackoverflow.com/questions/2281646/whats-the-difference-between-encoding-and-charset)

In [4]:
AppleStore = open('AppleStore.csv', encoding='utf8')
AppleStore = reader(AppleStore)
AppleStore = list(AppleStore)
header_apple = AppleStore[:1]
header_apple #Columns on AppleStore

[['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic']]

In [5]:
dataset_apple = AppleStore[1:] 

In [6]:
GooglePlay = open('googleplaystore.csv', encoding='utf8')
GooglePlay = reader(GooglePlay)
GooglePlay = list(GooglePlay)
header_google = GooglePlay[:1]
header_google #Columns on GooglePlay

[['App',
  'Category',
  'Rating',
  'Reviews',
  'Size',
  'Installs',
  'Type',
  'Price',
  'Content Rating',
  'Genres',
  'Last Updated',
  'Current Ver',
  'Android Ver']]

In [7]:
dataset_google = GooglePlay[1:]

### Function to explore datasets

To make them easier to explore, we created a function named explore_data() that you can repeatedly use to print rows in a readable way.

In [8]:
def explore_data(dataset, start, end, rows_and_columns=False):
        dataset_slice = dataset[start:end]    
        for row in dataset_slice:
            print(row)
            print('\n') # adds a new (empty) line after each row

        if rows_and_columns:
            print('Number of rows:', len(dataset))
            print('Number of columns:', len(dataset[0]))

In [9]:
%%html
<style>
table {float:left}
</style>

### Data dictionary:

We have the name of the columns and the data that form the set of both data sets:


#### AppleStore.csv

 - `header_apple`

 - `dataset_apple` 

||Name | Description|
|:--|:---|:--|
|1|"id" : |App ID|
|2|"track_name": |App Name|
|3|"size_bytes": |Size (in Bytes)|
|4|"currency": |Currency Type|
|5|"price": |Price amount|
|6|"ratingcounttot": |User Rating counts (for all version)|
|7|"ratingcountver": |User Rating counts (for current version)|
|8|"user_rating" : |Average User Rating value (for all version)|
|9|"userratingver": |Average User Rating value (for current version)|
|10|"ver" : |Latest version code|
|11|"cont_rating": |Content Rating|
|12|"prime_genre": |Primary Genre|
|13|"sup_devices.num": |Number of supporting devices|
|14|"ipadSc_urls.num": |Number of screenshots showed for display|
|15|"lang.num": |Number of supported languages|
|16|"vpp_lic": |Vpp Device Based Licensing Enabled|


#### googleplaystore.csv

 - `header_google`

 - `dataset_google`


||Name | Description|
|:--|:---|:--|
|1 | App: |Application name|
|2 | Category: |Category the app belongs to|
|3 | Rating: |Overall user rating of the app (as when scraped)|
|4 | Reviews: | Number of user reviews for the app (as when scraped)|
|5 | Size: | Size of the app (as when scraped)|
|6 | Installs: | Number of user downloads/installs for the app (as when scraped)|
|7 | Type: | Paid or Free|
|8 | Price: | Price of the app (as when scraped)|
|9 | Content: | Rating Age group the app is targeted at - Children / Mature 21+ / Adult|
|10 |  Genres: | An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.|

#### Knowing the size in both data sets.

In [10]:
header_apple

[['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic']]

In [11]:
explore_data(dataset_apple, 2, 4, True)

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


In [12]:
header_google

[['App',
  'Category',
  'Rating',
  'Reviews',
  'Size',
  'Installs',
  'Type',
  'Price',
  'Content Rating',
  'Genres',
  'Last Updated',
  'Current Ver',
  'Android Ver']]

In [13]:
explore_data(dataset_google, 2, 4, True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


### Columns that can help us in our analysis:

    Content: appleStore.csv
    
        "id" : App ID
     !  "track_name": App Name
     !  "size_bytes": Size (in Bytes)
     !  "currency": Currency Type
     !  "price": Price amount
     !  "ratingcounttot": User Rating counts (for all version)
     !  "ratingcountver": User Rating counts (for current version)
        "user_rating" : Average User Rating value (for all version)
        "userratingver": Average User Rating value (for current version)
        "ver" : Latest version code
        "cont_rating": Content Rating
     !  "prime_genre": Primary Genre
        "sup_devices.num": Number of supporting devices
        "ipadSc_urls.num": Number of screenshots showed for display
        "lang.num": Number of supported languages
        "vpp_lic": Vpp Device Based Licensing Enabled

    Content: googleplaystore.csv
    
    ! Application name
    ! Category: Category the app belongs to
      Rating: Overall user rating of the app (as when scraped)
    ! Reviews: Number of user reviews for the app (as when scraped)
      Size: Size of the app (as when scraped)
    ! Installs: Number of user downloads/installs for the app (as when scraped)
    ! Type: Paid or Free
    ! Price: Price of the app (as when scraped)
      Content Rating: Age group the app is targeted at - Children / Mature 21+ / Adult
    ! Genres: An app can belong to multiple genres (apart from its main category).For eg
        a musical family game will belong to

## Data cleaning

This means that we need to:

- 1. Detect inaccurate data, and correct or remove it.

- 2. Detect duplicate data, and remove the duplicates.


- 3. Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.

- 4. Remove apps that aren't free##.


The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion?sort=votes), and we can see that [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

### 1. Detect inaccurate data... remove it.

We start from the idea that the number of columns must be the same as the number of fields contained in a row ie **Number of columns: 13**, so if the number of fields varies that means that in that row data is missing.

In [14]:
index = 0
for row in dataset_google:
    long_row = len(row[:-1])
    if long_row != len(header_google):
        print("index:",index,'\n',"Application name:",row)
    index +=1

index: 0 
 Application name: ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
index: 1 
 Application name: ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']
index: 2 
 Application name: ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
index: 3 
 Application name: ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']
index: 4 
 Application name: ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & De

 Application name: ['Lynda - Online Training Videos', 'EDUCATION', '4.2', '8599', '17M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 11, 2018', '4.9.10', '4.1 and up']
index: 750 
 Application name: ['Brilliant', 'EDUCATION', '4.5', '41185', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'August 3, 2018', 'Varies with device', 'Varies with device']
index: 751 
 Application name: ['CppDroid - C/C++ IDE', 'EDUCATION', '4.1', '29980', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'August 17, 2017', 'Varies with device', 'Varies with device']
index: 752 
 Application name: ['Quiz&Learn Python', 'EDUCATION', '4.0', '304', '2.0M', '10,000+', 'Free', '0', 'Everyone', 'Education', 'July 4, 2016', '1.3.0', '4.0 and up']
index: 753 
 Application name: ['C++ Tutorials', 'EDUCATION', '4.1', '358', '1.9M', '50,000+', 'Free', '0', 'Everyone', 'Education', 'August 21, 2014', '1.1', '2.3 and up']
index: 754 
 Application name: ['C++ 

index: 875 
 Application name: ['DStv Now', 'ENTERTAINMENT', '3.9', '34923', 'Varies with device', '5,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 27, 2018', 'Varies with device', 'Varies with device']
index: 876 
 Application name: ['ivi - movies and TV shows in HD', 'ENTERTAINMENT', '4.5', '684116', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 31, 2018', 'Varies with device', '4.2 and up']
index: 877 
 Application name: ['Radio Javan', 'ENTERTAINMENT', '4.4', '46916', '4.5M', '1,000,000+', 'Free', '0', 'Everyone', 'Entertainment', 'July 18, 2018', '7.3', '4.2 and up']
index: 878 
 Application name: ['Viki: Asian TV Dramas & Movies', 'ENTERTAINMENT', '4.3', '407698', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 27, 2018', 'Varies with device', 'Varies with device']
index: 879 
 Application name: ['Talking Ginger 2', 'ENTERTAINMENT', '4.2', '702975', '49M', '50,000,000+', 'Free', '0', 'Everyone', 'Enterta

index: 975 
 Application name: ['HBO NOW: Stream TV & Movies', 'ENTERTAINMENT', '3.9', '61201', 'Varies with device', '10,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 19, 2018', 'Varies with device', 'Varies with device']
index: 976 
 Application name: ['Tribeca Shortlist - Handpicked Movies', 'ENTERTAINMENT', '3.9', '801', '57M', '100,000+', 'Free', '0', 'Teen', 'Entertainment', 'March 21, 2018', '4.003.1', '4.3 and up']
index: 977 
 Application name: ['A&E - Watch Full Episodes of TV Shows', 'ENTERTAINMENT', '4.0', '29706', '19M', '1,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 16, 2018', '3.1.4', '4.4 and up']
index: 978 
 Application name: ['VH1', 'ENTERTAINMENT', '4.1', '27424', '17M', '1,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 8, 2018', '11.45.0', '4.4 and up']
index: 979 
 Application name: ['SHOWTIME', 'ENTERTAINMENT', '4.2', '12398', 'Varies with device', '1,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'July 2, 2018', 'Varies with device'

index: 1875 
 Application name: ['Bubble Shooter', 'GAME', '4.5', '148945', '46M', '10,000,000+', 'Free', '0', 'Everyone', 'Casual', 'July 17, 2018', '1.20.1', '4.0.3 and up']
index: 1876 
 Application name: ['Toon Blast', 'GAME', '4.7', '1351771', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'July 30, 2018', '3196', '4.1 and up']
index: 1877 
 Application name: ['Toy Blast', 'GAME', '4.7', '1889582', 'Varies with device', '50,000,000+', 'Free', '0', 'Everyone', 'Puzzle', 'July 23, 2018', '5423', '4.1 and up']
index: 1878 
 Application name: ['Clash Royale', 'GAME', '4.6', '23136735', '97M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Strategy', 'June 27, 2018', '2.3.2', '4.1 and up']
index: 1879 
 Application name: ['Clash of Clans', 'GAME', '4.6', '44893888', '98M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Strategy', 'July 15, 2018', '10.322.16', '4.1 and up']
index: 1880 
 Application name: ['Farm Heroes Saga', 'GAME', '4.4', '7614415', '70M', '100,

 Application name: ['Educational Games 4 Kids', 'FAMILY', '4.3', '11618', '39M', '5,000,000+', 'Free', '0', 'Everyone', 'Educational;Education', 'April 3, 2018', '2.4', '4.1 and up']
index: 2023 
 Application name: ['Candy Pop Story', 'FAMILY', '4.7', '12948', '23M', '1,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'May 24, 2018', '2.0.3165', '2.3 and up']
index: 2024 
 Application name: ['Solitaire', 'FAMILY', '4.4', '685', '26M', '100,000+', 'Free', '0', 'Everyone', 'Card;Brain Games', 'July 16, 2018', '1.7.101', '4.1 and up']
index: 2025 
 Application name: ['Princess Coloring Book', 'FAMILY', '4.5', '9779', '39M', '5,000,000+', 'Free', '0', 'Everyone', 'Education;Creativity', 'February 25, 2018', '1.2.8', '4.0 and up']
index: 2026 
 Application name: ['Hello Kitty Nail Salon', 'FAMILY', '4.2', '369378', '24M', '50,000,000+', 'Free', '0', 'Everyone', 'Casual;Pretend Play', 'April 17, 2018', '1.5', '4.1 and up']
index: 2027 
 Application name: ['Candy Smash', 'FAMILY', '4

 Application name: ['kicker football news', 'SPORTS', '4.3', '56270', '15M', '5,000,000+', 'Free', '0', 'Everyone', 'Sports', 'August 3, 2018', '5.4.2', '5.0 and up']
index: 2970 
 Application name: ['Football Live Scores', 'SPORTS', '4.5', '107724', '6.5M', '5,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 31, 2018', '1004.0', '4.0 and up']
index: 2971 
 Application name: ['Pro 2018 - Series A and B', 'SPORTS', '4.6', '101455', '6.2M', '5,000,000+', 'Free', '0', 'Everyone', 'Sports', 'August 3, 2018', '2.24.2.0', '4.0.3 and up']
index: 2972 
 Application name: ['BeSoccer - Soccer Live Score', 'SPORTS', '4.5', '152780', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'July 18, 2018', 'Varies with device', 'Varies with device']
index: 2973 
 Application name: ['Sport.pl LIVE', 'SPORTS', '4.4', '21733', 'Varies with device', '1,000,000+', 'Free', '0', 'Mature 17+', 'Sports', 'June 29, 2018', 'Varies with device', 'Varies with device']
index: 2974 
 Applicatio

 Application name: ['CBS Sports App - Scores, News, Stats & Watch Live', 'SPORTS', '4.3', '91031', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Sports', 'August 4, 2018', 'Varies with device', '5.0 and up']
index: 3021 
 Application name: ['NBC Sports', 'SPORTS', '3.1', '78442', '25M', '5,000,000+', 'Free', '0', 'Everyone', 'Sports', 'June 22, 2018', '5.12.4', '4.1 and up']
index: 3022 
 Application name: ['WatchESPN', 'SPORTS', '4.1', '288809', '6.6M', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'September 27, 2017', '2.5.1', '4.4 and up']
index: 3023 
 Application name: ['NBC Sports Gold', 'SPORTS', '2.9', '3017', '24M', '100,000+', 'Free', '0', 'Everyone', 'Sports', 'July 2, 2018', '3.7', '4.1 and up']
index: 3024 
 Application name: ['UFC', 'SPORTS', '4.0', '30840', 'Varies with device', '1,000,000+', 'Free', '0', 'Teen', 'Sports', 'May 22, 2018', 'Varies with device', 'Varies with device']
index: 3025 
 Application name: ['Telemundo Deportes - Live', 'SPORTS

 Application name: ['Fortune City - A Finance App', 'FINANCE', '4.6', '49275', '91M', '500,000+', 'Free', '0', 'Everyone', 'Finance', 'July 17, 2018', '2.0.3.1', '4.4 and up']
index: 3938 
 Application name: ['Draw A Stickman', 'GAME', '3.4', '29265', '17M', '1,000,000+', 'Free', '0', 'Everyone', 'Adventure', 'May 19, 2017', '1.1.0', '4.0.3 and up']
index: 3939 
 Application name: ['Be A Legend: Soccer', 'SPORTS', '3.8', '85763', '21M', '1,000,000+', 'Free', '0', 'Everyone', 'Sports', 'August 24, 2015', '2.8.0.17', '3.0 and up']
index: 3940 
 Application name: ['Zombie Tsunami', 'GAME', '4.4', '4918776', 'Varies with device', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Arcade', 'June 15, 2018', 'Varies with device', 'Varies with device']
index: 3941 
 Application name: ['Bible', 'BOOKS_AND_REFERENCE', '4.7', '2440695', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Books & Reference', 'August 2, 2018', 'Varies with device', 'Varies with device']
index: 3942 
 Application

index: 4874 
 Application name: ['Angry Birds Friends', 'GAME', '4.2', '829753', '48M', '50,000,000+', 'Free', '0', 'Everyone', 'Arcade', 'June 19, 2018', '4.9.0', '4.1 and up']
index: 4875 
 Application name: ['30 Day Ab Challenge FREE', 'HEALTH_AND_FITNESS', '4.3', '48253', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Health & Fitness', 'December 13, 2017', 'Varies with device', 'Varies with device']
index: 4876 
 Application name: ['AB Blast Match 3', 'FAMILY', '4.3', '111', '48M', '10,000+', 'Free', '0', 'Everyone', 'Puzzle', 'June 19, 2018', '1.13', '4.1 and up']
index: 4877 
 Application name: ['AB Click2Shop', 'SHOPPING', '3.8', '454', '18M', '100,000+', 'Free', '0', 'Everyone', 'Shopping', 'June 26, 2018', '1.1.0.0', '4.1 and up']
index: 4878 
 Application name: ['Math games for kids : times tables - AB Math', 'FAMILY', '3.9', '2371', '8.3M', '500,000+', 'Free', '0', 'Everyone', 'Educational;Education', 'July 27, 2018', '3.9.3', '4.0.3 and up']
index: 4879 
 Ap

index: 4974 
 Application name: ['Brave Browser: Fast AdBlocker', 'COMMUNICATION', '4.3', '40241', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Communication', 'June 30, 2018', '1.0.52', '4.1 and up']
index: 4975 
 Application name: ['Solitaire: Decked Out Ad Free', 'GAME', '4.9', '37302', '35M', '500,000+', 'Free', '0', 'Everyone', 'Card', 'May 8, 2017', '1.3.3', '4.1 and up']
index: 4976 
 Application name: ['WeatherClear - Ad-free Weather, Minute forecast', 'WEATHER', '4.5', '3252', '3.8M', '50,000+', 'Free', '0', 'Everyone', 'Weather', 'June 25, 2017', '1.2.6', '4.1 and up']
index: 4977 
 Application name: ['Ad Remove Plugin for App2SD', 'PRODUCTIVITY', '4.1', '66', '17k', '1,000+', 'Paid', '$1.29', 'Everyone', 'Productivity', 'September 25, 2013', '1.0.0', '2.2 and up']
index: 4978 
 Application name: ['Digital Clock : Simple, Tiny, Ad-free Desk Clock.', 'LIFESTYLE', '4.4', '317', '74k', '50,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'December 21, 2016', '3.0', 

index: 5809 
 Application name: ['Workflow Approvals App AX 2012', 'BUSINESS', 'NaN', '2', '3.0M', '100+', 'Free', '0', 'Everyone', 'Business', 'July 12, 2018', '1.6(3)-demo', '4.0.3 and up']
index: 5810 
 Application name: ['Tips & Tricks Dynamics AX 365', 'PRODUCTIVITY', 'NaN', '1', '2.2M', '100+', 'Free', '0', 'Everyone', 'Productivity', 'July 14, 2017', '1.0', '4.0.3 and up']
index: 5811 
 Application name: ['Axe Man', 'GAME', '3.7', '53', '14M', '1,000+', 'Free', '0', 'Everyone', 'Adventure', 'February 23, 2015', '3.0', '2.3.3 and up']
index: 5812 
 Application name: ['AX Watch for WatchMaker', 'PERSONALIZATION', 'NaN', '2', '238k', '1+', 'Paid', '$0.99', 'Everyone', 'Personalization', 'August 18, 2017', '1.0', '2.3 and up']
index: 5813 
 Application name: ['Throw Knife', 'FAMILY', '3.7', '291', '37M', '100,000+', 'Free', '0', 'Everyone', 'Simulation', 'September 14, 2017', '2.04', '4.1 and up']
index: 5814 
 Application name: ['Dead Zombie Evil Killer:Axe', 'GAME', '3.9', '71', '

index: 6041 
 Application name: ['Remote for Sony TV & Sony Blu-Ray Players MyAV', 'TOOLS', '3.4', '3491', '7.3M', '1,000,000+', 'Free', '0', 'Everyone', 'Tools', 'July 31, 2018', 'Cow V3.15', '4.3 and up']
index: 6042 
 Application name: ['Remote for Panasonic TV+BD+AVR', 'TOOLS', '3.7', '533', '7.3M', '100,000+', 'Free', '0', 'Everyone', 'Tools', 'July 31, 2018', 'Cow V3.15', '4.3 and up']
index: 6043 
 Application name: ['Exam Result BD', 'FAMILY', '5.0', '2', '2.6M', '500+', 'Free', '0', 'Everyone', 'Education', 'July 22, 2017', '1.0', '3.2 and up']
index: 6044 
 Application name: ['BD Live Call', 'COMMUNICATION', '4.4', '7', '308k', '5,000+', 'Free', '0', 'Everyone', 'Communication', 'September 6, 2016', '1.01', '2.0 and up']
index: 6045 
 Application name: ['Helping BD', 'LIFESTYLE', '5.0', '15', '4.5M', '100+', 'Free', '0', 'Everyone', 'Lifestyle', 'June 19, 2018', '2.1', '4.0.3 and up']
index: 6046 
 Application name: ['Best Browser BD social networking', 'COMMUNICATION', '4.8'

index: 6874 
 Application name: ['Meu Cartão BV', 'FINANCE', '4.1', '15057', '5.7M', '500,000+', 'Free', '0', 'Everyone', 'Finance', 'August 3, 2018', '2.4.0', '5.0 and up']
index: 6875 
 Application name: ['BV Rando', 'TRAVEL_AND_LOCAL', 'NaN', '0', '3.6M', '100+', 'Free', '0', 'Everyone', 'Travel & Local', 'July 22, 2016', '1.2.4', '4.0 and up']
index: 6876 
 Application name: ['Kovax Europe B.V.', 'BUSINESS', 'NaN', '17', '96M', '500+', 'Free', '0', 'Everyone', 'Business', 'September 10, 2017', '24.0.11', '4.0.3 and up']
index: 6877 
 Application name: ['StartPage Private Search', 'LIFESTYLE', '4.6', '10198', '2.2M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'August 21, 2017', '1.10', '2.1 and up']
index: 6878 
 Application name: ['DHV accountancy BV', 'BUSINESS', 'NaN', '0', '10M', '10+', 'Free', '0', 'Everyone', 'Business', 'April 21, 2016', '1.0', '4.0 and up']
index: 6879 
 Application name: ['BV Frankfurt inside', 'SPORTS', 'NaN', '3', '4.8M', '50+', 'Free', '0', 'Every

 Application name: ['DIY cd rom ideas', 'LIFESTYLE', 'NaN', '5', '12M', '500+', 'Free', '0', 'Everyone', 'Lifestyle', 'October 17, 2016', '1.0', '2.3.3 and up']
index: 7174 
 Application name: ['Easy DIY CD Craft Ideas', 'ART_AND_DESIGN', 'NaN', '7', '5.6M', '5,000+', 'Free', '0', 'Everyone', 'Art & Design', 'May 30, 2018', '1.0', '2.3 and up']
index: 7175 
 Application name: ['Disc Label Print', 'TOOLS', '3.0', '262', '37M', '10,000+', 'Free', '0', 'Everyone', 'Tools', 'September 20, 2016', '1.0.0', '4.0 and up']
index: 7176 
 Application name: ['Kiosque CD', 'TRAVEL_AND_LOCAL', 'NaN', '0', '22M', '5+', 'Free', '0', 'Everyone', 'Travel & Local', 'June 18, 2018', '4.6.2201', '4.3 and up']
index: 7177 
 Application name: ['CD-Zing', 'BUSINESS', '3.8', '70', 'Varies with device', '1,000+', 'Free', '0', 'Everyone', 'Business', 'October 26, 2016', '6.2.7', '4.0 and up']
index: 7178 
 Application name: ['Nero AirBurn', 'VIDEO_PLAYERS', '4.2', '1008', '1.7M', '100,000+', 'Free', '0', 'Everyo

 Application name: ['CV Creator', 'FAMILY', '4.4', '31', '8.3M', '10,000+', 'Free', '0', 'Everyone', 'Education', 'January 8, 2018', '1.2.1', '4.2 and up']
index: 7974 
 Application name: ['Resume ( CV Editor )', 'BUSINESS', '4.5', '416', '6.0M', '10,000+', 'Free', '0', 'Everyone', 'Business', 'June 30, 2018', '1.2', '4.0 and up']
index: 7975 
 Application name: ['CV Maker', 'PERSONALIZATION', '3.2', '114', '3.0M', '10,000+', 'Free', '0', 'Everyone', 'Personalization', 'March 18, 2016', '1.02', '4.0.3 and up']
index: 7976 
 Application name: ['Resume Maker - Creator', 'BUSINESS', '4.3', '740', '19M', '50,000+', 'Free', '0', 'Everyone', 'Business', 'August 4, 2018', '1.1.2', '4.0.3 and up']
index: 7977 
 Application name: ['CV Builder for Smart Resumes', 'BUSINESS', '3.4', '22', '4.1M', '1,000+', 'Free', '0', 'Everyone', 'Business', 'May 23, 2017', '1.9', '4.0.3 and up']
index: 7978 
 Application name: ['CV-RECORD Pro', 'COMMUNICATION', '2.3', '42', '6.0M', '1,000+', 'Paid', '$0.99', 'E

index: 8149 
 Application name: ['CZ-Cyberon Voice Commander', 'TOOLS', '3.8', '131', '3.3M', '1,000+', 'Paid', '$5.99', 'Everyone', 'Tools', 'March 7, 2017', '3.2.17030601', '2.2 and up']
index: 8150 
 Application name: ['Pistolet CZ-70 CZ-50 expliqué', 'BOOKS_AND_REFERENCE', 'NaN', '0', '17M', '1+', 'Paid', '$5.99', 'Everyone', 'Books & Reference', 'November 1, 2016', 'Android 2.0 - 2016', '1.6 and up']
index: 8151 
 Application name: ['CZ-38 (vz 38) pistol explained', 'BOOKS_AND_REFERENCE', 'NaN', '0', '13M', '5+', 'Paid', '$5.99', 'Everyone', 'Books & Reference', 'August 10, 2017', 'Android 3.1 - 2017', '1.6 and up']
index: 8152 
 Application name: ['HTC Sense Input - CZ', 'TOOLS', '4.3', '87', '7.3M', '10,000+', 'Free', '0', 'Everyone', 'Tools', 'November 9, 2015', '1.0.612933', '5.0 and up']
index: 8153 
 Application name: ['WebCams', 'WEATHER', '4.6', '3963', '23M', '100,000+', 'Free', '0', 'Everyone', 'Weather', 'July 17, 2018', '5.2.11', '4.2 and up']
index: 8154 
 Application

 Application name: ['Plants vs. Zombies™ Watch Face', 'FAMILY', '3.5', '33178', '3.1M', '1,000,000+', 'Free', '0', 'Everyone', 'Casual', 'February 2, 2015', '1.0.5', '4.3 and up']
index: 9174 
 Application name: ['Real Football', 'SPORTS', '4.1', '585564', '33M', '10,000,000+', 'Free', '0', 'Everyone', 'Sports', 'June 1, 2018', '1.5.0', '4.0.3 and up']
index: 9175 
 Application name: ['Typical EA Game', 'FAMILY', '4.6', '33', '3.3M', '100+', 'Free', '0', 'Everyone', 'Casual', 'November 18, 2017', '1.0', '4.0 and up']
index: 9176 
 Application name: ['EB Mobile', 'FAMILY', '1.7', '1172', '5.6M', '10,000+', 'Free', '0', 'Everyone', 'Education', 'October 9, 2017', '1.1.2', '4.1 and up']
index: 9177 
 Application name: ['EB Events', 'TRAVEL_AND_LOCAL', '3.3', '97', '16M', '10,000+', 'Free', '0', 'Everyone', 'Travel & Local', 'January 27, 2016', '1.6', '4.0 and up']
index: 9178 
 Application name: ['i am EB', 'PHOTOGRAPHY', '5.0', '1', '5.4M', '10+', 'Free', '0', 'Teen', 'Photography', 'Feb

 Application name: ['Fan App for Portsmouth FC', 'SPORTS', '4.5', '53', '12M', '1,000+', 'Free', '0', 'Teen', 'Sports', 'August 6, 2016', '060816', '4.2 and up']
index: 10274 
 Application name: ['Latest Barcelona News 24h', 'NEWS_AND_MAGAZINES', '4.5', '104', '3.0M', '10,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'January 6, 2018', '1.3', '4.0 and up']
index: 10275 
 Application name: ['Mobile FC', 'SPORTS', '4.1', '5015', '57M', '100,000+', 'Free', '0', 'Everyone', 'Sports', 'May 25, 2018', '125', '4.0.3 and up']
index: 10276 
 Application name: ['FirstCry Baby & Kids Shopping, Fashion & Parenting', 'SHOPPING', '4.1', '41074', '16M', '5,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'June 27, 2018', '68', '4.1 and up']
index: 10277 
 Application name: ['Fan App for Reading FC', 'SPORTS', '3.6', '9', '13M', '500+', 'Free', '0', 'Teen', 'Sports', 'August 6, 2016', '060816', '4.2 and up']
index: 10278 
 Application name: ['Deportivo Toluca FC', 'SPORTS', '4.7', '2317', '5.9M',

In [15]:
len(dataset_google[10472]) #discussion row

12

In [16]:
dataset_google[10472] #header not included

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [17]:
del dataset_google[10472]

Checking if row has been removed properlly.

In [18]:
dataset_google[10472][0]

'osmino Wi-Fi: free WiFi'

**yes**

### 2. Detect duplicate data, and remove the duplicates.

In [19]:
def detect_duplicated(dataset):
    unique_apps = []
    duplicate_apps = []
    
    for app in dataset:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
            
    if len(duplicate_apps) == 0:
        print("In this dataset there are no duplicate apps")
    else:
        times=len(duplicate_apps)
        print("There are {reapeated} repeated apps".format(reapeated=times,dataset=dataset))
        print("\n")
        print('Examples of duplicate apps:',"\n"+"\n" , duplicate_apps[:5])

In [20]:
detect_duplicated(dataset_apple)

In this dataset there are no duplicate apps


In [21]:
detect_duplicated(dataset_google)

There are 1181 repeated apps


Examples of duplicate apps: 

 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Just only 5 samples of the repeated applications in **dataset_google**

* * * 

### What differentiates duplicate applications?

If we take as an example a repeated application, in this case `Instagram`, we see that of all the fields that make up the row there is only one that is different

In [22]:
for app in dataset_google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


That is the number of `'Reviews'`:


Therefore this will be the criterion we use to eliminate duplicate applications.

In [23]:
reviews_max = {}

for app in dataset_google:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('length of the dictionary:',len(reviews_max))

length of the dictionary: 9659


This is the new length of our dataset having deleted repeated applications.

The original length of the dataset minus the number of repeated applications results in this number.

In [24]:
print('expected length', len(dataset_google) - 1181)

expected length 9659


In [25]:
android_clean = []
already_added = []

for row in dataset_google:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row) #row list
        already_added.append(name) #name list

In [26]:
android_clean[0:2]

[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up']]

In [27]:
already_added[0:3] #example apps added 

['Photo Editor & Candy Camera & Grid & ScrapBook',
 'U Launcher Lite – FREE Live Cool Themes, Hide Apps',
 'Sketch - Draw & Paint']

In [28]:
print('length of the android_clean dictionary:',len(android_clean))

length of the android_clean dictionary: 9659


### Removing Non-English Apps

we managed to remove duplicate app entries in the Google Play dataset. It is no necessary to do the same with App Store data because there are no duplicates.

We use English for the applications we develop in our company, and we would like to analyze only applications aimed at an English-speaking audience. However, if we explore the data enough, we will find that **both datasets** have applications with names that suggest they are not aimed at an English-speaking audience.

In [29]:
print(AppleStore[813][1])
print(AppleStore[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

BATTLE BEARS -1
Beast Poker


中国語 AQリスニング
لعبة تقدر تربح DZ


**One way to do this is to remove all applications whose name contains a symbol that is not commonly used in English text**. 

English text typically includes letters of the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;) and other símbols (+, *, /).

Each character we use in a string has a corresponding number associated with it. For example, the corresponding number of the character"a" es 97, el del carácter "A" es 65, y el del carácter "爱" es 29.233. 

 We can get the corresponding number of each character using the built-in function **ord().**

In [30]:
print(ord('a'))
print(ord('A'))
print(ord('爱'))
print(ord('5'))
print(ord('+'))

97
65
29233
53
43


The numbers corresponding to the characters that we usually use in an English text are all in the range of 0 to 127, according to the ASCII system (American Standard Code for Information Interchange). 

Based on this range of numbers, we can construct a function that detects whether a character belongs to the common English character set or not. **If the number is equal to or less than 127, the character belongs to the set of common English characters. If an app's name contains a character greater than 127, it probably means that the app has a non-English name.** 

**However, the names of our applications are stored as strings, so how could we take each individual character in a string and check its corresponding number?**

In [31]:
def english_speak(string):
    for character in string:
        valor_character = ord(character)
        if valor_character > 127:
            return False #not English 
        else:
            return True # is English

In [32]:
english_speak('Instagram')

True

In [33]:
english_speak('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [34]:
english_speak('Docs To Go™ Free Office Suite')#Is English!

True

In [35]:
english_speak('Instachat 😜') #Is English!

True

In [36]:
print(ord('™'))
print(ord('😜'))

8482
128540


If we are going to use the function we have created, we will lose useful data, we saw that the function could not correctly identify certain names of English applications such as 'Docs To Go™ Free Office Suite' and 'Instachat 😜' since many English applications will be incorrectly labeled as non-English. 


To minimize the impact of data loss,it is necessary to have a basic criterion that helps in the screening so **we will only delete an application if its name is longer than three characters with the corresponding numbers outside the ASCII range**. 

This means that all English apps with up to three emoji or other special characters will still be labeled as English. 

In [37]:
def english_speak(cadena):
    non_validchar = []
    for character in cadena:
        valor_character = ord(character)
        if valor_character > 127:
            non_validchar.append(valor_character)
    if len(non_validchar) >= 3:
        return False
    else:
        return True

In [38]:
english_speak('Docs To Go™ Free Office Suite')

True

In [39]:
english_speak('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [40]:
english_speak('Instachat 😜 😜')

True

In [41]:
english_speak('Instachat 😜 😜 😜')

False

So the new feature to filter non-English applications from both datasets. 

Loop through each dataset. If an app's name is identified as English, add the entire row to a separate list.

In [42]:
# AppleStore 1
# android_clean 0

list_AppleStore = []

for row in AppleStore:
    nombre = row[1]
    if english_speak(nombre):
        list_AppleStore.append(row)
    else:
        pass
    
list_android_clean = []

for row in android_clean:
    nombre = row[0]
    if english_speak(nombre):
        list_android_clean.append(row)
    else:
        pass

----

Explore the datasets and see how many rows you have left for each dataset.

In [43]:
list_AppleStore[:1]

[['id',
  'track_name',
  'size_bytes',
  'currency',
  'price',
  'rating_count_tot',
  'rating_count_ver',
  'user_rating',
  'user_rating_ver',
  'ver',
  'cont_rating',
  'prime_genre',
  'sup_devices.num',
  'ipadSc_urls.num',
  'lang.num',
  'vpp_lic']]

In [44]:
len(list_AppleStore)

6156

In [45]:
len(list_android_clean)

9597

## 8. Isolating the Free Apps

**Our datasets contain both free and non-free applications; we will have to isolate only the free applications for our analysis.**

I check the length of each dataset to see how many apps you have left.

In [46]:
free_AppleStore = [] # 7

for row in list_AppleStore:
    precio = row[4]
 
    if precio == '0.0' or precio == '0':
        free_AppleStore.append(row)
    else:
        pass
    
len(free_AppleStore)

3203

In [47]:
free_android_clean = [] 

for row in list_android_clean:
    precio = row[7]
    if precio == '0.0' or precio == '0':
        free_android_clean.append(row)
    else:
        pass
    
len(free_android_clean) 

8848

## 9. Most Common Apps by Genre: Part One

Our ultimate goal is to add the app on both Google Play and the App Store, we need to find profiles of apps that are successful in both markets. 

For example, a profile that works well in both markets could be a productivity app that makes use of gamification. 

Let's start the analysis by having an idea of what are the most common genres for each market. To do this, we will have to build frequency tables for some columns of our datasets.

**Intruccciones:**

1 Dar a los lectores más contexto sobre por qué queremos encontrar un perfil de aplicación que se adapte tanto a la App Store como a Google Play. Explica nuestra estrategia de validación para una idea de aplicación.

2 Inspecciona ambos conjuntos de datos e identifica las columnas que podrías utilizar para generar tablas de frecuencia y averiguar cuáles son los géneros más comunes en cada mercado.

AttributeError: 'ellipsis' object has no attribute 'header_AppleStore'

In [None]:
header_GooglePlay

       "id" : App ID
     ! si  "track_name": App Name
     !  "size_bytes": Size (in Bytes)
     !  "currency": Currency Type
     !  "price": Price amount
     !  "ratingcounttot": User Rating counts (for all version)
     ! si "ratingcountver": User Rating counts (for current version)
        "user_rating" : Average User Rating value (for all version)
        "userratingver": Average User Rating value (for current version)
        "ver" : Latest version code
        "cont_rating": Content Rating
     !  "prime_genre": Primary Genre
        "sup_devices.num": Number of supporting devices
        "ipadSc_urls.num": Number of screenshots showed for display
        "lang.num": Number of supported languages
        "vpp_lic": Vpp Device Based Licensing Enabled
       
       
       
       
   
     [['App',
       'Category',
       'Rating',
       'Reviews',
       'Size',
       'Installs',
       'Type',
       'Price',
       'Content Rating',
       'Genres',
       'Last Updated',
       'Current Ver',
       'Android Ver']])

## 10. Most Common Apps by Genre: Part Two

En la pantalla anterior, vimos nuestra estrategia de validación para una idea de aplicación, y **luego inspeccionamos los conjuntos de datos para identificar las columnas que podrían ser útiles para averiguar cuáles son los géneros más comunes en cada mercado.**

Nuestra conclusión fue:

Necesitaremos construir una tabla de frecuencias para:

- **prime_genre** del conjunto de datos de App Store, 

- **columnas Genres y Category** del conjunto de datos de **Google Play**.

Construiremos dos funciones que podemos utilizar para analizar las tablas de frecuencia:

- Una función para generar tablas de frecuencia que muestren porcentajes

- Otra función que podemos utilizar para mostrar los porcentajes en orden descendente

Ya hemos aprendido a generar tablas de frecuencias que muestren porcentajes, y vamos a construir una función para ello en el siguiente ejercicio. Sin embargo, los diccionarios no tienen orden, y será muy difícil analizar las tablas de frecuencia. Necesitaremos construir una segunda función que nos ayude a mostrar las entradas de la tabla de frecuencias en orden descendente.

Para ello, tendremos que hacer uso de la función incorporada **sorted()**. Esta función toma un tipo de datos iterable (como una lista, un diccionario, una tupla, etc.) y devuelve una lista de los elementos de ese iterable ordenados de forma ascendente o descendente (el parámetro inverso controla si el orden es ascendente o descendente).

In [None]:
a_list = [50,20,100]
print(sorted(a_list))
print(sorted(a_list, reverse = True))

The sorted() function doesn't work too well with dictionaries because it only considers and returns the dictionary keys.

In [None]:
freq_table_1 = {'Genre_1': 50,'Genre_3': 20,'Genre_2': 100}
sorted(freq_table_1)

Sin embargo, la función sorted() funciona bien si transformamos el diccionario en una lista de tuplas, donde cada tupla contiene una clave del diccionario junto con su correspondiente valor del diccionario. Para que la ordenación funcione correctamente, el valor del diccionario va primero y la clave del diccionario va después:

In [None]:
freq_table_1 = {'Genre_1': 50,'Genre_3': 20,'Genre_2': 100}
freq_table_1_as_tuple = [(50,'Genre_1'),(20,'Genre_3'),(100, 'Genre_2')]
sorted(freq_table_1_as_tuple)

Esto es un poco exagerado para sólo ordenar un diccionario, pero hay maneras mucho más simples de hacer esto una vez que aprendemos técnicas más avanzadas. 

Usando la solución anterior, escribimos una función de ayuda llamada **display_table()**, que podrás combinar con la función que vas a escribir en el siguiente ejercicio. La función display_table() que ves a continuación:

- Toma dos parámetros: **dataset** e **index**. Se espera que el conjunto de datos sea una lista de listas, y que el índice sea un número entero.

- Genera una tabla de frecuencias utilizando la función **freq_table()** (que vas a escribir como ejercicio).

- Transforma la tabla de frecuencias en una lista de tuplas, y luego ordena la lista en orden descendente.

- Imprime las entradas de la tabla de frecuencias en orden descendente.

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


Creemos ahora una función para generar tablas de frecuencias, y utilicémosla en combinación con la función display_table().

**Instrucciones**

1 Cree una función llamada **freq_table()** que reciba dos entradas: **dataset** (que se espera que sea una lista de listas) e **index** (que se espera que sea un número entero).

- La función debe devolver la tabla de frecuencias (como un diccionario) para cualquier columna que queramos. Las frecuencias también deben expresarse como porcentajes.

- Ya aprendimos a construir tablas de frecuencias en la misión sobre diccionarios.

2 Copie la función **display_table()** que escribimos anteriormente. 
Utilízala para mostrar la tabla de frecuencias de las columnas:


   **prime_genre**

   **Genres** 

   **Category**
    

Analizaremos las tablas resultantes en la siguiente pantalla.

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total +=1
        value = row[index]
        if value in table:
            table[value] +=1
        else:
            table[value] = 1
        
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total ) * 100
        table_percentages[key]=percentage
            
    return table_percentages

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [None]:
# free_android_clean free_AppleStore

In [None]:
display_table(free_AppleStore,11)   

In [None]:
display_table(free_AppleStore,9)

In [None]:
display_table(free_AppleStore,1)

## 11. Most Common Apps by Genre: Part Three

En la pantalla anterior, hemos generado tablas de frecuencia para las columnas prime_genre, Genres y Category. Ahora nos centraremos en el análisis de estas tablas de frecuencia.

Recuerda que nuestro conjunto de datos sólo contiene aplicaciones gratuitas en inglés, por lo que debes tener cuidado de no extender tus conclusiones más allá de ese ámbito. Si encuentras que las aplicaciones de juegos son las más numerosas entre las aplicaciones gratuitas en inglés en Google Play, no significa que veamos el mismo patrón en Google Play en su conjunto.


**Instrucciones**

Analiza la tabla de frecuencias que has generado para la columna prime_genre del conjunto de datos del App Store.

- ¿Cuál es el género más común? 
- ¿Cuál es el segundo?
- ¿Qué otros patrones observas?
- ¿Cuál es la impresión general: la mayoría de las aplicaciones están diseñadas para fines prácticos (educación, compras, utilidades, productividad, estilo de vida) o más bien para el entretenimiento (juegos, fotos y vídeos, redes sociales, deportes, música)?

- ¿Puede recomendar un perfil de aplicaciones para el mercado de la App Store basándose únicamente en esta tabla de frecuencias? 

- Si hay un gran número de aplicaciones para un género en particular, ¿implica también que las aplicaciones de ese género suelen tener un gran número de usuarios?

- Analiza la tabla de frecuencias que has generado para la columna Categoría y Géneros del conjunto de datos de Google Play.

- ¿Cuáles son los géneros más comunes?
- ¿Qué otros patrones observas?
- Compara los patrones que ves para el mercado de Google Play con los que viste para el mercado de App Store.
- ¿Puedes recomendar un perfil de aplicación basado en lo que has encontrado hasta ahora? - - ¿Las tablas de frecuencia que has generado revelan los géneros de aplicaciones más frecuentes o qué géneros tienen más usuarios?

In [None]:
display_table(free_AppleStore,11)   

dataset  original:

- applestore: 7197      
- free_applestore:    3203 
- 44,5% de las apps son gratis

dataset  original: 

- GooglePlay: 10841     
- free_android_clean: 8848 

- 81,6%  de las apps son gratis

## 12. Most Popular Apps by Genre on the App Store

Las tablas de frecuencias que analizamos en la pantalla anterior nos mostraron que la App Store está dominada por aplicaciones diseñadas para la diversión, mientras que Google Play muestra un panorama más equilibrado de aplicaciones tanto prácticas como divertidas. 

Ahora, nos gustaría hacernos una idea sobre el tipo de apps con más usuarios.

Una forma de averiguar qué géneros son los más populares (tienen más usuarios) es calcular el número medio de instalaciones de cada género de aplicación. Para el conjunto de datos de Google Play, podemos encontrar esta información en la columna de instalaciones, pero esta información falta para el conjunto de datos de la App Store. 

Como solución, tomaremos el número total de valoraciones de los usuarios como proxy, que podemos encontrar en la app **rating_count_column.**

Empecemos por calcular el número medio de valoraciones de los usuarios por género de aplicación en la App Store. Para ello, necesitaremos

Aislar las apps de cada género.
Sumar las valoraciones de los usuarios de las aplicaciones de ese género.
Dividir la suma por el número de aplicaciones que pertenecen a ese género (no por el número total de aplicaciones).
Para calcular el número medio de valoraciones de los usuarios de cada género, utilizaremos un bucle for dentro de otro bucle for. Este es un ejemplo de un bucle for utilizado dentro de otro bucle for:

In [None]:
some_strings = ['FIRST','SECOND']
some_integers = [1,2,3,4,5]

for string in some_strings:
    print(string)
    
    for integer in some_integers:
        print(integer)

Arriba, podemos ver eso:

Primero iteramos sobre la lista some_strings, y para cada iteración:
Imprimimos la cadena (variable de iteración).
Iniciamos otra iteración sobre la lista algunos_integros.
Para cada iteración sobre esta lista, imprimimos entero (variable de iteración).
Podemos ver que para cada una de las dos iteraciones sobre la lista algunas_cuerdas (hay dos iteraciones porque algunas_cuerdas sólo tiene dos elementos de lista), hay otra iteración interna que ocurre sobre la lista algunos_integros.

La segunda iteración sobre algunas_cuerdas comienza sólo cuando la iteración sobre algunos_integros se ha completado. Observa que todos los elementos de la lista algunos_integros se imprimen para cada una de las dos iteraciones sobre la lista algunas_cadenas.

Un bucle dentro de otro bucle se llama bucle anidado. Usaremos un bucle anidado para calcular los promedios que mencionamos anteriormente.

**Instrucciones**

1. Comience generando una tabla de frecuencias para la columna prime_genre para obtener los géneros únicos de las aplicaciones (a continuación, necesitaremos hacer un bucle sobre los géneros únicos). Puede utilizar la función freq_table() que escribió en una pantalla anterior.

2. Haga un bucle sobre los géneros únicos del conjunto de datos de la App Store. Para cada iteración (a continuación, asumiremos que la variable de iteración se llama género):

- Inicie una variable llamada total con un valor de 0. Esta variable almacenará la suma de las valoraciones de los usuarios (el número de valoraciones, no las valoraciones reales) específicas de cada género.

- Inicie una variable llamada len_genre con un valor de 0. Esta variable almacenará el número de aplicaciones específicas de cada género.

- Realice un bucle sobre el conjunto de datos de la App Store y, en cada iteración, guarde el género de la aplicación en una variable denominada len_genre:

    - Guarda el género de la aplicación en una variable llamada genre_app.

    - Si genre_app es la misma que genre (la variable de iteración del bucle principal), entonces:

        - Guarda el número de valoraciones de los usuarios de la app como un flotador.

        - Suma el número de valoraciones de los usuarios a la variable total.

        - Incrementa la variable len_genre en 1.
        

- Calcule el número medio de valoraciones de los usuarios dividiendo el total entre len_genre. Esto debe hacerse fuera del bucle anidado.

- Imprime el género de la aplicación y el número medio de valoraciones de los usuarios. Esto también debe hacerse fuera del bucle anidado.

3. Analice los resultados e intente obtener al menos una recomendación de perfil de aplicación para la App Store. Ten en cuenta que aquí no hay una respuesta fija, y que no pasa nada si el perfil de aplicación que recomiendas es diferente al que se recomienda en el cuaderno de soluciones.

In [None]:
display_table(prime_genre)