# Lab Assignment One: Exploring Table Data
Arely Alcantara, Emily Fashenpour

## 1. Business Understanding

New iOS applications are constantly being developed and, obviously, their creators want the applications to be successful. The table data we found contains the names of applications, their ratings, the genre, the version, and other relevant information to describe an app. This data could be relevant to other iOS app developers who are developing apps that are similar in genre, content rating, etc. and see what other apps did well or what they did poorly.

The Mobile App Store data we found was collected in July 2017 and it features 7,000 different apps with 18 features (excluding duplicate ID and name fields). This dataset can be accessed and downloaded from kaggle.com. The purpose of collecting this data was to see how similar apps stand out, or do relative to others. This data was collected by grabbing the information from the App Store's API.

In analyzing this datasset, we hope to see what makes an app successful. This information could be extremely valuable to companies trying to rebrand their apps or looking at a future release. It could also show startups common mistakes in terms of the least successful apps ot know what not to do, and see what they can do to be successful.

Dataset URL: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv

Question we're trying to address: what makes an app successful in the App Store? How do apps in different markets compare?


## 2. Data Understanding

### 2.1 Data Defining & Description

Our dataset consists of 2 files - where one contains the general app info, and the other contains the actual app descriptions. Both of these files share id,track_name, and size_bytes - so we decided to merge this data and have one central data source.

In addition, we renamed some columns that we felt were not descriptive enough as to what they are describing - for instance, the first colum contains the index of each entry so we renamed it from 'Unnamed: 0' to 'index'. 'ipadSc_urls.num' doesn't tell us much so we renamed it to 'screenshots' as this column lists the number of screenshots displayed in the app store page for that app.

In [54]:
import pandas as pd

#read data from csv using pandas
appStore = pd.read_csv('data/AppleStore.csv')
description = pd.read_csv('data/appleStore_description.csv')

#merge 2 datasets since there is 2 files for general information and the description
outer_merge = pd.merge(appStore, description, on=['id', 'track_name', 'size_bytes'], how="outer", indicator=False)

# replace the column name of 'track_name' to 'name', 'prime_genre' to 'genre', remove the .num on some column names
outer_merge = outer_merge.rename(columns = {'track_name': 'app_name', 'prime_genre': 'genre', 'sup_devices.num': 'sup_devices', 'ipadSc_urls.num':'screenshots', 'lang.num':'sup_lang', 'Unnamed: 0': 'index', 'app_desc': 'app_desc_count'})

print(outer_merge.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7197 entries, 0 to 7196
Data columns (total 18 columns):
index               7197 non-null int64
id                  7197 non-null int64
app_name            7197 non-null object
size_bytes          7197 non-null int64
currency            7197 non-null object
price               7197 non-null float64
rating_count_tot    7197 non-null int64
rating_count_ver    7197 non-null int64
user_rating         7197 non-null float64
user_rating_ver     7197 non-null float64
ver                 7197 non-null object
cont_rating         7197 non-null object
genre               7197 non-null object
sup_devices         7197 non-null int64
screenshots         7197 non-null int64
sup_lang            7197 non-null int64
vpp_lic             7197 non-null int64
app_desc_count      7197 non-null object
dtypes: float64(3), int64(9), object(6)
memory usage: 1.0+ MB
None


In [53]:
#drop index column since that contains the literal index of each entry
#we're dropping id because it is not userful to us
#there is only one value for currency - all US dollars
#Vpp Device Based Licensing Enabled - vpp_lic is dropped since it is not relevant to us
outer_merge.drop(['index', 'id', 'currency', 'vpp_lic'], axis=1, inplace=True)
outer_merge

Unnamed: 0,app_name,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,genre,sup_devices,screenshots,sup_lang,app_desc
0,PAC-MAN Premium,100788224,3.99,21292,26,4.0,4.5,6.3.5,4+,Games,38,5,10,"SAVE 20%, now only $3.99 for a limited time!\n..."
1,Evernote - stay organized,158578688,0.00,161065,26,4.0,3.5,8.2.2,4+,Productivity,37,5,23,Let Evernote change the way you organize your ...
2,"WeatherBug - Local Weather, Radar, Maps, Alerts",100524032,0.00,188583,2822,3.5,4.5,5.0.0,4+,Weather,37,5,3,Download the most popular free weather app pow...
3,"eBay: Best App to Buy, Sell, Save! Online Shop...",128512000,0.00,262241,649,4.0,4.5,5.10.0,12+,Shopping,37,5,9,The eBay app is the best way to find anything ...
4,Bible,92774400,0.00,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,On more than 250 million devices around the wo...
5,Shanghai Mahjong,10485713,0.99,8253,5516,4.0,4.0,1.8,4+,Games,47,5,1,"★ WINNER ""BEST GAME"" 2009\n★ 3rd PLACE WINNER ..."
6,PayPal - Send and request money safely,227795968,0.00,119487,879,4.0,4.5,6.12.0,4+,Finance,37,0,19,Description\nTAP INTO YOUR MONEY\nSend money o...
7,Pandora - Music & Radio,130242560,0.00,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,Find the music you love and let the music you ...
8,PCalc - The Best Calculator,49250304,9.99,1117,4,4.5,5.0,3.6.6,4+,Utilities,37,5,1,"PCalc is the powerful choice for scientists, e..."
9,Ms. PAC-MAN,70023168,3.99,7885,40,4.0,4.0,4.0.4,4+,Games,38,0,10,Now with MFi controller support!\n\nMs. PAC-MA...


### 2.2 Data Quality

explain missing data, or duplicate data. 

visualize entries that are missing (mistakes? why does these quality issues exist?)

how to deal with these issues? 


CHOOSE: elimination or imputation and JUSTIFY

## 3. Data Visualization

### 3.1 Data Exploration

CHOOSE and VISUALIZE distributions for a subset of single attributes


use histograms, kernel density estimation, box plots, etc. 


describe anything meaningful found from the visualizations


**CAN USE other sourses to boost visualizations

VISUALIZE at least 5 attributes (one categorical, one numerical)

In [None]:
Visualize Relationships between a subset of attributes.

use visualization method


explain any interesting relationships

visualize at least 3 subsets