# Lab Assignment One: Exploring Table Data
Arely Alcantara, Emily Fashenpour

## 1. Business Understanding

New iOS applications are constantly being developed and, obviously, their creators want the applications to be successful. The table data we found contains the names of applications, their ratings, the genre, the version, and other relevant information to describe an app. This data could be relevant to other iOS app developers who are developing apps that are similar in genre, content rating, etc. and see what other apps did well or what they did poorly.

The Mobile App Store data we found was collected in July 2017 and it features 7,000 different apps with 18 features (excluding duplicate ID and name fields). This dataset can be accessed and downloaded from kaggle.com. The purpose of collecting this data was to see how similar apps stand out, or do relative to others. This data was collected by grabbing the information from the App Store's API.

In analyzing this datasset, we hope to see what makes an app successful. This information could be extremely valuable to companies trying to rebrand their apps or looking at a future release. It could also show startups common mistakes in terms of the least successful apps ot know what not to do, and see what they can do to be successful.

Dataset URL: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv

Question we're trying to address: what makes an app successful in the App Store? How do apps in different markets compare?


## 2. Data Understanding

Our dataset consists of 2 files - where one contains the general app info, and the other contains the actual app descriptions. Both of these files share id,track_name, and size_bytes - so we decided to merge this data and have one central data source.

In addition, we renamed some columns that we felt were not descriptive enough as to what they are describing - for instance, the first colum contains the index of each entry so we renamed it from 'Unnamed: 0' to 'index'. 'ipadSc_urls.num' doesn't tell us much so we renamed it to 'screenshots' as this column lists the number of screenshots displayed in the app store page for that app.

In [119]:
import pandas as pd

#read data from csv using pandas
appStore = pd.read_csv('data/AppleStore.csv')
description = pd.read_csv('data/appleStore_description.csv')

#merge 2 datasets since there is 2 files for general information and the description
outer_merge = pd.merge(appStore, description, on=['id', 'track_name', 'size_bytes'], how="outer", indicator=False)

# replace the column name of 'track_name' to 'name', 'prime_genre' to 'genre', remove the .num on some column names
outer_merge = outer_merge.rename(columns = {'track_name': 'app_name', 'prime_genre': 'genre', 'sup_devices.num': 'sup_devices', 'ipadSc_urls.num':'screenshots', 'lang.num':'sup_lang', 'Unnamed: 0': 'index', 'app_desc': 'app_desc_count'})

print(outer_merge.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7197 entries, 0 to 7196
Data columns (total 18 columns):
index               7197 non-null int64
id                  7197 non-null int64
app_name            7197 non-null object
size_bytes          7197 non-null int64
currency            7197 non-null object
price               7197 non-null float64
rating_count_tot    7197 non-null int64
rating_count_ver    7197 non-null int64
user_rating         7197 non-null float64
user_rating_ver     7197 non-null float64
ver                 7197 non-null object
cont_rating         7197 non-null object
genre               7197 non-null object
sup_devices         7197 non-null int64
screenshots         7197 non-null int64
sup_lang            7197 non-null int64
vpp_lic             7197 non-null int64
app_desc_count      7197 non-null object
dtypes: float64(3), int64(9), object(6)
memory usage: 1.0+ MB
None


##### Removing columns that are not needed
Some columns are not crucial or necessary for us to examine, so we have decided to remove them from our analysis. Those include the index column - which just lists the index of each entry, the id column - since it is a unique identifier and doesn't add/remove value to our findings, currency column - there is only one value and it is 'USD', app_name column - the name of the app is not relevant to us, and vpp_lic column tells us Vpp Device Based Licensing Enabled and either holds a 0 or 1.

In [120]:
import numpy as np

#drop unneeded columns
outer_merge.drop(['index', 'id', 'currency', 'app_name', 'vpp_lic'], axis=1, inplace=True)

#change ordinal features to ints - ratings have .5 values so multiple by 2 to make them managable
#values for ratings and ratings versions will be between 0 and 10
outer_merge['user_rating'] = outer_merge['user_rating'].apply(lambda x: x*2).astype(np.int64)
outer_merge['user_rating_ver'] = outer_merge['user_rating_ver'].apply(lambda x: x*2).astype(np.int64)
outer_merge['cont_rating'] = outer_merge['cont_rating'].str.replace('+', '').astype(np.int64)
outer_merge['app_desc_count'] = outer_merge['app_desc_count'].apply(lambda x: len(x))

outer_merge



Unnamed: 0,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,genre,sup_devices,screenshots,sup_lang,app_desc_count
0,100788224,3.99,21292,26,8,9,6.3.5,4,Games,38,5,10,1533
1,158578688,0.00,161065,26,8,7,8.2.2,4,Productivity,37,5,23,3952
2,100524032,0.00,188583,2822,7,9,5.0.0,4,Weather,37,5,3,2090
3,128512000,0.00,262241,649,8,9,5.10.0,12,Shopping,37,5,9,3997
4,92774400,0.00,985920,5320,9,10,7.5.1,4,Reference,37,5,45,2998
5,10485713,0.99,8253,5516,8,8,1.8,4,Games,47,5,1,2769
6,227795968,0.00,119487,879,8,9,6.12.0,4,Finance,37,0,19,1183
7,130242560,0.00,1126879,3594,8,9,8.4.1,12,Music,37,4,1,1563
8,49250304,9.99,1117,4,9,10,3.6.6,4,Utilities,37,5,1,481
9,70023168,3.99,7885,40,8,8,4.0.4,4,Games,38,0,10,964


### 2.2 Data Quality

explain missing data, or duplicate data. 

visualize entries that are missing (mistakes? why does these quality issues exist?)

how to deal with these issues? 


CHOOSE: elimination or imputation and JUSTIFY

## 3. Data Visualization

### 3.1 Data Exploration

CHOOSE and VISUALIZE distributions for a subset of single attributes


use histograms, kernel density estimation, box plots, etc. 


describe anything meaningful found from the visualizations


**CAN USE other sourses to boost visualizations

VISUALIZE at least 5 attributes (one categorical, one numerical)

In [None]:
Visualize Relationships between a subset of attributes.

use visualization method


explain any interesting relationships

visualize at least 3 subsets