## 📱 Free/Paid App Classification

Given *data about Apple store app rankings*, let's try to predict whether a given app will be **free** or not.

We will use a logistic regression model to make our predictions.

Data source: https://www.kaggle.com/datasets/iamsk7/apple-store-ranks-2019

### Importing Libraries

In [3]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

In [4]:
data = pd.read_csv('ranks.csv')
data

Unnamed: 0,appid,date,category,feed,name,publisher,price,ranking,change,sub_ranking,sub_change,comment_rating,comment_num,keyword_cover
0,691828408,2019-11-30,5000,free,微视-短视频创作与分享,Tencent Technology (Beijing) Company Limited,0.0,1,12,1,5,4.5,124000,17503
1,1458072671,2019-11-30,5000,free,剪映 - 轻而易剪,深圳市脸萌科技有限公司,0.0,2,-1,2,-1,4.9,632000,19978
2,1448327606,2019-11-30,5000,free,刷宝短视频,成都力奥文化传播有限公司,0.0,3,-1,3,-1,4.8,599000,6759
3,1472502819,2019-11-30,5000,free,快手极速版,华艺汇龙,0.0,4,-1,4,-1,4.9,228000,38549
4,1142110895,2019-11-30,5000,free,抖音短视频,"Beijing Microlive Vision Technology Co., Ltd",0.0,5,-1,5,-1,4.9,23870000,21441
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112622,1142972033,2019-12-06,5000,grossing,漫画岛-快看二次元动漫漫画神器APP,上海元聚网络科技有限公司,0.0,634,9999,62,3,4.6,2747,5457
112623,1196953001,2019-12-06,5000,grossing,SuperPads打击垫：音乐魔器,Opala Studios Solucoes Tecnologicas Ltda,0.0,635,9999,26,4,4.7,30500,5297
112624,487115059,2019-12-06,5000,grossing,7M即时比分-足球探索预测分析体育网,IEXIN Technology Development Limited,0.0,636,9999,19,0,4.9,3552,17844
112625,493145008,2019-12-06,5000,grossing,Headspace: Meditation & Sleep,Headspace Inc.,0.0,637,9999,20,3,4.9,3658,1968


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112627 entries, 0 to 112626
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   appid           112627 non-null  int64  
 1   date            112627 non-null  object 
 2   category        112627 non-null  int64  
 3   feed            112627 non-null  object 
 4   name            112627 non-null  object 
 5   publisher       112627 non-null  object 
 6   price           112627 non-null  float64
 7   ranking         112627 non-null  int64  
 8   change          112627 non-null  int64  
 9   sub_ranking     112627 non-null  int64  
 10  sub_change      112627 non-null  int64  
 11  comment_rating  112627 non-null  float64
 12  comment_num     112627 non-null  int64  
 13  keyword_cover   112627 non-null  int64  
dtypes: float64(2), int64(8), object(4)
memory usage: 12.0+ MB


### Cleaning

In [6]:
unneeded_columns = ['appid', 'name', 'publisher']

data.drop(unneeded_columns, axis=1, inplace=True)

In [7]:
data = data.replace(-1, np.NaN)

In [8]:
data.isna().sum()

date                 0
category             0
feed                 0
price                0
ranking              0
change             818
sub_ranking          0
sub_change        8209
comment_rating       0
comment_num          0
keyword_cover        0
dtype: int64

In [9]:
for column in ['change', 'sub_change']:
    data[column] = data[column].fillna(data[column].mean())

In [10]:
print("Total missing values: ", data.isna().sum().sum())

Total missing values:  0


### Feature Engineering

In [11]:
data

Unnamed: 0,date,category,feed,price,ranking,change,sub_ranking,sub_change,comment_rating,comment_num,keyword_cover
0,2019-11-30,5000,free,0.0,1,12.000000,1,5.000000,4.5,124000,17503
1,2019-11-30,5000,free,0.0,2,210.013711,2,52.753089,4.9,632000,19978
2,2019-11-30,5000,free,0.0,3,210.013711,3,52.753089,4.8,599000,6759
3,2019-11-30,5000,free,0.0,4,210.013711,4,52.753089,4.9,228000,38549
4,2019-11-30,5000,free,0.0,5,210.013711,5,52.753089,4.9,23870000,21441
...,...,...,...,...,...,...,...,...,...,...,...
112622,2019-12-06,5000,grossing,0.0,634,9999.000000,62,3.000000,4.6,2747,5457
112623,2019-12-06,5000,grossing,0.0,635,9999.000000,26,4.000000,4.7,30500,5297
112624,2019-12-06,5000,grossing,0.0,636,9999.000000,19,0.000000,4.9,3552,17844
112625,2019-12-06,5000,grossing,0.0,637,9999.000000,20,3.000000,4.9,3658,1968


In [13]:
data['year'] = data['date'].apply(lambda x: int(x[0:4]))
data['month'] = data['date'].apply(lambda x: int(x[5:7]))
data['day'] = data['date'].apply(lambda x: int(x[-2:]))

data = data.drop('date', axis=1)

In [14]:
data['category'].unique()

array([5000, 6000, 6001, 6002, 6003, 6004, 6005, 6006, 6007, 6008, 6009,
       6010, 6011, 6012, 6013, 6015, 6016, 6017, 6018, 6020, 6021, 6023,
       6024, 6061])

In [15]:
category_dummies = pd.get_dummies(data['category'], prefix='cat')

data = pd.concat([data, category_dummies], axis=1)
data = data.drop('category', axis=1)

### Encoding Labels 

In fact, let's only worry about examples that are either *free or paid*.

In [16]:
data['feed'].value_counts()

feed
free        40677
paid        36709
grossing    35241
Name: count, dtype: int64

In [19]:
grossing_indices = data.query('feed == "grossing"').index

data = data.drop(grossing_indices, axis=0)

In [20]:
data = data.reset_index(drop=True)

In [21]:
data

Unnamed: 0,feed,price,ranking,change,sub_ranking,sub_change,comment_rating,comment_num,keyword_cover,year,...,cat_6013,cat_6015,cat_6016,cat_6017,cat_6018,cat_6020,cat_6021,cat_6023,cat_6024,cat_6061
0,free,0.0,1,12.000000,1,5.000000,4.5,124000,17503,2019,...,False,False,False,False,False,False,False,False,False,False
1,free,0.0,2,210.013711,2,52.753089,4.9,632000,19978,2019,...,False,False,False,False,False,False,False,False,False,False
2,free,0.0,3,210.013711,3,52.753089,4.8,599000,6759,2019,...,False,False,False,False,False,False,False,False,False,False
3,free,0.0,4,210.013711,4,52.753089,4.9,228000,38549,2019,...,False,False,False,False,False,False,False,False,False,False
4,free,0.0,5,210.013711,5,52.753089,4.9,23870000,21441,2019,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77381,paid,6.0,786,9999.000000,128,35.000000,0.0,0,685,2019,...,False,False,False,False,False,False,False,False,False,False
77382,paid,1.0,787,-79.000000,32,-4.000000,0.0,0,1142,2019,...,False,False,False,False,False,False,False,False,False,False
77383,paid,25.0,788,-72.000000,22,-2.000000,4.5,132,1240,2019,...,False,False,False,False,False,False,False,False,False,False
77384,paid,12.0,789,-54.000000,84,-7.000000,4.8,744,477,2019,...,False,False,False,False,False,False,False,False,False,False


In [24]:
print("Class Distribution:")
print(data['feed'].value_counts() / len(data['feed']))

Class Distribution:
feed
free    0.525638
paid    0.474362
Name: count, dtype: float64


In [25]:
label_mapping = {'free': 0, 'paid': 1}

data['feed'] = data['feed'].replace(label_mapping)

  data['feed'] = data['feed'].replace(label_mapping)


In [26]:
data

Unnamed: 0,feed,price,ranking,change,sub_ranking,sub_change,comment_rating,comment_num,keyword_cover,year,...,cat_6013,cat_6015,cat_6016,cat_6017,cat_6018,cat_6020,cat_6021,cat_6023,cat_6024,cat_6061
0,0,0.0,1,12.000000,1,5.000000,4.5,124000,17503,2019,...,False,False,False,False,False,False,False,False,False,False
1,0,0.0,2,210.013711,2,52.753089,4.9,632000,19978,2019,...,False,False,False,False,False,False,False,False,False,False
2,0,0.0,3,210.013711,3,52.753089,4.8,599000,6759,2019,...,False,False,False,False,False,False,False,False,False,False
3,0,0.0,4,210.013711,4,52.753089,4.9,228000,38549,2019,...,False,False,False,False,False,False,False,False,False,False
4,0,0.0,5,210.013711,5,52.753089,4.9,23870000,21441,2019,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77381,1,6.0,786,9999.000000,128,35.000000,0.0,0,685,2019,...,False,False,False,False,False,False,False,False,False,False
77382,1,1.0,787,-79.000000,32,-4.000000,0.0,0,1142,2019,...,False,False,False,False,False,False,False,False,False,False
77383,1,25.0,788,-72.000000,22,-2.000000,4.5,132,1240,2019,...,False,False,False,False,False,False,False,False,False,False
77384,1,12.0,789,-54.000000,84,-7.000000,4.8,744,477,2019,...,False,False,False,False,False,False,False,False,False,False


### Splitting/Scaling

In [27]:
y = data['feed'].copy()
X = data.drop('feed', axis=1).copy()

In [28]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=34)

### Training/Results

In [31]:
base_model = LogisticRegression()
base_model.fit(X_train, y_train)

base_acc = base_model.score(X_test, y_test)

print('Accuracy: {:.4f}'.format(base_acc))

Accuracy: 0.9758


In [32]:
cv_model = LogisticRegressionCV()
cv_model.fit(X_train, y_train)

cv_acc = cv_model.score(X_test, y_test)

print("Accuracy: {:.4f}".format(cv_acc))

Accuracy: 0.9992
