# Google Play Store - Association Rules
In this notebook, an answer will be provide for the question: 
##### What characteristics of a paid app can help in improving the rating of an app?

The question will be answered with the help of Association Rules.


## Import
Import **numpy**, **pandas** and **matplotlib**.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None

## Google Play Store Dataset
For this case study, the dataset chosen by the researchers is called `Google Play Store Apps` dataset. This dataset contains 10841 rows which represents transactions by customers shopping for groceries. The dataset contains 13 unique columns.

The dataset is provided as `googleplaystore.csv`. Therefore, we must read the file.

In [2]:
apps_df = pd.read_csv('googleplaystore.csv')
apps_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


## Data Cleaning and Pre-processing

For data cleaning in this dataset, the researchers decided with these modifications.
1. Remove `Last Updated`, `Current Ver`, `Android Ver`
2. Include Main `Genres` Only
3. Include Main `Content Rating` Only
4. Numerical data for `Installs`, `Size`, `Price`
5. Binning `Rating`, `Reviews`, `Size`
6. Remove/Modify NaN and duplicate observations

### Removing `Last Updated`, `Current Ver`, `Android Ver`

In this case study, the columns `Last Updated`, `Current Ver`, and `Android Ver` are not needed and will be removed

In [3]:
apps_df = apps_df.drop(["Last Updated", "Current Ver", "Android Ver"], axis=1)
apps_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference


### Including Main Genre Only in `Genres`

The researchers noticed the presence of too much unique values for `Genres` due to a lot of apps having combined genres. The unique values can be seen below.

In [4]:
apps_df['Genres'].nunique(), apps_df['Genres'].unique()

(120,
 array(['Art & Design', 'Art & Design;Pretend Play',
        'Art & Design;Creativity', 'Art & Design;Action & Adventure',
        'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
        'Comics', 'Comics;Creativity', 'Communication', 'Dating',
        'Education;Education', 'Education', 'Education;Creativity',
        'Education;Music & Video', 'Education;Action & Adventure',
        'Education;Pretend Play', 'Education;Brain Games', 'Entertainment',
        'Entertainment;Music & Video', 'Entertainment;Brain Games',
        'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
        'Health & Fitness', 'House & Home', 'Libraries & Demo',
        'Lifestyle', 'Lifestyle;Pretend Play',
        'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
        'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports',
        'Music', 'Word', 'Racing', 'Casual;Creativity',
        'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board',
  

To solve this problem, the researches decided to include only the main genres provided in `Genres`. This is to divide the apps into simpler genres and allow easier visualization of categories for this column.

The first genre which comes before the character `;` for multi-genre apps will be considered the main genre. Genres that come after will be removed via string manipulation.

In [5]:
apps_df["Genres"] = apps_df["Genres"].str.split(";", 1).str[0]
apps_df["Genres"].unique()

array(['Art & Design', 'Auto & Vehicles', 'Beauty', 'Books & Reference',
       'Business', 'Comics', 'Communication', 'Dating', 'Education',
       'Entertainment', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Adventure', 'Arcade', 'Casual', 'Card', 'Action',
       'Strategy', 'Puzzle', 'Sports', 'Music', 'Word', 'Racing',
       'Simulation', 'Board', 'Trivia', 'Role Playing', 'Educational',
       'Music & Audio', 'Video Players & Editors', 'Medical', 'Social',
       'Shopping', 'Photography', 'Travel & Local', 'Tools',
       'Personalization', 'Productivity', 'Parenting', 'Weather',
       'News & Magazines', 'Maps & Navigation', 'Casino',
       'February 11, 2018'], dtype=object)

It is seen that the `Genres` section contained a bizarre genre of 'February 11, 2018', so the researchers decided to see the values of these apps.

In [6]:
apps_df[apps_df["Genres"] == "February 11, 2018"]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018"


The researchers have chosen to drop this since it only contains one observation in the dataset. 

In [7]:
apps_df = apps_df[apps_df['Genres'] != "February 11, 2018"]
apps_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference


### Including Main `Content Rating` Only

It is seen below that the `Content Rating` values contained some with 'Everyone' and 'Everyone 10+'. The researchers decided to exclude the age rating and only include the main content rating as well. In this case, the ratings would be 'Everyone', 'Teen', 'Mature', 'Adults', and 'Unrated'.

In [8]:
apps_df['Content Rating'].unique()

array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

Simply splitting the strings by the whitespaces and including the first substring will divide the content ratings into its desirable categories.

In [9]:
apps_df["Content Rating"] = apps_df["Content Rating"].str.split(" ", 1).str[0]
apps_df["Content Rating"].unique()

array(['Everyone', 'Teen', 'Mature', 'Adults', 'Unrated'], dtype=object)

### Assigning Numerical Data for `Installs`

Looking at the `Installs` column below, it can be noticed that the data type for the values are not yet initialized as float. Therefore, the researchers will also use string manipulation for this column for conversion to float.

In [10]:
apps_df["Installs"].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000,000,000+', '1,000+', '500,000,000+', '50+', '100+', '500+',
       '10+', '1+', '5+', '0+', '0'], dtype=object)

First, remove the '+' and ',' symbols to allow it for conversion.

In [11]:
apps_df['Installs'] = apps_df['Installs'].str.replace("+", "")
apps_df['Installs'] = apps_df['Installs'].str.replace(",", "")
apps_df['Installs'].unique()

array(['10000', '500000', '5000000', '50000000', '100000', '50000',
       '1000000', '10000000', '5000', '100000000', '1000000000', '1000',
       '500000000', '50', '100', '500', '10', '1', '5', '0'], dtype=object)

Next, it is possible to convert them into float using the pandas `to_numeric()` function.

In [12]:
apps_df['Installs'] = pd.to_numeric(apps_df['Installs'], downcast="float")
apps_df['Installs'].unique()

array([1.e+04, 5.e+05, 5.e+06, 5.e+07, 1.e+05, 5.e+04, 1.e+06, 1.e+07,
       5.e+03, 1.e+08, 1.e+09, 1.e+03, 5.e+08, 5.e+01, 1.e+02, 5.e+02,
       1.e+01, 1.e+00, 5.e+00, 0.e+00], dtype=float32)

### Assigning Numerical Data for `Size`

The same can be said for the `Size` column below. Therefore, the researchers will also use string manipulation for this column for conversion to float.

In [13]:
apps_df["Size"].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

By replacing 'k' into 'e+3' and 'M' into 'e+6', it converts the values into a string that makes the `to_numeric()` function possible. However it is noticed that there is a value named 'Varies with device'. The researchers decided to convert that into NaN and deal with the NaN values in a later step.

In [14]:
apps_df["Size"] = apps_df["Size"].str.replace('k', 'e+3')
apps_df["Size"] = apps_df["Size"].str.replace('M', 'e+6')
apps_df["Size"] = apps_df["Size"].replace('Varies with device', np.nan)
apps_df["Size"].unique()

array(['19e+6', '14e+6', '8.7e+6', '25e+6', '2.8e+6', '5.6e+6', '29e+6',
       '33e+6', '3.1e+6', '28e+6', '12e+6', '20e+6', '21e+6', '37e+6',
       '2.7e+6', '5.5e+6', '17e+6', '39e+6', '31e+6', '4.2e+6', '7.0e+6',
       '23e+6', '6.0e+6', '6.1e+6', '4.6e+6', '9.2e+6', '5.2e+6', '11e+6',
       '24e+6', nan, '9.4e+6', '15e+6', '10e+6', '1.2e+6', '26e+6',
       '8.0e+6', '7.9e+6', '56e+6', '57e+6', '35e+6', '54e+6', '201e+3',
       '3.6e+6', '5.7e+6', '8.6e+6', '2.4e+6', '27e+6', '2.5e+6', '16e+6',
       '3.4e+6', '8.9e+6', '3.9e+6', '2.9e+6', '38e+6', '32e+6', '5.4e+6',
       '18e+6', '1.1e+6', '2.2e+6', '4.5e+6', '9.8e+6', '52e+6', '9.0e+6',
       '6.7e+6', '30e+6', '2.6e+6', '7.1e+6', '3.7e+6', '22e+6', '7.4e+6',
       '6.4e+6', '3.2e+6', '8.2e+6', '9.9e+6', '4.9e+6', '9.5e+6',
       '5.0e+6', '5.9e+6', '13e+6', '73e+6', '6.8e+6', '3.5e+6', '4.0e+6',
       '2.3e+6', '7.2e+6', '2.1e+6', '42e+6', '7.3e+6', '9.1e+6', '55e+6',
       '23e+3', '6.5e+6', '1.5e+6', '7.5e+6', '51

After this, implementing the function will now be possible.

In [15]:
apps_df["Size"] = pd.to_numeric(apps_df["Size"], downcast="float")
apps_df["Size"].unique()

array([1.90e+07, 1.40e+07, 8.70e+06, 2.50e+07, 2.80e+06, 5.60e+06,
       2.90e+07, 3.30e+07, 3.10e+06, 2.80e+07, 1.20e+07, 2.00e+07,
       2.10e+07, 3.70e+07, 2.70e+06, 5.50e+06, 1.70e+07, 3.90e+07,
       3.10e+07, 4.20e+06, 7.00e+06, 2.30e+07, 6.00e+06, 6.10e+06,
       4.60e+06, 9.20e+06, 5.20e+06, 1.10e+07, 2.40e+07,      nan,
       9.40e+06, 1.50e+07, 1.00e+07, 1.20e+06, 2.60e+07, 8.00e+06,
       7.90e+06, 5.60e+07, 5.70e+07, 3.50e+07, 5.40e+07, 2.01e+05,
       3.60e+06, 5.70e+06, 8.60e+06, 2.40e+06, 2.70e+07, 2.50e+06,
       1.60e+07, 3.40e+06, 8.90e+06, 3.90e+06, 2.90e+06, 3.80e+07,
       3.20e+07, 5.40e+06, 1.80e+07, 1.10e+06, 2.20e+06, 4.50e+06,
       9.80e+06, 5.20e+07, 9.00e+06, 6.70e+06, 3.00e+07, 2.60e+06,
       7.10e+06, 3.70e+06, 2.20e+07, 7.40e+06, 6.40e+06, 3.20e+06,
       8.20e+06, 9.90e+06, 4.90e+06, 9.50e+06, 5.00e+06, 5.90e+06,
       1.30e+07, 7.30e+07, 6.80e+06, 3.50e+06, 4.00e+06, 2.30e+06,
       7.20e+06, 2.10e+06, 4.20e+07, 7.30e+06, 9.10e+06, 5.50e

### Assigning Numerical Data for `Price`

The same can also be said for the `Size` column below. Therefore, the researchers will also use string manipulation for this column for conversion to float.

In [16]:
apps_df['Price'] = apps_df['Price'].str.replace("$", "")
apps_df["Price"].unique()

array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99',
       '3.49', '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00',
       '24.99', '11.99', '79.99', '16.99', '14.99', '1.00', '29.99',
       '12.99', '2.49', '10.99', '1.50', '19.99', '15.99', '33.99',
       '74.99', '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88',
       '25.99', '399.99', '17.99', '400.00', '3.02', '1.76', '4.84',
       '4.77', '1.61', '2.50', '1.59', '6.49', '1.29', '5.00', '13.99',
       '299.99', '379.99', '37.99', '18.99', '389.99', '19.90', '8.49',
       '1.75', '14.00', '4.85', '46.99', '109.99', '154.99', '3.08',
       '2.59', '4.80', '1.96', '19.40', '3.90', '4.59', '15.46', '3.04',
       '4.29', '2.60', '3.28', '4.60', '28.99', '2.95', '2.90', '1.97',
       '200.00', '89.99', '2.56', '30.99', '3.61', '394.99', '1.26',
       '1.20', '1.04'], dtype=object)

After this, using the `to_numeric()` function will now be possible.

In [17]:
apps_df['Price'] = pd.to_numeric(apps_df["Price"], downcast="float")
apps_df['Price'].unique()

array([  0.  ,   4.99,   3.99,   6.99,   1.49,   2.99,   7.99,   5.99,
         3.49,   1.99,   9.99,   7.49,   0.99,   9.  ,   5.49,  10.  ,
        24.99,  11.99,  79.99,  16.99,  14.99,   1.  ,  29.99,  12.99,
         2.49,  10.99,   1.5 ,  19.99,  15.99,  33.99,  74.99,  39.99,
         3.95,   4.49,   1.7 ,   8.99,   2.  ,   3.88,  25.99, 399.99,
        17.99, 400.  ,   3.02,   1.76,   4.84,   4.77,   1.61,   2.5 ,
         1.59,   6.49,   1.29,   5.  ,  13.99, 299.99, 379.99,  37.99,
        18.99, 389.99,  19.9 ,   8.49,   1.75,  14.  ,   4.85,  46.99,
       109.99, 154.99,   3.08,   2.59,   4.8 ,   1.96,  19.4 ,   3.9 ,
         4.59,  15.46,   3.04,   4.29,   2.6 ,   3.28,   4.6 ,  28.99,
         2.95,   2.9 ,   1.97, 200.  ,  89.99,   2.56,  30.99,   3.61,
       394.99,   1.26,   1.2 ,   1.04], dtype=float32)

### Dealing with NaN and duplicate values

By checking the null values below, the `Rating` and `Size` column will undergo preprocessing.

In [18]:
apps_df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size              1695
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               0
dtype: int64

The observation where `Type` is null is dropped from the dataset.

In [19]:
apps_df[apps_df["Type"].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
9148,Command & Conquer: Rivals,FAMILY,,0,,0.0,,0.0,Everyone,Strategy


In [20]:
apps_df = apps_df[apps_df["Type"].notnull()]
apps_df["Type"].unique()

array(['Free', 'Paid'], dtype=object)

For `Rating` and `Size`, the researchers used the average of the apps per `Genres`. The researchers decided to use this column instead of `Category` because the latter has fewer unique values than the other, making the former more specific to the apps' capabilities.

In [21]:
apps_df.groupby("Genres").mean()

Unnamed: 0_level_0,Rating,Size,Installs,Price
Genres,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Action,4.286667,50575344.0,24686020.0,0.24144
Adventure,4.219101,40134116.0,14647750.0,0.734615
Arcade,4.308072,41984836.0,45541520.0,0.281688
Art & Design,4.349231,12492308.0,1825190.0,0.086522
Auto & Vehicles,4.190411,20037146.0,625061.3,0.158471
Beauty,4.278571,13795745.0,513151.9,0.0
Board,4.3,24589678.0,2873244.0,0.872063
Books & Reference,4.344444,13701160.0,8211456.0,0.541667
Business,4.121452,14472162.0,2178076.0,0.402761
Card,4.102083,29967022.0,3410316.0,0.380784


After checking the means for `Rating` and `Size` to be appropriate, an `apply()` function was done along with a lambda function that aims to assign the NaN values with the mean of those groups.

In [22]:
apps_df['Rating'] = apps_df.groupby(['Genres'], sort=False)['Rating'].apply(lambda x: x.fillna(x.mean()))
apps_df['Rating'].unique()

array([4.1       , 3.9       , 4.7       , 4.5       , 4.3       ,
       4.4       , 3.8       , 4.2       , 4.6       , 3.2       ,
       4.        , 4.34923077, 4.8       , 4.9       , 3.6       ,
       3.7       , 4.27857143, 3.3       , 4.34444444, 3.4       ,
       3.5       , 3.1       , 4.12145215, 4.15517241, 5.        ,
       2.6       , 3.97076923, 3.        , 1.9       , 2.5       ,
       2.8       , 2.7       , 1.        , 2.9       , 4.31243339,
       4.43555556, 4.16697248, 4.19736842, 4.17846154, 4.18914286,
       2.3       , 4.04741144, 4.3       , 4.06319018, 4.33598726,
       2.2       , 4.19824561, 1.7       , 2.        , 4.19211356,
       1.8       , 4.25559846, 4.30807175, 4.2379822 , 4.15866261,
       4.25416667, 4.03928571, 4.21139601, 4.10970874, 2.4       ,
       4.10138648, 4.19041096, 4.13188854, 4.09555556, 1.6       ,
       4.10929204, 4.25966387, 4.27725753, 4.244     , 4.17216981,
       4.1851145 , 4.13218884, 4.0516129 , 4.28666667, 4.10208

In [23]:
apps_df['Size'] = apps_df.groupby(['Genres'], sort=False)['Size'].apply(lambda x: x.fillna(x.mean()))
apps_df['Size'].unique()

array([1.9000000e+07, 1.4000000e+07, 8.7000000e+06, 2.5000000e+07,
       2.8000000e+06, 5.6000000e+06, 2.9000000e+07, 3.3000000e+07,
       3.1000000e+06, 2.8000000e+07, 1.2000000e+07, 2.0000000e+07,
       2.1000000e+07, 3.7000000e+07, 2.7000000e+06, 5.5000000e+06,
       1.7000000e+07, 3.9000000e+07, 3.1000000e+07, 4.2000000e+06,
       7.0000000e+06, 2.3000000e+07, 6.0000000e+06, 6.1000000e+06,
       4.6000000e+06, 9.2000000e+06, 5.2000000e+06, 1.1000000e+07,
       2.4000000e+07, 1.2492308e+07, 9.4000000e+06, 1.5000000e+07,
       1.0000000e+07, 1.2000000e+06, 2.6000000e+07, 8.0000000e+06,
       7.9000000e+06, 5.6000000e+07, 5.7000000e+07, 2.0037150e+07,
       3.5000000e+07, 5.4000000e+07, 2.0100000e+05, 3.6000000e+06,
       5.7000000e+06, 8.6000000e+06, 2.4000000e+06, 2.7000000e+07,
       2.5000000e+06, 1.6000000e+07, 3.4000000e+06, 8.9000000e+06,
       3.9000000e+06, 2.9000000e+06, 3.8000000e+07, 3.2000000e+07,
       5.4000000e+06, 1.8000000e+07, 1.1000000e+06, 2.2000000e

For duplicated rows, the researchers decided to simply drop these observations.

In [24]:
apps_df.duplicated().sum()

485

In [25]:
apps_df = apps_df.drop_duplicates()
apps_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.100000,159,19000000.0,10000.0,Free,0.0,Everyone,Art & Design
1,Coloring book moana,ART_AND_DESIGN,3.900000,967,14000000.0,500000.0,Free,0.0,Everyone,Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.700000,87510,8700000.0,5000000.0,Free,0.0,Everyone,Art & Design
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.500000,215644,25000000.0,50000000.0,Free,0.0,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.300000,967,2800000.0,100000.0,Free,0.0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.500000,38,53000000.0,5000.0,Free,0.0,Everyone,Education
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.000000,4,3600000.0,100.0,Free,0.0,Everyone,Education
10838,Parkinson Exercices FR,MEDICAL,4.189143,3,9500000.0,1000.0,Free,0.0,Everyone,Medical
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.500000,114,13701158.0,1000.0,Free,0.0,Mature,Books & Reference


### Dropping Impossible Data

In [26]:
apps_df[apps_df["Installs"] < 1]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
4465,Popsicle Launcher for Android P 9.0 launcher,PERSONALIZATION,4.335987,0,5500000.0,0.0,Paid,1.49,Everyone,Personalization
5307,Ak Parti Yardım Toplama,SOCIAL,4.255598,0,8700000.0,0.0,Paid,13.99,Teen,Social
5486,AP Series Solution Pro,FAMILY,4.312433,0,7400000.0,0.0,Paid,1.99,Everyone,Education
5945,Ain Arabic Kids Alif Ba ta,FAMILY,4.312433,0,33000000.0,0.0,Paid,2.99,Everyone,Education
6692,cronometra-br,PRODUCTIVITY,4.211396,0,5400000.0,0.0,Paid,154.990005,Everyone,Productivity
7434,Pekalongan CJ,SOCIAL,4.255598,0,5900000.0,0.0,Free,0.0,Teen,Social
8081,CX Network,BUSINESS,4.121452,0,10000000.0,0.0,Free,0.0,Everyone,Business
8614,Sweden Newspapers,NEWS_AND_MAGAZINES,4.132189,0,2100000.0,0.0,Free,0.0,Everyone,News & Magazines
8871,Test Application DT 02,ART_AND_DESIGN,4.349231,0,1200000.0,0.0,Free,0.0,Everyone,Art & Design
9337,EG | Explore Folegandros,TRAVEL_AND_LOCAL,4.109292,0,56000000.0,0.0,Paid,3.99,Everyone,Travel & Local


As seen from the minimized dataframe above, there are applications in the dataset that have zero `Installs` and zero `Reviews` but have a significant rating. These ratings usually come from users of applications but considering that there are no reviews and no installs for the applications above, it can be concluded that these rating values are untruthful or are either initial ratings from the developers.

In [27]:
len(apps_df[apps_df["Installs"] < 1])

14

Given that the number of entries with this condition is less than 1% of the original dataset, we can drop these rows beforre proceeding.

In [28]:
apps_df = apps_df[apps_df["Installs"] >= 1]
apps_df[apps_df["Installs"] < 0]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres


### Binning `Rating`, `Reviews`, `Size`, `Install` into Appropriate Quantiles

In the case of `Rating`, the researchers needed a new column that divides the rating into categories, which will be mainly used for association rules. The new column will then be called `Binned Rating`. For this binning process. the researches decided to use the `cut()` function since it is better to divide it into bins separating the ratings based on the actual value itself.

In [29]:
apps_df[apps_df["Rating"] < 1]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres


The bins were finalized as 0-1, 1-2, 2-3, 3-4, and 4-5 inclusive. It is applicable since there is no rating that is below 1, above 5, nor is there an actual rating of 0.

In [30]:
bins = [0, 1, 2, 3, 4, 5]

The new `Binned Rating` column is then integrated into the dataset.

In [31]:
apps_df["Binned Rating"] = pd.cut(apps_df['Rating'], bins, labels=['Rating(0,1]', 'Rating(1,2]', 'Rating(2,3]', 'Rating(3,4]', 'Rating(4,5]' ])
apps_df["Binned Rating"]

0        Rating(4,5]
1        Rating(3,4]
2        Rating(4,5]
3        Rating(4,5]
4        Rating(4,5]
            ...     
10836    Rating(4,5]
10837    Rating(4,5]
10838    Rating(4,5]
10839    Rating(4,5]
10840    Rating(4,5]
Name: Binned Rating, Length: 10340, dtype: category
Categories (5, object): [Rating(0,1] < Rating(1,2] < Rating(2,3] < Rating(3,4] < Rating(4,5]]

However, for `Reviews`, `Installs` and `Size`, the researchers decided that it was appropriate to divide the reviews into quantiles so that the binning process can be more normalized in concern with the dataset present. They will be named `Binned Reviews`, `Binned Installs`, and `Binned Size` respectively.

The researchers chose 5 as the number of quantiles to divide them accordingly into 5 categories: very small, small, average, large, and very large. `Reviews` will be converted to float also in case of statistical computations.

In [32]:
apps_df["Reviews"] = pd.to_numeric(apps_df["Reviews"], downcast='float')
apps_df['Binned Reviews'] = pd.qcut(apps_df['Reviews'], 5, labels=['Reviews(very small)', 'Reviews(small)', 'Reviews(average)', 'Reviews(large)', 'Reviews(very large)'])
apps_df['Binned Reviews'].unique()

[Reviews(small), Reviews(average), Reviews(very large), Reviews(large), Reviews(very small)]
Categories (5, object): [Reviews(very small) < Reviews(small) < Reviews(average) < Reviews(large) < Reviews(very large)]

In [33]:
apps_df['Binned Size'] = pd.qcut(apps_df['Size'], 5, labels=['Size(very small)', 'Size(small)', 'Size(average)', 'Size(large)', 'Size(very large)'])
apps_df['Binned Size'].unique()

[Size(large), Size(average), Size(small), Size(very small), Size(very large)]
Categories (5, object): [Size(very small) < Size(small) < Size(average) < Size(large) < Size(very large)]

In [34]:
apps_df['Binned Installs'] = pd.qcut(apps_df['Installs'], 5, labels=['Installs(very small)', 'Installs(small)', 'Installs(average)', 'Installs(large)', 'Installs(very large)'])
apps_df['Binned Installs'].unique()

[Installs(small), Installs(average), Installs(large), Installs(very large), Installs(very small)]
Categories (5, object): [Installs(very small) < Installs(small) < Installs(average) < Installs(large) < Installs(very large)]

In [35]:
apps_df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Binned Rating,Binned Reviews,Binned Size,Binned Installs
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.100000,159.0,19000000.0,10000.0,Free,0.0,Everyone,Art & Design,"Rating(4,5]",Reviews(small),Size(large),Installs(small)
1,Coloring book moana,ART_AND_DESIGN,3.900000,967.0,14000000.0,500000.0,Free,0.0,Everyone,Art & Design,"Rating(3,4]",Reviews(average),Size(average),Installs(average)
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.700000,87510.0,8700000.0,5000000.0,Free,0.0,Everyone,Art & Design,"Rating(4,5]",Reviews(very large),Size(small),Installs(large)
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.500000,215644.0,25000000.0,50000000.0,Free,0.0,Teen,Art & Design,"Rating(4,5]",Reviews(very large),Size(large),Installs(very large)
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.300000,967.0,2800000.0,100000.0,Free,0.0,Everyone,Art & Design,"Rating(4,5]",Reviews(average),Size(very small),Installs(average)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.500000,38.0,53000000.0,5000.0,Free,0.0,Everyone,Education,"Rating(4,5]",Reviews(small),Size(very large),Installs(small)
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.000000,4.0,3600000.0,100.0,Free,0.0,Everyone,Education,"Rating(4,5]",Reviews(very small),Size(very small),Installs(very small)
10838,Parkinson Exercices FR,MEDICAL,4.189143,3.0,9500000.0,1000.0,Free,0.0,Everyone,Medical,"Rating(4,5]",Reviews(very small),Size(small),Installs(very small)
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.500000,114.0,13701158.0,1000.0,Free,0.0,Mature,Books & Reference,"Rating(4,5]",Reviews(small),Size(average),Installs(very small)


**Converting the data for compatibility with Association Rules**

To get association rules, we will follow the market-basket model. In this case study, a basket is represented as a mobile app (rows). The items or itemsets in the basket are represented by the characteristics of the mobile app. However, each characteristic of a mobile app belongs to a certain category. To implement the `Rule Miner` class, the dataset should only contain boolean values (0s and 1s) which denote if the basket model contains a certain item. 

The dataset will be converted so that the columns are the unique values instead of the categories. All unique values except from the `App` columns are taken to build the `items` for the market-basket model

First, the `Price` column will be excluded alone for Association Rules. Instead of binning numerical values of price, it is much simpler to use the `Type` column which describes if the app is `Paid` or `Free`. 

Because the dataset now has binned columns, the original columns must also be removed

In [36]:
copy_df = apps_df.copy()

del copy_df['Rating']
del copy_df['Reviews']
del copy_df['Size']
del copy_df['Price']
del copy_df['Installs']
copy_df

Unnamed: 0,App,Category,Type,Content Rating,Genres,Binned Rating,Binned Reviews,Binned Size,Binned Installs
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,Free,Everyone,Art & Design,"Rating(4,5]",Reviews(small),Size(large),Installs(small)
1,Coloring book moana,ART_AND_DESIGN,Free,Everyone,Art & Design,"Rating(3,4]",Reviews(average),Size(average),Installs(average)
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,Free,Everyone,Art & Design,"Rating(4,5]",Reviews(very large),Size(small),Installs(large)
3,Sketch - Draw & Paint,ART_AND_DESIGN,Free,Teen,Art & Design,"Rating(4,5]",Reviews(very large),Size(large),Installs(very large)
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,Free,Everyone,Art & Design,"Rating(4,5]",Reviews(average),Size(very small),Installs(average)
...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,Free,Everyone,Education,"Rating(4,5]",Reviews(small),Size(very large),Installs(small)
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,Free,Everyone,Education,"Rating(4,5]",Reviews(very small),Size(very small),Installs(very small)
10838,Parkinson Exercices FR,MEDICAL,Free,Everyone,Medical,"Rating(4,5]",Reviews(very small),Size(small),Installs(very small)
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,Free,Mature,Books & Reference,"Rating(4,5]",Reviews(small),Size(average),Installs(very small)


In [37]:
items = np.ndarray(shape=(1), dtype=object)

for i in range(1, len(apps_df.columns)):
    items = np.concatenate( (items, apps_df[apps_df.columns[i]].unique()), axis=0)

items = np.delete(items, [0])

for i in range(len(items)):
    print(items[i])


ART_AND_DESIGN
AUTO_AND_VEHICLES
BEAUTY
BOOKS_AND_REFERENCE
BUSINESS
COMICS
COMMUNICATION
DATING
EDUCATION
ENTERTAINMENT
EVENTS
FINANCE
FOOD_AND_DRINK
HEALTH_AND_FITNESS
HOUSE_AND_HOME
LIBRARIES_AND_DEMO
LIFESTYLE
GAME
FAMILY
MEDICAL
SOCIAL
SHOPPING
PHOTOGRAPHY
SPORTS
TRAVEL_AND_LOCAL
TOOLS
PERSONALIZATION
PRODUCTIVITY
PARENTING
WEATHER
VIDEO_PLAYERS
NEWS_AND_MAGAZINES
MAPS_AND_NAVIGATION
4.1
3.9
4.7
4.5
4.3
4.4
3.8
4.2
4.6
3.2
4.0
4.349230769230768
4.8
4.9
3.6
3.7
4.278571428571428
3.3
4.344444444444446
3.4
3.5
3.1
4.121452145214522
4.155172413793104
5.0
2.6
3.9707692307692306
3.0
1.9
2.5
2.8
2.7
1.0
2.9
4.312433392539962
4.435555555555557
4.1669724770642205
4.197368421052633
4.178461538461538
4.18914285714286
2.3
4.047411444141691
4.300000000000001
4.06319018404908
4.335987261146501
2.2
4.198245614035089
1.7
2.0
4.192113564668767
1.8
4.255598455598457
4.308071748878922
4.237982195845699
4.158662613981761
4.254166666666666
4.039285714285714
4.211396011396012
4.109708737864079
2.4
4.10

1468638.0
344585.0
7614407.0
234110.0
48615.0
470694.0
14774.0
12753.0
33983.0
20267.0
5761.0
11618.0
12948.0
9779.0
369378.0
11436.0
2150.0
382.0
24936.0
1109.0
974.0
108795.0
1455.0
1024.0
1014822.0
86961.0
7320.0
269194.0
59843.0
18616.0
11950.0
4289.0
68286.0
11716.0
3323.0
36606.0
328619.0
10216997.0
46741.0
530854.0
7050.0
235906.0
17753.0
6294400.0
520609.0
432.0
32029.0
4207.0
64.0
96.0
82471.0
496.0
29436.0
19230.0
11126.0
23671.0
9652.0
9626.0
29319.0
1791.0
9199.0
14014.0
110877.0
10366.0
530792.0
12137.0
6404.0
6356.0
169.0
4450855.0
48701.0
15246.0
520654.0
4076.0
530904.0
106750.0
33785.0
58795.0
3235.0
47031.0
131.0
673203.0
2178.0
175625.0
8508.0
3484.0
379415.0
19245.0
65.0
24877.0
765.0
10088.0
525552.0
3762.0
141363.0
472584.0
1329192.0
148295.0
41273.0
392596.0
43060.0
514088.0
41867.0
1014846.0
23060.0
112080.0
15489.0
51895.0
623398.0
66661.0
10447.0
1574197.0
19170.0
169609.0
6188.0
1369.0
2952.0
9856.0
10753.0
154.0
288523.0
4522.0
3328.0
31100.0
854.0
560.0
631

817.0
313.0
37789.0
3570.0
48929.0
89947.0
466.0
30630.0
7462.0
8600.0
29505.0
106.0
6187.0
659.0
3965.0
4656.0
205.0
1475.0
148826.0
5150801.0
2473795.0
354.0
1699.0
11393.0
401530.0
925.0
671.0
274.0
140.0
30443.0
22401.0
324.0
14832.0
2059.0
826.0
180697.0
589.0
428268.0
298041.0
7590099.0
29.0
230.0
2026.0
86956.0
1129.0
108002.0
213.0
147.0
3062845.0
1162.0
720.0
502.0
1486.0
6627.0
4383.0
680.0
24668.0
13788.0
313403.0
26893.0
591411.0
2194.0
2012.0
32.0
657.0
4264.0
21107.0
3642.0
495971.0
697939.0
7357.0
4443407.0
944.0
5369.0
135.0
1852.0
6367.0
259.0
5682.0
7687.0
51068.0
2925.0
1655.0
1696.0
7775146.0
11244.0
16771865.0
14224.0
5178.0
628.0
12435.0
972574.0
464900.0
15097.0
146913.0
22503.0
1503544.0
5785.0
334.0
16111.0
2953886.0
2789775.0
174827.0
2133047.0
482630.0
69115.0
3128611.0
80816.0
38606.0
3044.0
1820.0
10067.0
480.0
2300.0
53144.0
7820775.0
22775.0
370.0
41502.0
963.0
21592.0
103.0
138129.0
18522.0
6454.0
17988.0
1771.0
8465.0
146.0
21943.0
1468.0
1088.0
29756.0

1072565.0
120494.0
637.0
43677.0
79132.0
39682.0
18478.0
32879.0
34612.0
253207.0
23348.0
46242.0
40467.0
16192.0
58981.0
148715.0
24565.0
59632.0
8780.0
38607.0
942.0
5985.0
426.0
505.0
794.0
905.0
820.0
11872.0
69013.0
364013.0
13079.0
4856.0
3745.0
2032.0
456474.0
267395.0
45359.0
25427.0
14432.0
54520.0
253155.0
154108.0
155998.0
72161.0
43088.0
6320.0
271214.0
14089.0
26452.0
6120.0
7801.0
57449.0
7566.0
4649.0
10484.0
2537.0
4441.0
86172.0
7969.0
56664.0
2295.0
1290.0
15618.0
266434.0
11402.0
1007.0
401648.0
8193.0
1115.0
1853.0
1283.0
4813.0
3003.0
666.0
12111.0
8432.0
7812.0
9659.0
2576.0
19373.0
3358.0
145646.0
1911.0
28140.0
5485.0
11250.0
5093.0
8450.0
13492.0
2362.0
139432.0
1638.0
6512.0
7896.0
68270.0
21381.0
58575.0
32881.0
10225.0
441.0
455496.0
475369.0
358633.0
1094094.0
23347.0
1626.0
36578.0
14253.0
15829.0
101738.0
372553.0
6716.0
3345.0
200450.0
42182.0
2700.0
2310.0
594406.0
85317.0
1013.0
2311.0
97209.0
4518.0
3397.0
3580.0
11748.0
1205.0
138872.0
1091.0
11788.0

The unique items will now be the columns for the dataframe. The dataframe is now a matrix that can represent the market basket model and is compatible with RuleMiner Class.

In [38]:
assoc_df = pd.DataFrame(0, index=np.arange(len(apps_df.index)), columns=items)
assoc_df

Unnamed: 0,ART_AND_DESIGN,AUTO_AND_VEHICLES,BEAUTY,BOOKS_AND_REFERENCE,BUSINESS,COMICS,COMMUNICATION,DATING,EDUCATION,ENTERTAINMENT,...,Size(large),Size(average),Size(small),Size(very small),Size(very large),Installs(small),Installs(average),Installs(large),Installs(very large),Installs(very small)
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10336,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10337,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10338,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [39]:
columns = copy_df.columns
columns

Index(['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Binned Rating',
       'Binned Reviews', 'Binned Size', 'Binned Installs'],
      dtype='object')

To complete the market-basket model matrix, we now change the value of cells from `0` to `1` if the an application has that characteristic. This could take some time due to the very large size of the dataframe, but this code only needs to be executed once. For reference, it takes around 11 seconds to complete on an 7th gen i7 laptop

In [40]:
import time
start_time = time.time()

for i in range(len(assoc_df.index)):
    for j in range(1, len(columns)):
        assoc_df.loc[assoc_df.index[i], copy_df.loc[copy_df.index[i], columns[j]]] = 1
        
print ("The program took ", time.time() - start_time, " to run")

assoc_df

The program took  31.490821361541748  to run


Unnamed: 0,ART_AND_DESIGN,AUTO_AND_VEHICLES,BEAUTY,BOOKS_AND_REFERENCE,BUSINESS,COMICS,COMMUNICATION,DATING,EDUCATION,ENTERTAINMENT,...,Size(large),Size(average),Size(small),Size(very small),Size(very large),Installs(small),Installs(average),Installs(large),Installs(very large),Installs(very small)
0,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10335,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0
10336,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
10337,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
10338,0,0,0,1,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,1


## Rule Miner
Open `rule_miner.py` file. Some of the functions in the `RuleMiner` class are not yet implemented. We will implement the missing parts of this class.

In [41]:
from rule_miner import RuleMiner

Running Rule Miner will also take a lot of time as we lower the thresholds. 

For reference (i7 7th gen laptop):
- RuleMiner(300, 0.5) took 25 seconds to run
- RuleMiner(100, 0.5) took 100 seconds to run

In the first trial, let us try support thresholds 300 and confidence threshold 50%. There is no particular reason, this just something that can be adjusted.

In [42]:
rule_miner = RuleMiner(300, 0.5)

In [43]:
start1_time = time.time()

rules = rule_miner.get_association_rules(assoc_df)
#print(rules)
# if you print this, it will look very ugly and may take up a lot of the screen

print ("The program took ", time.time() - start1_time, " to run")

The program took  25.199606895446777  to run


These are the rules

In [44]:
for i in range(0, len(rules), 2):
    print(rules[i], " -> ", rules[i+1], "\n")

['GAME', 'Size(very large)', 'Free', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['GAME', 'Size(very large)', 'Free', 'Installs(very large)', 'Rating(4,5]']  ->  ['Reviews(very large)'] 

['GAME', 'Size(very large)', 'Free', 'Reviews(very large)', 'Rating(4,5]']  ->  ['Installs(very large)'] 

['GAME', 'Size(very large)', 'Installs(very large)', 'Reviews(very large)', 'Rating(4,5]']  ->  ['Free'] 

['GAME', 'Free', 'Installs(very large)', 'Reviews(very large)', 'Rating(4,5]']  ->  ['Size(very large)'] 

['Size(very large)', 'Free', 'Installs(very large)', 'Reviews(very large)', 'Rating(4,5]']  ->  ['GAME'] 

['Everyone', 'Free', 'Size(average)', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['Everyone', 'Free', 'Size(average)', 'Installs(very large)', 'Rating(4,5]']  ->  ['Reviews(very large)'] 

['Everyone', 'Free', 'Size(average)', 'Reviews(very large)', 'Rating(4,5]']  ->  ['Installs(very large)'] 

['Everyone', 'Size(average)', 'Ins

Specifically, we want to see association rules (`x` -> `y`) such that `y` is a category for `Binned Ratings` to see what app characteristics are most likely to belong to a certain rating range.

First, we take the set of rating categories.

In [45]:
ratingset = copy_df['Binned Rating'].unique()
ratingset = ratingset.tolist()
ratingset

['Rating(4,5]', 'Rating(3,4]', 'Rating(2,3]', 'Rating(1,2]', 'Rating(0,1]']

In [46]:
for i in range(0, len(rules), 2):
    x = rules[i]
    y = rules[i+1]
    if y[0] in ratingset:
        print(rules[i], " -> ", rules[i+1], "\n")

['GAME', 'Size(very large)', 'Free', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['Everyone', 'Free', 'Size(average)', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['Everyone', 'Size(very large)', 'Free', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['Everyone', 'Free', 'Installs(very small)', 'Reviews(very small)', 'Size(small)']  ->  ['Rating(4,5]'] 

['Everyone', 'Size(very small)', 'Free', 'Installs(very small)', 'Reviews(very small)']  ->  ['Rating(4,5]'] 



#### Key Observations

- 5 out of 5 rules have the `Free` characterstic which pertains to a free app
- 4 out of 5 rules have the `Everyone` characteristic which pertains to an app that is suitable for all ages
- 3 out of 5 rules have the `Installs(very large)` and `Reviews(very large)` characteristics which pertains to an app that has very large amount of reviews and installs relative to the distribution of data in the dataset'
- The only `Category` characteristic among the 5 rules is `Game` which pertains to a game app.
    - that rule is also the only rule among the 5 rules without `Everyone` as characteristic

- Characteristic such as `Installs (very large)` is a bit obvious because a highly rated app is very likely to be installed
- Characteristic such as `Free` might also be caused by the large amount of free apps.
- It might be worth to try less stricter threshold

In [49]:
rule_miner = RuleMiner(100, 0.5)

In [50]:
start1_time = time.time()

rules = rule_miner.get_association_rules(assoc_df)
#print(rules)
# if you print this, it will look very ugly and may take up a lot of the screen

print ("The program took ", time.time() - start1_time, " to run")

The program took  106.53138589859009  to run


In [51]:
for i in range(0, len(rules), 2):
    x = rules[i]
    y = rules[i+1]
    if y[0] in ratingset:
        print(rules[i], " -> ", rules[i+1], "\n")

['Everyone', 'Business', 'Free', 'BUSINESS', 'Installs(very small)', 'Reviews(very small)']  ->  ['Rating(4,5]'] 

['Everyone', 'GAME', 'Size(very large)', 'Free', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['GAME', 'Action', 'Size(very large)', 'Free', 'Installs(very large)', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['Everyone', 'FAMILY', 'Education', 'Free', 'Installs(very small)', 'Reviews(very small)']  ->  ['Rating(4,5]'] 

['MEDICAL', 'Everyone', 'Medical', 'Free', 'Installs(very small)', 'Reviews(very small)']  ->  ['Rating(4,5]'] 

['MEDICAL', 'Medical', 'Everyone', 'Free', 'Installs(very small)', 'Reviews(very small)']  ->  ['Rating(4,5]'] 

['Everyone', 'Photography', 'Free', 'Installs(very large)', 'Reviews(very large)', 'PHOTOGRAPHY']  ->  ['Rating(4,5]'] 

['Everyone', 'Tools', 'Installs(very large)', 'Free', 'TOOLS', 'Reviews(very large)']  ->  ['Rating(4,5]'] 

['Everyone', 'Tools', 'Free', 'Installs(very small)', 'TOOLS', 'Reviews(very sm

#### Key Observations

- `Everyone` and `Free` are still dominant characteristics
- Aside from `GAME`, there are also apps from `FAMILY`, `BUSINESS`, `PHOTOGRAPHY`, `MEDICAL`, and `TOOLS` category which are categories that may also correlate to high rating for apps
- Out of 10 rules all having a `Review` and `Installs` characteristic, there are `5` very small and `5` very large
     - All rules with `Review(very small)` have `Installs(very small)
     - All rules with `Review(very large)` have `Installs(very large)
- `GAME` category and `Action`genre are together in a rule


#### Analysis / Conclusion

- There may be lot of apps with high rating due to having low number of installs and reviews
    - This statement may likely to apply for `MEDICAL` and `BUSINESS` apps
- For the `GAME` category, `Action` games are more likely to be highly rated and is likely to be not rated for `Everyone`
- `PHOTOGRAPHY` apps are also likely to be rated high and is supported with high number of installs and reviews
- `TOOLS` apps are also rated high and are supported with high number of installs and reviews however, there is still a good likeliness for it to have a high rating due to low number of installs and reviews.
