## Import
Import **pandas** and **matplotlib**.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
pd.options.mode.chained_assignment = None

## Google Play Store Dataset
For this notebook, we will work on a dataset called `grocery dataset`. This dataset contains 9835 rows which represents transactions by customers shopping for groceries. The dataset contains 169 unique items.

The dataset is provided to you as a `.csv` file. `.csv` means comma-separated values. You can open the file in Notepad to see how it is exactly formatted.

If you view the `.csv` file in Excel, you can see that our dataset contains a list of items bought by a customer for each single transaction, represented in rows.

In [2]:
df_gps = pd.read_csv("googleplaystore.csv", header=None)
df_gps.columns = df_gps.iloc[0]
df_gps = df_gps.drop([0], axis=0)
df_gps


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10839,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


## Data Cleaning and Pre-processing


**Duplicate and NAN**

First, we check for null values and duplicates so that we can remove them before doing further cleaning.

**Remove unnecessary columns**

In this case study, the columns `Last Updated`, `Current Ver`, and `Android Ver` are not needed and will be removed


In [3]:
df_gps = df_gps.iloc[:, 0:10]
df_gps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference


**String Manipulation for `Installs`, `Size`, `Price`, and `Genres` columns**

Next, we remove the characters '+' and ',' from the end of each value in `Installs` column for easier numerical operations later on.

In [4]:
df_gps['Installs'] = df_gps['Installs'].str.replace("+", "")
df_gps['Installs'] = df_gps['Installs'].str.replace(",", "")
df_gps


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,10000,Free,0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,500000,Free,0,Everyone,Art & Design;Pretend Play
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,5000000,Free,0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,50000000,Free,0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,100000,Free,0,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53M,5000,Free,0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100,Free,0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3,9.5M,1000,Free,0,Everyone,Medical
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,1000,Free,0,Mature 17+,Books & Reference


We also remove the characters 'M' from the end of each value in `Size` column for easier numerical operations later on.

In [5]:
df_gps['Size'] = df_gps['Size'].str.replace("M", "e+6")
df_gps['Size'] = df_gps['Size'].str.replace("k", "e+3")
#df_gps['Size'] = df_gps['Size'].str.replace(".", "")
df_gps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19e+6,10000,Free,0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14e+6,500000,Free,0,Everyone,Art & Design;Pretend Play
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7e+6,5000000,Free,0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25e+6,50000000,Free,0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8e+6,100000,Free,0,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53e+6,5000,Free,0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6e+6,100,Free,0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3,9.5e+6,1000,Free,0,Everyone,Medical
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,1000,Free,0,Mature 17+,Books & Reference


Next, we remove `$` sign from `price` 

In [6]:
df_gps['Price'] = df_gps['Price'].str.replace("$", "")
df_gps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19e+6,10000,Free,0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14e+6,500000,Free,0,Everyone,Art & Design;Pretend Play
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7e+6,5000000,Free,0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25e+6,50000000,Free,0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8e+6,100000,Free,0,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53e+6,5000,Free,0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6e+6,100,Free,0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3,9.5e+6,1000,Free,0,Everyone,Medical
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,1000,Free,0,Mature 17+,Books & Reference


We noticed that the unique values for `Genres` are a bit too many because a lot of apps have combined genres. You can see it below.

In [7]:
df_gps['Genres'].nunique()

120

In [8]:
df_gps['Genres'].unique()

array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity', 'Communication', 'Dating',
       'Education;Education', 'Education', 'Education;Creativity',
       'Education;Music & Video', 'Education;Action & Adventure',
       'Education;Pretend Play', 'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports',
       'Music', 'Word', 'Racing', 'Casual;Creativity',
       'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board',
       'Trivia', 'Role 

To address this, we are going to consider only the main genre of the app. The first genre which comes before the character `;` for multi-genre apps will be considered the main genre. Genres that come after will be removed via string manipulation.

In [9]:
df_gps['Genres'] = df_gps['Genres'].str.split(';').str.get(0)
df_gps


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19e+6,10000,Free,0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14e+6,500000,Free,0,Everyone,Art & Design
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7e+6,5000000,Free,0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25e+6,50000000,Free,0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8e+6,100000,Free,0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10837,Sya9a Maroc - FR,FAMILY,4.5,38,53e+6,5000,Free,0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6e+6,100,Free,0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3,9.5e+6,1000,Free,0,Everyone,Medical
10840,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,1000,Free,0,Mature 17+,Books & Reference


In [10]:
df_gps['Genres'].nunique()

49

From 121, the number of unique values for `Genres` is now 49.

**Shifted Columns**

Upon checking some unique values, a row may possibly have shifted values due to inconsistencies in the unique values. For `Installs` column, we found a unique value **Free** which supposed belongs to the the `Type` column

In [11]:
df_gps['Installs'].unique()

array(['10000', '500000', '5000000', '50000000', '100000', '50000',
       '1000000', '10000000', '5000', '100000000', '1000000000', '1000',
       '500000000', '50', '100', '500', '10', '1', '5', '0', 'Free'],
      dtype=object)

In [12]:
df_gps.loc[df_gps['Installs'] == 'Free']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
10473,Life Made WI-Fi Touchscreen Photo Frame,1.9,19,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018"


It seems there is only one row with shifted values and is missing its `Category` value. This row will be dropped.

In [13]:
df_gps = df_gps.drop( df_gps[df_gps['Installs'] == 'Free'].index)
df_gps.loc[df_gps['Installs'] == 'Free']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres


It is now confirmed that the row with shifted values has been dropped.

**Other rows that may also be dropped**

In [14]:
df_gps['Size'].unique()

array(['19e+6', '14e+6', '8.7e+6', '25e+6', '2.8e+6', '5.6e+6', '29e+6',
       '33e+6', '3.1e+6', '28e+6', '12e+6', '20e+6', '21e+6', '37e+6',
       '2.7e+6', '5.5e+6', '17e+6', '39e+6', '31e+6', '4.2e+6', '7.0e+6',
       '23e+6', '6.0e+6', '6.1e+6', '4.6e+6', '9.2e+6', '5.2e+6', '11e+6',
       '24e+6', 'Varies with device', '9.4e+6', '15e+6', '10e+6',
       '1.2e+6', '26e+6', '8.0e+6', '7.9e+6', '56e+6', '57e+6', '35e+6',
       '54e+6', '201e+3', '3.6e+6', '5.7e+6', '8.6e+6', '2.4e+6', '27e+6',
       '2.5e+6', '16e+6', '3.4e+6', '8.9e+6', '3.9e+6', '2.9e+6', '38e+6',
       '32e+6', '5.4e+6', '18e+6', '1.1e+6', '2.2e+6', '4.5e+6', '9.8e+6',
       '52e+6', '9.0e+6', '6.7e+6', '30e+6', '2.6e+6', '7.1e+6', '3.7e+6',
       '22e+6', '7.4e+6', '6.4e+6', '3.2e+6', '8.2e+6', '9.9e+6',
       '4.9e+6', '9.5e+6', '5.0e+6', '5.9e+6', '13e+6', '73e+6', '6.8e+6',
       '3.5e+6', '4.0e+6', '2.3e+6', '7.2e+6', '2.1e+6', '42e+6',
       '7.3e+6', '9.1e+6', '55e+6', '23e+3', '6.5e+6', '1.5e+

In [15]:
df_gps = df_gps.drop( df_gps.loc[df_gps['Size'] == 'Varies with device'].index)
df_gps.loc[df_gps['Size'] == 'Varies with device']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres


**Data Binning**

There are a lot of unique values due to having a lot of numerical variables. To reduce it, data binning will be performed so that numerical values are grouped into categories which is very important before it is converted to a dataset that is compatible with `Rule Miner`class

First, we convert the data types of the columns with numerical values to int/float (they were String by default)

In [16]:
df_gps['Size'] = pd.to_numeric(df_gps['Size'], downcast='float')

In [17]:
df_gps = df_gps.astype({"Rating": float, "Reviews": float, "Installs":float, "Price":float})
df_gps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19000000.0,10000.0,Free,0.0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000.0,Free,0.0,Everyone,Art & Design
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8700000.0,5000000.0,Free,0.0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25000000.0,50000000.0,Free,0.0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2800000.0,100000.0,Free,0.0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10836,FR Forms,BUSINESS,,0.0,9600000.0,10.0,Free,0.0,Everyone,Business
10837,Sya9a Maroc - FR,FAMILY,4.5,38.0,53000000.0,5000.0,Free,0.0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4.0,3600000.0,100.0,Free,0.0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3.0,9500000.0,1000.0,Free,0.0,Everyone,Medical


In [18]:
## create duplicate of df_gps for binned version
df_binned = df_gps.copy()
df_binned

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19000000.0,10000.0,Free,0.0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000.0,Free,0.0,Everyone,Art & Design
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8700000.0,5000000.0,Free,0.0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25000000.0,50000000.0,Free,0.0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2800000.0,100000.0,Free,0.0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10836,FR Forms,BUSINESS,,0.0,9600000.0,10.0,Free,0.0,Everyone,Business
10837,Sya9a Maroc - FR,FAMILY,4.5,38.0,53000000.0,5000.0,Free,0.0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4.0,3600000.0,100.0,Free,0.0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3.0,9500000.0,1000.0,Free,0.0,Everyone,Medical


In [19]:
df_gps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19000000.0,10000.0,Free,0.0,Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000.0,Free,0.0,Everyone,Art & Design
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510.0,8700000.0,5000000.0,Free,0.0,Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644.0,25000000.0,50000000.0,Free,0.0,Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967.0,2800000.0,100000.0,Free,0.0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10836,FR Forms,BUSINESS,,0.0,9600000.0,10.0,Free,0.0,Everyone,Business
10837,Sya9a Maroc - FR,FAMILY,4.5,38.0,53000000.0,5000.0,Free,0.0,Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4.0,3600000.0,100.0,Free,0.0,Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,3.0,9500000.0,1000.0,Free,0.0,Everyone,Medical


In [20]:
bins_rating = [0, 1, 2, 3, 4, 5, np.inf]
bins_reviews = [0, 1000, 10000, 100000, 500000, df_gps['Reviews'].max()]
bins_size = [0, df_gps['Size'].max()]
bins_installs = [0, 1000, 10000, 100000, 500000, df_gps['Installs'].max()]
bins_price = [0, 1, 5, 25, 50, 100, df_gps['Price'].max()]

df_binned['Rating'] = pd.cut(df_gps['Rating'], bins_rating, include_lowest =True)
df_binned['Reviews'] = pd.cut(df_gps['Reviews'], bins_reviews, include_lowest =True)
df_binned['Size'] = pd.cut(df_gps['Size'], bins_size, include_lowest =True)
df_binned['Installs'] = pd.cut(df_gps['Installs'], bins_installs, include_lowest =True)
df_binned['Price'] = pd.cut(df_gps['Price'], bins_price, include_lowest =True)

df_binned

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(1000.0, 10000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,"(3.0, 4.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(100000.0, 500000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,"(4.0, 5.0]","(10000.0, 100000.0]","(-0.001, 100000000.0]","(500000.0, 1000000000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,"(4.0, 5.0]","(100000.0, 500000.0]","(-0.001, 100000000.0]","(500000.0, 1000000000.0]",Free,"(-0.001, 1.0]",Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(10000.0, 100000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10836,FR Forms,BUSINESS,,"(-0.001, 1000.0]","(-0.001, 100000000.0]","(-0.001, 1000.0]",Free,"(-0.001, 1.0]",Everyone,Business
10837,Sya9a Maroc - FR,FAMILY,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(1000.0, 10000.0]",Free,"(-0.001, 1.0]",Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(-0.001, 1000.0]",Free,"(-0.001, 1.0]",Everyone,Education
10839,Parkinson Exercices FR,MEDICAL,,"(-0.001, 1000.0]","(-0.001, 100000000.0]","(-0.001, 1000.0]",Free,"(-0.001, 1.0]",Everyone,Medical


In [21]:
df_binned = df_binned.dropna()
df_binned

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres
1,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(1000.0, 10000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
2,Coloring book moana,ART_AND_DESIGN,"(3.0, 4.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(100000.0, 500000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
3,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,"(4.0, 5.0]","(10000.0, 100000.0]","(-0.001, 100000000.0]","(500000.0, 1000000000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
4,Sketch - Draw & Paint,ART_AND_DESIGN,"(4.0, 5.0]","(100000.0, 500000.0]","(-0.001, 100000000.0]","(500000.0, 1000000000.0]",Free,"(-0.001, 1.0]",Teen,Art & Design
5,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(10000.0, 100000.0]",Free,"(-0.001, 1.0]",Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...,...
10834,Chemin (fr),BOOKS_AND_REFERENCE,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(-0.001, 1000.0]",Free,"(-0.001, 1.0]",Everyone,Books & Reference
10835,FR Calculator,FAMILY,"(3.0, 4.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(-0.001, 1000.0]",Free,"(-0.001, 1.0]",Everyone,Education
10837,Sya9a Maroc - FR,FAMILY,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(1000.0, 10000.0]",Free,"(-0.001, 1.0]",Everyone,Education
10838,Fr. Mike Schmitz Audio Teachings,FAMILY,"(4.0, 5.0]","(-0.001, 1000.0]","(-0.001, 100000000.0]","(-0.001, 1000.0]",Free,"(-0.001, 1.0]",Everyone,Education


**Converting the data for compatibility with Association Rules**

To get association rules, we will follow the market-basket model. In this case study, a basket is represented as a mobile app (rows). The items or itemsets in the basket are represented by the characteristics of the mobile app. However, each characteristic of a mobile app belongs to a certain category. To implement the `Rule Miner` class, the dataset should only contain boolean values (0s and 1s) which denote if the basket model contains a certain item. 

The dataset will be converted so that the columns are the unique values instead of the categories. First, all unique values except from the `App` columns are taken to build the `items` for the market-basket model

In [23]:
items = np.ndarray(shape=(1), dtype=object)

for i in range(1, len(df_binned.columns)):
    items = np.concatenate( (items, df_binned[df_binned.columns[i]].unique()), axis=0)

items = np.delete(items, [0])
items

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION',
       Interval(4.0, 5.0, closed='right'),
       Interval(3.0, 4.0, closed='right'),
       Interval(2.0, 3.0, closed='right'),
       Interval(1.0, 2.0, closed='right'),
       Interval(-0.001, 1.0, closed='right'),
       Interval(-0.001, 1000.0, closed='right'),
       Interval(10000.0, 100000.0, closed='right'),
       Interval(100000.0, 500000.0, closed='right'),
       Interval(1000.0, 10000.0, closed='right'),
       Interval(500000.0, 44893

In [24]:
df_assoc = pd.DataFrame(0, index=np.arange(len(df_binned.index)), columns=items)
df_assoc

Unnamed: 0,ART_AND_DESIGN,AUTO_AND_VEHICLES,BEAUTY,BOOKS_AND_REFERENCE,BUSINESS,COMICS,COMMUNICATION,DATING,EDUCATION,ENTERTAINMENT,...,Photography,Travel & Local,Tools,Personalization,Productivity,Parenting,Weather,News & Magazines,Maps & Navigation,Casino
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
columns = df_binned.columns
columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres'],
      dtype='object', name=0)

In [26]:
import time
start_time = time.time()

for i in range(len(df_assoc.index)):
    for j in range(1, len(columns)):
        df_assoc.loc[df_assoc.index[i], df_binned.loc[df_binned.index[i], columns[j]]] = 1
        
print ("My program took ", time.time() - start_time, " to run")

df_assoc

My program took  16.010921955108643  to run


Unnamed: 0,ART_AND_DESIGN,AUTO_AND_VEHICLES,BEAUTY,BOOKS_AND_REFERENCE,BUSINESS,COMICS,COMMUNICATION,DATING,EDUCATION,ENTERTAINMENT,...,Photography,Travel & Local,Tools,Personalization,Productivity,Parenting,Weather,News & Magazines,Maps & Navigation,Casino
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7724,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [48]:
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 200)
df_assoc

Unnamed: 0,ART_AND_DESIGN,AUTO_AND_VEHICLES,BEAUTY,BOOKS_AND_REFERENCE,BUSINESS,COMICS,COMMUNICATION,DATING,EDUCATION,ENTERTAINMENT,EVENTS,FINANCE,FOOD_AND_DRINK,HEALTH_AND_FITNESS,HOUSE_AND_HOME,LIBRARIES_AND_DEMO,LIFESTYLE,GAME,FAMILY,MEDICAL,SOCIAL,SHOPPING,PHOTOGRAPHY,SPORTS,TRAVEL_AND_LOCAL,TOOLS,PERSONALIZATION,PRODUCTIVITY,PARENTING,WEATHER,VIDEO_PLAYERS,NEWS_AND_MAGAZINES,MAPS_AND_NAVIGATION,"(4.0, 5.0]","(3.0, 4.0]","(2.0, 3.0]","(1.0, 2.0]","(-0.001, 1.0]","(-0.001, 1000.0]","(10000.0, 100000.0]","(100000.0, 500000.0]","(1000.0, 10000.0]","(500000.0, 44893888.0]","(-0.001, 100000000.0]","(1000.0, 10000.0].1","(100000.0, 500000.0].1","(500000.0, 1000000000.0]","(10000.0, 100000.0].1","(-0.001, 1000.0].1",Free,Paid,"(-0.001, 1.0].1","(1.0, 5.0]","(5.0, 25.0]","(50.0, 100.0]","(25.0, 50.0]","(100.0, 400.0]",Everyone,Teen,Everyone 10+,Mature 17+,Adults only 18+,Unrated,Art & Design,Auto & Vehicles,Beauty,Books & Reference,Business,Comics,Communication,Dating,Education,Entertainment,Events,Finance,Food & Drink,Health & Fitness,House & Home,Libraries & Demo,Lifestyle,Adventure,Arcade,Casual,Card,Strategy,Action,Puzzle,Sports,Word,Racing,Simulation,Board,Trivia,Role Playing,Educational,Music,Music & Audio,Video Players & Editors,Medical,Social,Shopping,Photography,Travel & Local,Tools,Personalization,Productivity,Parenting,Weather,News & Magazines,Maps & Navigation,Casino
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,1,0,0,1,0,1,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,0,1,1,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,1,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,0,0,0,1,0,0,0,1,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7724,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7725,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7726,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,1,0,1,1,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7727,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Rule Miner
Open `rule_miner.py` file. Some of the functions in the `RuleMiner` class are not yet implemented. We will implement the missing parts of this class.

In [49]:
from rule_miner import RuleMiner

In [50]:
rule_miner = RuleMiner(90, 0.8)

In [51]:
start1_time = time.time()

rules = rule_miner.get_association_rules(df_assoc)
print(rules)

print ("My program took ", time.time() - start1_time, " to run")

[[Interval(500000.0, 44893888.0, closed='right'), 'Everyone', 'Free', 'GAME', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right')], [Interval(-0.001, 1.0, closed='right')], [Interval(500000.0, 44893888.0, closed='right'), 'Everyone', 'Free', 'GAME', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right'), Interval(-0.001, 1.0, closed='right')], [Interval(500000.0, 1000000000.0, closed='right')], [Interval(500000.0, 44893888.0, closed='right'), 'Everyone', 'Free', 'GAME', Interval(-0.001, 100000000.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right'), Interval(-0.001, 1.0, closed='right')], [Interval(4.0, 5.0, closed='right')], [Interval(500000.0, 44893888.0, closed='right'), 'Everyone', 'Free', 'GAME', Interval(4.0, 5.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right'), Interval(-0.001, 1.0, closed='right')], [Interval(-0.001, 100000000.0, closed='r

In [56]:
rule_miner = RuleMiner(1000, 1)

In [58]:
start2_time = time.time()

rules = rule_miner.get_association_rules(df_assoc)
print(rules)

print ("My program took ", time.time() - start2_time, " to run")

[[Interval(-0.001, 1000.0, closed='right'), 'Everyone', 'Free', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right')], [Interval(-0.001, 1.0, closed='right')], [Interval(-0.001, 1000.0, closed='right'), 'Everyone', 'Free', Interval(4.0, 5.0, closed='right'), Interval(-0.001, 1.0, closed='right')], [Interval(-0.001, 100000000.0, closed='right')], [Interval(10000.0, 100000.0, closed='right'), 'Free', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right')], [Interval(-0.001, 1.0, closed='right')], [Interval(10000.0, 100000.0, closed='right'), 'Free', Interval(4.0, 5.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right'), Interval(-0.001, 1.0, closed='right')], [Interval(-0.001, 100000000.0, closed='right')], [Interval(10000.0, 100000.0, closed='right'), 'Everyone', 'Free', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right')], [Interval(-0.00

In [62]:
for i in range(0, len(rules), 2):
    print(rules[i], " -> ", rules[i+1], "\n")

[Interval(-0.001, 1000.0, closed='right'), 'Everyone', 'Free', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right')]  ->  [Interval(-0.001, 1.0, closed='right')] 

[Interval(-0.001, 1000.0, closed='right'), 'Everyone', 'Free', Interval(4.0, 5.0, closed='right'), Interval(-0.001, 1.0, closed='right')]  ->  [Interval(-0.001, 100000000.0, closed='right')] 

[Interval(10000.0, 100000.0, closed='right'), 'Free', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right')]  ->  [Interval(-0.001, 1.0, closed='right')] 

[Interval(10000.0, 100000.0, closed='right'), 'Free', Interval(4.0, 5.0, closed='right'), Interval(500000.0, 1000000000.0, closed='right'), Interval(-0.001, 1.0, closed='right')]  ->  [Interval(-0.001, 100000000.0, closed='right')] 

[Interval(10000.0, 100000.0, closed='right'), 'Everyone', 'Free', Interval(-0.001, 100000000.0, closed='right'), Interval(4.0, 5.0, closed='right'

Notes/Concerns:

- Incredibly slow when thresholds are low (10.5 minutes to execute at support<=90 and confidence <= 0.8
- Binning intervals needs to be changed
- Binning names also needs to be changed (lahat sila starts with the word 'Interval'
- How to find x -> y rules such that y is yung intervals for rating