# A Case Study on Google Play Store Application Ratings

**DELA CRUZ, Alexis Louis L. <br>
NILL, Byron Ethelbert V. <br>
UY, Geosef Viktor C.** <br>

**28 September 2020**

This case study’s foundation is a dataset filled with information on applications from the Google Play Store and its objectives are geared towards answering questions in relation to an applications user rating. The first question of this case study deals with characteristics of mobile applications that may have an effect on its user rating such as, but not limited to, size, genre, and price. The second question may be considered a continuation of the first as it seeks to actually “assign” a rating to an unrated application based on similar programs. 

## Import
Import **numpy**, **pandas**, **matplotlib**, **time**, **LinearRegression**, **LabelEncoder**, and **CollaborativeFiltering**, and **RuleMiner**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import time

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from collaborative_filtering import CollaborativeFiltering
from rule_miner import RuleMiner

pd.options.mode.chained_assignment = None

## Google Play Store Dataset
For this case study, the dataset chosen by the researchers is called `Google Play Store Apps` dataset. This dataset contains 10841 rows which represents transactions by customers shopping for groceries. The dataset contains 13 unique columns.

The dataset is provided as `googleplaystore.csv`. Therefore, we must read the file.

In [None]:
apps_df = pd.read_csv('googleplaystore.csv')
apps_df

## Data Cleaning and Pre-processing

For data cleaning in this dataset, the researchers decided with these modifications.
1. Remove `Last Updated`, `Current Ver`, `Android Ver`
2. Include Main `Genres` Only
3. Include Main `Content Rating` Only
4. Numerical data for `Installs`, `Size`, `Price`
5. Binning `Rating`, `Reviews`, `Size`, `Installs`
6. Remove/Modify NaN and duplicate observations

### Removing `Last Updated`, `Current Ver`, `Android Ver`

In this case study, the columns `Last Updated`, `Current Ver`, and `Android Ver` are not needed and will be removed

In [None]:
apps_df = apps_df.drop(["Last Updated", "Current Ver", "Android Ver"], axis=1)
apps_df

### Including Main Genre Only in `Genres`

The researchers noticed the presence of too much unique values for `Genres` due to a lot of apps having combined genres. The unique values can be seen below.

In [None]:
apps_df['Genres'].nunique(), apps_df['Genres'].unique()

To solve this problem, the researches decided to include only the main genres provided in `Genres`. This is to divide the apps into simpler genres and allow easier visualization of categories for this column.

The first genre which comes before the character `;` for multi-genre apps will be considered the main genre. Genres that come after will be removed via string manipulation.

In [None]:
apps_df["Genres"] = apps_df["Genres"].str.split(";", 1).str[0]
apps_df["Genres"].unique()

It is seen that the `Genres` section contained a bizarre genre of 'February 11, 2018', so the researchers decided to see the values of these apps.

In [None]:
apps_df[apps_df["Genres"] == "February 11, 2018"]

The researchers have chosen to drop this since it only contains one observation in the dataset. 

In [None]:
apps_df = apps_df[apps_df['Genres'] != "February 11, 2018"]
apps_df

### Including Main `Content Rating` Only

It is seen below that the `Content Rating` values contained some with 'Everyone' and 'Everyone 10+'. The researchers decided to exclude the age rating and only include the main content rating as well. In this case, the ratings would be 'Everyone', 'Teen', 'Mature', 'Adults', and 'Unrated'.

In [None]:
apps_df['Content Rating'].unique()

Simply splitting the strings by the whitespaces and including the first substring will divide the content ratings into its desirable categories.

In [None]:
apps_df["Content Rating"] = apps_df["Content Rating"].str.split(" ", 1).str[0]
apps_df["Content Rating"].unique()

### Assigning Numerical Data for `Installs`

Looking at the `Installs` column below, it can be noticed that the data type for the values are not yet initialized as float. Therefore, the researchers will also use string manipulation for this column for conversion to float.

In [None]:
apps_df["Installs"].unique()

First, remove the '+' and ',' symbols to allow it for conversion.

In [None]:
apps_df['Installs'] = apps_df['Installs'].str.replace("+", "")
apps_df['Installs'] = apps_df['Installs'].str.replace(",", "")
apps_df['Installs'].unique()

Next, it is possible to convert them into float using the pandas `to_numeric()` function.

In [None]:
apps_df['Installs'] = pd.to_numeric(apps_df['Installs'], downcast="float")
apps_df['Installs'].unique()

### Assigning Numerical Data for `Size`

The same can be said for the `Size` column below. Therefore, the researchers will also use string manipulation for this column for conversion to float.

In [None]:
apps_df["Size"].unique()

By replacing 'k' into 'e+3' and 'M' into 'e+6', it converts the values into a string that makes the `to_numeric()` function possible. However it is noticed that there is a value named 'Varies with device'. The researchers decided to convert that into NaN and deal with the NaN values in a later step.

In [None]:
apps_df["Size"] = apps_df["Size"].str.replace('k', 'e+3')
apps_df["Size"] = apps_df["Size"].str.replace('M', 'e+6')
apps_df["Size"] = apps_df["Size"].replace('Varies with device', np.nan)
apps_df["Size"].unique()

After this, implementing the function will now be possible.

In [None]:
apps_df["Size"] = pd.to_numeric(apps_df["Size"], downcast="float")
apps_df["Size"].unique()

### Assigning Numerical Data for `Price`

The same can also be said for the `Size` column below. Therefore, the researchers will also use string manipulation for this column for conversion to float.

In [None]:
apps_df['Price'] = apps_df['Price'].str.replace("$", "")
apps_df["Price"].unique()

After this, using the `to_numeric()` function will now be possible.

In [None]:
apps_df['Price'] = pd.to_numeric(apps_df["Price"], downcast="float")
apps_df['Price'].unique()

### Dealing with duplicate and NaN values

For duplicated rows, the researchers decided to simply drop these observations.

In [None]:
apps_df.duplicated().sum()

In [None]:
apps_df = apps_df.drop_duplicates()
apps_df

Before preprocessing, the researchers decided to assign the `Rating` column before handling it to answer a specific question in the case study. It is only used to find NaN values for the test case in that question.

In [None]:
collab_ratings = apps_df[['App', 'Rating']].copy()
collab_ratings = collab_ratings.reset_index(drop=True)
collab_ratings

By checking the null values below, the `Rating` and `Size` column will undergo preprocessing.

In [None]:
apps_df.isnull().sum()

For `Rating` and `Size`, the researchers used the average of the apps per `Genres`. The researchers decided to use this column instead of `Category` because the latter has fewer unique values than the other, making the former more specific to the apps' capabilities.

In [None]:
apps_df.groupby("Genres").mean()

After checking the means for `Rating` and `Size` to be appropriate, an `apply()` function was done along with a lambda function that aims to assign the NaN values with the mean of those groups.

In [None]:
apps_df['Rating'] = apps_df.groupby(['Genres'], sort=False)['Rating'].apply(lambda x: x.fillna(x.mean()))
apps_df['Rating'].unique()

In [None]:
apps_df['Size'] = apps_df.groupby(['Genres'], sort=False)['Size'].apply(lambda x: x.fillna(x.mean()))
apps_df['Size'].unique()

### Dropping Impossible Data

In [None]:
apps_df[apps_df["Installs"] < 1]

As seen from the minimized dataframe above, there are applications in the dataset that have zero `Installs` and zero `Reviews` but have a significant rating. These ratings usually come from users of applications but considering that there are no reviews and no installs for the applications above, it can be concluded that these rating values are untruthful or are either initial ratings from the developers.

In [None]:
len(apps_df[apps_df["Installs"] < 1])

Given that the number of entries with this condition is less than 1% of the original dataset, we can drop these rows before proceeding.

In [None]:
apps_df = apps_df[apps_df["Installs"] >= 1]
apps_df[apps_df["Installs"] < 0]

### Binning `Rating`, `Reviews`, `Size`, `Install` into Appropriate Quantiles

In the case of `Rating`, the researchers needed a new column that divides the rating into categories, which will be mainly used for association rules. The new column will then be called `Binned Rating`. For this binning process. the researches decided to use the `cut()` function since it is better to divide it into bins separating the ratings based on the actual value itself.

In [None]:
apps_df[apps_df["Rating"] < 1]

The bins were finalized as 0-1, 1-2, 2-3, 3-4, and 4-5 inclusive. It is applicable since there is no rating that is below 1, above 5, nor is there an actual rating of 0.

In [None]:
bins = [0, 1, 2, 3, 4, 5]

The new `Binned Rating` column is then integrated into the dataset.

In [None]:
apps_df["Binned Rating"] = pd.cut(apps_df['Rating'], bins, labels=['Rating(0,1]', 'Rating(1,2]', 'Rating(2,3]', 'Rating(3,4]', 'Rating(4,5]' ])
apps_df["Binned Rating"]

However, for `Reviews`, `Installs` and `Size`, the researchers decided that it was appropriate to divide the reviews into quantiles so that the binning process can be more normalized in concern with the dataset present. They will be named `Binned Reviews`, `Binned Installs`, and `Binned Size` respectively.

The researchers chose 5 as the number of quantiles to divide them accordingly into 5 categories: very small, small, average, large, and very large. `Reviews` will be converted to float also in case of statistical computations.

In [None]:
apps_df["Reviews"] = pd.to_numeric(apps_df["Reviews"], downcast='float')
apps_df['Binned Reviews'] = pd.qcut(apps_df['Reviews'], 5, labels=['Reviews(very small)', 'Reviews(small)', 'Reviews(average)', 'Reviews(large)', 'Reviews(very large)'])
apps_df['Binned Reviews'].unique()

In [None]:
apps_df['Binned Size'] = pd.qcut(apps_df['Size'], 5, labels=['Size(very small)', 'Size(small)', 'Size(average)', 'Size(large)', 'Size(very large)'])
apps_df['Binned Size'].unique()

In [None]:
apps_df['Binned Installs'] = pd.qcut(apps_df['Installs'], 5, labels=['Installs(very small)', 'Installs(small)', 'Installs(average)', 'Installs(large)', 'Installs(very large)'])
apps_df['Binned Installs'].unique()

In [None]:
apps_df

## 1. What is/are the variables that can mostly affect the rating of an app?

For this research question, the proponents attempted to answer three specific questions that can base off of the general question:
- Are ratings affected by application pricing?
- What characteristics of a paid app can help in improving the rating of an app?
- Are ratings relevant in user interest (no. of installs) of an app?

### Are ratings affected by application pricing?

The question aims to find a significant difference in rating when an app can be either `Free` or `Paid`. Therefore, it is possible to use hypothesis testing to find out the result of the significance. To determine this specific question, the categorized data is appropriate for the usage of chi-square. The test will use `Binned Rating` and `Type` as the columns.

In this test, the hypotheses will be as follows:

$H_0$ (null hypothesis): The true difference is 0. There is NO significant difference in the two categories.

$H_A$ (alternative hypothesis): The true difference is not 0. There IS a significant difference in the two categories.

In [None]:
from scipy.stats import chi2_contingency

First we group the apps' ratings according to their `Type`, and find the count for each rating category.

In [None]:
rating_counts = apps_df.groupby("Type")["Binned Rating"].value_counts()
rating_counts

Next, the data will be converted to a table to make it suitable for chi-square testing.

In [None]:
table = pd.DataFrame([rating_counts["Free"], rating_counts["Paid"]], index=["Free", "Paid"]).transpose()
table

Finally, implement the chi-square test onto the table.

In [None]:
chi2_contingency(table)

Since it has a p-value of $ 4.5x 10^{-4} $, we reject the null hypothesis, meaning that there IS a significant difference in `Free` and `Paid` category apps.

**Conclusion**

However, it can be definitely seen that this is not enough to determine the difference in ratings of `Free` and `Paid` apps. In the expected value for the null hypothesis, see that the paid apps are all less than the free apps in all of the values. This is not due to the lack of paid apps present in the dataset, but the other way around. This is due to the sheer amount of free apps that are available in the Google Play Store. To further improve this result, possible recommendations include reducing the scope into genres, and addition of more paid apps to balance the count between `Free` and `Paid` apps.

### What characteristics of a paid app can help in improving the rating of an app?

Since the first subquestion determined that the pricing of an app does hold a significance in the rating, the researchers decided to use it as a characteristic as well for the implementation of the question, as it seemed to have better insight on what seems to be the common features for an app.

To get association rules, we will follow the market-basket model. In this case study, a basket is represented as a mobile app (rows). The items or itemsets in the basket are represented by the characteristics of the mobile app. However, each characteristic of a mobile app belongs to a certain category. To implement the `Rule Miner` class, the dataset should only contain boolean values (0s and 1s) which denote if the basket model contains a certain item. 

The dataset will be converted so that the columns are the unique values instead of the categories. All unique values except from the `App` columns are taken to build the `items` for the market-basket model

First, the `Price` column will be excluded alone for Association Rules. Instead of binning numerical values of price, it is much simpler to use the `Type` column which describes if the app is `Paid` or `Free`. 

Because the dataset now has binned columns, the original columns must also be removed

In [None]:
copy_df = apps_df.copy()

del copy_df['Rating']
del copy_df['Reviews']
del copy_df['Size']
del copy_df['Price']
del copy_df['Installs']
copy_df

In [None]:
items = np.ndarray(shape=(1), dtype=object)

for i in range(1, len(apps_df.columns)):
    items = np.concatenate( (items, apps_df[apps_df.columns[i]].unique()), axis=0)

items = np.delete(items, [0])

for i in range(len(items)):
    print(items[i])


The unique items will now be the columns for the dataframe. The dataframe is now a matrix that can represent the market basket model and is compatible with RuleMiner Class.

In [None]:
assoc_df = pd.DataFrame(0, index=np.arange(len(apps_df.index)), columns=items)
assoc_df

In [None]:
columns = copy_df.columns
columns

To complete the market-basket model matrix, we now change the value of cells from `0` to `1` if the an application has that characteristic. This could take some time due to the very large size of the dataframe, but this code only needs to be executed once. For reference, it takes around 11 seconds to complete on an 7th gen i7 laptop

In [None]:
start_time = time.time()

for i in range(len(assoc_df.index)):
    for j in range(1, len(columns)):
        assoc_df.loc[assoc_df.index[i], copy_df.loc[copy_df.index[i], columns[j]]] = 1
        
print ("The program took ", time.time() - start_time, " to run")

assoc_df

Running Rule Miner will also take a lot of time as we lower the thresholds. 

For reference (i7 7th gen laptop):
- RuleMiner(300, 0.5) took 25 seconds to run
- RuleMiner(100, 0.5) took 100 seconds to run

In the first trial, let us try support thresholds 300 and confidence threshold 50%. There is no particular reason, this is something that can be adjusted.

In [None]:
rule_miner = RuleMiner(300, 0.5)

In [None]:
start1_time = time.time()

rules = rule_miner.get_association_rules(assoc_df)
#print(rules)
# if you print this, it will look very ugly and may take up a lot of the screen

print ("The program took ", time.time() - start1_time, " to run")

These are the rules.

In [None]:
for i in range(0, len(rules), 2):
    print(rules[i], " -> ", rules[i+1], "\n")

Specifically, we want to see association rules (`x` -> `y`) such that `y` is a category for `Binned Ratings` to see what app characteristics are most likely to belong to a certain rating range.

First, we take the set of rating categories.

In [None]:
ratingset = copy_df['Binned Rating'].unique()
ratingset = ratingset.tolist()
ratingset

In [None]:
for i in range(0, len(rules), 2):
    x = rules[i]
    y = rules[i+1]
    if y[0] in ratingset:
        print(rules[i], " -> ", rules[i+1], "\n")

#### Key Observations

- 5 out of 5 rules have the `Free` characterstic which pertains to a free app
- 4 out of 5 rules have the `Everyone` characteristic which pertains to an app that is suitable for all ages
- 3 out of 5 rules have the `Installs(very large)` and `Reviews(very large)` characteristics which pertains to an app that has very large amount of reviews and installs relative to the distribution of data in the dataset'
- The only `Category` characteristic among the 5 rules is `Game` which pertains to a game app.
    - that rule is also the only rule among the 5 rules without `Everyone` as characteristic

- Characteristic such as `Installs (very large)` is a bit obvious because a highly rated app is very likely to be installed
- Characteristic such as `Free` might also be caused by the large amount of free apps.
- It might be worth to try less stricter threshold

In [None]:
rule_miner = RuleMiner(100, 0.5)

In [None]:
start1_time = time.time()

rules = rule_miner.get_association_rules(assoc_df)
#print(rules)
# if you print this, it will look very ugly and may take up a lot of the screen

print ("The program took ", time.time() - start1_time, " to run")

In [None]:
for i in range(0, len(rules), 2):
    x = rules[i]
    y = rules[i+1]
    if y[0] in ratingset:
        print(rules[i], " -> ", rules[i+1], "\n")

#### Key Observations

- `Everyone` and `Free` are still dominant characteristics
- Aside from `GAME`, there are also apps from `FAMILY`, `BUSINESS`, `PHOTOGRAPHY`, `MEDICAL`, and `TOOLS` category which are categories that may also correlate to high rating for apps
- Out of 10 rules all having a `Review` and `Installs` characteristic, there are `5` very small and `5` very large
     - All rules with `Review(very small)` have `Installs(very small)
     - All rules with `Review(very large)` have `Installs(very large)
- `GAME` category and `Action`genre are together in a rule


#### Analysis / Conclusion

- There may be lot of apps with high rating due to having low number of installs and reviews
    - This statement may likely to apply for `MEDICAL` and `BUSINESS` apps
- For the `GAME` category, `Action` games are more likely to be highly rated and is likely to be not rated for `Everyone`
- `PHOTOGRAPHY` apps are also likely to be rated high and is supported with high number of installs and reviews
- `TOOLS` apps are also rated high and are supported with high number of installs and reviews however, there is still a good likeliness for it to have a high rating due to low number of installs and reviews.


### Are ratings relevant in user interest (no. of installs) of an app?

The third subquestion of this case study deals with the relationship of application rating and user interest. While ratings themselves may be considered measures of user interest, another field in this dataset may be considered as well: the number of installations an application has.

To see the relationship between these variables, the researchers have decided to employ linear regression. The original dataframe will be reduced to two columns for ease in processing.

In [None]:
linreg_df = apps_df[['Rating', 'Installs']]
linreg_df

In [None]:
linreg_df.isnull().values.any()

Next, the individual points will be plotted using `matplotlib`'s scatterplot function. A regression line will also be plotted with this scatterplot to see the relationship between the `Rating` and `Installs` variables.

For the plot, the application `Rating` will be the independent variable on the x-axis and the number of `Installs` an application has will be the dependent variable on the y-axis.

In [None]:
x = linreg_df.iloc[:, 0].values.reshape(-1, 1)
y = linreg_df.iloc[:, 1].values.reshape(-1, 1)

linreg = LinearRegression()
linreg.fit(x, y)
y_pred = linreg.predict(x)

plt.figure(figsize=(12, 9))
plt.scatter(x, y)
plt.plot(x, y_pred, color='red')
plt.title("Rating vs User Interest")
plt.xlabel("Rating")
plt.ylabel("User Interest")
plt.grid(True)
plt.show()

As can be seen from the above figure, there are large cavities with regard to the user interest (`Installs`) as compared to the `Rating` variable. This may be attributed to the existing values for the former variable.

In [None]:
install_vals = linreg_df['Installs'].unique()
install_vals.sort()
install_vals.astype(int)

The original dataset listed the values for `Installs` as strings of the form $x+$ where $x$ is any number from the above array. Upon execution of EDA and data cleaning, the values for this variable have been turned to floating-point numbers.

Evidently, however, this left the values to be heavily varied with a standard deviation of *242403260* as seen below.

In [None]:
install_vals.std()

This is the reason why the scatterplot showed high variability with regard to the dependent values, specifically how the larger values (1 billion, 500 million, 100 million, 50 million) can easily be distinguished while the other values are cramped on the bottom of the plot.

To remove, or at least minimize, this variability, the researchers have decided to give random values within the correct range. 

For example, an application with an original `Installs` value of `500000+` and translated to a value of `500000.0` after EDA will be assigned a random value between 500000 and 1000000. This is in accordance with its original value of at least 500000 (`500000+`) but less than 1000000.

In [None]:
np.random.seed(1)

def getInstallSampleVal (base):
    if (base == 1000000000):
        return 1000000000
    
    base_index = np.where(install_vals == base)[0][0]
    min_val = install_vals[base_index]
    max_val = install_vals[base_index + 1]
    
    return float(np.random.randint(min_val, max_val))

The generated values may be seen below.

In [None]:
sample_vals = []

for val in linreg_df['Installs']:
    sample_vals.append(getInstallSampleVal(val))

linreg_df['Installs_sample'] = sample_vals
linreg_df

In [None]:
linreg_df['Installs_sample'].std()

The standard deviation of the generated values is 107232458.6841. This may still be a larger than usual number but the variability of values has been decreased considering that the earlier standard deviation is more than 200 million.

The same process as before will be applied. The generated values will be plotted as the variable dependent on the original rating values.

In [None]:
x_s = linreg_df.iloc[:, 0].values.reshape(-1, 1)
y_s = linreg_df.iloc[:, 2].values.reshape(-1, 1)

linreg_s = LinearRegression()
linreg_s.fit(x_s, y_s)
y_s_pred = linreg_s.predict(x_s)

plt.figure(figsize=(12, 9))
plt.scatter(x_s, y_s)
plt.plot(x_s, y_s_pred, color='green', label="regression line")
plt.plot([1, 5], [100000000, 100000000], color="red", label="y = 100 million")
plt.title("Rating vs User Interest")
plt.xlabel("Rating")
plt.ylabel("User Interest")
plt.grid(True)
plt.legend()
plt.show()

As can be seen from the figure above, the variability has somewhat decreased and the regression line is more akin to the data than before. However, considering the entire graph the overall variability is still evident considering that the $range$ of the dataset itself if (1000000000 - 1) or 999999999. 

In [None]:
print("Applications with less than 100 million installs: " +  str(len(linreg_df[linreg_df['Installs'] < 100000000])))
print("Applications with at least 100 million installs: " + str(len(linreg_df[linreg_df['Installs'] >= 100000000])))

The regression line, albeit more defined than before, can only be observed on the lower portion of the graph which may be due to the fact that a majority of the observations (n = 9861 / 10340) have `Installs` and, therefore, `Installs_sample` values of less than 100 million.

To conlude this section, the researchers have decided to repeat the previous processes with minimized datasets.

In [None]:
linreg_1_df = linreg_df[linreg_df['Installs'] < 100000000]
linreg_1_df

In [None]:
x_1 = linreg_1_df.iloc[:, 0].values.reshape(-1, 1)
y_1 = linreg_1_df.iloc[:, 2].values.reshape(-1, 1)

linreg_1 = LinearRegression()
linreg_1.fit(x_1, y_1)
y_1_pred = linreg_1.predict(x_1)

plt.figure(figsize=(12, 9))
plt.scatter(x_1, y_1)
plt.plot(x_1, y_1_pred, color='orange')
plt.title("Rating vs User Interest (Installs < 100000000)")
plt.xlabel("Rating")
plt.ylabel("User Interest")
plt.grid(True)
plt.show()

Minimizing the dataset to include only the majority of observations (`Installs` < 100000000) still yields a positive regression line (higher ratings yields higher installations).

In [None]:
linreg_100_df = linreg_df[linreg_df['Installs'] >= 100000000]
linreg_100_df

In [None]:
x_100 = linreg_100_df.iloc[:, 0].values.reshape(-1, 1)
y_100 = linreg_100_df.iloc[:, 2].values.reshape(-1, 1)

linreg_100 = LinearRegression()
linreg_100.fit(x_100, y_100)
y_100_pred = linreg_100.predict(x_100)

plt.figure(figsize=(12, 9))
plt.scatter(x_100, y_100)
plt.plot(x_100, y_100_pred, color='yellow')
plt.title("Rating vs User Interest (Installs >= 100000000)")
plt.xlabel("Rating")
plt.ylabel("User Interest")
plt.grid(True)
plt.show()

Further minimizing the dataset to include only observations with `Installs` values of at least 100000000 results in the figure above. Contrary to the figure from before, a negative relationship may be observed from this figure as evidenced by the regression line above.

**Conclusion**

To answer the subquestion for this section, ratings are not significantly relevant in user interest. The contrasting relationships that have been shown in this section are evidences of this.

While the scatterplots initially show that a higher rating will yield more installations for an application, the contrasting regression lines may prove otherwise. The `Rating vs User Interest (Installs < 100000000)` plot shows that a higher rating may indeed mean more installs for the application but the `Rating vs User Interest (Installs >= 100000000)` plot shows the reverse: higher ratings mean lower installation numbers.

## 2. Is it possible to suggest the rating of an app given the variables?

To answer this question, the researchers aimed to find the most similar items to an app with no rating given. That is why in this question, the `collab_ratings` variable will be used. However, compared to normal cosine similarity, the researchers decided to use it strictly on categorical data only.

To modify the dataframe suitable for cosine similarity, one-hot encoding will be implemented on the columns that are categorical.

In [None]:
le = LabelEncoder()

The specific columns that are strictly categorical will be taken.

In [None]:
collab_df = apps_df[['App', 'Category', 'Type', 'Content Rating', 'Genres', 'Binned Reviews', 'Binned Size', 'Binned Installs']]
collab_df

Each category column is then transformed into numerical categories.

In [None]:
collab_df['Type']= le.fit_transform(collab_df['Type'])
collab_df['Category']= le.fit_transform(collab_df['Category'])
collab_df['Content Rating']= le.fit_transform(collab_df['Content Rating'])
collab_df['Genres']= le.fit_transform(collab_df['Genres'])
collab_df['Binned Reviews']= le.fit_transform(collab_df['Binned Reviews'])
collab_df['Binned Size']= le.fit_transform(collab_df['Binned Size'])
collab_df['Binned Installs']= le.fit_transform(collab_df['Binned Installs'])
collab_df

Each category is then converted using one-hot encoding and assigned to their respective columns.

In [None]:
one_hot_category = pd.get_dummies(collab_df['Category'], prefix = 'category')
one_hot_type = pd.get_dummies(collab_df['Type'], prefix = 'type')
one_hot_content_rating = pd.get_dummies(collab_df['Content Rating'], prefix = 'content_rating')
one_hot_genre = pd.get_dummies(collab_df['Genres'], prefix = 'genre')
one_hot_review = pd.get_dummies(collab_df['Binned Reviews'], prefix = 'review')
one_hot_size = pd.get_dummies(collab_df['Binned Size'], prefix = 'size')
one_hot_install = pd.get_dummies(collab_df['Binned Installs'], prefix = 'install')

After this, concatenate the new columns and drop the previous columns, making the new dataframe a fully binary data.

In [None]:
collab_df = collab_df.join([one_hot_category, one_hot_type, one_hot_content_rating, one_hot_genre, one_hot_review, one_hot_size, one_hot_install])
collab_df

In [None]:
collab_df = collab_df.drop(columns=['Category', 'Type', 'Content Rating', 'Genres', 'Binned Reviews', 'Binned Size', 'Binned Installs'])
collab_df

Setting of the index into the app name itself makes it possible to use collaborative filtering functions in the following cells.

In [None]:
collab_df = collab_df.set_index('App')
collab_df

For this question, the researchers chose 500 apps that can be similar to the chosen app because that is what the researchers thought would be a good starting ground for the filtering process.

In [None]:
cfilter = CollaborativeFiltering(500)

Using the `collab_rating` column, it is possible to search for a NaN value inside this and use that as the test case.

In [None]:
collab_ratings

It is seen that index 10337 does not contain a rating. Let us try and suggest a rating using that as a basis.

In [None]:
index = 10337

In [None]:
collab_df['index'] = range(0, len(collab_df))
collab_df

The cosine similarity, $S_c$, between two vectors $A$ and $B$ is computed as:
$$S_c(A, B)=\dfrac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}$$

For example, let us use the first app as a basis, and use the `get_cosine_similarity()` function:

In [None]:
sim, ind = cfilter.get_cosine_similarity(collab_df.iloc[0, :], collab_df.iloc[1:, :])
print('Photo Editor & Candy Camera & Grid & ScrapBook:', [round(x, 2) for x in collab_df.iloc[0, :]])
print('\nCosine similarities:\n' + str(sim))

It is possible to find the k similar apps using the `get_k_similar()` function. In this case, the k was instantiated to be 500.

In [None]:
main_app = collab_df.iloc[index, :]
other_apps = collab_df.iloc[:index, :].append(collab_df.iloc[index+1:, :])
similar_index, similar_apps = cfilter.get_k_similar(other_apps, main_app)
print(similar_apps)

The rating of user `x` to item `i`, represented as $r_{xi}$, given the set of similar items `N`, is computed as:

$$r_{xi}=\dfrac{\sum_{y \in N}^{}s_{xy}r_{yi}}{\sum_{y \in N}^{}s_{xy}}$$

First, we reset the index of the similarities to clean the instance and assign it to `sum_sim`

In [None]:
sum_sim = similar_apps[0]
sum_sim = sum_sim.reset_index()[0]
sum_sim

Next, the rating for predicting the app is taken from `apps_df`. Note that the researchers did not take the ratings from `collab_ratings`, for it contained NaN values and it was determined to be inaccurate when using it. That is why the researchers decided to use the preprocessed data instead.

In [None]:
rating_for_prediction = apps_df[['App', 'Rating']].reset_index(drop=True)
rating_for_prediction

After obtaining the ratings, the index of, in this case, the 500 apps is taken from `similar_apps`, which will be the rating respective to the similarity in the specific rows. Resetting of index is also done to clean the instance.

In [None]:
rating_sim = rating_for_prediction.iloc[similar_apps['index']]
rating_sim = rating_sim['Rating'].reset_index(drop=True)
rating_sim

After finding the top cosine similarities of the apps and the ratings of those apps, the formula can be done.

In [None]:
(sum_sim * rating_sim).sum() / sum_sim.sum()

`(sum_sim * rating_sim).sum()` denotes $${\sum_{y \in N}^{}s_{xy}r_{yi}}$$ of the equation, and `sum_sim.sum()` denotes $${\sum_{y \in N}^{}s_{xy}}$$

**Conclusion**

It is possible to suggest a rating using the dataset presented. Using cosine similarities of one-hot encoded data, predicting the rating of an app containing preprocessed data can be done. It is possible to continue this prediction to other ratings by **assigning the newly found rating** to its respective app, and **solve for the ratings of other preprocessed apps**. Repetition of these steps will yield the process closer and closer prediction of the rating after many iterations.

## Conclusions and Recommendations

This case study is geared on determining the characteristics that affect a mobile application’s user satisfaction rating and “guess” the rating of an unrated application given these characteristics.

The first question was further divided into three subquestions each tackling a separate matter in relation to factors affecting the application rating. The first subquestion aimed to find differences between the ratings of paid and free applications and, upon completion of a chi-square statistical test, there was indeed a difference between the two. However, the researchers deemed the test to be lacking in integrity due to the fact that there is a large disparity between the number of free and the number of paid applications present in the dataset. The second subquestion was answered using the association rules data mining technique. Here, the researchers determined application characteristics that yielded rating values in the higher range of `(4, 5]`; a significant correlation may be found between a lower number of installations and reviews and a higher actual rating and rightfully so, as newer applications may have higher individual ratings or have rating entries that pull its overall rating to a higher value. Lastly, the third question dealt with the significance of application rating in the number of installations an application has. This was done through linear regression but it was found that the ratings do not matter significantly if a user installs a specific application as was found in contrasting regression analyses. This may also be viewed in an alternate sense such that the more installations an application gets, more and more users may give less-than-average reviews that pull its review down. In general, applications do have characteristics that affect their overall rating such as their installation and review quantities, genre, content rating, and pricing. 

The second question was mainly answered through the implementation of collaborative filtering, and the usage of cosine similarity in categorical data. One-hot encoding was implemented to make it suitable for cosine similarity, and the similar apps were chosen. The ratings of these apps were taken into account and using its corresponding cosine similarities, a predicted rating is found as a result. Undergoing this process will provide the dataset with closer ratings once it is repeated with a significant number of iterations.

To give further depth to the study and give more accurate results, the researchers recommend the increase in the volume of paid apps in the dataset. Since the dataset contained mostly free apps, adding more details regarding paid apps will add more insight into this category, and give more accurate results when comparing them to free apps. Allowing manipulation of the support threshold and the confidence threshold may give new insights for the categories as it may introduce new association rules. Initially, the second subquestion aimed to determine what characteristics of a paid app can help in improving the rating of an app. However, the scope was changed to all apps in general instead of paid apps alone because it was determined that the pricing of an app does hold a significance in the rating. The researchers decided to consider free apps as well for the implementation of the question, as it seemed to provide a better insight on what seems to be the common features for an app. Furthermore, introducing the usage of either ordinal or logistic regression can provide a better understanding on the relevance of the variables. The usage of ordinal regression can provide better insight with the incorporation of ordered variables in the dataset, specifically the application rating, and logistic regression for the categorical variables. Lastly, the researchers recommend that collaborative filtering be executed differently such that the data to be used in suggesting a rating for an unrated application is not preprocessed. That is, rating suggestions must be based on the original ratings of similar applications and if, by any chance, these similar applications are also unrated, the next similar application to it will be used, in repetition.