The following project is about the analysis of a dataset contaning data about google play store in order to find interesting insights and predict the app rating based on some app features.

The project can be divided in the following sections:<br>
- Dataset cleaning + Feature Engineering<br>
- Exploratory Data Analysis: answering interesting questions about the data<br>
- Data preparation for ML (encoding, scaling) for Rating prediction<br>
- ML modeling<br>
- Results<br>

# *Main results Summary*

![image.png](attachment:95dca130-c151-4b43-ac18-7bd6ac858cc8.png)

In [None]:
# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor

from google.colab import drive


In [None]:
!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# 1. Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
drive

<pydrive.drive.GoogleDrive at 0x7fdbce554190>

In [None]:
file_list = drive.ListFile({'q': "'1LDb37kQstSWOP_x7rSqW7XRehPUzzgmt' in parents and trashed=false"}).GetList()

In [None]:
for file1 in file_list:
  print('title: %s, id: %s' % (file1['title'], file1['id']))

title: googleplaystore.csv, id: 11zqy3UVL596QePrcX5PInEa6QOdY-Ye0


In [None]:
import pandas as pd
import io
googleplaystore_data = drive.CreateFile({'id': '1aTuuYzxj5TZR8slFs7C0wWZ5NfYNUOSz'})
googleplaystore_data.GetContentFile('googleplaystore.csv')

In [None]:
df = pd.read_csv('googleplaystore.csv')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [None]:
df.info()

We can already notice that some columns that should be 'Numerical' are labeled as objects (strings).<br>
We will work on this and convert these columns into numerical.

# Numerical Features Cleaning and Analysis

In this section, some basic cleaning will be performed, then every feature will be analyzed.

Moreover, we will define a custom function to get boxplot and histogram for numerical features.

In [None]:
def num_plots(df, col, title, xlabel):
    fig, ax = plt.subplots(2, 1, sharex=True, figsize=(8,5),gridspec_kw={"height_ratios": (.2, .8)})
    ax[0].set_title(title,fontsize=18)
    sns.boxplot(x=col, data=df, ax=ax[0])
    ax[0].set(yticks=[])
    sns.histplot(x=col, data=df, ax=ax[1])
    ax[1].set_xlabel(xlabel, fontsize=16)
    plt.tight_layout()
    plt.show()

Before starting the analysis of each feature, we will change the columns names to lower case

In [None]:
df = df.rename(columns=str.lower)

In [None]:
df.columns

Are there duplicate values?

In [None]:
df[df.duplicated(subset='app')]

There are 1181 duplicated apps apparently. They will be dropped.

In [None]:
df.drop_duplicates(subset='app', inplace=True, ignore_index=True)

Moreover, we will define a copy of the original dataframe without eventual outlier called 'df_clean'.

In [None]:
df_clean = df.copy()

## Reviews column cleaning

Are all the reviews actually numbers?

In [None]:
print('Number of non numeric reviews :', len(df_clean) - df_clean.reviews.str.isnumeric().sum())

It looks like there is a not numeric review, which is it?

In [None]:
df[pd.to_numeric(df_clean.reviews, errors='coerce').isna()]

This data looks a bit weird, all the columns have apperently wrong entries. It can be corrected as follows:

In [None]:
df.at[9300,'category'] = np.nan
df.at[9300,'rating'] = 1.9
df.at[9300,'reviews'] = 19.0
df.at[9300,'size'] = '3.0M'
df.at[9300,'installs'] = '1,000+'
df.at[9300,'type'] = 'Free'
df.at[9300,'price'] = 0
df.at[9300,'content rating'] = 'Everyone'
df.at[9300,'genres'] = np.nan
df.at[9300,'last updated'] = 'February 11, 2018'
df.at[9300,'current ver'] = '1.0.19'
df.at[9300,'android ver'] = '4.0 and up'


However, category is still missing and in the dataframe there are over 9000+ apps. For this reason, we will drop this row.

In [None]:
df_clean = df_clean.drop(9300)
df_clean = df_clean.reset_index(drop=True)

In [None]:
print('Number of non numeric reviews :', len(df_clean) - df_clean.reviews.str.isnumeric().sum())

Now the review column will be converted into int64 data type

In [None]:
df_clean['reviews'] = df_clean['reviews'].astype('int64')

## App size column cleaning and analysis

In [None]:
df_clean['size']

We can see that different apps sizes have 'M' which stands for MB (Megabytes).<br>
Moreover, there is also a size 'Varies with device', we will now investigate using regex if there are other non-numerical sizes.

In [None]:
df_clean[~df_clean['size'].str.contains('M', regex= True, na=False)].head()

We can see that different entries have 'Varies with device' as size. Moreover, some apps have a size in KB (labeled as 'k').

We will now check if there are other strings or characters besides k, M and 'Varies with device'.

In [None]:
df_clean[~df_clean['size'].str.contains('[k,M,Varies with device]$', regex= True, na=False)].head()

OK, the sizes are either in KB (k), MB(M) or Varies with device

First, we will label the the size values that corresponds to Varies with device with 'NaN'

In [None]:
df_clean['size'] = df_clean['size'].replace('Varies with device', 'NaN', regex=True)

In [None]:
df_clean['size']

Now, we will convert the sizes to MB.



In [None]:
size =[]

for i in df_clean['size']:
    if i == 'NaN':
        size.append('NaN')
    elif i[-1] == 'k':
        size.append(float(i[:-1])/1000)
    else:
        size.append(float(i[:-1]))

In [None]:
df_clean['size'] = size
df_clean['size'] = df_clean['size'].astype(float)
df_clean.rename(columns={df_clean.columns[4]:'size(MB)'}, inplace=True)

In [None]:
df_clean.head()

In [None]:
num_plots(df_clean,'size(MB)','App Size distribution','Size (MB)')

The distribution of app size is a right skewed long tail.

Mean, Median and Mode can be computed as follows:

In [None]:
print('Average app size is: ', df_clean['size(MB)'].mean())
print('Median app size is: ', df_clean['size(MB)'].median())
print('Mode app size is: ', df_clean['size(MB)'].mode()[0])

Most of the apps have size lower than 20, and some apps have sizes around 100MB!

## App Rating analysis

In [None]:
num_plots(df_clean,'rating','App rating distribution','Rating')

In [None]:
print('Average app rating is: ', df_clean['rating'].mean())
print('Median app rating is: ', df_clean['rating'].median())
print('Mode app rating is: ', df_clean['rating'].mode()[0])

Most of the apps have rating around 4.2. We can also see that some apps have 1 star rating, which are those apps?

In [None]:
df_clean[df_clean['rating'] <= 1.0]

In [None]:
print('Apps with rating equal or lower than 1 star: ',len(df_clean[df_clean['rating'] <= 1.0]))

## App price analysis

In [None]:
df_clean['price'].isnull().sum()

There are no missing values in the price column

In [None]:
df_clean['price'] = df_clean['price'].str.replace('$','').astype(float)

In [None]:
df_clean['price'].value_counts()

We can see that most of the apps are free!<br>

In [None]:
print('Free apps are {}% of the total apps in the dataset'.format(np.round(len(df_clean[df_clean['price']==0])*100/len(df_clean)),2))

In [None]:
num_plots(df_clean,'price','Price Distribution','price')

We can confirm the great majority of free apps in the store. However, there are some apps with a price over 50\\$, and some of them even cost 400\\$!

To better visualize the price distribution, we will separately analyze apps with a price lower than 10\\$ and apps with a price higher than 10\\$

In [None]:
num_plots(df_clean[(df_clean['price']>0) & (df_clean['price']<10)],'price','Price Distribution of apps between 0-10$','price')

In [None]:
num_plots(df_clean[(df_clean['price']>10)],'price','Price Distribution of apps over 10$','price')

There are some apps with a price close to 400$, let's investigate more.

In [None]:
df_clean[df_clean['price']>350]

It looks like these apps are 'meme apps'. They do not do anything. They just cost a lot of money.

What about the other apps which cost more than 50\\$ but less than 350\\$?

In [None]:
df_clean[(df_clean['price']>50) & (df_clean['price']<350)]

Among these apps, there is still one meme app 'I am rich VIP', while the others looks to be 'serious apps'. However, these apps do not have any reviews or rating.

In [None]:
print('Number of apps with price higher than 50$: ', len(df_clean.loc[df_clean['price']>50]))

We will remove apps with a price over 50 dollars, since there are very few of them and make the distribution of price heavily right skewed: they can be considered as outliers.

In [None]:
df_clean = df_clean.loc[df_clean['price'] < 50]

What is the distribution of the paid apps?

In [None]:
num_plots(df_clean.loc[df_clean['price'] > 0],'price','Price Distribution of Paid Apps','price')

We still have a right skewed distribution, but we can see the median and the IQR in the boxplot now!

By looking at this plot, we decide to only consider apps with a price lower than 20\\$. We will drop apps with a price higher than 20\\$.

In [None]:
df_clean = df_clean.loc[df_clean['price']<20]

## App install

In [None]:
df_clean['installs']

We will remove the '+' from the rows and add it to the feature name!

In [None]:
df_clean['installs'] = df_clean['installs'].str.replace('+','').str.replace(',','').astype(float)

In [None]:
df_clean.rename(columns={df_clean.columns[5]:'Installs(+)'}, inplace=True)

In [None]:
sns.kdeplot(x='Installs(+)', data=df_clean)

Most of the apps installs are relative small compared to the maximum, which is around 1e9 (a billion).

### Which are the apps with these high numbers of installs?

In [None]:
df_clean[df_clean['Installs(+)']> 0.8e9 ]

The apps with most installs are famous social network apps like Facebook, Instagram and Google apps.

In [None]:
sns.boxplot(x='Installs(+)', data=df_clean)

It is becoming difficult to visualize the box fences! We'll define a function to obtain them

In [None]:
def iqr_fence(x):
    Q1 = x.quantile(0.25)
    Q3 = x.quantile(0.75)
    IQR = Q3 - Q1
    Lower_Fence = Q1 - (1.5 * IQR)
    Upper_Fence = Q3 + (1.5 * IQR)
    u = max(x[x<Upper_Fence])
    l = min(x[x>Lower_Fence])
    return [u,l]

In [None]:
upper, lower = iqr_fence(df_clean['Installs(+)'])
print('Upper Fence:', upper)
print('Lower Fence:', lower)

We can see that the lower fence is 0 installs, while the higher fence is 1 million installs.

In [None]:
print('Total apps', len(df_clean))
no_installs = [1e9, 1e8, 1e7, 1e6, 1e5, 1e4, 1e3, 1e2, 1e1]
for n in no_installs:
    print('Number of apps with less than ' + str(n) + ' installs:', len(df_clean.loc[df_clean['Installs(+)']<n]))

We can see that there are quite few apps with just 10 or 100 reviews.

# Categorical Features Cleaning and Analysis

## Category analysis

In [None]:
print('In total there are {} different app categories'. format(len(df_clean['category'].value_counts())))

In [None]:
plt.subplots(figsize=(13,5))
sns.countplot(x='category', data=df_clean, order = df_clean['category'].value_counts().index)
plt.xticks(rotation=90);
plt.xlabel('')
plt.title('App category counts');

The most popular categories are family, game and tools

### Is there any difference between the category column and genres?

In [None]:
df_clean[['category','genres']]

It looks like they are slightly different

## App type (free or paid)

In [None]:
sns.countplot(x='type', data=df_clean)
plt.title('Paid vs Free apps')
plt.xlabel('App type')
plt.show()

Most of the apps are free, as we saw during the price column analysis

## Content rating

In [None]:
sns.countplot(x='content rating', data=df_clean)
plt.title('Paid vs Free apps')
plt.xticks(rotation=60)
plt.show()

Most of the apps are for everyone

In [None]:
fig, ax = plt.subplots(figsize=(8,4))
ax=sns.boxplot(x='content rating', y='rating', data=df_clean)
ax.set_title('Content rating vs rating')
plt.show()

'Adults only 18+' apps are more dense around 4-4.5 rating, whereas other apps have a higher variance, outliers and have at least some apps with rating=5

In [None]:
fig, ax = plt.subplots(figsize=(8,4))
ax=sns.boxplot(x='content rating', y='Installs(+)', hue='type', data=df_clean)
ax.set_title('Installs(+) per content rating by type')
ax.set_yscale('log')
plt.show()

From this plot we can see that free apps have more installs compared to paid apps and that apps for Everyone10+ have more installs.

## App Genres column analysis

In [None]:
df_clean['genres'].value_counts()

It looks like there are some apps with multiple genres as well.

In [None]:
sns.heatmap(df_clean.corr(), annot=True, cmap='Blues')
plt.title('Correlation Matrix')
plt.show()

There is a quite high positive correlation between installs and reviews. This means that apps with higher reviews have more installs.

## Last updated

In [None]:
df_clean['last updated']=pd.to_datetime(df_clean['last updated'])

In [None]:
plt.figure(figsize=(10,4))
sns.histplot(x='last updated', data=df_clean)
plt.show()

Most apps have been recently updated recently (wrt dataset publication). There are also apps that have been last updated before 2014.<br>
Moreover, we extract the year from this feature, since it could be interesting to analyze.

In [None]:
df_clean['last_up_year']=df_clean['last updated'].dt.year

In [None]:
plt.figure(figsize=(10,4))
sns.histplot(x='last_up_year', data=df_clean)
plt.show()

We can better visualize the situation here. We can see that the great majority of apps have been updated recently (2018).

## Current version column analysis

In [None]:
df_clean['current ver']

In [None]:
df_clean['current ver'] = df_clean['current ver'].replace('Varies with device', 'NaN', regex=True)

In [None]:
df_clean['current ver'].value_counts()

There are a lot of possible current versions among the apps in the store.<br>
To simplify the further analysis, we will approximate the current version with just the first number of the version.

In [None]:
df_clean['current vers']=df_clean['current ver'].str.extract(r'^(\d+).', ).astype(float)

In [None]:
df_clean['current vers'].value_counts()

It looks like some apps have very high numbers as the version first number... this sounds weird, probably the authors did this a joke.

In [None]:
sns.boxplot(x='current vers', data=df_clean.loc[df_clean['current vers']<100])
plt.show()

We can see that by considering only apps with a version number lower than 100, we already have lots of outliers.

In [None]:
sns.boxplot(x='current vers', data=df_clean.loc[df_clean['current vers']<10])
plt.show()

By considering only apps with version lower than 10 we can start understanding the version feature distribution

In [None]:
print('Total apps', len(df_clean))
print('Number of apps with current version lower than 1000:', len(df_clean.loc[df_clean['current vers']<1000]))
print('Number of apps with current version lower than 100:', len(df_clean.loc[df_clean['current vers']<100]))
print('Number of apps with current version lower than 10:', len(df_clean.loc[df_clean['current vers']<10]))
print('Number of apps with current version lower than 6:', len(df_clean.loc[df_clean['current vers']<6]))

According to the boxplot, we could consider apps with a version higher than 6 as outliers.

We drop the categorical 'current ver' column, and we consider only apps with a current version lower than 6.

In [None]:
df_clean=df_clean.drop('current ver', axis=1)

In [None]:
df_clean=df_clean.loc[df_clean['current vers']<6]

## Android version

In [None]:
df_clean['android ver']

We will remove 'and up' from every rows.

In [None]:
df_clean['android vers']=df_clean['android ver'].replace('and up', '', regex=True)

In [None]:
df_clean.drop('android ver', axis=1,inplace=True)

In [None]:
df_clean['android vers'].value_counts()

We can see that we still need to do some cleaning on this column.
In particular we will remove 'varies with device', the 'w' which appears in 4.4w and remove the few data where the android version compatible is within a range such as '5.0 - 7.1.1 '.

In [None]:
df_clean['android vers']=df_clean['android vers'].replace('Varies with device', '', regex=True).replace('W', '', regex=True).replace('', np.nan)
df_clean=df_clean.loc[df_clean['android vers'].str.contains(r'-') == False]

In [None]:
df_clean['android vers']=df_clean['android vers'].str.strip()

In [None]:
df_clean['android vers'].value_counts()

We can subsitute subversions such as 4.0.3, 2.3.3 and 2.0.1 with 4.0, 2.3 and 2.0 respecitvely.

In [None]:
df_clean['android vers'] = df_clean['android vers'].apply(lambda x: x[:3])

In [None]:
sns.histplot(x='android vers', data=df_clean)
plt.xticks(rotation=90);
plt.title('Andoid Version Distributions over apps')
plt.show()

Moreover,can group these android versions by the main number: for example versions 4.0.3 and 4.1 will be labeled as '4' and so on for the different versions

In [None]:
df_clean['android vers_main']=df_clean['android vers'].str.extract(r'^(\d+).', ).astype(float)

In [None]:
df_clean['android vers_main'].value_counts()

In [None]:
sns.histplot(x='android vers_main', data=df_clean)
plt.xticks(rotation=90);
plt.title('Main Andoid Version by apps')

In [None]:
df_clean = df_clean.loc[(df_clean['android vers_main'] >= 3)]

In [None]:
df_clean.info()

We can also change the type of andoid vers to float64:

In [None]:
df_clean['android vers'] = df_clean['android vers'].astype(float)

There are some rows with missing values for rating and size. We can impute these quantites by KNNImputer later.

In the following, we will try to answer some questions by analyzing the data.

# Q1 Do expensive apps have higher rating?

In [None]:
sns.regplot(x='price', y='rating', data=df_clean)
plt.title('Price VS Rating')
plt.show()

From this plot we can see a slight positive trend between price and rating: apps with higher prices tends to be slightly higher rated.

# Q2 Do apps with high rating have more reviews?

In [None]:
sns.regplot(y='rating', x='reviews', data=df_clean)
plt.title('No. Reviews VS Rating')
plt.show()

We can see a positive trend between rating and number of reviews: apps with more reviews tends to have higher rating.

# Q3 Which category has more reviews?

In [None]:
plt.figure(figsize=(14,4))
sns.boxplot(x='category', y='reviews', data=df_clean, hue='type')
plt.yscale('log')
plt.ylabel('')
plt.xticks(rotation=90);
plt.title('No. Reviews (log) vs Category')
plt.show()

We can see that apps with the category game, entertainment, education and photography have more reviews than apps of other categories.<br>
In particular, free apps seems to have more reviews than paid apps for most categories, with the exception of business and weather, there paid apps have more reviews overall.

# Q4 Which category has higher rating?

In [None]:
plt.figure(figsize=(14,4))
sns.boxplot(x='category', y='rating', data=df_clean, hue='type')
plt.xticks(rotation=90);
plt.title('Rating vs Category')
plt.ylabel('')
plt.show()

From this plot we can see that in most categories, paid apps have higher rating than free apps. In particular it is also interesint to notice that free apps have lots of outlier values compared to paid apps.

# Q5 Is there any relationship between the category and app size?

In [None]:
plt.figure(figsize=(14,4))
sns.boxplot(x='category', y='size(MB)', data=df_clean, hue='type')
plt.xticks(rotation=90);
plt.title('size vs category')
plt.show()

We can see that the category where apps have a higher size are 'game', 'travel and local' (for paid apps only), education (for paid apps only) and family.<br>
In particular, free apps seems to have a higher size compared to paid apps for almost all categories.

# Q6 Is there any relationship between app rating and size?

In [None]:
sns.regplot(x='rating', y='size(MB)', data=df_clean)

We can see that apps with higher ratings have more possible sizes compared to apps with lower ratings (<3.0), where the size is almost always under 40 MB. 

# Q7 Is there any relationship between No. Installs and Reviews?

In [None]:
sns.regplot(x='Installs(+)',y='reviews', data=df_clean)
plt.yscale('log')
plt.xscale('log')

From this plot we can see that apps with more installs tends to have more reviews

# App rating prediction

In [None]:
df_clean.info()

In [None]:
plt.figure(figsize=(8,4))
sns.heatmap(df_clean.corr(), cmap='Blues', annot=True)
plt.title('Correlation Matrix')

In [None]:
df1=df_clean.copy()

In [None]:
df1=df1.drop(['app','last updated'], axis=1)

In [None]:
df1.info()

In [None]:
df1.head()

To improve the ML algorithm prediction performance, we will log-transform the column 'Installs(+)', in order to make it more 'Normal'.
In particular, we should transform it with log(x+1) transform, since there are apps with 0 installs.

In [None]:
df1.describe()

In [None]:
df1['Installs(+)']=np.log(df_clean['Installs(+)'] + 1)

In [None]:
sns.displot(df1['Installs(+)'])

We can clearly see the benefit of the log transform, now it looks more like a normal distrbution.

Moreover, we can drop the column 'android ver_main' and keep 'android vers' only.

In [None]:
df1.drop('android vers_main', axis = 1, inplace=True)

## Categorical features Encoding

First, we replace Free and paid with 0 and 1 respecitvely in the price column.

In [None]:
df1['type'] = df1['type'].replace({'Free':0, "Paid":1})

In [None]:
df1.info()

For what concerns Content Rating and Genres, we will encode them by label encoder, since OHE will create too many columns.

In [None]:
en = LabelEncoder()
catCols =  ['category','content rating','genres']
for cols in catCols:
    df1[cols] = en.fit_transform(df1[cols])

In [None]:
df1.info()

Finally we can impute the missing values in rating and size by KNNImputer!

In [None]:
imputer = KNNImputer(n_neighbors=3)
df1 = pd.DataFrame(imputer.fit_transform(df1),columns = df1.columns)

In [None]:
df1.info()

In [None]:
df1.head()

In [None]:
X=df1.drop('rating', axis = 1).values

In [None]:
y=df1['rating'].values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [None]:
scaler = MinMaxScaler()

In [None]:
scaler.fit(X_train)

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)            

In [None]:
rf = RandomForestRegressor()

In [None]:
rf.fit(X_train,y_train)

In [None]:
y_pred_rf = rf.predict(X_test)

In [None]:
mse_rf = mean_squared_error(y_test, y_pred_rf)
print("RMSE using RF: ", np.sqrt(mse_rf))

In [None]:
feature_name_list=df1.drop('rating', axis = 1).columns

In [None]:
rf.feature_names = feature_name_list

In [None]:
plt.barh(rf.feature_names,rf.feature_importances_)
plt.xticks(rotation=90);
plt.title('Feature Importance by Random Forest')
plt.xlabel('Feature Importance (%)')

## XGBoost

In [None]:
xgb = XGBRegressor(n_estimators=2000, learning_rate=0.01)
xgb.fit(X_train, y_train) 
y_pred_xgb = xgb.predict(X_test) 
mse_xgb = mean_squared_error(y_pred_xgb, y_test)

# Rating prediction Summary

In [None]:
xgb.feature_names = feature_name_list

In [None]:
print(r"RMSE with RF: {:.3f}".format(np.sqrt(mse_rf)))
print(r"RMSE with XGBoost: {:.3f}".format(np.sqrt(mse_xgb)))

In [None]:
plt.barh(rf.feature_names,rf.feature_importances_, alpha=0.4, label='RF', color='red')
plt.barh(xgb.feature_names,xgb.feature_importances_, alpha=0.4, label='XGBoost', color='blue')
plt.legend(loc='upper right');
plt.title('Feature Importance to predict Rating by ML models')
plt.xlabel('Feature Importance (%)')
plt.show()

We can see that RF and XGBoost gave similar results in terms of RMSE.<br>
For what concerns the Feature importance, we can see that RF gave more importance to reviews and size, while XGBoost spreaded more the feature importance among all the features.