# Google Play Store Apps Data
This dataset consists of web scraped data of more than 10,000 Google Play Store apps and 60,000 app reviews. `apps_data.csv` consists of data about the apps such as category, number of installs, and price. `review_data.csv` holds reviews of the apps, including the text of the review and sentiment scores. You can join the two tables on the `App` column.

Not sure where to begin? Scroll to the bottom to find challenges!

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Importing Data
apps = pd.read_csv("apps_data.csv")
reviews = pd.read_csv('review_data.csv')
games_df = apps.merge(reviews, on='App', how='outer')

# Functions
def convert_to_millions(value):
    if value[-1] == 'k':
        return float(value[:-1]) / 1000
    elif value[-1] == 'M':
        return float(value[:-1])
    elif value == 'Varies with device':
        return np.nan
    else:
        return float(value)

# Cleaning Data
na_threshold = len(games_df) * 0.05
drop_cols = games_df.columns[games_df.isna().sum() <= na_threshold]
games_df.dropna(subset=drop_cols, inplace=True)
games_df.drop(['Current Ver','Android Ver','Translated_Review'], axis=1, inplace=True)
games_df.drop_duplicates()

games_df['Category'] = games_df['Category'].str.replace('_', ' ')
games_df['Category'] = games_df['Category'].str.title()
games_df['Category'] = games_df['Category'].astype('category')

games_df['Size'] = games_df['Size'].apply(convert_to_millions)
games_df['Size'] = games_df['Size'].astype(float)

cat_list = ['Installs','Type','Content Rating']
games_df[cat_list] = games_df[cat_list].astype('category')

games_df['Price'] = games_df['Price'].str.strip('$')
games_df['Price'] = games_df['Price'].astype(float)

games_df['Genres'] = games_df['Genres'].str.replace('_',' ')
games_df['Genres'] = games_df['Genres'].str.replace(';',' ')
games_df['Genres'] = games_df['Genres'].astype('category')

games_df['Last Updated'] = pd.to_datetime(games_df['Last Updated'], infer_datetime_format=True)
games_df['Last Updated (Year)'] = games_df['Last Updated'].dt.year
games_df['Last Updated (Month)'] = games_df['Last Updated'].dt.strftime('%B')
games_df.drop('Last Updated', axis=1, inplace=True)

# Imputation
cat_group = games_df.groupby('Category')
mean_imputer = lambda x: x.fillna(x.mean().round(2))

games_df['Size'] = cat_group['Size'].transform(mean_imputer)
games_df['Size'] = games_df['Size'].astype(float)
games_df['Size'] = np.log(games_df['Size'])

games_df['Sentiment_Polarity'] = cat_group['Sentiment_Polarity'].transform(mean_imputer)
games_df['Sentiment_Polarity'] = games_df['Sentiment_Polarity'].astype(float)

games_df['Sentiment_Subjectivity'] = cat_group['Sentiment_Subjectivity'].transform(mean_imputer)
games_df['Sentiment_Subjectivity'] = games_df['Sentiment_Subjectivity'].astype(float)

games_df['Sentiment'] = cat_group['Sentiment'].transform(lambda x: x.fillna(x.mode().iloc[0]))
games_df['Sentiment'] = games_df['Sentiment'].astype('category')

# Machine Learning
X = games_df[['Size','Sentiment_Polarity']].values
y = games_df['Category'].values
X_new = np.array([[2.87,0.26],[3.21,0.45],[2.61,-0.02],[1.98,0.5]])
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)

rfc = RandomForestClassifier()
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the fores
    'max_depth': [10, 20, 30],  # Maximum depth of trees
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split
    'min_samples_leaf': [1, 2, 4],  # Minimum samples required at a leaf node
}
grid_search = GridSearchCV(rfc, param_grid=param_grid, n_jobs=-1,verbose=1, cv=5, scoring='accuracy')
grid_search.fit(X_train,y_train)

rfc_estimator = grid_search.best_estimator_ # RandomForestClassifier(max_depth=30, n_estimators=300)
y_pred = rfc_estimator.predict(X_test)
print(accuracy_score(y_true=y_test,y_pred=y_pred))

Fitting 5 folds for each of 81 candidates, totalling 405 fits


## Data Dictionary

**data_apps.csv**

| variable       | class     | description                                                                  |
|:---------------|:----------|:-----------------------------------------------------------------------------|
| App            | character | The application name                                                         |
| Category       | character | The category the app belongs to                                              |
| Rating         | numeric   | Overall user rating of the app                                               |
| Reviews        | numeric   | Number of user reviews for the app                                           |
| Size           | character | The size of the app                                                          |
| Installs       | character | Number of user installs for the app                                          |
| Type           | character | Either "Paid" or "Free"                                                      |
| Price          | character | Price of the app                                                             |
| Content Rating | character | The age group the app is targeted at - "Children" / "Mature 21+" / "Adult"   |
| Genres         | character | Possibly multiple genres the app belongs to                                  |
| Last Updated   | character | The date the app was last updated                                            |
| Current Ver    | character | The current version of the app                                               |
| Android Ver    | character | The Android version needed for this app                                      |

**data_reviews.csv**

| variable               | class        | description                                           |
|:-----------------------|:-------------|:------------------------------------------------------|
| App                    | character    | The application name                                  |
| Translated_Review      | character    | User review (translated to English)                   |
| Sentiment              | character    | The sentiment of the user - Positive/Negative/Neutral |
| Sentiment_Polarity     | character    | The sentiment polarity score                          |
| Sentiment_Subjectivity | character    | The sentiment subjectivity score                      |

[Source](https://www.kaggle.com/lava18/google-play-store-apps) of dataset.

## Don't know where to start?

**Challenges are brief tasks designed to help you practice specific skills:**

- 🗺️ **Explore**: Which categories get the highest reviews from amongst the 10 most popular categories?
- 📊 **Visualize**: Create a plot visualizing the distribution of sentiment polarity, split by content rating.
- 🔎 **Analyze**: What impact does the content rating an app receives have on its sentiment and rating?

**Scenarios are broader questions to help you develop an end-to-end project for your portfolio:**

You are working for an app developer. They are in the process of brainstorming a new app. They want to ensure that their next app scores a high review on the app store, as this can lead to the app being featured on the store homepage. They would like you analyze what factors increase the rating an app will receive. They would also like to know what impact reviews have on the final score.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.