# Sentiment-analysis-on-Google-Play-store-apps-reviews

Perform sentiment analysis on reviews of android apps on Google Play-Store

Suppose you want user reviews to be classified as positive and negative. Sentiment Analysis is a popular job to be performed by data scientists. This is a simple guide using Naive Bayes Classifier and Scikit-learn to create a Google Play store reviews classifier (Sentiment Analysis) in Python.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import joblib
from sklearn.feature_extraction.text import CountVectorizer

#### Data Overview
Reviews of 23 popular mobile apps have been scrapped. In order to create the dataset, the data was compiled manually labelling each data as positive or negative example.

In [2]:
data = pd.read_csv('google_play_store_apps_reviews_training.csv')

In [3]:
data.head()


Unnamed: 0,package_name,review,polarity
0,com.facebook.katana,privacy at least put some option appear offli...,0
1,com.facebook.katana,"messenger issues ever since the last update, ...",0
2,com.facebook.katana,profile any time my wife or anybody has more ...,0
3,com.facebook.katana,the new features suck for those of us who don...,0
4,com.facebook.katana,forced reload on uploading pic on replying co...,0


Pre-process Data
We need to remove the package name as it’s not relevant. Then convert text to lowercase for CSV data. So, this is the data pre-process stage.

In [4]:
def preprocess_data(data):
    # Remove package name as it's not relevant
    data = data.drop('package_name', axis=1)
    
    # Convert text to lowercase
    data['review'] = data['review'].str.strip().str.lower()
    return data

In [5]:
data = preprocess_data(data)

Note: There are many different and more sophisticated ways in which text data can be cleaned that would likely produce better results than what I did here. To be as easy as possible in this tutorial. I also generally think it’s best to get baseline predictions with the simplest solution possible before spending time doing unnecessary transformations.

### Splitting Data
First, separate the columns into dependent and independent variables (or features and labels). Then you split those variables into train and test sets.

In [6]:
# Split into training and testing data
x = data['review']
y = data['polarity']
x, x_test, y, y_test = train_test_split(x,y, stratify=y, test_size=0.25, random_state=42)

In [7]:
# Vectorize text reviews to numbers
vec = CountVectorizer(stop_words='english')
x = vec.fit_transform(x).toarray()
x_test = vec.transform(x_test).toarray()

Vectorization: To make sense of this data for our machine learning algorithm, we will need to convert each review to a numerical representation that we call vectorization.

### Model Generation
After splitting and vectorize text reviews into numbers, we will generate a random forest model on the training set and perform prediction on test set features.

In [8]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(x, y)

MultinomialNB()

Check the correctness of the model after it has been created by comparing real and anticipated values. This model is 85 % accurate.

In [9]:
model.score(x_test, y_test)


0.8565022421524664

In [10]:
model.predict(vec.transform(['Love this app simply awesome!']))

array([1], dtype=int64)

In [None]:
A very simple classifier with 85% pretty decent accuracy out of the box.