### IMPORT DEPENDENCIES

Importing all necessary modules needed for this project:
- numpy for linear algebra and numerical transformations.
- pandas for data processing and I/O.
- matplotlib for data visualization.
- seaborn for more advanced data visualizations.
- scikitlearn for modeling and transformations.
- beautiful soup for text parsing.

In [1]:
import pandas as pd
import numpy as np
import nltk
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

%matplotlib inline


warnings.filterwarnings("ignore")

NameError: name 'warnings' is not defined

In [None]:
data = pd.read_csv('googleplaystore_user_reviews.csv')
data.head(3)

In [None]:
data.shape

### DATA PROCESSING

OBJECTIVES:
- find missing values.
- drop missing values.
- encode labels manually and add to new column "Target".
- keep only the necessary columns ie: "Translated_Review" and "Target.

In [None]:
data.isnull().sum()

The missing values are a lot but since we have a total entry of 64k+, we can still drop them.

In [None]:
data = data.dropna()
data.head(3)

In [None]:
data.shape

We still have 37k+ entries to work with which is good too.

LABEL ENCODING:
To preprocess we use the ‘sentiment’ column in the data frame to have scores ranging from ‘0 ‘to ‘2 ‘where ‘0’ means a negative review, ‘1’ means a neutral review and ‘2’ means a positive review. It is similar to encoding in python but here we don’t use any in-built function but we explicitly run a for loop where and create a new list and append the values to the list.

In [None]:
#Encode labels manually

def to_target(Sentiment):
  Sentiment = str(Sentiment)

  if Sentiment == "Positive":
    return 2
  elif Sentiment == "Neutral":
    return 1
  else:
    return 0
    
data['Target'] = data.Sentiment.apply(to_target)
data.head(3)

In [None]:
#Visualize label

sns.countplot(data["Target"])
plt.xlabel('Reviews', color = 'red')
plt.ylabel('Count', color = 'red')
plt.xticks([0,1,2],['Negative','Neutral','Positive'])
plt.title('COUNT PLOT', color = 'r')
plt.show()

In [None]:
#Create final dataset

final_dataset = data[['Translated_Review','Target']]
final_dataset.head(3)

In [None]:
final_dataset["Target"].value_counts()

In [None]:
final_dataset.shape

OBSERVATION: Now if we print the ‘final_dataset’ and find the shape we come to know that there are 37,427 rows and only 2 columns. From the final_dataset, we find out the number of positive reviews is 23998 entries and the number of negative reviews is found to be 5158. There is a very large difference between the positive and negative reviews. Hence, there are more chances for the data to overfit if we directly try to build the model.

Therefore, we have to choose only a few entries from the final_datset to avoid overfitting. So from various trials, I have found that the optimal value for the number of reviews to be considered is 5000. Hence I create two new variables ‘data_p’ and ‘data_n’ and store randomly any 5000 positive and negative reviews in the variables respectively.

In [None]:
# datap = []
# datan = []
# dataneu = []

# for i in final_dataset['Target']:
#     if i == 2:                              
#         datap.append(i)
#     if i == 1:
#         dataneu.append(i)
#     if i == 0:
#         datan.append(i)

In [None]:
# datap = pd.DataFrame(datap)
# datan = pd.DataFrame(datan)
# dataneu = pd.DataFrame(dataneu)

In [None]:
# data_p = datap.iloc[np.random.randint(1,23998,5000), :]
# data_n = datan.iloc[np.random.randint(1, 8271,5000), :]
# data_neu = dataneu.iloc[np.random.randint(1, 5158,5000), :]

# len(data_n), len(data_p), len(data_neu)

In [None]:
# data = pd.concat([data_p, data_n, data_neu])
# len(data)
# data

In [None]:
# final_dataset['Target'].append(data)

# sns.countplot(final_dataset['Target'])
# plt.show()

### TOKENIZATION

If we see the data then we can find that there are a few HTML tags since the data was originally fetched from real e-commerce sites. Hence we can find that there are tags present which is to be removed as they are not necessary for the sentiment analysis. we use the BeautifulSoup function which uses the ‘html.parser’ and we can easily remove the unwanted tags from the reviews. To perform the task I create a new column named ‘review’ which stores the parsed text and I drop the column named ‘translated_review’ to avoid redundancy. I have performed the above task using a function named ‘strip_html’.

In [None]:
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    
    return soup.get_text()

final_dataset['Review'] = final_dataset['Translated_Review'].apply(strip_html)
final_dataset = final_dataset.drop('Translated_Review',axis=1)
final_dataset.head(3)

Before directly jumping to building the model we need, to do a small task. We know that for humans to classify the sentiment we need articles, determinants, conjunctions, punctuation marks, etc, as we can clearly understand and then classify the review.

But this is not the case with machines. So they don’t actually need these to classify the sentiment rather they just get confused literally if they are present. So to perform this task like any other sentiment analysis we need to use the ‘nltk’ library.

NLTK stands for ‘Natural Language Processing Toolkit’. This is one of the core libraries to perform Sentiment Analysis or any text-based ML Projects. So with the help of this library, I am going first remove the punctuation marks and then remove the words which do not add a sentiment to the text. First I use a function named ‘punc_clean’ which removes the punctuation marks from every review.

In [None]:
def punc_clean(text):
    
    import string as st
    a = [w for w in text if w not in st.punctuation]
    
    return ''.join(a)

final_dataset['Review'] = final_dataset['Review'].apply(punc_clean)
final_dataset.head(3)

Now, next we have to remove the words which don’t add a sentiment to the sentence. Such words are called the ‘stopwords’. If we go through the list of the stopwords we can find that it contains the word ‘not’ as well. So it is necessary that we don’t remove the ‘not’ from the ‘review’ as it adds some value to the sentiment because it contributes to the negative sentiment. 

Hence we have to write the code in such a way that we remove other words except the ‘not’. 

In [None]:
nltk.download('stopwords')

In [None]:
nltk.download('punkt')

In [None]:
def remove_stopword(text):
    stopword = nltk.corpus.stopwords.words('english')
    stopword.remove('not')
    
    a = [w for w in nltk.word_tokenize(text) if w not in stopword]
    
    return ' '.join(a)

final_dataset['Review'] = final_dataset['Review'].apply(remove_stopword)

### VECTORIZATION 

The next motive is to assign each word in every review with a sentiment score. So to implement it we need to use another library from the ‘sklearn’ module which is the ‘TfidVectorizer’ which is present inside the ‘feature_extraction.text’.

In [None]:
vectorizer = TfidfVectorizer(ngram_range = (1, 2), min_df = 1)
vectorizer.fit(final_dataset['Review'])
vectorizer_X = vectorizer.transform(final_dataset['Review'])

In [None]:
model = LogisticRegression()

clf = model.fit(vectorizer_X, final_dataset['Target'])
clf.score(vectorizer_X, final_dataset['Target']) * 100

The score of the model we get around 96 – 97% as the dataset changes every time we run the code as we consider the data randomly. Hence we have successfully built our model that too with a good score.

### PREDICTION

So to clarify the performance of the model I have used two simple sentences “I love machine learning down to its complexities” and “I so hate data analysis” which clearly refer to positive and negative sentiment.

In [None]:
clf.predict(vectorizer.transform(["I love machine learning down to its complexities"]))

In [None]:
clf.predict(vectorizer.transform(["I so hate data analysis"]))