![wordcloud](wordcloud.png)

As a Data Scientist working for a mobile app company, you usually find yourself applying product analytics to better understand user behavior, uncover patterns, and reveal insights to identify the great and not-so-great features. Recently, the number of negative reviews has increased on Google Play, and as a consequence, the app's rating has been decreasing. The team has requested you to analyze the situation and make sense of the negative reviews.

It's up to you to apply K-means clustering from scikit-learn and NLP techniques through NLTK to sort text data from negative reviews in the Google Play Store into categories!

## The Data

A dataset has been shared with a sample of reviews and their respective scores (from 1 to 5) in the Google Play Store. A summary and preview are provided below.

# reviews.csv

| Column     | Description              |
|------------|--------------------------|
| `'content'` | Content (text) of each review. |
| `'score'` | Score assigned to the review by the user as an integer (from 1 to 5). |

In [51]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [52]:
# Download necessary files from NLTK:
# punkt -> Tokenization
# stopwords -> Stop words removal
nltk.download("punkt")
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /home/repl/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/repl/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [53]:
# Load the reviews dataset and preview it
reviews = pd.read_csv("reviews.csv")
reviews.head()

Unnamed: 0,content,score
0,I cannot open the app anymore,1
1,I have been begging for a refund from this app...,1
2,Very costly for the premium version (approx In...,1
3,"Used to keep me organized, but all the 2020 UP...",1
4,Dan Birthday Oct 28,1


In [54]:
negative_reviews = reviews[(reviews["score"] == 1) | (reviews["score"] == 2)]["content"]

stop_words = set(stopwords.words('english'))

# Function to tokenize and remove stopwords
def process_text(text):
    tokens = word_tokenize(text)
    filtered_tokens = [w for w in tokens if not w.lower() in stop_words and w.isalpha()]
    return " ".join(filtered_tokens)

# Apply the function to the content column
preprocessed_reviews = pd.DataFrame({'review': negative_reviews.apply(process_text)})

In [55]:
preprocessed_reviews.head()

Unnamed: 0,review
0,open app anymore
1,begging refund app month nobody replying
2,costly premium version approx Indian Rupees pe...
3,Used keep organized UPDATES made mess things c...
4,Dan Birthday Oct


In [56]:
#Vectorize the negative reviews

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(preprocessed_reviews['review'])


In [57]:
#K means clustering
num_clusters = 5

# Initialize KMeans with the number of clusters
kmeans = KMeans(n_clusters=num_clusters, random_state=42)

# Fit the KMeans model to the TF-IDF matrix
kmeans.fit(tfidf_matrix)

# Get the cluster labels for each document
categories = kmeans.labels_.tolist()
preprocessed_reviews["category"] = categories

In [58]:
# Get the terms from the vectorizer
terms = vectorizer.get_feature_names_out()

topic_terms_list = []

for i in range(num_clusters):
    # Find review indices in the current cluster index
    cluster_indices = [cluster for cluster, label in enumerate(categories) if label == i]
    
    # Sum the scores for each term in cluster
    cluster_tfidf_sum = tfidf_matrix[cluster_indices].sum(axis=0)
    cluster_term_freq = np.asarray(cluster_tfidf_sum).ravel()
    
    #Get the top term and its frequencies
    top_term_index = cluster_term_freq.argsort()[::-1][0]
    
    topic_terms_list.append({'category': i, 'term': terms[top_term_index], 'frequency': cluster_term_freq[top_term_index],})

topic_terms = pd.DataFrame(topic_terms_list)