# Project Overview
## Dataset explanation: Spotify User Review Dataset
Our initial dataset, after dropping irrelevant columns, will have 3 columns (before creating our own features). These features are: Review, Total_thumbsup, and Rating. The Review column contains a textual review that a user left, the Total_thumbsup column contains the number of 
thumbs up a review receives, and the rating contains the corresponding star rating that the user left on the app.

## Project Overview/Breakdown
Essentially, our project first classifies the reviews in the dataset into three categories: negative, positive, and neutral, and then takes all reviews classified as negative, and ranks the features from most important (the issue that users are complaining the most about) to least
important. Our project has two models: one for classifying the sentiment, and one for ranking.

## Things to note in regards to building our model/testing

### Part 1
The first model is pretty straightforward, we basically classified the reviews based on sentiment into good, positive, and neutral, and we verified our results by checking if a review that was classified as positive had 4-5 stars, checking if a review classified as neutral had 3 stars, and seeing if a review classified as negative had a 1-2 star rating. 

### Part 2
Our second part (the ranking step) falls more into the category of unsupervised learning. What we decided to do was use Latent Dirichlet Allocation (LDA), which is a technique used for topic modeling to extract different topics amongst the reviews. Each topic has a corresponding sentiment score, and we got the top 10 positive topics (topics with the highest sentiment scores), and top 10 negative topics (topics with the lowest sentiment scores), and assigned descriptive labels to the negative topics (because we are concerned with ranking issues that users complain about). The topic with the lowest sentiment score was assigned the highest ranking, and the topic with the highest sentiment score among all negative topics was assigned the lowest ranking. The tricky part of this milestone was within testing
the second part, since we had to go through our reviews and assign the negative reviews with the labels our own labels (we didn't really have a ground truth in the dataset that we could go off of, so we had to create it ourselves), so we used Hugging Face's zero-shot classifier to create a ground truth for us (we passed it the labels for the negative topics) and compared the rankings from this to the rankings our lda model came up with

In [62]:
# Import/pip install all of the necessary libraries
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install scikit-learn 
!pip install transformers
!pip install torch
!pip install sentence-transformers
!pip install scipy



# Clone the repository
!git clone https://github.com/dregmi08/Milestone-2-Data-Exploration-Initial-Preprocessing.git

# Include all imports 
import sklearn
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
from scipy.stats import kendalltau

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: textblob
Successfully installed textblob-0.18.0.post0
fatal: destination path 'Milestone-2-Data-Exploration-Initial-Preprocessing' already exists and is not an empty directory.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [2]:
# Create a pandas data frame from our csv file
data = pd.read_csv('Milestone-2-Data-Exploration-Initial-Preprocessing/reviews.csv')

# Dropping the unnecessary columns: timesubmitted is the time which the user submitted the review, which is irrelevant to our project
# The Reply column is over 99.6% null, so we will be dropping that as well since there is not much information that we can extract to 
#train our model
data = data.drop(columns=['Time_submitted', 'Reply'])

# Display the data frame
print(data)

                                                  Review  Rating  \
0      Great music service, the audio is high quality...       5   
1      Please ignore previous negative rating. This a...       5   
2      This pop-up "Get the best Spotify experience o...       4   
3        Really buggy and terrible to use as of recently       1   
4      Dear Spotify why do I get songs that I didn't ...       1   
...                                                  ...     ...   
61589  Even though it was communicated that lyrics fe...       1   
61590  Use to be sooo good back when I had it, and wh...       1   
61591  This app would be good if not for it taking ov...       2   
61592  The app is good hard to navigate and won't jus...       2   
61593  Its good but sometimes it doesnt load the musi...       4   

       Total_thumbsup  
0                   2  
1                   1  
2                   0  
3                   1  
4                   1  
...               ...  
61589          

In [74]:
# This cell contains basic preprocessing for our reviews, we want to remove stop words (words that contain/contribute no valuable information
# (like and, I, she, her), we do this to ensure our review text only has words that provide useful information), convert all text to
# lowercase, and removing numbers/punctuation
def preprocess_reviews(text_series):

    # Convert text to lowercase
    text_series = text_series.str.lower() 

    # Remove punctuation
    text_series = text_series.str.replace(r'[^\w\s]', '')  

    # Remove numbers
    text_series = text_series.str.replace(r'\d+', '')

    #Return filtered reviews
    return text_series

# Create a new feature in the dataset that now contains the preprocessed version of all reviews
data['Review_clean'] = preprocess_reviews(data['Review'])

In [75]:
# Creates a tool to turn text into a matrix of word/phrase counts, focusing on the 1000 most common ones, ignoring common stop words.
vectorizer = CountVectorizer(max_features=1000, ngram_range=(1, 2), stop_words='english')

# Convert the cleaned reviews into a matrix of numbers, where each row is a review and each column is a word or phrase
X = vectorizer.fit_transform(data['Review_clean'])

In [77]:
# Creates an LDA model to find 20 topics in the data, where each topic is a group of related words.
lda = LatentDirichletAllocation(n_components=20, random_state=0)

# Trains the LDA model on the numerical feature matrix (X) to learn the topics.
lda.fit(X)

# Transforms the data into a topic representation, showing the topic distribution for each document.
topics = lda.transform(X)

In [78]:
def get_sentiment_score(rating):
    if rating >= 4:
        return 1  # Positive
    elif rating <= 2:
        return -1  # Negative
    else:
        return 0  # Neutral

data['sentiment_score'] = data['Rating'].apply(get_sentiment_score)

In [79]:
# Number of top terms to display per topic (10 most frequent words for each topic)
n_top_terms = 20

# Get the list of all words (terms) from the vectorizer's feature names
terms = vectorizer.get_feature_names_out()

# Create a dictionary to store the top terms for each topic
topic_features = {}

# Loop through each topic and find the top terms (words) for that topic
for topic_idx, topic in enumerate(lda.components_):
    # Sort the terms in the topic by their importance and pick the top 10
    top_terms = [terms[i] for i in topic.argsort()[-n_top_terms:]]
    topic_features[f"Topic {topic_idx + 1}"] = top_terms

# Convert the topic features dictionary into a DataFrame for easier viewing
topic_df = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in topic_features.items()]))

# Display the top words for each topic
print("Top words per topic:")
print(topic_df)

Top words per topic:
           Topic 1           Topic 2  Topic 3        Topic 4        Topic 5  \
0   keeps crashing            simple     went        forward  spotify great   
1            worse  highly recommend    wrong      currently          great   
2          spotify          app easy  won let  recent update          thing   
3      uninstalled          navigate    email            fix        artists   
4          getting           spotify      buy            bar            use   
5              fix             great     says           skip       download   
6            music       apple music     just       open app          music   
7        app keeps              lots  problem      close app           feel   
8             load            highly    login   song playing   like spotify   
9        excellent            really      fix         recent        premium   
10        stopping           app use   logged         update        pandora   
11            time       use sp

In [98]:
# Assuming `topics` is the topic distribution matrix (topics x documents)
# and `data['sentiment_score']` contains sentiment scores for each review

# Aggregate sentiment scores for each topic by multiplying the topic distribution matrix with sentiment scores of each review
topic_sentiment_scores = np.dot(topics.T, data['sentiment_score'])

# Get the frequency of each topic across all reviews
topic_frequencies = topics.sum(axis=0)

# Create a DataFrame to store topic scores, frequency, and sentiment scores for easy comparison
topic_score_df = pd.DataFrame({
    'topic': [f'Topic {i+1}' for i in range(len(topic_sentiment_scores))],
    'sentiment_score': topic_sentiment_scores,
    'frequency': topic_frequencies
})

# Sort topics by sentiment score (from most negative to least negative)
 sentiment_ranked = negative_topic_score_df.sort_values(by='sentiment_score', ascending=True)
 #freq_ranked = negative_topic_score_df_sorted_by_sentiment = sentiment_ranked.sort_values(by='frequency', ascending=True)
# Display the most negative topics (lowest sentiment score)
print("Most Negative Topics:")
print(negative_topic_score_df_sorted_by_sentiment[['topic', 'sentiment_score', 'frequency']].head(10))


IndentationError: unexpected indent (1997301025.py, line 18)

In [88]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util

# Manually assign labels for the topics (negative reviews)
topic_labels = {
    'Topic 13': "Issues with music pausing/suddenly stopping",
    'Topic 3': "Login Issues",
    'Topic 8': "Issues with playlist and shuffle",
    'Topic 4': "Issues with playing music when closing app",
    'Topic 19': "Issues when connecting to different devices/platforms",
    'Topic 1': "App Crashes and Stability Problems",
    'Topic 18': "Connectivity and Podcast Issues",
    'Topic 7': "Ads and Ad-Free Experience",
    'Topic 11': "Playlist and Download Management",
    'Topic 6': "Free vs Premium App Experience"
}


candidate_labels = [
    'Issues with music pausing/suddenly stopping',
    'Login Issues',
    'Issues with playlist and shuffle',
    'Issues with playing music when closing app',
    'Issues when connecting to different devices/platforms',
    'App Crashes and Stability Problems',
    'Connectivity and Podcast Issues',
    'Ads and Ad-Free Experience',
    'Playlist and Download Management',
    'Free vs Premium App Experience'
]


# Load the Sentence-Transformer model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Select a smaller subset for faster testing (negative reviews with low ratings, e.g., 1 or 2)
negative_reviews_subset = data[data['Rating'].isin([1, 2])]['Review_clean'].sample(n=24000, random_state=42).tolist()

# Function to classify each review using Sentence-Transformer embeddings (Batch Processing)
def classify_reviews_batch(reviews, candidate_labels, model, batch_size=32):
    # Generate embeddings for all candidate labels (just once, not per review)
    label_embeddings = model.encode(candidate_labels, convert_to_tensor=True)
    
    # Process reviews in batches
    all_classified_reviews = []
    
    for i in range(0, len(reviews), batch_size):
        batch_reviews = reviews[i:i+batch_size]
        
        # Generate embeddings for the batch of reviews
        review_embeddings = model.encode(batch_reviews, convert_to_tensor=True)
        
        # Calculate cosine similarities in batch
        cosine_similarities = util.pytorch_cos_sim(review_embeddings, label_embeddings)
        
        # Process each review in the batch
        for j, review in enumerate(batch_reviews):
            similarities = cosine_similarities[j]
            max_sim_idx = similarities.argmax()
            predicted_label = candidate_labels[max_sim_idx]
            score = similarities[max_sim_idx].item()  # Extract the scalar value from tensor
            
            all_classified_reviews.append({
                'Review': review,
                'Predicted Label': predicted_label,
                'Score': score
            })
    
    return all_classified_reviews

# Classify the reviews in batches
classified_reviews = classify_reviews_batch(negative_reviews_subset, candidate_labels, model)

# Create DataFrame with results
classified_reviews_df = pd.DataFrame(classified_reviews)

# Get the counts per label
label_counts = classified_reviews_df['Predicted Label'].value_counts()

print(label_counts)

Predicted Label
Issues with playing music when closing app               8451
App Crashes and Stability Problems                       3146
Playlist and Download Management                         2267
Issues with music pausing/suddenly stopping              2171
Ads and Ad-Free Experience                               2000
Connectivity and Podcast Issues                          1779
Issues with playlist and shuffle                         1760
Login Issues                                              977
Free vs Premium App Experience                            948
Issues when connecting to different devices/platforms     501
Name: count, dtype: int64


In [94]:
# Predicted order of topics (from your predicted ranking)
predicted_order = [
    'Topic 13', 'Topic 3', 'Topic 8', 'Topic 4', 'Topic 19',
    'Topic 1', 'Topic 18', 'Topic 7', 'Topic 11', 'Topic 6'
]

# Ground truth order of topics (from the actual label ranking you provided)
ground_truth_order = [
    'Topic 4', 'Topic 1', 'Topic 11', 'Topic 13', 'Topic 7', 'Topic 18', 'Topic 8', 'Topic 3', 'Topic 6'
]

# Create a rank mapping for both predicted and ground truth orders
predicted_ranks = [predicted_order.index(topic) for topic in ground_truth_order]
ground_truth_ranks = list(range(len(ground_truth_order)))

# Calculate Spearman's Rank Correlation
spearman_corr, _ = spearmanr(predicted_ranks, ground_truth_ranks)

# Calculate Kendall's Tau
kendall_corr, _ = kendalltau(predicted_ranks, ground_truth_ranks)

# Display the results
print(f"Kendall's Tau: {kendall_corr:.4f}")

Kendall's Tau: 0.0556
