# Sentiment Prediction Analysis

This notebook analyzes the results of sentiment prediction on Counter-Strike 2 reviews.
We focus on deriving insights from the model's performance and behavior.

**Theme:** Asiimov (Orange, Black, White)

In [8]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import numpy as np

# Define Asiimov Color Palette
# Inspired by the CS:GO Asiimov skin: Distinctive Orange, Black, and White.
asiimov_colors = {
    'orange': '#ff9d00',
    'black': '#1a1a1a',
    'white': '#ffffff',
    'grey': '#5c5c5c',
    'light_grey': '#d1d1d1'
}

# Set default template or color sequence
pio.templates["asiimov"] = go.layout.Template(
    layout=go.Layout(
        colorway=[asiimov_colors['orange'], asiimov_colors['black'], asiimov_colors['grey']],
        plot_bgcolor=asiimov_colors['white'],
        paper_bgcolor=asiimov_colors['white'],
        font={'color': asiimov_colors['black']},
        title={'font': {'color': asiimov_colors['black']}},
    )
)
pio.templates.default = "asiimov"

print("Libraries loaded and Asiimov theme defined.")

Libraries loaded and Asiimov theme defined.


In [9]:
# Load the prediction results
df = pd.read_csv('cs2_10k_predictions.csv')

# Display first few rows to verify
df.head()

Unnamed: 0,recommendationid,language,review,timestamp_created,timestamp_updated,voted_up,votes_up,votes_funny,weighted_vote_score,comment_count,...,author_playtime_forever,author_playtime_last_two_weeks,author_playtime_at_review,author_last_played,detected_lang,clean_review,cleaned_review,label_numeric,predicted_label,predicted_prob
0,26982902,english,Killed a chicken and a chicken killed me 10/10,1479952338,1479952338,True,0,0,0.5,0,...,21969,0,19354.0,1684603450,en,Killed a chicken and a chicken killed me 10/10,Killed a chicken and a chicken killed me 10/10,1,1,0.999792
1,16670500,english,"this is very good game, but quite often with my game off when he reads a map, you do not know wh...",1435135418,1435135418,True,1,1,0.52381,0,...,52599,0,2083.0,1751639001,en,"this is very good game, but quite often with my game off when he reads a map, you do not know wh...","this is very good game, but quite often with my game off when he reads a map, you do not know wh...",1,1,0.999864
2,13842311,english,Dogshit game... Valve couldn't be more wrong.... these graphics are not worth breaking a game...,1420431242,1703460261,False,0,0,0.5,0,...,702615,5726,22057.0,1765778215,en,Dogshit game... Valve couldn't be more wrong.... these graphics are not worth breaking a game...,Dogshit game... Valve couldn't be more wrong.... these graphics are not worth breaking a game...,0,0,0.000118
3,53679961,english,great game,1562342756,1562342756,True,0,0,0.5,0,...,51608,0,17441.0,1683333774,en,great game,great game,1,1,0.973276
4,8846854,english,So a bit over 2 and a half years have passed since I reviewed the game for the first time. Back ...,1390682342,1479928937,True,8,0,0.62029,0,...,327574,17,131452.0,1765800361,en,So a bit over 2 and a half years have passed since I reviewed the game for the first time. Back ...,So a bit over 2 and a half years have passed since I reviewed the game for the first time. Back ...,1,1,0.999864


In [10]:
# Preprocessing for Analysis

# Create a column for correctness
# voted_up is True/False, predicted_label is 1/0.
df['actual_label'] = df['voted_up'].astype(int)
df['is_correct'] = df['actual_label'] == df['predicted_label']

# Calculate review length
df['review_length'] = df['clean_review'].astype(str).apply(len)

# Map numeric labels to string for better plotting
df['prediction_status'] = df['is_correct'].map({True: 'Correct', False: 'Incorrect'})
df['sentiment_label'] = df['actual_label'].map({1: 'Positive', 0: 'Negative'})

print("Preprocessing complete.")

Preprocessing complete.


## Insight 1: Model Prediction Distribution
We investigate how the model performs across positive and negative classes. Does it have a bias towards one sentiment?

In [11]:
# Confusion Matrix-style breakdown
confusion_data = df.groupby(['sentiment_label', 'predicted_label']).size().reset_index(name='count')
confusion_data['predicted_label_str'] = confusion_data['predicted_label'].map({1: 'Predicted Positive', 0: 'Predicted Negative'})

# Plotting with Asiimov colors
fig = px.bar(
    confusion_data,
    x='sentiment_label',
    y='count',
    color='predicted_label_str',
    title='Model Prediction Distribution by Actual Sentiment',
    color_discrete_map={
        'Predicted Positive': asiimov_colors['orange'],
        'Predicted Negative': asiimov_colors['black']
    },
    barmode='group'
)

fig.update_layout(
    xaxis_title="Actual Sentiment",
    yaxis_title="Count",
    legend_title="Prediction"
)

fig.show()

**Commentary:**
This chart visualizes the confusion matrix. 
- If the orange bar is high for 'Positive' and the black bar is high for 'Negative', the model is doing well.
- Significant bars of the 'wrong' color indicate the type of error (False Positive vs False Negative) that is more prevalent.

## Insight 2: Confidence Distribution
Is the model confident when it's wrong? We analyze the distribution of predicted probabilities.

In [12]:
# Histogram of probabilities
fig = px.histogram(
    df,
    x='predicted_prob',
    color='prediction_status',
    nbins=50,
    title='Distribution of Prediction Probabilities (Confidence)',
    color_discrete_map={
        'Correct': asiimov_colors['orange'],
        'Incorrect': asiimov_colors['black']
    },
    opacity=0.7,
    barmode='overlay'
)

fig.update_layout(
    xaxis_title="Predicted Probability (0=Negative, 1=Positive)",
    yaxis_title="Count"
)

fig.show()

**Commentary:**
- **Correct Predictions (Orange):** Should ideally cluster near 0 and 1 (high confidence).
- **Incorrect Predictions (Black):** 
    - If they cluster around 0.5, the model was uncertain.
    - If they cluster near 0 or 1, the model was "confidently wrong".

## Insight 3: Playtime vs. Prediction Accuracy
Do players with more experience write reviews that are easier or harder to classify? Veterans might use more slang or sarcasm.

In [13]:
# Convert playtime (minutes) to hours
df['playtime_hours'] = df['author_playtime_at_review'] / 60

# Create bins for playtime
bins = [0, 10, 100, 500, 1000, 5000, 100000]
labels = ['0-10h', '10-100h', '100-500h', '500-1k h', '1k-5k h', '5k+ h']
df['playtime_category'] = pd.cut(df['playtime_hours'], bins=bins, labels=labels)

# Calculate accuracy per bin
accuracy_by_playtime = df.groupby('playtime_category', observed=True)['is_correct'].mean().reset_index()

fig = px.bar(
    accuracy_by_playtime,
    x='playtime_category',
    y='is_correct',
    title='Model Accuracy by Player Experience (Playtime)',
    color_discrete_sequence=[asiimov_colors['orange']]
)

fig.update_layout(
    xaxis_title="Playtime at Review",
    yaxis_title="Accuracy",
    yaxis_tickformat='.1%'
)

fig.show()

## Insight 4: Review Length vs. Model Confidence
Does the model feel more confident when there is more text to analyze?

In [14]:
# Calculate 'confidence' as absolute distance from 0.5 (neutral).
df['model_confidence'] = (df['predicted_prob'] - 0.5).abs() * 2  # Scale 0 to 1

# Scatter plot of length vs confidence
fig = px.scatter(
    df,
    x='review_length',
    y='model_confidence',
    color='prediction_status',
    title='Review Length vs. Model Confidence',
    color_discrete_map={
        'Correct': asiimov_colors['orange'],
        'Incorrect': asiimov_colors['black']
    },
    opacity=0.6,
    log_x=True # Log scale for length
)

fig.update_layout(
    xaxis_title="Review Length (characters) - Log Scale",
    yaxis_title="Model Confidence (0=Unsure, 1=Sure)"
)

fig.show()

## Insight 5: The Impact of "Funny" Reviews
Are reviews voted as "Funny" harder to predict? These reviews often contain sarcasm, ASCII art, or jokes.

In [15]:
# Binning votes_funny
df['is_funny'] = df['votes_funny'] > 0
accuracy_funny = df.groupby('is_funny')['is_correct'].mean().reset_index()
accuracy_funny['is_funny_str'] = accuracy_funny['is_funny'].map({True: 'Rated Funny', False: 'Not Funny'})

fig = px.bar(
    accuracy_funny,
    x='is_funny_str',
    y='is_correct',
    title='Model Accuracy: Funny vs Normal Reviews',
    color='is_funny_str',
    color_discrete_map={
        'Rated Funny': asiimov_colors['orange'],
        'Not Funny': asiimov_colors['black']
    }
)

fig.update_layout(
    xaxis_title="Review Type",
    yaxis_title="Accuracy",
    yaxis_tickformat='.1%',
    showlegend=False
)

fig.show()

## Insight 6: Temporal Analysis - Reviews per Day
We look at the volume of reviews over time to see trends in player engagement.

In [None]:
# Convert timestamp to date
df['date'] = pd.to_datetime(df['timestamp_created'], unit='s')
df['date_only'] = df['date'].dt.date

# Group by date
daily_counts = df.groupby('date_only').size().reset_index(name='count')

# Plot
fig = px.line(
    daily_counts, 
    x='date_only', 
    y='count', 
    title='Number of Reviews per Day',
    color_discrete_sequence=[asiimov_colors['orange']]
)
fig.update_layout(
    xaxis_title="Date", 
    yaxis_title="Number of Reviews",
    template="asiimov"
)
fig.show()


## Insight 7: Reviews per Day by Sentiment
How does the volume of positive vs negative reviews change over time? usage spikes often correlate with game updates.

In [None]:
# Group by date and sentiment
daily_sentiment = df.groupby(['date_only', 'sentiment_label']).size().reset_index(name='count')

# Plot
fig = px.line(
    daily_sentiment, 
    x='date_only', 
    y='count', 
    color='sentiment_label',
    title='Number of Reviews per Day by Sentiment',
    color_discrete_map={
        'Positive': asiimov_colors['orange'], 
        'Negative': asiimov_colors['black']
    }
)
fig.update_layout(
    xaxis_title="Date", 
    yaxis_title="Number of Reviews",
    template="asiimov"
)
fig.show()


## Insight 8: Sentiment Average Change With Time
Tracking the model's average predicted probability (positive sentiment) over time. This shows the overall community sentiment trend.

In [None]:
# Calculate daily average probability
daily_avg_prob = df.groupby('date_only')['predicted_prob'].mean().reset_index(name='avg_prob')

# Plot
fig = px.line(
    daily_avg_prob, 
    x='date_only', 
    y='avg_prob', 
    title='Daily Average Positive Probability',
    color_discrete_sequence=[asiimov_colors['orange']]
)

# Add global mean line
global_mean = df['predicted_prob'].mean()
fig.add_hline(
    y=global_mean, 
    line_dash="dash", 
    line_color=asiimov_colors['grey'], 
    annotation_text="Global Mean"
)

fig.update_layout(
    xaxis_title="Date", 
    yaxis_title="Average Probability (0-1)",
    yaxis_range=[0, 1],
    template="asiimov"
)
fig.show()


## Insight 9: Score Distribution by Sentiment
Comparing the distribution of predicted probabilities for Positive vs Negative reviews using a density plot.

In [None]:
# Helper to calculate density (simple histogram based)
def get_density(series, bins=100):
    hist, bin_edges = np.histogram(series, bins=bins, density=True)
    bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
    return bin_centers, hist

# Get data for Positive and Negative
pos_probs = df[df['sentiment_label'] == 'Positive']['predicted_prob']
neg_probs = df[df['sentiment_label'] == 'Negative']['predicted_prob']

x_pos, y_pos = get_density(pos_probs)
x_neg, y_neg = get_density(neg_probs)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=x_pos, y=y_pos, 
    mode='lines', 
    name='Positive', 
    line=dict(color=asiimov_colors['orange'], width=2),
    fill='tozeroy'
))

fig.add_trace(go.Scatter(
    x=x_neg, y=y_neg, 
    mode='lines', 
    name='Negative', 
    line=dict(color=asiimov_colors['black'], width=2),
    fill='tozeroy'
))

fig.update_layout(
    title='Score Distribution by Sentiment (Density)',
    xaxis_title='Predicted Probability',
    yaxis_title='Density',
    template="asiimov"
)
fig.show()


## Insight 10: Word Frequency Analysis
Top words appearing in Positive and Negative reviews. This helps understand the topics driving sentiment.

In [None]:
from collections import Counter
import re

# Simple tokenizer and stopword list (since spacy/nltk might not be available)
def simple_tokenize(text):
    # Lowercase and remove non-alphanumeric
    if pd.isna(text): return []
    text = re.sub(r'[^a-zA-Z\s]', '', str(text).lower())
    return text.split()

STOPWORDS = set([
    'the', 'and', 'to', 'of', 'a', 'is', 'in', 'it', 'for', 'that', 'i', 'you', 'this', 'on', 'with', 'game', 'cs', 'csgo',
    'are', 'was', 'have', 'my', 'not', 'be', 'but', 'as', 'at', 'so', 'if', 'or', 'just', 'like', 'can', 'from', 'an', 'all',
    'me', 'your', 'one', 'they', 'about', 'has', 'out', 'what', 'do', 'get', 'no', 'up', 'when', 'good', 'play', 'will',
    'time', 'really', 'would', 'there', 'more', 'very', 'even', 'only', 'some', 'had', 'been', 'which', 'by', 'their',
    'who', 'dont', 'im', 'its', 'games', 'valve', 'counter', 'strike', '2', 'global', 'offensive', 'played', 'playing',
    'because', 'than', 'much', 'better', 'best', 'worst', 'bad', 'fun', 'love', 'hate', 'great', 'amazing', 'awesome',
    'sucks', 'shit', 'trash', 'garbage', 'fucking', 'fuck', 'u', 'r'
])

def get_top_words(texts, n=15):
    all_words = []
    for text in texts:
        words = simple_tokenize(text)
        # Filter stopwords and short words
        all_words.extend([w for w in words if w not in STOPWORDS and len(w) > 2])
    return Counter(all_words).most_common(n)

# Get top words for each sentiment
top_pos = get_top_words(df[df['sentiment_label'] == 'Positive']['cleaned_review'])
top_neg = get_top_words(df[df['sentiment_label'] == 'Negative']['cleaned_review'])

# Create DataFrames
df_pos_words = pd.DataFrame(top_pos, columns=['word', 'count'])
df_neg_words = pd.DataFrame(top_neg, columns=['word', 'count'])

# Plot Positive
fig_pos = px.bar(
    df_pos_words, 
    x='word', 
    y='count', 
    title='Top 15 Words in Positive Reviews',
    color_discrete_sequence=[asiimov_colors['orange']]
)
fig_pos.update_layout(template="asiimov")
fig_pos.show()

# Plot Negative
fig_neg = px.bar(
    df_neg_words, 
    x='word', 
    y='count', 
    title='Top 15 Words in Negative Reviews',
    color_discrete_sequence=[asiimov_colors['black']]
)
fig_neg.update_layout(template="asiimov")
fig_neg.show()
