# Polymarket User Problems Analysis

This notebook analyzes Reddit and Twitter posts to identify the top problems users face with Polymarket.

## Objectives:
1. Collect posts about Polymarket from Reddit (last 3 months) and Twitter (last 7 days)
2. Filter for problem-related discussions
3. Categorize and analyze problems using NLP
4. Identify top 3-5 problems to build solutions for

**Data Collection:**
- **Reddit:** 300 posts per subreddit, last 90 days (3 months)
- **Twitter:** 10 tweets minimum (API limit), last 7 days
- **Total Expected:** ~100-200+ posts for comprehensive analysis

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from collect_polymarket_data import PolymarketDataCollector
from analyze_polymarket_problems import PolymarketProblemAnalyzer

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úì Libraries imported successfully!")

‚úì Libraries imported successfully!


## Step 1: Collect Data from Reddit and Twitter

In [None]:
# Initialize collector
collector = PolymarketDataCollector()

# Collect data from Reddit (3 months) and Twitter (7 days)
print("üöÄ Collecting comprehensive data...\n")
print("üìä Reddit: 300 posts/subreddit, last 90 days")
print("üê¶ Twitter: 10 tweets minimum (skip if quota exceeded)\n")

df = collector.collect_all_data(days_back=90, skip_twitter=True)  # Skip Twitter due to quota

print(f"\n‚úÖ Collected {len(df)} total posts")
print(f"\nüìà Platform breakdown:")
print(df['platform'].value_counts())


POLYMARKET DATA COLLECTION STARTING

Collecting Reddit posts about 'polymarket'

[DEBUG] Date filter: Posts after 2025-10-11 22:32:37 UTC
[DEBUG] Looking back 30 days

Searching r/polymarket...
  [DEBUG] Getting all posts (no search query needed), Limit: 100
  [DEBUG] Post #1: 'Using API for sports...' | Date: 2025-11-04 | Score: 1
  [DEBUG] Post #2: 'The Changing Landscape of Sports Betting...' | Date: 2025-10-30 | Score: 1
  [DEBUG] Post #3: 'MoonPay for deposits?...' | Date: 2025-10-24 | Score: 1
  [DEBUG] Filtered out (too old): 'Lord Miles dead yes or no?...' | Date: 2025-09-26
  [DEBUG] Filtered out (too old): 'Parliament, Polymarket and the perils of political betting...' | Date: 2025-09-19
  [DEBUG] Retrieved 51 posts from API
  [DEBUG] Filtered out 41 posts (too old)
  ‚úì Found 10 posts matching criteria

Searching r/cryptocurrency...
  [DEBUG] Query: 'polymarket', Limit: 100, Time filter: month
  [DEBUG] Post #1: 'I built a website to track whales and insider/suspicious tra

In [None]:
# Preview the data
df.head(10)

In [None]:
# Basic statistics
print("Data Overview:")
print(f"Date range: {df['created_utc'].min()} to {df['created_utc'].max()}")
print(f"\nPlatforms: {df['platform'].unique()}")
print(f"\nColumns: {df.columns.tolist()}")

## Step 2: Identify Problem Posts

In [None]:
# Initialize analyzer
analyzer = PolymarketProblemAnalyzer()

# Use the collected data
analyzer.df = df

# Identify problem posts
problem_df = analyzer.identify_problem_posts()

print(f"\nFound {len(problem_df)} problem-related posts out of {len(df)} total")
print(f"That's {len(problem_df)/len(df)*100:.1f}% of all posts")

In [None]:
# Sample problem posts
print("Sample problem posts:\n")
for i, row in problem_df.head(5).iterrows():
    print(f"Platform: {row['platform']}")
    print(f"Date: {row['created_utc']}")
    print(f"Text: {row['full_text'][:200]}...")
    print("-" * 80)

## Step 3: Sentiment Analysis

In [None]:
# Analyze sentiment
problem_df = analyzer.analyze_sentiment(problem_df)

In [None]:
# Visualize sentiment distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Sentiment counts
sentiment_counts = problem_df['sentiment'].value_counts()
axes[0].bar(sentiment_counts.index, sentiment_counts.values, 
           color=['red', 'gray', 'green'])
axes[0].set_title('Sentiment Distribution of Problem Posts')
axes[0].set_ylabel('Count')

# Sentiment scores
axes[1].hist(problem_df['sentiment_compound'], bins=30, edgecolor='black')
axes[1].set_title('Sentiment Score Distribution')
axes[1].set_xlabel('Compound Score (-1 to 1)')
axes[1].set_ylabel('Count')
axes[1].axvline(x=0, color='red', linestyle='--', label='Neutral')
axes[1].legend()

plt.tight_layout()
plt.show()

## Step 4: Categorize Problems

In [None]:
# Categorize problems
problem_df, category_counts = analyzer.categorize_problems(problem_df)

In [None]:
# Visualize problem categories
plt.figure(figsize=(12, 6))
top_10_categories = dict(category_counts.most_common(10))
plt.barh(list(top_10_categories.keys()), list(top_10_categories.values()),
         color=sns.color_palette("Reds_r", 10))
plt.xlabel('Number of Mentions')
plt.title('Top 10 Problem Categories')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## Step 5: Extract Top 5 Problems

In [None]:
# Extract top 5 problems
top_problems = analyzer.extract_top_problems(problem_df, category_counts, top_n=5)

In [None]:
# Display detailed information for each top problem
for problem in top_problems:
    print("\n" + "="*80)
    print(f"RANK #{problem['rank']}: {problem['category'].upper().replace('_', ' ')}")
    print("="*80)
    print(f"Mentions: {problem['count']} ({problem['percentage']:.1f}% of problem posts)")
    print(f"Average Sentiment: {problem['avg_sentiment']:.3f}")
    print(f"Platforms: {problem['platforms']}")
    print("\nSample Complaints:")
    for i, complaint in enumerate(problem['sample_complaints'][:2], 1):
        print(f"\n{i}. {complaint[:300]}...")

## Step 6: Generate Comprehensive Visualization

In [None]:
# Create comprehensive visualization
viz_file = analyzer.visualize_results(top_problems, category_counts)
print(f"Visualization saved to: {viz_file}")

## Step 7: Generate Full Report

In [None]:
# Generate text report
report_file = analyzer.generate_report(top_problems)
print(f"Report saved to: {report_file}")

# Display report
with open(report_file, 'r', encoding='utf-8') as f:
    print(f.read())

## Step 8: Save Results for App Development

In [None]:
# Save processed data
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
problem_file = f'polymarket_problem_posts_{timestamp}.csv'
problem_df.to_csv(problem_file, index=False)
print(f"Problem posts saved to: {problem_file}")

# Save top problems as JSON for easy access
import json
json_file = f'polymarket_top_problems_{timestamp}.json'
with open(json_file, 'w') as f:
    # Convert to serializable format
    problems_json = [{
        'rank': p['rank'],
        'category': p['category'],
        'count': p['count'],
        'percentage': p['percentage'],
        'avg_sentiment': p['avg_sentiment'],
        'platforms': p['platforms']
    } for p in top_problems]
    json.dump(problems_json, f, indent=2)
    
print(f"Top problems JSON saved to: {json_file}")

## Summary and Next Steps

### Key Findings:
The analysis has identified the top 5 problems users face with Polymarket.

### Next Steps for App Development:
1. Review the top problems identified
2. Prioritize which problems to solve based on:
   - Frequency (number of mentions)
   - Sentiment severity (how negative)
   - Feasibility of solution using Polymarket + Polygon API
3. Design and build apps to address these problems
4. Test solutions with the community

### Files Generated:
- Problem posts CSV (for further analysis)
- Text report (detailed findings)
- Visualization PNG (charts and graphs)
- JSON file (easy access to top problems)