This file help figure out what are the main files that we are working with, and the important data. 

How many unique tweets are viewed by users in the dataset?<br>
    Among them, how many tweets are valid<br>
    Produce a user-viewed, content-valid, unique, gpt-predicted dataset of tweets. <br>
    => This is the dataset we want to work with<br>

<br>
We are working with 384273 unique tweets, that are 6+ word long, english, and are viewed by at least 1 user for 3s+.

In [3]:
from langdetect import detect, LangDetectException
import json
import pandas as pd
import re
import os

In [2]:
# remove tweets that are not viewed for 3s plus
df = pd.read_csv('../csvs/exposure_features.csv')
df_filtered = df[df['3s'] == 1]

df_filtered.to_csv('../csvs/exposure_features.csv', index=False)

print(f"Original dataset had {len(df)} rows")
print(f"After filtering for 3s=1: {len(df_filtered)} rows")
print(f"Removed {len(df) - len(df_filtered)} rows")


Original dataset had 1401643 rows
After filtering for 3s=1: 1401643 rows
Removed 0 rows


In [8]:
# Print out first 100 rows from tweets.json for sample
with open('../tweets.json', 'r') as file:
    for i, line in enumerate(file):
        if i < 100:
            print(line.strip())
        else:
            break

{"position": 0, "sortIndex": "1810472417504526336", "text": "", "is_retweet": false, "card": {}, "quoting": {}, "tweet_id": "messageprompt-premium-plus-upsell-prompt", "batch_id": "00e9dbc788164dd79b3f1e2c8f12ecbd", "user_id": "C26E207D61F94497895EF589F2E6CF26"}
{"position": 1, "sortIndex": "1810472417504526334", "text": "", "is_retweet": false, "card": {}, "quoting": {}, "tweet_id": "promoted-tweet-1720034014164070423-a74eb45854f9d6d", "batch_id": "00e9dbc788164dd79b3f1e2c8f12ecbd", "user_id": "C26E207D61F94497895EF589F2E6CF26"}
{"position": 2, "sortIndex": "1810472417504526333", "text": "SKETCH DID WHAT??? https://t.co/xe9fKFsMXq", "is_retweet": false, "card": {}, "quoting": {}, "tweet_id": "tweet-1810210153385578638", "bookmark_count": 10910, "favorite_count": 119495, "lang": "en", "reply_count": 1399, "retweet_count": 2452, "is_quote_status": false, "quote_count": 182, "conversation_id_str": "1810210153385578638", "created_at": "Mon Jul 08 07:11:47 +0000 2024", "media": [{"type": "

The main function to find all valid tweet ids. <br>
<br>
The output of this function is the total valid tweets that we should process. <br>
All unique valid tweet-ids are saved in unique_tweet_ids.<br>
All excluded tweet-ids are saved in excluded_tweet_ids, which the reason for exclusion. <br>
<br>
Total included tweets: 384273 <br>
Total excluded tweets: 1069908

In [19]:
# Create a set of tweet ids to work with
"""
Create a set of all unique tweet id
-> excluding lines with text length < 6 words
-> exclude tweets that are not english
-> exclude tweets that are not viewed by users
"""
unique_tweet_ids = set()
excluded_tweet_ids = dict()
viewed_tweet_ids = set(df_filtered['tweet_id_numeric'].values)
skipped_count = 0

with open('../tweets.json', 'r') as file:
    for i, line in enumerate(file):
        if i % 10000 == 0:
            print(f"Processed {i} lines. Included: {len(unique_tweet_ids)}, Excluded: {len(excluded_tweet_ids)}")
            print(f"Skipped lines due to Json error: {skipped_count}")
        
        try:
            line_object = json.loads(line)
            tweet_id_raw = line_object['tweet_id']
            line_text = line_object['text']
        except (json.JSONDecodeError, KeyError):
            skipped_count += 1
            continue

        # Extract tweet_id
        match = re.search(r'\b\d{10,}\b', tweet_id_raw)
        if not match:
            excluded_tweet_ids[tweet_id_raw] = {'invalid_id': 1}
            continue

        tweet_id_numeric = int(match.group())

        # Skip duplicates
        if tweet_id_numeric in unique_tweet_ids or tweet_id_numeric in excluded_tweet_ids:
            continue

        # Exclusion reasons
        reason = {}

        if len(line_text.split()) <= 6:
            reason['short'] = 1

        try:
            if detect(line_text) != 'en':
                reason['non_english'] = 1
        except LangDetectException:
            reason['non_english'] = 1

        if tweet_id_numeric not in viewed_tweet_ids:
            reason['not_viewed'] = 1

        if reason:
            excluded_tweet_ids[tweet_id_raw] = reason
        else:
            unique_tweet_ids.add(tweet_id_numeric)

print("\nFinished processing.")
print(f"Total included tweets: {len(unique_tweet_ids)}")
print(f"Total excluded tweets: {len(excluded_tweet_ids)}")

# Convert exclusions for analysis
excluded_tweets_df = pd.DataFrame([
    {'tweet_id': tid, **reason}
    for tid, reason in excluded_tweet_ids.items()
])
sample_excluded = list(excluded_tweet_ids.items())[:10]
print("\nSample of excluded tweets and reasons:")
for tid, reason in sample_excluded:
    print(f"Tweet ID: {tid}, Reason: {reason}")


Processed 0 lines. Included: 0, Excluded: 0
Skipped lines due to Json error: 0
Processed 10000 lines. Included: 1417, Excluded: 3557
Skipped lines due to Json error: 0
Processed 20000 lines. Included: 2442, Excluded: 6253
Skipped lines due to Json error: 0
Processed 30000 lines. Included: 3337, Excluded: 9002
Skipped lines due to Json error: 0
Processed 40000 lines. Included: 4347, Excluded: 11665
Skipped lines due to Json error: 0
Processed 50000 lines. Included: 5316, Excluded: 14304
Skipped lines due to Json error: 0
Processed 60000 lines. Included: 6204, Excluded: 17076
Skipped lines due to Json error: 0
Processed 70000 lines. Included: 7337, Excluded: 20077
Skipped lines due to Json error: 0
Processed 80000 lines. Included: 8273, Excluded: 22829
Skipped lines due to Json error: 0
Processed 90000 lines. Included: 9358, Excluded: 25745
Skipped lines due to Json error: 0
Processed 100000 lines. Included: 10160, Excluded: 28249
Skipped lines due to Json error: 0
Processed 110000 lines

In [20]:
# save the unique tweet ids to a csv file
unique_tweet_ids_df = pd.DataFrame({'tweet_id_numeric': list(unique_tweet_ids)})
unique_tweet_ids_df.to_csv('../csvs/unique_tweet_ids.csv', index=False)

In [21]:
# save the excluded tweet ids and the excluded reasons to a csv file
excluded_tweets_df.to_csv('../csvs/excluded_tweet_ids.csv', index=False)

In [None]:
# Verify the unique tweet ids are indeed viewed by users
# Yes they are.
unique_tweet_ids_df = pd.read_csv('../csvs/unique_tweet_ids.csv')
viewed_tweet_ids = set(df_filtered['tweet_id_numeric'].values)
valid_tweet_ids = unique_tweet_ids_df[unique_tweet_ids_df['tweet_id_numeric'].isin(viewed_tweet_ids)]
print(f"Valid tweet ids that are viewed by users: {len(valid_tweet_ids)}")

Valid tweet ids that are viewed by users: 384273


In [None]:
# verify if all the unique ids have gpt predictions
# Missing GPT predictions for 36061 unique tweet ids.
gpt_predictions_df = pd.read_csv('../csvs/gpt_outputs.csv')
unique_tweet_ids_df = pd.read_csv('../csvs/unique_tweet_ids.csv')
missing_predictions = unique_tweet_ids_df[~unique_tweet_ids_df['tweet_id_numeric'].isin(gpt_predictions_df['tweet_id'])]
if not missing_predictions.empty:
    print(f"Missing GPT predictions for {len(missing_predictions)} unique tweet ids.")
else:
    print("All unique tweet ids have GPT predictions.")

Missing GPT predictions for 36061 unique tweet ids.


In [14]:
# Add a column of tweet_id_numeric to unique_tweets.csv
unique_tweets_df = pd.read_csv('../csvs/unique_tweets.csv')
unique_tweets_df['tweet_id_numeric'] = unique_tweets_df['tweet_id'].apply(lambda x: int(re.search(r'\b\d{10,}\b', x).group()) if re.search(r'\b\d{10,}\b', x) else None)
unique_tweets_df.to_csv('../csvs/unique_tweets.csv', index=False)

In [13]:
# Verify if the corresponding tweet-text is included in unique_tweets.csv
# Missing tweet texts for 782 unique tweet ids
unique_tweets_df = pd.read_csv('../csvs/unique_tweets.csv')
missing_texts = unique_tweet_ids_df[~unique_tweet_ids_df['tweet_id_numeric'].isin(unique_tweets_df['tweet_id_numeric'])]
if not missing_texts.empty:
    # make a set of missing tweet ids
    missing_tweet_ids = set(missing_texts['tweet_id_numeric'].values)
    print(f"Missing tweet texts for {len(missing_texts)} unique tweet ids: {missing_tweet_ids}")
else:
    print("All unique tweet ids have corresponding tweet texts in unique_tweets.csv.")

NameError: name 'unique_tweet_ids_df' is not defined

In [10]:
print(f"Number of tweet id: {len(unique_tweet_ids_df)}")

NameError: name 'unique_tweet_ids_df' is not defined

In [36]:
# Join unique_tweets.csv and unique_tweet_ids.csv based on tweet_id_numeric
valid_tweets_df = pd.merge(unique_tweet_ids_df, unique_tweets_df, on='tweet_id_numeric', how='left')
valid_tweets_df = valid_tweets_df.drop_duplicates(subset='tweet_id_numeric')
valid_tweets_df.to_csv('../csvs/valid_tweets.csv', index=False)

In [None]:
# Check the row number of valid_tweets.csv
valid_tweets_df = pd.read_csv('../csvs/valid_tweets.csv')
print(f"Number of valid tweets: {len(valid_tweets_df)}")

Number of valid tweets: 384273


In [39]:
# Count the number of rows that have empty string for tweet column
empty_tweets_count = valid_tweets_df['tweet'].isnull().sum()
print(f"Number of rows with empty tweet text: {empty_tweets_count}")

Number of rows with empty tweet text: 782


In [44]:
print(len(missing_tweet_ids))

782


In [53]:
# Go over the tweets.json and extract the tweet text for each missing tweet id
valid_tweets_df = pd.read_csv('../csvs/valid_tweets_updated.csv')
with open('../tweets.json', 'r') as file:
    for i, line in enumerate(file):
        
        line_object = json.loads(line)
        tweet_id_raw = line_object['tweet_id']
        line_text = line_object['text']
        
        # Extract tweet_id_numeric
        match = re.search(r'\b\d{10,}\b', tweet_id_raw)
        if not match:
            continue

        tweet_id_numeric = int(match.group())
        if tweet_id_numeric in missing_tweet_ids:
            # Save the tweet text to unique_tweets.csv
            missing_tweet_ids.remove(tweet_id_numeric)
            valid_tweets_df.loc[valid_tweets_df['tweet_id_numeric'] == tweet_id_numeric, 'text'] = line_text 
            print(f"Added tweet text for tweet_id: {tweet_id_numeric} {line_text}")

if missing_tweet_ids:
    print(f"Missing tweet ids after processing: {missing_tweet_ids}")

# Save the updated unique_tweets_df to csv
valid_tweets_df.to_csv('../csvs/valid_tweets_updated.csv', index=False)

Added tweet text for tweet_id: 1813387249305940039 


KeyboardInterrupt: 

In [47]:
# add the text column of valid_tweets_updated.csv to tweet conlumn if tweet column is empty
valid_tweets_updated_df = pd.read_csv('../csvs/valid_tweets_updated.csv')
valid_tweets_updated_df['tweet'] = valid_tweets_updated_df.apply(
    lambda row: row['text'] if pd.isnull(row['tweet']) else row['tweet'], axis=1
)
valid_tweets_updated_df.to_csv('../csvs/valid_tweets_updated.csv', index=False)


In [17]:
# check if all missing tweet ids are resolved
valid_tweets_updated_df = pd.read_csv('../csvs/valid_tweets.csv')
missing_tweet_ids = set(valid_tweets_updated_df[valid_tweets_updated_df['tweet'].isnull()]['tweet_id_numeric'].values)
if not missing_tweet_ids:
    print("All missing tweet ids are resolved.")
else:
    print(f"Missing tweet ids: {missing_tweet_ids}")
    print(f"Number of missing tweet ids: {len(missing_tweet_ids)}")

All missing tweet ids are resolved.


In [55]:
# Make sure all tweet are string in valid_tweets_updated.csv
valid_tweets_updated_df = pd.read_csv('../csvs/valid_tweets_updated.csv')
valid_tweets_updated_df['tweet'] = valid_tweets_updated_df['tweet'].astype(str)
valid_tweets_updated_df.to_csv('../csvs/valid_tweets_updated.csv', index=False)

In [None]:
# rename the valid_tweets_updated.csv to valid_tweets.csv
os.rename('../csvs/valid_tweets_updated.csv', '../csvs/valid_tweets.csv')       

In [18]:
# verify if all the tweet_id_numeric in valid_tweets.csv have gpt predictions
valid_tweets_df = pd.read_csv('../csvs/valid_tweets.csv')
gpt_predictions_df = pd.read_csv('../csvs/gpt_outputs.csv')
missing_predictions = valid_tweets_df[~valid_tweets_df['tweet_id_numeric'].isin(gpt_predictions_df['tweet_id'])]
print(f"Missing GPT predictions for {len(missing_predictions)} tweet ids.")

# print a few example of missing tweets
if not missing_predictions.empty:
    print("Examples of missing tweets:")
    for index, row in missing_predictions.head(5).iterrows():
        print(f"Tweet ID: {row['tweet_id_numeric']}, Tweet: {row['tweet']}")


Missing GPT predictions for 36061 tweet ids.
Examples of missing tweets:
Tweet ID: 1815235435217502620, Tweet: My name is Grant Stern, I’m 47 years old and I’m supporting VP Kamala Harris in 2024!

Who’s with me?? https://t.co/uwOeN2cMLM
Tweet ID: 1825285715887849885, Tweet: HAKEEM JEFFRIES: Kamala Harris and Tim Walz are running a forward-looking, joyful campaign; the difference between Kamala Harris and Donald Trump is as wide as the Grand Canyon.

Boom!!! 💥 https://t.co/b35Fyz0OMZ
Tweet ID: 1812255619417575424, Tweet: This is why the media is the #1 enemy of the American people.  CNN Headline: "Trump injured in incident at Pennsylvania rally."

Scum! https://t.co/mRCkq7zjzW
Tweet ID: 1825710540372512768, Tweet: im crying someone caught NPC Miles Morales standing on the top shelf at a store during one of his TikTok lives 😭😭 https://t.co/52VEO94w5d
Tweet ID: 1824406775031861262, Tweet: This is EXACTLY what I’m picturing is happening when I spend all 700 of my recruiting hours to send 

In [31]:
# get rid of the text column in valid_tweets.csv
valid_tweets_df = pd.read_csv('../csvs/valid_tweets.csv')
valid_tweets_df = valid_tweets_df.drop(columns=['text'])
valid_tweets_df.to_csv('../csvs/valid_tweets.csv', index=False)

In [25]:
# merge valid_tweets.csv and gpt_outputs.csv based on tweet_id_numeric
valid_tweets_df = pd.read_csv('../csvs/valid_tweets.csv')
gpt_predictions_df = pd.read_csv('../csvs/gpt_outputs.csv')

In [32]:
# check for duplicate in valid_tweets_df and gpt_predictions_df
valid_tweets_duplicates = valid_tweets_df[valid_tweets_df.duplicated(subset='tweet_id_numeric', keep=False)]
gpt_predictions_duplicates = gpt_predictions_df[gpt_predictions_df.duplicated(subset='tweet_id', keep=False)]
print(f"Number of duplicate tweets in valid_tweets_df: {len(valid_tweets_duplicates)}")
print(f"Number of duplicate tweets in gpt_predictions_df: {len(gpt_predictions_duplicates)}")

Number of duplicate tweets in valid_tweets_df: 0
Number of duplicate tweets in gpt_predictions_df: 0


In [24]:
# drop duplicates in gpt_predictions_df
gpt_predictions_df = gpt_predictions_df.drop_duplicates(subset='tweet_id', keep='first')
gpt_predictions_df.to_csv('../csvs/gpt_outputs.csv', index=False)

In [33]:
merged_df = pd.merge(valid_tweets_df, gpt_predictions_df, left_on='tweet_id_numeric', right_on='tweet_id', how='inner')

In [34]:
# check number of rows in merged_df
print(f"Number of rows in merged_df: {len(merged_df)}")

Number of rows in merged_df: 348211


In [35]:
# check number of rows in valid_tweets_df and gpt_predictions_df
print(f"Number of rows in valid_tweets_df: {len(valid_tweets_df)}")
print(f"Number of rows in gpt_predictions_df: {len(gpt_predictions_df)}")

Number of rows in valid_tweets_df: 384272
Number of rows in gpt_predictions_df: 680092


In [37]:
# drop tweet_id_y
# rename tweet_id_x to tweet_id
merged_df = merged_df.drop(columns=['tweet_id_y'])
merged_df = merged_df.rename(columns={'tweet_id_x': 'tweet_id'})

In [39]:
merged_df.head()

Unnamed: 0,tweet_id_numeric,tweet_id,tweet,predicted_nervous,predicted_sad,predicted_happy,predicted_calm,predicted_excited,predicted_aroused,predicted_angry,predicted_relaxed,predicted_fearful,predicted_enthusiastic,predicted_still,predicted_satisfied,predicted_bored,predicted_lonely
0,1816494698644520965,tweet-1816494698644520965,🧵A quick look at the emerging poll picture aft...,2,2,2,1,3,2,2,1,2,3,1,1,1,1
1,1812656619781488927,tweet-1812656619781488927,The maids of the Red Keep after the guards car...,2,1,3,2,4,3,1,1,1,3,1,1,1,1
2,1823015751248155008,tweet-1823015751248155008,It was an action-packed weekend for monarch bu...,1,1,5,2,4,3,1,2,1,4,1,3,1,1
3,1821396938492895358,tweet-1821396938492895358,So yall upset at Leslie over this ??? Ain’t no...,1,1,3,1,2,3,2,1,1,4,1,1,1,1
4,1822683652947423666,tweet-1822683652947423666,Incresibly sinister energy. I've seen dudes li...,3,2,1,1,2,2,4,1,4,1,1,1,1,1


In [38]:
merged_df.to_csv('../csvs/valid_tweets_with_gpt.csv', index=False)

Output: valid_tweets.csv <br>
The problem is there are still around 36k tweets we do not have gpt data. <br>
<br>
TD: see if these tweets are in json output. Otherwise, re-batch-run gpt prediction for these. 