# Creative extension analysis notebook

---

**Authors**

- Jérémy Bensoussan
- Ekaterina Kryukova
- Jules Triomphe

---

## Abstract

While the paper examines the exposure hypothesis for all topics, we propose to analyze and compare political and food related tweets. To do so, we plan to obtain egos, alters and their timelines from Twitter’s API by generating $3\times30,000$ random numbers in a range from $0$ to $3,000,000,000$. Then, we intend to identify egos’ retweets using official RT, classify tweets by topics based on hashtags (and keywords if the dataset is lacking content) and create a follower/followee graph. To calculate the probability of retweeting alters’ tweets by egos we will use a more solid approach then what is described in the paper “Differences in the Mechanics of Information Diffusion Across Topics” where the probability is equal to the number of users that were k times exposed to a hashtag and retweeted before the ($k+1$)-th exposure, divided by the number of users that were k-times exposed to the hashtag. Finally, we plan to visualize results as well as analyze the probability by breaking down users based on betweenness, clustering coefficient and number of followees.

## Research Questions

1. Is there a significant difference between the probability of retweeting when the tweet is about food and when it is about politics ?
2. Is there a significant difference in the number of times a tweet is retweeted depending on whether it is about food or politics ?
3. Is there a relation between user betweenness, size of cluster or number of friends with the retweet probabilities ?

## Proposed dataset

Self-collected (with the Twitter API) ego and alter timelines with all tweet fields from the [**GET /2/users** endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users) and all user fields except for `profile_image_url`.

## Methods

### Data collection

We will sign up for Twitter’s API to collect data. We will generate $3\times30,000$ random numbers in a range from $0$ to $3,000,000,000$ as in the paper and use the [**GET /2/users** endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users) to collect active and public user information with all tweet fields and all user fields except for profile_image_url.

### Building the network

We will use networkx to build a directed network of followers  in which nodes are users (egos and alters) and edges are the following relationships (without the following relationships among alters). Next, we will build another network where relationships among alters of active egos are included.

### Calculating retweet probability

For each ego, we will count the number of followees who have retweeted a post (exposures) on a certain topic at a certain date. We will get the information like this: an ego $i$ was exposed to $200$ posts about this topic only once, among which $i$ retweeted $50$ (probability is $50/200 = 25\%$); at the same time, $i$ was exposed to $100$ posts about this topic twice, among which $i$ retweeted $50$ ($probability = 50\%$); … Finally, we will calculate a sequence of probability for each ego. (Same procedure as in the paper.) If there is sufficient data, we will apply a t-test to identify whether the distribution of retweets for each exposure count is significantly different from one topic to the other. We will also try to compare the distribution of retweet probabilities between topics.

### Community detection

We will use the second ego networks to compute clustering coefficients and betweenness of the active egos.

### Data analysis

We will compare the retweet probabilities based on betweenness, number of followers and clustering for each topic, much like in *Figure 6* in the paper.

---

## Initialisation

In [1]:
# Import libraries
import random
import numpy as np
import pandas as pd
from tqdm.autonotebook import tqdm, trange
import requests
import time
from datetime import datetime

  from tqdm.autonotebook import tqdm, trange


In [2]:
# Enable auto-formatting

%load_ext lab_black

In [3]:
# Define constants

# UserID range
LOWER_ID_N = 0
UPPER_ID_N = int(3e9)

# UserID number
N_UID_PER_REQUEST = int(3e4)
N_UID_REQUESTS = 3

# --------------------------------------------------

# Choose whether to generate new UserIDs
CREATE_NEW_UIDs = False

# Choose whether to collect user data
COLLECT_USER_DATA = False
# Select the batch to query if collecting user data
REQUEST_NUMBER = 2
# Define behaviour depending on the run number
FIRST_RUN = False

# --------------------------------------------------

# Data folder location
DATA_FOLDER = "./data/"
# UIDs file name
UIDS_FILE = DATA_FOLDER + "uids.csv"
USERS_FILE = DATA_FOLDER + "users.csv"
# File containing the bearer token
BEARER_TOKEN = DATA_FOLDER + "bearer_token.auth"

# API endpoints
API_USERS_ENDPOINT = "https://api.twitter.com/2/users?ids="
API_USER_FIELDS = "user.fields=created_at,description,entities,id,location,name,pinned_tweet_id,protected,public_metrics,url,username,verified,withheld"
API_TWEET_FIELDS = "tweet.fields=attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,non_public_metrics,public_metrics,organic_metrics,promoted_metrics,possibly_sensitive,referenced_tweets,reply_settings,source,text,withheld"
API_USER_TIMELINE_ENDPOINT = (
    "https://api.twitter.com/1.1/statuses/user_timeline.json?user_id="
)

# Random seed
SEED = 30
random.seed(SEED)

---

## Data collection

In this part, we will generate random user IDs and collect their respective user information if they exist.

In [4]:
# Generate UIDs
if CREATE_NEW_UIDs:
    uids = pd.DataFrame(
        np.array(
            random.sample(
                range(LOWER_ID_N, UPPER_ID_N), N_UID_PER_REQUEST * N_UID_REQUESTS
            )
        ).reshape(N_UID_PER_REQUEST, N_UID_REQUESTS)
    )
    uids.to_csv(UIDS_FILE, index=False)
else:
    uids = pd.read_csv(UIDS_FILE)

In [5]:
# Load bearer token
with open(BEARER_TOKEN, "r") as file:
    token = file.readline().strip("\n")

# Define authentication header
headers = {"Authorization": "Bearer " + token}

In [6]:
def wait_for_reset(r):
    ts = int(r.headers["x-rate-limit-reset"])
    print("Current time: {} (UTC)".format(datetime.now().strftime("%Y-%m-%d %H:%M:%S")))
    print(
        "Waiting until {} (UTC) for the rate limit to reset.".format(
            datetime.fromtimestamp(ts).strftime("%Y-%m-%d %H:%M:%S")
        )
    )
    time.sleep((datetime.fromtimestamp(ts) - datetime.now()).total_seconds())
    print("Resuming user data collection")


def get_user_data(req, headers=headers, wait=True):
    # Query the user data
    r = requests.get(req, headers=headers)
    if wait & int(r.headers["x-rate-limit-remaining"]) == 0:
        wait_for_reset(r)

        # Query the user data
        r = requests.get(req, headers=headers)

    return r


def get_user_data_df(r):
    df = pd.DataFrame(
        r.json()["data"],
        columns=[
            "id",
            "username",
            "name",
            "protected",
            "withheld",
            "verified",
            "created_at",
            "location",
            "public_metrics",
            "description",
            "url",
            "entities",
            "pinned_tweet_id",
        ],
        # Replace NaNs by empty strings to facilitate pre-processing
    ).fillna("")
    return df


# Get user data
if COLLECT_USER_DATA:
    for i in trange(N_UID_PER_REQUEST // 100):
        # Get 100 UserIDs (limit per request as defined by Twitter)
        users = uids.values[i * 100 : (i + 1) * 100, REQUEST_NUMBER]
        # Define the request URL
        req = (
            API_USERS_ENDPOINT
            + ",".join([str(user) for user in users])
            + "&"
            + API_USER_FIELDS
            + "&"
            + API_TWEET_FIELDS
        )

        # Create the dataframe on the first iteration
        if i == 0:
            # Query the user data
            r = get_user_data(req, headers=headers)

            # If the rate limit is not maximal, then wait for the reset to occur
            # (max 15 minutes)
            if int(r.headers["x-rate-limit-remaining"]) != 299:
                wait_for_reset(r)

                # Query the user data
                r = get_user_data(req, headers=headers)

            df = get_user_data_df(r)
        # Append to existing dataframe on other iterations
        # but do not wait for reset for the last iteration
        elif i == 299:
            # Query the user data
            r = get_user_data(req, headers=headers, wait=False)
            # Append new data to existing dataframe
            df = df.append(get_user_data_df(r))
        else:
            # Query the user data
            r = get_user_data(req, headers=headers)
            # Append new data to existing dataframe
            df = df.append(get_user_data_df(r))

In [7]:
# Preprocess the data


def get_key_val(x, key):
    """Get dictionary value from key if it exists, otherwise return an empty string."""
    if key in x:
        return x[key]
    else:
        return ""


def get_public_metrics(df):
    """Extract the data from the public_metrics column"""
    for metric in ["followers_count", "following_count", "tweet_count", "listed_count"]:
        df[metric] = df.public_metrics.apply(lambda x: get_key_val(x, metric))
    df.pop("public_metrics")
    return df


def get_entities(df):
    """Extract the data from the entity column"""
    for entity in ["url", "description"]:
        df["entities_" + entity] = df.entities.apply(lambda x: get_key_val(x, entity))
    df.pop("entities")
    return df


if COLLECT_USER_DATA:
    df = get_public_metrics(df)
    df = get_entities(df)
    df = df.astype({"id": int, "protected": bool, "verified": bool})
    print("Number of valid users: {:,}".format(df.shape[0]))
    df

In [9]:
if not FIRST_RUN:
    user_data = pd.read_csv(USERS_FILE, dtype={"id": int, "protected": bool, 'verified': bool}, lineterminator="\n")
    if COLLECT_USER_DATA:
        user_data = user_data.append(df)

if COLLECT_USER_DATA:
    # Save data to disk
    user_data.to_csv(USERS_FILE, index=False)

# Print statistics
print("Total number of valid users: {:,}".format(user_data.shape[0]))
print(
    "Number of public accounts: {:,}".format(
        user_data[user_data.protected.astype(bool)].shape[0]
    )
)
print(
    "--> With at least one tweet: {:,}".format(
        user_data[user_data.protected.astype(bool) & (user_data.tweet_count > 0)].shape[
            0
        ]
    )
)

ValueError: cannot safely convert passed user dtype of bool for object dtyped data in column 3

---

In [None]:
break

# Test section

In [None]:
int(r.headers["x-rate-limit-remaining"])

In [None]:
req = (
    API_USERS_ENDPOINT
    + "783214,15994119,1320117356"
    + "&"
    + API_USER_FIELDS
    + "&"
    + API_TWEET_FIELDS
)
print(req)
r = requests.get(req, headers=headers)

In [None]:
df = pd.DataFrame(
    r.json()["data"],
    columns=[
        "id",
        "username",
        "name",
        "protected",
        "withheld",
        "verified",
        "created_at",
        "location",
        "public_metrics",
        "description",
        "url",
        "entities",
        "pinned_tweet_id",
    ],
).fillna("")
df