# Creative extension analysis notebook

---

**Authors**

- Jérémy Bensoussan
- Ekaterina Kryukova
- Jules Triomphe

---

## Abstract

While the paper examines the exposure hypothesis for all topics, we propose to analyze and compare political and food related tweets. To do so, we plan to obtain egos, alters and their timelines from Twitter’s API by generating $3\times30,000$ random numbers in a range from $0$ to $3,000,000,000$. Then, we intend to identify egos’ retweets using official RT, classify tweets by topics based on hashtags (and keywords if the dataset is lacking content) and create a follower/followee graph. To calculate the probability of retweeting alters’ tweets by egos we will use a more solid approach then what is described in the paper “Differences in the Mechanics of Information Diffusion Across Topics” where the probability is equal to the number of users that were k times exposed to a hashtag and retweeted before the ($k+1$)-th exposure, divided by the number of users that were k-times exposed to the hashtag. Finally, we plan to visualize results as well as analyze the probability by breaking down users based on betweenness, clustering coefficient and number of followees.

## Research Questions

1. Is there a significant difference between the probability of retweeting when the tweet is about food and when it is about politics ?
2. Is there a significant difference in the number of times a tweet is retweeted depending on whether it is about food or politics ?
3. Is there a relation between user betweenness, size of cluster or number of friends with the retweet probabilities ?

## Proposed dataset

Self-collected (with the Twitter API) ego and alter timelines with all tweet fields from the [**GET /2/users** endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users) and all user fields except for `profile_image_url`.

## Methods

### Data collection

We will sign up for Twitter’s API to collect data. We will generate $3\times30,000$ random numbers in a range from $0$ to $3,000,000,000$ as in the paper and use the [**GET /2/users** endpoint](https://developer.twitter.com/en/docs/twitter-api/users/lookup/api-reference/get-users) to collect active and public user information with all tweet fields and all user fields except for profile_image_url.

### Building the network

We will use networkx to build a directed network of followers  in which nodes are users (egos and alters) and edges are the following relationships (without the following relationships among alters). Next, we will build another network where relationships among alters of active egos are included.

### Calculating retweet probability

For each ego, we will count the number of followees who have retweeted a post (exposures) on a certain topic at a certain date. We will get the information like this: an ego $i$ was exposed to $200$ posts about this topic only once, among which $i$ retweeted $50$ (probability is $50/200 = 25\%$); at the same time, $i$ was exposed to $100$ posts about this topic twice, among which $i$ retweeted $50$ ($probability = 50\%$); … Finally, we will calculate a sequence of probability for each ego. (Same procedure as in the paper.) If there is sufficient data, we will apply a t-test to identify whether the distribution of retweets for each exposure count is significantly different from one topic to the other. We will also try to compare the distribution of retweet probabilities between topics.

### Community detection

We will use the second ego networks to compute clustering coefficients and betweenness of the active egos.

### Data analysis

We will compare the retweet probabilities based on betweenness, number of followers and clustering for each topic, much like in *Figure 6* in the paper.

---

## Initialisation

Import modules.

In [1]:
# Import libraries
import random
import numpy as np
import pandas as pd
from tqdm.auto import tqdm, trange
import requests
import time
from datetime import datetime
import os
import errno

Setup automatic formatting (requires the `nb-black` package).

In [2]:
# Enable auto-formatting

%load_ext lab_black

### Control center

**This is the control center. All operations are decided here to avoid memory overflow and excessive computation times. This notebook should be run FROM THE TOP once these parameters have been set.** If in doubt, ask Jules ;)

In [3]:
# Define constants

# UserID range
LOWER_ID_N = 0
UPPER_ID_N = int(3e9)

# UserID number
N_UID_PER_REQUEST = int(3e4)
N_UID_REQUESTS = 3

# --------------------------------------------------

# Choose whether to generate new UserIDs
CREATE_NEW_UIDs = False

# Choose whether to collect user data
COLLECT_USER_DATA = False
# Select the batch to query if collecting user data
REQUEST_NUMBER = 2
# Define behaviour depending on the run number.
# If this is True then COLLECT_USER_DATA must be True
FIRST_RUN = False

# Chooser whether to create user subset files
CREATE_USER_SUBSETS = False

# Choose whether to create/reset data pull status
# This also controls whether the PUBLIC_USERS_TIMELINES_FILE is overwritten
CREATE_DATA_PULL_STATUS = False
# Select the users to pull (from public_users_w_tweets)
PULL_START = 0
PULL_END = PULL_START + 1000
# Choose whether to pull new data and save it
PULL_NEW_TIMELINE_DATA = True
N_RUNS_TIMELINE_DATA = 100
# Choose whether to save newly pulled data
SAVE_PULLED_DATA = True

# --------------------------------------------------

# Data folder location
DATA_FOLDER = "./data/"
# UIDs
UIDS_FILE = DATA_FOLDER + "uids.csv"
# User files
USERS_FOLDER = DATA_FOLDER + "users/"
USERS_FILE = USERS_FOLDER + "users.csv"
PUBLIC_USERS_FILE = USERS_FOLDER + "public_users.csv"
PUBLIC_USERS_W_TWEETS_FILE = USERS_FOLDER + "public_users_w_tweets.csv"
# Pulled data
PUBLIC_USERS_PULL_STATUS_FILE = (
    DATA_FOLDER + f"public_users_pull_status_{PULL_START:05}_to_{PULL_END:05}.csv"
)
# Timeline files
TIMELINES_FOLDER = DATA_FOLDER + "timelines/"
PUBLIC_USERS_TIMELINES_FILE = (
    TIMELINES_FOLDER + f"public_users_timelines_{PULL_START:05}_to_{PULL_END:05}.csv"
)
# File containing the bearer token
BEARER_TOKEN = DATA_FOLDER + "bearer_token.auth"

# API endpoints
API_USERS_ENDPOINT = "https://api.twitter.com/2/users?ids="
API_USER_FIELDS = "user.fields=created_at,id,protected,public_metrics,username,verified"
# API_USER_FIELDS = "user.fields=created_at,description,entities,id,location,name,pinned_tweet_id,protected,public_metrics,url,username,verified,withheld"
API_TWEET_FIELDS = "tweet.fields="
# API_TWEET_FIELDS = "tweet.fields=attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,non_public_metrics,public_metrics,organic_metrics,promoted_metrics,possibly_sensitive,referenced_tweets,reply_settings,source,text,withheld"
API_V1_RATE_LIMITS = "https://api.twitter.com/1.1/application/rate_limit_status.json?resources=application,statuses,followers,friends"
API_USER_TIMELINE_ENDPOINT = "https://api.twitter.com/1.1/statuses/user_timeline.json"

# Random seed
SEED = 30
random.seed(SEED)

Let's create the folders if they do not exist.

In [4]:
for folder in [DATA_FOLDER, USERS_FOLDER, TIMELINES_FOLDER]:
    if not os.path.exists(folder):
        os.makedirs(folder)

---

## Data collection

In this part, we will generate random user IDs and collect their respective user information if they exist.

### UID generation

Let's create random UIDs in the 0-3 billion range as discussed in the abstract.

We reshape them to simplify queries due to Twitter's API's rate limits.

If they have already been generated, we load them.

In [5]:
if CREATE_NEW_UIDs:
    uids = pd.DataFrame(
        np.array(
            random.sample(
                range(LOWER_ID_N, UPPER_ID_N), N_UID_PER_REQUEST * N_UID_REQUESTS
            )
        ).reshape(N_UID_PER_REQUEST, N_UID_REQUESTS)
    )
    uids.to_csv(UIDS_FILE, index=False)
else:
    uids = pd.read_csv(UIDS_FILE)

### Token load

To query Twitter's API, we need a bearer token which we load.

In [6]:
# Load bearer token
with open(BEARER_TOKEN, "r") as file:
    token = file.readline().strip("\n")

# Define authentication header
headers = {"Authorization": "Bearer " + token}

### User data collection

In this section, we will get user data from Twitter's API.

First we define a few helper functions.

In [7]:
def wait_for_reset(r):
    print(
        "Current time: {} (UTC)".format(datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S"))
    )
    # Get reset time (Unix format)
    ts = int(r.headers["x-rate-limit-reset"])
    ts_str = datetime.fromtimestamp(ts).strftime("%Y-%m-%d %H:%M:%S")
    # Compute difference between current time and reset time
    sleep_time = (datetime.fromtimestamp(ts) - datetime.utcnow()).total_seconds()
    if sleep_time > 0:
        print("Waiting until {} (UTC) for the rate limit to reset.".format(ts_str))
        time.sleep(sleep_time)
    else:
        print("Reset time was: {} (UTC)".format(ts_str))
    print("Resuming user data collection")


def get_user_data(req, headers=headers, wait=True):
    # Query the user data
    r = requests.get(req, headers=headers)
    if wait & (int(r.headers["x-rate-limit-remaining"]) == 0):
        wait_for_reset(r)

        # Query the user data
        r = requests.get(req, headers=headers)

    return r


def get_user_data_df(r):
    df = pd.DataFrame(
        r.json()["data"],
        columns=[
            "id",
            "username",
            "protected",
            "verified",
            "created_at",
            "public_metrics",
        ],
        # Replace NaNs by empty strings to facilitate pre-processing
    ).fillna("")
    return df

Now, let's query the API.

In [8]:
# Get user data
if COLLECT_USER_DATA:
    for i in trange(N_UID_PER_REQUEST // 100):
        # Get 100 UserIDs (limit per request as defined by Twitter)
        users = uids.values[i * 100 : (i + 1) * 100, REQUEST_NUMBER]
        # Define the request URL
        req = (
            API_USERS_ENDPOINT
            + ",".join([str(user) for user in users])
            + "&"
            + API_USER_FIELDS
            + "&"
            + API_TWEET_FIELDS
        )

        # Create the dataframe on the first iteration
        if i == 0:
            # Query the user data
            r = get_user_data(req)

            print(r)

            # If the rate limit is not maximal, then wait for the reset to occur
            # (max 15 minutes)
            if int(r.headers["x-rate-limit-remaining"]) != 299:
                wait_for_reset(r)

                # Query the user data
                r = get_user_data(req)

            raw_user_data = get_user_data_df(r)
        # Append to existing dataframe on other iterations
        # but do not wait for reset for the last iteration
        elif i == 299:
            # Query the user data
            r = get_user_data(req, wait=False)
            # Append new data to existing dataframe
            raw_user_data = raw_user_data.append(get_user_data_df(r))
        else:
            # Query the user data
            r = get_user_data(req)
            # Append new data to existing dataframe
            raw_user_data = raw_user_data.append(get_user_data_df(r))

There are a few important data points we will need to the next parts so we extract them here along with any others they are grouped with.

In [9]:
# Preprocess the data


def get_key_val(x, key):
    """Get dictionary value from key if it exists, otherwise return an empty string."""
    if key in x:
        return x[key]
    else:
        return ""


def get_public_metrics(df):
    """Extract the data from the public_metrics column"""
    for metric in ["followers_count", "following_count", "tweet_count", "listed_count"]:
        df[metric] = df.public_metrics.apply(lambda x: get_key_val(x, metric))
    df.pop("public_metrics")
    return df


def get_entities(df):
    """Extract the data from the entity column"""
    for entity in ["url", "description"]:
        df["entities_" + entity] = df.entities.apply(lambda x: get_key_val(x, entity))
    df.pop("entities")
    return df


if COLLECT_USER_DATA:
    raw_user_data = get_public_metrics(raw_user_data)
    #     raw_user_data = get_entities(raw_user_data)
    raw_user_data = raw_user_data.astype(
        {
            "id": int,
            "username": str,
            "protected": bool,
            "verified": bool,
            "created_at": str,
            "followers_count": int,
            "following_count": int,
            "tweet_count": int,
            "listed_count": int,
        }
    )
    print("Number of valid users: {:,}".format(raw_user_data.shape[0]))
    raw_user_data

We need all of the user data available for the next parts, so we append the generated data (if any) to pre-existing user data and we save the data frame.

In [10]:
# Load user data if it exists
if os.path.isfile(USERS_FILE) and not FIRST_RUN:
    user_data = pd.read_csv(
        USERS_FILE,
        dtype={
            "id": int,
            "username": str,
            "protected": bool,
            "verified": bool,
            "created_at": str,
            "followers_count": int,
            "following_count": int,
            "tweet_count": int,
            "listed_count": int,
        },
        lineterminator="\n",
    )
    if COLLECT_USER_DATA:
        user_data = user_data.append(raw_user_data)
else:
    user_data = raw_user_data

if COLLECT_USER_DATA:
    # Save data to disk
    user_data.to_csv(USERS_FILE, index=False)

# Print statistics
print("Total number of valid users: {:,}".format(user_data.shape[0]))

Total number of valid users: 33,511


### User subset definition

We define and save groups of users to facilitate data manipulation later on.

In [11]:
# Extract user subsets
if CREATE_USER_SUBSETS:
    # Public users
    public_users = user_data[~user_data.protected].copy()
    # Public users with tweets
    # Tweet count includes retweets and deleted tweets
    public_users_w_tweets = user_data[
        ~user_data.protected & (user_data.tweet_count > 0)
    ].copy()

    print("Number of public users: {:,}".format(public_users.shape[0]))
    print(
        "Number of public users with tweets: {:,}".format(
            public_users_w_tweets.shape[0]
        )
    )

    public_users.to_csv(PUBLIC_USERS_FILE, index=False)
    public_users_w_tweets.to_csv(PUBLIC_USERS_W_TWEETS_FILE, index=False)

elif os.path.isfile(PUBLIC_USERS_W_TWEETS_FILE):
    public_users_w_tweets = pd.read_csv(
        PUBLIC_USERS_W_TWEETS_FILE,
        dtype={
            "id": int,
            "username": str,
            "protected": bool,
            "verified": bool,
            "created_at": str,
            "followers_count": int,
            "following_count": int,
            "tweet_count": int,
            "listed_count": int,
        },
        lineterminator="\n",
    )
else:
    raise FileNotFoundError(
        errno.ENOENT, os.strerror(errno.ENOENT), PUBLIC_USERS_W_TWEETS_FILE
    )

### Data pull status generation

As there are many queries to make, we create here a dataframe to be able to keep track of what data was already pulled and what data still needs to be pulled.

In [12]:
# Create pull status dataframe
if CREATE_DATA_PULL_STATUS:
    # Use public metrics to define limits
    user_data_pull_status = public_users_w_tweets[["id", "tweet_count"]][
        PULL_START:PULL_END
    ].copy()

    # Define parameters for API queries
    user_data_pull_status["timeline_lowest_id"] = 0
    user_data_pull_status["timeline_tweets_pulled"] = 0

    # Change column order for easier visualization
    user_data_pull_status = user_data_pull_status[
        [
            "id",
            "timeline_lowest_id",
            "timeline_tweets_pulled",
            "tweet_count",
        ]
    ]

    # Set id column to index
    user_data_pull_status = user_data_pull_status.set_index("id")
    # Save to file
    user_data_pull_status.to_csv(PUBLIC_USERS_PULL_STATUS_FILE)

elif os.path.isfile(PUBLIC_USERS_PULL_STATUS_FILE):
    user_data_pull_status = pd.read_csv(
        PUBLIC_USERS_PULL_STATUS_FILE,
        dtype=int,
        index_col="id",
    )
else:
    raise FileNotFoundError(
        errno.ENOENT, os.strerror(errno.ENOENT), PUBLIC_USERS_PULL_STATUS_FILE
    )

### User timeline collection

As part of our analysis, we need to collect users' timelines. This is what we do here.

First we define a few helper functions whose names are pretty explicit, then we move on to actually query the data before saving it along with the data pull status.

In [13]:
def get_user_timeline_rate_limit():
    r = requests.get(API_V1_RATE_LIMITS, headers=headers)
    remaining = r.json()["resources"]["statuses"]["/statuses/user_timeline"][
        "remaining"
    ]
    # Get Unix timestamp
    reset_ts = r.json()["resources"]["statuses"]["/statuses/user_timeline"]["reset"]
    # Convert to string
    reset_time = datetime.fromtimestamp(reset_ts).strftime("%Y-%m-%d %H:%M:%S")
    return remaining, reset_time


def get_initial_timeline_df():
    return pd.DataFrame(
        columns=[
            "user_id",
            "id",
            "user_mentions",
        ]
    )


def get_user_timeline_df(r):
    df = pd.DataFrame(
        r.json(),
        columns=[
            "user_id",
            # User profile
            "user",
            "id",
            # This feature contains the user mentions
            "entities",
            # This feature is expected to be NaN
            "user_mentions",
        ],
        # Replace NaNs by empty strings to facilitate pre-processing
    ).fillna("")
    # Fill in user_id with tweet UserID
    df.user_id = df.user.apply(lambda x: x["id"])

    def get_user_mentions_ids(x):
        users_mentioned = x["user_mentions"]
        user_mentions = []
        for i in range(len(users_mentioned)):
            user_mentions.append(users_mentioned[i]["id"])
        return ";".join(str(x) for x in user_mentions)

    df.user_mentions = df.entities.apply(lambda x: get_user_mentions_ids(x))
    df.drop(columns=["user", "entities"], inplace=True)
    return df


def get_user_timeline(query_n, user_id, max_id, count):
    req = API_USER_TIMELINE_ENDPOINT
    params = {
        "user_id": str(user_id),
        "count": str(count),
        "include_rts": "1",
    }
    if max_id > 0:
        params.update({"max_id": str(max_id - 1)})

    r = requests.get(req, headers=headers, params=params)
    df = get_user_timeline_df(r)

    n_tweets_pulled = len(r.json())
    if n_tweets_pulled < count:
        print(r.url)
        print(
            "Query {:,} -- ".format(query_n + 1).ljust(15)
            + "User {}: got {:,} tweets instead of {:,}.".format(
                user_id, n_tweets_pulled, count
            )
        )
        lowest_id = -1
    if df.shape[0] > 0:
        lowest_id = int(df.id.min())

    return df, lowest_id, n_tweets_pulled

Having defined our helper functions, we now create an empty dataframe for our user timeline data and query the API for as much data as possible until we hit the rate limit (similar sections are run multiple times (days...) to query all of the necessary data).

In [14]:
tmp_user_timeline_data = get_initial_timeline_df()

if PULL_NEW_TIMELINE_DATA:
    for n in range(N_RUNS_TIMELINE_DATA):
        print("Starting pull sequence...")

        tmp_user_timeline_data = get_initial_timeline_df()

        # Get the number of available queries and rate limit reset time
        query_quota, reset_time = get_user_timeline_rate_limit()

        # Get users with tweets left to pull
        user_timelines_to_pull = user_data_pull_status[
            (user_data_pull_status.tweet_count > 0)
            & (
                user_data_pull_status.timeline_tweets_pulled
                < user_data_pull_status.tweet_count
            )
            # The API limits pulls to the 3.2k most recent tweets
            & (user_data_pull_status.timeline_tweets_pulled < 3200)
        ]

        skip_sleep_time = user_timelines_to_pull.shape[0] < query_quota
        n_queries = min(query_quota, user_timelines_to_pull.shape[0])
        if n_queries > 0:
            print("Executing {:,} queries.".format(n_queries))
        else:
            print("\nNo more tweets to pull!\n")
            break
        for query_n in trange(n_queries):

            # Get query parameters
            user_id = user_timelines_to_pull.index[query_n]
            max_id = user_timelines_to_pull.loc[user_id, "timeline_lowest_id"]
            # A 200-tweet limit is set by Twitter per request
            count = min(
                min(user_timelines_to_pull.loc[user_id, "tweet_count"], 3200)
                - user_timelines_to_pull.loc[user_id, "timeline_tweets_pulled"],
                200,
            )

            # Get user timeline data and statistics
            raw_user_timeline_data, lowest_id, n_tweets_pulled = get_user_timeline(
                query_n, user_id, max_id, count
            )

            # Append to existing user timeline data
            tmp_user_timeline_data = tmp_user_timeline_data.append(
                raw_user_timeline_data
            )

            # Update pull status
            user_data_pull_status.loc[user_id, "timeline_lowest_id"] = lowest_id
            if user_timelines_to_pull.loc[user_id, "timeline_tweets_pulled"] == 0:
                user_data_pull_status.loc[user_id, "timeline_tweets_pulled"] = count
            else:
                user_data_pull_status.loc[user_id, "timeline_tweets_pulled"] += count

        print("Next reset time: {} (UTC)".format(reset_time))

        print("\nPull is done!\n")

        # Define user timelines dataframe
        if os.path.isfile(PUBLIC_USERS_TIMELINES_FILE) and not CREATE_DATA_PULL_STATUS:
            user_timeline_data = pd.read_csv(PUBLIC_USERS_TIMELINES_FILE)
            user_timeline_data = user_timeline_data.append(tmp_user_timeline_data)
        else:
            user_timeline_data = tmp_user_timeline_data

        print(
            "Number of collected tweets: {:,} ({:,} unique) out of {:,}.\nNumber of unique users: {:,} (out of {:,}).".format(
                user_timeline_data.shape[0],
                len(np.unique(user_timeline_data.id.values)),
                np.sum(user_data_pull_status.tweet_count.values),
                len(np.unique(user_timeline_data.user_id.values)),
                public_users_w_tweets.shape[0],
            )
        )

        if SAVE_PULLED_DATA:
            # Save data to disk
            user_timeline_data.to_csv(PUBLIC_USERS_TIMELINES_FILE, index=False)
            user_data_pull_status.to_csv(PUBLIC_USERS_PULL_STATUS_FILE)
            print("Data saved!")

        # Get new reset time
        query_quota, reset_time = get_user_timeline_rate_limit()

        # Sleep until rate limit reset
        sleep_time = (
            datetime.strptime(reset_time, "%Y-%m-%d %H:%M:%S") - datetime.utcnow()
        ).total_seconds()

        if sleep_time > 0 and not skip_sleep_time:
            print(
                "Waiting for {:,.0f} seconds to continue (until {} (UTC)).\n".format(
                    sleep_time, reset_time
                )
            )
            time.sleep(sleep_time)

    print("Loop is done!")

Starting pull sequence...

No more tweets to pull!

Loop is done!


---

# Break

In [15]:
break

SyntaxError: 'break' outside loop (<ipython-input-15-6aaf1f276005>, line 1)

# Test section

In [None]:
user_data_pull_status[user_data_pull_status.tweet_count > 0].head(15)

In [None]:
get_initial_followers_df().shape[0]

In [None]:
r = requests.get(
    "https://api.twitter.com/1.1/followers/ids.json?user_id=555533734&cursor=-1&count=311",
    headers=headers,
)

In [None]:
test = pd.DataFrame(
    r.json(),
    columns=[
        "user_id",
        "ids",
        "next_cursor",
    ],
    dtype=int,
).fillna("")
test

In [None]:
len(r.json()["ids"])

In [None]:
np.unique(user_timeline_data.id.values).shape[0]

In [None]:
import os

os.path.isfile(PUBLIC_USERS_FILE)

In [None]:
pd.DataFrame(
    columns=[
        "user_id",
        "user",
        "id",
        "created_at",
        "text",
        "in_reply_to_status_id",
        "in_reply_to_user_id",
        "source",
        "truncated",
        "coordinates",
        "place",
        "is_quote_status",
        "quoted_status_id",
        "quoted_status",
        "quote_count",
        "retweeted_status",
        "retweet_count",
        "favorite_count",
        "entities",
        "extended_entities",
        "possibly_sensitive",
        "lang",
    ],
    # Replace NaNs by empty strings to facilitate pre-processing
)

In [None]:
user_data_pull_status = user_data[
    ["id", "followers_count", "following_count", "tweet_count"]
].copy()

In [None]:
user_data_pull_status["timeline_lowest_id"] = -1
user_data_pull_status["timeline_tweets_pulled"] = -1

user_data_pull_status["followers_cursor"] = -1
user_data_pull_status["followers_pulled"] = -1

user_data_pull_status["following_cursor"] = -1
user_data_pull_status["following_pulled"] = -1

In [None]:
user_data_pull_status = user_data_pull_status[
    [
        "id",
        "timeline_lowest_id",
        "timeline_tweets_pulled",
        "tweet_count",
        "followers_cursor",
        "followers_pulled",
        "followers_count",
        "following_cursor",
        "following_pulled",
        "following_count",
    ]
]
user_data_pull_status

In [None]:
req = (
    "https://api.twitter.com/1.1/application/rate_limit_status.json"
    #     + "783214,15994119,1320117356"
    #     + "&"
    #     + API_USER_FIELDS
    #     + "&"
    #     + API_TWEET_FIELDS
)
payload = {"resources": "application,statuses,followers,friends"}
print(req)
r = requests.get(req, headers=headers, params=payload)
print(r.url)

In [None]:
r.json()["resources"]["statuses"]["/statuses/user_timeline"]

In [None]:
req = API_USER_TIMELINE_ENDPOINT + "?user_id=783214" + "&count=10"
print(req)
r = requests.get(req, headers=headers)

In [None]:
r.json()

In [None]:
test = get_user_timeline_df(r)

In [None]:
test

In [None]:
test.created_at = pd.to_datetime(test.created_at)

In [None]:
test.sort_values(by="id").user[0]

In [None]:
int(r.headers["x-rate-limit-remaining"])

In [None]:
wait_for_reset(r)

In [None]:
df = pd.DataFrame(
    r.json()["data"],
    columns=[
        "id",
        "username",
        "name",
        "protected",
        "withheld",
        "verified",
        "created_at",
        "location",
        "public_metrics",
        "description",
        "url",
        "entities",
        "pinned_tweet_id",
    ],
).fillna("")
df

In [None]:
df.public_metrics[0]

In [None]:
783214 in user_data.id.values.astype(int)

In [None]:
timeline_cols = ["timeline_lowest_id", "timeline_tweets_pulled", "tweet_count"]
other_cols = [x for x in user_data_pull_status.columns if x not in timeline_cols]

In [None]:
user_data_pull_status_timeline = user_data_pull_status[timeline_cols].copy()
user_data_pull_status_timeline.head()

In [None]:
user_data_pull_status_timeline.to_csv(
    DATA_FOLDER + "public_users_pull_status_timeline.csv"
)

In [None]:
user_data_pull_status_ff = user_data_pull_status[other_cols].copy()
user_data_pull_status_ff.head()

In [None]:
user_data_pull_status_ff.to_csv(DATA_FOLDER + "public_users_pull_status_ff.csv")

In [None]:
user_data_pull_status_timeline.join(user_data_pull_status_ff)

In [None]:
user_data_pull_status.head()

In [None]:
np.sum(
    ~np.equal(
        user_data_pull_status.values,
        user_data_pull_status_timeline.join(user_data_pull_status_ff).values,
    )
)

In [None]:
user_timeline_data.user_id.value_counts()
# user_timeline_data.iloc[1024864]

In [None]:
test = user_timeline_data[user_timeline_data.user_id == 1256677638]
eval(
    user_timeline_data[
        ~user_timeline_data.retweeted_status.isna()
    ].retweeted_status.iloc[0]
)