# Tweet Deletes

When searching the Twitter v2 API responses sometimes don't come back with the requested numbers of results. You can use the twarc log to examine these gaps, which presumably are areas where tweets have been removed from search results because the tweets are deleted or protected.

These show up as lines like this:

    2021-08-17 03:42:53,464 INFO archived 1267885644099002374
    
twarc should have been able to get 100 tweets but instead only got 1. It would be interesting to go through the log and examine these periods of missing tweets to get a sense of when things have been deleted. This is made a bit easier because the tweet id contains the the time it was created.

Here's a little trick for turning a tweet id into a datetime:

In [45]:
from datetime import datetime

def id2datetime(id):
  shifted = id >> 22 
  timestamp = shifted + 1288834974657
  return datetime.utcfromtimestamp(timestamp/1000)

We can try it on this tweet: https://twitter.com/jack/status/1424854924194729984

In [46]:
print(id2datetime(1424854924194729984))

2021-08-09 22:07:41.109000


Seems to work. Now we can read a gzipped log file looking for the messages about what tweets were archived, examine the min and max ids in the message and then output the range and the number of tweets that were missing (assuming that there should be 100).

In [54]:
import re
import gzip

def deletions(log_file):
    for line in gzip.open(log_file, 'rt'):
        m = re.match(f'.+INFO archived (.+)', line)
        if not m:
            continue
        ids = sorted(map(int, m.group(1).split(',')))
        missing = 100 - len(ids)
        if (len(ids) == 0):
            yield (None, None, missing)
        elif (len(ids) == 1):
            yield (id2datetime(ids[0]), id2datetime(ids[0]), missing)
        else:
            yield (id2datetime(ids[0]), id2datetime(ids[-1]), missing)
        
for start, end, deletes in deletions('data/twarc-search.log.gz'):
    print(start, end, deletes)

2021-08-12 23:38:24.980000 2021-08-13 12:26:25.651000 0
2021-08-12 13:00:12.680000 2021-08-12 23:35:06.333000 0
2021-08-12 08:39:35.335000 2021-08-12 13:00:09.484000 0
2021-08-11 10:16:24.924000 2021-08-12 07:46:05.513000 2
2021-08-10 17:17:02.215000 2021-08-11 10:13:19.141000 0
2021-08-10 09:08:27.202000 2021-08-10 17:14:30.483000 0
2021-08-10 00:02:21.354000 2021-08-10 09:05:32.419000 0
2021-08-09 21:48:08.067000 2021-08-10 00:02:21.054000 1
2021-08-09 01:43:31.300000 2021-08-09 21:44:34.025000 0
2021-08-08 12:50:28.054000 2021-08-09 00:40:57.135000 2
2021-08-08 00:14:48.016000 2021-08-08 12:50:07.159000 1
2021-08-07 14:19:01.034000 2021-08-08 00:09:24.483000 0
2021-08-07 03:49:25.412000 2021-08-07 14:18:42.718000 4
2021-08-06 22:15:20.097000 2021-08-07 03:43:39.825000 10
2021-08-06 19:57:17.632000 2021-08-06 22:15:04.495000 0
2021-08-06 17:09:48.038000 2021-08-06 19:56:17.888000 0
2021-08-06 14:24:18.022000 2021-08-06 17:08:48.552000 1
2021-08-05 21:49:45.971000 2021-08-06 14:21:25.

2021-05-02 20:44:26.929000 2021-05-03 13:04:09.347000 1
2021-05-02 14:06:51.291000 2021-05-02 20:43:06.262000 3
2021-05-02 11:19:34.173000 2021-05-02 14:04:05.705000 0
2021-05-01 23:01:13.618000 2021-05-02 11:19:33.257000 2
2021-05-01 15:06:00.461000 2021-05-01 22:54:16.627000 0
2021-05-01 09:49:07.824000 2021-05-01 15:05:57.418000 2
2021-04-30 18:08:43.830000 2021-05-01 09:45:05.203000 0
2021-04-30 10:13:53.367000 2021-04-30 17:36:16.670000 1
2021-04-29 16:56:36.573000 2021-04-30 10:12:46.295000 2
2021-04-29 13:05:58.898000 2021-04-29 16:56:26.282000 0
2021-04-28 15:20:49.283000 2021-04-29 13:03:15.722000 1
2021-04-28 05:13:09.725000 2021-04-28 15:13:41.109000 6
2021-04-27 13:53:00.736000 2021-04-28 04:52:26.416000 2
2021-04-27 00:59:54.621000 2021-04-27 13:51:28.905000 1
2021-04-26 18:11:44.378000 2021-04-27 00:59:37.730000 0
2021-04-26 13:58:41.062000 2021-04-26 18:11:11.722000 0
2021-04-26 11:47:05.674000 2021-04-26 13:58:39.017000 0
2021-04-26 06:36:59.457000 2021-04-26 11:44:47.0

2021-03-03 23:12:29.013000 2021-03-03 23:29:25.614000 1
2021-03-03 22:57:23.808000 2021-03-03 23:12:28.332000 1
2021-03-03 18:25:52.721000 2021-03-03 22:57:17.431000 0
2021-03-03 12:13:57.289000 2021-03-03 18:24:29.785000 3
2021-03-02 20:48:49.651000 2021-03-03 12:13:17.935000 2
2021-03-02 16:59:27.300000 2021-03-02 20:46:21.766000 1
2021-03-02 14:37:44.700000 2021-03-02 16:58:28.091000 0
2021-03-02 13:22:50.057000 2021-03-02 14:37:42.633000 0
2021-03-02 12:06:46.240000 2021-03-02 13:22:41.221000 0
2021-03-02 11:12:21.156000 2021-03-02 12:06:32.843000 2
2021-03-02 09:56:04.397000 2021-03-02 11:11:48.316000 1
2021-03-02 04:08:02.989000 2021-03-02 09:54:31.928000 2
2021-03-02 02:37:17.189000 2021-03-02 04:05:54.529000 2
2021-03-02 00:34:22.979000 2021-03-02 02:37:16.345000 0
2021-03-01 12:57:14.197000 2021-03-02 00:30:26.260000 1
2021-02-28 20:27:14.495000 2021-03-01 12:55:45.673000 2
2021-02-28 11:26:23.172000 2021-02-28 20:27:09.409000 2
2021-02-27 20:32:24.745000 2021-02-28 11:13:40.1

2020-11-21 02:38:52.730000 2020-11-21 02:41:41.213000 1
2020-11-21 02:34:59.043000 2020-11-21 02:38:52.686000 1
2020-11-21 02:21:44.298000 2020-11-21 02:34:38.706000 3
2020-11-21 02:10:37.038000 2020-11-21 02:21:37.225000 1
2020-11-21 02:03:15.686000 2020-11-21 02:10:35.947000 4
2020-11-21 01:57:23.769000 2020-11-21 02:03:10.834000 0
2020-11-21 01:44:04.460000 2020-11-21 01:57:22.417000 1
2020-11-21 00:51:40.528000 2020-11-21 01:43:48.982000 2
2020-11-21 00:02:25.722000 2020-11-21 00:51:30.305000 1
2020-11-20 23:28:23.757000 2020-11-21 00:01:31.933000 1
2020-11-20 22:58:10.274000 2020-11-20 23:27:36.464000 0
2020-11-20 22:20:44.754000 2020-11-20 22:57:37.866000 1
2020-11-20 21:49:15.552000 2020-11-20 22:19:51.001000 0
2020-11-20 21:26:07.962000 2020-11-20 21:48:41.715000 2
2020-11-20 20:24:12.131000 2020-11-20 21:26:05.549000 2
2020-11-20 18:12:45.890000 2020-11-20 20:22:23.126000 0
2020-11-20 16:32:18.575000 2020-11-20 18:12:13.723000 4
2020-11-20 15:06:34.270000 2020-11-20 16:31:18.5

2020-07-27 15:31:36.499000 2020-07-27 15:34:29.371000 0
2020-07-27 15:29:02.674000 2020-07-27 15:31:34.738000 2
2020-07-27 15:25:42.131000 2020-07-27 15:29:01.895000 2
2020-07-27 15:23:24.810000 2020-07-27 15:25:40.653000 0
2020-07-27 15:20:12.939000 2020-07-27 15:23:20.796000 2
2020-07-27 15:17:49.907000 2020-07-27 15:20:11.710000 1
2020-07-27 15:15:01.932000 2020-07-27 15:17:45.928000 1
2020-07-27 15:11:58.409000 2020-07-27 15:15:00.766000 1
2020-07-27 15:09:39.437000 2020-07-27 15:11:54.350000 2
2020-07-27 15:07:15.664000 2020-07-27 15:09:38.636000 2
2020-07-27 15:04:56.314000 2020-07-27 15:07:10.969000 2
2020-07-27 15:02:35.689000 2020-07-27 15:04:54.120000 0
2020-07-27 15:00:32.897000 2020-07-27 15:02:35.599000 0
2020-07-27 14:58:20.658000 2020-07-27 15:00:32.669000 1
2020-07-27 14:56:06.959000 2020-07-27 14:58:20.403000 0
2020-07-27 14:53:41.946000 2020-07-27 14:56:06.834000 0
2020-07-27 14:51:22.930000 2020-07-27 14:53:39.594000 0
2020-07-27 14:47:51.455000 2020-07-27 14:51:18.6

2020-06-03 01:10:15.640000 2020-06-03 01:12:28.541000 71
2020-06-03 01:07:57.033000 2020-06-03 01:10:13.611000 76
2020-06-03 01:05:36.058000 2020-06-03 01:07:45.412000 72
2020-06-03 01:02:32.754000 2020-06-03 01:05:16.218000 84
2020-06-03 01:00:22.738000 2020-06-03 01:02:31.116000 69
2020-06-03 00:57:21.302000 2020-06-03 01:00:04.136000 77
2020-06-03 00:55:07.345000 2020-06-03 00:57:07.373000 77
2020-06-03 00:53:22.466000 2020-06-03 00:54:57.068000 74
2020-06-03 00:49:55.432000 2020-06-03 00:52:41.020000 82
2020-06-03 00:47:17.340000 2020-06-03 00:49:48.113000 75
2020-06-03 00:44:39.461000 2020-06-03 00:47:01.984000 78
2020-06-03 00:42:14.050000 2020-06-03 00:44:23.325000 84
2020-06-03 00:40:03.249000 2020-06-03 00:42:05.429000 81
2020-06-03 00:37:07.539000 2020-06-03 00:39:39.898000 73
2020-06-03 00:34:57.236000 2020-06-03 00:36:35.252000 81
2020-06-03 00:31:43.040000 2020-06-03 00:34:35.269000 78
2020-06-03 00:28:56.747000 2020-06-03 00:31:33.655000 78
2020-06-03 00:26:55.255000 2020

2020-06-01 09:40:33.416000 2020-06-01 10:05:22.104000 7
2020-06-01 09:02:51.103000 2020-06-01 09:39:47.726000 9
2020-06-01 08:23:31.609000 2020-06-01 09:01:33.692000 20
2020-06-01 07:50:21.315000 2020-06-01 08:23:19.417000 15
2020-06-01 07:23:58.152000 2020-06-01 07:50:11.650000 14
2020-06-01 07:06:05.680000 2020-06-01 07:23:48.634000 17
2020-06-01 06:50:45.265000 2020-06-01 07:05:40.013000 11
2020-06-01 06:40:38.539000 2020-06-01 06:50:39.498000 8
2020-06-01 06:28:53.693000 2020-06-01 06:40:36.516000 12
2020-06-01 06:12:32.232000 2020-06-01 06:28:40.933000 8
2020-06-01 05:59:12.079000 2020-06-01 06:12:28.264000 14
2020-06-01 05:46:58.964000 2020-06-01 05:59:04.027000 21
2020-06-01 05:34:34.027000 2020-06-01 05:46:43.829000 12
2020-06-01 05:23:05.021000 2020-06-01 05:34:33.269000 15
2020-06-01 05:10:40.658000 2020-06-01 05:23:02.209000 15
2020-06-01 05:01:19.918000 2020-06-01 05:10:40.269000 17
2020-06-01 04:52:30.702000 2020-06-01 05:01:19.633000 12
2020-06-01 04:44:13.978000 2020-06-

2020-04-24 21:13:01.726000 2020-04-24 21:13:31.755000 1
2020-04-24 21:12:39.351000 2020-04-24 21:13:01.432000 2
2020-04-24 21:12:11.749000 2020-04-24 21:12:39.345000 1
2020-04-24 21:11:45.541000 2020-04-24 21:12:11.530000 1
2020-04-24 21:11:22.220000 2020-04-24 21:11:45.388000 0
2020-04-24 21:11:00.192000 2020-04-24 21:11:21.985000 0
2020-04-24 21:10:35.952000 2020-04-24 21:11:00.159000 2
2020-04-24 21:10:14.943000 2020-04-24 21:10:35.922000 2
2020-04-24 21:09:51.862000 2020-04-24 21:10:14.793000 0
2020-04-24 21:09:26.487000 2020-04-24 21:09:51.729000 1
2020-04-24 21:09:05.800000 2020-04-24 21:09:26.145000 3
2020-04-24 21:08:43.955000 2020-04-24 21:09:05.798000 1
2020-04-24 21:08:20.261000 2020-04-24 21:08:43.721000 2
2020-04-24 21:07:59.336000 2020-04-24 21:08:19.534000 0
2020-04-24 21:07:36.434000 2020-04-24 21:07:59.233000 0
2020-04-24 21:07:13.664000 2020-04-24 21:07:36.145000 1
2020-04-24 21:06:53.070000 2020-04-24 21:07:13.499000 1
2020-04-24 21:06:34.260000 2020-04-24 21:06:52.9

2020-03-14 21:53:26.207000 2020-03-14 21:55:29.075000 3
2020-03-14 21:50:53.789000 2020-03-14 21:53:21.472000 0
2020-03-14 21:48:46.249000 2020-03-14 21:50:52.090000 1
2020-03-14 21:46:34.798000 2020-03-14 21:48:46.082000 1
2020-03-14 21:44:54.595000 2020-03-14 21:46:34.634000 2
2020-03-14 21:42:49.048000 2020-03-14 21:44:54.294000 0
2020-03-14 21:40:59.249000 2020-03-14 21:42:45.759000 2
2020-03-14 21:38:55.014000 2020-03-14 21:40:55.565000 1
2020-03-14 21:36:59.762000 2020-03-14 21:38:54.838000 0
2020-03-14 21:34:56.771000 2020-03-14 21:36:59.635000 1
2020-03-14 21:33:01.414000 2020-03-14 21:34:55.081000 2
2020-03-14 21:31:27.254000 2020-03-14 21:33:00.275000 2
2020-03-14 21:29:18.490000 2020-03-14 21:31:26.138000 0
2020-03-14 21:27:33.919000 2020-03-14 21:29:17.582000 1
2020-03-14 21:25:39.606000 2020-03-14 21:27:33.626000 0
2020-03-14 21:23:52.322000 2020-03-14 21:25:38.268000 3
2020-03-14 21:21:56.695000 2020-03-14 21:23:51.344000 1
2020-03-14 21:19:43.472000 2020-03-14 21:21:55.8

2020-02-14 23:48:54.236000 2020-02-15 00:53:05.358000 0
2020-02-14 22:57:18.505000 2020-02-14 23:46:36.734000 0
2020-02-14 22:02:17.994000 2020-02-14 22:57:12.588000 1
2020-02-14 21:04:39.173000 2020-02-14 22:02:08.488000 0
2020-02-14 20:09:04.471000 2020-02-14 21:04:30.551000 0
2020-02-14 18:24:50.966000 2020-02-14 20:08:40.462000 0
2020-02-14 17:07:50.987000 2020-02-14 18:24:08.350000 0
2020-02-14 15:31:59.764000 2020-02-14 17:07:47.126000 2
2020-02-14 14:17:43.874000 2020-02-14 15:30:58.151000 2
2020-02-14 13:32:09.869000 2020-02-14 14:17:25.473000 2
2020-02-14 12:49:42.393000 2020-02-14 13:31:17.338000 1
2020-02-14 12:20:46.436000 2020-02-14 12:49:31.303000 2
2020-02-14 11:51:17.546000 2020-02-14 12:20:45.856000 0
2020-02-14 10:25:54.972000 2020-02-14 11:50:11.838000 4
2020-02-14 03:31:29.406000 2020-02-14 10:25:22.055000 1
2020-02-14 00:25:52.736000 2020-02-14 03:29:18.938000 2
2020-02-13 20:51:00.365000 2020-02-14 00:25:30.098000 6
2020-02-13 19:10:16.481000 2020-02-13 20:50:04.4

2019-11-21 11:22:47.845000 2019-11-21 11:30:30.876000 2
2019-11-21 11:15:15.379000 2019-11-21 11:22:46.547000 2
2019-11-21 11:08:39.825000 2019-11-21 11:15:15.294000 1
2019-11-21 11:02:46.232000 2019-11-21 11:08:39.706000 3
2019-11-21 10:55:29.109000 2019-11-21 11:02:44.729000 1
2019-11-21 10:46:55.839000 2019-11-21 10:55:23.109000 1
2019-11-21 10:38:39.958000 2019-11-21 10:46:41.771000 0
2019-11-21 10:30:58.448000 2019-11-21 10:38:38.738000 0
2019-11-21 10:21:14.236000 2019-11-21 10:30:54.380000 2
2019-11-21 10:11:15.718000 2019-11-21 10:21:12.409000 0
2019-11-21 09:59:57.878000 2019-11-21 10:11:04.973000 0
2019-11-21 09:49:52.867000 2019-11-21 09:59:57.865000 4
2019-11-21 09:39:22.430000 2019-11-21 09:49:32.356000 1
2019-11-21 09:29:34.043000 2019-11-21 09:39:21.618000 1
2019-11-21 09:15:59.450000 2019-11-21 09:29:23.793000 0
2019-11-21 09:00:53.648000 2019-11-21 09:15:44.244000 0
2019-11-21 08:44:29.128000 2019-11-21 09:00:32.824000 2
2019-11-21 08:18:42.455000 2019-11-21 08:43:42.4

2019-11-02 23:19:47.096000 2019-11-02 23:24:45.463000 4
2019-11-02 23:15:22.733000 2019-11-02 23:19:45.840000 3
2019-11-02 23:10:10.586000 2019-11-02 23:15:15.801000 5
2019-11-02 23:05:36.028000 2019-11-02 23:10:10.129000 2
2019-11-02 23:01:22.470000 2019-11-02 23:05:31.834000 1
2019-11-02 22:56:15.106000 2019-11-02 23:01:21.198000 2
2019-11-02 22:51:34.145000 2019-11-02 22:56:11.871000 2
2019-11-02 22:47:28.326000 2019-11-02 22:51:33.647000 6
2019-11-02 22:43:22.814000 2019-11-02 22:47:28.302000 6
2019-11-02 22:39:12.126000 2019-11-02 22:43:20.841000 4
2019-11-02 22:35:06.730000 2019-11-02 22:39:11.940000 5
2019-11-02 22:30:40.253000 2019-11-02 22:35:03.326000 6
2019-11-02 22:25:34.011000 2019-11-02 22:30:39.515000 8
2019-11-02 22:20:17.316000 2019-11-02 22:25:29.871000 3
2019-11-02 22:13:18.578000 2019-11-02 22:20:13.573000 7
2019-11-02 22:03:43.109000 2019-11-02 22:13:12.525000 6
2019-11-02 21:55:48.350000 2019-11-02 22:03:40.577000 9
2019-11-02 21:43:59.110000 2019-11-02 21:55:47.7

Lets try to use this data in Pandas to see if we can get a sense of how many tweets are being deleted over time.

In [56]:
import pandas

df = pandas.DataFrame.from_records(
    data=deletions('data/twarc-search.log.gz'), 
    columns=['start', 'end', 'deleted']
)

df.start = pandas.to_datetime(df['start'])
df.end = pandas.to_datetime(df['end'])

df

Unnamed: 0,start,end,deleted
0,2021-08-12 23:38:24.980,2021-08-13 12:26:25.651,0
1,2021-08-12 13:00:12.680,2021-08-12 23:35:06.333,0
2,2021-08-12 08:39:35.335,2021-08-12 13:00:09.484,0
3,2021-08-11 10:16:24.924,2021-08-12 07:46:05.513,2
4,2021-08-10 17:17:02.215,2021-08-11 10:13:19.141,0
...,...,...,...
12022,2019-10-31 15:10:28.295,2019-10-31 15:12:10.110,1
12023,2019-10-31 15:08:42.223,2019-10-31 15:10:23.203,5
12024,2019-10-31 15:07:22.844,2019-10-31 15:08:41.024,1
12025,2019-10-31 15:05:58.507,2019-10-31 15:07:22.085,0


We can add a new column with the duration:

In [59]:
df['middle'] = df['start'] + ((df['end'] - df['start']) / 2)
df

Unnamed: 0,start,end,deleted,middle
0,2021-08-12 23:38:24.980,2021-08-13 12:26:25.651,0,2021-08-13 06:02:25.315500
1,2021-08-12 13:00:12.680,2021-08-12 23:35:06.333,0,2021-08-12 18:17:39.506500
2,2021-08-12 08:39:35.335,2021-08-12 13:00:09.484,0,2021-08-12 10:49:52.409500
3,2021-08-11 10:16:24.924,2021-08-12 07:46:05.513,2,2021-08-11 21:01:15.218500
4,2021-08-10 17:17:02.215,2021-08-11 10:13:19.141,0,2021-08-11 01:45:10.678000
...,...,...,...,...
12022,2019-10-31 15:10:28.295,2019-10-31 15:12:10.110,1,2019-10-31 15:11:19.202500
12023,2019-10-31 15:08:42.223,2019-10-31 15:10:23.203,5,2019-10-31 15:09:32.713000
12024,2019-10-31 15:07:22.844,2019-10-31 15:08:41.024,1,2019-10-31 15:08:01.934000
12025,2019-10-31 15:05:58.507,2019-10-31 15:07:22.085,0,2019-10-31 15:06:40.296000


Now we can resample to bucket the middle time by day:

In [62]:
deletes_by_day = df.groupby(df['middle'].dt.floor('D')).sum()
deletes_by_day

Unnamed: 0_level_0,deleted
middle,Unnamed: 1_level_1
2019-10-31,677
2019-11-01,421
2019-11-02,487
2019-11-03,469
2019-11-04,265
...,...
2021-08-09,1
2021-08-10,0
2021-08-11,2
2021-08-12,0


What does the data look like?

In [65]:
deletes_by_day.describe()

Unnamed: 0,deleted
count,653.0
mean,187.500766
std,3158.950173
min,0.0
25%,5.0
50%,11.0
75%,27.0
max,79700.0


Maybe a chart helps?

In [71]:
from plotly import express as px

fig = px.line(deletes_by_day, title="Deletes per Day", labels={"middle": "Day", "value": "Deletes"})
fig

I wonder what happend on June 2, 2020?! For context this log is from a search for [Marielle Franco](https://https://en.wikipedia.org/wiki/Marielle_Franco).