# BTS concert twt data depersonalisation 

This notebook is transforms tweet dataset collected with the Twitter Streaming API into minimal subsets for the open sharing of essential information. 

Tweets were collected during four online broadcast BTS concerts in 2021. Streaming APIs capture tweets at the time of posting according to predefined monitoring criteria made up of user ids, keywords, and hashtags (Twitter Developer Platform, 2021). All public tweets were captured in the interval of streaming (~4 hrs each) accroding to pre-established criteria, except in cases where rate limits interfered. The tweet data was collected and stored by the Center for an Informed Public at the University of Washington.  For each tweet capture, the API logged what and when the status update was posted, along with information on tweets related by retweet, quote tweet, and reply, and account details for the posting user and users of related tweets such as their user id, number of followers, and account language.

According to the API research agreement, CIPs best practices, and out of respect to the privacy of the users sampled, we cannot publish the full original datasets. As a compromise, this notebook records how two subsets were generated from the full records to make accessible the minimum information behind the analyses presented in "Audience Reconstructed". 

## Filter Tweets per request
Tweets from initial samplings are filtered first to remove tweets and retweets of accounts that specify a request to be excluded from off-platform uses. 


## Filter Tweets of Official Accounts
Tweets by official accounts get attention in different patterns than tweets by fans, and the dynamics can overwhile the interactions of interest. Here we take out tweets pertaining to (reply to, RT of) official accounts related to BTS.

## Depersonalisation of full tweet sets
Share full set of filtered recording tweets after depersonalising entries by dropping unused fields, replacing potentially identifying fields with obscuring features, and replacing identifying id numbers such as user IDs and Tweets IDs.

## Tweet datasets for content analysis.
A sampling of 1200 tweets from these concert tweet datasets were evaluated for content. A subset of fields are the content codes are retained for publication.
	

In [1]:
import sys
import os
import time
import datetime as dt
import math
import numpy as np 
import scipy as sp
import pandas as pd
import gc

In [2]:
# import respy functions from twt.py file
%load_ext autoreload
%autoreload 1
%aimport twt

In [4]:
%reload_ext autoreload

In [21]:
f = open('data_loc.txt','r')
raw_dir = f.readline()
f.close()

In [22]:
Concerts = pd.DataFrame(columns=['tag','raw_loc','fullfeilds_loc','dep_loc',
                                 'raw_twt_db','full_twt_db','fan_twt_db','dep_twt_db',
                                 'event_file','event_offset','event_reduction','Long_name','sampling','Program'])
Concerts.loc[0,:]={'tag': 'SWZ_D1','raw_loc':'data/','fullfeilds_loc': '../StreamData/', 'dep_loc':'./data/',
             'raw_twt_db':'Fan_tweets_H_Sowoozoo_D1.csv','full_twt_db':'All_Tweets_SWZ_D1.csv',
             'fan_twt_db':'fan_Tweets_SWZ_D1.csv','dep_twt_db':'fan_Tweets_SWZ_D1_reduced.csv',
             'event_file':'Setlists_sowoozoo_D1.csv',
             'event_offset':'6MIN','event_reduction':[1,2,3,6,8,9,10,12,13,15,16,19,20,21,22,23,25,26,27,28],
             'Long_name':'Sowoozoo Concert Day 1','sampling':'#SOWOOZOO','Program':'SWZ'}
Concerts.loc[1,:]={'tag': 'SWZ_D2','raw_loc':'data/','fullfeilds_loc': '../StreamData/', 'dep_loc':'./data/',
             'raw_twt_db':'Fan_tweets_H_Sowoozoo_D2.csv','full_twt_db':'All_Tweets_SWZ_D2.csv',
             'fan_twt_db':'fan_Tweets_SWZ_D2.csv','dep_twt_db':'fan_Tweets_SWZ_D2_reduced.csv',
             'event_file':'Setlists_sowoozoo_D2.csv',
             'event_offset':'108S','event_reduction':[1,2,3,6,8,9,10,12,13,15,16,19,20,21,22,23,25,26,27,28],
             'Long_name':'Sowoozoo Concert Day 2','sampling':'#SOWOOZOO','Program':'SWZ'}
Concerts.loc[2,:]={'tag': 'PTD_ON','raw_loc':'data/PTD/','fullfeilds_loc': '../StreamData/', 'dep_loc':'./data/',
             'raw_twt_db':'FullPTD_Fan_tweets_PTD_ON_STAGE.csv','full_twt_db':'All_Tweets_PTD_ON.csv',
             'fan_twt_db':'fan_Tweets_PTD_ON.csv','dep_twt_db':'fan_Tweets_PTD_ON_reduced.csv',
             'event_file':'Setlists_PTD_ON.csv','event_offset':'25S','event_reduction':[1,2,3,6,7,8,9,11,12,14,15,17,18,21,22,28,29,32,33,34,36,37,38,39],
             'Long_name':'Permission to Dance on Stage','sampling':'Kpop Stream','Program':'PTD'}
Concerts.loc[3,:]={'tag': 'PTD_LA4','raw_loc':'data/PTD/','fullfeilds_loc': '../StreamData/', 'dep_loc':'./data/',
             'raw_twt_db':'PTD_LA4_Fan_tweets_FULLSTREAM.csv','full_twt_db':'All_Tweets_PTD_LA4.csv',
             'fan_twt_db':'fan_Tweets_PTD_LA4.csv','dep_twt_db':'fan_Tweets_PTD_LA4_reduced.csv',
             'event_file':'Setlists_PTD_LA4.csv','event_offset':'40S','event_reduction':[1,2,3,6,7,8,9,11,12,14,15,17,18,21,22,28,29,32,33,34,36,37,38,39],
             'Long_name':'Permission to Dance LA Day 4','sampling':'Kpop Stream','Program':'PTD'}
Concerts.loc[4,:]={'tag': 'PTD_ON_Alt1','raw_loc':'data/PTD/','fullfeilds_loc': '../StreamData/', 'dep_loc':'./data/',
             'raw_twt_db':'Alt1PTD_Fan_tweets_PTD_ON_STAGE.csv','full_twt_db':'All_Tweets_PTD_ON_Alt1.csv',
             'fan_twt_db':'fan_Tweets_PTD_ON_Alt1.csv','dep_twt_db':'fan_Tweets_PTD_ON_Alt1_reduced.csv',
             'event_file':'','event_offset':'','event_reduction':[],
             'Long_name':'Week prior to PTD On Stage','sampling':'Kpop Stream','Program':''}
Concerts.loc[5,:]={'tag': 'PTD_ON_Alt2','raw_loc':'data/PTD/','fullfeilds_loc': '../StreamData/', 'dep_loc':'./data/',
             'raw_twt_db':'Alt2PTD_Fan_tweets_PTD_ON_STAGE.csv','full_twt_db':'All_Tweets_PTD_ON_Alt2.csv',
             'fan_twt_db':'fan_Tweets_PTD_ON_Alt2.csv','dep_twt_db':'fan_Tweets_PTD_ON_Alt2_reduced.csv',
             'event_file':'','event_offset':'','event_reduction':[],
             'Long_name':'Week following PTD On Stage','sampling':'Kpop Stream','Program':''}
Concerts

Unnamed: 0,tag,raw_loc,fullfeilds_loc,dep_loc,raw_twt_db,full_twt_db,fan_twt_db,dep_twt_db,event_file,event_offset,event_reduction,Long_name,sampling,Program
0,SWZ_D1,data/,../StreamData/,./data/,Fan_tweets_H_Sowoozoo_D1.csv,All_Tweets_SWZ_D1.csv,fan_Tweets_SWZ_D1.csv,fan_Tweets_SWZ_D1_reduced.csv,Setlists_sowoozoo_D1.csv,6MIN,"[1, 2, 3, 6, 8, 9, 10, 12, 13, 15, 16, 19, 20,...",Sowoozoo Concert Day 1,#SOWOOZOO,SWZ
1,SWZ_D2,data/,../StreamData/,./data/,Fan_tweets_H_Sowoozoo_D2.csv,All_Tweets_SWZ_D2.csv,fan_Tweets_SWZ_D2.csv,fan_Tweets_SWZ_D2_reduced.csv,Setlists_sowoozoo_D2.csv,108S,"[1, 2, 3, 6, 8, 9, 10, 12, 13, 15, 16, 19, 20,...",Sowoozoo Concert Day 2,#SOWOOZOO,SWZ
2,PTD_ON,data/PTD/,../StreamData/,./data/,FullPTD_Fan_tweets_PTD_ON_STAGE.csv,All_Tweets_PTD_ON.csv,fan_Tweets_PTD_ON.csv,fan_Tweets_PTD_ON_reduced.csv,Setlists_PTD_ON.csv,25S,"[1, 2, 3, 6, 7, 8, 9, 11, 12, 14, 15, 17, 18, ...",Permission to Dance on Stage,Kpop Stream,PTD
3,PTD_LA4,data/PTD/,../StreamData/,./data/,PTD_LA4_Fan_tweets_FULLSTREAM.csv,All_Tweets_PTD_LA4.csv,fan_Tweets_PTD_LA4.csv,fan_Tweets_PTD_LA4_reduced.csv,Setlists_PTD_LA4.csv,40S,"[1, 2, 3, 6, 7, 8, 9, 11, 12, 14, 15, 17, 18, ...",Permission to Dance LA Day 4,Kpop Stream,PTD
4,PTD_ON_Alt1,data/PTD/,../StreamData/,./data/,Alt1PTD_Fan_tweets_PTD_ON_STAGE.csv,All_Tweets_PTD_ON_Alt1.csv,fan_Tweets_PTD_ON_Alt1.csv,fan_Tweets_PTD_ON_Alt1_reduced.csv,,,[],Week prior to PTD On Stage,Kpop Stream,
5,PTD_ON_Alt2,data/PTD/,../StreamData/,./data/,Alt2PTD_Fan_tweets_PTD_ON_STAGE.csv,All_Tweets_PTD_ON_Alt2.csv,fan_Tweets_PTD_ON_Alt2.csv,fan_Tweets_PTD_ON_Alt2_reduced.csv,,,[],Week following PTD On Stage,Kpop Stream,


In [23]:
# https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
# forced data types for csv files uploaded because pandas is acting up.

dtype_map = {'id': 'Int64', 'created_at':str, 'tweet':str, 'source':str, 'language':str, 'user_id': 'Int64',
       'user_screen_name':str, 'user_name':str, 'user_description':str, 'user_language':str,
       'user_location':str, 'user_created_at':str, 'user_followers_count': 'Int64',
       'user_friends_count': 'Int64', 'user_statuses_count': 'Int64', 'user_favorites_count': 'Int64',
       'user_verified':str, 'in_reply_to_status_id': 'Int64', 'in_reply_to_user_id': 'Int64',
       'in_reply_to_user_screen_name':str, 'retweeted_status_id': 'Int64',
       'retweeted_status_user_id': 'Int64', 'retweeted_status_user_screen_name':str,
       'retweeted_status_user_name':str, 'retweeted_status_user_description':str,
       'retweeted_status_user_friends_count': 'Int64',
       'retweeted_status_user_statuses_count': 'Int64',
       'retweeted_status_user_followers_count': 'Int64',
       'retweeted_status_retweet_count': 'Int64', 'retweeted_status_favorite_count': 'Int64',
       'retweeted_status_reply_count': 'Int64', 'quoted_status_id': 'Int64',
       'quoted_status_user_id': 'Int64', 'quoted_status_user_screen_name':str,
       'quoted_status_user_name':str, 'quoted_status_user_description':str,
       'quoted_status_user_friends_count': 'Int64', 'quoted_status_user_statuses_count': 'Int64',
       'quoted_status_user_followers_count': 'Int64', 'quoted_status_retweet_count': 'Int64',
       'quoted_status_favorite_count': 'Int64', 'quoted_status_reply_count': 'Int64'}

In [24]:
# read an original recording of twitter api csv files with forced type and correct datetime handling
df_alltwt=pd.read_csv(raw_dir + Concerts.loc[4,'raw_loc'] + Concerts.loc[4,'raw_twt_db']).astype(dtype_map)
df_alltwt["created_at"] = pd.to_datetime(df_alltwt["created_at"])

## Filtering out users per request

Some twitter users explicitly refused consent to have their tweets cited or used for research purposes. While it is legally and technically easy to include their content, we have filtered out tweets by and retweets of users that include key phrases such as "🚫 please do not cite my tweets w/o my express consent" and "💥This acct DOES NOT consent to being used for research purposes 💥"  in their users bio texts. 

The exclusion list was built by reviewing instances of key words in English and Korean to find representative phrases. Most common were instances of artists specifying restrictions on reposting or reusing the media shared in their tweets. As the Streaming API does not collect media (images videos), this was not a reason to exclude these accounts. Accounts identified as requesting exclusion mostly used forms of "DON'T USE" and mentions of consent. Examples of this search process can found in notebook Tweet_content_review.ipynb. This removed 107 entries from 12 in one concert dataset. 

This was not an exhaustive review of user consent to be included in research. Iterative keyword search is a limited strategy, thematically and crosslinguistically. However, the low number of these cases captured suggests these were not a concentrated concern. Additionally, our use and sharing of the depersonalised datasets should not pose any risk to such users. 

In [8]:
# output for databases after clearing out accounts identified for exclusion.
# this folder location is outside of this repo, as these databases are not classified as green data (open sharable)
data_loc = '../Stream_Data/'

In [9]:
dontuse_users = pd.read_csv(data_loc+'Exclude_accounts.csv',header = None)[0].values

In [10]:
len(dontuse_users)

12

In [16]:
for i in Concerts.index:
    df_alltwt=pd.read_csv(raw_dir + Concerts.loc[i,'raw_loc'] + Concerts.loc[i,'raw_twt_db'],
                 lineterminator='\n',low_memory=False)#.astype(dtype_map)
    df_alltwt["created_at"] = pd.to_datetime(df_alltwt["created_at"])
    data_name = Concerts.loc[i,'tag']
    print(data_name)
    # Clean up
    print('Size of full set: ' + str(len(df_alltwt)))
    
    df_fantwt = df_alltwt.copy()
    # remove tweets from or connected to identified users by user_id
    for acc in dontuse_users:
        df_fantwt = df_fantwt.loc[df_fantwt['user_id']!=acc,:].copy()
        df_fantwt = df_fantwt.loc[df_fantwt['retweeted_status_user_id']!=acc,:].copy()
        df_fantwt = df_fantwt.loc[df_fantwt['in_reply_to_user_id']!=acc,:].copy()
        df_fantwt = df_fantwt.loc[df_fantwt['quoted_status_user_id']!=acc,:].copy()

    print('Size of cleaned set: ' + str(len(df_fantwt)))
    df_alltwt = df_fantwt.sort_values('created_at').reset_index(drop=True)
    df_alltwt.to_csv(Concerts.loc[i,'fullfeilds_loc'] + Concerts.loc[i,'full_twt_db'])

SWZ_D1
Size of full set: 225993
Size of cleaned set: 225934
SWZ_D2
Size of full set: 114728
Size of cleaned set: 114724
PTD_ON
Size of full set: 277794
Size of cleaned set: 277794
PTD_LA4
Size of full set: 143837
Size of cleaned set: 143772
PTD_ON_Alt1
Size of full set: 16806
Size of cleaned set: 16802
PTD_ON_Alt2
Size of full set: 55676
Size of cleaned set: 55671


## Filter out non-fan content
Tweets from or to official accounts are in these datasets, both hashtag and stream samples, but are not of interest to this research question. We are not interested in tweets to the artists or from the the production company, these have different dynamics. 

Our list of exclusion: '@BTS_twt', '@bts_bighit', '@weverseofficial', '@weverseshop', '@BIGHIT_MUSIC', '@HYBE_MERCH', '@BT21_'



In [17]:
filtered_users = ['@BTS_twt','@bts_bighit','@weverseofficial','@weverseshop','@BIGHIT_MUSIC','@HYBE_MERCH','@BT21_']

In [19]:
for i in Concerts.index:
    df_alltwt=pd.read_csv(Concerts.loc[i,'fullfeilds_loc'] + Concerts.loc[i,'full_twt_db'],
                 lineterminator='\n',index_col = 0,low_memory=False).astype(dtype_map).reset_index(drop = True)
    df_alltwt["created_at"] = pd.to_datetime(df_alltwt["created_at"])
    
    data_name = Concerts.loc[i,'tag']
    print(data_name)
    # Clean up
    print('Size of full set: ' + str(len(df_alltwt)))
    df_fantwt = df_alltwt.copy()
    # quote or retweets of accounts by user
    for acc in filtered_users:
        twts = df_fantwt['tweet'] 
        df_fantwt = df_fantwt.loc[~(twts.str.contains('T '+acc+':', case=False,regex=False))].copy()
    # actually from these users
    for acc in filtered_users:
        twts = df_fantwt['user_screen_name'] 
        df_fantwt = df_fantwt.loc[~(twts.str.contains(acc[1:], case=False,regex=False))].copy()

    #removing replys to these accounts
    df_notreplys = df_fantwt.loc[df_fantwt['in_reply_to_user_id'].isna(),:]
    df_replys = df_fantwt.loc[df_fantwt['in_reply_to_user_id'].notna(),:]
    print('Size of reply tweets: ' + str(len(df_replys)) )
    twts = df_replys['in_reply_to_user_screen_name']  # df_replys['in_reply_to_user_screen_name'].value_counts()
    for acc in filtered_users: # filter RTs and replys
         df_replys =  df_replys.loc[~(twts.str.startswith(acc[1:]))]
    print('Size of replys: ' + str(len(df_replys)) )
    
    df_alltwt = pd.concat([df_notreplys,df_replys])
    print('Size of cleaned set: ' + str(len(df_alltwt)))
    
    df_alltwt = df_alltwt.sort_values('created_at').reset_index(drop=True)
    
    df_alltwt.to_csv(Concerts.loc[i,'fullfeilds_loc'] + Concerts.loc[i,'fan_twt_db'])

SWZ_D1
Size of full set: 225934
Size of reply tweets: 1730
Size of replys: 1484
Size of cleaned set: 224661
SWZ_D2
Size of full set: 114724
Size of reply tweets: 1303
Size of replys: 947
Size of cleaned set: 111152
PTD_ON
Size of full set: 277794
Size of reply tweets: 56674
Size of replys: 8213
Size of cleaned set: 228708
PTD_LA4
Size of full set: 143772
Size of reply tweets: 28387
Size of replys: 4069
Size of cleaned set: 116313
PTD_ON_Alt1
Size of full set: 16802
Size of reply tweets: 3605
Size of replys: 1235
Size of cleaned set: 14269
PTD_ON_Alt2
Size of full set: 55671
Size of reply tweets: 11997
Size of replys: 3444
Size of cleaned set: 46790


# Reduction 1: Depersonalised full tweet sets

Once the databases have been filtered for exclusions and official account related activity, we go through with depersonalising the initial set of information saved per tweet to fields that are sufficiently depersonalised. 

Unique identifier numbers are replaced by hashtable per dataset, both tweet ids and user ids. Relevant tweet and users numerical statistics are retained as they are non-searchable and non-unique descriptors of posts and accounts from a non-retreivable time in the twitter database's history. Tweet text, user names, and user descriptions are potentially unique and identifiable and thus removed. 

List of fields retained (without alternation):

    - 'created_at'
    
    - 'retweeted_status_retweet_count'
    - 'retweeted_status_favorite_count'
    - 'retweeted_status_reply_count'
    - 'quoted_status_retweet_count'
    - 'quoted_status_favorite_count'
    - 'quoted_status_reply_count'
    
    - 'user_followers_count'
    - 'retweeted_status_user_followers_count'
    - 'quoted_status_user_followers_count'

Info to replace with hashtable:
    
    - 'id'
	- 'user_id'
    - 'in_reply_to_status_id'
    - 'in_reply_to_user_id'
    - 'retweeted_status_id'
    - 'retweeted_status_user_id'
    - 'quoted_status_id'
    - 'quoted_status_user_id'
    
    
Features extrated from tweet content:

	- Tweet length in characters
	- Media inclusion (embedded photo, video, or quoting another tweet)
    - Tweet type (Original, RT, QT, Reply)
    
List of fields discarded entirely:

    - 'tweet' 
    - 'source'
    - 'language'
    - 'user_screen_name'
    - 'user_name'
    - 'user_description'
    - 'user_language'
    - 'user_location'
    - 'user_created_at'
    - 'user_friends_count'
    - 'user_statuses_count'
    - 'user_favorites_count'
    - 'user_verified'
    
    - 'in_reply_to_user_screen_name'
    
    - 'retweeted_status_user_screen_name'
    - 'retweeted_status_user_name'
    - 'retweeted_status_user_description'
    - 'retweeted_status_user_friends_count'
    - 'retweeted_status_user_statuses_count'
    
    - 'quoted_status_user_screen_name'
    - 'quoted_status_user_name'
    - 'quoted_status_user_description'
    - 'quoted_status_user_friends_count'
    - 'quoted_status_user_statuses_count'


In [11]:
i = 0

In [12]:
df_alltwt=pd.read_csv(Concerts.loc[i,'fullfeilds_loc'] + Concerts.loc[i,'fan_twt_db'],
             lineterminator='\n',index_col = 0)#.astype(dtype_map)
data_name = Concerts.loc[i,'tag']
print(data_name)

df_alltwt["created_at"] = pd.to_datetime(df_alltwt["created_at"])
df_alltwt = df_alltwt.sort_values('created_at').reset_index(drop=True)

Feilds_to_keep = ['created_at','id','user_id','user_followers_count',
'retweeted_status_id','retweeted_status_user_id','retweeted_status_user_followers_count',
'retweeted_status_retweet_count', 'retweeted_status_favorite_count','retweeted_status_reply_count',
'quoted_status_id','quoted_status_user_id','quoted_status_user_followers_count',
'quoted_status_retweet_count','quoted_status_favorite_count','quoted_status_reply_count',
'in_reply_to_status_id','in_reply_to_user_id']

df_Reduced = df_alltwt.loc[:,Feilds_to_keep]#.astype('Int64')
twts = df_alltwt['tweet']
dates = df_alltwt['created_at']

SWZ_D1


In [13]:
df_Reduced.dtypes

created_at                               datetime64[ns, UTC]
id                                                     int64
user_id                                                int64
user_followers_count                                   int64
retweeted_status_id                                  float64
retweeted_status_user_id                             float64
retweeted_status_user_followers_count                float64
retweeted_status_retweet_count                       float64
retweeted_status_favorite_count                      float64
retweeted_status_reply_count                         float64
quoted_status_id                                     float64
quoted_status_user_id                                float64
quoted_status_user_followers_count                   float64
quoted_status_retweet_count                          float64
quoted_status_favorite_count                         float64
quoted_status_reply_count                            float64
in_reply_to_status_id   

In [14]:

# Hash the user ids across all user id feilds.
Feilds_to_anon = ['user_id','in_reply_to_user_id','retweeted_status_user_id','quoted_status_user_id']
ids = []
tic = time.time()
for replace_feild in Feilds_to_anon:
    ids +=list(df_alltwt.loc[df_alltwt[replace_feild].notna(),replace_feild].astype('int64').values)
ids=pd.DataFrame(columns = ['index'],data = ids)
print(['set ids',time.time()-tic])
A = ids.value_counts().reset_index()
B = dict((v, k) for k, v in A['index'].to_dict().items())
print(['create hash',time.time()-tic])
for replace_feild in Feilds_to_anon:
    df_Reduced[replace_feild].replace(B,inplace = True)
    print([replace_feild,time.time()-tic])

['set ids', 0.24439120292663574]
['create hash', 0.4201631546020508]
['user_id', 85.2270519733429]
['in_reply_to_user_id', 146.00992608070374]
['retweeted_status_user_id', 207.81203508377075]
['quoted_status_user_id', 268.20725297927856]


In [15]:
# Hash the tweet ids across all tweet id feilds.
Feilds_to_anon = ['id','in_reply_to_status_id', 'retweeted_status_id', 'quoted_status_id']
ids = []
tic = time.time()
for replace_feild in Feilds_to_anon:
    ids +=list(df_alltwt.loc[df_alltwt[replace_feild].notna(),replace_feild].astype('int64').values)
ids=pd.DataFrame(columns = ['index'],data = ids)
print(['set ids',time.time()-tic])
A = ids.value_counts().reset_index()
B = dict((v, k) for k, v in A['index'].to_dict().items())
print(['create hash',time.time()-tic])
for replace_feild in Feilds_to_anon:
    df_Reduced[replace_feild].replace(B,inplace = True)
    print([replace_feild,time.time()-tic])

['set ids', 0.28423118591308594]
['create hash', 0.5798389911651611]
['id', 183.05087304115295]
['in_reply_to_status_id', 319.56561303138733]
['retweeted_status_id', 456.2954912185669]
['quoted_status_id', 590.9220142364502]


In [16]:
# add in binary tweet type categories: RT, QT, Reply, Original
df_Reduced['Original'] = 1
df_Reduced.loc[df_alltwt['retweeted_status_id'].notna(),'Original'] = 0
df_Reduced.loc[df_alltwt['quoted_status_id'].notna(),'Original'] = 0
df_Reduced.loc[df_alltwt['in_reply_to_status_id'].notna(),'Original'] = 0
df_Reduced['RT'] = 0
df_Reduced.loc[df_alltwt['retweeted_status_id'].notna(),'RT'] = 1
df_Reduced['QT'] = 0
df_Reduced.loc[df_alltwt['quoted_status_id'].notna(),'QT'] = 1
df_Reduced['Reply'] = 0
df_Reduced.loc[df_alltwt['in_reply_to_status_id'].notna(),'Reply'] = 1

# add features of tweet content
df_Reduced['Media'] = 0
df_Reduced.loc[twts.str.contains('https://t.co', case=False,regex=False),'Media'] = 1
df_Reduced['Length'] = twts.str.len()

df_Reduced.to_csv(Concerts.loc[i,'dep_loc'] + Concerts.loc[i,'dep_twt_db'])
print(data_name)

SWZ_D1


# Check on coded subsets
Clean up coded subset for easy processing

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

In [17]:
dloc='/Users/finn/Desktop/Current_Projects/BTS_twitter/twt_Analysis/data/'
subTwts = pd.read_csv(dloc + 'Packed_PTD_Subsets_Coded.csv')

data_name = subTwts.loc[0,'Concert']
subTwts['Last_RT_created_at'] = pd.to_datetime(subTwts['Last_RT_created_at'])
subTwts['Last_RT_id'] = pd.to_datetime(subTwts['Last_RT_created_at'])
subTwts['Last_RT_id'] = subTwts['Last_RT_id'].astype('int64')
# convert codes to boolean or binary. binary? 
codes = ['Affection', 'Intensifiers', 'Self', 'Members', 'recording', 'Stills',
       'Production', 'Music', 'Commentary', 'Army', 'Anticipation', 'Fanwork',
       'Information', 'Stream', 'Spam', 'Commercial']
for c in codes:
    subTwts.loc[subTwts[c].notna(),c] = 1
    subTwts.loc[subTwts[c].isna(),c] = 0
    subTwts[c] = subTwts[c].astype('bool')

subTwts.columns

Index(['RT_SubSet', 'Concert', 'Last_RT_created_at', 'Last_RT_id',
       'Last_RT_url', 'OriTwt_id', 'language', 'LastRT_user_followers_count',
       'Last_RT_user_friends_count', 'Last_RT_user_statuses_count',
       'Last_RT_user_favorites_count', 'Ori_Twt_user_friends_count',
       'Ori_Twt_user_statuses_count', 'Ori_Twt_user_followers_count',
       'Last_RT_retweet_count', 'Last_RT_favorite_count',
       'Last_RT_reply_count', 'Shout', 'Tweet Length', 'Tweet link/media',
       'Lost', 'Unrelated', 'Affection', 'Intensifiers', 'Self', 'Members',
       'recording', 'Stills', 'Production', 'Music', 'Commentary', 'Army',
       'Anticipation', 'Fanwork', 'Information', 'Stream', 'Spam',
       'Commercial'],
      dtype='object')

In [18]:
# check formating
for col in subTwts.columns:
    N = subTwts[col].nunique()
    print(' '.join([col,'unique',str(N), str(subTwts[col].dtypes)]))
    if N<10:
        print(subTwts[col].unique())
        

RT_SubSet unique 4 object
['Top200RTd' 'Rand200_32t6RTd' 'Rand200_3t1RTd' 'Rand200_NoRT']
Concert unique 1 object
['PTD_ON1']
Last_RT_created_at unique 468 datetime64[ns, UTC]
Last_RT_id unique 468 int64
Last_RT_url unique 484 object
OriTwt_id unique 484 int64
language unique 24 object
LastRT_user_followers_count unique 287 int64
Last_RT_user_friends_count unique 314 int64
Last_RT_user_statuses_count unique 483 int64
Last_RT_user_favorites_count unique 469 int64
Ori_Twt_user_friends_count unique 340 int64
Ori_Twt_user_statuses_count unique 454 int64
Ori_Twt_user_followers_count unique 447 int64
Last_RT_retweet_count unique 149 int64
Last_RT_favorite_count unique 210 int64
Last_RT_reply_count unique 98 int64
Shout unique 2 bool
[False  True]
Tweet Length unique 170 int64
Tweet link/media unique 2 bool
[ True False]
Lost unique 1 float64
[nan  1.]
Unrelated unique 1 float64
[ 1. nan]
Affection unique 2 bool
[False  True]
Intensifiers unique 2 bool
[False  True]
Self unique 2 bool
[False 

In [19]:
 subTwts.to_csv('./data/'+data_name+'_coded_subsets_with_replaced.csv')