In [160]:
import pandas as pd
pd.set_option('use_inf_as_na', True)
import csv

In [161]:
### ADDED DATA INTO DATA FOLDER ON GDRIVE ###

**Lets start with user data and their features - any featured created using the tweet data can be joined later**

So this includes the 5000 accounts randomly sampled from our 'climate emergency' dataset and 18000 accounts which have been labelled as either bots or genuine users. (source?) The next steps are:

- Load in the data and standardise columns - are there features that aren't in both and that we can find?
- Apply the features we have created in relation to the user account details - again we can create more of these

**Load in the tweet data (most 200 recent tweets ~ March 2020 for our sample, then variety for our training)**

- Load in the tweet data for both samples and standardise columns etc
- Apply the tweet based features we have created 
- Join these back to the user data - for example average words per tweet for each user etc
- We can now apply a variety of supervised and unsupervised algorithms to the training data to apply to our random sample of 5000 users who contributed to the climate emergency debate during the time we collected data.
- Literature suggests that unsupervised methods often produce better results: 
 - fast greedy (Cresci et al., 2017)
 - digital DNA (Cresci et al., 2016)
 - graph clustering (Ahmed et al., 2013)
 
Happy hunting!

**Data folder in gdrive**

- *5000_accounts_climate.csv* - the 5000 accounts from climate emergency with user features
- *5000_tweets_climate.csv* - most recent 200 tweets from the 5000 accounts 
- *5000_tweets_frequency.csv* - features based on tweet frequency from the tweets of 5000 users
- *training_users_tag.csv* - this is now the 18000 training data we have tweet data on as well
- *training_tweets.txt* - most recent 200 tweets from the 18000 accounts columns = ['dt','text','tweetid','username']

In [117]:
# load in the 5000 users without additional features
users_5000 = pd.read_csv("DATA/5000_accounts_climate.csv")
users_5000.head(1)

Unnamed: 0,id,name,username,location,url,description,verified,followers,friends,favourites_count,statuses_count,created_at,default_profile,default_profile_image
0,1098803589609189376,💧The Cranky Croation,JohnSarich2,,,My First ever vote was for Gough Whitlam. Left...,False,430,291,14866,6039,2019-02-22 04:36:05,True,False


In [148]:
# These are the accounts (from training_users_tag) which we could source tweet data from training_tweets_tag
users_train = pd.read_csv("DATA/training_users_tag.csv")
users_train.head(1)

Unnamed: 0.1,Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,...,description,created_at,class,tag_stock,tag_politics,tag_pronbot,tag_business,tag_fake_follower,tag_spambot,tag_traditional_spambot
0,0,418,Dennis Crowley,dens,69341,85422,2623,14990,4491,https://t.co/63fYABYs9J,...,"I like to build things (@Foursquare📱, @Stockad...",Wed Jul 05 19:52:46 +0000 2006,human,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [149]:
# See what columns match
a = users_5000.columns
b = users_train.columns

missing = list(set(a) - set(b))
print(missing,'are not in the training data!')

['username', 'followers', 'verified', 'friends'] are not in the training data!


 - username = screen_name
 - followers = followers_count
 - no verified field
 - friends = friends_count

In [150]:
# lets add the missing columns to our matches variable
matches = list(set(a) & set(b))
matches = matches + ['screen_name','followers_count','friends_count']

In [151]:
# add the tag for just the class - tags for the type of bot
tag = [users_train.columns[-8]]
tags = list(users_train.columns[-8:])

In [152]:
# add your choice to 
matches = matches + tag
# apply this to the training data so we have standardised columns
users_train_2 = users_train[matches]

Unnamed: 0,name,statuses_count,id,favourites_count,url,created_at,default_profile,default_profile_image,location,description,screen_name,followers_count,friends_count,class
0,Dennis Crowley,69341,418,14990,https://t.co/63fYABYs9J,Wed Jul 05 19:52:46 +0000 2006,False,False,NYC / Kingston,"I like to build things (@Foursquare📱, @Stockad...",dens,85422,2623,human


In [153]:
# we'll drop verified for now, even though it will be useful - we can add back later.
users_5000_2 = users_5000.drop(columns=['verified'])

In [155]:
# rename the columns to the same
users_train_2.columns = ['name','statuses_count','id','favourites_count','url',
                         'created_at','default_profile','default_profile_image',
                         'location','description','username','followers','friends','class']
users_train_2.head(1)

Unnamed: 0,name,statuses_count,id,favourites_count,url,created_at,default_profile,default_profile_image,location,description,username,followers,friends,class
0,Dennis Crowley,69341,418,14990,https://t.co/63fYABYs9J,Wed Jul 05 19:52:46 +0000 2006,False,False,NYC / Kingston,"I like to build things (@Foursquare📱, @Stockad...",dens,85422,2623,human


So we now have our training data and climate emergency accounts in a normalised format. Next job is to feature engineer both.

In [None]:
# create functions for adding features to both datasets - we can add more features as more are developed!


**Lets load in the additional features created using the recent tweets of each user**

Don't load in the actual tweets - 5000 + 18000 * 200 = a bit too much

In [None]:
# load in csvs of tweet related features aggregated by username which we can join without loading alot of tweets in

In [159]:
# tweet_frequency for both datasets
frequency_5000 = pd.read_csv("DATA/5000_tweets_frequency.csv")
#frequency_users = pd.read_csv("DATA/training_tweets_frequency.csv")