## Content-based Feature
<li>The percentage of Tweets containing URLs</li>
<li>The ratio of the number of unique URLs [3]</li>
<li>hashtag ratio</li>
<li>The ratio of the number of @usernames [3]</li>
<li>The ratio of the number of unique @usernames [3]</li>
<li>Tweet similarity:   
(1) $S=\frac{\sum_{p\in P}c(p)}{l_al_p}$    
where $P$ is the set of possible tweet-to-tweet combinations among any two tweets logged for a certain account, $p$ is a single pair, $c(p)$ is a function calculation the number of words two tweets share, $l_a$ is the average length of tweets posted by that user, and $l_p$ is the number of tweet combinations. A profile sending similar tweets will have a low value of S.[4]    
(2) $\sum_{a,b \in set of pairs in tweets}\frac{similarity(a,b)}{|set of pairs in tweets|}$    
where the content similarity is computed using the standard cosine simility over the bag-of-word vector representation $\mathbf{V(a)}$ of the tweet content: $similarity(a,b)=\frac{\mathbf{V(a)}\mathbf{V(b)}}{|\mathbf{V(a)}||\mathbf{V(b)}|}$ 
Since tweets are extremely short (140 characters or less), we consider a bag-of-words model and a sparse bigrams model. [3]</li>
<li>Duplicate tweet count</li>  
<li>User behavior: number of times the user was mentioned, number of times the user was replied to, number of times the user replied someone</li>

In [2]:
import pandas as pd
import numpy as np
import tweepy
import requests
import requests_cache
import time
requests_cache.install_cache('demo_cache')
from api import *
import re
from __future__ import division
from compiler.ast import flatten
from math import factorial
from nltk.corpus import stopwords
import glob
import csv

In [3]:
# read all the files
genuine = pd.read_csv(open('User dataset/genuine account.csv', 'rU'), 
                      encoding = 'utf-8', usecols = ['id'] )

# load in fake accounts data
path =r'./User dataset/fake data/'
allFiles = glob.glob(path + "/*.csv")
fake = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(open(file_, 'rU'), encoding = 'utf-8', usecols = ['id'])
    list_.append(df)
fake = pd.concat(list_)
# merge all the data 
all_ids = pd.concat([genuine, fake])

In [4]:
# create a list of user ids
id_list = list(all_ids['id'])
id_list = [str(i) for i in id_list]
items = ["False", "None", "nan"]
id_list = filter(lambda x: x not in items, id_list)
final_id_list = [int(i) for i in id_list]

In [5]:
def get_tweets_limit(id_list):
    
    """
    gets the most recent 200 tweets from a user timeline
    
    Argument: a list of twitter id
    
    Return: a tweets dataframe
    """
    # create a list that stores the tweets for all the users
    all_tweets = []
    for twitter_id in id_list:
        # read the tweets from the timeline, 200 is the maximum allowed count
        tweets = api.user_timeline(user_id = twitter_id,count = 200)
         # save the tweets into the list
        tweets_list = []
        tweets_list.extend(tweets)
        outtweets = [[tweet.id_str, tweet.created_at, tweet.source, tweet.text.encode("utf-8")] for tweet in tweets_list]
        tweets = [tweets + [twitter_id] for tweets in outtweets]
        all_tweets.extend(tweets)
    tweets_df = pd.DataFrame(all_tweets)
    col_names = ['tweet_id', 'tweet_created_at', 'tweet_source', 'tweet_text', 'twitter_id']
    tweets_df.columns = col_names
    return tweets_df


In [6]:
def create_tweets_df(user_file):
    """
    create a dataframe that stores the existing user tweets
    
    Argument: the user info dataframe
    
    Return: a tweets dataframe
    """
    user_df = pd.read_csv("User_dataset/" + user_file)
    twitter_id = list(user_df['id'])
    tweets_df = get_tweets_limit(twitter_id)
    return tweets_df

In [7]:
# Use reference: https://gist.github.com/yanofsky/5436496#file-tweet_dumper-py

def get_all_tweets(twitter_id):
    all_tweets = []
    tweets = api.user_timeline(user_id = twitter_id,count = 200)
    all_tweets.extend(tweets)
    oldest = all_tweets[-1].id - 1
    while len(tweets) > 0:
        tweets = api.user_timeline(user_id = twitter_id,count = 200, max_id = oldest)
        all_tweets.extend(tweets)
        oldest = all_tweets[-1].id - 1
    outtweets = [[tweet.id_str, tweet.created_at, tweet.source, tweet.text.encode("utf-8")] for tweet in all_tweets]
    tweets = [tweets + [twitter_id] for tweets in outtweets]
    with open('%s_tweets.csv' % twitter_id, 'wb') as f:
        writer = csv.writer(f)
        writer.writerow(["id","created_at","source","tweet_text", "id"])
        writer.writerows(tweets)
    
    pass

In [52]:
get_all_tweets(33212890)

Reference: http://www.cs.wm.edu/~hnw/paper/tdsc12b.pdf

The users post tweets manually or via auto piloted tools. 

Devices are divided in the following categories:

1) Twitter Web Client

2) Mobile Devices: Android and OS

3) Registered third-party applications: website integrators(twitpic, bit.ly, Facebook, Tweetbar, Twitterfox for Firefox, Desktop clients(TweetDeck and Seesmic Desktop), RSS feeds/blog widgets(twitterfeed and Twitter for Wordpress)

4) APIs, third-party applications not registered or certified by Twitter, labeled as API 

In [27]:
# genuine
get_tweets_limit([2492782375, 293212315])

Unnamed: 0,tweet_id,tweet_created_at,tweet_source,tweet_text,twitter_id
0,860434971432964096,2017-05-05 10:04:17,Twitter for iPhone,RT @Afro_spirits: そもそも財政難ではない。 https://t.co/8x...,2492782375
1,860245821119320064,2017-05-04 21:32:41,Twitter for iPhone,RT @namiheiAMURO: 人手不足というのは、現在の賃金が低すぎて、超過需要が生じ...,2492782375
2,860243247825772544,2017-05-04 21:22:27,Twitter for iPhone,奴らにとって憲法典は聖遺物とかモノリスみたいなモンなんでしょ,2492782375
3,860242909588631552,2017-05-04 21:21:06,Twitter for iPhone,RT @mollichane: 憲法学者と言う人種がいかにデタラメか分かる図。\n図１の「自...,2492782375
4,859520657855467521,2017-05-02 21:31:08,Twitter for iPhone,RT @saitohisanori: 「私が金融緩和を始めたとき株がバブルになるとかドルが暴...,2492782375
5,859280547377422337,2017-05-02 05:37:01,Twitter Web Client,RT @kyounoowari: 物価が再び下落する中、保険という詐欺みたいな言葉で増税主張...,2492782375
6,858935817628024832,2017-05-01 06:47:11,Twitter for iPhone,RT @kokoro_gif_bot: #心が乱れた時に見るgif\nケルベロス型加湿器！？...,2492782375
7,858363690512400385,2017-04-29 16:53:46,Twitter Web Client,RT @Pekaso: 講師「業務中に疑問に思ったことを何でも良いので思い浮かべてください」...,2492782375
8,858067988112523264,2017-04-28 21:18:45,Twitter for iPhone,RT @Black_Post_Bot: 身の回りにブラック企業が溢れているから忘れがちだけど...,2492782375
9,858065248451870720,2017-04-28 21:07:51,Twitter for iPhone,RT @washizutan: 取引先の担当者で、新人の事務員ちゃんにPCの使い方教えてると...,2492782375


In [28]:
# fake
get_tweets_limit([24858289, 33212890])

Unnamed: 0,tweet_id,tweet_created_at,tweet_source,tweet_text,twitter_id
0,860374756788494336,2017-05-05 06:05:01,Facebook,https://t.co/xXB9UH9wRv,24858289
1,860203786660458500,2017-05-04 18:45:39,Facebook,I posted a new video to Facebook https://t.co/...,24858289
2,860192258926235648,2017-05-04 17:59:50,Facebook,https://t.co/IXBfbR5IEs,24858289
3,860174783979368448,2017-05-04 16:50:24,Waze,Ho aiutato automobilisti nei paraggi segnalan...,24858289
4,860095488154116096,2017-05-04 11:35:18,Facebook,https://t.co/wgjhLB5oTs,24858289
5,860091656938377216,2017-05-04 11:20:05,Facebook,https://t.co/otKeRMrann,24858289
6,860007135513649153,2017-05-04 05:44:13,Waze,Ho aiutato automobilisti nei paraggi segnalan...,24858289
7,860005782548602880,2017-05-04 05:38:51,Waze,Ho aiutato automobilisti nei paraggi segnalan...,24858289
8,859837992470552584,2017-05-03 18:32:07,Facebook,https://t.co/rANNhPHLRO,24858289
9,859835250503012352,2017-05-03 18:21:13,Facebook,https://t.co/mclpa9sWbT,24858289


<li>Calculate the ratio of the number of URLs to the number of tweets (|URLs|/|tweets|).</li>

In [8]:
def url_ratio(user_id):
    """
    calculate the percentage of Tweets containing URLs
    
    Argument: tweets_df
    
    Return: tweets_url_ratio
    """
    user_tweets = pd.read_csv(user_id + "_tweets.csv")
    tweets_url_ratio = sum(user_tweets['tweet_text'].str.contains("http:") == True)/len(user_tweets['tweet_text'])
    return str('{0:.4f}'.format(100 * tweets_url_ratio)) + '%'
    

In [9]:
url_ratio("33212890")

'0.0930%'

<li>Calculate the ratio of the number of unique URLs to the number of tweets (|unique URLs|/|tweets|).</li>

In [11]:
def url_unique_ratio(user_id):
    """
    calculate the ratio of the number of unique URLs
    
    Argument: tweets_df
    
    Return: url_ratio
    """
    user_tweets = pd.read_csv(user_id + "_tweets.csv")
    # find all the urls using regular expression
    urls = [re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweets) for tweets in user_tweets['tweet_text']]
    # flatten a list of lists
    urls_flatten = flatten(urls)
    # get the first two parts of the url
    urls_split = [u.split('/')[0:3] for u in urls_flatten]
    
    # find the unique urls
    # explain this??
    urls_unique = [list(u) for u in set(tuple(u) for u in urls_split)]
    url_unique= len(urls_unique)
    tweet_total = len(user_tweets['tweet_text'])
    url_ratio = url_unique/tweet_total
    return str('{0:.4f}'.format(100 * url_ratio)) + '%'


In [12]:
url_unique_ratio("33212890")

'0.1550%'

<li> Calculate the hashtag ratio</li>

In [15]:
def hashtag_ratio(user_id):
    """
    calculate the hashtag ratio
    
    Argument: tweets_df
    
    Return: hashtag ratio
    """
    user_tweets = pd.read_csv(user_id + "_tweets.csv")
    hashtag_ratio = 100 *(sum(user_tweets['tweet_text'].str.contains("#"))/len(user_tweets['tweet_text']))
    return str('{0:.2f}'.format(hashtag_ratio)) + '%'


In [16]:
hashtag_ratio("33212890")

'56.23%'

<li> Calculate the ratio of the number of @usernames [3]</li>

In [17]:
def username_ratio(user_id):
    """
    calculate the username ratio
    
    Argument: tweets_df
    
    Return: username ratio
    """
    user_tweets = pd.read_csv(user_id + "_tweets.csv")
    username_ratio = 100 *(sum(user_tweets['tweet_text'].str.contains("@"))/len(user_tweets['tweet_text']))
    return str('{0:.2f}'.format(username_ratio)) + '%'

In [19]:
username_ratio("33212890")

'76.78%'

<li>The ratio of the number of unique @usernames [3]</li>

In [21]:
def username_unique_ratio(user_id):
    """
    calculate the ratio of the number of unique @usernames
    
    Argument: tweets_df
    
    Return: username_unique_ratio
    """
    user_tweets = pd.read_csv(user_id + "_tweets.csv")
    username = [re.findall('@([A-Za-z0-9_]+)', tweets) for tweets in user_tweets['tweet_text']]
    # flatten a list of lists
    username_flatten = flatten(username)
    username_unique = set(username_flatten)
    user_unique= len(username_unique)
    # total number of users that were being @, not all the tweets
    tweet_total = len(user_tweets['tweet_text'])
    user_ratio = user_unique/tweet_total
    return str('{0:.4f}'.format(100 * user_ratio)) + '%'

In [22]:
username_unique_ratio("33212890")

'38.9337%'

In [2]:
new_traditional_spambots_3_tweets = pd.read_csv("new_traditional_spambots_3_tweets.csv")

In [3]:
tweets_df = new_traditional_spambots_3_tweets

In [7]:
test = tweets_df[tweets_df['twitter_id'] == 325403988]

In [5]:
test

Unnamed: 0.1,Unnamed: 0,tweet_id,tweet_created_at,tweet_text,twitter_id
0,0,108826506024849408,2011-08-31 09:00:35,@ayesweetz http://t.co/muLal67,325403988
1,1,108826192748097537,2011-08-31 08:59:20,@Tawneeeee http://t.co/muLal67,325403988
2,2,108825876396908544,2011-08-31 08:58:05,@what_it_is_KP http://t.co/muLal67,325403988
3,3,108825529519579136,2011-08-31 08:56:42,@unrealfred http://t.co/muLal67,325403988
4,4,108825174769545216,2011-08-31 08:55:18,@beardedotp http://t.co/muLal67,325403988
5,5,108824855058710528,2011-08-31 08:54:01,@mtygris http://t.co/muLal67,325403988
6,6,108824494516338689,2011-08-31 08:52:36,@BandBAberfeldy http://t.co/muLal67,325403988
7,7,108824143327281152,2011-08-31 08:51:12,@Adam_steward http://t.co/muLal67,325403988
8,8,108823768335519745,2011-08-31 08:49:42,@aaAdeeNnnn http://t.co/muLal67,325403988
9,9,108823379582259200,2011-08-31 08:48:10,@s_nadan http://t.co/muLal67,325403988


In [54]:
def extract_words(tweets_df, user_id):
    user_tweets = tweets_df.loc[tweets_df['twitter_id'] == 2310064794]
    username = [re.findall('@([A-Za-z0-9_]+)', tweets) for tweets in user_tweets['tweet_text']]
    hashtag = [re.findall(r"#(\w+)", tweets) for tweets in user_tweets['tweet_text']]
    urls = [re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweets) for tweets in user_tweets['tweet_text']]
    final_tweets = [tweets.decode('unicode_escape').encode('ascii','ignore') for tweets in user_tweets['tweet_text']]
    stop = stopwords.words('english')
    delete_list = [username, hashtag, urls]
    delete_list = delete_list + stop
    delete_flatten = flatten(delete_list)
    result = [tweet for tweet in final_tweets if tweet not in delete_flatten]
    return result

In [25]:
### Question 1: How to remove unnecessary words

# get the tweets for a certain user (User ID: 2310064794)
user_id = 33212890
user_tweets = pd.read_csv(user_id + "_tweets.csv")



TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [None]:
# use regular expression to find the username, hashtag, urls
username = [re.findall('@([A-Za-z0-9_]+)', tweets) for tweets in user_tweets['tweet_text']]
hashtag = [re.findall(r"#(\w+)", tweets) for tweets in user_tweets['tweet_text']]
urls = [re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', tweets) for tweets in user_tweets['tweet_text']]
# delete the special characters from the tweets
final_tweets = [tweets.decode('unicode_escape').encode('ascii','ignore') for tweets in user_tweets['tweet_text']]
# get a list of stop words
stop = stopwords.words('english')
delete_list = [username, hashtag, urls]
# create a list of items that need to be deleted
delete_list = delete_list + stop
delete_flatten = flatten(delete_list)
# find the important words
important_words = filter(lambda words: words not in delete_flatten, final_tweets)

In [6]:
important_words

['All the Biggest names are on  https://t.co/xLO5DreL9u  https://t.co/Oc5xKL4x9B https://t.co/Kr4D4PIuJe',
 'Biggest names are on  https://t.co/xLO5DreL9u  https://t.co/4qQ3ZD2FcP https://t.co/eTdfvCB2c7',
 'Great fun watch  check it out   @  slappie.us ( https://t.co/ULvq9abRP4 ) #watch #fun  #ThursdayThoughts https://t.co/hGk1Fk0qW6',
 'JOB ALERT #IT #job #hiring     Senior .Net Developer    #developer  https://t.co/ZkOuqfK07V https://t.co/HuPQCyl0f9',
 ' https://t.co/xLO5DreL9u  https://t.co/W7v5EmtJs3 https://t.co/hXxH5CXxLS',
 'Top Companies are on  https://t.co/xLO5DreL9u  https://t.co/eMvmH55mtZ https://t.co/8HVvyuPk4E',
 'Big names are on  https://t.co/xLO5DreL9u  https://t.co/HuWDvD7goG https://t.co/1qQS9dvqY1',
 'Search and apply to thousands of tech jobs on https://t.co/xLO5DreL9u https://t.co/wwcm1ilRJQ https://t.co/V27zyNzLK9',
 'Featured by Entrepreneur Magazine https://t.co/IAOLd2nqWl https://t.co/3sG0461CNq',
 'Just look who uses  https://t.co/xLO5DreL9u  https://t.co/T

In [23]:
def comb_2(num_tweets):
    return int(factorial(num_tweets) / (factorial(2) * factorial(num_tweets - 2)))

In [40]:
user_tweets = tweets_df.loc[tweets_df['twitter_id'] == 2310064794]

<li>Tweet similarity:   
(1) $S=\frac{\sum_{p\in P}c(p)}{l_al_p}$    
where $P$ is the set of possible tweet-to-tweet combinations among any two tweets logged for a certain account, $p$ is a single pair, $c(p)$ is a function calculation the number of words two tweets share, $l_a$ is the average length of tweets posted by that user, and $l_p$ is the number of tweet combinations. A profile sending similar tweets will have a low value of S.[4] 

In [38]:
for i in num_tweets['tweet_text']:
    for j in num_tweets['tweet_text']:
        print i + j

Biggest names are on ☆★☆ https://t.co/xLO5DreL9u ☆★☆ https://t.co/4qQ3ZD2FcP https://t.co/eTdfvCB2c7Biggest names are on ☆★☆ https://t.co/xLO5DreL9u ☆★☆ https://t.co/4qQ3ZD2FcP https://t.co/eTdfvCB2c7
Biggest names are on ☆★☆ https://t.co/xLO5DreL9u ☆★☆ https://t.co/4qQ3ZD2FcP https://t.co/eTdfvCB2c7Great fun watch ⌚ check it out 😍 👉 @  slappie.us ( https://t.co/ULvq9abRP4 ) #watch #fun  #ThursdayThoughts https://t.co/hGk1Fk0qW6
Biggest names are on ☆★☆ https://t.co/xLO5DreL9u ☆★☆ https://t.co/4qQ3ZD2FcP https://t.co/eTdfvCB2c7JOB ALERT #IT #job #hiring  💻   Senior .Net Developer   👨‍💻👩‍💻 #developer  https://t.co/ZkOuqfK07V https://t.co/HuPQCyl0f9
Great fun watch ⌚ check it out 😍 👉 @  slappie.us ( https://t.co/ULvq9abRP4 ) #watch #fun  #ThursdayThoughts https://t.co/hGk1Fk0qW6Biggest names are on ☆★☆ https://t.co/xLO5DreL9u ☆★☆ https://t.co/4qQ3ZD2FcP https://t.co/eTdfvCB2c7
Great fun watch ⌚ check it out 😍 👉 @  slappie.us ( https://t.co/ULvq9abRP4 ) #watch #fun  #ThursdayThoughts http

In [19]:
user_tweets = tweets_df.loc[tweets_df['twitter_id'] == 2310064794]

In [30]:
num_tweets = len(user_tweets['tweet_text'])

In [32]:
num_tweets = user_tweets[1:4]

In [35]:
num_tweets['tweet_text']

56501    Biggest names are on ☆★☆ https://t.co/xLO5DreL...
56502    Great fun watch ⌚ check it out 😍 👉 @  slappi...
56503    JOB ALERT #IT #job #hiring  💻   Senior .Net D...
Name: tweet_text, dtype: object

   
(2) $\sum_{a,b \in set of pairs in tweets}\frac{similarity(a,b)}{|set of pairs in tweets|}$    
where the content similarity is computed using the standard cosine simility over the bag-of-word vector representation $\mathbf{V(a)}$ of the tweet content: $similarity(a,b)=\frac{\mathbf{V(a)}\mathbf{V(b)}}{|\mathbf{V(a)}||\mathbf{V(b)}|}$ 
Since tweets are extremely short (140 characters or less), we consider a bag-of-words model and a sparse bigrams model. [3]</li>