#Loading Scraped Data
## Fanpage List
This sides provides not only categories for the twitter accounts but also sub categories that will be very useful for trainning user classification tasks, but also for future topic classification.

In [1]:
import pandas as pd

fanpage_list = pd.read_csv('data/scraped/fanpagelist_scrapped_data_16_06_2015.csv', header=None, names=['name', 'handle','category','subcategory'])
fanpage_list.head()

Unnamed: 0,name,handle,category,subcategory
0,Facebook,facebook,Brand,Technology
1,YouTube,YouTube,Product,Technology
2,McDonald's,McDonalds,Brand,Dining
3,MTV,MTV,TV Show,MTV
4,Disney,Disney,Brand,Media


In [2]:
brand_categories = ['Organization', 'TV Show', 'Brand', 'Sports Team', 'News', 'Product', 'Movie', 'Game']
person_categories = ['Athlete', 'Musician', 'Politician', 'Actor', 'Model', 'Comedian', 'Reality Star', 'TV Host', 'Executive', 'Author', 'Pro Dancer']

print('Fanpage List has', len(fanpage_list), 'trained items.')
fanpage_list[['handle','category']].groupby('category').count().sort('handle', ascending=False)

Fanpage List has 2319 trained items.


Unnamed: 0_level_0,handle
category,Unnamed: 1_level_1
Athlete,499
Organization,459
TV Show,274
Brand,256
Musician,202
Sports Team,183
News,137
Movie,84
Product,60
Politician,48


###News category might be dubious

The only category that is dubious is news, I'll make sure it is not filled with reporters and news outlets, but this needs manual analisys.

In [3]:
fanpage_list[fanpage_list['category'] == 'News'].head(5)

Unnamed: 0,name,handle,category,subcategory
11,National Geographic,NatGeo,News,Travel Channel
55,CNN Breaking News,cnnbrk,News,World
56,CNN,CNN,News,U.S.
75,ESPN,espn,News,Sports
100,BBC World News,BBCWorld,News,World


By manual analisys, the 'News' category is filled only with News Outlets

In [4]:
fanpage_list_brands = fanpage_list[fanpage_list['category'].isin(brand_categories)]
fanpage_list_person = fanpage_list[fanpage_list['category'].isin(person_categories)]
print('Trainned brands count:', len(fanpage_list_brands))
print('Trainned person count:', len(fanpage_list_person))

Trainned brands count: 1490
Trainned person count: 817


## Social Brand Index data
A site with a collection of twitter brand handles

In [5]:
social_brand_index = pd.read_csv('data/scraped/socialbrandindex_scrapped_data_19_06_2015.csv', header=None, names=['name', 'handle','category','subcategory'])
social_brand_index.head()

Unnamed: 0,name,handle,category,subcategory
0,3M,3M,Business,
1,Abbott Laboratories,abbottnews,Business,
2,Actavis,Actavis,Business,
3,Advance Auto Parts,aapdeals,Business,
4,Advanced Micro Devices,amd,Business,


In [6]:
print('Social Brand Index has', len(social_brand_index), 'trained items.')
social_brand_index[['handle','category']].groupby('category').count().sort('handle', ascending=False).head()

Social Brand Index has 400 trained items.


Unnamed: 0_level_0,handle
category,Unnamed: 1_level_1
Business,400


In [7]:
print('Trainned brands count:', len(social_brand_index['handle'].unique()))

Trainned brands count: 398


## Twibs data
A site with a collection of twitter brand handles

In [8]:
twibs = pd.read_csv('data/scraped/twibs_scrapped_data_17_06_2015.csv', header=None, names=['handle', 'name', 'category', 'subcategory'])
twibs.head()

Unnamed: 0,handle,name,category,subcategory
0,andriisedniev,andriisedniev,Bloggers Services,Business
1,danzarrella,danzarrella,Bloggers Services,Business
2,Kathy Johnson,Kathy_Johnson,Bloggers Services,Business
3,Jaquoneoqhdoqhd,Jaquoneoqhdoqhd,Bloggers Services,Business
4,pchaney,pchaney,Bloggers Services,Business


In [9]:
print('Twibs has', len(twibs), 'trained items.')
twibs[['handle','category','subcategory']].groupby('category').count().sort('handle', ascending=False)

Twibs has 2127 trained items.


Unnamed: 0_level_0,handle,subcategory
category,Unnamed: 1_level_1,Unnamed: 2_level_1
Bloggers Services,279,279
Media,225,225
Food and Restaurants,217,217
Community and Education,180,180
Tech,180,180
Health,180,180
Retail,160,160
Electronics,147,147
Travel and Recreation,147,147
Web,107,107


In [10]:
twibs[['handle','category','subcategory']].groupby('subcategory').count().sort('handle', ascending=False)

Unnamed: 0_level_0,handle,category
subcategory,Unnamed: 1_level_1,Unnamed: 2_level_1
Music,60,60
Gaming,40,40
Politics,40,40
Organizations,20,20
Realtors,20,20
Real Estate Investment,20,20
Radio,20,20
Programming Languages,20,20
Power Supply,20,20
Platform,20,20


Most twibs data is composed by a mix of brands and persons, so it have to be classified manually and it will not be used in this current notebook

## Using Twitter to retrieve metadata
The scraped data consist mainly of handles and categories, so we need to retrieve extra actor information for trainning

In [11]:
from twitter import *

token = '22911906-GR7LBJ2oil3cc27aUIAln4zur4F7CdKAKyEi6NDzi'
token_key = 'FZbyPm1i3BMfiXKlKPuzBdRlvbenW09n8LX5OvgM85g'
con_secret = 'cyZ6NLdySvTkhKGUGmXMKw'
con_secret_key = '5UgOJOanohNPMVkfLY85CjzdMcNAAVBlRCyGYys'

t = Twitter(auth=OAuth(token, token_key, con_secret, con_secret_key))

In [12]:
print(social_brand_index.columns, len(social_brand_index['handle'].unique()))
print(fanpage_list_brands.columns, len(fanpage_list_brands['handle'].unique()))

brand_handles_list = pd.concat([social_brand_index['handle'], fanpage_list_brands['handle']]).unique()
print(len(brand_handles_list))

Index(['name', 'handle', 'category', 'subcategory'], dtype='object') 398
Index(['name', 'handle', 'category', 'subcategory'], dtype='object') 1398
1785


In [13]:
import csv
import os.path

def export_extracted_data_to_csv(profiles, file_name):
    keys = profiles[0].keys()
    with open(file_name, 'w', newline='') as output_file:
        dict_writer = csv.DictWriter(output_file, keys)
        dict_writer.writeheader()
        dict_writer.writerows(profiles)
        
def import_dict_from_csv(file_name):
    result = []
    if os.path.isfile(file_name):
        reader = csv.DictReader(open(file_name))

        for row in reader:
            result.append(row)

    return result

In [14]:
import numpy as np
import json

brand_profiles = import_dict_from_csv('data/csv/extracted_twitter_actor_info.csv')
if len(brand_profiles) == 0:
    indices = np.arange(len(brand_handles_list))
    max_mod = int((len(brand_handles_list)/100)+1)
    brand_profiles = []
    for x in range(0, max_mod):
        handles = brand_handles_list[(indices % max_mod) == x]
        actors = t.users.lookup(screen_name=','.join(handles), _timeout=3)
        for actor in actors:
            brand_profiles.append(actor)
        print("Extracted", len(actors), "handles from twitter API")

    export_extracted_data_to_csv(brand_profiles, 'data/csv/extracted_twitter_actor_info.csv')

print('Extracted', len(brand_profiles), 'brand actors from Twitter')

Extracted 1649 brand actors from Twitter


### Write in a csv file the whole downloaded profiles from Twitter API
This data may be used further for feature selection, so it is important that we store it

In [30]:
import time
import sys

csv.field_size_limit(sys.maxsize)

complete_profiles = import_dict_from_csv('data/csv/extracted_complete_twitter_actor_info.csv')
if len(complete_profiles) > 0:
    print('Loading complete profiles from csv file...')
    brand_profiles = complete_profiles

def persistToFile(count, data):
    if count % 25 == 0:
            export_extracted_data_to_csv(data, 'data/csv/extracted_complete_twitter_actor_info.csv')
            print('---- Persisting to CSV file')
    
i = 0
for actor in brand_profiles:
    try:
        i += 1
        if (not 'tweet' in actor) or (not actor['tweet']):
            # Retrieve their last 'posted' tweet to use in the feature selection
            actor_tweets = t.statuses.user_timeline(screen_name=actor['screen_name'])
            brand_profiles[i]['tweets'] = actor_tweets
            actor_tweet = actor_tweets[0] if actor_tweets else None
            brand_profiles[i]['tweet'] = actor_tweet
            persistToFile(i, brand_profiles)
            print(i, '. Importing tweets for', actor['screen_name'])
            time.sleep(5)
        else:
            persistToFile(i, brand_profiles)
            print(i, '. Imported tweets for', actor['screen_name'])
    except:
         print("Unexpected error:", sys.exc_info()[0])

export_extracted_data_to_csv(brand_profiles, 'data/csv/extracted_complete_twitter_actor_info.csv')

Loading complete profiles from csv file...
1 . Imported tweets for 3M
2 . Imported tweets for AmericanExpress
3 . Imported tweets for BestBuy
4 . Imported tweets for CBRE
5 . Imported tweets for DanaHoldingCorp
6 . Imported tweets for DrPepperSnapple
7 . Imported tweets for ExelisInc
8 . Imported tweets for Gap
9 . Imported tweets for Kohls
10 . Imported tweets for MarriottIntl
11 . Imported tweets for OwensCorning
12 . Imported tweets for PPLElectric
13 . Imported tweets for RyderPR
14 . Imported tweets for Staples
15 . Imported tweets for Thrivent
16 . Imported tweets for urscorp
17 . Imported tweets for windowslive
18 . Imported tweets for Xbox
19 . Imported tweets for cnnbrk
20 . Imported tweets for Macys
21 . Imported tweets for gameloft
22 . Imported tweets for London2012
23 . Imported tweets for OldNavy
24 . Imported tweets for kingfisherworld
---- Persisting to CSV file
25 . Imported tweets for ilovebeingblack
26 . Imported tweets for redbox
27 . Imported tweets for TheOnion
28

### Write in a csv file with default format for trainning


In [31]:
import ast
import re
from html import escape

s = 0
e = 0
brand_profiles = import_dict_from_csv('data/csv/extracted_complete_twitter_actor_info.csv')
with open('data/csv/brand_trainned.csv', 'w') as csv_file:
    tweets_writer = csv.writer(csv_file)
    tweets_writer.writerow([
        'actor_id',
        'actor_screen_name',
        'actor_name',
        'actor_verified',
        'actor_friends_count',
        'actor_followers_count',
        'actor_listed_count',
        'actor_statuses_count',
        'actor_favorites_count',
        'actor_summary',
        'actor_created_at',
        'actor_location',
        
        'tweet_id',
        'tweet_created_at',
        'tweet_generator',
        'tweet_body',
        'tweet_verb',
            
        'tweet_urls_count',
        'tweet_mentions_count',
        'tweet_hashtags_count',
        'tweet_trends_count',
        'tweet_symbols_count'])
    for profile in brand_profiles:
        try:
            tweet = ast.literal_eval((str(profile['tweet'])))
            tweets_writer.writerow([
                    tweet['user']['id'],
                    tweet['user']['screen_name'],
                    tweet['user']['name'],
                    tweet['user']['verified'],
                    tweet['user']['friends_count'],
                    tweet['user']['followers_count'],
                    tweet['user']['listed_count'],
                    tweet['user']['statuses_count'],
                    tweet['user']['favourites_count'],
                    tweet['user']['description'],
                    tweet['user']['created_at'],
                    tweet['user']['location'] if tweet['user'].get('location') else 'null',

                    tweet['id'],
                    tweet['created_at'],
                    re.findall('>(.*)<', tweet['source'])[0],
                    tweet['text'],
                    not tweet['retweeted'],
                    len(tweet['entities']['urls']),
                    len(tweet['entities']['user_mentions']),
                    len(tweet['entities']['hashtags']),
                    "",
                    len(tweet['entities']['symbols'])
                ])
            s += 1
            if s % 100 == 0:
                print('Already writed', s, 'tweets to CSV')
            
        except:
            e += 1
            if e % 100 == 0:
                print(e, 'errors saving the tweets to CSV')

Already writed 100 tweets to CSV
Already writed 200 tweets to CSV
Already writed 300 tweets to CSV
Already writed 400 tweets to CSV
Already writed 500 tweets to CSV
Already writed 600 tweets to CSV
Already writed 700 tweets to CSV
Already writed 800 tweets to CSV
Already writed 900 tweets to CSV
Already writed 1000 tweets to CSV
100 errors saving the tweets to CSV
Already writed 1100 tweets to CSV
200 errors saving the tweets to CSV
Already writed 1200 tweets to CSV
300 errors saving the tweets to CSV
Already writed 1300 tweets to CSV


## Sensing the business trainned data
It is important to take a look at some characterístics of this data, since we will build the trainner on top of it, plus the researches already analysed


In [2]:
import pandas as pd

df_tweets = pd.read_csv('data/csv/brand_trainned.csv')
df_tweets = df_tweets.dropna(subset=['actor_summary', 'tweet_generator'])
print(len(df_tweets))
df_tweets.head()

1275


Unnamed: 0,actor_id,actor_screen_name,actor_name,actor_verified,actor_friends_count,actor_followers_count,actor_listed_count,actor_statuses_count,actor_favorites_count,actor_summary,...,tweet_id,tweet_created_at,tweet_generator,tweet_body,tweet_verb,tweet_urls_count,tweet_mentions_count,tweet_hashtags_count,tweet_trends_count,tweet_symbols_count
0,22919665,AirAsia,AirAsia,True,639,1537751,4916,35601,37,Welcome to our official Twitter account where ...,...,614395234659635200,Fri Jun 26 11:30:13 +0000 2015,Twitter Ads,为明年计划个周末出游吧！吉隆坡 - 新加坡，RM37起！快来抢购哦！http://t.co/...,True,1,0,0,,0
1,85399171,tvland,TV Land,True,18100,48050,795,29934,11320,"@GaffiganShow, @ImpastorTV & @The_Exes start W...",...,614312317313019904,Fri Jun 26 06:00:44 +0000 2015,Hootsuite,Nancy comes out on this episode of #Roseanne #...,True,0,0,2,,0
2,18360370,utahjazz,Utah Jazz,True,331,365180,3838,28392,3911,Official Twitter account of the Utah Jazz. Get...,...,614309402397573120,Fri Jun 26 05:49:09 +0000 2015,Tweetbot for iΟS,RT @TreyMambaLyles: Dreams do come true!!! Tha...,True,1,1,1,,0
3,52803520,astros,#VoteAltuve,True,423,228291,2799,32980,308,The Official Twitter of the Houston Astros. Ru...,...,614293312820740096,Fri Jun 26 04:45:13 +0000 2015,Adobe® Social,Career high in #whiff(s) for @kidkeuchy … and ...,True,1,2,1,,0
4,28173550,TBLightning,Tampa Bay Lightning,True,6976,275362,3590,52064,2215,Official Twitter of the 2015 Eastern Conferenc...,...,614234380534464512,Fri Jun 26 00:51:02 +0000 2015,Twitter for iPhone,RT @RHiggins_TBSC: THANK YOU to all who made t...,True,0,2,2,,0


In [3]:
device = df_tweets[['tweet_generator', 'tweet_id']]
posts_by_device = device.groupby('tweet_generator').count()
posts_by_device['percentage'] = (posts_by_device.tweet_id / posts_by_device.tweet_id.sum()) * 100
posts_by_device = posts_by_device[['percentage']].sort('percentage', ascending=False)

print(posts_by_device.head(10))

                     percentage
tweet_generator                
Twitter Web Client    29.960784
Hootsuite             16.627451
TweetDeck             13.568627
Twitter for iPhone    11.215686
Sprinklr               4.862745
Adobe® Social          2.431373
Sprout Social          2.274510
SocialFlow             2.039216
Spredfast app          1.568627
Twitter for Android    1.019608


In [4]:
df_tweets[df_tweets['tweet_generator'] == 'Twitter Web Client'].head()

Unnamed: 0,actor_id,actor_screen_name,actor_name,actor_verified,actor_friends_count,actor_followers_count,actor_listed_count,actor_statuses_count,actor_favorites_count,actor_summary,...,tweet_id,tweet_created_at,tweet_generator,tweet_body,tweet_verb,tweet_urls_count,tweet_mentions_count,tweet_hashtags_count,tweet_trends_count,tweet_symbols_count
8,343454456,TotalRecall,Total Recall,True,30,3651,50,310,0,Total Recall is out on DVD and Director’s Cut ...,...,306856395926544385,Wed Feb 27 20:00:40 +0000 2013,Twitter Web Client,Have u played the single player demo of God of...,True,1,0,1,,0
9,545373621,hopesprings,Hope Springs,False,187,902,30,284,30,Follow the official HOPE SPRINGS Twitter page ...,...,289873538586927104,Fri Jan 11 23:16:52 +0000 2013,Twitter Web Client,#HopeSprings' Meryl Streep won the @PeoplesCho...,True,1,1,1,,0
10,113707000,BudClaymanOC87,Bud Clayman,False,228,173,4,78,0,Can you make a movie while having mental illne...,...,24953461731,Sun Sep 19 17:40:59 +0000 2010,Twitter Web Client,Just posted on F.B. that we had a great time a...,True,0,0,0,,0
11,23504870,SouthPark,South Park,True,260,1710640,5394,9864,3408,The official South Park twitter. Watch full e...,...,614175793514549248,Thu Jun 25 20:58:14 +0000 2015,Twitter Web Client,No it wasn't me. IT WAS THE SPOOKY GHOST!!! #...,True,0,0,1,,0
14,9786732,ghostwhisperer,ghostwhisperer,False,102,15413,279,1467,11,The dead are still talking... and she's still...,...,603195879902867456,Tue May 26 13:47:59 +0000 2015,Twitter Web Client,#GhostWhisperer producers &amp; creator @JThom...,True,1,2,3,,0
