## Health and Wellness Influencers on Twitter

### Overview and Feature Justification

This project is geared towards finding influencers in the health and wellness space. As we move forward with Metro Group, we will have to find influencers in this space regardless, so hopefully this will make the process easy. 

### Process

The flow is as follows:

1. Personally curate 10 influencers in the health and wellness space (and run this by Steph and Tas)
2. Get the friends of the influencers on our list.
3. Take the top 20 people who have the most friends amongst our influencers list.
4. Now we have a list of 30 influencers to repeat steps 2 and 3 on. Keep doing this until we have 90 influencers.

What's nice about this is that depending on how we do it, it can be *self-correcting*. Even the ones we initially find as influencers might get cycled out if they're not in the inner friend circle :)

Moreover, we can combine this with hashtag analysis to further classify our influencers (beyond simply being in the health and wellness space).

### Initial Setup 

We start by setting up Twitter authentication and importing necessary libraries.

In [4]:
# Twitter API setup
import tweepy
import json
import csv
from tweepy import OAuthHandler

# setup authentication for Twitter
consumer_key = 'lun6TR6KpaISisFdGnQ5Eo8v5'
consumer_secret = 'hmwEtnfvTfI6CljEKKtIGjahG4NcFQvLBXhOnPyFHmAqNZ9fVV'
access_token = '3004335028-UKSgKFDbaBLNWTzXQFrBRDwVOKo0JR475KYY3LW'
access_secret = 'pA6MeW4NYsv3tL0MRvjI1oBqdUZc0os11gesdNVkeLpX2'
# consumer_key = "HKdRO2RsrerfwnBCYUjcUorpb"
# consumer_secret = "h3Roqkslf2wn3D6e8xBUt4kh4PBYG1WoUK4RAPgWZTUh24Qn6F"
# access_token = "451540968-QbTf67xnuYjlYBkf2MsKuH6SUh2OdJrD1n7EX7Gn"
# access_secret = "fhpAikh6eoByPoravJpn7BFMN8VrAUg10j7CgyaKBZw1R"

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
 
api = tweepy.API(auth)

# pandas setup
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import time

### General Helper Functions

In [15]:
"""
function get_users_info_from_screen_name
inputs: 
- screen_names: list of screen names I want to get more info about
returns: user_summary as a dataframe (with the Twitter id, screen name, follower and friend count)
"""
def get_users_info_from_screen_name(screen_names):
    # get users summary
    users_summary = []
    for influencer in screen_names:
        user = api.get_user(screen_name = influencer)
        #save relevant info to csv
        user_summary = [user.id, user.screen_name, user.followers_count, user.friends_count]
        users_summary.append(user_summary)
 
    # create df
    headers = ['user_id', 'screen_name', 'followers_count', 'friends_count']
    df = pd.DataFrame(users_summary, columns=headers) 
    
    return df

In [21]:
"""
function get_users_info_from_id
inputs: 
- screen_names: list of ids I want to get more info about
returns: user_summary as a dataframe (with the Twitter id, screen name, follower and friend count)
"""
def get_users_info_from_screen_name(ids):
    # get users summary
    users_summary = []
    for influencer in ids:
        user = api.get_user(id = influencer)
        #save relevant info to csv
        user_summary = [user.id, user.screen_name, user.followers_count, user.friends_count]
        users_summary.append(user_summary)
 
    # create df
    headers = ['user_id', 'screen_name', 'followers_count', 'friends_count']
    df = pd.DataFrame(users_summary, columns=headers) 
    
    return df

In [19]:
"""
function get_friends_ids
inputs: 
- user_id: a user id
returns: ids an array of friend ids
"""
def get_friends_ids(user_id):
    ids = []
    page_count = 0
    for page in tweepy.Cursor(api.friends_ids, id=user_id).pages():
        page_count += 1
        ids.extend(page)
        time.sleep(60)
    return ids

In [20]:
"""
function get_influencers_friends
inputs: 
- influencers_ids: list of influencer ids
returns: returns all the friends of those influencers (with duplicates)
"""
# get all the friend's ids in one big giant list
def get_influencers_friends(influencers_ids):
    all_friend_ids = []
    for influencer_id in influencers_ids:
        # get all friends of the influencer
        ids = get_friends_ids(influencer_id)
        # add to our comprehensive list
        all_friend_ids = all_friend_ids + ids
    return all_friend_ids

In [None]:
"""
function get_most_popular_friends_no_duplicates
inputs: 
- all_friends_ids: takes a list of all the friends' ids
returns: returns all the friends of those influencers (with duplicates)
"""
def get_most_popular_friends_no_duplicates(all_friend_ids, original_ids):
    # sort items given list test
    sorted_list = [item for items, c in Counter(all_friend_ids).most_common() for item in [items] * c]
    # top users
    top_ids= [key for key, group in groupby(sorted_list)]
    top_ids_no_duplicates = []
    # remove duplicates
    counter = 0
    for top_id in top_ids:
        if counter < 20:
            if top_id in list(original_ids):
                continue
            else:
                top_ids_no_duplicates.append(top_id) 
                counter = counter + 1
            
    return top_ids_no_duplicates

### Curating the initial top 10
We will need a curated list of influencers in the health and wellness space to start out with. These initial 10 are probably important, so I've gone through 2 articles and found the twitter handles of the people mentioned. I'll then take the influencers with the most followers

In [16]:
web_influencers_list = ['minimalistbaker', 'sproutedkitchen', 'ohsheglows', 'MikeTriathlon', 'ChocCoveredKt',
                        'CleanEatingGodd', 'acouplecooks', 'thehealthymaven', 'kaleandcaramel', 'marionnestle', 
                        'michaelpollan', 'bittman', 'ChefAnnFnd', 'JoelSalatin', 'jamieoliver', 'bryantterry',
                        'johnlapuma', 'robbwolf', 'ugwellness', '100daysrealfood']
web_influencer_info_df = get_users_info_from_screen_name(web_influencers_list)
web_influencer_info_df.head(5)

Unnamed: 0,user_id,screen_name,followers_count,friends_count
0,323872736,minimalistbaker,13347,69
1,21952537,sproutedkitchen,16369,83
2,18676526,ohsheglows,71802,3676
3,89796982,MikeTriathlon,1744,125
4,148521467,ChocCoveredKt,17817,30


Next we sort these by their followers_count and take the top 10 of these as our inital ten most influential people. We write these initial 10 to a csv. 

In [17]:
sorted_web_influencer_info_df = web_influencer_info_df.sort(['followers_count'], ascending=False)
ten_influencers_df = sorted_web_influencer_info_df[0:10]
ten_influencers_df.to_csv('wellness/teninfluencers.csv')

### Cycle 1: Curating the next 20

In [6]:
next_influencers_df=pd.read_csv("moretopinfluencers.csv", names=["id", "screen_name", 
                                                                       "follower_count", "friend_count"])
next_influencers_df

Unnamed: 0,id,screen_name,follower_count,friend_count
0,19212009,WhoWhatWear,2101444,1594
1,228379737,thecoveteur,155119,2766
2,64822927,sea_of_shoes,101586,570
3,17475920,BagSnob,165187,888
4,21190774,tuulavintage,38864,415
5,17809182,MattiasSwenson,8124,119
6,90963591,SincerelyJules,56593,436
7,24190981,TeenVogue,2588637,2819
8,73359920,MaisonValentino,1656069,975
9,15090453,gerihirsch,16550,384


#### Merge dfs

In [10]:
frames = [initial_influencers_df, next_influencers_df]
new_df = pd.concat(frames)
new_df.head(5)

Unnamed: 0,id,screen_name,follower_count,friend_count
0,40090727,ChiaraFerragni,273723,153
1,20298371,AIMEESONG,62329,359
2,16252512,wendynguyen,51012,378
3,157994424,Kayture,36128,493
4,90963591,SincerelyJules,56582,436


In [15]:
# save work by writing to a csv file
new_df.to_csv('top30influencers.csv')

#### Cycle 2
Start getting friends again

In [68]:
next_influencers_df=pd.read_csv("top30influencers.csv")
next_influencers_df.head(30)

Unnamed: 0.1,Unnamed: 0,id,screen_name,follower_count,friend_count
0,0,40090727,ChiaraFerragni,273723,153
1,1,20298371,AIMEESONG,62329,359
2,2,16252512,wendynguyen,51012,378
3,3,157994424,Kayture,36128,493
4,4,90963591,SincerelyJules,56582,436
5,5,22830044,rumineely,122310,304
6,6,112700733,garypeppergirl,34180,829
7,7,282389552,GalMeetsGlam,36106,1142
8,8,165572420,BlairEadieBEE,52657,313
9,9,54859060,BettyAutier,124774,146


In [18]:
# function to get friends ids without Twitter timing out
def get_friends_ids(user_id):
    ids = []
    page_count = 0
    print user_id
    for page in tweepy.Cursor(api.friends_ids, id=user_id).pages():
        page_count += 1
        print 'Getting page {} for friends ids'.format(page_count)
        ids.extend(page)
        time.sleep(60)
    return ids

In [20]:
# get all the friend's ids in one big giant list
def get_influencers_friends(influencers_ids):
    all_friend_ids = []
    for influencer_id in influencers_ids:
        print influencer_id

        # get all friends of the influencer
        ids = get_friends_ids(influencer_id)
        # add to our comprehensive list
        all_friend_ids = all_friend_ids + ids
    return all_friend_ids

In [21]:
all_friend_ids = get_influencers_friends(next_influencers_df.id)

40090727
40090727
Getting page 1 for friends ids
20298371
20298371
Getting page 1 for friends ids
16252512
16252512
Getting page 1 for friends ids
157994424
157994424
Getting page 1 for friends ids
90963591
90963591
Getting page 1 for friends ids
22830044
22830044
Getting page 1 for friends ids
112700733
112700733
Getting page 1 for friends ids
282389552
282389552
Getting page 1 for friends ids
165572420
165572420
Getting page 1 for friends ids
54859060
54859060
Getting page 1 for friends ids
19212009
19212009
Getting page 1 for friends ids
228379737
228379737
Getting page 1 for friends ids
64822927
64822927
Getting page 1 for friends ids
17475920
17475920
Getting page 1 for friends ids
21190774
21190774
Getting page 1 for friends ids
17809182
17809182
Getting page 1 for friends ids
90963591
90963591
Getting page 1 for friends ids
24190981
24190981
Getting page 1 for friends ids
73359920
73359920
Getting page 1 for friends ids
15090453
15090453
Getting page 1 for friends ids
16271952
1

In [29]:
from collections import Counter
from itertools import groupby

def get_most_popular_friends(all_friend_ids):
    # sort items given list test
    sorted_list = [item for items, c in Counter(all_friend_ids).most_common() for item in [items] * c]
    # top users
    top_ids= [key for key, group in groupby(sorted_list)]
    #  truncate to top twenty
    top_twenty_ids = top_ids[0:40]
    return top_twenty_ids

In [97]:
def get_most_popular_friends_no_duplicates(all_friend_ids, original_ids):
    # sort items given list test
    sorted_list = [item for items, c in Counter(all_friend_ids).most_common() for item in [items] * c]
    # top users
    top_ids= [key for key, group in groupby(sorted_list)]
    top_ids_no_duplicates = []
    # remove duplicates
    counter = 0
    for top_id in top_ids:
        if counter < 20:
            if top_id in list(original_ids):
                continue
            else:
                top_ids_no_duplicates.append(top_id) 
                counter = counter + 1
            
    return top_ids_no_duplicates

In [51]:
original = [2,4]
a = [1,2,3,4,5]
top_ids_no_duplicates  = [top_id for top_id in a if top_id not in original]
top_ids_no_duplicates

[1, 3, 5]

In [98]:
top_twenty_no_duplicates = get_most_popular_friends_no_duplicates(all_friend_ids, next_influencers_df["id"])

In [60]:
def get_users_info_and_write_to_csv(top_twenty_ids):
    # open csv file
    csvFileMore = open('new_twenty_influencers_no_duplicates.csv', 'a')
    # create csv writer
    csvWriterMore = csv.writer(csvFileMore)
    users_summary = []
    for influencer in top_twenty_ids:
        user = api.get_user(id = influencer)
        #save relevant info to csv
        user_summary = [user.id, user.screen_name, user.followers_count, user.friends_count]
        users_summary.append(user_summary)
        csvWriterMore.writerow(user_summary)
    return users_summary

In [31]:
top_forty_ids = get_most_popular_friends(all_friend_ids)

In [99]:
more_influencers = get_users_info_and_write_to_csv(top_twenty_no_duplicates)

In [65]:
# get initial top 10 influencers
twenty_influencers_df=pd.read_csv("new_twenty_influencers_no_duplicates.csv", names=["id", "screen_name", 
                                                                       "follower_count", "friend_count"])
twenty_influencers_df.head(2)

Unnamed: 0,id,screen_name,follower_count,friend_count
0,64822927,sea_of_shoes,101587,570
1,19212009,WhoWhatWear,2102743,1594


In [28]:
# gets the summary info for top ten influencers
more_influencers = get_users_info_and_write_to_csv(top_twenty_ids)

In [33]:
# gets the summary info for top ten influencers
more_influencers = get_users_info_and_write_to_csv(top_forty_ids)

We can decide how we want to do this. Ie, should we just take the most overlap in general? should we save only the nonduplicates?

### We should go through and tag these

In [36]:
# get initial top 10 influencers
twenty_influencers_df=pd.read_csv("new_twenty_influencers.csv", names=["id", "screen_name", 
                                                                       "follower_count", "friend_count"])
twenty_influencers_df.head(2)

Unnamed: 0,id,screen_name,follower_count,friend_count
0,64822927,sea_of_shoes,101586,570
1,19212009,WhoWhatWear,2102642,1594


In [38]:
# get initial top 10 influencers
forty_influencers_df=pd.read_csv("new_forty_influencers.csv", names=["id", "screen_name", 
                                                                       "follower_count", "friend_count"])
forty_influencers_df.head(2)

Unnamed: 0,id,screen_name,follower_count,friend_count
0,64822927,sea_of_shoes,101586,570
1,19212009,WhoWhatWear,2102650,1594


In [44]:
frames = [new_df, forty_influencers_df]
final_new_df = pd.concat(frames)
final_new_df.head(5)

Unnamed: 0.1,Unnamed: 0,follower_count,friend_count,id,screen_name
0,,273723,153,40090727,ChiaraFerragni
1,,62329,359,20298371,AIMEESONG
2,,51012,378,16252512,wendynguyen
3,,36128,493,157994424,Kayture
4,,56582,436,90963591,SincerelyJules


In [47]:
len(final_new_df)

80

In [49]:
sum(final_new_df.duplicated('screen_name'))

28

#### Cycle 4
Start the cycle again

In [None]:
# read dfs

# merge dfs

# get all friend ids
# all_friend_ids = get_influencers_friends(next_influencers_df.id)

# get top twenty ids
# top_twenty_ids = get_most_popular_friends(all_friend_ids)

# get summary info for these influencers
# more_influencers = get_users_info_and_write_to_csv(top_twenty_ids)


In [100]:
#read dfs and merge
original_thirty=pd.read_csv("top30influencers.csv",names=["id", "screen_name", 
                                                                       "follower_count", "friend_count"])
new_twenty=pd.read_csv("new_twenty_influencers_no_duplicates.csv")
frames = [original_thirty, new_twenty]
final_new_df = pd.concat(frames)
final_new_df.head(5)

Unnamed: 0.1,1606965,21401466,637,Unnamed: 0,VogueRunway,follower_count,friend_count,id,screen_name
0,,,,0,,273723,153,40090727,ChiaraFerragni
1,,,,1,,62329,359,20298371,AIMEESONG
2,,,,2,,51012,378,16252512,wendynguyen
3,,,,3,,36128,493,157994424,Kayture
4,,,,4,,56582,436,90963591,SincerelyJules
