# Técnicas de Machine Learning para Classificação Supervisionada de Contas Bots do Twitter.

## FIA LABDATA - Turma 13

### Descrição dos Data Sets utilizados: 

Contas classificadas como reais e bots, anotadas pela [CrowdFlower](https://en.wikipedia.org/wiki/Figure_Eight_Inc.), conforme descrita no paper:[The Paradigm-Shift of Social Spambots: Evidence, Theories, and Tools for the Arms Race](http://dl.acm.org/citation.cfm?doid=3041021.3055135)

fonte: http://mib.projects.iit.cnr.it/dataset.html

***

In [146]:
import numpy as np
import pandas as pd
import seaborn as sns
import datetime

# Importando os Data Sets

`Genuine Data Set` - Genuine verified accounts that are human-operated

In [173]:
df_genuine = pd.read_csv('data/cresci-2017/genuine_accounts.csv/users.csv')
df_genuine = df_genuine.assign(classification='human', dataset='genuine')

# elimina timezone da data
df_genuine['created_at'] = pd.to_datetime(df_genuine['created_at']).dt.tz_localize(None)

df_genuine.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,contributors_enabled,following,created_at,timestamp,crawled_at,updated,test_set_1,test_set_2,classification,dataset
0,1502026416,TASUKU HAYAKAWA,0918Bask,2177,208,332,265,1,,ja,...,,,2013-06-11 11:20:35,2013-06-11 13:20:35,2015-05-02 06:41:46,2016-03-15 15:53:47,0,0,human,genuine
1,2492782375,ro_or,1120Roll,2660,330,485,3972,5,,ja,...,,,2014-05-13 10:37:57,2014-05-13 12:37:57,2015-05-01 17:20:27,2016-03-15 15:53:48,0,0,human,genuine
2,293212315,bearclaw,14KBBrown,1254,166,177,1185,0,,en,...,,,2011-05-04 23:30:37,2011-05-05 01:30:37,2015-05-01 18:48:28,2016-03-15 15:53:48,0,0,human,genuine
3,191839658,pocahontas farida,wadespeters,202968,2248,981,60304,101,http://t.co/rGV0HIJGsu,en,...,,,2010-09-17 14:02:10,2010-09-17 16:02:10,2015-05-01 13:55:16,2016-03-15 15:53:48,0,0,human,genuine
4,3020965143,Ms Kathy,191a5bd05da04dc,82,21,79,5,0,,en,...,,,2015-02-06 04:10:49,2015-02-06 05:10:49,2015-05-02 01:17:32,2016-03-15 15:53:48,0,0,human,genuine



`social spambots #1` - Retweeters of an Italian political candidate

In [174]:
df_social_bot_1 = pd.read_csv('data/cresci-2017/social_spambots_1.csv/users.csv')
df_social_bot_1 = df_social_bot_1.assign(classification='bot', dataset='social_spambots_1')

# elimina timezone da data
df_social_bot_1['created_at'] = pd.to_datetime(df_social_bot_1['created_at']).dt.tz_localize(None)

df_social_bot_1.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,test_set_1,classification,dataset
0,24858289,Davide Bertoli,davideb66,1299,22,40,1,0,,it,...,,,,2009-03-17 08:51:12,2009-03-17 09:51:12,2014-04-19 14:46:19,2016-03-15 14:12:22,1,bot,social_spambots_1
1,33212890,Elisa D'Ospina,ElisaDospina,18665,12561,3442,16358,110,http://t.co/ceK8TovxwI,it,...,Autrice del libro #unavitatuttacurve dal 9 apr...,,,2009-04-19 14:38:04,2009-04-19 16:38:04,2014-05-18 23:20:58,2016-03-15 14:17:13,1,bot,social_spambots_1
2,39773427,Donato Vincenzo,Vladimir65,22987,600,755,14,6,,it,...,[Live Long and Prosper],,,2009-05-13 15:34:41,2009-05-13 17:34:41,2014-05-13 23:21:54,2016-03-15 14:16:44,1,bot,social_spambots_1
3,57007623,Rafiela Morales L.,RafielaMorales,7975,398,350,11,2,,en,...,"Cuasi Odontologa*♥,#Bipolar, #Sarcastica & Som...",,,2009-07-15 12:55:03,2009-07-15 14:55:03,2014-05-19 23:24:18,2016-03-15 14:18:54,1,bot,social_spambots_1
4,63258466,§ h a † u r♄,FabrizioC_c,20218,413,405,162,8,http://t.co/PK5F0JDKcy,it,...,"I shall rise from my own death, to avenge hers...",,,2009-08-05 21:12:49,2009-08-05 23:12:49,2014-05-11 23:22:23,2016-03-15 14:17:05,1,bot,social_spambots_1


`social spambots #2` - Spammers of paid apps for mobile devices

In [175]:
df_social_bot_2 = pd.read_csv('data/cresci-2017/social_spambots_2.csv/users.csv')
df_social_bot_2 = df_social_bot_2.assign(classification='bot', dataset='social_spambots_2')

# elimina timezone da data
df_social_bot_2['created_at'] = pd.to_datetime(df_social_bot_2['created_at']).dt.tz_localize(None)

df_social_bot_2.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,notifications,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,classification,dataset
0,2372241176,Denna Mcsparren,DennaMcsparren,53,10,46,0,0,,en,...,,,,,2014-03-04 18:11:08,2014-03-04 19:11:08,2014-05-05 00:20:03,2016-03-15 15:02:07,bot,social_spambots_2
1,2368684734,Yukiko Tretter,YukikoTretter,68,4,40,0,0,,en,...,,,,,2014-03-02 10:38:13,2014-03-02 11:38:13,2014-05-05 00:20:47,2016-03-15 15:02:07,bot,social_spambots_2
2,2353855646,Rochel Amaro,RochelAmaro,79,9,39,0,0,,en,...,,,,,2014-02-20 22:28:03,2014-02-20 23:28:03,2014-05-05 00:20:03,2016-03-15 15:02:08,bot,social_spambots_2
3,2372322542,Brandi Babin,BrandiBabin,59,1,39,0,0,,en,...,,,,,2014-03-04 19:52:10,2014-03-04 20:52:10,2014-05-05 00:20:03,2016-03-15 15:02:08,bot,social_spambots_2
4,2352506778,Chung Posadas,ChungPosadas,73,7,36,0,0,,en,...,,,,,2014-02-20 01:34:19,2014-02-20 02:34:19,2014-05-05 00:20:03,2016-03-15 15:02:09,bot,social_spambots_2


`social spambots #3` - Spammers of products on sale at Amazon.com

In [176]:
df_social_bot_3 = pd.read_csv('data/cresci-2017/social_spambots_3.csv/users.csv')
df_social_bot_3 = df_social_bot_3.assign(classification='bot', dataset='social_spambots_3')

# elimina timezone da data
df_social_bot_3['created_at'] = pd.to_datetime(df_social_bot_3['created_at']).dt.tz_localize(None)

df_social_bot_3.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,test_set_2,classification,dataset
0,16282004,Enrique Kates,eckates,11405,819,2000,0,19,http://t.co/gAa6cVM0Fe,en,...,My name is Enrique! ! I'm a highly experienced...,,,2008-09-14 11:20:09,2008-09-14 13:20:09,2014-05-05 23:17:51,2016-03-15 15:41:18,1,bot,social_spambots_3
1,16740486,genebailey,genebailey,520,219,406,36,2,http://t.co/mag9oYulVZ,en,...,"Author, Speaker, Father, Friend, Motivator, Re...",,,2008-10-14 16:11:24,2008-10-14 18:11:24,2014-05-05 23:17:51,2016-03-15 15:41:15,1,bot,social_spambots_3
2,17132768,Patrick G Howard,patrickghoward,4671,38877,25953,6,228,http://t.co/0ukMNj4N3Y,en,...,Patrick G Howard is an experienced project & p...,,,2008-11-03 15:51:00,2008-11-03 16:51:00,2014-05-05 23:17:51,2016-03-15 15:41:15,1,bot,social_spambots_3
3,18013384,Doggie Cakes Bakery,DoggieCakes,8512,2069,1177,70,43,http://t.co/B4NRMJHH6Q,en,...,Dog Bakery and Boutique - Retail (Events and O...,,,2008-12-10 05:44:57,2008-12-10 06:44:57,2014-05-05 23:12:36,2016-03-15 15:41:04,1,bot,social_spambots_3
4,21331733,David Varrone,DavidVarrone,245,6656,7469,1,72,http://t.co/KLzAJ1yzmB,en,...,Home Based Business and Personal Development C...,,,2009-02-19 19:46:02,2009-02-19 20:46:02,2014-05-05 23:17:51,2016-03-15 15:41:10,1,bot,social_spambots_3


`traditional spambots #1` - Spammers

In [177]:
df_traditional_spambots_1 = pd.read_csv('data/cresci-2017/traditional_spambots_1.csv/users.csv')
df_traditional_spambots_1 = df_traditional_spambots_1.assign(classification='bot', dataset='traditional_spambots_1')
df_traditional_spambots_1.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,notifications,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,classification,dataset
0,7248952,Bhuvan Chand,tarunkjuyal,1259,837,1978,3200,9,http://lifeofearth.org,,...,,Love Your Life,,,1183552203000L,2007-07-04 14:30:03,2010-11-07 11:10:52,2016-03-14 17:05:53,bot,traditional_spambots_1
1,7732472,Daniel Wagner,DanielWagner,770,3274,3595,8,22,http://www.yourinternetbuddies.com/go,,...,,I am an internet marketing coach and mentor wh...,,,1185440851000L,2007-07-26 11:07:31,2010-11-07 11:10:52,2016-03-14 17:05:54,bot,traditional_spambots_1
2,9524952,Andrew Lock,Andrewlock,1100,38849,34504,41,1014,http://www.helpmybusiness.com,,...,,Marketing Geek & Presenter of 'Help! My Busine...,,,1192725360000L,2007-10-18 18:36:00,2010-11-07 11:10:52,2016-03-14 17:05:54,bot,traditional_spambots_1
3,10788822,Tim Thompson,yourinsaneworld,6497,5902,5496,0,82,http://investing-information.com,,...,,I am a member of a network of stock investing ...,,,1196614406000L,2007-12-02 17:53:26,2010-11-07 11:10:52,2016-03-14 17:05:54,bot,traditional_spambots_1
4,14596967,fxgenie,fxgenie,3203,2570,2638,0,5,http://www.4xgenie.com/wp,,...,,forex trader,,,1209536534000L,2008-04-30 08:22:14,2010-11-07 11:10:52,2016-03-14 17:05:54,bot,traditional_spambots_1


In [178]:
# Este dataset traz a variavel created_at em formato timestamp, em formato string com o 
# ultimo caracter (antiga marcação Python2 para Long)

# Elimina o ultimo caracter L das datas com formato timestamp 
df_traditional_spambots_1['created_at'] = pd.Series(df_traditional_spambots_1['created_at']).str[0:13]

In [179]:
# converte timestamp to datetime

df_traditional_spambots_1['created_at'] = df_traditional_spambots_1['created_at'].apply(
    lambda x: datetime.datetime.fromtimestamp(int(x)/1000)
)

`traditional spambots #2` - Spammers of scam URLs

In [180]:
df_traditional_spambots_2 = pd.read_csv('data/cresci-2017/traditional_spambots_2.csv/users.csv')
df_traditional_spambots_2 = df_traditional_spambots_2.assign(classification='bot', dataset='traditional_spambots_2')

# elimina timezone da data
df_traditional_spambots_2['created_at'] = pd.to_datetime(df_traditional_spambots_2['created_at']).dt.tz_localize(None)

df_traditional_spambots_2.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,notifications,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,classification,dataset
0,2355955040,Prize Giveaways,PrizeCrazy66450,716,86,78,0,4,http://t.co/CF0tmHKIHk,en-gb,...,,NOTICE TO WINNERS: A unique prize code is nece...,,,2014-02-22 07:22:34,2014-02-22 08:22:34,2014-05-05 21:59:19,2016-03-14 15:46:02,bot,traditional_spambots_2
1,2368624550,Prize Giveaways,Prizetopia43484,779,87,79,0,2,http://t.co/3agfNy4i4S,en-gb,...,,NOTICE TO WINNERS: A unique invite code is nee...,,,2014-03-02 09:38:19,2014-03-02 10:38:19,2014-05-05 21:59:19,2016-03-14 15:46:02,bot,traditional_spambots_2
2,2355950858,Prize Crazy,TweetWin57918,753,89,97,0,0,http://t.co/jlIaA4LovD,en-gb,...,,NOTICE TO WINNERS: A unique invitation code is...,,,2014-02-22 07:19:00,2014-02-22 08:19:00,2014-05-05 21:59:19,2016-03-14 15:46:02,bot,traditional_spambots_2
3,2357744766,Prize World,Prizetopia67432,743,93,104,0,4,http://t.co/lVWwVDdjr6,en-gb,...,,NOTICE TO WINNERS: A unique invitation code is...,,,2014-02-23 09:30:43,2014-02-23 10:30:43,2014-05-05 21:59:19,2016-03-14 15:46:03,bot,traditional_spambots_2
4,2362454995,Prize Rocket,PrizeFun52329,180,87,71,0,1,http://t.co/rPMDkKvchA,en-gb,...,,NOTE TO WINNERS: A unique invitation code is n...,,,2014-02-26 09:57:57,2014-02-26 10:57:57,2014-05-05 21:59:19,2016-03-14 15:46:03,bot,traditional_spambots_2


`traditional spambots #3` - automated accounts spamming job offers

In [181]:
df_traditional_spambots_3 = pd.read_csv('data/cresci-2017/traditional_spambots_3.csv/users.csv')
df_traditional_spambots_3 = df_traditional_spambots_3.assign(classification='bot', dataset='traditional_spambots_3')

# elimina timezone da data
df_traditional_spambots_3['created_at'] = pd.to_datetime(df_traditional_spambots_3['created_at']).dt.tz_localize(None)

df_traditional_spambots_3.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,notifications,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,classification,dataset
0,325403988,Borremans Bellman,borremanstpdri2,48,0,1,0,1,,en,...,,,,,2011-06-28 07:15:18,2011-06-28 09:15:18,2016-03-15 11:39:07,2016-03-15 11:39:07,bot,traditional_spambots_3
1,3298943021,Reward Crazy,CrazyPrize66244,52,0,6,0,0,http://t.co/wCBL5xhOdq,en,...,,Have a special code? Click the linky below:,,,2015-05-26 08:13:25,2015-05-26 10:13:25,2016-03-15 11:39:07,2016-03-15 11:39:07,bot,traditional_spambots_3
2,3305460917,Reward Patrol,CrazyPrize75229,39,0,4,0,0,http://t.co/c4Yxb1wFKj,en,...,,Have an invitation password? Click the link be...,,,2015-06-01 09:58:30,2015-06-01 11:58:30,2016-03-15 11:39:07,2016-03-15 11:39:07,bot,traditional_spambots_3
3,179562837,Dana Shemesh,danashemesh,304,0,0,0,1,,en,...,,,,,2010-08-17 16:09:13,2010-08-17 18:09:13,2016-03-15 11:39:07,2016-03-15 11:39:07,bot,traditional_spambots_3
4,179295032,Davina Vanwey,DavinaVanwey428,1883,0,0,0,1,,en,...,,,,,2010-08-17 00:02:18,2010-08-17 02:02:18,2016-03-15 11:39:07,2016-03-15 11:39:07,bot,traditional_spambots_3


`traditional spambots #4`	- Another group of automated accounts spamming job offers

In [182]:
df_traditional_spambots_4 = pd.read_csv('data/cresci-2017/traditional_spambots_4.csv/users.csv')
df_traditional_spambots_4 = df_traditional_spambots_4.assign(classification='bot', dataset='traditional_spambots_4')

# elimina timezone da data
df_traditional_spambots_4['created_at'] = pd.to_datetime(df_traditional_spambots_4['created_at']).dt.tz_localize(None)

df_traditional_spambots_4.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,lang,...,notifications,description,contributors_enabled,following,created_at,timestamp,crawled_at,updated,classification,dataset
0,21478911,TMJ- CLT Util Jobs,tmj_clt_util,4,344,295,0,5,https://t.co/DByWt45HZj,en,...,,Follow this account for geo-targeted Utilities...,,,2009-02-21 12:04:47,2009-02-21 13:04:47,2016-03-15 13:48:59,2016-03-15 13:48:59,bot,traditional_spambots_4
1,21479094,TMJ - SFO Util Jobs,tmj_sfo_util,3,353,322,0,16,https://t.co/DByWt45HZj,en,...,,Follow this account for geo-targeted Utilities...,,,2009-02-21 12:09:26,2009-02-21 13:09:26,2016-03-15 13:48:59,2016-03-15 13:48:59,bot,traditional_spambots_4
2,21479204,TMJ - WAS Util Jobs,tmj_dc_util,1,323,294,0,2,https://t.co/DByWt45HZj,en,...,,Follow this account for geo-targeted Utilities...,,,2009-02-21 12:12:01,2009-02-21 13:12:01,2016-03-15 13:48:59,2016-03-15 13:48:59,bot,traditional_spambots_4
3,21479275,TMJ - JAX Util Jobs,tmj_jax_util,4,311,292,0,4,https://t.co/DByWt45HZj,en,...,,Follow this account for geo-targeted Utilities...,,,2009-02-21 12:13:37,2009-02-21 13:13:37,2016-03-15 13:48:59,2016-03-15 13:48:59,bot,traditional_spambots_4
4,21479334,TMJ - CHI Util Jobs,tmj_chi_util,6,339,298,0,7,https://t.co/DByWt45HZj,en,...,,Follow this account for geo-targeted Utilities...,,,2009-02-21 12:15:21,2009-02-21 13:15:21,2016-03-15 13:48:59,2016-03-15 13:48:59,bot,traditional_spambots_4


`fake followers` - Simple accounts that inflate the number of followers of another account

In [183]:
df_fake_followers = pd.read_csv('data/cresci-2017/fake_followers.csv/users.csv')
df_fake_followers = df_fake_followers.assign(classification='bot', dataset='fake_followers')

# elimina timezone da data
df_fake_followers['created_at'] = pd.to_datetime(df_fake_followers['created_at']).dt.tz_localize(None)

df_fake_followers.head()

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,created_at,url,...,follow_request_sent,protected,verified,notifications,description,contributors_enabled,following,updated,classification,dataset
0,80479674,YI YUAN,yi_twitts,29,19,255,1,0,2009-10-07 03:19:21,http://www.jycondo.com,...,,,,,real estate sales,,,2013-06-12 18:38:35,bot,fake_followers
1,82487179,Marcos Perez C,marcos_peca,1408,208,866,138,0,2009-10-14 23:40:17,,...,,,,,,,,2013-06-12 18:38:35,bot,fake_followers
2,105830531,curti lorenzo,curtilorenzo,39,59,962,8,0,2010-01-17 16:46:52,http://www.valcavargna.com/,...,,,,,le corna del capro scappato dal gregge s'infil...,,,2013-06-12 18:38:35,bot,fake_followers
3,114488344,ruben dario toscano,gatito2710,59,7,49,4,0,2010-02-15 15:49:58,,...,,,,,,,,2013-06-12 18:38:35,bot,fake_followers
4,123222267,Malek Khalaf,MalekKhalaf,987,60,521,61,1,2010-03-15 11:38:55,http://www.facebook.com/Malek.AlBalawi,...,,,,,"MA student at JU, Interested in Juventus,Italy...",,,2013-06-11 17:39:44,bot,fake_followers


# Concatenando os Data Sets

In [184]:
df_twitter_accounts = pd.concat([
    df_genuine,
    df_social_bot_1,
    df_social_bot_2,
    df_social_bot_3,
    df_traditional_spambots_1,
    df_traditional_spambots_2,
    df_traditional_spambots_3,
    df_traditional_spambots_4,
    df_fake_followers    
])
df_twitter_accounts.shape

(14368, 44)

In [185]:
df_twitter_accounts.columns

Index(['id', 'name', 'screen_name', 'statuses_count', 'followers_count',
       'friends_count', 'favourites_count', 'listed_count', 'url', 'lang',
       'time_zone', 'location', 'default_profile', 'default_profile_image',
       'geo_enabled', 'profile_image_url', 'profile_banner_url',
       'profile_use_background_image', 'profile_background_image_url_https',
       'profile_text_color', 'profile_image_url_https',
       'profile_sidebar_border_color', 'profile_background_tile',
       'profile_sidebar_fill_color', 'profile_background_image_url',
       'profile_background_color', 'profile_link_color', 'utc_offset',
       'is_translator', 'follow_request_sent', 'protected', 'verified',
       'notifications', 'description', 'contributors_enabled', 'following',
       'created_at', 'timestamp', 'crawled_at', 'updated', 'test_set_1',
       'test_set_2', 'classification', 'dataset'],
      dtype='object')

# Dropando colunas marcadas como deprecated pelo Twitter 

In [186]:
deprecated_features_list = ['utc_offset','time_zone', 'geo_enabled', 'lang', 'contributors_enabled', 'is_translator', 
    'profile_background_color', 'profile_background_image_url', 
    'profile_background_image_url_https', 'profile_background_tile', 'profile_image_url',
    'profile_link_color', 'profile_sidebar_border_color','profile_sidebar_fill_color',
    'profile_text_color','profile_use_background_image','following',
    'follow_request_sent', 'notifications',
    #variaveis do data set nao relacionadas ao objeto user do Twitter
    'timestamp','crawled_at','updated','test_set_1','test_set_2']

# drop colunas deprecated pelo twitter
df_twitter_accounts = df_twitter_accounts.drop(deprecated_features_list, axis = 1)

In [187]:
df_twitter_accounts.columns, df_twitter_accounts.shape

(Index(['id', 'name', 'screen_name', 'statuses_count', 'followers_count',
        'friends_count', 'favourites_count', 'listed_count', 'url', 'location',
        'default_profile', 'default_profile_image', 'profile_banner_url',
        'profile_image_url_https', 'protected', 'verified', 'description',
        'created_at', 'classification', 'dataset'],
       dtype='object'),
 (14368, 20))

# Gerando novas variáveis

In [188]:
#df_twitter_accounts =
(
    df_twitter_accounts
        # quantidade total de caracteres no screen name (mome unico)
        .assign(screen_name_total_len = (df_twitter_accounts['screen_name'].str.len()))
        # quantidade de caracteres numericos no screen name (nome unico)
        .assign(screen_name_num_len = (df_twitter_accounts['screen_name'].str.count('[0-9]')))
        # quantidade total de caracteres no name
        .assign(name_total_len = (df_twitter_accounts['name'].str.len()))
        # quantidade de caracteres numericos no name
        .assign(name_num_len = (df_twitter_accounts['name'].str.count('[0-9]')))
        # is_url_null --> se url nula == True se nao == False
        .assign(is_url_null = (np.where(df_twitter_accounts['url'].isnull(),True,False)))
        # se usuario informou location
        .assign(is_location_null = (np.where(df_twitter_accounts['location'].isnull(),True,False)))
        # se a variavel profile_banner_url foi informada ou nao pelo usuario
        .assign(profile_banner_url_null = (np.where(df_twitter_accounts['profile_banner_url'].isnull(),True,False)))
        # se a imagem do perfil foi informada pelo usuario
        .assign(profile_image_url_null = (np.where(df_twitter_accounts['profile_banner_url'].isnull(),True,False)))
        # se usuário informou uma descricao para seu perfil
        .assign(description_null = (np.where(df_twitter_accounts['description'].isnull(),True,False)))
        # idade da conta
        #.assign(account_age = ())
)

Unnamed: 0,id,name,screen_name,statuses_count,followers_count,friends_count,favourites_count,listed_count,url,location,...,dataset,screen_name_total_len,screen_name_num_len,name_total_len,name_num_len,is_url_null,is_location_null,profile_banner_url_null,profile_image_url_null,description_null
0,1502026416,TASUKU HAYAKAWA,0918Bask,2177,208,332,265,1,,Tokyo .Japan .,...,genuine,8,4,15.0,0.0,True,False,False,False,False
1,2492782375,ro_or,1120Roll,2660,330,485,3972,5,,神奈川県横浜市,...,genuine,8,4,5.0,0.0,True,False,False,False,False
2,293212315,bearclaw,14KBBrown,1254,166,177,1185,0,,,...,genuine,9,2,8.0,0.0,True,True,False,False,False
3,191839658,pocahontas farida,wadespeters,202968,2248,981,60304,101,http://t.co/rGV0HIJGsu,#freePalestine - rip paul,...,genuine,11,0,17.0,0.0,False,False,False,False,False
4,3020965143,Ms Kathy,191a5bd05da04dc,82,21,79,5,0,,Wichita KS,...,genuine,15,8,8.0,0.0,True,False,True,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3346,1391497074,Verda Marks,VerdaMarks1,1,0,17,0,0,,"Murphy, NC",...,fake_followers,11,1,11.0,0.0,True,False,True,True,False
3347,1391544607,Danial Campbell,DanialCampbell2,0,1,17,0,0,,,...,fake_followers,15,1,15.0,0.0,True,True,True,True,True
3348,1391622127,Maudie Meyer,MaudieMeyer1,2,0,15,0,0,,"Rome,Italy",...,fake_followers,12,1,12.0,0.0,True,False,True,True,True
3349,1391832212,Harriett Harvey,HarriettHarvey9,2,0,16,0,0,,,...,fake_followers,15,1,15.0,0.0,True,True,True,True,True


# Drop das variaveis de alta cardinalidade

In [189]:
alta_card = ['url','location','profile_banner_url','profile_image_url_https','description']
df_twitter_accounts = df_twitter_accounts.drop(alta_card, axis=1)

## Correção de Valores das Variáveis

In [190]:
#verifica vaores nulos
df_twitter_accounts.isnull().sum()

id                           0
name                         1
screen_name                  0
statuses_count               0
followers_count              0
friends_count                0
favourites_count             0
listed_count                 0
default_profile           9857
default_profile_image    14290
protected                14290
verified                 14357
created_at                   0
classification               0
dataset                      0
dtype: int64

In [191]:
# A variavel verified possui 2 valores possiveis, de acordo com documentação do Twitter (True,False)
# Os data sets apresentam os valores 1 e NaN, portanto, Substituir valores NaN por 0.
df_twitter_accounts['verified'] = df_twitter_accounts.fillna({'verified':0}).filter(['verified'])

df_twitter_accounts['verified'].isnull().sum()

0

In [192]:
# A variavel default_profile possui 2 valores possiveis, de acordo com documentação do Twitter (True,False)
# Os data sets apresentam os valores 1 e NaN, portanto, Substituir valores NaN por 0.
df_twitter_accounts['default_profile'] = df_twitter_accounts.fillna({'default_profile':0}).filter(['default_profile'])

df_twitter_accounts['default_profile'].isnull().sum()

0

In [193]:
# A variavel default_profile_image possui 2 valores possiveis, de acordo com documentação do Twitter (True,False)
# Os data sets apresentam os valores 1 e NaN, portanto, Substituir valores NaN por 0.
df_twitter_accounts['default_profile_image'] = df_twitter_accounts.fillna(
    {'default_profile_image':0}).filter(['default_profile_image'])

df_twitter_accounts['default_profile_image'].isnull().sum()

0

In [194]:
# A variavel protected possui 2 valores possiveis, de acordo com documentação do Twitter (True,False)
# Os data sets apresentam os valores 1 e NaN, portanto, Substituir valores NaN por 0.
df_twitter_accounts['protected'] = df_twitter_accounts.fillna({'protected':0}).filter(['protected'])

df_twitter_accounts['protected'].isnull().sum()

0

In [195]:
# existe um usuario que esta sem o nome e sera substituido pelo screen name (nome unico no twitter)
#df_twitter_accounts.query('name != name')
df_twitter_accounts['name'] = np.where(df_twitter_accounts['name'].isnull(), 
                                       df_twitter_accounts['screen_name'],
                                       df_twitter_accounts['name'])

In [196]:
#verifica vaores nulos
df_twitter_accounts.isnull().sum()

id                       0
name                     0
screen_name              0
statuses_count           0
followers_count          0
friends_count            0
favourites_count         0
listed_count             0
default_profile          0
default_profile_image    0
protected                0
verified                 0
created_at               0
classification           0
dataset                  0
dtype: int64

In [197]:
df_twitter_accounts.shape

(14368, 15)

# Exportando a base tratada para CSV

In [202]:
df_twitter_accounts.to_csv(r'C:\git_repositories\tcc\data\cresci-2017\classified_twitter_accounts.csv', index=False)