# Spam Filter for Look2Social

## ~03/10

* Finished Collecting Data
* Communicated with Look2Social team: Figured out the goal of the program
* Communicated with instructors: Adjusted my approach to the project

## 03/14 ~ : Data Cleaning

### 03/14 - Overview of data
(screenshot of original data)
* 20,000 rows of data - not labeled for spam
* Six criteria for spam: 

    1) Marketing Focus (by social media marketing group)
    
    2) Bot Generated
    
    3) Known Spammer
    
    4) Corporate Posted
    
    5) Own Post
    
    6) Hijacking (just use tag, different content)


Bot Generated - can be labeled by either pattern of post ("just closed a deal...") and url

Corporate Posted - User input account name

Known spammer, Own Post - User input account name

Hijacking, Marketing - Need some eyeballing.

In [3]:
import pandas as pd
import numpy as np

### Suggestion from Miles

(pic of Miles)

"Use Machine Learning form early phase! Use it for labeling!" 

### Approach
1. create 7 columns for spam - one column per each criteria 
    one for final spam column that's default value is 0 but become 1 based on user declaration of spam

2. label bot generated and corporate generated first by their pattern

3. seperate out the rows with no label. (~7000 rows)

4. label first dozen~hundred for hijacking and marketing

5. use machine learning (Naive Bayes / Logistic Regression / Random Forest) to label, set up threshold high 

6. eyeball rest of 'em

### 03/17 - Steps before labeling data
Created new columns and index

In [4]:
df = pd.read_excel('data/data_prepping_zeros_index.xlsx')

In [7]:
df

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
0,0,just closed a deal in 27 hours using #AdobeSig...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Fe0YfarG31
1,1,just closed a deal in 2 days using #AdobeSign ...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TTlCvaDE0V
2,2,just closed a deal in 2 hours using #AdobeSign...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/okmHDeM9bZ
3,3,just closed a deal in 26 hours using #AdobeSig...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
4,4,just closed a deal in 6 days using #AdobeSign ...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Oi57q6jdXP
5,5,just closed a deal in 35 minutes using #AdobeS...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
6,6,just closed a deal in 3 days using #AdobeSign ...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/su9NFDvRSm
7,7,just closed a deal in 58 minutes using #AdobeS...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Ir5AOw3Uv8
8,8,just closed a deal in 7 minutes using #AdobeSi...,,,0,0,0,0,0,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TUP7YE2FjQ
9,9,So apparently IU is catching on and wonвЂ™t ju...,,,0,0,0,0,0,0,...,,,,,,,,,,


1. Corporate post

Criteria for corporate = client(DocuSign) and its competitors (OneSpan, SignNow, Adobe Sign)

Explored 'screen_name' column and came up with following list of corporate account name:

In [8]:
adobe = ['Acrobat','Acrobat_GU','adobe_mabhatia','adobe1234567','AdobeCare','AdobeDocCloud','AdobeExpCloud','AdobeGov','adobemax','AdobeNews','AdobePartner','adobesignstx','AdobeStarProps','AdobeUK']
onespan = ['atOneSpan','OneSpan','OneSpanSign']
docusign = ['DocuSign','DocuSignAPAC','DocuSignAPI','DocuSignIMPACT','DocuSignUK','DocuSignING']
signnow = ['signnow']

From this observation, I found that it would be possible to label for spam_corporate based on that whether screen name contains certain company name or not (would be more efficient)

* bonus found: Adobe leverages its branch companies
* Q: Do we need multiple accounts or one accoun for social media marketing?

In [11]:
combined_lst = adobe+onespan+docusign+signnow
#combined_lst

In [6]:
#df[df['spam_']]

In [7]:
#df.loc[df['screen_name'] in (combined_lst), 'spam_corporate'] = 1

In [8]:
#df.loc[df['c1'] == 'Value', 'c2'] = 10

In [12]:
df['spam_corporate'][df.screen_name.isin(combined_lst)] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


2. Bot generated post

Bot generated post for this case is by AdobeSign, in format of "Just closed a deal in..." / "Just signed an agreement in.." / "just connected my...". its expanded url column value is always adobe sign. 

For right now, we can use the assumption that if expanded url column value is link to AdobeSign, then it is bot generated post. However, it would be possible to process text and compare its vector (ex: use clustering) to capture the group of bot generated post with its distinct pattern (later task to elaborate the work)

Considering time restriction, I will proceed with this hypothesis.

In [13]:
df['spam_bot'][df.expanded_url == 'https://acrobat.adobe.com'] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
df.head()

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
0,0,just closed a deal in 27 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Fe0YfarG31
1,1,just closed a deal in 2 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TTlCvaDE0V
2,2,just closed a deal in 2 hours using #AdobeSign...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/okmHDeM9bZ
3,3,just closed a deal in 26 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
4,4,just closed a deal in 6 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Oi57q6jdXP


* Q: is bot generated post worth to do it? Is that why Adobe using it?

Known spammer, Own Post spam would be labeled with similar method like below:

In [12]:
x = input("Type in target user name: ")

Type in target user name: hi


In [13]:
df.columns

Index(['index', 'text', 'retweet_count', 'favorite_count', 'spam',
       'spam_marketing', 'spam_hijack', 'spam_corporate', 'spam_bot',
       'spam_known', 'spam_own', 'Docusign', 'onespan', 'signnow',
       'adobe sign', 'favorited', 'truncated', 'id_str',
       'in_reply_to_screen_name', 'source', 'retweeted', 'created_at',
       'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'lang',
       'listed_count', 'verified', 'location', 'user_id_str', 'description',
       'geo_enabled', 'user_created_at', 'statuses_count', 'followers_count',
       'favourites_count', 'protected', 'user_url', 'name', 'time_zone',
       'user_lang', 'utc_offset', 'friends_count', 'screen_name',
       'country_code', 'country', 'place_type', 'full_name', 'place_name',
       'place_id', 'place_lat', 'place_lon', 'lat', 'lon', 'expanded_url',
       'url'],
      dtype='object')

And... finally eyeball time.
Let's seperate out the rows that none of spam categories value is 1.

In [14]:
df.loc[df['spam_bot'] == 1 | df['spam_corporate'] == 1]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
Df[(df['A'] == 1) | (df['B'] == 1)...]

In [14]:
df[(df['spam_bot'] == 1) | (df['spam_corporate'] == 1)]

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
0,0,just closed a deal in 27 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Fe0YfarG31
1,1,just closed a deal in 2 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TTlCvaDE0V
2,2,just closed a deal in 2 hours using #AdobeSign...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/okmHDeM9bZ
3,3,just closed a deal in 26 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
4,4,just closed a deal in 6 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Oi57q6jdXP
5,5,just closed a deal in 35 minutes using #AdobeS...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
6,6,just closed a deal in 3 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/su9NFDvRSm
7,7,just closed a deal in 58 minutes using #AdobeS...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Ir5AOw3Uv8
8,8,just closed a deal in 7 minutes using #AdobeSi...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TUP7YE2FjQ
10,10,just closed a deal in 27 minutes using #AdobeS...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/PMDUd1jH0h


In [15]:
target_df = df[(df['spam_bot'] != 1) & (df['spam_corporate'] != 1)]
target_df

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
9,9,So apparently IU is catching on and wonвЂ™t ju...,,,0,0,0,0,0,0,...,,,,,,,,,,
82,82,"The 11 Year Customer: Last week, I met a VP of...",,,0,0,0,0,0,0,...,,,,,,,,,http://dlvr.it/QWJnCz,https://t.co/knHc6lFzj5
107,107,7 Reasons to Choose Adobe Sign for E-Signature...,,,0,0,0,0,0,0,...,,,,,,,,,https://buff.ly/2LZaGMH,https://t.co/PrPAyMqtII
115,115,What is it with ink discrimination? I was just...,,,0,0,0,0,0,0,...,,,,,,,,,,
129,129,"Explore Adobe Sign, a web base tool used for d...",,,0,0,0,0,0,0,...,,,,,,,,,https://cornell.sabacloud.com/Saba/Web_spf/NA1...,https://t.co/siOQk3vgHD
132,132,New version of Adobe Flash Player for Windows ...,,,0,0,0,0,0,0,...,,,,,,,,,http://happyasis.com/forums/topic/824/adobe-fl...,https://t.co/61jXV17sJg
143,143,"Adobe Fill &amp; Sign users, you can get aroun...",,,0,0,0,0,0,0,...,,,,,,,,,,
151,151,Sign up for Coastline's Online DGA C164 Adobe ...,,,0,0,0,0,0,0,...,,,,,,,,,http://www.coastline.edu/,https://t.co/mMhtSaip96
163,163,Sapho Employee Experience Portal 4.9 is here! ...,,,0,0,0,0,0,0,...,,,,,,,,,http://bit.ly/2JjAcP0,https://t.co/dsTkn5EnBD
167,167,Sign up today for our summer 2018 classes! Top...,,,0,0,0,0,0,0,...,,,,,,,,,http://bit.ly/2GqpOiR,https://t.co/GiMClaF3pR


In [18]:
13227+6784

20011

In [19]:
target_df = df[(df['spam_bot'] != 1) & (df['spam_corporate'] != 1)]

In [10]:
target_df.columns

Index(['index', 'text', 'retweet_count', 'favorite_count', 'spam',
       'spam_marketing', 'spam_hijack', 'spam_corporate', 'spam_bot',
       'spam_known', 'spam_own', 'Docusign', 'onespan', 'signnow',
       'adobe sign', 'favorited', 'truncated', 'id_str',
       'in_reply_to_screen_name', 'source', 'retweeted', 'created_at',
       'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'lang',
       'listed_count', 'verified', 'location', 'user_id_str', 'description',
       'geo_enabled', 'user_created_at', 'statuses_count', 'followers_count',
       'favourites_count', 'protected', 'user_url', 'name', 'time_zone',
       'user_lang', 'utc_offset', 'friends_count', 'screen_name',
       'country_code', 'country', 'place_type', 'full_name', 'place_name',
       'place_id', 'place_lat', 'place_lon', 'lat', 'lon', 'expanded_url',
       'url'],
      dtype='object')

In [13]:
target_df2 = target_df[['index','text','spam_marketing','spam_hijack','description','screen_name','expanded_url']]

In [14]:
target_df2.head()

Unnamed: 0,index,text,spam_marketing,spam_hijack,description,screen_name,expanded_url
9,9,So apparently IU is catching on and wonвЂ™t ju...,0,0,BSU вЂ17 - SIU вЂ19 // Architecture & Interi...,cmdavis4cap,
82,82,"The 11 Year Customer: Last week, I met a VP of...",0,0,Changing the way management consulting service...,MHPAdvisory,http://dlvr.it/QWJnCz
107,107,7 Reasons to Choose Adobe Sign for E-Signature...,0,0,Making your business STAND OUT in the digital ...,GleesonDigital,https://buff.ly/2LZaGMH
115,115,What is it with ink discrimination? I was just...,0,0,Mythical conservative academic. Don't tell any...,LadyGrammarian,
129,129,"Explore Adobe Sign, a web base tool used for d...",0,0,Cornell Information Technologies. Get support ...,Cornell_IT,https://cornell.sabacloud.com/Saba/Web_spf/NA1...


In [1]:
#target_df2.to_excel("smalleyeball.xlsx", sheet_name='Sheet_name_1')

In [2]:
#target_df.to_excel("eyeball.xlsx", sheet_name='Sheet_name_1')

### 3/31 Finally done with labeling!

In [16]:
auto_labeled_df = df[(df['spam_bot'] == 1) | (df['spam_corporate'] == 1)]

In [17]:
hand_labeled_df = df[(df['spam_bot'] != 1) & (df['spam_corporate'] != 1)]

In [19]:
hand_labeled_df

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
9,9,So apparently IU is catching on and wonвЂ™t ju...,,,0,0,0,0,0,0,...,,,,,,,,,,
82,82,"The 11 Year Customer: Last week, I met a VP of...",,,0,0,0,0,0,0,...,,,,,,,,,http://dlvr.it/QWJnCz,https://t.co/knHc6lFzj5
107,107,7 Reasons to Choose Adobe Sign for E-Signature...,,,0,0,0,0,0,0,...,,,,,,,,,https://buff.ly/2LZaGMH,https://t.co/PrPAyMqtII
115,115,What is it with ink discrimination? I was just...,,,0,0,0,0,0,0,...,,,,,,,,,,
129,129,"Explore Adobe Sign, a web base tool used for d...",,,0,0,0,0,0,0,...,,,,,,,,,https://cornell.sabacloud.com/Saba/Web_spf/NA1...,https://t.co/siOQk3vgHD
132,132,New version of Adobe Flash Player for Windows ...,,,0,0,0,0,0,0,...,,,,,,,,,http://happyasis.com/forums/topic/824/adobe-fl...,https://t.co/61jXV17sJg
143,143,"Adobe Fill &amp; Sign users, you can get aroun...",,,0,0,0,0,0,0,...,,,,,,,,,,
151,151,Sign up for Coastline's Online DGA C164 Adobe ...,,,0,0,0,0,0,0,...,,,,,,,,,http://www.coastline.edu/,https://t.co/mMhtSaip96
163,163,Sapho Employee Experience Portal 4.9 is here! ...,,,0,0,0,0,0,0,...,,,,,,,,,http://bit.ly/2JjAcP0,https://t.co/dsTkn5EnBD
167,167,Sign up today for our summer 2018 classes! Top...,,,0,0,0,0,0,0,...,,,,,,,,,http://bit.ly/2GqpOiR,https://t.co/GiMClaF3pR


In [20]:
13227+6784 #confirming

20011

Combining two subsets of hand-labeled data

In [22]:
hand_partone = pd.read_excel('data/smalleyeball_partone.xlsx')
hand_parttwo = pd.read_excel('data/smalleyeball_parttwo.xlsx')

In [51]:
#hand_partone

In [52]:
#hand_parttwo

In [25]:
hand_combined = pd.concat([hand_partone,hand_parttwo])

In [53]:
#hand_combined

In [54]:
#hand_combined.columns

In [30]:
hand_combined.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1,inplace=True)

In [31]:
hand_combined.columns

Index(['index', 'text', 'spam_marketing', 'spam_hijack', 'description',
       'screen_name', 'expanded_url'],
      dtype='object')

In [55]:
#hand_combined

In [56]:
#hand_labeled_df

In [57]:
#print(hand_labeled_df.columns)

In [58]:
#print(hand_combined.columns)

array([0, 0, 1, ..., 1, 0, 0])

In [42]:
hand_labeled_df['spam_marketing'] = hand_combined['spam_marketing'].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [44]:
hand_labeled_df['spam_hijack'] = hand_combined['spam_hijack'].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [59]:
#hand_labeled_df

I don't have a confidence for consistency in data labeling & my understanding for the definition of spam marketing & hijack - it would be necessary for Look2Social to label once again with their standard.

It is clearly easier to label if you sort the text by post text - would be able to label repeating/similar post at once, which would save a lot of time.

There were so many insights from data labeling. I would personally recommend labeling data once a quarter / 2 quarters to understand the trend. It would capture some qualitative insights of the client(DocuSign) and its competitors (Adobe Sign, OneSpan, SignNow).

In [48]:
combined_labeled_df = pd.concat([hand_labeled_df,auto_labeled_df])

In [49]:
combined_labeled_df.sort_values(by='index', inplace=True)

In [50]:
combined_labeled_df

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
0,0,just closed a deal in 27 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Fe0YfarG31
1,1,just closed a deal in 2 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TTlCvaDE0V
2,2,just closed a deal in 2 hours using #AdobeSign...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/okmHDeM9bZ
3,3,just closed a deal in 26 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
4,4,just closed a deal in 6 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Oi57q6jdXP
5,5,just closed a deal in 35 minutes using #AdobeS...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
6,6,just closed a deal in 3 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/su9NFDvRSm
7,7,just closed a deal in 58 minutes using #AdobeS...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Ir5AOw3Uv8
8,8,just closed a deal in 7 minutes using #AdobeSi...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TUP7YE2FjQ
9,9,So apparently IU is catching on and wonвЂ™t ju...,,,0,0,0,0,0,0,...,,,,,,,,,,


In [60]:
combined_labeled_df.columns

Index(['index', 'text', 'retweet_count', 'favorite_count', 'spam',
       'spam_marketing', 'spam_hijack', 'spam_corporate', 'spam_bot',
       'spam_known', 'spam_own', 'Docusign', 'onespan', 'signnow',
       'adobe sign', 'favorited', 'truncated', 'id_str',
       'in_reply_to_screen_name', 'source', 'retweeted', 'created_at',
       'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'lang',
       'listed_count', 'verified', 'location', 'user_id_str', 'description',
       'geo_enabled', 'user_created_at', 'statuses_count', 'followers_count',
       'favourites_count', 'protected', 'user_url', 'name', 'time_zone',
       'user_lang', 'utc_offset', 'friends_count', 'screen_name',
       'country_code', 'country', 'place_type', 'full_name', 'place_name',
       'place_id', 'place_lat', 'place_lon', 'lat', 'lon', 'expanded_url',
       'url'],
      dtype='object')

In [61]:
df2 = combined_labeled_df
for col in combined_labeled_df.columns:
    if col in ['spam', 'spam_own', 'spam_corporate', 'spam_known', 'spam_marketing', 'spam_hijack', 'spam_bot']:
        pass
    elif combined_labeled_df[col].nunique() == 1:
        df2 = df2.drop(col,axis=1)

In [62]:
df2.columns

Index(['index', 'text', 'retweet_count', 'favorite_count', 'spam',
       'spam_marketing', 'spam_hijack', 'spam_corporate', 'spam_bot',
       'spam_known', 'spam_own', 'Docusign', 'onespan', 'signnow',
       'adobe sign', 'id_str', 'in_reply_to_screen_name', 'source',
       'created_at', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str',
       'lang', 'listed_count', 'verified', 'location', 'user_id_str',
       'description', 'geo_enabled', 'user_created_at', 'statuses_count',
       'followers_count', 'favourites_count', 'user_url', 'name', 'time_zone',
       'user_lang', 'utc_offset', 'friends_count', 'screen_name',
       'country_code', 'country', 'place_type', 'full_name', 'place_name',
       'place_id', 'place_lat', 'place_lon', 'lat', 'lon', 'expanded_url',
       'url'],
      dtype='object')

In [63]:
df2.head()

Unnamed: 0,index,text,retweet_count,favorite_count,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
0,0,just closed a deal in 27 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Fe0YfarG31
1,1,just closed a deal in 2 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/TTlCvaDE0V
2,2,just closed a deal in 2 hours using #AdobeSign...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/okmHDeM9bZ
3,3,just closed a deal in 26 hours using #AdobeSig...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
4,4,just closed a deal in 6 days using #AdobeSign ...,,,0,0,0,0,1,0,...,,,,,,,,,https://acrobat.adobe.com,https://t.co/Oi57q6jdXP


In [64]:
df2.dropna(thresh=(len(df)/2), axis=1, inplace=True)

In [65]:
df2.head()

Unnamed: 0,index,text,spam,spam_marketing,spam_hijack,spam_corporate,spam_bot,spam_known,spam_own,Docusign,...,user_created_at,statuses_count,followers_count,favourites_count,name,user_lang,friends_count,screen_name,expanded_url,url
0,0,just closed a deal in 27 hours using #AdobeSig...,0,0,0,0,1,0,0,False,...,Fri Sep 09 20:40:00 +0000 2011,6750,108,1,Joe Zarroli,en,12,IslandRealtyGrp,https://acrobat.adobe.com,https://t.co/Fe0YfarG31
1,1,just closed a deal in 2 days using #AdobeSign ...,0,0,0,0,1,0,0,False,...,Fri May 25 01:27:15 +0000 2012,5018,47,2,Party Karacters,en,25,partykaracters,https://acrobat.adobe.com,https://t.co/TTlCvaDE0V
2,2,just closed a deal in 2 hours using #AdobeSign...,0,0,0,0,1,0,0,False,...,Thu Dec 18 22:06:40 +0000 2008,29,60,10,annpodlozny,en,52,annpodlozny,https://acrobat.adobe.com,https://t.co/okmHDeM9bZ
3,3,just closed a deal in 26 hours using #AdobeSig...,0,0,0,0,1,0,0,False,...,Wed Aug 08 20:11:12 +0000 2012,5332,66,0,Angela Dugan,en,182,AngelaRedondo1,https://acrobat.adobe.com,https://t.co/AAhUGDLZxH
4,4,just closed a deal in 6 days using #AdobeSign ...,0,0,0,0,1,0,0,False,...,Thu Aug 26 20:50:06 +0000 2010,1337,225,19,Phebe,en,348,Makingmeevents,https://acrobat.adobe.com,https://t.co/Oi57q6jdXP


In [66]:
df2.columns

Index(['index', 'text', 'spam', 'spam_marketing', 'spam_hijack',
       'spam_corporate', 'spam_bot', 'spam_known', 'spam_own', 'Docusign',
       'onespan', 'signnow', 'adobe sign', 'id_str', 'source', 'created_at',
       'lang', 'listed_count', 'verified', 'location', 'user_id_str',
       'description', 'geo_enabled', 'user_created_at', 'statuses_count',
       'followers_count', 'favourites_count', 'name', 'user_lang',
       'friends_count', 'screen_name', 'expanded_url', 'url'],
      dtype='object')

In [73]:
df2[['listed_count','geo_enabled','location','statuses_count']]

Unnamed: 0,listed_count,geo_enabled,location,statuses_count
0,1,True,the South Jersey Shore,6750
1,14,False,"Mission Viejo, Ca",5018
2,0,False,,29
3,2,False,Arizona,5332
4,8,False,"Laredo, Texas",1337
5,2,False,Arizona,5332
6,5,False,"San Jose, CA",419
7,0,False,,39
8,1374,True,"Washington, DC",335579
9,4,True,,20653
