# Tweet Poster Classifier

This is a basic Natrual Langaue Processing (NLP) and Supervised Machine Learning (ML) example using real tweets from Elon Musk and Jeff Bezos to demonstrate the capabilities of common classification models and their application to unstructured text data.

## Data Collection

The first step in this project is to collect data in the form of tweets by Jeff Bezos and Elon Musk. To accomplish this, the [Tweepy](https://docs.tweepy.org/en/stable/index.html) package, managed by [Harmon758](https://github.com/Harmon758), is going to be utilized.

This first part was built out using the [examples the Documentaion links to](https://github.com/bear/python-twitter/tree/master/examples) as a starting point, but adapting it to use the tweepy wrapper instead. There should also be credit given to Mike Roman for providing the "TwitterMiner" class used



In [1]:
#import python twitter api and config files with secrets

import tweepy
from t_creds import *
import time
import datetime
import pandas as pd

In the following cell, a quick test of the tweepy api is used to grab the most recent tweets from Elon Musk

In [3]:
#screen name for user's post we're interested in
screen_name = 'elonmusk'

#create tweepy's OAuth handler Object
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)

#use Auth handler to instanciate api object
api = tweepy.API(auth)

#grab timeline of user which consists of recent posts
#determine earlier tweet, which will be used in tweetminer later
timeline = api.user_timeline(screen_name=screen_name)
earliest_tweet = min(timeline, key=lambda x: x.id).id

#print out timeline items
print("A list of status items from the timeline:")
for tweet in timeline:
    print(tweet, "\n")

#print ealiest tweet ID
print(f"earliest tweet id is : {earliest_tweet}")

A list of status items from the timeline:
Status(_api=<tweepy.api.API object at 0x000002BFD09CC7C0>, _json={'created_at': 'Tue Aug 03 16:59:45 +0000 2021', 'id': 1422603106035118085, 'id_str': '1422603106035118085', 'text': '@ErcXspace Very close to real! Arms are able to move during descent to match exact booster position. \n\nCatch point… https://t.co/XKR1oja7Dw', 'truncated': True, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [{'screen_name': 'ErcXspace', 'name': 'Erc X', 'id': 1258538731054739456, 'id_str': '1258538731054739456', 'indices': [0, 10]}], 'urls': [{'url': 'https://t.co/XKR1oja7Dw', 'expanded_url': 'https://twitter.com/i/web/status/1422603106035118085', 'display_url': 'twitter.com/i/web/status/1…', 'indices': [116, 139]}]}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': 1422591623427461120, 'in_reply_to_status_id_str': '1422591623427461120', 'in_reply_to_user_id': 12585387310547394

It should be noted that since it wasn't specified, the api call returned 20 'statuses'

In [4]:
len(timeline)

20

In [16]:
#TweetMiner function from Mike Roman

class TweetMiner(object):

    #api should be instanciated prior to creating the TweetMiner object
    def __init__(self, api, result_limit = 20):
        
        self.api = api        
        self.result_limit = result_limit
        
    #method to mine tweets given user, wheather or not to include retweets,
    # and how many hits per call
    def mine_user_tweets(self, user="elonmusk", mine_retweets=False, max_pages=20):

        data           =  []
        last_tweet_id  =  False
        page           =  1
        
        while page <= max_pages:
            
            if last_tweet_id:
                statuses   =   self.api.user_timeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1, include_rts=mine_retweets)
                statuses = [ _._json for _ in statuses]
            else:
                statuses   =   self.api.user_timeline(screen_name=user, count=self.result_limit, include_rts=mine_retweets)
                statuses = [_._json for _ in statuses]
                
            for item in statuses:
                # Using try except here.
                # When retweets = 0 we get an error (GetUserTimeline fails to create a key, 'retweet_count')
                try:
                    mined = {
                        'tweet_id':        item['id'],
                        'handle':          item['user']['screen_name'],
                        'retweet_count':   item['retweet_count'],
                        'text':            item['text'],
                        'mined_at':        datetime.datetime.now(),
                        'created_at':      item['created_at'],
                    }
                
                except:
                        mined = {
                        'tweet_id':        item['id'],
                        'handle':          item['user']['screen_name'],
                        'retweet_count':   0,
                        'text':            item['full_text'],
                        'mined_at':        datetime.datetime.now(),
                        'created_at':      item['created_at'],
                    }
                
                last_tweet_id = item['id']
                data.append(mined)
                
            page += 1
            
        return data

In [17]:
# Result limit == count parameter from our GetUserTimeline()
miner = TweetMiner(api, result_limit=200)

#mine both musk's and bezos's tweets
musk = miner.mine_user_tweets(user="elonmusk")
bezos = miner.mine_user_tweets(user="jeffbezos")

It should be noted that Elon Musk has a much larger number of tweets. He is notorious for having a strong twitter presence, especially for a tech giant CEO.

This imbalance will have to addressed later when creating the classifier. 

The json data will be made into a dataframe to better display the data

In [25]:
print(f'Nubmer of tweets by Elon Musk: {len(musk)} tweets')
print(f'Nubmer of tweets by Jeff Bezos: {len(bezos)} tweets')

Nubmer of tweets by Elon Musk: 1532 tweets
Nubmer of tweets by Jeff Bezos: 228 tweets


In [26]:
musk_df = pd.DataFrame(musk)
musk_df.head()

Unnamed: 0,tweet_id,handle,retweet_count,text,mined_at,created_at
0,1422627025068695556,elonmusk,244,@Erdayastronaut @ErcXspace We stole the idea f...,2021-08-03 16:36:38.236180,Tue Aug 03 18:34:48 +0000 2021
1,1422615364479897606,elonmusk,191,@flcnhvy Pitch control requires more force tha...,2021-08-03 16:36:38.236180,Tue Aug 03 17:48:28 +0000 2021
2,1422612139160834050,elonmusk,144,@TeslaFruit Thanks Sandy!,2021-08-03 16:36:38.236180,Tue Aug 03 17:35:39 +0000 2021
3,1422608233995382791,elonmusk,3527,https://t.co/nNjhPIEhcZ,2021-08-03 16:36:38.236180,Tue Aug 03 17:20:08 +0000 2021
4,1422607954101084161,elonmusk,8561,Super Heavy Booster moving to orbital launch m...,2021-08-03 16:36:38.236180,Tue Aug 03 17:19:01 +0000 2021


In [27]:
bezos_df = pd.DataFrame(bezos)
bezos_df.head()

Unnamed: 0,tweet_id,handle,retweet_count,text,mined_at,created_at
0,1233441223232245760,JeffBezos,2554,"Discussing climate, sustainability, and preser...",2021-08-03 16:36:42.400558,Fri Feb 28 17:17:58 +0000 2020
1,1224154674804084736,JeffBezos,13443,"I just took a DNA test, turns out I’m 100% @li...",2021-08-03 16:36:42.400558,Mon Feb 03 02:16:32 +0000 2020
2,1222572705066536961,JeffBezos,2775,"Hey, Alexa — show everyone our upcoming Super ...",2021-08-03 16:36:42.400558,Wed Jan 29 17:30:21 +0000 2020
3,1220059386694922240,JeffBezos,5441,#Jamal https://t.co/8ej1rUBXVb,2021-08-03 16:36:42.400558,Wed Jan 22 19:03:20 +0000 2020
4,1219093283265138688,JeffBezos,9970,"Hey, India. We’re rolling out our new fleet of...",2021-08-03 16:36:42.400558,Mon Jan 20 03:04:23 +0000 2020


Finally, the data will be saved into a csv to be used in the next notebook

In [28]:
musk_df.to_csv("./data/musk_tweets.csv")
bezos_df.to_csv("./data/bezos_tweets.csv")