### (1) Demonstrating Python modules and data structures that can be used to efficiently work with Twitter data

The following python modules can be used to efficiently work with Twitter data:

tweepy (this module will be used in this notebook for our demonstration of the Twitter API for Python)

Python Twitter Tools

python-twitter

twython

TwitterAPI

TwitterSearch

Source: https://stackabuse.com/accessing-the-twitter-api-with-python/
  

Data structures used in the Python API for Twitter
    
Typically, Twitter data is pulled using the JSON data structure which you would have to parse either into csv or a pandas dataframe, depending on your purpose of the results.

In the module used in this demonstration (Tweepy), tweets are pulled as tweepy objects. These objects are then converted into json so that we can parse through keys and values easier to gather tweet metadata.

### (2) Using the Twitter API for Python to download tweets, search tweets by hashtags, extract metadata (i.e. number of reteweets, etc.)

#### Import the necessary libraries

The primary libraries used for Twitter API extraction and analysis are tweepy, csv, and json. Tweepy is the Twitter API library for Python, which is the most mature compared to all python libraries available for the Twitter API. The CSV library is used to save extracted tweets and underlying metadata into. The JSON library is used to parse and format tweet metadata into a format which is easy to manipulate because we can use dict keys and values to extract underlying metadata details.

In [59]:
import tweepy as tw
import datetime
import csv
import json

#### State the keys to authenticate to the Twitter API

You will need to setup and be approved for a Developer account in order to receive these keys. These access keys are necessary in order to authenticate into the Twitter API using the tweepy library.

In [3]:
consumer_key= 'crdecmmwhUaTV7oitShaB7xlV'
consumer_secret= 'pEE16H07j9ygOmaxPyJBlW9LUZIrkjOwSyBwhk3DWTS5yZKzEX'
access_token= '1242649299978256389-Ba9M1Nudxuue16nFtGAXuzPk5NNnja'
access_token_secret= 'Skqz04ZBTGAob4K61cHBSay3myFyGLJiCUFPjd7rxyEIk'


#### Authenticate to your Twitter App  

Pass the access key values into the OAuth handler, which is a function of the tweepy library that allows us to authenticate given acceptable credentials

In [4]:
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

     
users = api.me()
print(users)

User(_api=<tweepy.api.API object at 0x000002128FC99CF8>, _json={'id': 1242649299978256389, 'id_str': '1242649299978256389', 'name': 'Rohith', 'screen_name': 'rohith_so', 'location': 'Toronto, Canada', 'profile_location': None, 'description': 'PhD researcher in Machine Learning @UofT and Cloud security Engineer @Deloitte', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': True, 'followers_count': 3, 'friends_count': 62, 'listed_count': 0, 'created_at': 'Wed Mar 25 03:07:47 +0000 2020', 'favourites_count': 5, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 0, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1257200110250921984/a2rubQEa_normal.jpg', 'profile_im

#### Search through tweets by hashtags

We search for the hashtag #wildfires which we define as a search term. We then use Tweepy's Cursor function to pass this search term into Tweepy's api.search function which allows us to conduct queries on available public tweets from a specified date, which is defined by the variable "date_since". We defined this variable to extract data from November 16th 2018.

In [5]:
# Define the search term and the date_since date as variables
search_hashtag = "#wildfires"
date_since = "2018-11-16"

# Collect tweets
tweets = tw.Cursor(api.search,
              q=search_hashtag,
              lang="en",
              since=date_since).items(5)
print(tweets)

# Iterate and print tweets
for tweet in tweets:
    print(tweet.id, tweet.text)

<tweepy.cursor.ItemIterator object at 0x000002128FCDB6A0>
1323291868550078464 RT @i_ameztoy: Do you want a scary #Halloween? Here you go two months of CO evolution; Look at the tongues crossing oceans! 🧐

@CopernicusE…
1323291275320205312 #Wine country, fire country https://t.co/wtvGt9980N from @sfchronicle #winecountry #wildfires
1323290275586936832 RT @Alex_Bernhardt: A key reason for the increase in #wildfires is forest management — is the solution biomass? Learn more from Stan Parton…
1323287849899225093 RT @ClimateSignals: Climate change is causing bigger, more frequent #wildfires to burn hotter and spread faster. Scientists have identified…
1323287814193238016 RT @MarineGOfficial: #SavePantanal: The Pantanal is a terrestrial ecoregion of South America belonging to the prairie and flooded savannah…


#### Extract metadata (i.e. number of retweets etc.)

We use the function api.get_status to pull the full text of a retweeted status given a tweet ID then convert this object into JSON format in order to manipulate the underlying metadata elements.

In [49]:
#Extracting the full text of a retweeted status of a given tweet ID by first checking if the tweet has been retweeted

id = "1265889240300257280"
status = api.get_status(id, tweet_mode="extended")
try:
    print(status.retweeted_status.full_text)
except AttributeError: # Not a Retweet
    print(status.full_text)

#Convert the tweet status into JSON so we can parse the dict keys and gather underlying metadata
json_str = json.dumps(status._json)
metadata = (json.loads(json_str))
metadata


Are you a coding fanatic who wants to work with us and learn new technologies? 👨‍💻👩‍💻
Well then, we are looking just for you!

Register for our SDE Hiring Challenge right now!
https://t.co/Zg08gHhT0W  

#hiring #challenge #coding #programming https://t.co/1N7gXaH9eA


{'created_at': 'Thu May 28 06:14:48 +0000 2020',
 'id': 1265889240300257280,
 'id_str': '1265889240300257280',
 'full_text': 'Are you a coding fanatic who wants to work with us and learn new technologies? 👨\u200d💻👩\u200d💻\nWell then, we are looking just for you!\n\nRegister for our SDE Hiring Challenge right now!\nhttps://t.co/Zg08gHhT0W  \n\n#hiring #challenge #coding #programming https://t.co/1N7gXaH9eA',
 'truncated': False,
 'display_text_range': [0, 242],
 'entities': {'hashtags': [{'text': 'hiring', 'indices': [203, 210]},
   {'text': 'challenge', 'indices': [211, 221]},
   {'text': 'coding', 'indices': [222, 229]},
   {'text': 'programming', 'indices': [230, 242]}],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': 'https://t.co/Zg08gHhT0W',
    'expanded_url': 'https://practice.geeksforgeeks.org/contest/hiring-challenge-sde',
    'display_url': 'practice.geeksforgeeks.org/contest/hiring…',
    'indices': [176, 199]}],
  'media': [{'id': 1265887151016812546,
    'id_str

As we can see, all elements of the metadata variable (JSON format of the retweeted status object) can be seen in a clean JSON format

Given that the metadata variable is now in JSON format, we can view the keys of the variable since it is a dict. 

In [42]:
#Gather the keys of the tweet's metadata
metadata.keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'possibly_sensitive_appealable', 'lang'])

Using these key values, we can now easily identify how we want to extract underlying metadata elements by searching through the keys of the metadata dict variable. For example, as seen below we can get the user metadata information by analyzing the name key within the user key.

We now use this to gather when the tweet was published, by which user, from wht country, and how many followeres and friends the user has.

In [57]:
#Gather the user of the tweet
user = metadata['user']['name']
#Gather the location of a user's tweet
user_location = metadata['user']['location']

#Gather the time the tweet was made
created_at = metadata['created_at']

#Gather details about the user's followers and friends
number_of_followers = metadata['user']['followers_count']
number_of_friends = metadata['user']['friends_count']

print("The tweet was created at",created_at,"by the user",user,"from",user_location,"\nThis user has",number_of_followers,"followers and",number_of_friends,"friends")



The tweet was created at Thu May 28 06:14:48 +0000 2020 by the user GeeksforGeeks from India 
This user has 20776 followers and 22 friends


In [24]:
# printing the screen names of the retweeters of the given tweet id
for retweet in retweets_list: 
    print(retweet.user.screen_name) 
    


harshitabambure
codedailybot
UVahalkar
codedailybot
ProjectLearn_io
codedailybot
AaronCuddeback


In [58]:
#printing the number of retweets for a tweet 
retweets_list = api.retweets(id) 

number_of_retweets = len(retweets_list)
print("\nBased on the Tweet ID, there were:", number_of_retweets, "retweets found")


Based on the Tweet ID, there were: 7 retweets found


### (3) Using the Twitter API to download tweets and save those as a csv file

Here, we search for the last 100 tweets made using the hashtag "#trump" and save these tweets to a csv along with metadata of those tweets: username/screen name, id of the tweet, whether it was retweeted, language of the tweet, number of followers of the user, whether the user is verified, location the tweet was made in, the tweet, and when it was created. Our search results are then saved to a csv file.

In [117]:
#method to get a user's last tweets
def get_tweets(hashtag):

 

    #set count to however many tweets you want
    number_of_tweets = 100

    #get tweets
    tweets_for_csv = []
    for tweet in tweepy.Cursor(api.search, q = hashtag).items(number_of_tweets):
        #create array of tweet information: username, tweet id, date/time, text
        tweets_for_csv.append([tweet.user.screen_name,tweet.retweeted,tweet.user.lang,tweet.user.followers_count,tweet.user.verified,tweet.user.location.encode("utf-8"),tweet.id_str, tweet.created_at, tweet.text.encode("utf-8")])

    #write to a new csv file from the array of tweets
    outfile = "hashtag_tweets.csv"
    print ("tweets have been saved to the following csv file:" + outfile)
    with open(outfile, 'w+') as file:
        writer = csv.writer(file, delimiter=',')
        writer.writerows(tweets_for_csv)

if __name__ == '__main__':
    get_tweets("#trump")

tweets have been saved to the following csv file:hashtag_tweets.csv


### (4) Basic feature extraction and basic text preprocessing on tweets from csv file