# Collect_Twitter_Data

## Install Python libraries

We need the [pymongo](https://pypi.org/project/pymongo/) to manage the MongoDB database, and [tweepy](https://www.tweepy.org/) to call the Twitter APIs.

In [3]:
pip install pymongo

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install tweepy

Note: you may need to restart the kernel to use updated packages.


## Secret Manager Function

In [5]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials  

In [6]:
import pymongo
from pymongo import MongoClient
import json
import tweepy

bearer_token   = get_secret('twitter_api')['bearer_token']

mongodb_connect = get_secret('mongodb')['connection_string']

## Connect to the MongoDB cluster

We will create a database named 'demo' and a collection named 'tweet_collection' in your MongoDB database.

In [7]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection
tweet_collection.create_index([("tweet.id", pymongo.ASCENDING)],unique = True) # make sure the collected tweets are unique

'tweet.id_1'

## Use the API to collect tweets

For more about Twitter API 2.0 query operators, please check [Search Tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query)

In [8]:
query = 'jmu'  #query tweets contain the word of 'generative ai'

Insert the collected Tweets into the MongoDB database. You can set a different max_result, but the max tweets we can collect is 100.

In [9]:
tweet_client = tweepy.Client(bearer_token)

tweets = tweet_client.search_recent_tweets(query=query, max_results=100,
                                    expansions=['author_id'], 
                                    tweet_fields = ['created_at','entities','lang','public_metrics','geo'],
                                    user_fields = ['id', 'location','name', 'public_metrics','username'])

next_token = tweets.meta['next_token']
for user, tweet in zip(tweets.includes['users'],tweets.data):
    tweet_json = {}
    tweet_json['tweet']= tweet.data
    tweet_json['user'] = user.data
    try:
        tweet_collection.insert_one(tweet_json)
        print(tweet_json['tweet']['created_at'])
    except:
        pass

2025-03-31T18:53:00.000Z
2025-03-31T18:51:38.000Z
2025-03-31T18:51:20.000Z
2025-03-31T18:51:10.000Z
2025-03-31T18:51:07.000Z
2025-03-31T18:51:04.000Z
2025-03-31T18:42:05.000Z
2025-03-31T18:39:38.000Z
2025-03-31T18:37:38.000Z
2025-03-31T18:31:38.000Z
2025-03-31T18:29:22.000Z
2025-03-31T18:29:22.000Z
2025-03-31T18:26:05.000Z
2025-03-31T18:25:56.000Z
2025-03-31T18:18:49.000Z
2025-03-31T18:16:11.000Z
2025-03-31T18:16:09.000Z
2025-03-31T18:15:26.000Z
2025-03-31T18:14:10.000Z
2025-03-31T18:13:36.000Z
2025-03-31T18:12:10.000Z
2025-03-31T18:07:31.000Z
2025-03-31T17:57:26.000Z
2025-03-31T17:56:39.000Z
2025-03-31T17:56:18.000Z
2025-03-31T17:55:31.000Z
2025-03-31T17:54:11.000Z
2025-03-31T17:47:31.000Z
2025-03-31T17:45:35.000Z
2025-03-31T17:43:46.000Z
2025-03-31T17:38:08.000Z
2025-03-31T17:35:16.000Z
2025-03-31T17:35:07.000Z
2025-03-31T17:34:41.000Z
2025-03-31T17:34:36.000Z
2025-03-31T17:34:23.000Z
2025-03-31T17:34:00.000Z
2025-03-31T17:33:36.000Z
2025-03-31T17:33:30.000Z
2025-03-31T17:33:01.000Z


Continue fetching early tweets with the same query. YOU WILL REACH YOUR RATE LIMIT VERY FAST!

In [None]:
for i in range(5): # change the number here to fetch more tweets
    tweets = tweet_client.search_recent_tweets(query=query, max_results=100,
                                        expansions=['author_id'], 
                                        tweet_fields = ['created_at','entities','lang','public_metrics','geo'],
                                        user_fields = ['id', 'location','name', 'public_metrics','username'],
                                        next_token=next_token)
    next_token = tweets.meta['next_token']
    for user, tweet in zip(tweets.includes['users'],tweets.data):
        tweet_json = {}
        tweet_json['tweet']= tweet.data
        tweet_json['user'] = user.data
        try:
            tweet_collection.insert_one(tweet_json)
            print(tweet_json['tweet']['created_at'])
        except:
            pass

## Close Database Connection

In [None]:
mongo_client.close()