# Collect_Twitter_Data

## Install Python libraries

We need the [pymongo](https://pypi.org/project/pymongo/) to manage the MongoDB database, and [tweepy](https://www.tweepy.org/) to call the Twitter APIs.

In [1]:
pip install pymongo

Collecting pymongo
  Downloading pymongo-4.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.11.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m75.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
Installing collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.11.3
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install tweepy

Collecting tweepy
  Downloading tweepy-4.15.0-py3-none-any.whl.metadata (4.1 kB)
Collecting oauthlib<4,>=3.2.0 (from tweepy)
  Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB)
Collecting requests-oauthlib<3,>=1.2.0 (from tweepy)
  Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl.metadata (11 kB)
Downloading tweepy-4.15.0-py3-none-any.whl (99 kB)
Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
Downloading requests_oauthlib-2.0.0-py2.py3-none-any.whl (24 kB)
Installing collected packages: oauthlib, requests-oauthlib, tweepy
Successfully installed oauthlib-3.2.2 requests-oauthlib-2.0.0 tweepy-4.15.0
Note: you may need to restart the kernel to use updated packages.


## Secret Manager Function

In [3]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials  

In [4]:
import pymongo
from pymongo import MongoClient
import json
import tweepy

bearer_token   = get_secret('twitter_api')['bearer_token']

mongodb_connect = get_secret('mongodb')['connection_string']

## Connect to the MongoDB cluster

We will create a database named 'demo' and a collection named 'tweet_collection' in your MongoDB database.

In [5]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection
tweet_collection.create_index([("tweet.id", pymongo.ASCENDING)],unique = True) # make sure the collected tweets are unique

'tweet.id_1'

## Use the API to collect tweets

For more about Twitter API 2.0 query operators, please check [Search Tweets](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query)

In [6]:
query = 'cyber threats'  #query tweets contain the word of 'generative ai'

Insert the collected Tweets into the MongoDB database. You can set a different max_result, but the max tweets we can collect is 100.

In [7]:
tweet_client = tweepy.Client(bearer_token)

tweets = tweet_client.search_recent_tweets(query=query, max_results=100,
                                    expansions=['author_id'], 
                                    tweet_fields = ['created_at','entities','lang','public_metrics','geo'],
                                    user_fields = ['id', 'location','name', 'public_metrics','username'])

next_token = tweets.meta['next_token']
for user, tweet in zip(tweets.includes['users'],tweets.data):
    tweet_json = {}
    tweet_json['tweet']= tweet.data
    tweet_json['user'] = user.data
    try:
        tweet_collection.insert_one(tweet_json)
        print(tweet_json['tweet']['created_at'])
    except:
        pass

2025-03-31T18:55:28.000Z
2025-03-31T18:55:01.000Z
2025-03-31T18:53:54.000Z
2025-03-31T18:53:20.000Z
2025-03-31T18:52:43.000Z
2025-03-31T18:51:48.000Z
2025-03-31T18:49:23.000Z
2025-03-31T18:49:16.000Z
2025-03-31T18:47:37.000Z
2025-03-31T18:46:37.000Z
2025-03-31T18:43:17.000Z
2025-03-31T18:41:27.000Z
2025-03-31T18:39:02.000Z
2025-03-31T18:38:45.000Z
2025-03-31T18:37:42.000Z
2025-03-31T18:37:28.000Z
2025-03-31T18:36:51.000Z
2025-03-31T18:35:49.000Z
2025-03-31T18:31:26.000Z
2025-03-31T18:31:19.000Z
2025-03-31T18:31:03.000Z
2025-03-31T18:30:41.000Z
2025-03-31T18:30:15.000Z
2025-03-31T18:30:07.000Z
2025-03-31T18:30:00.000Z
2025-03-31T18:29:32.000Z
2025-03-31T18:28:43.000Z
2025-03-31T18:27:54.000Z
2025-03-31T18:26:56.000Z
2025-03-31T18:22:21.000Z
2025-03-31T18:21:58.000Z
2025-03-31T18:21:16.000Z
2025-03-31T18:19:43.000Z
2025-03-31T18:18:39.000Z
2025-03-31T18:12:08.000Z
2025-03-31T18:12:03.000Z
2025-03-31T18:03:39.000Z
2025-03-31T18:01:38.000Z
2025-03-31T18:01:22.000Z
2025-03-31T18:00:00.000Z


Continue fetching early tweets with the same query. YOU WILL REACH YOUR RATE LIMIT VERY FAST!

In [None]:
for i in range(5): # change the number here to fetch more tweets
    tweets = tweet_client.search_recent_tweets(query=query, max_results=100,
                                        expansions=['author_id'], 
                                        tweet_fields = ['created_at','entities','lang','public_metrics','geo'],
                                        user_fields = ['id', 'location','name', 'public_metrics','username'],
                                        next_token=next_token)
    next_token = tweets.meta['next_token']
    for user, tweet in zip(tweets.includes['users'],tweets.data):
        tweet_json = {}
        tweet_json['tweet']= tweet.data
        tweet_json['user'] = user.data
        try:
            tweet_collection.insert_one(tweet_json)
            print(tweet_json['tweet']['created_at'])
        except:
            pass

## Close Database Connection

In [None]:
mongo_client.close()