# Leveraging Advanced Data Acquisition Techniques for Real-time Twitter Analysis and AWS S3 Integration

In an era dominated by big data and rapid information exchange, obtaining real-time data from social media platforms like Twitter and managing it effectively can provide organizations with a competitive edge. This guide explores advanced data acquisition techniques using Twitter's API to fetch real-time tweets related to specific topics such as COVID-19. Additionally, it discusses the process of integrating this data into Amazon Web Services (AWS) S3 for efficient storage and management, enabling further analysis and accessibility.

Detailed Explanation
Twitter Data Acquisition
The process begins by setting up authentication with Twitter's API, which allows access to a broad range of data points from tweets. By utilizing specific queries, such as searching for recent mentions of "COVID-19," the system can fetch relevant tweets that provide insights into public sentiment, trends, and more. This method involves encoding client credentials, handling authentication responses, and structuring API requests to retrieve the desired data efficiently.

Data Handling and Transformation
Once the data is retrieved, it is essential to structure it into a usable format. The raw data from Twitter, which includes details like tweet text, user information, timestamps, and more, is converted into a structured format such as a DataFrame. This transformation makes it easier to manipulate, analyze, and store the data. DataFrames provide a tabular format that is familiar and accessible for data analysis tasks.

Integration with AWS S3
After structuring the data, the next step involves storing it in a secure and scalable environment. AWS S3 is chosen for its robustness and flexibility in handling large datasets. The guide covers creating a new S3 bucket if one does not already exist and configuring it to ensure data security and integrity. The structured data is then uploaded to the S3 bucket using a direct method from the DataFrame, which simplifies the process and reduces the potential for data transmission errors.

Automation and Scalability
To facilitate ongoing data analysis projects, the process is designed to be automated, allowing for continuous data collection and storage without manual intervention. This automation is crucial for tracking evolving discussions and trends over time, particularly for time-sensitive topics like pandemics or other global events.

Practical Applications
The setup detailed in this guide is particularly useful for data scientists, market researchers, and social media analysts who require access to real-time data for rapid decision-making and trend analysis. By leveraging Twitter's extensive data and AWS's scalable infrastructure, they can perform complex analyses to extract actionable insights and respond proactively to changes in public opinion or market conditions.

In summary, this approach not only facilitates efficient data collection and management but also enhances the capability to perform advanced analytics on real-time data from one of the world's largest social media platforms.

In [7]:
import pandas as pd
import requests
import json
import base64
!pip install s3fs
import s3fs # documentation: https://s3fs.readthedocs.io/en/latest/
import time
import twitter_keys #this is a custom reference module to a package containing twitter keys

%config IPCompleter.greedy=True


key_secret = '{}:{}'.format(twitter_keys.client_key, twitter_keys.client_secret).encode('ascii')
b64_encoded_key = base64.b64encode(key_secret)
b64_encoded_key = b64_encoded_key.decode('ascii')

#identify base url and oauth token path
base_url = 'https://api.twitter.com/' #base url for authentication
auth_url = '{}oauth2/token'.format(base_url)

#share header information -- encoding is ascii
auth_headers = {
    'Authorization': 'Basic {}'.format(b64_encoded_key),
    'Content-Type': 'UTF-8 Credentials'
}

#pass clientcredentials
auth_data = {
    'grant_type': 'client_credentials'
}

#send authentication using requests - POST request
auth_resp = requests.post(auth_url, headers=auth_headers, data=auth_data)

#check response status. 200 = OK
auth_resp.status_code

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
boto3 1.26.119 requires botocore<1.30.0,>=1.29.119, but you have botocore 1.29.76 which is incompatible.


Collecting botocore<1.29.77,>=1.29.76
  Using cached botocore-1.29.76-py3-none-any.whl (10.4 MB)
Installing collected packages: botocore
  Attempting uninstall: botocore
    Found existing installation: botocore 1.29.119
    Uninstalling botocore-1.29.119:
      Successfully uninstalled botocore-1.29.119
Successfully installed botocore-1.29.76


200

In [9]:
#Keys in data response are token_type (bearer) and access_token (your access token)
print(auth_resp.json().keys())

access_token = auth_resp.json()['access_token']


search_headers = {
    'Authorization': 'Bearer {}'.format(access_token)    
}

#enter search parameters for coronavirus example. This looks for "covid-19" in the 1000 most recent tweets
query_params = {
    'q': 'covid-19',
    'result_type': 'recent',
    'count': 100, #update here to get more/less than 100 returns
    'lang': 'en' #filters by english language only
}


#identify search url path and save 
search_url = '{}1.1/search/tweets.json'.format(base_url)


#run search using get request
search_resp = requests.get(search_url, headers=search_headers, params=query_params)

#check status code of GET request
search_resp.status_code

dict_keys(['token_type', 'access_token'])


200

In [10]:
#print text from result to verify  
twitter_data = search_resp.json()

for x in twitter_data['statuses']:
    print(x['text'] + '\n')
    break #prints after one iteration and stops, remove break to see all 1000

RT @crampell: Florida Surgeon General Joseph Ladapo personally altered a state-driven study about Covid-19 vaccines last year to suggest th…



In [11]:
# move data into data frame 
df = pd.DataFrame(twitter_data['statuses'])

# show one record to verify import 
df.head(1)

Unnamed: 0,created_at,id,id_str,text,truncated,entities,metadata,source,in_reply_to_status_id,in_reply_to_status_id_str,...,retweet_count,favorite_count,favorited,retweeted,lang,quoted_status_id,quoted_status_id_str,extended_entities,possibly_sensitive,quoted_status
0,Tue Apr 25 18:48:19 +0000 2023,1650934779720908801,1650934779720908801,RT @crampell: Florida Surgeon General Joseph L...,False,"{'hashtags': [], 'symbols': [], 'user_mentions...","{'iso_language_code': 'en', 'result_type': 're...","<a href=""http://twitter.com/download/iphone"" r...",,,...,1578,0,False,False,en,,,,,


In [2]:
import os
import boto3
import aws_s3 #this is a custom reference module to a package containing aws keys

In [3]:
def create_s3_bucket(bucket_name, region=None):
    s3 = boto3.client('s3', region_name=region)
    
    if region is None or region == 'us-east-1':
        s3.create_bucket(Bucket=bucket_name)
    else:
        s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})

    print(f"S3 bucket '{bucket_name}' created in region '{region}'.")

In [4]:
bucket_name = 'information'
region = 'us-east-1'
create_s3_bucket(bucket_name, region)

S3 bucket 'information-arch-dengyi-liu-assignment-lab3' created in region 'us-east-1'.


In [12]:
import time
# prepare csv file name   
pathname = 's3://'#specify location of s3:/{my-bucket}/
filename= 'twitter_api' #name of your group
datetime = time.strftime("%Y%m%d%H%M%S") #timestamp
filenames3 = "%s%s%s.csv"%(pathname,filename,datetime) #name of the filepath and csv file

#load file into s3. Pandas actually leverages boto to connect to s3 and can push the file directly into an s3 bucket
df.to_csv(filenames3, header=True, line_terminator='\n') 

#print success message
print("Successfull uploaded file to location:"+str(filenames3))

Successfull uploaded file to location:s3://information-arch-dengyi-liu-assignment-lab3/twitter_api20230425144827.csv
