<img src='images/gesis.png' style='height: 60px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>
<img src='images/isi.png' style='height: 50px; float: left; margin-left: 20px'>  

Authors = N. Gizem Bacaksizlar Turbic and Haiko Lietz

Date = 19 July 2022

# 1. Introduction

Data collection is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing accurate insights for research using various techniques. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

The ease of access to the technology has made various social media platforms more popular as communication tools, therefore as a source of data. With this rise of social media use as a data source, data collection using APIs has become a demanding skill. Here, in this session, we aim to teach how to collect data from various social media platforms, such as Twitter and Reddit.

# 2. Social Media Platforms for Data Harvesting through API

<img src="./images/database.png"  width="150" height = "150" align="right"/>

In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system.

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites.

However, even though we have access to these API, as researchers, we should not forget to respect API access rules and always read the documents before collecting data.




## 2.1. A demonstration using Python to collect data from Twitter 

Twitter is one of the most used social media platforms in the academic research. This microblogging and social networking service host users who can post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read those that are publicly available. As of 2022, Twitter has 436 million active users worldwide (Statista, 2022*). 

<img src="./images/twitter.png"  width="200" height = "200" align="left"/>

Different access options for different purposes:

- Twitter Developer: https://developer.twitter.com/
- APIs: https://developer.twitter.com/en/docs
- GNIP: http://support.gnip.com/apis/
- Twitter Enterprise: https://developer.twitter.com/en/enterprise

IMPORTANT to note that free APIs cover 7 days Tweets; Premium APIs exist for 30-day search and beyond. If you have an Academic Research access level, you can access even more data with full-archive search endpoint. There are changes to APIs policies over time, such as functionalities and user agreements. Also, limitations on volume and functions should be considered. 

Before we start with our first project on Twitter, first you need to sign up for Twitter and then, create a Developer account: 

- Sign up from [here.](https://help.twitter.com/en/using-twitter/create-twitter-account)
- Create a Developer Account from [here.](https://developer.twitter.com)


**https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/*

In [1]:
''' 
    Let's get started with our first project of colleting Tweets.
    Import libraries if you install them before.
    If you have not installed them, then install with pip on your command prompt or your jupyter notebook with !pip.
'''
# import relevant packages
import pandas as pd # data manipulation library
import datetime # human readable date formats
import tweepy as tw # wrapper around Twitter API 
# Please make sure you have installed all of these libraries!

In [2]:
# Enter your keys registered with Twitter
# Obtain the access token and access token secret
# These can be generated in your Developer Portal, under the “Keys and tokens” tab for your Developer App.

apikey = 'YOURapikey' #25 alphanumeric characters
apisecretkey = 'YOURapisecretkey'
accesstoken = 'YOURaccesstoken'
accesstokensecret = 'YOURaccesstokensecret'
bearertoken = 'YOURbearertoken'

<img src="./images/developer_portal.png"  width="500" height = "500" align="center"/>

In [3]:
# Let's say you are sharing your scripts with others, and do not want to show your keys. What can you do?
# We first can create a simple Python script called keys.py in which we store all passwords. 
# Save this script in the same folder as this notebook and import your keys

from keys import *

# Make sure you name your variable names for the keys in the keys.py script are the same as your variables here.

In [4]:
# Set up your access with search terms
auth = tw.OAuthHandler(apikey, apisecretkey)
auth.set_access_token(accesstoken, accesstokensecret)
api = tw.API(auth, wait_on_rate_limit = True)
search_words = "ComputationalSocialScience OR GESIS OR SocialComQuant" # Words should be changed according to your search.
# If you want to remove retweets, then include -filter:retweets to the search_words.

In [5]:
# Collect Tweets and be aware of the attribute names from the new version of the packages, which may change in time
# For a standart search, we use Tweepy.
tweets = tw.Cursor(api.search_tweets,  q=search_words, lang="en").items() # Possible to limit the number of search items

In [6]:
# First, check which Twitter attributes you collected from this search, visit:
# https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet
# Then, create a dataframe with the columns you might need for your analysis

tweet_details = [[ tweet.user.screen_name, tweet.user.id, tweet.id_str, 
                  tweet.created_at, tweet.text, tweet.user.profile_image_url, tweet.user.location] 
                  for tweet in tweets]
tweet_df = pd.DataFrame(data=tweet_details, 
                        columns = ["user_name","user_id", "tweet_id", "tweet_date","tweet","user_image",
                                   "user_location"])

# For instance, you can see the values of one specific column with a code like this: tweet_df['user_image'].values

# Save df
tweet_df.to_csv("./data/test_tweets.csv", index = False)
print(tweet_df.head())
print('---------------------------------------------')

# print the length of the dataset
print('The length of the dataframe:', len(tweet_df['tweet_id'].unique()))

         user_name              user_id             tweet_id  \
0      CompCommLab   798181898031943680  1519270689823498241   
1         emcr_sna  1491847980617547779  1519231234358095872   
2        gesis_org            145554242  1519224965555535873   
3  ReligionFESTHD1  1186731643505184768  1519220526677467136   
4        gesis_org            145554242  1519211980896317440   

                 tweet_date  \
0 2022-04-27 11:02:13+00:00   
1 2022-04-27 08:25:26+00:00   
2 2022-04-27 08:00:32+00:00   
3 2022-04-27 07:42:53+00:00   
4 2022-04-27 07:08:56+00:00   

                                               tweet  \
0  RT @clauwa: Want to join @gesis_org and @HHU_d...   
1  RT @clauwa: Want to join @gesis_org and @HHU_d...   
2  Lights! Camera! Action! Teach! A #Handbook for...   
3  RT @gesis_org: #stellenangebot #job #openposit...   
4  RT @trovdimi: In case you missed my talk on Ge...   

                                          user_image            user_location  
0  http://p

In [7]:
# Let's check the first five user images with searching the link on the browser
tweet_df.user_image.values[:5]

array(['http://pbs.twimg.com/profile_images/839049954144436225/iZvx4Nbr_normal.jpg',
       'http://pbs.twimg.com/profile_images/1496253854937141257/J4Xdl0YN_normal.jpg',
       'http://pbs.twimg.com/profile_images/2840291739/926f900a36e46987ff8ac10c060f2c07_normal.png',
       'http://pbs.twimg.com/profile_images/1471199487288913922/wcvkmu9V_normal.jpg',
       'http://pbs.twimg.com/profile_images/2840291739/926f900a36e46987ff8ac10c060f2c07_normal.png'],
      dtype=object)

In [8]:
# Twitter API v2 (if you have a full access)
client = tw.Client(bearer_token=bearer_token)

# Replace with your own search query
query = 'from:SocialComquant -is:retweet' # you can change from with your own choice of username (without retweets)

# Replace with time period of your choice
start_time = '2021-01-01T00:00:00Z'

# Replace with time period of your choice
end_time = '2022-01-01T00:00:00Z'

In [9]:
# Check the start_time by yourself with writing
start_time

'2021-01-01T00:00:00Z'

In [10]:
'''
# You can search Tweets from the last 7 days or all Tweets with different functions. Check available functions in Tweepy!
Tweepy: https://docs.tweepy.org/en/stable/client.html#search-tweets
# A helpful link for setting up your query: 
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md
'''
# Connect to Twitter API and search all tweets if you have a full academic access
tweets = client.search_all_tweets(query=query, tweet_fields=['created_at','text', 'context_annotations','entities'],
                                  start_time=start_time,
                                  end_time=end_time, max_results=10) #set your max results between 10 and 500



In [11]:
# Let's see a fairly new field for context annotations.
for tweet in tweets.data:
    print(tweet.created_at)
    print(tweet.context_annotations) #context annotations (https://developer.twitter.com/en/docs/twitter-api/annotations/overview)

2021-12-08 10:26:14+00:00
[{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item that is in Entity Service should be in this domain'}, 'entity': {'id': '848920371311001600', 'name': 'Technology', 'description': 'Technology and computing'}}, {'domain': {'id': '66', 'name': 'Interests and Hobbies Category', 'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'}, 'entity': {'id': '848921413196984320', 'name': 'Computer programming', 'description': 'Computer programming'}}, {'domain': {'id': '66', 'name': 'Interests and Hobbies Category', 'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'}, 'entity': {'id': '898673391980261376', 'name': 'Web development', 'description': 'Web Development'}}]
2021-11-24 14:10:43+00:00
[{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item 

## 2.2. A demonstration using Python to collect Reddit comments <img src="./images/reddit.svg"  width="150" height = "150" align="right"/>

Reddit is one of the oldest social media platforms which is still generating content with its users. Millions of users are creating on a daily basis in the form of questions and comments. Reddit also offers such API which is easy to access this vast amount of data.

First thing you need to do is to have a Reddit account. You should create it from [here.](https://www.reddit.com/)
- [Official Reddit API](https://www.reddit.com/dev/api/)
    - [Collecting Reddit data](https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892)
    
Alternative ways of getting Reddit data:
- [Google BigQuery](https://cloud.google.com/bigquery) (GBQ)
    - [Scraping Reddit data with GBQ](https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892)
- [Pushshift.io](https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563)

We need to decide which subreddit you would like to focus on getting the data: Let's say "Computational Social Science" and be creative :)

title, score, url, id, number of comments, date of creation, body text are the fields that are available from Reddit API. 
Here, we will focus on getting the bodytext(comments) from the subreddit. Refer to [praw documentation](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html) for different kinds of implementations. 

# 2.3. More APIs and precollected datasets 

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

# 3. Challanges

Facebook completely closed down many of it’s APIs and it is not very hard to get Facebook data besides CrowdTangle or FB Ads.

Twitter’s API now has the version 2 with substantial changes. 

These challanges make us stay vigilant and continuously update our code to keep up with the APIs.

- More on Social Media data collection and data quality:
https://www.slideshare.net/suchprettyeyes/working-with-socialmedia-data-ethics-good-practice-around-collecting-using-and-storing-data

# 4. References

Zenk-Möltgen, Wolfgang (GESIS - Leibniz Institute for the Social Sciences), Python Script to rehydrate Tweets from Tweet IDs https://doi.org/10.7802/1504

Pfeffer, Morstatter (2016): Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness. Dataset. http://dx.doi.org/10.7802/1166

Do not miss checking out the Social Comquant Workshop 10 at:https://github.com/strohne/autocol

- Useful links for getting started with Twitter API v2
    - [Comprehensive Guide for Using the Twitter API v2](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9#:~:text=Tweepy%20is%20a%20popular%20package,the%20academic%20research%20product%20track)
    - [Step by Step Guide to Making Your First Request to the Twitter API v2](https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2)
    - [Getting Started with Data Collection Using Twitter API v2](https://towardsdatascience.com/getting-started-with-data-collection-using-twitter-api-v2-in-less-than-an-hour-600fbd5b5558#39c4)
    - [An Extensive Guide to Collecting Tweets from Twitter API v2 for Academic REsearch Using Python 3](https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a)
    - [What Pythong package is best for getting data from Twitter](https://towardsdatascience.com/what-python-package-is-best-for-getting-data-from-twitter-comparing-tweepy-and-twint-f481005eccc9)

- Useful links for getting started with Reddit API
    - https://www.reddit.com/r/TheoryOfReddit/wiki/collecting_data/- 
    - https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892
    - https://github.com/akhilesh-reddy/Cable-cord-cutter-Sentiment-analysis-using-Reddit-data
    
<a href="https://www.flaticon.com/free-icons/database" title="database icons">Database icons created by Smashicons - Flaticon</a>

<a href="https://de.freepik.com/vektoren/logo">Logo Vektor erstellt von rawpixel.com - de.freepik.com</a>

<a href="http://www.freepik.com">Designed by stories / Freepik</a>



### Note: Alternative Ways for Twitter Academic API or Premium Account

The search function mandatorily requires environment label and query argument. Label your Application on Twitter Developer page: https://developer.twitter.com/en/account/environments

You can optionally add the fromDate and toDate fields to filter search results by time.

The format of dates should "YYYYMMDDHHMM".

tweets_month = api.search_30_day(label='teaching', query=search_words, 
                                 fromDate="202202201000", toDate="202203010000")

Now, you can dump your results into json format *don't forget to import json*: print(json.dumps(tweet_results[0]._json, indent=4, sort_keys=True))
                                 
For further interest, visit: https://towardsdatascience.com/how-to-use-twitter-premium-search-apis-for-mining-tweets-2705bbaddca

Also, there is another library called Twarc2 to explore for further data collection with Twitter v2 API:
https://twarc-project.readthedocs.io/en/latest/api/client2/

An academic research product:
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md

A standart product: 
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6b-labs-code-standard-python.md