<img src='images/gesis.png' style='height: 60px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>
<img src='images/isi.png' style='height: 50px; float: left; margin-left: 20px'>  

Authors = N. Gizem Bacaksizlar Turbic and Haiko Lietz

Date = 19 July 2022

## Introduction to Computational Social Science methods with Python

# Session 3: API Harvesting

Data collection is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing accurate insights for research using various techniques. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

The ease of access to the technology has made various social media platforms more popular as communication tools, therefore as a source of data. With this rise of social media use as a data source, data collection using APIs has become a demanding skill. Here, in this session, we aim to teach how to collect data from various social media platforms, such as Twitter and Reddit.

## 3.1. Social Media Platforms for Data Harvesting through API

<img src="./images/database.png"  width="150" height = "150" align="right"/>

In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system.

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites.

However, even though we have access to these APIs, as researchers, we should not forget to respect API access rules and always read the documents before collecting data.




## 3.2. A demonstration using Python to collect data from Twitter 

Twitter is one of the most used social media platforms in the academic research. This microblogging and social networking service hosts users who can post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read those that are publicly available. As of 2022, Twitter has 436 million active users worldwide (Statista, 2022*). 

<img src="./images/twitter.png"  width="200" height = "200" align="left"/>

Different access options for different purposes:

- Twitter Developer: https://developer.twitter.com/
- APIs: https://developer.twitter.com/en/docs
- GNIP: http://support.gnip.com/apis/
- Twitter Enterprise: https://developer.twitter.com/en/enterprise

IMPORTANT to note that free APIs cover 7 days Tweets; Premium APIs exist for 30-day search and beyond. If you have an Academic Research access level, you can access even more data with full-archive search endpoint. There are changes to APIs policies over time, such as functionalities and user agreements. Also, limitations on volume and functions should be considered. 

Before we start with our first project on Twitter, first you need to sign up for Twitter and then, create a Developer account: 

- Sign up from [here.](https://help.twitter.com/en/using-twitter/create-twitter-account)
- Create a Developer Account from [here.](https://developer.twitter.com)


**https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/*

### 3.2.1 Getting started

In this section, we will begin with our first project of collecting tweets. Import the following libraries if you have already installed them. If you have not, install them using pip in your command prompt, or using !pip in your jupyter notebook.

We will be using `datetime` library for working with human readable date formats, and `tweepy` as the python wrapper of the twitter API.

In [None]:
import pandas as pd
import datetime
import tweepy as tw

You will have to use your twitter keys registered with your account. Create an account if you don't have one yet. Then obtain the access token and access token secret. They can be generated in your developer portal [here](https://developer.twitter.com/en/portal/dashboard), under "Keys and tokens" tab of your developer app.

After getting them ready, use the following variables to save them for further steps:

In [None]:
apikey = 'YOURapikey' #25 alphanumeric characters
apisecretkey = 'YOURapisecretkey'
accesstoken = 'YOURaccesstoken'
accesstokensecret = 'YOURaccesstokensecret'
bearertoken = 'YOURbearertoken'

<img src="./images/developer_portal.png"  width="500" height = "500" align="center"/>

If you are sharing your scripts with other people and want to keep your keys secret, you can follow the steps below instead of assigning your keys in the above 5 variables:

- Create a simple python script called `keys.py`
- Store all passwords the way you did in the notebook already, with the same names
- Save the script in the same folder as this notebook's
- Import the keys like the following:

In [1]:
from keys import *

The next step is setting up your access to the API:

In [None]:
auth = tw.OAuthHandler(apikey, apisecretkey)
auth.set_access_token(accesstoken, accesstokensecret)
api = tw.API(auth, wait_on_rate_limit = True)

### 3.2.2 Retrieving tweets with keyword search

Now we want to retrieve the tweets that contain certain words. Let's say we want to get the ones that contain *at least* one of the words **ComputationalSocialScience**, **GESIS** or **SocialComQuant**. We need to save them in a string, seperated with `OR`s like the following (You can try with any other search terms of your own choice):

In [None]:
search_words = "ComputationalSocialScience OR GESIS OR SocialComQuant"

If you want to remove retweets from your search results, you can include `-filter:retweets` in the `search_words` string.

Now we collect the desired tweets like this:

In [None]:
tweets = tw.Cursor(api.search_tweets,  q=search_words, lang="en").items()

**Note**: Be aware of the attribute names from the new version of the packages, they may change in time.

You can pass a number as an argument to the `.items()` at the end of the line to limit the number of search results.

The tweets that are now kept in the `tweets` object above contain a lot of information, in the form of dictionaries. You can check an overview of this information [here](https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet).

We will take a look at some of this information and store it in a dataframe:

In [None]:
tweet_details = [[tweet.user.screen_name, tweet.user.id, tweet.id_str, 
                  tweet.created_at, tweet.text, tweet.user.profile_image_url, tweet.user.location] 
                  for tweet in tweets]

tweets_df = pd.DataFrame(data=tweet_details, 
                        columns = ["user_name","user_id", "tweet_id", "tweet_date","tweet","user_image",
                                   "user_location"])

tweets_df

Here is some information about the data each column keeps:

- user_name: The username of the user that has tweeted the desired tweet.
- user_id: Each user has a unique ID, this columns keeps those IDs.
- tweet_id: Each tweets has also a unique ID.
- tweet_date: The date that the tweet has been posted.
- tweet: The text of the tweet
- user_image: The profile photo of the user
- user_location: The location of the user

You can store the dataframe for later access, if you need to:

In [None]:
tweets_df.to_csv("./data/test_tweets.csv", index = False)

You can also get the number of the tweets retrieved:

In [None]:
print('The length of the dataframe:', len(tweets_df['tweet_id'].unique()))

To access the user photos, you can simply search the links in your browser:

In [None]:
tweets_df.user_image.values[:5]

### 3.2.3 Retrieving users' information

Let's say you have a list of users. This can be a list of IDs like the ones we have in the `user_id` column of the dataframe in the previous section, or it can be a list of usernames as shown on the screen for every user in twitter (Like the ones we have in the `user_name` column of the dataframe in the previous section).

We can access the infromation like profile description or profile photo of any of these users. As an example, take the following list of users. It contains the first 10 unique user IDs of the dataframe in the previous section:

In [None]:
users_ids = list(tweets_df['user_id'][:50].unique())
users_ids

We can get their profile information using `get_user()` and store it in the `information`:

In [None]:
information = []

for i in users_ids:
    user = api.get_user(user_id = i)
    information.append(user)

Next, we can extract information like users' location, profile description, profile photo, etc and make a dataframe to keep them:

In [None]:
names = []
locations = []
descriptions = []
profile_pics = []
background_pics = []
friends = []
followers = []

for i in information:   
    names.append(i.name)
    locations.append(i.location)
    descriptions.append(i.description)
    profile_pics.append(i.profile_image_url)
    background_pics.append(i.profile_background_image_url)
    friends.append(i.friends_count)
    followers.append(i.followers_count)
    
users_df = pd.DataFrame({'ID': users_ids, "name": names, "location": locations, "description": descriptions, "profile picture": profile_pics, "background picture": background_pics,
                                   "friends": friends, "followers": followers})

users_df

If, instead of list of users IDs, you have a list of usernames, you can still get the information above. You just need to change the argument name to `screen_name` in `get_user()` function.

For example, let's say your list of usernames is something like this:

In [None]:
usernames = list(tweets_df['user_name'][:50].unique())
usernames

It can be done like this:

In [None]:
information = []

for i in usernames:
    user = api.get_user(screen_name = i)
    information.append(user)

The rest is the same.

### 3.2.4 Rehydrating tweets

In research, sharing large tweet data sets is done through sharing tweet identifiers, since Twitter Terms of Service does not allow researchers to share the full tweets data. In order to get the tweets used in a research work, we need to retrieve/reconstruct the tweets data using those tweet identifiers. This is called hydrating/rehydrating tweets.

Since some of the tweets used in a research work migh have been deleted in time, we may not be able to access every single tweet used at the time when that research work has been done. We will see about that in more details later in this section.

In order to rehydrate tweets, we will be using Twarc library, which is a python wrapper for twitter API. You can install it with `pip`.

In [2]:
from twarc import Twarc2, expansions

We will rehydrate tweets from [this](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0237073) paper for our teaching purposes here. Download the `2021.csv` data set from [this](https://figshare.com/articles/dataset/The_Twitter_Parliamentarian_Database/10120685) link and read it like the following:

In [3]:
import pandas as pd

data = pd.read_csv('2021.csv', header = None)
data.columns = ['country', 'party', 'author', 'author_id', 'district','date','tweet_id']

  data = pd.read_csv('2021.csv', header = None)


In [4]:
data.head()

Unnamed: 0,country,party,author,author_id,district,date,tweet_id
0,United States,Republican,thom tillis,2964174789,,2021-01-01 05:01:00,1344871222073458690
1,United States,Republican,thom tillis,2964174789,,2021-01-03 22:19:37,1345857376155545601
2,United States,Republican,thom tillis,2964174789,,2021-01-04 19:31:13,1346177383225815052
3,United States,Republican,thom tillis,2964174789,,2021-01-05 19:46:42,1346543667570495488
4,United States,Republican,thom tillis,2964174789,,2021-01-06 17:29:55,1346871631101259776


We will take the tweets for Turkey and keep their IDs to rehydrate. We'll try rehydrating a random sample of 1000 of them:

In [5]:
turkey = data[data['country'] == 'Turkey']

tweet_ids = list(turkey.sample(1000, random_state = 2023)['tweet_id'])

We will use the following `rehydrate` [function](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) to rehydrate the tweets and keep their data in the `tweets` list:

In [6]:
import json

# Use your bearer token here
client = Twarc2(bearer_token=bearer_token)

tweets = []

def rehydrate(ids: list):
    # List of Tweet IDs you want to lookup
    tweet_ids = ids
    # The tweet_lookup function from twarc 
    lookup = client.tweet_lookup(tweet_ids=tweet_ids)
    for page in lookup:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            tweets.append(tweet)

Running the function for the 1000 tweet IDs takes around 30 seconds, since twarc sends a GET request for 100 tweet IDs every 3 seconds. More on twitter rate limits [here](https://developer.twitter.com/en/docs/twitter-api/rate-limits).

In [7]:
rehydrate(tweet_ids)

To see the information returned for each tweet ID, we can check the first item in `tweets` list:

In [8]:
tweets[0]

{'possibly_sensitive': False,
 'referenced_tweets': [{'type': 'replied_to',
   'id': '1367480542002876420',
   'possibly_sensitive': False,
   'text': '1- TBMM ‘ne dün itibari ile gelen fezlekelerden birinin de bana ait olduğunu öğrenmiş bulunmaktayım.\nHiç vakit kaybedilmeden ve bekletilmeden dokunulmazlığımın kaldırılması ile ilgili dilekçemi bugün itibari ile TBMM başkanlığına vermiş bulunmaktayım. https://t.co/bUXK4Wygi6',
   'edit_history_tweet_ids': ['1367480542002876420'],
   'conversation_id': '1367480542002876420',
   'author_id': '215618996',
   'created_at': '2021-03-04T14:22:22.000Z',
   'reply_settings': 'everyone',
   'attachments': {'media_keys': ['3_1367480531148042248'], 'media': [{}]},
   'lang': 'tr',
   'edit_controls': {'edits_remaining': 5,
    'is_edit_eligible': True,
    'editable_until': '2021-03-04T14:52:22.000Z'},
   'entities': {'urls': [{'start': 252,
      'end': 275,
      'url': 'https://t.co/bUXK4Wygi6',
      'expanded_url': 'https://twitter.com/lutfi

We can also check to see how many tweets could have been rehydrated from the IDs:

In [9]:
len(tweets)

923

As you can see, only 92 percent of tweets could have been rehydrated; others are not available anymore.

In [10]:
tweets[0]

{'possibly_sensitive': False,
 'referenced_tweets': [{'type': 'replied_to',
   'id': '1367480542002876420',
   'possibly_sensitive': False,
   'text': '1- TBMM ‘ne dün itibari ile gelen fezlekelerden birinin de bana ait olduğunu öğrenmiş bulunmaktayım.\nHiç vakit kaybedilmeden ve bekletilmeden dokunulmazlığımın kaldırılması ile ilgili dilekçemi bugün itibari ile TBMM başkanlığına vermiş bulunmaktayım. https://t.co/bUXK4Wygi6',
   'edit_history_tweet_ids': ['1367480542002876420'],
   'conversation_id': '1367480542002876420',
   'author_id': '215618996',
   'created_at': '2021-03-04T14:22:22.000Z',
   'reply_settings': 'everyone',
   'attachments': {'media_keys': ['3_1367480531148042248'], 'media': [{}]},
   'lang': 'tr',
   'edit_controls': {'edits_remaining': 5,
    'is_edit_eligible': True,
    'editable_until': '2021-03-04T14:52:22.000Z'},
   'entities': {'urls': [{'start': 252,
      'end': 275,
      'url': 'https://t.co/bUXK4Wygi6',
      'expanded_url': 'https://twitter.com/lutfi

We can show some of the useful information of tweets in a dataframe:

In [11]:
author_id = []
created_at = []
text = []
reply_count = []
like_count = []

for i in tweets:
    
    author_id.append(i['author_id'])
    created_at.append(i['created_at'])
    text.append(i['text'])
    reply_count.append(i['public_metrics']['reply_count'])
    like_count.append(i['public_metrics']['like_count'])
    
tweets_df = pd.DataFrame(data=[tweet_ids, author_id, created_at, text, reply_count, like_count]).transpose()

tweets_df.columns = ["tweet id","author id", "created at", "text", "reply count","like count"]

tweets_df.head()

Unnamed: 0,tweet id,author id,created at,text,reply count,like count
0,1367480557412769793,215618996,2021-03-04T14:22:26.000Z,2-2010 yılında Dörtyol ilçemizde 4 polis memur...,5,140
1,1375139722712977414,228446708,2021-03-25T17:37:13.000Z,"AK Parti MKYK Üyemiz, Genel Merkez Teşkilat Ba...",20,733
2,1367095084571852801,4472008409,2021-03-03T12:50:42.000Z,"İP’in Başkanı, yalanı bırak, tezviratı geç; aç...",45,1234
3,1363355151914983424,601912159,2021-07-27T17:29:45.000Z,#IBANGönderKurz https://t.co/WWD9n5zBS0,17,422
4,1420073937552252929,2161571388,2021-04-17T07:07:25.000Z,"RT @AKKADINGM: 81 İlde ""Kadın Emeği Türkiye'ni...",0,0


### 3.2.5 Getting users info

Consider the `turkey` dataframe that we created from the `2021.csv` data in the previous section. We want to get the profile information of a fraction of the politicians whose tweets are in that dataframe. First, we get the unique politicians in the data:

In [12]:
# Getting all IDs in the dataframe 
author_ids = turkey['author_id']

# Getting unique IDs
unique_ids = list(author_ids.unique())

With the following `get_user()` [function](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md), we can get the users' information based on their IDs (It's a bit like tweet rehydration, but we use users' IDs this time), and save them in the `users_list` list.

In [13]:
from twarc import Twarc2, expansions
import json

users_list = []

# Replace your bearer token below
client = Twarc2(bearer_token=bearer_token)

def get_user(ids):
    # List of user IDs to lookup, add the ones you would like to lookup
    users = ids
    # The user_lookup function gets the hydrated user information for specified users
    lookup = client.user_lookup(users=users)
    for page in lookup:
        result = expansions.flatten(page)
        for user in result:
            # Here we are printing the full Tweet object JSON to the console
            users_list.append(user)

In [14]:
import random

some_ids = random.sample(unique_ids, 50)

get_user(some_ids)

Now we can make a dataframe and put some of this useful information of users' in it:

In [15]:
username = []
screen_name = []
profile_pic = []
followers = []
followings = []

for i in users_list:
    
    username.append(i['username'])
    screen_name.append(i['name'])
    profile_pic.append(i['profile_image_url'])
    followers.append(i['public_metrics']['followers_count'])
    followings.append(i['public_metrics']['following_count'])
    
users_df = pd.DataFrame(data=[some_ids, username, screen_name, profile_pic, followers, followings]).transpose()

users_df.columns = ["user id", "user name","screen name", "profile picture", "followers","followings"]

users_df.head()

Unnamed: 0,user id,user name,screen name,profile picture,followers,followings
0,564689563,DursunATAS38,Dursun ATAŞ,https://pbs.twimg.com/profile_images/127109640...,40815,773
1,145254257,ABabuscu,Aziz Babuşcu,https://pbs.twimg.com/profile_images/564473142...,225492,390
2,999759826518528000,mugokgoz32,Mehmet Uğur Gökgöz,https://pbs.twimg.com/profile_images/134828553...,19653,428
3,3710390297,yusufbeyazit60,Av. Yusuf Beyazıt,https://pbs.twimg.com/profile_images/148886252...,26924,334
4,281656898,Belmasatir,Av.M. Belma Satır 🇹🇷,https://pbs.twimg.com/profile_images/138699180...,147758,1414


### 3.2.6 Keyword search limited to a time window

We can use the the `search_all()` function of twarc to search for tweets in any time window of our choice. The following `search()` [function](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md) looks for any tweet containing the query it takes, limited to a beginning and end time, and saves them into the `tweets` list.


You can find more information on writing search queries [here](https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md).

In [16]:
from twarc import Twarc2, expansions
import datetime
import json


# Replace your bearer token below
client = Twarc2(bearer_token=bearer_token)

def search(beginning, end, q):
    
    # Specify the start time in UTC for the time period you want Tweets from
    start_time = beginning

    # Specify the end time in UTC for the time period you want Tweets from
    end_time = end

    # This is where we specify our query
    query = q

    # The search_all method call the full-archive search endpoint to get Tweets based on the query, start and end times
    search_results = client.search_all(query=query, start_time=beginning, end_time=end, max_results=100)

    # Twarc returns all Tweets for the criteria set above, so we page through the results
    for page in search_results:
        # The Twitter API v2 returns the Tweet information and the user, media etc.  separately
        # so we use expansions.flatten to get all the information in a single JSON
        result = expansions.flatten(page)
        for tweet in result:
            # Here we are printing the full Tweet object JSON to the console
            tweets.append(tweet)

In [17]:
tweets = []

# Beginning time for the time window
beginning = datetime.datetime(2023, 1, 5, 0, 0, 0, 0, datetime.timezone.utc)

# End time for the time window
end = datetime.datetime(2023, 1, 8, 0, 0, 0, 0, datetime.timezone.utc)

# The query for searching
q = "ComputationalSocialScience"

search (beginning, end, q)

You can take a look at the number of retrieved tweets and the first tweet like this:

In [18]:
# Number of retrieved tweets
len(tweets)

15

In [19]:
# The information available for each tweet
tweets[0].keys()

dict_keys(['created_at', 'conversation_id', 'edit_controls', 'edit_history_tweet_ids', 'text', 'lang', 'entities', 'referenced_tweets', 'id', 'possibly_sensitive', 'public_metrics', 'author_id', 'reply_settings', 'author', '__twarc'])

In [20]:
# The overall information of the first tweet 
tweets[0]

{'created_at': '2023-01-06T21:18:39.000Z',
 'conversation_id': '1611472338427744256',
 'edit_controls': {'edits_remaining': 5,
  'is_edit_eligible': False,
  'editable_until': '2023-01-06T21:48:39.000Z'},
 'edit_history_tweet_ids': ['1611472338427744256'],
 'text': "RT @SzassTam: Hesaplamalı sosyal bilimler (#ComputationalSocialScience) Türkiye'de büyüyen bir alan. \n\n@MerihAngin @BalcSoy @uzay00 @Gundog…",
 'lang': 'tr',
 'entities': {'mentions': [{'start': 3,
    'end': 12,
    'username': 'SzassTam',
    'id': '35683594',
    'verified': False,
    'entities': {'url': {'urls': [{'start': 0,
        'end': 23,
        'url': 'https://t.co/8g443GTFql',
        'expanded_url': 'http://www.staff.science.uu.nl/~salah006/',
        'display_url': 'staff.science.uu.nl/~salah006/'}]},
     'description': {'mentions': [{'start': 38,
        'end': 45,
        'username': 'uubeta'},
       {'start': 46, 'end': 60, 'username': 'Bogazici_CmpE'},
       {'start': 61, 'end': 77, 'username': 'dat

We can show some of the useful information of tweets in a dataframe:

In [21]:
tweet_id = []
author_id = []
created_at = []
text = []
lang = []

for i in tweets:
    
    tweet_id.append(i['id'])
    author_id.append(i['author_id'])
    created_at.append(i['created_at'])
    text.append(i['text'])
    lang.append(i['lang'])
    
search_df = pd.DataFrame(data=[tweet_id, author_id, created_at, text, lang,]).transpose()

search_df.columns = ["tweet id","author id", "created at", "text", "language"]

search_df.head()

Unnamed: 0,tweet id,author id,created at,text,language
0,1611472338427744256,1173993379698552832,2023-01-06T21:18:39.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
1,1611412573605466114,14519511,2023-01-06T17:21:10.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
2,1611406068873052178,3377132271,2023-01-06T16:55:19.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
3,1611333811354181633,2400010513,2023-01-06T12:08:12.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr
4,1611295709545783296,1474656871,2023-01-06T09:36:48.000Z,RT @SzassTam: Hesaplamalı sosyal bilimler (#Co...,tr


##### *Some other things from the initial notebook by Gizem:*

In [None]:
# Twitter API v2 (if you have a full access)
client = tw.Client(bearer_token=bearer_token)

# Replace with your own search query
query = 'from:SocialComquant -is:retweet' # you can change from with your own choice of username (without retweets)

# Replace with time period of your choice
start_time = '2021-01-01T00:00:00Z'

# Replace with time period of your choice
end_time = '2022-01-01T00:00:00Z'

In [None]:
# Check the start_time by yourself with writing
start_time

In [None]:
'''
# You can search Tweets from the last 7 days or all Tweets with different functions. Check available functions in Tweepy!
Tweepy: https://docs.tweepy.org/en/stable/client.html#search-tweets
# A helpful link for setting up your query: 
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md
'''
# Connect to Twitter API and search all tweets if you have a full academic access
tweets = client.search_all_tweets(query=query, tweet_fields=['created_at','text', 'context_annotations','entities'],
                                  start_time=start_time,
                                  end_time=end_time, max_results=10) #set your max results between 10 and 500



In [None]:
# Let's see a fairly new field for context annotations.
for tweet in tweets.data:
    print(tweet.created_at)
    print(tweet.context_annotations) #context annotations (https://developer.twitter.com/en/docs/twitter-api/annotations/overview)

## 3.3. A demonstration using Python to collect Reddit comments <img src="./images/reddit.svg"  width="150" height = "150" align="right"/>

Reddit is one of the oldest social media platforms which is still generating content with its users. Millions of users are creating on a daily basis in the form of questions and comments. Reddit also offers such API which is easy to access this vast amount of data.

First thing you need to do is to have a Reddit account. You should create it from [here.](https://www.reddit.com/)
- [Official Reddit API](https://www.reddit.com/dev/api/)
    - [Collecting Reddit data](https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892)
    
Alternative ways of getting Reddit data:
- [Google BigQuery](https://cloud.google.com/bigquery) (GBQ)
    - [Scraping Reddit data with GBQ](https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892)
- [Pushshift.io](https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563)

We need to decide which subreddit you would like to focus on getting the data: Let's say "Computational Social Science" and be creative :)

title, score, url, id, number of comments, date of creation, body text are the fields that are available from Reddit API. 
Here, we will focus on getting the bodytext(comments) from the subreddit. Refer to [praw documentation](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html) for different kinds of implementations. 

## 3.4. More APIs and precollected datasets 

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

## 3.5. Data harvesting from Wikipedia through API

<img src='images/wikipedia_logo.png' style='height: 190px; float: right; margin-left: 50px' >

Wikipedia is a rich source of data for social science research. Although we can access its data through other techniques like web scraping, there are also useful APIs that could ease collecting data from the website.

Since Wikipedia is built on [MediaWiki](https://en.wikipedia.org/wiki/MediaWiki), we will be using python wrappers written for its API, [Mediawiki Action API](https://www.mediawiki.org/wiki/API:Main_page). Each of these wrappers provide some useful methods, and we will try to go through the ones that are the most important to our data collection tasks.

We will also introduce two useful parsers for the Wikipedia markup language, and will see how they could be used for extracting clean data from the raw markup code.

### 3.5.1 Wikipedia library

https://wikipedia.readthedocs.io/en/latest/code.html#api

Installation and importing: 

In [None]:
pip install wikipedia

In [None]:
import wikipedia

Searching a query:

In [None]:
wikipedia.search("Barack")

In [None]:
wikipedia.suggest("Barak Obama")

Fewer or more results with a specific number:

In [None]:
wikipedia.search("Ford", results=3)

Getting the summary of an article:

In [None]:
wikipedia.summary("Barack Obama")

In [None]:
wikipedia.summary("Barack Obama", sentences=1)

wikipedia.summary will raise a DisambiguationError if the page is a disambiguation page, or a PageError if the page doesn’t exist (although by default, it tries to find the page you meant with suggest and search.)

In [None]:
wikipedia.summary("Mercury")

In [None]:
try:
    mercury = wikipedia.summary("Mercury")
except wikipedia.exceptions.DisambiguationError as e:
    print (e.options)

wikipedia.page enables you to load and access data from full Wikipedia pages. Initialize with a page title (keep in mind the errors listed above), and then access most properties using property methods:

In [None]:
bo = wikipedia.page("Barack Obama")

Getting the title of the page:

In [None]:
bo.title

Getting the url of the page:

In [None]:
bo.url

Getting the full text of the page:

In [None]:
bo.content

Getting the images of the page:

In [None]:
bo.images[0:5]

Getting the links in the page:

In [None]:
bo.links[:10]

To change the language of the Wikipedia you are accessing, use wikipedia.set_lang. Remember to search for page titles in the language that you have set, not English:

In [None]:
wikipedia.set_lang("fr")

In [None]:
wikipedia.summary("Francois Hollande")

List of URLs of the external links:

In [None]:
bo.references[:10]

Getting the plain text content of a section in the page:

In [None]:
bo.section('Early life and career')

List of section titles: an example of a bug!

In [None]:
bo.sections

### 3.5.2 Pywikibot & parsers

https://doc.wikimedia.org/pywikibot/stable/

https://mwparserfromhell.readthedocs.io/en/latest/index.html

https://github.com/5j9/wikitextparser

Using pywikibot to get the wikipedia markup code and then parse it with parsers like mwparserfromhell and wikitextparser.

Installation and importing:

In [None]:
pip install pywikibot

In [None]:
pip install mwparserfromhell

In [None]:
pip install wikitextparser

In [None]:
import pywikibot
import mwparserfromhell as mwp
import wikitextparser as wtp
import pandas as pd

Getting the markup code of the page [List of political parties in Germany]('https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany'):

In [None]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "List of political parties in Germany")
# text = page.get()

In [None]:
revs = page.revisions()

In [None]:
wikicode = mwp.parse(text)

In [None]:
wikicode.get_sections()

In [None]:
templates[4]

In [None]:
revsl = []
for i in revs:
    revsl.append(i)

In [None]:
revsl[0]

In [None]:
rev1 = revsl[0].text
# page = wtp.parse(rev1)

In [None]:
rev1

In [None]:
page.sections

In [None]:
revsl[1000]['timestamp']

In [None]:
text

Parsing the page with wikitextparser, by first making a page object:

In [None]:
page = wtp.parse(text)

In [None]:
page

Getting page templates:

In [None]:
page.templates[:10]

Like in the previous section, we can get the links in the page, this time with a different order:

In [None]:
page.wikilinks[:10]

Getting sections, no bugs with wikitexmtparser!

In [None]:
page.sections[0]

Tables data:

In [None]:
data = page.tables[1].data()
data

Putting the data in a dataframe:

In [None]:
df = pd.DataFrame(data[1:])
df.columns = data[0]
df

Parsing each cells data with mwparserfromhell and then making the dataframe:

In [None]:
for i in range(len(data)):
    for j in range(len(data[i])):
        wikicode = mwp.parse(data[i][j])
        data[i][j] = wikicode.strip_code(data[i][j])

In [None]:
df = pd.DataFrame(data[1:])
df.columns = data[0]
df

### Alternatives for extracting tables data:

**1. wikitables library:** Small bugs need to be handled by hand:


In [None]:
from wikitables import import_tables

tables = import_tables('List of political parties in Germany')

In [None]:
tables

In [None]:
print(tables[0].rows[0]['Abbr.'])

**2. Introducing DBpedia:** www.dbpedia.org

### 3.5.3 Pywikibot & parsers 2: Main text of different revisions

Extracting the main text of the first revision of an article in each year since the beginning:

In [None]:
import pywikibot
import mwparserfromhell

In [None]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "Koç University")

In [None]:
revisions = page.revisions(content=True)

In [None]:
revisions_list = []
years = []

for i in revisions:
    revisions_list.append(i)
    years.append(int(str(i['timestamp'])[:4]))
years.reverse()
revisions_list.reverse()

In [None]:
# years

In [None]:
# revisions_list[-1]

In [None]:
yearly_revisions = []
for i in range(years[0], years[-1]+1):
    index = years.index(i)
    yearly_revisions.append(revisions_list[index])

In [None]:
# yearly_revisions[-1]

In [None]:
text = yearly_revisions[-1].text

In [None]:
parsed = mwparserfromhell.parse(text)

In [None]:
print(parsed.strip_code())

## 3.6. Challanges

Facebook completely closed down many of it’s APIs and it is not very hard to get Facebook data besides CrowdTangle or FB Ads.

Twitter’s API now has the version 2 with substantial changes. 

These challanges make us stay vigilant and continuously update our code to keep up with the APIs.

- More on Social Media data collection and data quality:
https://www.slideshare.net/suchprettyeyes/working-with-socialmedia-data-ethics-good-practice-around-collecting-using-and-storing-data

## 3.7. References

Zenk-Möltgen, Wolfgang (GESIS - Leibniz Institute for the Social Sciences), Python Script to rehydrate Tweets from Tweet IDs https://doi.org/10.7802/1504

Pfeffer, Morstatter (2016): Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness. Dataset. http://dx.doi.org/10.7802/1166

Do not miss checking out the Social Comquant Workshop 10 at:https://github.com/strohne/autocol

- Useful links for getting started with Twitter API v2
    - [Comprehensive Guide for Using the Twitter API v2](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9#:~:text=Tweepy%20is%20a%20popular%20package,the%20academic%20research%20product%20track)
    - [Step by Step Guide to Making Your First Request to the Twitter API v2](https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2)
    - [Getting Started with Data Collection Using Twitter API v2](https://towardsdatascience.com/getting-started-with-data-collection-using-twitter-api-v2-in-less-than-an-hour-600fbd5b5558#39c4)
    - [An Extensive Guide to Collecting Tweets from Twitter API v2 for Academic REsearch Using Python 3](https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a)
    - [What Pythong package is best for getting data from Twitter](https://towardsdatascience.com/what-python-package-is-best-for-getting-data-from-twitter-comparing-tweepy-and-twint-f481005eccc9)

- Useful links for getting started with Reddit API
    - https://www.reddit.com/r/TheoryOfReddit/wiki/collecting_data/- 
    - https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892
    - https://github.com/akhilesh-reddy/Cable-cord-cutter-Sentiment-analysis-using-Reddit-data
    
<a href="https://www.flaticon.com/free-icons/database" title="database icons">Database icons created by Smashicons - Flaticon</a>

<a href="https://de.freepik.com/vektoren/logo">Logo Vektor erstellt von rawpixel.com - de.freepik.com</a>

<a href="http://www.freepik.com">Designed by stories / Freepik</a>



### Note: Alternative Ways for Twitter Academic API or Premium Account

The search function mandatorily requires environment label and query argument. Label your Application on Twitter Developer page: https://developer.twitter.com/en/account/environments

You can optionally add the fromDate and toDate fields to filter search results by time.

The format of dates should "YYYYMMDDHHMM".

tweets_month = api.search_30_day(label='teaching', query=search_words, 
                                 fromDate="202202201000", toDate="202203010000")

Now, you can dump your results into json format *don't forget to import json*: print(json.dumps(tweet_results[0]._json, indent=4, sort_keys=True))
                                 
For further interest, visit: https://towardsdatascience.com/how-to-use-twitter-premium-search-apis-for-mining-tweets-2705bbaddca

Also, there is another library called Twarc2 to explore for further data collection with Twitter v2 API:
https://twarc-project.readthedocs.io/en/latest/api/client2/

An academic research product:
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md

A standart product: 
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6b-labs-code-standard-python.md