<img src='images/gesis.png' style='height: 60px; float: left'>
<img src='images/social_comquant.png' style='height: 50px; float: left; margin-left: 40px'>
<img src='images/isi.png' style='height: 50px; float: left; margin-left: 20px'>  

Authors = N. Gizem Bacaksizlar Turbic and Haiko Lietz

Date = 19 July 2022

# 1. Introduction

Data collection is a procedure of gathering information from subjects (all relevant sources), measuring and analyzing accurate insights for research using various techniques. Researchers can evaluate their research questions and hypotheses on the basis of collected data. In most cases, data collection is the primary and most important step for research, irrespective of the field of study. The approach of data collection varies for different fields of study, depending on the required information.

The ease of access to the technology has made various social media platforms more popular as communication tools, therefore as a source of data. With this rise of social media use as a data source, data collection using APIs has become a demanding skill. Here, in this session, we aim to teach how to collect data from various social media platforms, such as Twitter and Reddit.

# 2. Social Media Platforms for Data Harvesting through API

<img src="./images/database.png"  width="150" height = "150" align="right"/>

In order to access APIs, you first need to create an account and apply to have a developer account on the platform that you want to work on. With this developer account, platforms provide you KEYS (e.g., secret, public, or access) to authenticate their system.

While web scraping is one of the common ways of collecting data from websites, a lot of websites offer APIs to access the public data that they host on their website. This is to avoid unnecessary traffic on the websites.

However, even though we have access to these API, as researchers, we should not forget to respect API access rules and always read the documents before collecting data.




## 2.1. A demonstration using Python to collect data from Twitter 

Twitter is one of the most used social media platforms in the academic research. This microblogging and social networking service host users who can post and interact with messages known as "tweets". Registered users can post, like, and retweet tweets, but unregistered users can only read those that are publicly available. As of 2022, Twitter has 436 million active users worldwide (Statista, 2022*). 

<img src="./images/twitter.png"  width="200" height = "200" align="left"/>

Different access options for different purposes:

- Twitter Developer: https://developer.twitter.com/
- APIs: https://developer.twitter.com/en/docs
- GNIP: http://support.gnip.com/apis/
- Twitter Enterprise: https://developer.twitter.com/en/enterprise

IMPORTANT to note that free APIs cover 7 days Tweets; Premium APIs exist for 30-day search and beyond. If you have an Academic Research access level, you can access even more data with full-archive search endpoint. There are changes to APIs policies over time, such as functionalities and user agreements. Also, limitations on volume and functions should be considered. 

Before we start with our first project on Twitter, first you need to sign up for Twitter and then, create a Developer account: 

- Sign up from [here.](https://help.twitter.com/en/using-twitter/create-twitter-account)
- Create a Developer Account from [here.](https://developer.twitter.com)


**https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/*

In [1]:
''' 
    Let's get started with our first project of colleting Tweets.
    Import libraries if you install them before.
    If you have not installed them, then install with pip on your command prompt or your jupyter notebook with !pip.
'''
# import relevant packages
import pandas as pd # data manipulation library
import datetime # human readable date formats
import tweepy as tw # wrapper around Twitter API 
# Please make sure you have installed all of these libraries!

In [2]:
# Enter your keys registered with Twitter
# Obtain the access token and access token secret
# These can be generated in your Developer Portal, under the “Keys and tokens” tab for your Developer App.

apikey = 'YOURapikey' #25 alphanumeric characters
apisecretkey = 'YOURapisecretkey'
accesstoken = 'YOURaccesstoken'
accesstokensecret = 'YOURaccesstokensecret'
bearertoken = 'YOURbearertoken'

<img src="./images/developer_portal.png"  width="500" height = "500" align="center"/>

In [3]:
# Let's say you are sharing your scripts with others, and do not want to show your keys. What can you do?
# We first can create a simple Python script called keys.py in which we store all passwords. 
# Save this script in the same folder as this notebook and import your keys

from keys import *

# Make sure you name your variable names for the keys in the keys.py script are the same as your variables here.

In [4]:
# Set up your access with search terms
auth = tw.OAuthHandler(apikey, apisecretkey)
auth.set_access_token(accesstoken, accesstokensecret)
api = tw.API(auth, wait_on_rate_limit = True)
search_words = "ComputationalSocialScience OR GESIS OR SocialComQuant" # Words should be changed according to your search.
# If you want to remove retweets, then include -filter:retweets to the search_words.

In [5]:
# Collect Tweets and be aware of the attribute names from the new version of the packages, which may change in time
# For a standart search, we use Tweepy.
tweets = tw.Cursor(api.search_tweets,  q=search_words, lang="en").items() # Possible to limit the number of search items

In [6]:
# First, check which Twitter attributes you collected from this search, visit:
# https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/tweet
# Then, create a dataframe with the columns you might need for your analysis

tweet_details = [[ tweet.user.screen_name, tweet.user.id, tweet.id_str, 
                  tweet.created_at, tweet.text, tweet.user.profile_image_url, tweet.user.location] 
                  for tweet in tweets]
tweet_df = pd.DataFrame(data=tweet_details, 
                        columns = ["user_name","user_id", "tweet_id", "tweet_date","tweet","user_image",
                                   "user_location"])

# For instance, you can see the values of one specific column with a code like this: tweet_df['user_image'].values

# Save df
tweet_df.to_csv("./data/test_tweets.csv", index = False)
print(tweet_df.head())
print('---------------------------------------------')

# print the length of the dataset
print('The length of the dataframe:', len(tweet_df['tweet_id'].unique()))

         user_name              user_id             tweet_id  \
0      CompCommLab   798181898031943680  1519270689823498241   
1         emcr_sna  1491847980617547779  1519231234358095872   
2        gesis_org            145554242  1519224965555535873   
3  ReligionFESTHD1  1186731643505184768  1519220526677467136   
4        gesis_org            145554242  1519211980896317440   

                 tweet_date  \
0 2022-04-27 11:02:13+00:00   
1 2022-04-27 08:25:26+00:00   
2 2022-04-27 08:00:32+00:00   
3 2022-04-27 07:42:53+00:00   
4 2022-04-27 07:08:56+00:00   

                                               tweet  \
0  RT @clauwa: Want to join @gesis_org and @HHU_d...   
1  RT @clauwa: Want to join @gesis_org and @HHU_d...   
2  Lights! Camera! Action! Teach! A #Handbook for...   
3  RT @gesis_org: #stellenangebot #job #openposit...   
4  RT @trovdimi: In case you missed my talk on Ge...   

                                          user_image            user_location  
0  http://p

In [7]:
# Let's check the first five user images with searching the link on the browser
tweet_df.user_image.values[:5]

array(['http://pbs.twimg.com/profile_images/839049954144436225/iZvx4Nbr_normal.jpg',
       'http://pbs.twimg.com/profile_images/1496253854937141257/J4Xdl0YN_normal.jpg',
       'http://pbs.twimg.com/profile_images/2840291739/926f900a36e46987ff8ac10c060f2c07_normal.png',
       'http://pbs.twimg.com/profile_images/1471199487288913922/wcvkmu9V_normal.jpg',
       'http://pbs.twimg.com/profile_images/2840291739/926f900a36e46987ff8ac10c060f2c07_normal.png'],
      dtype=object)

In [8]:
# Twitter API v2 (if you have a full access)
client = tw.Client(bearer_token=bearer_token)

# Replace with your own search query
query = 'from:SocialComquant -is:retweet' # you can change from with your own choice of username (without retweets)

# Replace with time period of your choice
start_time = '2021-01-01T00:00:00Z'

# Replace with time period of your choice
end_time = '2022-01-01T00:00:00Z'

In [9]:
# Check the start_time by yourself with writing
start_time

'2021-01-01T00:00:00Z'

In [10]:
'''
# You can search Tweets from the last 7 days or all Tweets with different functions. Check available functions in Tweepy!
Tweepy: https://docs.tweepy.org/en/stable/client.html#search-tweets
# A helpful link for setting up your query: 
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/5-how-to-write-search-queries.md
'''
# Connect to Twitter API and search all tweets if you have a full academic access
tweets = client.search_all_tweets(query=query, tweet_fields=['created_at','text', 'context_annotations','entities'],
                                  start_time=start_time,
                                  end_time=end_time, max_results=10) #set your max results between 10 and 500



In [11]:
# Let's see a fairly new field for context annotations.
for tweet in tweets.data:
    print(tweet.created_at)
    print(tweet.context_annotations) #context annotations (https://developer.twitter.com/en/docs/twitter-api/annotations/overview)

2021-12-08 10:26:14+00:00
[{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item that is in Entity Service should be in this domain'}, 'entity': {'id': '848920371311001600', 'name': 'Technology', 'description': 'Technology and computing'}}, {'domain': {'id': '66', 'name': 'Interests and Hobbies Category', 'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'}, 'entity': {'id': '848921413196984320', 'name': 'Computer programming', 'description': 'Computer programming'}}, {'domain': {'id': '66', 'name': 'Interests and Hobbies Category', 'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'}, 'entity': {'id': '898673391980261376', 'name': 'Web development', 'description': 'Web Development'}}]
2021-11-24 14:10:43+00:00
[{'domain': {'id': '30', 'name': 'Entities [Entity Service]', 'description': 'Entity Service top level domain, every item 

## 2.2. A demonstration using Python to collect Reddit comments <img src="./images/reddit.svg"  width="150" height = "150" align="right"/>

Reddit is one of the oldest social media platforms which is still generating content with its users. Millions of users are creating on a daily basis in the form of questions and comments. Reddit also offers such API which is easy to access this vast amount of data.

First thing you need to do is to have a Reddit account. You should create it from [here.](https://www.reddit.com/)
- [Official Reddit API](https://www.reddit.com/dev/api/)
    - [Collecting Reddit data](https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892)
    
Alternative ways of getting Reddit data:
- [Google BigQuery](https://cloud.google.com/bigquery) (GBQ)
    - [Scraping Reddit data with GBQ](https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892)
- [Pushshift.io](https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563)

We need to decide which subreddit you would like to focus on getting the data: Let's say "Computational Social Science" and be creative :)

title, score, url, id, number of comments, date of creation, body text are the fields that are available from Reddit API. 
Here, we will focus on getting the bodytext(comments) from the subreddit. Refer to [praw documentation](https://praw.readthedocs.io/en/latest/code_overview/models/subreddit.html) for different kinds of implementations. 

# 2.3. More APIs and precollected datasets 

<img src="./images/datasets.jpg" width="500" height = "900" align="left"/>  

- __More APIs__

    [Facebook for Developers](https://developers.facebook.com/)  
    [Facebook Ads API](https://developers.facebook.com/docs/marketing-apis/)  
    [Instagram Developer](https://developers.facebook.com/docs/instagram-basic-display-api)  
    [YouTube Developers](https://developers.google.com/youtube/)  
    [Weibo API](http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en)  
    [CrowdTangle](https://www.crowdtangle.com/request)  
    [4chan](https://github.com/4chan/4chan-API)  
    [Gab](https://github.com/a-tal/gab)  
    [Github REST API](https://docs.github.com/en/rest)  
    [Github GraphQL](https://docs.github.com/en/graphql)  
    [Stackoverflow](https://api.stackexchange.com/docs)  
    [Facepager](https://github.com/strohne/Facepager)  


- __Precollected datasets__  
    https://datasetsearch.research.google.com  
    https://www.kaggle.com/datasets  
    https://data.gesis.org/sharing/#!Search  


- __Locating or Requesting Social Media Data__
    https://www.programmableweb.com

# 3. Data harvesting from Wikipedia through API

<img src='images/wikipedia_logo.png' style='height: 190px; float: right; margin-left: 50px' >

Wikipedia is a rich source of data for social science research. Although we can access its data through other techniques like web scraping, there are also useful APIs that could ease collecting data from the website.

Since Wikipedia is built on [MediaWiki](https://en.wikipedia.org/wiki/MediaWiki), we will be using python wrappers written for its API, [Mediawiki Action API](https://www.mediawiki.org/wiki/API:Main_page). Each of these wrappers provide some useful methods, and we will try to go through the ones that are the most important to our data collection tasks.

We will also introduce two useful parsers for the Wikipedia markup language, and will see how they could be used for extracting clean data from the raw markup code.

## 3.1 Wikipedia library

https://wikipedia.readthedocs.io/en/latest/code.html#api

Installation and importing: 

In [None]:
pip install wikipedia

In [2]:
import wikipedia

Searching a query:

In [4]:
wikipedia.search("Barack")

['Barack Obama',
 'Barack Obama Sr.',
 'Barack (disambiguation)',
 'Presidency of Barack Obama',
 'Family of Barack Obama',
 'Barack (brandy)',
 'Barack Obama "Hope" poster',
 'Early life and career of Barack Obama',
 'Barack Obama religion conspiracy theories',
 'Barack Obama presidential campaign']

In [5]:
wikipedia.suggest("Barak Obama")

'barack obama'

Fewer or more results with a specific number:

In [6]:
wikipedia.search("Ford", results=3)

['Ford', 'Ford Motor Company', 'Gerald Ford']

Getting the summary of an article:

In [7]:
wikipedia.summary("Barack Obama")

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American  president of the United States. Obama previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. \nObama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. Obama received national attention in 2004 with

In [8]:
wikipedia.summary("Barack Obama", sentences=1)

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017.'

wikipedia.summary will raise a DisambiguationError if the page is a disambiguation page, or a PageError if the page doesn’t exist (although by default, it tries to find the page you meant with suggest and search.)

In [9]:
wikipedia.summary("Mercury")



  lis = BeautifulSoup(html).find_all('li')


DisambiguationError: "Mercury" may refer to: 
Mercury (planet)
Mercury (element)
Mercury (mythology)
Mercury (toy manufacturer)
Mercury Communications
Mercury Drug
Mercury Energy
Mercury Filmworks
Mercury General
Mercury Interactive
Mercury Marine
Mercury Systems
Mercury (programming language)
Mercury (metadata search system)
Ferranti Mercury
Mercury Browser
Mercury Mail Transport System
Mercury (film)
Mercury (TV series)
Young Adult
character in the RWBY web series
Sailor Mercury
Mercury (Marvel Comics)
Makkari (comics)
Metal Men
Cerebro's X-Men
Amalgam Comics character
Mercury (magazine)
The American Mercury
The Mercury (Hobart)
The Mercury (South Africa)
The Mercury (Pennsylvania)
Mercury (Newport)
Reading Mercury
List of newspapers named Mercury
Mercury (Bova novel)
Mercury (Livesey novel)
Anna Kavan
Mercury Nashville
Mercury Records
Mercury Prize
The Planets
Mercury (American Music Club album)
Mercury (Longview album)
Mercury (Madder Mortem album)
Mercury – Act 1
Mercury – Act 2
"Mercury" (song)
Recovering the Satellites
Failer
Planetarium
Operation Mercury
Boeing E-6 Mercury
Miles Mercury
HMS Mercury
USS Mercury
Russian brig Mercury
Mercury (pigeon)
Mercury (name)
Mercury, Savoie
Mercury Bay
place in Alabama
Mercury, Nevada
Mercury, Texas
Mercury (plant)
Annual mercury
Blitum bonus-henricus
Mercury FM
Heart Hertfordshire
Edmonton Mercurys
Fujita Soccer Club Mercury
Memphis Mercury
Phoenix Mercury
Toledo Mercurys
Blackburn Mercury
Bristol Mercury
Mercury (automobile)
Mercury (cyclecar)
Mercury (train)
Mercury (ship)
Cape Cod Mercury 15
Mercury 18
Project Mercury
Mercury
Mercury (satellite)
Archer Maclean's Mercury
Mercury (cipher machine)
Mercury Boulevard
Mercury Cinema
Shuttle America
The Mercury Mall
All pages with titles beginning with Mercury 
The American Mercury
Mercuri
Mercury 1 (disambiguation)
Mercury 2 (disambiguation)
Mercury 3 (disambiguation)
Mercury 4 (disambiguation)
Mercury 5 (disambiguation)
Mercury 6 (disambiguation)
Mercury 7 (disambiguation)
Mercury 8 (disambiguation)
Mercury City (disambiguation)
Mercury FM (disambiguation)
Mercury House (disambiguation)
Mercury mission (disambiguation)
Mercury program (disambiguation)
Mercury project (disambiguation)
All pages with titles containing Mercury

In [10]:
try:
    mercury = wikipedia.summary("Mercury")
except wikipedia.exceptions.DisambiguationError as e:
    print (e.options)

['Mercury (planet)', 'Mercury (element)', 'Mercury (mythology)', 'Mercury (toy manufacturer)', 'Mercury Communications', 'Mercury Drug', 'Mercury Energy', 'Mercury Filmworks', 'Mercury General', 'Mercury Interactive', 'Mercury Marine', 'Mercury Systems', 'Mercury (programming language)', 'Mercury (metadata search system)', 'Ferranti Mercury', 'Mercury Browser', 'Mercury Mail Transport System', 'Mercury (film)', 'Mercury (TV series)', 'Young Adult', 'character in the RWBY web series', 'Sailor Mercury', 'Mercury (Marvel Comics)', 'Makkari (comics)', 'Metal Men', "Cerebro's X-Men", 'Amalgam Comics character', 'Mercury (magazine)', 'The American Mercury', 'The Mercury (Hobart)', 'The Mercury (South Africa)', 'The Mercury (Pennsylvania)', 'Mercury (Newport)', 'Reading Mercury', 'List of newspapers named Mercury', 'Mercury (Bova novel)', 'Mercury (Livesey novel)', 'Anna Kavan', 'Mercury Nashville', 'Mercury Records', 'Mercury Prize', 'The Planets', 'Mercury (American Music Club album)', 'Mer

wikipedia.page enables you to load and access data from full Wikipedia pages. Initialize with a page title (keep in mind the errors listed above), and then access most properties using property methods:

In [12]:
bo = wikipedia.page("Barack Obama")

Getting the title of the page:

In [13]:
bo.title

'Barack Obama'

Getting the url of the page:

In [14]:
bo.url

'https://en.wikipedia.org/wiki/Barack_Obama'

Getting the full text of the page:

In [15]:
bo.content

'Barack Hussein Obama II ( (listen) bə-RAHK hoo-SAYN oh-BAH-mə; born August 4, 1961) is an American politician who served as the 44th president of the United States from 2009 to 2017. A member of the Democratic Party, he was the first African-American  president of the United States. Obama previously served as a U.S. senator from Illinois from 2005 to 2008 and as an Illinois state senator from 1997 to 2004. \nObama was born in Honolulu, Hawaii. After graduating from Columbia University in 1983, he worked as a community organizer in Chicago. In 1988, he enrolled in Harvard Law School, where he was the first black president of the Harvard Law Review. After graduating, he became a civil rights attorney and an academic, teaching constitutional law at the University of Chicago Law School from 1992 to 2004. Turning to elective politics, he represented the 13th district in the Illinois Senate from 1997 until 2004, when he ran for the U.S. Senate. Obama received national attention in 2004 with

Getting the images of the page:

In [17]:
bo.images[0:5]

['https://upload.wikimedia.org/wikipedia/commons/3/34/2004_United_States_Senate_election_in_Illinois_results_map_by_county.svg',
 'https://upload.wikimedia.org/wikipedia/commons/a/a5/20090124_WeeklyAddress.ogv',
 'https://upload.wikimedia.org/wikipedia/commons/d/da/210120-D-WD757-1249_%2850861341397%29.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/2/2b/58th_Presidential_Inaugural_Ceremony_170120-D-BP749-1327.jpg',
 'https://upload.wikimedia.org/wikipedia/commons/1/17/Balance%2C_by_David.svg']

Getting the links in the page:

In [18]:
bo.links[:10]

['109th United States Congress',
 '110th United States Congress',
 '14th Dalai Lama',
 '1828 United States presidential election',
 '1832 Democratic National Convention',
 '1835 Democratic National Convention',
 '1840 Democratic National Convention',
 '1844 Democratic National Convention',
 '1848 Democratic National Convention',
 '1852 Democratic National Convention']

To change the language of the Wikipedia you are accessing, use wikipedia.set_lang. Remember to search for page titles in the language that you have set, not English:

In [None]:
wikipedia.set_lang("fr")

In [20]:
wikipedia.summary("Francois Hollande")

"François Gérard Georges Nicolas Hollande (French: [fʁɑ̃swa ʒeʁaʁ ʒɔʁʒ nikɔla ɔlɑ̃d] (listen); born 12 August 1954) is a French politician who served as President of France from 2012 to 2017. He previously was First Secretary of the Socialist Party (PS) from 1997 to 2008, Mayor of Tulle from 2001 to 2008, and President of the General Council of Corrèze from 2008 to 2012. Hollande also served in the National Assembly twice for the 1st constituency of Corrèze from 1988 to 1993, and again from 1997 until 2012.\nBorn in Rouen and raised in Neuilly-sur-Seine, Hollande began his political career as a special advisor to newly elected President François Mitterrand, before serving as a staffer for Max Gallo, the government's spokesman. He became a member of the National Assembly in 1988 and was elected First Secretary of the PS in 1997. Following the 2004 regional elections won by the PS, Hollande was cited as a potential presidential candidate, but he resigned as First Secretary and was immedi

List of URLs of the external links:

In [21]:
bo.references[:10]

['http://www.theage.com.au/world/a-classic-orator-obama-learnt-from-the-masters-20081129-6nf1.html',
 'http://www.theaustralian.com.au/archive/news/obama-launches-afghanistan-surge/story-e6frg6t6-1111118893671',
 'http://www.abc.net.au/news/stories/2008/09/09/2360240.htm',
 'http://data.rero.ch/02-A013554091',
 'http://data.rero.ch/02-A013584632',
 'http://www.bncatalogo.cl/F?func=direct&local_base=red10&doc_number=000710697',
 'http://blogs.abcnews.com/politicalpunch/2010/09/president-obama-i-am-a-christian-by-choicethe-precepts-of-jesus-spoke-to-me.html',
 'http://america.aljazeera.com/opinions/2014/11/obama-deportationimmigrationreformmidtermelections.html',
 'http://www.aljazeera.com/news/middleeast/2014/09/obama-strike-wherever-it-exists-2014910223935601193.html',
 'http://corporate.ancestry.com/press/press-releases/2012/07/ancestry.com-discovers-president-obama-related-to-first-documented-slave-in-america/']

Getting the plain text content of a section in the page:

In [22]:
bo.section('Early life and career')

'Obama was born on August 4, 1961, at Kapiolani Medical Center for Women and Children in Honolulu, Hawaii. He is the only president born outside the contiguous 48 states. He was born to an American mother and a Kenyan father. His mother, Ann Dunham (1942–1995), was born in Wichita, Kansas and was mostly of English descent, though in 2007 it was discovered her great-great-grandfather Falmouth Kearney emigrated from the village of Moneygall, Ireland to the US in 1850. In July 2012, Ancestry.com found a strong likelihood that Dunham was descended from John Punch, an enslaved African man who lived in the Colony of Virginia during the seventeenth century. Obama\'s father, Barack Obama Sr. (1934–1982), was a married Luo Kenyan from Nyang\'oma Kogelo. Obama\'s parents met in 1960 in a Russian language class at the University of Hawaii at Manoa, where his father was a foreign student on a scholarship. The couple married in Wailuku, Hawaii, on February 2, 1961, six months before Obama was born.

List of section titles: an example of a bug!

In [23]:
bo.sections

[]

## 3.2 Pywikibot & parsers

https://doc.wikimedia.org/pywikibot/stable/

https://mwparserfromhell.readthedocs.io/en/latest/index.html

https://github.com/5j9/wikitextparser

Using pywikibot to get the wikipedia markup code and then parse it with parsers like mwparserfromhell and wikitextparser.

Installation and importing:

In [None]:
pip install pywikibot

In [None]:
pip install mwparserfromhell

In [None]:
pip install wikitextparser

In [24]:
import pywikibot
import mwparserfromhell as mwp
import wikitextparser as wtp
import pandas as pd

Getting the markup code of the page [List of political parties in Germany]('https://en.wikipedia.org/wiki/List_of_political_parties_in_Germany'):

In [25]:
site = pywikibot.Site('en', 'wikipedia')
page = pywikibot.Page(site, "List of political parties in Germany")
text = page.get()

In [26]:
text

'{{Short description|None}}\n{{Politics of Germany}}\nThis is a \'\'\'list of [[political party|political parties]] in [[politics of Germany|Germany]]\'\'\'.\n\nThe [[Federal Republic of Germany]] has a plural [[multi-party]] system. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[coalition government|coalition]] of a major and a minor party, specifically CDU/CSU and FDP or SPD and FDP, and from 1998 to 2005 SPD and Greens. From 1

Parsing the page with wikitextparser, by first making a page object:

In [27]:
page = wtp.parse(text)

Getting page templates:

In [28]:
page.templates[:10]

[Template('{{Short description|None}}'),
 Template('{{Politics of Germany}}'),
 Template('{{Cite web |url=http://www.dw-world.de/popups/popup_printcontent/0,,1647406,00.html |title=Chronik: Bundestagswahlen von 1949 bis 2002 &#124; Deutschland &#124; Deutsche Welle &#124; 02.10.2005 |access-date=2009-10-10 |archive-url=https://web.archive.org/web/20081108114315/http://www.dw-world.de/popups/popup_printcontent/0,,1647406,00.html |archive-date=2008-11-08 |url-status=dead }}'),
 Template('{{cite web |url=http://www.dw-world.de/dw/article/0,,4541120,00.html |title=Political parties form colorful spectrum in Germany |date=2009-08-18 |access-date=2009-09-12 |publisher=[[Deutsche Welle]] }}'),
 Template('{{citation |url=http://www.dw-world.de/dw/article/0,,4582700,00.html |title=The Green party: Getting used to opposition |date=2009-08-24 |access-date=2009-10-12 |publisher=[[Deutsche Welle]] |quote=This made a so-called [[Jamaica coalition (politics)|Jamaica coalition]] with the Christian Dem

Like in the previous section, we can get the links in the page, this time with a different order:

In [29]:
page.wikilinks[:10]

[WikiLink('[[political party|political parties]]'),
 WikiLink('[[politics of Germany|Germany]]'),
 WikiLink('[[Federal Republic of Germany]]'),
 WikiLink('[[multi-party]]'),
 WikiLink('[[Christian Democratic Union (Germany)|Christian Democratic Union]]'),
 WikiLink('[[Christian Social Union of Bavaria|Christian Social Union]]'),
 WikiLink('[[Social Democratic Party of Germany]]'),
 WikiLink('[[Free Democratic Party (Germany)|Free Democratic Party]]'),
 WikiLink('[[Alliance 90/The Greens]]'),
 WikiLink('[[The Left (Germany)|The Left]]')]

Getting sections, no bugs with wikitexmtparser!

In [31]:
page.sections[0]

Section("{{Short description|None}}\n{{Politics of Germany}}\nThis is a '''list of [[political party|political parties]] in [[politics of Germany|Germany]]'''.\n\nThe [[Federal Republic of Germany]] has a plural [[multi-party]] system. The largest by members and parliament seats are the [[Christian Democratic Union (Germany)|Christian Democratic Union]] (CDU), with its sister party, the [[Christian Social Union of Bavaria|Christian Social Union]] (CSU) and [[Social Democratic Party of Germany]] (SPD).\n\nGermany also has a number of other parties, in recent history most importantly the [[Free Democratic Party (Germany)|Free Democratic Party]] (FDP), [[Alliance 90/The Greens]], [[The Left (Germany)|The Left]], and more recently the [[Alternative for Germany]] (AfD), founded in 2013. The federal government of Germany often consisted of a [[coalition government|coalition]] of a major and a minor party, specifically CDU/CSU and FDP or SPD and FDP, and from 1998 to 2005 SPD and Greens. From

Tables data:

In [33]:
data = page.tables[1].data()
data

[['Logo',
  'Logo',
  'Name',
  'Abbr.',
  'Leader',
  'Ideology',
  'Elected in state (Seats)',
  'Position',
  'Notes'],
 ['',
  '[[File:Bürger in Wut Logo.svg|60px]]',
  "[[Citizens in Rage]]<br /><small>''Bürger in Wut''</small>",
  'BIW',
  '[[Jan Timke]]',
  '[[Right-wing populism]]',
  '[[Bremen (state)|Bremen]] (1)',
  '[[Right-wing politics|Right-wing]]',
  ''],
 ['',
  '[[File:Logo-BVB-FREIE-WAEHLER.svg|60px]]',
  "[[Brandenburg United Civic Movements/Free Voters]]<br /><small>''Brandenburger Vereinigte Bürgerbewegungen / Freie Wähler''</small>",
  'BVB / FW',
  '[[Péter Vida]]',
  '[[Regionalism (politics)|Regionalism]]',
  '[[Brandenburg]] (3)',
  "<!-- Please do NOT add a political position here unless it can be or is cited with a reliable third-party source either here or on the party's page. -->",
  ''],
 ['',
  '[[File:Bürger für Thüringen Logo.png|64x64px]]',
  "[[Citizens for Thuringia]]<br /><small>''Bürger für Thüringen''</small>",
  'BfTh',
  '[[Ute Bergner]]',
  '

Putting the data in a dataframe:

In [34]:
df = pd.DataFrame(data[1:])
df.columns = data[0]
df

Unnamed: 0,Logo,Logo.1,Name,Abbr.,Leader,Ideology,Elected in state (Seats),Position,Notes
0,,[[File:Bürger in Wut Logo.svg|60px]],[[Citizens in Rage]]<br /><small>''Bürger in W...,BIW,[[Jan Timke]],[[Right-wing populism]],[[Bremen (state)|Bremen]] (1),[[Right-wing politics|Right-wing]],
1,,[[File:Logo-BVB-FREIE-WAEHLER.svg|60px]],[[Brandenburg United Civic Movements/Free Vote...,BVB / FW,[[Péter Vida]],[[Regionalism (politics)|Regionalism]],[[Brandenburg]] (3),<!-- Please do NOT add a political position he...,
2,,[[File:Bürger für Thüringen Logo.png|64x64px]],[[Citizens for Thuringia]]<br /><small>''Bürge...,BfTh,[[Ute Bergner]],[[Liberal conservatism]],[[Thuringia]] (4),[[Right-wing politics|Right-wing]],


Parsing each cells data with mwparserfromhell and then making the dataframe:

In [36]:
for i in range(len(data)):
    for j in range(len(data[i])):
        wikicode = mwp.parse(data[i][j])
        data[i][j] = wikicode.strip_code(data[i][j])

In [37]:
df = pd.DataFrame(data[1:])
df.columns = data[0]
df

Unnamed: 0,Logo,Logo.1,Name,Abbr.,Leader,Ideology,Elected in state (Seats),Position,Notes
0,,60px,Citizens in RageBürger in Wut,BIW,Jan Timke,Right-wing populism,Bremen (1),Right-wing,
1,,60px,Brandenburg United Civic Movements/Free Voters...,BVB / FW,Péter Vida,Regionalism,Brandenburg (3),,
2,,64x64px,Citizens for ThuringiaBürger für Thüringen,BfTh,Ute Bergner,Liberal conservatism,Thuringia (4),Right-wing,


### Alternatives for extracting tables data:

**1. wikitables library:** Small bugs need to be handled by hand:


In [39]:
from wikitables import import_tables

tables = import_tables('List of political parties in Germany')

List of political parties in Germany[0][0]: dropping field from unknown column: Centre-left
List of political parties in Germany[0][0]: dropping field from unknown column: S&D
List of political parties in Germany[0][1]: dropping field from unknown column: Centre-right
List of political parties in Germany[0][1]: dropping field from unknown column: EPP
List of political parties in Germany[0][2]: dropping field from unknown column: 
List of political parties in Germany[0][2]: dropping field from unknown column: EPP
List of political parties in Germany[0][3]: dropping field from unknown column: Centre-left
List of political parties in Germany[0][3]: dropping field from unknown column: Greens/EFA
List of political parties in Germany[0][4]: dropping field from unknown column: Centre to centre-right
List of political parties in Germany[0][4]: dropping field from unknown column: RE
List of political parties in Germany[0][5]: dropping field from unknown column: Far-right
List of political parti

List of political parties in Germany[3][10]: dropping field from unknown column: Conservatism Federalism
List of political parties in Germany[3][11]: dropping field from unknown column: Social liberalism Federalism Laïcité Parliamentarism
List of political parties in Germany[3][12]: dropping field from unknown column: Liberalism Federalism
List of political parties in Germany[3][13]: dropping field from unknown column: Liberalism Progressivism Parliamentarism Laïcité
List of political parties in Germany[3][14]: dropping field from unknown column: Democratic socialism Centrist Marxism Pacifism
List of political parties in Germany[3][15]: dropping field from unknown column: Liberalism Parliamentarism Classical liberalism Economic liberalism Conservative liberalism
List of political parties in Germany[3][16]: dropping field from unknown column: National liberalism
List of political parties in Germany[3][17]: dropping field from unknown column: Christian socialism Social liberalism Nationa

In [40]:
tables[3].json()

'[{"Name": "", "Abbr.": "Bavarian Peasants\' League", "Ideology": "BB"}, {"Name": "", "Abbr.": "Centre Party", "Ideology": "Zentrum"}, {"Name": "", "Abbr.": "Christian Social Party", "Ideology": "CSP"}, {"Name": "", "Abbr.": "Democratic Union", "Ideology": "DV"}, {"Name": "", "Abbr.": "Free Conservative Party", "Ideology": "FKP"}, {"Name": "", "Abbr.": "Free-minded People\'s Party", "Ideology": "FVP"}, {"Name": "", "Abbr.": "Free-minded Union", "Ideology": "FV"}, {"Name": "", "Abbr.": "General German Workers\' Association", "Ideology": "ADAV"}, {"Name": "", "Abbr.": "German Conservative Party", "Ideology": "DKP"}, {"Name": "", "Abbr.": "German Fatherland Party", "Ideology": "DVLP"}, {"Name": "", "Abbr.": "German-Hanoverian Party", "Ideology": "DHP"}, {"Name": "", "Abbr.": "German People\'s Party", "Ideology": "DtVP"}, {"Name": "", "Abbr.": "German Progress Party", "Ideology": "DFP"}, {"Name": "", "Abbr.": "German Free-minded Party", "Ideology": "DFP"}, {"Name": "", "Abbr.": "Independen

In [41]:
print(tables[0].rows[0]['Abbr.'])

60px


**2. Introducing DBpedia:** www.dbpedia.org

# 4. Challanges

Facebook completely closed down many of it’s APIs and it is not very hard to get Facebook data besides CrowdTangle or FB Ads.

Twitter’s API now has the version 2 with substantial changes. 

These challanges make us stay vigilant and continuously update our code to keep up with the APIs.

- More on Social Media data collection and data quality:
https://www.slideshare.net/suchprettyeyes/working-with-socialmedia-data-ethics-good-practice-around-collecting-using-and-storing-data

# 5. References

Zenk-Möltgen, Wolfgang (GESIS - Leibniz Institute for the Social Sciences), Python Script to rehydrate Tweets from Tweet IDs https://doi.org/10.7802/1504

Pfeffer, Morstatter (2016): Geotagged Twitter posts from the United States: A tweet collection to investigate representativeness. Dataset. http://dx.doi.org/10.7802/1166

Do not miss checking out the Social Comquant Workshop 10 at:https://github.com/strohne/autocol

- Useful links for getting started with Twitter API v2
    - [Comprehensive Guide for Using the Twitter API v2](https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9#:~:text=Tweepy%20is%20a%20popular%20package,the%20academic%20research%20product%20track)
    - [Step by Step Guide to Making Your First Request to the Twitter API v2](https://developer.twitter.com/en/docs/tutorials/step-by-step-guide-to-making-your-first-request-to-the-twitter-api-v2)
    - [Getting Started with Data Collection Using Twitter API v2](https://towardsdatascience.com/getting-started-with-data-collection-using-twitter-api-v2-in-less-than-an-hour-600fbd5b5558#39c4)
    - [An Extensive Guide to Collecting Tweets from Twitter API v2 for Academic REsearch Using Python 3](https://towardsdatascience.com/an-extensive-guide-to-collecting-tweets-from-twitter-api-v2-for-academic-research-using-python-3-518fcb71df2a)
    - [What Pythong package is best for getting data from Twitter](https://towardsdatascience.com/what-python-package-is-best-for-getting-data-from-twitter-comparing-tweepy-and-twint-f481005eccc9)

- Useful links for getting started with Reddit API
    - https://www.reddit.com/r/TheoryOfReddit/wiki/collecting_data/- 
    - https://towardsdatascience.com/scrape-reddit-data-using-python-and-google-bigquery-44180b579892
    - https://github.com/akhilesh-reddy/Cable-cord-cutter-Sentiment-analysis-using-Reddit-data
    
<a href="https://www.flaticon.com/free-icons/database" title="database icons">Database icons created by Smashicons - Flaticon</a>

<a href="https://de.freepik.com/vektoren/logo">Logo Vektor erstellt von rawpixel.com - de.freepik.com</a>

<a href="http://www.freepik.com">Designed by stories / Freepik</a>



### Note: Alternative Ways for Twitter Academic API or Premium Account

The search function mandatorily requires environment label and query argument. Label your Application on Twitter Developer page: https://developer.twitter.com/en/account/environments

You can optionally add the fromDate and toDate fields to filter search results by time.

The format of dates should "YYYYMMDDHHMM".

tweets_month = api.search_30_day(label='teaching', query=search_words, 
                                 fromDate="202202201000", toDate="202203010000")

Now, you can dump your results into json format *don't forget to import json*: print(json.dumps(tweet_results[0]._json, indent=4, sort_keys=True))
                                 
For further interest, visit: https://towardsdatascience.com/how-to-use-twitter-premium-search-apis-for-mining-tweets-2705bbaddca

Also, there is another library called Twarc2 to explore for further data collection with Twitter v2 API:
https://twarc-project.readthedocs.io/en/latest/api/client2/

An academic research product:
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6a-labs-code-academic-python.md

A standart product: 
https://github.com/twitterdev/getting-started-with-the-twitter-api-v2-for-academic-research/blob/main/modules/6b-labs-code-standard-python.md