# Scraping Social Media (e.g., Twitter, Reddit) with APIs
### Last Updated: 09/17/18

## What is an API?!

API stands for Application Programming Interface. For social media sites, it is essentially a set of defined interfaces of how one could communicate and interact with data captured by these social media sites.

It is normally for developers of apps who wish to integrate their services with the social media sites, but we are using it to collect data for academic purposes... FOR SCIENCE!

## Scraping Twitter

__Important note__: It is quite impossible to scrape Twitter for complete samples without actually paying Twitter these days. So below is just a taste of the kind of code you could use once you are granted access to the firehose (after paying for the API)

For more information: https://developer.twitter.com/en/pricing.html

### Creating a Twitter Application account

https://developer.twitter.com/en/apply/account
https://developer.twitter.com/en/account/get-started


### Importing required packages...

If packages are not priorly installed you can either use `pip` or `conda` to install the required packages.

Note: There are actually a lot of Twitter API handlers out there for python, `twitter` is just the one I have used before that has worked. There are others out there (e.g., `tweepy`) that may work better and/or more suitable to your needs.

For more information on the `twitter` package: https://pypi.org/project/twitter/

In [81]:
from twitter import Twitter, OAuth
import pandas as pd

In [82]:
con_key = "biTvmyLGlsd96hwwKhuaSxBuY"
con_sec = "UiyuUH5d4MKXPkncE9WNFiyjVSo6hJTDwowHoZ2O9q4DKsiN1L"
access_token = "51968999-JOR8FivsOq3bHjm2nlCS6UTCR5VKPzcmxSCkRRaeu"
access_token_sec = "xw4IlcOnyLjwwAxOgEcZtNSlwkV9Y8CcKWOj428e2CWW3"

t = Twitter(auth=OAuth(access_token, access_token_sec,
                       con_key, con_sec))

### Searching for historical tweets on Twitter
For more information on how the Twitter API works: https://developer.twitter.com/en/docs/tweets/search/overview

Very important note: not as powerful as it was in the past...unless you pay?

#### Kevin exercise

Accessing/collecting Tweets from Kevin's timeline:
https://twitter.com/kjs253


In [75]:
kjs_test = t.statuses.user_timeline(screen_name="kjs253",
                                    exclude_replies = True,
                                    include_rts = 1)

In [78]:
kjs_test

[{'contributors': None,
  'coordinates': None,
  'created_at': 'Sun Sep 16 17:47:25 +0000 2018',
  'entities': {'hashtags': [], 'symbols': [], 'urls': [], 'user_mentions': []},
  'favorite_count': 0,
  'favorited': False,
  'geo': None,
  'id': 1041383066227302400,
  'id_str': '1041383066227302400',
  'in_reply_to_screen_name': None,
  'in_reply_to_status_id': None,
  'in_reply_to_status_id_str': None,
  'in_reply_to_user_id': None,
  'in_reply_to_user_id_str': None,
  'is_quote_status': False,
  'lang': 'en',
  'place': None,
  'retweet_count': 0,
  'retweeted': False,
  'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
  'text': 'Hello, World!',
  'truncated': False,
  'user': {'contributors_enabled': False,
   'created_at': 'Wed Jul 11 04:16:36 +0000 2018',
   'default_profile': True,
   'default_profile_image': True,
   'description': '',
   'entities': {'description': {'urls': []}},
   'favourites_count': 0,
   'follow_request_sent': F

In [76]:
kevin = []

for tweet in kjs_test:
    single = {}
    single["id"] = tweet['id']
    single["created_at"] = tweet['created_at']
    single["text"] = tweet['text']
    kevin.append(single)

In [77]:
kevin

[{'created_at': 'Sun Sep 16 17:47:25 +0000 2018',
  'id': 1041383066227302400,
  'text': 'Hello, World!'}]

In [83]:
kevin_data = pd.DataFrame(kevin)

In [84]:
kevin_data

Unnamed: 0,created_at,id,text
0,Sun Sep 16 17:47:25 +0000 2018,1041383066227302400,"Hello, World!"


#### Zach Lowe exercise
Accessing and collecting tweets from Zach Lowe's timeline: https://twitter.com/ZachLowe_NBA

In [79]:
lowe = []
zl_nba = t.statuses.user_timeline(screen_name="ZachLowe_NBA",
                                    exclude_replies = True,
                                    count = 1000,
                                    include_rts = 1)

In [80]:
len(zl_nba)

168

### Activity 1
How would you create a table to capture the top 1000 tweets from Zach Lowe's timeline in a pandas table? Please include the `id`, `created_at`, and actual `text` of the tweet in this table. [10 Min]

In [87]:
# Insert answer to Activity 1 here...








### Search for Tweets

In [111]:
search = t.search.tweets(q="#NBA",
                        count = 1000,)

Two larger objects within the search object: 1) `search_metadata` which tells you how long the search it took, how many were collected, the id strings and all that stuff; and 2) `statuses` which is the meat of what you want when you do a query like this.

In [122]:
search

{'search_metadata': {'completed_in': 0.061,
  'count': 100,
  'max_id': 1041799276224020480,
  'max_id_str': '1041799276224020480',
  'next_results': '?max_id=1041791029098307583&q=%23NBA&count=100&include_entities=1',
  'query': '%23NBA',
  'refresh_url': '?since_id=1041799276224020480&q=%23NBA&include_entities=1',
  'since_id': 0,
  'since_id_str': '0'},
 'statuses': [{'contributors': None,
   'coordinates': None,
   'created_at': 'Mon Sep 17 21:21:17 +0000 2018',
   'entities': {'hashtags': [{'indices': [76, 84], 'text': 'Nba2k19'},
     {'indices': [85, 91], 'text': 'Nba2K'},
     {'indices': [92, 100], 'text': '2KGames'},
     {'indices': [101, 105], 'text': 'NBA'},
     {'indices': [106, 113], 'text': 'Ps4Pro'},
     {'indices': [114, 118], 'text': 'Ps4'},
     {'indices': [119, 128], 'text': 'XboxOneX'},
     {'indices': [129, 137], 'text': 'XboxOne'}],
    'symbols': [],
    'urls': [{'display_url': 'deadarticgames.com/2018/07/nba-2k…',
      'expanded_url': 'https://www.deadar

Since there is rate limit to these things, I think you can only collect 100 of the most recent tweets that feature the `#NBA` in their tweets.

In [118]:
len(search['statuses'])

100

In [119]:
search_ls = []
for tweet in search['statuses']:
    single = {}
    single["id"] = tweet['id']
    single["created_at"] = tweet['created_at']
    single["text"] = tweet['text']
    search_ls.append(single)

In [120]:
search_pandas = pd.DataFrame(search_ls)

In [121]:
search_pandas

Unnamed: 0,created_at,id,text
0,Mon Sep 17 21:21:17 +0000 2018,1041799276224020480,RT @DeadarticGames: NBA 2K19 - Special Announc...
1,Mon Sep 17 21:21:17 +0000 2018,1041799275406127105,RT @KingsCustoms_: Had an amazing opportunity ...
2,Mon Sep 17 21:21:08 +0000 2018,1041799239083286528,RT @NBAJPN: 🇯🇵 日本代表として存在感を見せつけた渡邊雄太（グリズリーズ）。代表...
3,Mon Sep 17 21:20:53 +0000 2018,1041799175736885248,"RT @LoudCityPod: Russ’ knee, Jersey rants, Top..."
4,Mon Sep 17 21:20:34 +0000 2018,1041799095441010689,On this weeks eps we review “The Predator” wit...
5,Mon Sep 17 21:19:36 +0000 2018,1041798850829332481,Nike Air Force 1 Jewel Mid Olive Gum | \n👉 Inf...
6,Mon Sep 17 21:18:43 +0000 2018,1041798629336526848,RT @TRConsulChicago: #NBA’s Turkish stars @ced...
7,Mon Sep 17 21:18:43 +0000 2018,1041798628489089024,#DwyaneWade regresa para su temporada #16 en l...
8,Mon Sep 17 21:18:37 +0000 2018,1041798604959219719,I would really really appreciate it if someone...
9,Mon Sep 17 21:18:34 +0000 2018,1041798591348715520,"RT @HankNoah: Another Monday, #AnotherNBAPodca..."


### Stream public tweets that are happening in real-time
We will use Twython to do this (no particular reason, just because Kevin learned to use this before).

For more info on Twython: https://twython.readthedocs.io/en/latest/

In [126]:
from twython import TwythonStreamer

streamed_tweets = []

class MyStreamer(TwythonStreamer):
    
    def on_success(self, data):
        
        if data['lang'] == 'en':
            streamed_tweets.append(data)
            print("received tweet", len(streamed_tweets))
            
        if len(streamed_tweets) >= 5:
            self.disconnect()
            
    def on_error(self, status_code, data):
        print(status_code, data)
        self.disconnect

In [127]:
stream = MyStreamer(con_key, con_sec, access_token, access_token_sec)
stream.statuses.filter(track='iphone')

received tweet 1
received tweet 2
received tweet 3
received tweet 4
received tweet 5


In [128]:
streamed_tweets[1]

{'contributors': None,
 'coordinates': None,
 'created_at': 'Mon Sep 17 21:33:34 +0000 2018',
 'display_text_range': [12, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'urls': [],
  'user_mentions': [{'id': 963045220668583936,
    'id_str': '963045220668583936',
    'indices': [0, 11],
    'name': 'Miss DABS',
    'screen_name': 'dabs_tweet'}]},
 'favorite_count': 0,
 'favorited': False,
 'filter_level': 'low',
 'geo': None,
 'id': 1041802368151707649,
 'id_str': '1041802368151707649',
 'in_reply_to_screen_name': 'dabs_tweet',
 'in_reply_to_status_id': 1041634308535058433,
 'in_reply_to_status_id_str': '1041634308535058433',
 'in_reply_to_user_id': 963045220668583936,
 'in_reply_to_user_id_str': '963045220668583936',
 'is_quote_status': False,
 'lang': 'en',
 'place': None,
 'quote_count': 0,
 'reply_count': 0,
 'retweet_count': 0,
 'retweeted': False,
 'source': '<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>',
 'text': "@dabs_tweet Buy hi

In [129]:
stream_ls = []
for tweet in streamed_tweets:
    single = {}
    single["id"] = tweet['id']
    single["created_at"] = tweet['created_at']
    single["text"] = tweet['text']
    stream_ls.append(single)

In [130]:
stream_pandas = pd.DataFrame(stream_ls)

In [131]:
stream_pandas

Unnamed: 0,created_at,id,text
0,Mon Sep 17 21:33:34 +0000 2018,1041802367497330688,@jacksfilms Have people already said IPhone X ...
1,Mon Sep 17 21:33:34 +0000 2018,1041802368151707649,"@dabs_tweet Buy him an iPhone Xs Max, and you'..."
2,Mon Sep 17 21:33:36 +0000 2018,1041802373880922112,RT @verge: Password Autofill in iOS 12 now wor...
3,Mon Sep 17 21:33:36 +0000 2018,1041802375583997958,This is the real invente Roman invente
4,Mon Sep 17 21:33:37 +0000 2018,1041802376754212864,This is the equivalent of a subtweet from Appl...


### Activity 2

Please stream 100 tweets that feature the word "sociology." Please present this data in a pandas dataframe, and please include the four following columns: `created_at`, `Id`, `text`, `screen_name`

In [None]:
# Please input your answers for activity 2 here










## Scraping Reddit

__Last Updated: 09/17/18__

This file is an implementation of the tutorial found here: http://www.storybench.org/how-to-scrape-reddit-with-python/ and https://github.com/aleszu/reddit-sentiment-analysis/blob/master/r_subreddit.py

For more information regarding `PRAW` please consult: https://praw.readthedocs.io/en/latest/getting_started/quick_start.html 

Thank you internet!

### Load the required packages

In [2]:
import praw
import pandas as pd

### Reddit API credentials
To be able to successfully scrape Reddit, you will need to first create a Reddit account and an "app" for Reddit

https://www.reddit.com/prefs/apps

In [3]:
reddit = praw.Reddit(client_id='MUBUPDCB0p4Stw', \
                     client_secret='bTNn4exVwxyMISlW4L171d1_NsU', \
                     user_agent='compsoc_bootcamp_2018', \
                     username='soc_bootcamp2018', \
                     password='kevinshih')

### Scraping NBA Reddit

https://www.reddit.com/r/nba/

In [4]:
subreddit = reddit.subreddit('nba') # Set the subreddit of interest

In [5]:
for submission in subreddit.hot(limit=5):
    print(submission.title, submission.id)

Daily Locker Room and Free Talk + Game Threads Index (2018.09.17) 9gk8f5
[Announcement] More AMA's! 9gmsy2
[Wojnarowski] The San Antonio Spurs are hiring Brent Barry, a television analyst and 14-year NBA veteran, into a front-office position, league sources tell ESPN. 9gkez4
At this time last year, Kawhi and Boogie were considered loyal players while Paul George was considered a snake. 9gk6pt
[MacMahon] The word out of Dallas is that Doncic has often been the best player on the court in pickup games at the Mavs' facility despite not being in great shape by NBA standards... These games have included local products like LaMarcus Aldridge. 9gm4pz


In [6]:
topics_dict = {"title":[], \
               "score":[], \
               "id":[], \
               "url":[], \
               "comms_num": [], \
               "created": [], \
               "body":[]}

In [7]:
for submission in subreddit.hot(limit = 10):
    topics_dict["title"].append(submission.title)
    topics_dict["score"].append(submission.score)
    topics_dict["id"].append(submission.id)
    topics_dict["url"].append(submission.url)
    topics_dict["comms_num"].append(submission.num_comments)
    topics_dict["created"].append(submission.created)
    topics_dict["body"].append(submission.selftext)

In [8]:
len(topics_dict["id"])

10

In [9]:
topics_data = pd.DataFrame(topics_dict)

In [10]:
topics_data

Unnamed: 0,body,comms_num,created,id,score,title,url
0,#[/r/NBA Rules](https://www.reddit.com/r/nba/w...,38,1537219000.0,9gk8f5,24,Daily Locker Room and Free Talk + Game Threads...,https://www.reddit.com/r/nba/comments/9gk8f5/d...
1,"On Thursday, September 20th at 2 ET, we will b...",26,1537237000.0,9gmsy2,68,[Announcement] More AMA's!,https://www.reddit.com/r/nba/comments/9gmsy2/a...
2,,254,1537221000.0,9gkez4,3674,[Wojnarowski] The San Antonio Spurs are hiring...,https://twitter.com/wojespn/status/10416838633...
3,"But now, Paul George is considered loyal for s...",457,1537219000.0,9gk6pt,3081,"At this time last year, Kawhi and Boogie were ...",https://www.reddit.com/r/nba/comments/9gk6pt/a...
4,,291,1537233000.0,9gm4pz,840,[MacMahon] The word out of Dallas is that Donc...,http://www.espn.com/nba/story/_/id/24704418/nb...
5,,397,1537227000.0,9glcuq,943,Shams - Karl Anthony Towns has not yet signed ...,https://streamable.com/nitqk
6,,152,1537220000.0,9gkbvj,840,[Toronto Raptors] Fun fact: Leonard’s hand mea...,https://twitter.com/raptors/status/10416617977...
7,,323,1537232000.0,9gm20f,421,"[Mychal Thompson] If Klay came to L.A., which ...",https://www.ibtimes.com/lebron-james-affects-k...
8,"Hi y'all! Just for fun, I've been trying to fi...",105,1537227000.0,9gl9fm,546,[OC] Wine Pairings for each NBA team and fanbase,https://www.reddit.com/r/nba/comments/9gl9fm/o...
9,,313,1537242000.0,9gnhpo,246,[Robinson] Spoke to a source today who confirm...,https://twitter.com/ScoopB/status/104176389512...


In [11]:
topics_data.to_csv('x.csv')

#### Scraping the top comments in thread "9gk8f5"

https://www.reddit.com/r/nba/comments/9gk8f5/daily_locker_room_and_free_talk_game_threads/

In [12]:
submission = reddit.submission(id="9gk8f5")

for top_level_comment in submission.comments:
    print(top_level_comment.body, top_level_comment.created)

This is going to be the longest month of my life holy shit... 1537219340.0
Gotta love school bookstores. They sell you an IClicker with non-functioning batteries and tell you its your damn fault that they're not working. Bet they wouldn't say the same shit if one of their books had missing pages. "Too fucking bad, that's what happens when you rent a used book." Fuck off, and fuck that swarmy prick sitting at the desk. 1537227634.0
idk if it's a sunday afternoon thing, but i cannot watch a NFL game without falling asleep. 1537225971.0
That Blake Bortles fuckin boomed me 1537226941.0
How we doing this morning gentlemen  1537223771.0
Is 30 teams in 30 days happening this year? 1537233058.0
Nfl robbed us the Packers of the W 1537220534.0
I can't believe Nick Young hasn't signed anywhere yet. Where do y'all think he's going?  1537231250.0
Instead of studying for university (tons of math), I just reorganize my living space, run some family errands and lurk on reddit.. 1537244681.0
Me and som

In [13]:
top_comms_dict = {"topic": [], \
              "body": [], \
              "comm_id": [], \
              "created": []}

In [14]:
for top_level_comment in submission.comments:
    top_comms_dict["topic"].append("9gk8f5")
    top_comms_dict["body"].append(top_level_comment.body)
    top_comms_dict["comm_id"].append(top_level_comment)
    top_comms_dict["created"].append(top_level_comment.created)

In [15]:
len(top_comms_dict["topic"])

19

In [16]:
top_comms_data = pd.DataFrame(top_comms_dict)

In [17]:
top_comms_data

Unnamed: 0,body,comm_id,created,topic
0,This is going to be the longest month of my li...,e64p9b6,1537219000.0,9gk8f5
1,Gotta love school bookstores. They sell you an...,e64y0wh,1537228000.0,9gk8f5
2,"idk if it's a sunday afternoon thing, but i ca...",e64w3xe,1537226000.0,9gk8f5
3,That Blake Bortles fuckin boomed me,e64x7qr,1537227000.0,9gk8f5
4,How we doing this morning gentlemen,e64to5w,1537224000.0,9gk8f5
5,Is 30 teams in 30 days happening this year?,e654h4r,1537233000.0,9gk8f5
6,Nfl robbed us the Packers of the W,e64qd80,1537221000.0,9gk8f5
7,I can't believe Nick Young hasn't signed anywh...,e652acg,1537231000.0,9gk8f5
8,Instead of studying for university (tons of ma...,e65inc9,1537245000.0,9gk8f5
9,Me and some friends are going to Indianapolis ...,e64r6bz,1537221000.0,9gk8f5


### Scraping all comments in thread "9gk8f5"

In [18]:
all_comms_dict = {"topic": [], \
                  "body": [], \
                  "comm_id": [], \
                  "created": []}

In [19]:
for all_level_comment in submission.comments.list():
    all_comms_dict["topic"].append("9gk8f5")
    all_comms_dict["body"].append(all_level_comment.body)
    all_comms_dict["comm_id"].append(all_level_comment)
    all_comms_dict["created"].append(all_level_comment.created)

In [20]:
len(all_comms_dict["topic"])

38

In [21]:
comments_data = pd.DataFrame(all_comms_dict)

In [22]:
comments_data

Unnamed: 0,body,comm_id,created,topic
0,This is going to be the longest month of my li...,e64p9b6,1537219000.0,9gk8f5
1,Gotta love school bookstores. They sell you an...,e64y0wh,1537228000.0,9gk8f5
2,"idk if it's a sunday afternoon thing, but i ca...",e64w3xe,1537226000.0,9gk8f5
3,That Blake Bortles fuckin boomed me,e64x7qr,1537227000.0,9gk8f5
4,How we doing this morning gentlemen,e64to5w,1537224000.0,9gk8f5
5,Is 30 teams in 30 days happening this year?,e654h4r,1537233000.0,9gk8f5
6,Nfl robbed us the Packers of the W,e64qd80,1537221000.0,9gk8f5
7,I can't believe Nick Young hasn't signed anywh...,e652acg,1537231000.0,9gk8f5
8,Instead of studying for university (tons of ma...,e65inc9,1537245000.0,9gk8f5
9,Me and some friends are going to Indianapolis ...,e64r6bz,1537221000.0,9gk8f5


### Activity 3

Please represent the "hottest" 25 topics from the "Sociology" subreddit (https://www.reddit.com/r/sociology) in a pandas table with the following columns: "Body", "Number of comments", "date created", "Title", and "URL"

In [25]:
# Please insert your answers for activity 3 here....







### Activity 4 (If time allows)

IF THERE IS TIME... Please scrape all the comments in this thread: https://www.reddit.com/r/sociology/comments/9fba2z/what_important_sociological_ideas_are_in/

Represent this data in the form of a pandas table, with the following columns: body, comment id, and date created

In [33]:
# Please insert your answers for activity 4 here....




