## 1. Different ways to access data on the web
 - Scrape HTML web pages 
 - Download data file directly 
    * data files such as csv, txt
    * pdf files
 - Access data through Application Programming Interface (API), e.g. The Movie DB, Twitter

Reference datasets:
- http://snap.stanford.edu/class/cs224w-2013/resources.html
- http://socialcomputing.asu.edu/pages/datasets
- https://www.yelp.com/dataset/challenge
- https://aminer.org/data-sna
- https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=social+network&page=1&pageSize=20&size=sizeAll&filetype=fileTypeAll&license=licenseAll

## 2. Scrape data through API (e.g. tweets)
- Online content providers usually provide APIs for you to access data.  
   * Python packages: e.g. tweepy package from Twitter
   
- You need to read documentation of APIs to figure out how to access data



## 2.1. Access tweet through tweepy package

- Twitter Terminology (https://support.twitter.com/articles/166337)
  - **@{username}**: mentioning an accounts {username} in a tweet
  - **\#{topic}**: a hashtag indicates a keyword or topic.
  - **follow**: Subscribing to a Twitter account 
  - **reply**: A response to another person’s Tweet
  - **Retweet (n.)**: A tweet that you forward to your followers
  - **like (n.)**: indicates appreciating a tweet. 
  - **timeline**: A timeline is a real-time stream of tweets. Your Home timeline, for instance, is where you see all the Tweets shared by your friends and other people you follow.
  - **Twitter emoji**: A Twitter emoji is a specific series of letters immediately preceded by the # sign which generates an icon on Twitter such as a national flag or another small image.
- Register Your App
 http://apps.twitter.com


In [None]:
# install tweepy
!pip install tweepy

In [2]:
import tweepy

# import twitter authentication module
from tweepy import OAuthHandler

# import the python package to handle datetime
import datetime

import csv

In [3]:
consumer_key = 'replace your own account consumer_key'
consumer_secret = 'replace your own account consumer_secret'
access_token = 'replace your own account access_token'
access_secret = 'replace your own account access_secret'

**OAuth Authentication**

In [5]:
#Twitter requires all requests to use OAuth for authentication. 
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

### print basic information of a user

In [6]:
user = api.get_user('realDonaldTrump')
print("Name:",user.name)
print("Location:",user.location)
print("Following:",user.friends_count)
print("Followers:",user.followers_count)

Name: Donald J. Trump
Location: Washington, DC
Following: 46
Followers: 55613953


In [7]:
for friend in user.friends():
    print(friend.screen_name)

VP
GOPChairwoman
parscale
PressSec
TuckerCarlson
JesseBWatters
WhiteHouse
Scavino45
KellyannePolls
Reince
RealRomaDowney
Trump
TrumpGolf
TiffanyATrump
IngrahamAngle
mike_pence
TeamTrump
DRUDGE_REPORT
MrsVanessaTrump
LaraLeaTrump


In [8]:
import time
friends = []
for page in tweepy.Cursor(api.friends, screen_name='realDonaldTrump').pages():
    for user in page:
        friends.append(user.screen_name)
   # time.sleep(60)

In [9]:
friends

['VP',
 'GOPChairwoman',
 'parscale',
 'PressSec',
 'TuckerCarlson',
 'JesseBWatters',
 'WhiteHouse',
 'Scavino45',
 'KellyannePolls',
 'Reince',
 'RealRomaDowney',
 'Trump',
 'TrumpGolf',
 'TiffanyATrump',
 'IngrahamAngle',
 'mike_pence',
 'TeamTrump',
 'DRUDGE_REPORT',
 'MrsVanessaTrump',
 'LaraLeaTrump',
 'seanhannity',
 'CLewandowski_',
 'AnnCoulter',
 'DiamondandSilk',
 'KatrinaCampins',
 'KatrinaPierson',
 'foxandfriends',
 'MELANIATRUMP',
 'GeraldoRivera',
 'ericbolling',
 'MarkBurnettTV',
 'garyplayer',
 'VinceMcMahon',
 'DanScavino',
 'TrumpDoral',
 'TrumpCharlotte',
 'TrumpLasVegas',
 'TrumpChicago',
 'TrumpGolfDC',
 'TrumpGolfLA',
 'EricTrump',
 'BillOReilly',
 'greta',
 'piersmorgan',
 'DonaldJTrumpJr',
 'IvankaTrump']

### print your home timeline tweets

For example, we can read our own timeline (i.e. our Twitter homepage) with:

In [10]:
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print ("@"+tweet.user.screen_name+": "+tweet.text)

@ABC: A funeral Mass for notorious Boston gangster James "Whitey" Bulger has been held in South Boston.… https://t.co/4qUWXhBdMy
@ABC: Hundreds of people line the highways and overpasses, many waving flags, for the procession for Sgt. Ron Helus, who… https://t.co/1Bsb2LnhDq
@chrisalbon: Detect the brand of a piece of clothing being worn by someone from a grainy iPhone camera video: easy

Build a dong… https://t.co/kdEPgKPwmb
@ABC: NEW: U.S. appeals court keeps in place preliminary injunction blocking Pres. Trump's decision to phase out DACA.… https://t.co/y2k7MVnSHo
@ABC: Actress Tamera Mowry-Housley, husband Adam Housley trying to locate niece after Thousand Oaks shooting.… https://t.co/CsorZZuCuh
@ABC: The 2018 map looked more like 2012 than 2016, with Democrats performing quite well in Wisconsin, Michigan and Penns… https://t.co/tjMOCfcIPp
@GoodwinMJ: Come to Europe Frank... 

Sweden turnout 87%
Germany 76%
France Round 2 75%
Brexit referendum 72% https://t.co/43llmBFiDO
@GoodwinMJ:

In [11]:
for status in tweepy.Cursor(api.home_timeline,tweet_mode='extended').items(10):
    print(status.full_text)

A funeral Mass for notorious Boston gangster James "Whitey" Bulger has been held in South Boston. https://t.co/ZboJM9mbnw https://t.co/vcnBgGQKOJ
Hundreds of people line the highways and overpasses, many waving flags, for the procession for Sgt. Ron Helus, who was killed responding to Thousand Oaks mass shooting overnight.

Helus was a 29-year veteran who was about to retire. https://t.co/ITyE0YqOp1 https://t.co/8jwNq76kDj
Detect the brand of a piece of clothing being worn by someone from a grainy iPhone camera video: easy

Build a dongle with multiple USB-C ports: seemingly fucking impossible
NEW: U.S. appeals court keeps in place preliminary injunction blocking Pres. Trump's decision to phase out DACA. https://t.co/iTfL9W0o7I https://t.co/CMKgUMSEv9
Actress Tamera Mowry-Housley, husband Adam Housley trying to locate niece after Thousand Oaks shooting. https://t.co/Ocf0Zse42v https://t.co/YhJur2A3Lo
The 2018 map looked more like 2012 than 2016, with Democrats performing quite well in 

### print other users timeline tweets

In [12]:
for tweet in tweepy.Cursor(api.user_timeline,screen_name='FollowStevens',tweet_mode='extended').items(10):
    print (tweet.full_text)

Stevens researcher Stephanie Lee engineers green, portable, affordable power sources for a brighter future.
https://t.co/1d8D1vi4ZY
Mark your calendars! On Sunday, the @HobokenMuseum presents a lecture by Stevens Dir. of Technology Commercialization, David Zimmerman: “Made in Hoboken: Tales of Inventions with Ties to the Mile-Square City.”
https://t.co/FGHr8IIdiT
Show your Stevens pride this Halloween by carving a Stevens themed pumpkin! We've created a few spooky and not-so-spooky carving stencils to help you create your masterpiece! Don't forget to post a photo using #SpookyStevens

Stencils/Instructions: https://t.co/IRRpvQPyUc https://t.co/jlSxTI2r0x
Stevens researchers &amp; colleagues have modeled how toxic proteins spread throughout the brain to reproduce the atrophy patterns associated w/ Alzheimer’s, Parkinson’s &amp; ALS. 
https://t.co/fiq3jfodvL
Stevens researchers hope to improve the odds and pave the way for more effective breast cancer medications, treatments and therapie

### Finding Tweets based on a Query

In [13]:
for tweet in tweepy.Cursor(api.search,q='League of Legends').items(10):
    print('Tweet by: @' + tweet.user.screen_name)
    print('Tweet:'+ tweet.text)

Tweet by: @DonZarzoso
Tweet:RT @LVPibai: yo: hay que ser bastante inmaduro para enfadarse e insultar por perder una partida en el LoL por favor que estamos en pleno 20…
Tweet by: @gaahboo
Tweet:RT @LVPibai: yo: hay que ser bastante inmaduro para enfadarse e insultar por perder una partida en el LoL por favor que estamos en pleno 20…
Tweet by: @TeenWolf_02
Tweet:RT @LVPibai: yo: hay que ser bastante inmaduro para enfadarse e insultar por perder una partida en el LoL por favor que estamos en pleno 20…
Tweet by: @seoxxxxxx
Tweet:RT @LVPibai: yo: hay que ser bastante inmaduro para enfadarse e insultar por perder una partida en el LoL por favor que estamos en pleno 20…
Tweet by: @lookhackcom
Tweet:Go to https://t.co/YGyKx5TMsA and choose League of Legends image (you will be redirect to League of Legends Generat… https://t.co/wM1WEtexMu
Tweet by: @Fumasca_Gz
Tweet:RT @LVPibai: yo: hay que ser bastante inmaduro para enfadarse e insultar por perder una partida en el LoL por favor que estamos e

In [14]:
for tweet in tweepy.Cursor(api.search,
                           q='League of Legends',
                           since='2018-10-01',
                           until='2018-11-07').items(10):
    print('Tweet by: @' + tweet.user.screen_name)

Tweet by: @cosmicsky_ptg
Tweet by: @nicky2533_
Tweet by: @lucasmigueltel1
Tweet by: @hahaitsmedio
Tweet by: @SpecterXV
Tweet by: @angel_moctezuma
Tweet by: @AiyBaan
Tweet by: @RussoTollls
Tweet by: @Kyoutawave
Tweet by: @atype55


### Save to csv

In [15]:
users = ["FollowStevens"]

In [16]:
import unicodecsv
from unidecode import unidecode

with open('tweets.csv', 'wb') as file:
    writer = unicodecsv.writer(file, delimiter = ',', quotechar = '"')
    # Write header row.
    writer.writerow(["name",
                    "username",
                    "followers_count",
                    "listed_count",
                    "following",
                    "favorites",
                    "verified",
                    "default_profile",
                    "location",
                    "time_zone",
                    "statuses_count",
                    "description",
                    "geo_enabled",
                    "contributors_enabled",
                    "tweet_year",
                    "tweet_month",
                    "tweet_day",
                    "tweet_hour",
                    "tweet_text",
                    "tweet_lat",
                    "tweet_long",
                    "tweet_source",
                    "tweet_in_reply_to_screen_name",
                    "tweet_direct_reply",
                    "tweet_retweet_status",
                    "tweet_retweet_count",
                    "tweet_favorite_count",
                    "tweet_hashtags",
                    "tweet_hashtags_count",
                    "tweet_urls",
                    "tweet_urls_count",
                    "tweet_user_mentions",
                    "tweet_user_mentions_count",
                    "tweet_media_type",
                    "tweet_contributors"])

    for user in users:
        user_obj = api.get_user(user)
    
        # Gather info specific to the current user.
        user_info = [user_obj.name,
                     user_obj.screen_name,
                     user_obj.followers_count,
                     user_obj.listed_count,
                     user_obj.friends_count,
                     user_obj.favourites_count,
                     user_obj.verified,
                     user_obj.default_profile,
                     user_obj.location,
                     user_obj.time_zone,
                     user_obj.statuses_count,
                     user_obj.description,
                     user_obj.geo_enabled,
                     user_obj.contributors_enabled]

        # Get 1000 most recent tweets for the current user.
        for tweet in tweepy.Cursor(api.user_timeline, screen_name = user).items(1000):
            # Latitude and longitude stored as array of floats within a dictionary.
            lat = tweet.coordinates['coordinates'][1] if tweet.coordinates != None else None
            long = tweet.coordinates['coordinates'][0] if tweet.coordinates != None else None
            # If tweet is not in reply to a screen name, it is not a direct reply.
            direct_reply = True if tweet.in_reply_to_screen_name != "" else False
            # Retweets start with "RT ..."
            retweet_status = True if tweet.text[0:3] == "RT " else False

            # Get info specific to the current tweet of the current user.
            tweet_info = [tweet.created_at.year,
                          tweet.created_at.month,
                          tweet.created_at.day,
                          tweet.created_at.hour,
                          unidecode(tweet.text),
                          lat,
                          long,
                          tweet.source,
                          tweet.in_reply_to_screen_name,
                          direct_reply,
                          retweet_status,
                          tweet.retweet_count,
                          tweet.favorite_count]

            # Below entities are stored as variable-length dictionaries, if present.
            hashtags = []
            hashtags_data = tweet.entities.get('hashtags', None)
            if(hashtags_data != None):
                for i in range(len(hashtags_data)):
                    hashtags.append(unidecode(hashtags_data[i]['text']))

            urls = []
            urls_data = tweet.entities.get('urls', None)
            if(urls_data != None):
                for i in range(len(urls_data)):
                    urls.append(unidecode(urls_data[i]['url']))

            user_mentions = []
            user_mentions_data = tweet.entities.get('user_mentions', None)
            if(user_mentions_data != None):
                for i in range(len(user_mentions_data)):
                    user_mentions.append(unidecode(user_mentions_data[i]['screen_name']))

            media = []
            media_data = tweet.entities.get('media', None)
            if(media_data != None):
                for i in range(len(media_data)):
                    media.append(unidecode(media_data[i]['type']))

            contributors = []
            if(tweet.contributors != None):
                for contributor in tweet.contributors:
                    contributors.append(unidecode(contributor['screen_name']))

            more_tweet_info = [', '.join(hashtags),
                               len(hashtags),
                               ', '.join(urls),
                               len(urls),
                               ', '.join(user_mentions),
                               len(user_mentions),
                               ', '.join(media),
                               ', '.join(contributors)]

            # Write data to CSV.
            writer.writerow(user_info + tweet_info + more_tweet_info)

        # Show progress.
        print("Wrote tweets by %s to CSV." % user)

514

459

484

469

473

471

450

452

456

468

474

490

468

399

470

477

478

467

469

435

421

489

470

470

465

504

469

456

510

468

472

465

526

438

467

468

471

468

479

469

472

469

469

478

477

478

470

480

469

469

475

477

447

472

469

469

434

496

472

489

504

466

469

453

474

393

505

470

485

463

473

465

469

450

468

491

479

470

467

442

469

469

465

480

476

466

467

485

409

406

344

366

362

471

495

528

383

516

497

468

410

434

472

469

480

464

469

468

470

482

466

494

468

476

484

441

468

473

478

472

467

471

396

468

476

453

464

466

491

469

468

470

469

469

468

488

484

482

501

482

458

518

468

468

471

474

467

497

509

457

484

456

472

468

456

513

495

501

499

476

461

468

472

376

438

475

471

475

486

472

471

456

500

407

420

469

436

495

504

504

454

453

470

464

482

444

472

482

454

482

468

486

488

396

471

345

345

345

493

412

474

488

485

468

458

468

479

479

472

426

475

475

476

481

479

495

443

469

443

462

437

441

462

468

444

475

481

476

373

494

482

481

478

397

493

479

500

481

465

472

471

461

520

480

452

443

464

466

477

465

470

468

481

422

475

351

473

498

507

478

482

467

483

503

483

513

421

470

441

469

511

480

469

515

438

481

488

468

468

507

470

488

467

478

480

513

460

471

498

523

511

490

498

468

517

462

481

479

470

476

477

432

475

477

474

488

494

450

474

517

465

468

490

474

356

470

468

500

464

502

504

501

473

481

522

479

470

467

484

471

513

488

498

468

483

486

479

493

477

472

498

485

472

457

505

480

491

498

474

462

497

506

473

464

468

470

494

481

460

435

500

490

466

444

515

487

497

481

470

516

464

465

495

510

476

472

569

490

476

400

384

502

499

481

499

502

474

536

458

487

486

478

467

490

489

465

485

498

464

450

483

492

506

497

510

494

503

455

460

458

431

458

385

457

461

458

437

503

465

464

472

478

465

445

470

433

449

488

486

472

470

451

484

486

483

478

483

476

476

495

475

456

493

437

466

456

483

485

495

485

498

506

436

374

506

485

473

492

477

450

398

519

477

468

468

460

445

455

438

420

446

481

507

501

486

503

481

493

482

472

490

489

485

466

463

438

438

364

483

382

450

465

473

498

484

496

474

474

457

499

450

463

487

476

466

475

455

483

486

483

490

498

481

453

506

496

415

490

472

453

454

486

503

408

470

422

507

486

400

483

448

501

525

478

488

487

454

493

456

454

442

490

459

487

473

460

477

488

471

480

490

467

433

470

491

460

484

481

458

402

406

467

468

496

479

482

466

497

484

464

490

467

477

478

485

434

459

477

456

389

446

514

485

501

469

508

490

459

478

497

439

474

517

464

360

490

441

474

456

479

504

480

476

495

475

482

469

490

481

488

434

453

457

509

477

401

476

486

459

501

472

468

439

498

461

469

489

406

465

481

498

481

444

471

472

452

448

465

496

443

480

468

482

486

453

478

459

501

461

504

468

514

465

472

402

436

488

497

450

512

452

483

489

479

485

379

373

467

471

479

484

465

418

490

456

380

504

501

430

456

499

374

479

473

500

388

494

502

511

476

474

540

481

493

494

439

431

512

507

467

514

501

494

505

464

502

454

455

481

445

446

478

450

376

481

458

471

446

442

448

422

451

462

459

479

473

485

514

428

484

497

475

458

505

438

460

482

518

514

519

498

463

421

500

512

487

453

480

457

493

491

507

415

470

445

487

472

434

432

466

496

502

514

500

494

498

486

453

484

463

472

473

508

353

486

519

420

462

469

498

344

375

405

480

422

499

412

458

473

459

380

438

427

470

487

417

499

404

456

467

420

422

476

446

470

454

483

477

462

426

486

494

491

493

429

386

473

439

391

435

469

476

466

457

390

440

529

394

445

499

504

449

467

479

497

465

494

448

501

463

492

353

490

459

486

425

501

448

479

352

471

406

495

506

471

538

430

464

497

471

524

491

450

447

464

455

505

490

493

420

441

469

451

458

477

480

431

466

453

443

454

517

455

445

443

481

482

480

461

457

426

439

455

359

518

490

512

391

527

471

448

482

346

408

466

494

497

448

481

505

466

476

482

379

469

497

481

500

506

498

463

479

453

494

481

479

455

372

444

450

397

511

445

430

462

463

502

453

493

477

471

479

456

420

401

478

469

480

446

473

502

482

508

467

479

492

480

515

484

463

447

437

441

514

480

485

502

490

477

483

499

509

449

416

482

509

466

483

467

394

506

469

480

495

427

487

390

417

413

358

475

514

428

499

464

499

449

473

508

503

481

Wrote tweets by FollowStevens to CSV.


In [17]:
stevens_tweet=pd.read_csv('tweets.csv')
stevens_tweet.head()

Unnamed: 0,name,username,followers_count,listed_count,following,favorites,verified,default_profile,location,time_zone,...,tweet_retweet_count,tweet_favorite_count,tweet_hashtags,tweet_hashtags_count,tweet_urls,tweet_urls_count,tweet_user_mentions,tweet_user_mentions_count,tweet_media_type,tweet_contributors
0,Stevens,FollowStevens,9295,228,938,3408,True,False,"Hoboken, NJ",,...,6,12,,0,https://t.co/1d8D1vi4ZY,1,,0,,
1,Stevens,FollowStevens,9295,228,938,3408,True,False,"Hoboken, NJ",,...,3,14,,0,https://t.co/nxOTC3RZbv,1,HobokenMuseum,1,,
2,Stevens,FollowStevens,9295,228,938,3408,True,False,"Hoboken, NJ",,...,4,18,,0,https://t.co/FQTxTPhVqg,1,,0,,
3,Stevens,FollowStevens,9295,228,938,3408,True,False,"Hoboken, NJ",,...,4,18,,0,https://t.co/9Rk2it4w93,1,,0,,
4,Stevens,FollowStevens,9295,228,938,3408,True,False,"Hoboken, NJ",,...,5,10,,0,https://t.co/YygR9lmkrM,1,,0,,
