# Getting Information from Social Media (Twitter)

<img src="Images/pic1.JPG" alt="Drawing"/>

+ **Web Crawling** merupakan suatu program/sistem/script otomatis yang dengan suatu metode tertentu melakukan scanning data-data yang ada dalam sebuah website.  
+ **Web Scraping** merupakan suatu kegiatan yang dilakukan untuk mengambil informasi dari halaman website. Web scraping biasanya mengambil informasi dari HTML yang terdapat pada halaman website.

### Contoh Scraper/crawler

**Official Scraper/Crawler**   : Tweepy, Scrapy

**Unofficial Scraper/Crawler** : Twitterscraper, Scweet


### Example Scraping

Pada sesi ini, kita akan menggunakan salah satu contoh web scraper yaitu Scweet. Scweet melakukan scraping pada halaman website twitter. 

#### 1. Import needed library

Library yang dibutuhkan untuk scraping kali ini adalah scweet dan pandas

In [None]:
!pip install Scweet==1.0
!pip install pandas==1.1.3

In [1]:
from Scweet.scweet import scrap
from Scweet.user import get_user_information, get_users_following, get_users_followers
import pandas as pd

In [2]:
pd.__version__

'1.2.4'

#### 2. Scrape tweet with certain words

dengan menggunakan Scweet, kita dapat mengambil top tweets yang mengandung kata tertentu. caranya dengan menggunakan module scrap yang tersedia pada library Scweet.

In [7]:
# keywords
keywords = ['kolak']

# Date interval
initial_date = '2021-04-20'
finish_date = '2021-04-30'

all_datas = []
for x in keywords:
    data = scrap(words=x,
                 start_date=initial_date, #penting
                 max_date=finish_date, #penting
                 from_account=None, #bisa spesific account
                 interval=1, 
                 headless=True,
                 save_images=False,
                 display_type=None,
                 resume=False,
                 filter_replies=True,
                 proximity=True)
    
    data['keyword'] = x
    all_datas.append(data)

all_datas = pd.concat(all_datas)

Scraping on headless mode.
looking for tweets between 2021-04-20 and 2021-04-21 ...
 path : https://twitter.com/search?q=(kolak)%20until%3A2021-04-21%20since%3A2021-04-20%20%20-filter%3Areplies&src=typed_query&lf=on
scroll  1
scroll  2
Tweet made at: 2021-04-20T16:55:47.000Z is found.
scroll  3
scroll  4
looking for tweets between 2021-04-21 and 2021-04-22 ...
 path : https://twitter.com/search?q=(kolak)%20until%3A2021-04-22%20since%3A2021-04-21%20%20-filter%3Areplies&src=typed_query&lf=on
scroll  1
scroll  2
looking for tweets between 2021-04-22 and 2021-04-23 ...
 path : https://twitter.com/search?q=(kolak)%20until%3A2021-04-23%20since%3A2021-04-22%20%20-filter%3Areplies&src=typed_query&lf=on
scroll  1
scroll  2
looking for tweets between 2021-04-23 and 2021-04-24 ...
 path : https://twitter.com/search?q=(kolak)%20until%3A2021-04-24%20since%3A2021-04-23%20%20-filter%3Areplies&src=typed_query&lf=on
scroll  1
scroll  2
looking for tweets between 2021-04-24 and 2021-04-25 ...
 path : ht

In [8]:
all_datas

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL,keyword
0,brunns,@fikrialhakimm,2021-04-20T16:55:47.000Z,"Bisa bisanya manusia makan kolak durian, aneh",,,,,,[],https://twitter.com/fikrialhakimm/status/13845...,kolak
1,Waroeng Snoepen,@wsnoepen,2021-04-24T05:52:15.000Z,"Selamat siang, 12:47 WIB dan sudah mulai bosen...",,,,2.0,,[https://pbs.twimg.com/media/Ezt4n7_VkAA_pcd?f...,https://twitter.com/wsnoepen/status/1385833948...,kolak


In [None]:
# Save data to csv
filename = 'Data/all_keywordsv2.csv'
all_datas.to_csv(filename, index=None)

In [None]:
hashtag = 'kolak'

initial_date = '2021-04-28'
finish_date = '2021-04-30'

data = scrap(hashtag=hashtag,
             start_date=initial_date,
             max_date=finish_date,
             from_account=None,
             interval=5,
             headless=True,
             display_type="Top",
             save_images=False, 
             resume=False,
             filter_replies=False,
             proximity=True)

In [None]:
data.head()

### Get the main information of a given list of users

In [None]:
users = ['@raisa6690', '@isyanasarasvati']

# this function return a list that contains : 
# ["nb of following","nb of followers", "join date", "birthdate", "location", "website", "description"]

users_info = get_user_information(users, headless=True)

In [None]:
users_df = pd.DataFrame(users_info, index = ["nb of following",
                                             "nb of followers",
                                             "join date", 
                                             "birthdate",
                                             "location",
                                             "website",
                                             "description"]).T
users_df