# Scraping Data Twitter

# Library yang Dibutuhkan

Dalam melakukan proses scraping data dari Twitter digunakan beberapa library berikut:

1. Library Twint : Digunakan untuk melakukan proses scraping data dari twitter. 
Dapat dilihat di tautan https://pypi.org/project/twint/
2. Library Nest_Asyncio : Digunakan untuk menangaani error saat proses looping ketika proses scraping data dari Twitter. 
Dapat dilihat di tautan https://pypi.org/project/nest-asyncio/
3. Library PyMySQL : Digunakan untuk meghubungkan dengan database MySQL agar dapat menyimpan data hasil scraping Twitter ke bentuk SQL
Dapat dilihat di tautan : https://pypi.org/project/PyMySQL/

**Proses install Library Twint**

In [None]:
!pip install twint 
#!pip install --user --upgrade git+https://github.com/yunusemrecatalcam/twint.git@twitter_legacy2
!pip install --user --upgrade -e git+https://github.com/twintproject/twint.git@origin/master#egg=twint

Collecting twint
  Downloading twint-2.1.20.tar.gz (31 kB)
Collecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 12.1 MB/s 
[?25hCollecting aiodns
  Downloading aiodns-3.0.0-py3-none-any.whl (5.0 kB)
Collecting cchardet
  Downloading cchardet-2.1.7-cp37-cp37m-manylinux2010_x86_64.whl (263 kB)
[K     |████████████████████████████████| 263 kB 52.6 MB/s 
[?25hCollecting elasticsearch
  Downloading elasticsearch-7.15.0-py2.py3-none-any.whl (378 kB)
[K     |████████████████████████████████| 378 kB 56.6 MB/s 
Collecting aiohttp_socks
  Downloading aiohttp_socks-0.6.0-py3-none-any.whl (9.2 kB)
Collecting schedule
  Downloading schedule-1.1.0-py2.py3-none-any.whl (10 kB)
Collecting fake-useragent
  Downloading fake-useragent-0.1.11.tar.gz (13 kB)
Collecting googletransx
  Downloading googletransx-2.4.2.tar.gz (13 kB)
Collecting pycares>=4.0.0
  Downloading pycares-4.0.0-cp37-cp37m-manylinux2010_x8

**Proses install Library Nest_asyncio**

In [None]:
!pip install nest_asyncio



**Proses install Library PyMysql**

In [None]:
!pip install pymysql
!pip install mysql-connector-python-rf

Collecting pymysql
  Downloading PyMySQL-1.0.2-py3-none-any.whl (43 kB)
[?25l[K     |███████▌                        | 10 kB 27.2 MB/s eta 0:00:01[K     |███████████████                 | 20 kB 29.4 MB/s eta 0:00:01[K     |██████████████████████▍         | 30 kB 21.0 MB/s eta 0:00:01[K     |██████████████████████████████  | 40 kB 18.1 MB/s eta 0:00:01[K     |████████████████████████████████| 43 kB 1.4 MB/s 
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.0.2
Collecting mysql-connector-python-rf
  Downloading mysql-connector-python-rf-2.2.2.tar.gz (11.9 MB)
[K     |████████████████████████████████| 11.9 MB 188 kB/s 
[?25hBuilding wheels for collected packages: mysql-connector-python-rf
  Building wheel for mysql-connector-python-rf (setup.py) ... [?25l[?25hdone
  Created wheel for mysql-connector-python-rf: filename=mysql_connector_python_rf-2.2.2-cp37-cp37m-linux_x86_64.whl size=249476 sha256=be3880637227bdce89b1b99cee22d2bf1312bcbafe669dd5f

# Proses Scraping Data Twitter

**Proses Import Library**

Pertama, lakukan import library yang dibutuhkan berikut:

In [None]:
#import libraries
import twint
import nest_asyncio 
import os
import pandas as pd
import numpy as np
import glob

Setelah itu, lakukan penanganan untuk mengatasi error looping saat melakukan proses scraping data dengan library **nest_asyncio**

In [None]:
#eksekusi library nest_asyncio
nest_asyncio.apply()

Buat folder "data" untuk menyimpan hasil scraping data Twitter nantinya

In [None]:
#create folder data before covid-19
if not os.path.exists('data/noncovid'):
    os.makedirs('data/noncovid')
    
#create folder data during covid-19
if not os.path.exists('data/covid'):
    os.makedirs('data/covid')

Deklarasi function untuk konversi tuple ke string

In [None]:
def convertTuple(string): 
    str =  ''.join(string) 
    return str

**Konfigurasi dan proses scraping data dari Twitter**

Ada beberapa konfigurasi yang perlu dilakukan untuk melakukan proses scraping datanya, yaitu:
1. Mengatur **Tanggal Postingan Tweet** melalui parameter **Since dan Until** untuk mendapatkan data berdasarkan periodik waktu tertentu
2. Menampilkan statistik postingan melalui parameter **Stats dan Count** untuk menampilkan jumlah postingan, komen dan like 
3. Mengatur Geocode wilayah yang dijadikan target lokasi melalui parameter **Geo** untuk menentukan lokasi spesifik dari data twitter yang akan di-scraping. defenisikan juga jarak radius yang membatasi proses pengambilan datanya.
4. Mengatur periodik data, dimana kondisi data dibagi ke dalam dua jenis yaitu data yang diambil saat masa pandemi COVID-19 dan sebelum terjadinya COVID-19

In [None]:
def general_twitter_scraping(start_date, until_date, latitude, longitude, radius, condition):
    
    #set geocode parameter value from latitude, longitude and radius
    geocode     = latitude,',',longitude,',',radius
    geocode_str = convertTuple(geocode).strip()
    
    if(condition == 'noncovid'):
        output_file = convertTuple(('data/noncovid/general_scrapping_result_',(start_date.replace('-', '')),'_',(until_date.replace('-', '')),'.csv')).strip()
    else:
        output_file = convertTuple(('data/covid/general_scrapping_result_',(start_date.replace('-', '')),'_',(until_date.replace('-', '')),'.csv')).strip()
        
    #configuration
    config = twint.Config()
    #config.Limit = 100
    config.Since = start_date
    config.Until = until_date
    config.Stats = True
    config.Count = True
    config.Favorites = True
    config.Retweets = True 
    config.Store_csv = True
    config.Hide_output = True
    config.Include_retweets = True
    config.Geo = geocode_str
    config.Output = output_file

    #running search
    twint.run.Search(config)

Melakukan deklarasi variabel untuk mendefenisikan parameter yang digunakan dalam proses scraping data Twitter

In [None]:
latitude    = '-6.216657128974757'
longitude   = '106.83030289065285'
radius      = '10km'

Pemanggilan function general_twitter_scraping() untuk melakukan proses scraping data dari Twitter saat sebelum pandemi COVID-19

In [None]:
general_twitter_scraping('2021-10-11','2021-10-14',latitude,longitude,radius,'covid')

[+] Finished: Successfully collected 119 Tweets.


Pemanggilan function general_twitter_scraping() untuk melakukan proses scraping data dari Twitter saat pandemi COVID-19

In [None]:
general_twitter_scraping('2020-03-14','2020-06-14',latitude,longitude,radius,'covid')

Membaca file CSV yang berisi scraping data Twitter pada periode sebelum terjadi COVID-19

In [None]:
tweet_data_before_covid_df = pd.DataFrame()

for files in glob.glob("data/noncovid/*.csv"):
    tweet_data_before_covid_files = pd.read_csv(files)
    tweet_data_before_covid_df = tweet_data_before_covid_df.append(tweet_data_before_covid_files)
    
#menghapus data rows yang redudant
tweet_data_before_covid_df = pd.DataFrame.drop_duplicates(tweet_data_before_covid_df)
#tweet_data_before_covid_df = tweet_data_before_covid_df[~tweet_data_before_covid_df.index.duplicated(keep='first')]

Membaca file CSV yang berisi scraping data Twitter pada periode saat terjadi COVID-19

In [None]:
tweet_data_during_covid_df = pd.DataFrame()

for files in glob.glob("data/covid/*.csv"):
    tweet_data_during_covid_files = pd.read_csv(files)
    tweet_data_during_covid_df = tweet_data_during_covid_df.append(tweet_data_during_covid_files)
    
#menghapus data rows yang redudant
tweet_data_during_covid_df = pd.DataFrame.drop_duplicates(tweet_data_during_covid_df)
#tweet_data_during_covid_df = tweet_data_during_covid_df[~tweet_data_during_covid_df.index.duplicated(keep='first')]

In [None]:
#before covid
number_of_rows_before_covid = len(tweet_data_before_covid_df.index)
print('Jumlah baris sebelum COVID-19: ',number_of_rows_before_covid)

Jumlah baris sebelum COVID-19:  652345


In [None]:
#during covid
number_of_rows_during_covid = len(tweet_data_during_covid_df.index)
print('Jumlah baris saat COVID-19: ', number_of_rows_during_covid)

Jumlah baris saat COVID-19:  658490


In [None]:
#all_tweets_data_df = pd.concat([tweet_data_before_covid_df, tweet_data_during_covid_df], axis=1, join='outer')
#all_tweets_data_df = pd.DataFrame(all_tweets_data_df);

Melihat beberapa data sebelum terjadinya pandemi COVID-19

In [None]:
tweet_data_before_covid_df.head(5)

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1219771501609570305,1219771501609570305,2020-01-21 23:59:22 UTC,2020-01-21,23:59:22,0,56072466,harigelita,Dian Mochtar,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
1,1219771450304786432,1219439859128922112,2020-01-21 23:59:10 UTC,2020-01-21,23:59:10,0,239827943,1dg4fux,Aryawipu,,...,"-6.216657128974757,106.83030289065285,10km",,,,,"[{'screen_name': 'Tospotato', 'name': 'Kamisam...",,,,
2,1219771440943091712,1219763519119220736,2020-01-21 23:59:08 UTC,2020-01-21,23:59:08,0,80840926,iwanneh,.:::.,,...,"-6.216657128974757,106.83030289065285,10km",,,,,"[{'screen_name': 'NotesofMila', 'name': 'Mila🇮...",,,,
3,1219771347020083200,1219771347020083200,2020-01-21 23:58:46 UTC,2020-01-21,23:58:46,0,195299889,bys_king11,BODDAH,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
4,1219771311368523776,1219771311368523776,2020-01-21 23:58:37 UTC,2020-01-21,23:58:37,0,406069088,anggi_gunawan25,Anggi Gunawan S.Tr.Sos,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,


Melihat beberapa data saat terjadinya pandemi COVID-19

In [None]:
tweet_data_during_covid_df.head(5)

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1252748884301889538,1252733407282388993,2020-04-21 23:59:43 UTC,2020-04-21,23:59:43,0,134710864,budikotelawala,budi,,...,"-6.216657128974757,106.83030289065285,10km",,,,,"[{'screen_name': 'queennisme', 'name': 'queenn...",,,,
1,1252748846469283841,1252748846469283841,2020-04-21 23:59:34 UTC,2020-04-21,23:59:34,0,44393016,kembletwitt,Kemble,"{'type': 'Point', 'coordinates': [-6.1803, 106...",...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
2,1252748818405154817,1252748818405154817,2020-04-21 23:59:27 UTC,2020-04-21,23:59:27,0,924225714982809600,anitatrisia8,Anita Ita,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
3,1252748804723376128,1252748804723376128,2020-04-21 23:59:24 UTC,2020-04-21,23:59:24,0,42618344,adhis_lawliet,adhisty puji,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
4,1252748799497261058,1252748799497261058,2020-04-21 23:59:23 UTC,2020-04-21,23:59:23,0,1041540877,abdillahrifki12,imut,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,


Proses scraping data yang dilakukan sebelumnya hanya mengambil data original postingan yang dilakukan oleh masing-masing pengguna, tetapi tidak mengambil juga data aktivitas Retweet yang dilakukan oleh pengguna. Data retweet ini hanya bisa di ambil melalui scrapping data ke masing-masing Timenline pengguna. Berikut ini mekanisme lain untuk melakukan scrapping data untuk mendapatkan data Retweet tersebut.

In [None]:
def profile_twitter_scraping(username, start_date, until_date, latitude, longitude, radius):
    
    #set geocode parameter value from latitude, longitude and radius
    geocode = latitude,',',longitude,',',radius
    geocode_str = convertTuple(geocode).strip()

    #configuration
    config = twint.Config()
    config.Username = username 
    config.Since = start_date
    config.Until = until_date
    config.Stats = True
    config.Count = True
    config.Favorites = True
    #config.Profile_full = True
    #config.User_full = True
    #config.All = True
    #config.Store_csv = True
    #config.Include_retweets = True
    #config.Filter_retweets = "include"
    config.Filter_retweets = True
    config.Retweet = True
    #config.Hide_output = True
    config.Native_retweets = True
    config.Pandas = True
    config.Geo = geocode_str
    #config.Output = "/content/data/profile_scraping_result.csv"

    #running search
    twint.run.Profile(config)

    profile = twint.storage.panda.Tweets_df

    return profile

**Mengambil Data Retweet sebelum Pandemi COVID-19**

Deklarasi variabel **profile_before_covid** untuk menyimpan daftar nama pengguna Twitter yang melakukan aktifitas posting di periode sebelum COVID-19

In [None]:
#reset index dataframe untuk membuat index nya unique
tweet_data_before_covid_df = tweet_data_before_covid_df.reset_index()

#melakukan distinct username
profile_before_covid = pd.unique(tweet_data_before_covid_df['username'])

Melakukan penyimpanan data pengguna Twitter ke bentuk object DataFrame

In [None]:
#menyimpan hasil distinct username pada dataframe
profile_before_covid = pd.DataFrame(profile_before_covid)
profile_before_covid = profile_before_covid.rename(columns={0: 'username'})
profile_before_covid = profile_before_covid.sort_values('username', ascending = 'True')

Melakukan looping ke masing-masing Timeline akun pengguna yang sudah di-scraping datanya untuk mendapatkan data aktivitas Retweet. Data yang digunakan adalah data saat belum terjadi pandemi COVID-19

In [None]:
#deklarasi variabel untuk menyimpan data semua profile
all_profile_before_covid_df = pd.DataFrame()
all_profile_before_covid_df = all_profile_before_covid_df.drop(all_profile_before_covid_df.index, inplace=True)

for index, row in profile_before_covid[100:10000].iterrows():
    del all_profile_before_covid_df
    all_profile_before_covid_df = pd.DataFrame()
    
    try:
        all_profile_before_covid_df = profile_twitter_scraping(row['username'],'2020-01-01','2020-02-08', latitude, longitude, radius)

        if not all_profile_before_covid_df.empty:
            all_profile_before_covid_df = all_profile_before_covid_df.loc[all_profile_before_covid_df['retweet'] == True]

            if not os.path.isfile("data/noncovid/profile_scraping_result_0_10000.csv"):
                all_profile_before_covid_df.to_csv('data/noncovid/profile_scraping_result_0_10000.csv')
            else:
                all_profile_before_covid_df = all_profile_before_covid_df.append(all_profile_before_covid_df)
                all_profile_before_covid_df.to_csv('data/noncovid/profile_scraping_result_0_10000.csv', mode='a', index=True, header=False)
    except:
        pass

1225452226165923840 2020-02-06 23:12:33 +0700 <2137ath_> @GiaPratamaMD Saya ISFP-T , dok  https://t.co/K10PHZMytz | 0 replies 0 retweets 0 likes
1221027511192018946 2020-01-25 18:10:18 +0700 <2137ath_> Sy open mind thd semua jenis pengobatan, selama tdk bertentangan dg syariat. Tapi kenapa sih para pengobat tradisional itu sering bgt mempertentangkan metodenya dg kedokteran modern? Negatif thinking thd kedokteran modern terus. Ini yg bikin sy antipati | 0 replies 0 retweets 0 likes
1219291621910429697 2020-01-20 23:12:30 +0700 <2137ath_> @republikaonline Rajam sampai mati | 0 replies 0 retweets 0 likes
1215688087646494720 2020-01-11 00:33:21 +0700 <2137ath_> @GiaPratamaMD NO GOOGLING ALLOWED. Every answer must start with the first letter of your FIRST name.   WEAR - Niqab DRINK - Nabeez FOOD - Nasi ANIMAL - Ngengat PROFESSION - Nanny for my kids SOMETHING IN YOUR HOME - Nampan BODY PART - Nails  COPY, PASTE, &amp; HAVE FUN 😆 | 0 replies 0 retweets 0 likes
1215462365522644994 2020-01-10

Membaca file CSV yang berisi scraping data Twitter pada periode sebelum terjadi COVID-19

In [None]:
#tweet_data_before_covid_df = tweet_data_before_covid_df.dropna(how='any',axis=0)
tweet_data_before_covid_df = pd.DataFrame()

for files in glob.glob("data/noncovid/*.csv"):
    tweet_data_before_covid_files = pd.read_csv(files)
    tweet_data_before_covid_df = tweet_data_before_covid_df.append(tweet_data_before_covid_files)
    
#menghapus data rows yang redudant
tweet_data_before_covid_df = pd.DataFrame.drop_duplicates(tweet_data_before_covid_df)

**Mengambil Data Retweet saat Pandemi COVID-19**

Deklarasi variabel **profile_during_covid** untuk menyimpan daftar nama pengguna Twitter yang melakukan aktifitas posting di periode sebelum COVID-19

In [None]:
#reset index dataframe untuk membuat index nya unique
tweet_data_during_covid_df = tweet_data_during_covid_df.reset_index()

#melakukan distinct username
profile_during_covid = pd.unique(tweet_data_during_covid_df['username'])

Melakukan penyimpanan data pengguna Twitter ke bentuk object DataFrame

In [None]:
#menyimpan hasil distinct username pada dataframe
profile_during_covid = pd.DataFrame(profile_during_covid)
profile_during_covid = profile_during_covid.rename(columns={0: 'username'})
profile_during_covid = profile_during_covid.sort_values('username', ascending = 'True')

Melakukan looping ke masing-masing Timeline akun pengguna yang sudah di-scraping datanya untuk mendapatkan data aktivitas Retweet. Data yang digunakan adalah data saat terjadi pandemi COVID-19

In [None]:
#deklarasi variabel untuk menyimpan data semua profile
#all_profile_before_covid_df = all_profile_before_covid_df.drop(all_profile_before_covid_df.index, inplace=True)

for index, row in profile_during_covid[0:1000].iterrows():
    del all_profile_during_covid_df
    all_profile_during_covid_df = pd.DataFrame()
    
    try:
        all_profile_during_covid_df = profile_twitter_scraping(row['username'],'2020-04-01','2020-05-08', latitude, longitude, radius)

        if not all_profile_during_covid_df.empty:
            all_profile_during_covid_df = all_profile_during_covid_df.loc[all_profile_during_covid_df['retweet'] == True]

            if not os.path.isfile("data/covid/profile_scraping_result_0_1000.csv"):
                all_profile_during_covid_df.to_csv('data/covid/profile_scraping_result_0_1000.csv')
            else:
                all_profile_during_covid_df = all_profile_during_covid_df.append(all_profile_during_covid_df)
                all_profile_during_covid_df.to_csv('data/covid/profile_scraping_result_0_1000.csv', mode='a', index=True, header=False)
    except:
        pass

Membaca file CSV yang berisi scraping data Twitter pada periode saat terjadi COVID-19

In [None]:
tweet_data_during_covid_df = tweet_data_during_covid_df.dropna(how='any',axis=0)
tweet_data_during_covid_df = pd.DataFrame()

for files in glob.glob("data/covid/*.csv"):
    tweet_data_during_covid_files = pd.read_csv(files)
    tweet_data_during_covid_df = tweet_data_during_covid_df.append(tweet_data_during_covid_files)
    
#menghapus data rows yang redudant
tweet_data_during_covid_df = pd.DataFrame.drop_duplicates(tweet_data_during_covid_df)

Menggabungkan data tweets sebelum dan saat COVID-19

In [None]:
all_tweets_data_df = pd.concat([tweet_data_before_covid_df, tweet_data_during_covid_df], ignore_index=True)

Menghitung jumlah baris data keseluruhan

In [None]:
print('Jumlah baris keseluruhan: ', len(all_tweets_data_df.index))

Jumlah baris keseluruhan:  1310835


In [None]:
all_tweets_data_df.head(5)

Unnamed: 0,id,conversation_id,created_at,date,time,timezone,user_id,username,name,place,...,geo,source,user_rt_id,user_rt,retweet_id,reply_to,retweet_date,translate,trans_src,trans_dest
0,1219771501609570305,1219771501609570305,2020-01-21 23:59:22 UTC,2020-01-21,23:59:22,0,56072466,harigelita,Dian Mochtar,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
1,1219771450304786432,1219439859128922112,2020-01-21 23:59:10 UTC,2020-01-21,23:59:10,0,239827943,1dg4fux,Aryawipu,,...,"-6.216657128974757,106.83030289065285,10km",,,,,"[{'screen_name': 'Tospotato', 'name': 'Kamisam...",,,,
2,1219771440943091712,1219763519119220736,2020-01-21 23:59:08 UTC,2020-01-21,23:59:08,0,80840926,iwanneh,.:::.,,...,"-6.216657128974757,106.83030289065285,10km",,,,,"[{'screen_name': 'NotesofMila', 'name': 'Mila🇮...",,,,
3,1219771347020083200,1219771347020083200,2020-01-21 23:58:46 UTC,2020-01-21,23:58:46,0,195299889,bys_king11,BODDAH,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
4,1219771311368523776,1219771311368523776,2020-01-21 23:58:37 UTC,2020-01-21,23:58:37,0,406069088,anggi_gunawan25,Anggi Gunawan S.Tr.Sos,,...,"-6.216657128974757,106.83030289065285,10km",,,,,[],,,,
