# Scrapping data YouTube

## Data komen menggunakan API dari youtube 
Untuk mendapatkan data komentar dari youtube, kita perlu menggunakan api untuk mempermudah mendapat data koment dari youtube. Didalamnya kita bisa mendapatkan berbagai macam data yang dapat diambil mulai dari akun yang melakukan komentar, waktu komen dll. 

In [26]:
!pip install emoji indoNLP --quiet

## Menggunakan library NLTK (Natural Language Toolkit)
Digunakan untuk memproses teks Tokenisasi, Stemming.

In [27]:
! pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Menginstal indoNLP


In [28]:
! pip install indoNLP

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [29]:
# import time
# from selenium import webdriver
# from selenium.webdriver import Chrome
# from selenium.webdriver.common.by import By
# from selenium.webdriver.common.keys import Keys
# from selenium.webdriver.support.ui import WebDriverWait
# from selenium.webdriver.support import expected_conditions as EC

## Import library

In [30]:
import numpy as np
import pandas as pd
from googleapiclient.discovery import build

from indoNLP.preprocessing import pipeline, replace_word_elongation, replace_slang, remove_html, remove_url
import re, string
import emoji


import nltk
from nltk.tokenize import RegexpTokenizer
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords', quiet=True)
from nltk.corpus import stopwords
stopwords = stopwords.words('indonesian')


# ignore warnings
import warnings
warnings.filterwarnings('ignore')

## Scrapping data youtube

In [31]:
def video_comments(video_id, api_key):
	# empty list for storing reply
	replies = []

	# creating youtube resource object
	youtube = build('youtube', 'v3', developerKey=api_key)

	# retrieve youtube video results
	video_response = youtube.commentThreads().list(part='snippet,replies', videoId=video_id).execute()

	# iterate video response
	while video_response:
		for item in video_response['items']:
			
			# Extracting comments ()
			published = item['snippet']['topLevelComment']['snippet']['publishedAt']
			user = item['snippet']['topLevelComment']['snippet']['authorDisplayName']
			comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
			likeCount = item['snippet']['topLevelComment']['snippet']['likeCount']

			replies.append([published, user, comment, likeCount])

			replycount = item['snippet']['totalReplyCount']

			# if reply is there
			if replycount>0:
				# iterate through all reply
				for reply in item['replies']['comments']:
					
					# Extract reply
					published = reply['snippet']['publishedAt']
					user = reply['snippet']['authorDisplayName']
					repl = reply['snippet']['textDisplay']
					likeCount = reply['snippet']['likeCount']
					
					# Store reply is list
					replies.append([published, user, repl, likeCount])

		# Again repeat
		if 'nextPageToken' in video_response:
			video_response = youtube.commentThreads().list(
					part = 'snippet,replies',
					pageToken = video_response['nextPageToken'], 
					videoId = video_id
				).execute()
		else:
			break
	#endwhile
	return replies

## Dapatkan API dan ID video dari Youtube

In [32]:
video_id = '_HwJW5f5jFQ'
Api_key = 'AIzaSyBZqJO_poUA57gg9UnaC9rZd_UdPHLl1fc'


comments = video_comments(video_id, Api_key)

## Memasukkan ke dalam Data Frame

In [33]:
# dataframe
df = pd.DataFrame(comments, columns=['publishedAt', 'user', 'comment', 'likeCount'])

df

Unnamed: 0,publishedAt,user,comment,likeCount
0,2023-06-02T09:58:39Z,Kamal Yusuf,Sebaiknya Bacawapres Bapak Ganjar Pranowo bera...,0
1,2023-06-02T00:56:09Z,Oemar Husain,Knpa gue gk tertarik ma ni orang,0
2,2023-06-01T15:39:36Z,akhiri yanto,<b>PETUGAS PARTAI</b><br><br><b>Menurut paham ...,0
3,2023-06-01T11:01:38Z,Kamal Yusuf,Apabila Bapak Ganjar Pranowo menjadi pemimpin ...,0
4,2023-06-01T07:55:09Z,M. Maulana Muhson,Aslinya pak ganjar masih menjadi gubernur jate...,0
...,...,...,...,...
10572,2023-04-23T06:08:09Z,Olive,"Halaaah. Timbang si Yaman, kampanye ke mana² b...",1
10573,2023-04-23T06:01:08Z,CAH NDESO NEWS,Gaaasssss 2024,3
10574,2023-04-23T06:00:59Z,BUZZTRUCK,Masih sepi,1
10575,2023-04-23T06:11:39Z,Mulyadi,Soalnya lagi makan ketupat...<br>Katanya malas...,0


## Cleaning Data

Memproses data kata yang tidak digunakan 

In [34]:
# Text Cleaning
def cleaning(text):
    # HTML Tag Removal
    text = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});').sub('', str(text))

    # Case folding
    text = text.lower()

    # Trim text
    text = text.strip()

    # Remove punctuations, karakter spesial, and spasi ganda
    text = re.compile('<.*?>').sub('', text)
    text = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', text)
    text = re.sub('\s+', ' ', text)

    # Number removal
    text = re.sub(r'\[[0-9]*\]', ' ', text)
    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())
    text = re.sub(r'\d', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = emoji.replace_emoji(text)

    return text


pipe = pipeline([replace_word_elongation, replace_slang, remove_html, remove_url])

## Steamming data

Mengambil kata dasar



In [35]:
df['comment (clean)'] = df['comment'].apply(lambda x: pipe(x))
df['comment (clean)'] = df['comment (clean)'].apply(lambda x: cleaning(x))
# ubah empty string menjadi NaN
df['comment (clean)'] = df['comment (clean)'].replace('', np.nan)


## Cek nilai kosong



In [36]:
print(df.isna().sum())

publishedAt         0
user                0
comment             0
likeCount           0
comment (clean)    99
dtype: int64


## Hapus nilai kosong

In [37]:
df.dropna(inplace=True)

In [38]:
print(df.isna().sum())

publishedAt        0
user               0
comment            0
likeCount          0
comment (clean)    0
dtype: int64


In [39]:
df

Unnamed: 0,publishedAt,user,comment,likeCount,comment (clean)
0,2023-06-02T09:58:39Z,Kamal Yusuf,Sebaiknya Bacawapres Bapak Ganjar Pranowo bera...,0,sebaiknya bacawapres bapak ganjar pranowo bera...
1,2023-06-02T00:56:09Z,Oemar Husain,Knpa gue gk tertarik ma ni orang,0,kenapa gue enggak tertarik sama nih orang
2,2023-06-01T15:39:36Z,akhiri yanto,<b>PETUGAS PARTAI</b><br><br><b>Menurut paham ...,0,petugas partaimenurut paham negara demokrasi m...
3,2023-06-01T11:01:38Z,Kamal Yusuf,Apabila Bapak Ganjar Pranowo menjadi pemimpin ...,0,apabila bapak ganjar pranowo menjadi pemimpin ...
4,2023-06-01T07:55:09Z,M. Maulana Muhson,Aslinya pak ganjar masih menjadi gubernur jate...,0,aslinya pak ganjar masih menjadi gubernur jate...
...,...,...,...,...,...
10572,2023-04-23T06:08:09Z,Olive,"Halaaah. Timbang si Yaman, kampanye ke mana² b...",1,halaaah timbang sih yaman kampanye ke mana² be...
10573,2023-04-23T06:01:08Z,CAH NDESO NEWS,Gaaasssss 2024,3,gaaas
10574,2023-04-23T06:00:59Z,BUZZTRUCK,Masih sepi,1,masih sepi
10575,2023-04-23T06:11:39Z,Mulyadi,Soalnya lagi makan ketupat...<br>Katanya malas...,0,soalnya lagi makan ketupat katanya malas ah so...


## Import ke ke csv

Setelah mendapatkan hasilnya ubah ke dalam bentuk csv

In [40]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [41]:
df.to_csv('/content/Comments_Data.csv', index=False)

In [42]:
df = pd.read_csv('/content/Comments_Data.csv')

df

Unnamed: 0,publishedAt,user,comment,likeCount,comment (clean)
0,2023-06-02T09:58:39Z,Kamal Yusuf,Sebaiknya Bacawapres Bapak Ganjar Pranowo bera...,0.0,sebaiknya bacawapres bapak ganjar pranowo bera...
1,2023-06-02T00:56:09Z,Oemar Husain,Knpa gue gk tertarik ma ni orang,0.0,kenapa gue enggak tertarik sama nih orang
2,2023-06-01T15:39:36Z,akhiri yanto,<b>PETUGAS PARTAI</b><br><br><b>Menurut paham ...,0.0,petugas partaimenurut paham negara demokrasi m...
3,2023-06-01T11:01:38Z,Kamal Yusuf,Apabila Bapak Ganjar Pranowo menjadi pemimpin ...,0.0,apabila bapak ganjar pranowo menjadi pemimpin ...
4,2023-06-01T07:55:09Z,M. Maulana Muhson,Aslinya pak ganjar masih menjadi gubernur jate...,0.0,aslinya pak ganjar masih menjadi gubernur jate...
...,...,...,...,...,...
10494,2023-04-23T06:08:09Z,Olive,"Halaaah. Timbang si Yaman, kampanye ke mana² b...",1.0,halaaah timbang sih yaman kampanye ke mana² be...
10495,2023-04-23T06:01:08Z,CAH NDESO NEWS,Gaaasssss 2024,3.0,gaaas
10496,2023-04-23T06:00:59Z,BUZZTRUCK,Masih sepi,1.0,masih sepi
10497,2023-04-23T06:11:39Z,Mulyadi,Soalnya lagi makan ketupat...<br>Katanya malas...,0.0,soalnya lagi makan ketupat katanya malas ah so...


In [43]:
df['comment (clean)'].fillna('', inplace=True)

In [44]:
# Membentuk matriks dokumen x kata
tokenizer = RegexpTokenizer(r'\w+')
vectorizer = TfidfVectorizer(lowercase=True,
                        stop_words=stopwords,
                        tokenizer = tokenizer.tokenize)

tfidf_matrix = vectorizer.fit_transform(df['comment (clean)']) 

# Melakukan dekomposisi matriks dengan SVD
svd_model = TruncatedSVD(n_components=4)
lsa_matrix = svd_model.fit_transform(tfidf_matrix)

## Berikan bobot kata pada masing-masing topik

In [45]:
# bobot kata terhadap masing masing topik
terms = vectorizer.get_feature_names_out()

for index, component in enumerate(svd_model.components_):
    zipped = zip(terms, component)
    top_terms_key=sorted(zipped, key = lambda t: t[1], reverse=True)[:3]
    print("Topic "+str(index)+": ",top_terms_key)

Topic 0:  [('ganjar', 0.7753927696143593), ('presiden', 0.26053510510489974), ('ri', 0.20229273069550835)]
Topic 1:  [('partai', 0.7116812695919013), ('petugas', 0.583256207396635), ('rakyat', 0.10229906277841277)]
Topic 2:  [('presiden', 0.5192154157718281), ('prabowo', 0.46844031240139905), ('anies', 0.18758521093211838)]
Topic 3:  [('ri', 0.47400799551175565), ('presiden', 0.43317755634821264), ('pranowo', 0.17340817984299473)]


## Berikan bobot topik pada masing-masing dokumen

In [46]:
# bobot setiap topik terhadap  dokumen
df_lsa = pd.DataFrame(lsa_matrix, columns=["Topik 0", "Topik 1", "Topik 2", "Topik 3"])
df_lsa = pd.concat([df["comment (clean)"], df_lsa], axis=1)
df_lsa['Topik']= df_lsa[['Topik 0', 'Topik 1', 'Topik 2', 'Topik 3']].apply(lambda x: x.argmax(), axis=1)

df_lsa

Unnamed: 0,comment (clean),Topik 0,Topik 1,Topik 2,Topik 3,Topik
0,sebaiknya bacawapres bapak ganjar pranowo bera...,1.559279e-01,-4.727330e-02,-3.513604e-02,2.320549e-02,0
1,kenapa gue enggak tertarik sama nih orang,2.674689e-02,1.196918e-02,1.919385e-02,-2.563696e-02,0
2,petugas partaimenurut paham negara demokrasi m...,1.207230e-01,2.585382e-01,3.069129e-02,-2.252631e-03,1
3,apabila bapak ganjar pranowo menjadi pemimpin ...,1.947403e-01,-2.327173e-02,3.542895e-02,3.599127e-02,0
4,aslinya pak ganjar masih menjadi gubernur jate...,8.571505e-02,6.145721e-03,5.059285e-02,2.357836e-02,0
...,...,...,...,...,...,...
10494,halaaah timbang sih yaman kampanye ke mana² be...,7.656482e-03,4.597682e-03,1.203065e-02,-1.419255e-02,2
10495,gaaas,1.246273e-14,-8.421682e-13,-4.967756e-12,1.009319e-11,3
10496,masih sepi,2.305728e-04,1.189862e-04,3.445038e-04,-4.132306e-04,2
10497,soalnya lagi makan ketupat katanya malas ah so...,2.731007e-03,3.552052e-03,3.053887e-03,-3.973749e-03,1


In [47]:
df_lsa['Topik'].value_counts()

0    6448
2    2494
1    1401
3     156
Name: Topik, dtype: int64