## Data Loading and Preparation
This section is dedicated to loading the dataset, which is crucial for initial exploration and performing necessary analyses. The dataset, presumably containing cleaned Twitter data, is loaded into a pandas DataFrame for easy manipulation and analysis.


In [1]:
import pandas as pd
import numpy as np

# Load the dataset to see the structure of the data and perform necessary analyses
file_path = 'twitter_cleaned_6.csv'
data = pd.read_csv(file_path)

## Data Cleaning
Here, we are renaming a column in our DataFrame for clarity and consistency. The column originally named 'text' is being renamed to 'text_clean'. This step is important for maintaining clear and understandable code, especially when dealing with multiple text columns that might have different processing stages.

In [2]:
# rename kolom text menjadi text_clean
data.rename(columns={'text':'text_clean'}, inplace=True)

In this step, we transform the data from string representations of lists back into actual lists, which is crucial for text analysis. This is particularly important for columns that contain textual data in list format, which needs to be converted for further analysis or processing. After the conversion, we join these lists into a single string for each record in the 'text' column.

In [3]:
from ast import literal_eval

# Convert the string representations of lists in the 'text' column back to lists
data['text'] = data['text_clean'].apply(literal_eval)
data['hastag'] = data['hastag'].apply(literal_eval)
data['text_ngrams'] = data['text_ngrams'].apply(literal_eval)

# Join the lists into strings
data['text'] = data['text'].apply(' '.join)

This section involves converting the 'date' column to a datetime format. The conversion to datetime is essential for any time series analysis or operations that require date manipulation. It allows for more straightforward and accurate handling of dates in the dataset.

In [4]:
data['date'] = pd.to_datetime(data['date'])

# Exploratory Data Analysis (EDA)
In this section, we perform exploratory data analysis (EDA) to understand the structure and trends within the dataset. This involves grouping the data by date, analyzing word frequencies, and identifying top users.

In [27]:
# group by date by dt.day and count
grouped = data.groupby(data['date'].dt.day)['text'].count()

Here, we segment the dataset based on a specific date (the 7th) to compare tweet activity and content before and after this date. This segmentation helps in identifying any changes in public sentiment or focus due to events happening around this date.

In [29]:
# pisahkan data sebelum tanggal 7 dan setelah tangal 7
data_before = data[data['date'].dt.day < 7]
data_after = data[data['date'].dt.day >= 7]


In [None]:
# grouped.to_csv('before after.csv', index=True)

In this section, we analyze the level of engagement by different users by grouping the data by username and counting the number of texts. This helps in identifying the most active participants in the discussion.


In [16]:
# group by username
grouped_user = data.groupby(data['username'])['text'].count()
grouped_user.sort_values(ascending=False, inplace=True)

In [125]:
# grouped_user.to_csv('grouped_user.csv', index=True)

This section focuses on analyzing the frequency of words used in tweets before and after the specified date. This analysis helps in understanding the shift in discussion topics or sentiment in response to external events.


In [None]:
from collections import Counter
# Create a list of lists containing all the words for each tweet for each day before tanggal 7
all_words_before = [word for tokens in data_before["text_ngrams"] for word in tokens]

# Create a list of lists containing all the words for each tweet for each day after tanggal 7
all_words_after = [word for tokens in data_after["text_ngrams"] for word in tokens]

# Create a counter for the words for each day before
counts_before = Counter(all_words_before)

# Create a counter for the words for each day after
counts_after = Counter(all_words_after)

# Create a dataframe with the words and their respective counts for each day before tanggal 7
df_counts_before = pd.DataFrame.from_dict(counts_before, orient='index')
df_counts_before.rename(columns={0: 'counts'}, inplace=True)
df_counts_before.sort_values(by=['counts'], ascending=False, inplace=True)

# Create a dataframe with the words and their respective counts for each day after tanggal 7
df_counts_after = pd.DataFrame.from_dict(counts_after, orient='index')
df_counts_after.rename(columns={0: 'counts'}, inplace=True)
df_counts_after.sort_values(by=['counts'], ascending=False, inplace=True)

In [None]:
# df_counts_before.to_csv('df_counts_before.csv')

In [None]:
# df_counts_after.to_csv('df_counts_after.csv')

This part of the analysis focuses on identifying the top words used across all tweets. This provides a general view of the most common themes or terms in the entire dataset.


In [None]:
# Create a list of lists containing all the words for each tweet for each day
words = [word for day in data['text_ngrams'] for word in day]

# Count the words for each day
word_counts = Counter(words)

# Create a dataframe from the word counts
counts = pd.DataFrame.from_dict(word_counts, orient='index', columns=['count'])

counts.sort_values(by='count', ascending=False, inplace=True)

In [131]:
# counts.to_csv('words counts.csv', index=True)

# Sampling
This section is dedicated to preparing the dataset for sampling. The aim is to create stratified samples representing tweets before and after the specified date, allowing for more accurate analysis of each period.


In [5]:
data_all = data.copy()

In [6]:
# beri label before dan after
data['stratum'] = ['before' if x < 7 else 'after' for x in data['date'].dt.day]

Calculating the appropriate sample size for each stratum ensures that our analysis is statistically significant. This step involves using a formula to determine the sample size needed from each stratum based on certain parameters.

In [7]:
# split based on strata
stratum_1 = data[data['stratum'] == 'before']
stratum_2 = data[data['stratum'] == 'after']

# Now, we'll calculate the sample size for each stratum using the provided formula
# Given values
p = 0.5
B = 0.05
N_1 = len(stratum_1)  # Size of stratum 1
N_2 = len(stratum_2)  # Size of stratum 2

# Define the function to calculate the sample size using the provided formula
def calculate_sample_size(N, p, B):
    numerator = N * p * (1 - p)
    denominator = ((N - 1) * (B**2 / 4)) + (p * (1 - p))
    return numerator / denominator

# Calculate sample sizes for each stratum
n_1 = calculate_sample_size(N_1, p, B)
n_2 = calculate_sample_size(N_2, p, B)

(n_1, n_2)

(62.57928118393234, 364.15899393667183)

This step involves performing systematic sampling on each stratum. Systematic sampling is a method where samples are selected at regular intervals from a sorted list. This ensures that the sample is representative of the entire stratum.

In [8]:
# Round the sample sizes to the nearest whole number
n_1 = round(n_1)
n_2 = round(n_2)

# Calculate the steps for systematic sampling for each stratum
step_1 = N_1 // n_1
step_2 = N_2 // n_2

# Generate the starting point randomly between 0 and step-1 for each stratum
start_1 = np.random.randint(0, step_1)
start_2 = np.random.randint(0, step_2)

# Create the indices of the systematic sample for each stratum
indices_1 = range(start_1, N_1, step_1)
indices_2 = range(start_2, N_2, step_2)

# Take the systematic sample for each stratum
systematic_sample_1 = stratum_1.iloc[indices_1]
systematic_sample_2 = stratum_2.iloc[indices_2]

# Combine the samples from both strata
systematic_sample_combined = pd.concat([systematic_sample_1, systematic_sample_2])

# Output the number of samples from each stratum and the combined systematic sample
(start_1, start_2, step_1, step_2, n_1, n_2, len(systematic_sample_combined))

(0, 9, 1, 11, 63, 364, 442)

In [9]:
# # buat dataframe yang berisi 3 kolom yakni stratum, N, n
# stratum = ['before', 'after']
# N = [N_1, N_2]
# n = [n_1, n_2]

# df_N = pd.DataFrame({'stratum':stratum, 'N':N, 'n':n})

# df_N.to_csv('sample.csv', index = False)

In [10]:
data = systematic_sample_combined

# Sentiment Analysis with VADER
This section utilizes VADER (Valence Aware Dictionary and sEntiment Reasoner) for sentiment analysis. VADER is particularly well-suited for social media text and can effectively evaluate the sentiment of short texts. The analysis will categorize each tweet into negative, positive, or neutral sentiments.

In [11]:
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk
# nltk.download('vader_lexicon')

# Initialize VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

# Define a function to calculate sentiment score for a list of words
def calculate_sentiment(text_list):
    # Join the list of words into a single string
    text_str = ' '.join(text_list.strip("[]").replace("'", "").split(", "))
    # Calculate sentiment score
    return sia.polarity_scores(text_str)

# Apply the function to the 'text' column to get the sentiment scores
sentiment_scores = data['text'].apply(calculate_sentiment)

# Add the sentiment scores to the dataframe
data = pd.concat([data, sentiment_scores.apply(pd.Series)], axis=1)

Here, we categorize each tweet into 'negative', 'positive', or 'neutral' based on the highest sentiment score among 'neu', 'neg', and 'pos'. This categorization helps in simplifying the sentiment analysis results for further interpretation and analysis.

In [19]:
# buat kolom sentiment into negative, positive, and neutral based on neu neg and pos.
def sentiment_into_negative_positive_neutral(neu, neg, pos):
    if neu > neg and neu > pos:
        return 'neutral'
    elif neg > neu and neg > pos:
        return 'negative'
    elif pos > neu and pos > neg:
        return 'positive'
    
data['sentiment'] = data.apply(lambda x: sentiment_into_negative_positive_neutral(x['neu'], x['neg'], x['pos']), axis=1)

In this step, we filter out tweets categorized as 'neutral' to focus our analysis on more clearly positive or negative sentiments. This helps in obtaining a more distinct sentiment perspective.

In [20]:
# drop sentiment neutral
data = data[data['sentiment'] != 'neutral']

Here, we further simplify the sentiment analysis by categorizing tweets into 'negative' and 'positive' based on the 'compound' score. A score equal to or above 0.5 is considered 'positive', while below 0.5 is considered 'negative'.

In [12]:
# buat kolom sentiment into just negative and positive based on compound colomn
# jika compound >= 0.5 maka positive
# jika compound < 0.5 maka negative

def sentiment_into_negative_positive(compound):
    if compound >= 0.5:
        return 'positive'
    else:
        return 'negative'
    
data['sentiment_2'] = data['compound'].apply(sentiment_into_negative_positive)

# Sentiment Analysis with TextBlob
In addition to VADER, we employ TextBlob for sentiment analysis. TextBlob provides a simple API to access its methods for performing basic NLP tasks. It is particularly useful for obtaining a tweet's polarity (positive/negative sentiment) and subjectivity (objective/subjective measurement).

In [12]:
from textblob import TextBlob

# Function to apply sentiment analysis
def analyze_sentiment(text):
    # Create a TextBlob object
    analysis = TextBlob(text)
    # Return polarity and subjectivity
    return analysis.sentiment.polarity, analysis.sentiment.subjectivity

# Apply sentiment analysis to the 'text' column (assuming the 'text' column contains stringified lists of words)
# We will join the list of words into full sentences for the sentiment analysis
data['sentiment'] = data['text_clean'].apply(lambda x: ' '.join(literal_eval(x))).apply(analyze_sentiment)

Here, we extract the subjectivity and polarity scores from the sentiment analysis results. These scores provide insights into how subjective or objective the language in the tweets is and the overall sentiment tone (positive or negative).


In [13]:
# ambil hanya subjectivity
data['subjectivity'] = data['sentiment'].apply(lambda x: x[1])

In [14]:
# ambil hanya polarity
data['polarity'] = data['sentiment'].apply(lambda x: x[0])

# Emotion Analysis with NRCLex
This section focuses on emotion analysis using NRCLex. NRCLex is a tool for detecting emotions in text. It assigns emotions to text based on a lexicon and rule-based approach. The analysis will identify the predominant emotion in each tweet.


In [15]:
from nrclex import NRCLex

# # Function to perform emotion analysis using NRCLex
# def emotion_analysis_with_nrclex(text):
#     # Create an NRCLex object for the text
#     emotion = NRCLex(text)
    
#     # Access the emotions and frequency score
#     emotions = emotion.affect_frequencies
    
#     # The emotion object also has other attributes and methods that can be useful:
#     # - emotion.raw_emotion_scores: Emotion scores before normalization.
#     # - emotion.top_emotions: The highest scoring emotions.
#     # - emotion.words: The words in the text associated with emotions.
    
#     return emotions

# Define a function to apply NRCLex to each text
def apply_nrclex(text):
    # Instantiate NRCLex with the text
    text_object = NRCLex(text)
    # Get the affect frequencies
    emotion_frequencies = text_object.affect_frequencies
    # Find the emotion with the highest frequency
    highest_emotion = max(emotion_frequencies, key=emotion_frequencies.get)
    # Return a tuple of the highest emotion and its score
    return highest_emotion

# Apply the function to the 'text' column
data['emotions'] = data['text'].apply(apply_nrclex)

# Latent Dirichlet Allocation (LDA) for Topic Modeling
This section is dedicated to performing topic modeling using LDA. LDA is a popular method for extracting topics from a collection of documents. It helps in discovering abstract topics within the text data, which can be instrumental in understanding the underlying themes in large text corpora.

In [78]:
import pandas as pd
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from gensim.models import CoherenceModel

# Menyiapkan data untuk model LDA
documents = data['text_ngrams']

# Membuat kamus (dictionary) dari teks
dictionary = Dictionary(documents)

# Membuat korpus
corpus = [dictionary.doc2bow(doc) for doc in documents]

# Melakukan LDA untuk jumlah topic antara 2 dan 15
start_topics = 2
end_topics = 10

best_coherence = -1
best_lda_model = None
best_num_topics = 0

If an existing LDA model is saved, it can be loaded for further analysis or comparison. This step is useful for reusing models without needing to retrain them.


In [79]:
# # load lda model
# lda_model = LdaModel.load('lda_model')

## Evaluating Coherence Scores
To determine the optimal number of topics for the LDA model, we evaluate the coherence score for different numbers of topics. The coherence score measures the degree of semantic similarity between high scoring words in the topic, helping to choose the number of topics that make the most sense.

In [53]:
import matplotlib.pyplot as plt

# List untuk menyimpan skor koherensi dan jumlah topik
coherence_scores = []
num_topics_list = list(range(start_topics, end_topics + 1))

# Melakukan LDA dan mengumpulkan skor koherensi
for num_topics in num_topics_list:
    lda_model = LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)
    coherence_model = CoherenceModel(model=lda_model, texts=documents, dictionary=dictionary, coherence='c_v')
    coherence_score = coherence_model.get_coherence()
    coherence_scores.append(coherence_score)

This step involves visualizing the coherence scores for different numbers of topics. A graph is plotted to illustrate how coherence scores change with the number of topics, aiding in the selection of an optimal number for further analysis.


In [None]:
# Membuat grafik
plt.figure(figsize=(10, 6))
plt.plot(num_topics_list, coherence_scores, marker='o')
plt.title('Coherence Score vs Number of Topics')
plt.xlabel('Number of Topics')
plt.ylabel('Coherence Score')
plt.grid(False)
plt.show()

## Displaying Topics for a Specific Model
After determining the optimal number of topics, the model can be used to display the top words for each topic. This step is crucial for interpreting the themes and subjects that the LDA model has uncovered in the dataset.


In [80]:
# tampilkan untuk topic 4
# Melakukan LDA dengan 4 topik
# lda_model = LdaModel(corpus, num_topics=4, id2word=dictionary, passes=15)

# # load lda model
# lda_model = LdaModel.load('lda_model')

# Menampilkan topik
for topic in lda_model.show_topics(num_topics=4, num_words=12, formatted=False):
    print('Topic', topic[0], ':', ', '.join([word[0] for word in topic[1]]))

Topic 0 : palestine, israel, palestinian, people, land, one, free, like, right, israeli, free_palestine, civilian
Topic 1 : palestine, israel, people, hamas, free, israel_palestine, palestinian, peace, would, supporter, one, free_palestine
Topic 2 : palestine, israel, hamas, people, war, terrorist, peace, israel_palestine, palestinian, like, israeli, know
Topic 3 : palestine, israel, israel_palestine, people, get, hamas, year, attack, palestinian, jew, free, like


Finally, we assign the most relevant topic to each document in the dataset. This step involves creating a new column in the DataFrame that contains the topic number with the highest contribution for each document.


In [21]:
# tambahkan kolom baru yang berisi topik ke berapa
# Membuat kolom baru untuk menampung topik using lda_model
data['topic'] = data['text_ngrams'].apply(lambda x: sorted(lda_model[dictionary.doc2bow(x)], key=lambda tup: tup[1], reverse=True)[0][0])