## **Objective **
### Social Media Tweet Analysis on Twitter Dataset
*   Topic Modeling on Twitter Dataset


*   Reference for [Topic modeling ](https://https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0)

*   Sentiment analysis on Twitter Dataset








### **Business understanding**

### **Topic modeling**
Topic modeling is a type of statistical model for discovering the abstract "topics" that occur in a collection of texts.
 It is an unsupervised approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts.
 **Topic models** are built around the idea that the semantics of our document are actually being governed by some hidden, or “latent,” variables that we are not observing.

*   Our task here is to discover abstract topics from tweets.


### **Sentiment analysis**
 It is used in social media monitoring, allowing businesses to gain insights about how customers feel about certain topics, and detect urgent issues in real time before they spiral out of control.


*   Our task here is to classify a tweet as a positive or negative tweet sentiment wise.




**Topic modeling **is a machine learning technique that automatically analyzes text data to determine cluster words for a set of documents. 


*   unsupervised machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
*   doesn’t require training, it’s a quick and easy way to start analyzing your data.

## Data Understanding
### Loading necessary packages

In [58]:
!pip install pyLDAvis



In [1]:
import warnings
warnings.filterwarnings('ignore')
from nltk.corpus import stopwords
import gensim
import nltk
import plotly.express as px
import plotly.graph_objs as go
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import STOPWORDS,WordCloud
import gensim
from gensim.models import CoherenceModel
from gensim import corpora
import pandas as pd
from pprint import pprint
import string
import os
import re

Data acquisition

For this example we have two option for data acquisition:

*   You can download Twitter dataset directly from Twitter
*   By registering as a developer using this link [Here](https://developer.twitter.com/en) 

*   Or you can use downloaded data found at Week0/data/cleaned_fintech_data.csv 



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
#data loader class
class DataLoader:
  def __init__(self,dir_name,file_name):
    self.dir_name=dir_name
    self.file_name = file_name
    
 
  def read_csv(self):
    os.chdir(self.dir_name)
    tweets_df=pd.read_csv(self.file_name)
    return tweets_df
  
    

In [74]:
#object creation
DataLoader_obj= DataLoader('/content/','processed_tweet_data.csv')


In [75]:
tweets_df=DataLoader_obj.read_csv()
tweets_df.dropna()


Unnamed: 0,statuses_count,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,screen_name,followers_count,friends_count,sensitivity,hashtags,user_mentions,place


In [None]:
tweets_df.head()

Unnamed: 0,statuses_count,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,screen_name,followers_count,friends_count,sensitivity,hashtags,user_mentions,place
0,40,Fri Apr 22 22:20:18 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",RT @nikitheblogger: Irre: Annalena Baerbock sa...,0.0,0.0,de,2356.0,355.0,McMc74078966,3,12,,,nikitheblogger,
1,40,Fri Apr 22 22:19:16 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",RT @sagt_mit: Merkel schaffte es in 1 Jahr 1 M...,0.0,0.0,de,1985.0,505.0,McMc74078966,3,12,,,sagt_mit,
2,40,Fri Apr 22 22:17:28 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",RT @Kryptonoun: @WRi007 Pharma in Lebensmittel...,0.0,0.0,de,16.0,4.0,McMc74078966,3,12,,,Kryptonoun WRi007,
3,40,Fri Apr 22 22:17:20 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",RT @WRi007: Die #Deutschen sind ein braves Vol...,0.0,0.0,de,1242.0,332.0,McMc74078966,3,12,,Deutschen Spritpreisen inflation Abgaben,WRi007,
4,40,Fri Apr 22 22:13:15 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",RT @RolandTichy: Baerbock verkündet mal so neb...,0.0,0.0,de,1329.0,386.0,McMc74078966,3,12,,,RolandTichy,


In [None]:
tweets_df['lang'].unique()

array(['de', 'und', 'en', 'fr', 'hu', 'nl', 'lt', 'ro', 'pt', 'fi', 'ja',
       'ar', 'in', 'tr', 'it', 'ca', 'ur', 'sl', 'hi', 'cs', 'es', 'pl',
       'tl', 'ht', 'et', 'ru', 'da', 'no', 'uk', 'sv', 'cy', 'th', 'ko',
       nan, 'Yujin_030901', 'zh', 'lv', 'te', 'ml', 'bn', 'GDSroy', 'mr',
       'ShivaKJSP', 'eu', 'kn', 'or', 'ta', 'ne', 'gu', 'pa', 'fa', 'km',
       'si'], dtype=object)

In [76]:
# extract english rows and drop lang column with other remaining languages
tweet_df_2 = tweets_df[tweets_df.lang == 'en'].drop('lang', axis=1).reset_index(drop = True)

#check for the uniques values in locations columns 
tweet_df_2['place'].value_counts().head(30)

India                      519
United States              254
Sri Lanka                  228
London, England            195
Canada                     193
New Delhi                  177
Mumbai                     144
Mars                       143
Kenya                      133
Chennai, India             113
Hyderabad, India           103
San Francisco, CA          100
Ireland                     98
Boston, MA                  97
Texas                       96
United Kingdom              96
London                      94
Sydney, New South Wales     93
Melbourne, Victoria         93
South Africa                93
Nairobi Kenya               93
Bankura, India              92
UK                          89
Nairobi, Kenya              89
Dallas, TX                  86
Metaverse                   82
New Delhi, India            80
California, USA             76
Mumbai, India               74
England, United Kingdom     59
Name: place, dtype: int64

In [77]:
#Renaming necessary rows from the column
def rename(first, second):
  return tweet_df_2['place'].replace(first, second, inplace = True)

rename('England, United Kingdom', 'United Kingdom')
rename('London, England', 'United Kingdom')
rename('London', 'United Kingdom')
rename('London, UK', 'United Kingdom')
rename('East London', 'United Kingdom')
rename('England', 'United Kingdom')
rename('UK', 'United Kingdom')
rename('Texas, USA', 'United States')
rename('Dallas, TX', 'United States')
rename('Texas', 'United States')
rename('us', 'United States')
rename('USA', 'United States')
rename('Boston, MA', 'United States')
rename('San Francisco, CA', 'United States')
rename('New York, NY', 'United States')
rename('New York', 'United States')
rename('San Diego, CA', 'United States')
rename('New York, USA', 'United States')
rename('Florida, USA', 'United States')
rename('Los Angeles, CA', 'United States')
rename('California, USA', 'United States')
rename('Washington, DC', 'United States')
rename('Chicago, IL', 'United States')
rename('New Delhi, India', 'India')
rename('Chennai, India', 'India')
rename('Hyderabad, India', 'India')
rename('Hyderabad, India', 'India')
rename('Mumbai, INDIA', 'India')
rename('Asansol, India', 'India')
rename('Mumbai, India', 'India')
rename('Madanapalle, India', 'India')
rename('New Delhi', 'India')
rename('Mumbai', 'India')
rename('Bankura, India', 'India')
rename('Nairobi, Kenya', 'Kenya')
rename('Nairobi Kenya', 'Kenya')
rename('Sydney, New South Wales', 'Australia')
rename('Melbourne, Victoria', 'Australia')

In [7]:
tweet_df_2['place'].value_counts().head(30)

India                           1461
United States                    839
United Kingdom                   583
Kenya                            315
Sri Lanka                        228
Canada                           193
Australia                        186
Mars                             143
Ireland                           98
South Africa                      93
Metaverse                         82
Boston                            50
Burbank, CA                       50
Saddle Hills County, Alberta      50
Abuja & London                    50
Instagram: pastexpirycom          50
Scottsdale, Arizona               50
Limassol                          50
Get a FREE demo ⬇️⬇️              50
The Stock Market                  50
Michigan, USA                     50
Columbus, OH                      50
Palo Alto, CA                     50
Global                            50
sri lanka                         50
Bannockburn, IL                   50
Austin, Texas                     50
N

In [None]:
len(tweets_df)

5621

In [None]:
emoji_pattern = re.compile('['
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

In [None]:
#Removing @names, links, images,  because they don't convey any sentiment
class PrepareData:
  def __init__(self,df):
    self.df=df
    
  def preprocess_data(self):
    tweets_df = self.df.loc[self.df['lang'] =="en"]
    tweets_df = str(self.df)
    #Remove emojis
    tweets_df = emoji_pattern.sub(r'', self.df)
    #Remove identifications
    tweets_df = re.sub(r'RT @\w+:', '', self.df)
    tweets_df = re.sub(r'@\w+', '', self.df)
    #Remove links
    tweets_df = re.sub(r'https,?://[^/s]+[/s]?', '', self.df)    
    
    #text Preprocessing
    tweets_df['original_text'] = tweets_df['original_text'].astype(str)
    tweets_df['original_text'] = tweets_df['original_text'].apply(lambda x: x.lower())
    tweet_df_2['original_text'] = tweet_df_2['original_text'].apply(lambda x : clean_tweet(x))
    tweets_df['original_text'] = tweets_df['original_text'].apply(lambda x: clean_tweet(x))
    tweets_df['original_text'] = tweets_df['original_text'].apply(lambda x: x.translate(str.maketrans(' ', ' ', string.punctuation)))
  
    return tweets_df


In [80]:
tweet_df_2 = tweets_df
tweet_df_2

Unnamed: 0,statuses_count,created_at,source,original_text,polarity,subjectivity,favorite_count,retweet_count,screen_name,followers_count,friends_count,sensitivity,hashtags,user_mentions,place
0,281,Fri Apr 22 22:17:05 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",the 10-year yield is telling us that there's a...,0.16,0.540000,188.0,43.0,davideiacovozzi,18,55,,gold silver crypto,NorthstarCharts,
1,281,Fri Apr 22 13:44:53 +0000 2022,"<a href=""http://twitter.com/download/android"" ...","german 10y mortgage rate went from 0,8% to 2,5...",0.15,0.175000,179.0,32.0,davideiacovozzi,18,55,,,MichaelAArouet,
2,281,Fri Apr 22 06:10:34 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",when? ko2ffhkazg,0.00,0.000000,193.0,26.0,davideiacovozzi,18,55,,,goldseek,
3,281,Thu Apr 21 17:22:09 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",the 30-year mortgage rate in the us rises to 5...,0.00,0.183333,620.0,213.0,davideiacovozzi,18,55,,,charliebilello,
4,281,Thu Apr 21 10:32:26 +0000 2022,"<a href=""http://twitter.com/download/android"" ...",rates rise until something breaks … is anythin...,-0.40,0.400000,1787.0,417.0,davideiacovozzi,18,55,,,biancoresearch,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16460,21272,Fri Apr 22 15:22:56 +0000 2022,"<a href=""http://twitter.com/download/iphone"" r...",best wishes &amp; heartfelt congratulations to...,0.50,0.729630,2924.0,300.0,kitukalesatya,706,643,,,CHARANJITCHANNI RajaBrar_INC BB__Ashu,
16461,21272,Fri Apr 22 15:22:29 +0000 2022,"<a href=""http://twitter.com/download/iphone"" r...",thank you for this beautiful message of commu...,0.85,1.000000,14671.0,5006.0,kitukalesatya,706,643,,,pbhushan1 BajpayeeManoj,
16462,21272,Fri Apr 22 15:01:27 +0000 2022,"<a href=""http://twitter.com/download/iphone"" r...",agree ? r54zjw3kgb,0.00,0.000000,5056.0,973.0,kitukalesatya,706,643,,,s_shreyatweets,
16463,21272,Fri Apr 22 14:58:12 +0000 2022,"<a href=""http://twitter.com/download/iphone"" r...",1. peace yatra by late sunil dutt from mumbai ...,-0.30,0.600000,636.0,115.0,kitukalesatya,706,643,,,tejjINC,


In [9]:
x = tweet_df_2['place'].value_counts()[:15].index
y = tweet_df_2['place'].value_counts()[:15].values

fig = go.Figure()
fig.add_trace(go.Bar(x = x, y = y))

fig.update_layout(
    title = 'Tweet from Countries',
    height = 600,
    width = 1300,
    )

fig.show(renderer = 'colab')

In [103]:
group_data = tweet_df_2.groupby('polarity').agg('sum').reset_index()
group_data.head(20)

Unnamed: 0,polarity,subjectivity,favorite_count,retweet_count
0,-1.0,58.657143,65615.0,8791.0
1,-0.9375,1.0,0.0,0.0
2,-0.915527,0.6,2974.0,338.0
3,-0.91,1.0,0.0,0.0
4,-0.9,9.5,4589.0,3200.0
5,-0.875,8.8,6770.0,5275.0
6,-0.85,0.95,0.0,0.0
7,-0.833333,1.916667,0.0,0.0
8,-0.825,1.0,0.0,0.0
9,-0.821429,0.928571,8.0,4.0


In [None]:
group_data.tail(20)

In [124]:
fig = go.Figure()
# fig.add_trace(go.Bar(x = x, y = y))
fig = px.histogram(group_data, x='polarity', y='retweet_count')
fig.update_layout(
    title = 'Polarity distribution to the retweet count',
    height = 600,
    width = 1300,
    )

fig.show(renderer = 'colab')

In [126]:
fig = go.Figure()
# fig.add_trace(go.Bar(x = x, y = y))
fig = px.histogram(group_data, x='polarity', y='favorite_count')
fig.update_layout(
    title = 'Polarity distribution to the favorite count',
    height = 600,
    width = 1300,
    )

fig.show(renderer = 'colab')

In [127]:
fig = go.Figure()
# fig.add_trace(go.Bar(x = x, y = y))
fig = px.histogram(group_data, x='polarity', y='subjectivity')
fig.update_layout(
    title = 'Polarity distribution to the subjectiviti',
    height = 600,
    width = 1300,
    )

fig.show(renderer = 'colab')

In [81]:
nltk.download('stopwords')
stopwords_set = nltk.corpus.stopwords.words('english')
stopwords_set.extend(['from', 'subject', 're', 'edu', 'use'])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [82]:
#Tokeninzing words and clean-up text
def sent_to_words(sentences):
  for sentence in sentences:
    yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(tweet_df_2['original_text']))

print(data_words[:1])

[['the', 'year', 'yield', 'is', 'telling', 'us', 'that', 'there', 'high', 'risk', 'of', 'something', 'breaking', 'in', 'the', 'system', 'gold', 'silver', 'crypto']]


##Creating Bigram and Trigram Model

In [83]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# trigram
print(trigram_mod[bigram_mod[data_words[0]]])

['the', 'year', 'yield', 'is', 'telling', 'us', 'that', 'there', 'high', 'risk', 'of', 'something', 'breaking', 'in', 'the', 'system_gold_silver', 'crypto']


In [None]:
# Define functions to remove stopwords, make bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stopwords_set] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

In [85]:
print(data_lemmatized[:1])

[['year', 'yield', 'tell', 'high', 'risk', 'break', 'system']]


In [86]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])


[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)]]


In [87]:
# Human readable format of corpus (term-frequency)
read_format = [[(id2word[id], freq) for id, freq in cp] for cp in corpus]

### Topic Modeling using Latent Dirichlet Allocation 
based on the distributional hypothesis, (i.e. similar topics make use of similar words) and the statistical mixture hypothesis (i.e. documents talk about several topics) for which a statistical distribution can be determined. 

*  The purpose of LDA is mapping each teweets in our corpus to a set of topics 
which covers a good deal of the words in the tweet



In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus = corpus,
                                           id2word=id2word,
                                           num_topics=5, 
                                           random_state=100, 
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [71]:
pprint(lda_model.show_topics(formatted=False))

[(0,
  [('level', 0.023767931),
   ('government', 0.01615424),
   ('want', 0.014647724),
   ('even', 0.01436198),
   ('may', 0.011875476),
   ('story', 0.0113919005),
   ('start', 0.011198992),
   ('life', 0.01066228),
   ('world', 0.010436555),
   ('report', 0.009116626)]),
 (1,
  [('take', 0.022599876),
   ('power', 0.015510931),
   ('state', 0.013746052),
   ('thank', 0.012571241),
   ('price', 0.0123921),
   ('tell', 0.01051736),
   ('big', 0.010495168),
   ('tamil', 0.009828248),
   ('would', 0.009554321),
   ('watch', 0.009429148)]),
 (2,
  [('year', 0.014608484),
   ('amp', 0.01384011),
   ('today', 0.012212249),
   ('follow', 0.011946685),
   ('give', 0.01106696),
   ('high', 0.010144731),
   ('new', 0.009623735),
   ('still', 0.008687857),
   ('many', 0.008054487),
   ('become', 0.007511882)]),
 (3,
  [('go', 0.029821087),
   ('say', 0.022441491),
   ('people', 0.021179715),
   ('make', 0.019301184),
   ('read', 0.01667678),
   ('know', 0.015361498),
   ('day', 0.014921448),
 

Each line is a topic with individual topic terms and weights. Topic0  can be termed as climate change, and Topic4 can be termed as government and carbon emission.

# **Model Analysis**

Perplexity is also a measure of model quality and in natural language processing is often used as “perplexity per number of words”. It describes how well a model predicts a sample, i.e. how much it is “perplexed” by a sample from the observed data. The lower the score, the better the model for the given data.

A coherence matrix is used to test the model for accuracy. Topic coherence is a measure that compares different topic models based on their human-interpretability. The coherence score ‘C_V’ provides a numerical value to the interpretability of the topics

In [None]:
# Compute Perplexity

#It's a measure of how good the model is. The lower the better. Perplexity is a negative value
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  
doc_lda = lda_model[corpus]


# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Ldamodel Coherence Score/Accuracy on Tweets: ', coherence_lda)

Basic Ldamodel Coherence Score 0.58 This means that the model has performed reasonably well in topic modeling.

In [None]:
!pip install pyLDAvis 

**Anlayizing results**
Exploring the Intertopic Distance Plot can help you learn about how topics relate to each other, including potential higher-level structure between groups of topics

In [90]:
import pyLDAvis.gensim_models as gensimvis
import pickle 
import pyLDAvis
# Visualize the topics
pyLDAvis.enable_notebook()

LDAvis_prepared = gensimvis.prepare(lda_model, corpus, id2word)
LDAvis_prepared