# Topic modelling and sentiment analysis
## Objective:
- write a code using scikit-learn, Gensim, or other packages and APIs to model the topics discussed in the tweets data and their sentiments. 
- word clouds, k-mean clustering, and the like model can for topic modelling.

In [1]:
import re
import warnings
import gensim
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint
from gensim import corpora
from wordcloud import WordCloud, STOPWORDS
from gensim.models import CoherenceModel
from nltk.stem import WordNetLemmatizer

warnings.filterwarnings('ignore')

In [2]:
tweets = pd.read_csv("./my_clean_data.csv")
tweets[:5]

Unnamed: 0,created_at,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,place,hashtags_in_tweets,screen_name,device
0,Fri Apr 22 22:17:05 +0000 2022,rt northstarcharts the year yield is tell...,0.16,0.54,en,188,43,davideiacovozzi,18,55,False,"#gold, #gold, #gold",,"#gold, #silver, #crypto",@northstarcharts,twitter for android
1,Fri Apr 22 13:44:53 +0000 2022,rt michaelaarouet german y mortgage rate w...,0.15,0.175,en,179,32,davideiacovozzi,18,55,False,,,,@michaelaarouet,twitter for android
2,Fri Apr 22 06:10:34 +0000 2022,rt goldseek when,0.0,0.0,en,193,26,davideiacovozzi,18,55,False,,,,@goldseek,twitter for android
3,Thu Apr 21 17:22:09 +0000 2022,rt charliebilello the year mortgage rate ...,0.0,0.183333,en,620,213,davideiacovozzi,18,55,False,,,,@charliebilello,twitter for android
4,Thu Apr 21 10:32:26 +0000 2022,rt biancoresearch rates rise until something...,-0.4,0.4,en,1787,417,davideiacovozzi,18,55,False,,,,@biancoresearch,twitter for android


In [3]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16472 entries, 0 to 16471
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   created_at          16472 non-null  object 
 1   original_text       16472 non-null  object 
 2   polarity            16472 non-null  float64
 3   subjectivity        16472 non-null  float64
 4   lang                16472 non-null  object 
 5   favorite_count      16472 non-null  int64  
 6   retweet_count       16472 non-null  int64  
 7   original_author     16472 non-null  object 
 8   followers_count     16472 non-null  int64  
 9   friends_count       16472 non-null  int64  
 10  possibly_sensitive  16472 non-null  bool   
 11  hashtags            16472 non-null  object 
 12  place               16472 non-null  object 
 13  hashtags_in_tweets  16472 non-null  object 
 14  screen_name         16472 non-null  object 
 15  device              16472 non-null  object 
dtypes: b

## 1. Feature Extraction

- **I have created a dataset containg only the columns important for topic modeling**

In [4]:
df = pd.DataFrame(columns=['clean_text'])
df['clean_text'] = tweets['original_text'].astype(str)
df[:5]

Unnamed: 0,clean_text
0,rt northstarcharts the year yield is tell...
1,rt michaelaarouet german y mortgage rate w...
2,rt goldseek when
3,rt charliebilello the year mortgage rate ...
4,rt biancoresearch rates rise until something...


## 2. Data Pre-processing

- **Get stop words from my data frame**

In [5]:
freqX = pd.Series(
    ' '.join(df['clean_text']).split()).value_counts()[:10]

print('FREQ X: \n', freqX)

FREQ X: 
 the    10229
rt      8286
to      6412
of      4737
a       4593
in      4204
and     3902
is      3876
s       3080
for     2773
dtype: int64


In [7]:
- **Remove stopwords

SyntaxError: invalid syntax (542772593.py, line 1)

In [6]:
custom_stopwords = ['t', 'rt', 'ti', 'vk', 'to', 'co',
                    'dqlw', 'z', 'nd', 'm', 's', 'kur', 'u', 'o', 'd']
STOP_WORDS = STOPWORDS.union(custom_stopwords)

- **Tokeniziation**

In [9]:
df['clean_text']

0        rt  northstarcharts  the    year yield is tell...
1        rt  michaelaarouet  german   y mortgage rate w...
2                                     rt  goldseek  when  
3        rt  charliebilello  the    year mortgage rate ...
4        rt  biancoresearch  rates rise until something...
                               ...                        
16467    rt  charanjitchanni  best wishes  amp  heartfe...
16468    rt  pbhushan   thank you  bajpayeemanoj for th...
16469                        rt  s shreyatweets  agree    
16470    rt  tejjinc     peace yatra by late sunil dutt...
16471    rt  parthtiwari    gujarat congress mla arrest...
Name: clean_text, Length: 16472, dtype: object

In [10]:
df['clean_text'] = df['clean_text'].apply(
    lambda x: [item for item in x.split() if item not in STOP_WORDS])

df['clean_text']

0        [northstarcharts, year, yield, telling, us, hi...
1        [michaelaarouet, german, y, mortgage, rate, we...
2                                               [goldseek]
3        [charliebilello, year, mortgage, rate, us, ris...
4        [biancoresearch, rates, rise, something, break...
                               ...                        
16467    [charanjitchanni, best, wishes, amp, heartfelt...
16468    [pbhushan, thank, bajpayeemanoj, beautiful, me...
16469                                [shreyatweets, agree]
16470    [tejjinc, peace, yatra, late, sunil, dutt, mum...
16471    [parthtiwari, gujarat, congress, mla, arrested...
Name: clean_text, Length: 16472, dtype: object

In [11]:
sentence_list = [sent for sent in df['clean_text']]
print(sentence_list[:5])

[['northstarcharts', 'year', 'yield', 'telling', 'us', 'high', 'risk', 'something', 'breaking', 'system', '#gold', '#silver', '#crypto', '#'], ['michaelaarouet', 'german', 'y', 'mortgage', 'rate', 'went', 'hear', 'sound', 'german', 'real', 'estate', 'bubble', 'bursting'], ['goldseek'], ['charliebilello', 'year', 'mortgage', 'rate', 'us', 'rises', 'highest', 'level', 'last', 'year', 'hit', 'time', 'low'], ['biancoresearch', 'rates', 'rise', 'something', 'breaks', 'anything', 'broken', 'yet']]


In [12]:
word_list = [sent for sent in sentence_list]
print(word_list[:5])

[['northstarcharts', 'year', 'yield', 'telling', 'us', 'high', 'risk', 'something', 'breaking', 'system', '#gold', '#silver', '#crypto', '#'], ['michaelaarouet', 'german', 'y', 'mortgage', 'rate', 'went', 'hear', 'sound', 'german', 'real', 'estate', 'bubble', 'bursting'], ['goldseek'], ['charliebilello', 'year', 'mortgage', 'rate', 'us', 'rises', 'highest', 'level', 'last', 'year', 'hit', 'time', 'low'], ['biancoresearch', 'rates', 'rise', 'something', 'breaks', 'anything', 'broken', 'yet']]


- **Lemmatization**

In [13]:
lemmatizer = WordNetLemmatizer()
word_list_lematized = []

for w in word_list:
    word_list_lematized.append([lemmatizer.lemmatize(x) for x in w])
print(word_list_lematized[:5])

[['northstarcharts', 'year', 'yield', 'telling', 'u', 'high', 'risk', 'something', 'breaking', 'system', '#gold', '#silver', '#crypto', '#'], ['michaelaarouet', 'german', 'y', 'mortgage', 'rate', 'went', 'hear', 'sound', 'german', 'real', 'estate', 'bubble', 'bursting'], ['goldseek'], ['charliebilello', 'year', 'mortgage', 'rate', 'u', 'rise', 'highest', 'level', 'last', 'year', 'hit', 'time', 'low'], ['biancoresearch', 'rate', 'rise', 'something', 'break', 'anything', 'broken', 'yet']]


- **Modeling**

In [14]:
id2word = corpora.Dictionary(word_list_lematized) #dictionery with ID and WORD
corpus = [id2word.doc2bow(tweet) for tweet in word_list]

In [15]:
print(np.array(word_list).shape)
print(np.array(id2word).shape)
print(np.array(corpus).shape)

(16472,)
(29723,)
(16472,)


- **Build my Latent Dirichlet Allocation (LDA)Model**

In [17]:
lda_model = gensim.models.ldamodel.LdaModel(corpus,
                                            id2word=id2word,
                                            num_topics=7,
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

- **see my pretty-print**

In [18]:
pprint(lda_model.show_topics(formatted=True))

[(0,
  '0.039*"amp" + 0.035*"follow" + 0.018*"go" + 0.016*"#" + 0.015*"man" + '
  '0.015*"world" + 0.013*"power" + 0.012*"need" + 0.010*"#srilanka" + '
  '0.009*"#covid"'),
 (1,
  '0.031*"one" + 0.017*"day" + 0.016*"read" + 0.013*"cartoon" + 0.013*"full" + '
  '0.012*"time" + 0.012*"country" + 0.012*"#srilanka" + 0.011*"minister" + '
  '0.010*"two"'),
 (2,
  '0.021*"make" + 0.018*"state" + 0.017*"president" + 0.016*"may" + '
  '0.014*"think" + 0.012*"thank" + 0.010*"delivery" + 0.010*"happy" + '
  '0.009*"way" + 0.007*"daily"'),
 (3,
  '0.019*"us" + 0.017*"sri" + 0.013*"lanka" + 0.011*"don" + 0.011*"government" '
  '+ 0.008*"#ukraine" + 0.008*"best" + 0.008*"tamil" + 0.008*"h" + '
  '0.007*"life"'),
 (4,
  '0.028*"will" + 0.018*"people" + 0.016*"back" + 0.016*"now" + 0.016*"today" '
  '+ 0.012*"please" + 0.011*"online" + 0.010*"know" + 0.009*"profile" + '
  '0.008*"even"'),
 (5,
  '0.043*"india" + 0.016*"new" + 0.014*"sec" + 0.012*"aitcofficial" + '
  '0.010*"last" + 0.009*"year" + 0.0

- **Check the model using Perplexity**
- Perplexity is a metric used to judge how good a language model is. for more use the follwing link 
- https://towardsdatascience.com/perplexity-in-language-models-87a196019a94

In [20]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus))


Perplexity:  -10.968949746402618


- **Ceck the Coherence Score**

In [22]:
doc_lda = lda_model[corpus]
coherence_model_lda = CoherenceModel(
    model=lda_model, texts=word_list, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Ldamodel Coherence Score/Accuracy on Tweets: ', coherence_lda)


 Ldamodel Coherence Score/Accuracy on Tweets:  0.4959697369726013


## 3. Data Visualization

In [23]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [24]:
vis_data = gensimvis.prepare(lda_model, corpus, id2word)
pyLDAvis.display(vis_data)

## 4. Analysis of Sentiments from the clean_text

- **𝐩𝐨𝐥𝐚𝐫𝐢𝐭𝐲 based analysys**

In [27]:
df = pd.DataFrame(columns=['clean_text', 'polarity'])
df['clean_text'] = tweets['original_text']
df['polarity'] = tweets['polarity']

def clean_tweet(tweet):
    clean_tweet = re.sub("[^a-zA-Z]",  " ",  tweet)
    return clean_tweet

df['clean_text'] = df['clean_text'].apply(clean_tweet)
df[:5]

Unnamed: 0,clean_text,polarity
0,rt northstarcharts the year yield is tell...,0.16
1,rt michaelaarouet german y mortgage rate w...,0.15
2,rt goldseek when,0.0
3,rt charliebilello the year mortgage rate ...,0.0
4,rt biancoresearch rates rise until something...,-0.4


- **duplication check**

In [30]:
print("duplicate count: {}".format(df.isnull().sum().sum()))

duplicate count: 0


  **get text catagories as:**
- Positive
- Negative 
- Neutral

In [36]:
def text_category(p):
  if p > 0:
    return "positive"
  elif p < 0:
    return "negative"
  else:
    return "neutral"

- **use the above function to test clean_texts with their polarity**

In [37]:
df["polarity"] = df["polarity"].apply(text_category)
df[:5]


TypeError: '>=' not supported between instances of 'str' and 'int'

- **use piechart and barchart to visualize better**

In [None]:
category = df.groupby(['polarity']).size()
category

build a classification model on the clean tweet.
Remove rows from cleanTweet where 𝐩𝐨𝐥𝐚𝐫𝐢𝐭𝐲 =0 (i.e where 𝐬𝐜𝐨𝐫𝐞 = Neutral) and reset the frame index.¶

In [None]:
df = df[df['polarity'] != 'neutral']
df


Construct a column  𝐬𝐜𝐨𝐫𝐞𝐦𝐚𝐩  Use the mapping {'positive':1, 'negative':0} on the  𝐬𝐜𝐨𝐫𝐞  column¶

In [None]:
df['scoremap'] = df["polarity"].map( lambda score: 1 if score == "positive" else 0)
df

Create feature and target variables (X,y) from  𝐜𝐥𝐞𝐚𝐧-𝐭𝐞𝐱𝐭  and  𝐬𝐜𝐨𝐫𝐞𝐦𝐚𝐩  columns respectively.

In [None]:
(X, y) = df['clean_text'], df['scoremap']

Use train_test_split function to construct (X_train, y_train) and (X_test, y_test) from (X,y)¶

In [None]:
from sklearn.model_selection import train_test_split