# Topic modelling and sentiment analysis
## Objective:
- write a code using scikit-learn, Gensim, or other packages and APIs to model the topics discussed in the tweets data and their sentiments. 
- word clouds, k-mean clustering, and the like model can for topic modelling.

In [34]:
import re
import warnings
import gensim
import string
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
from pprint import pprint
from gensim import corpora
from wordcloud import WordCloud, STOPWORDS
from gensim.models import CoherenceModel
from nltk.stem import WordNetLemmatizer

warnings.filterwarnings('ignore')

In [35]:
tweets = pd.read_csv("./my_clean_data.csv")
tweets[:5]
tweets.shape

(16472, 16)

In [36]:
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16472 entries, 0 to 16471
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   created_at          16472 non-null  object 
 1   original_text       16472 non-null  object 
 2   polarity            16472 non-null  float64
 3   subjectivity        16472 non-null  float64
 4   lang                16472 non-null  object 
 5   favorite_count      16472 non-null  int64  
 6   retweet_count       16472 non-null  int64  
 7   original_author     16472 non-null  object 
 8   followers_count     16472 non-null  int64  
 9   friends_count       16472 non-null  int64  
 10  possibly_sensitive  16472 non-null  bool   
 11  hashtags            16472 non-null  object 
 12  place               16472 non-null  object 
 13  hashtags_in_tweets  16472 non-null  object 
 14  screen_name         16472 non-null  object 
 15  device              16472 non-null  object 
dtypes: b

## 1. Feature Extraction

- **I have created a dataset containg only the columns important for topic modeling**

In [37]:
df = pd.DataFrame(columns=['clean_text'])
df['clean_text'] = tweets['original_text'].astype(str)
df[:5]

Unnamed: 0,clean_text
0,rt northstarcharts the year yield is tell...
1,rt michaelaarouet german y mortgage rate w...
2,rt goldseek when
3,rt charliebilello the year mortgage rate ...
4,rt biancoresearch rates rise until something...


## 2. Data Pre-processing

- **Get stop words from my data frame**

In [38]:
freqX = pd.Series(
    ' '.join(df['clean_text']).split()).value_counts()[:10]

print('FREQ X: \n', freqX)

FREQ X: 
 the    10229
rt      8286
to      6412
of      4737
a       4593
in      4204
and     3902
is      3876
s       3080
for     2773
dtype: int64


- **Remove stopwords**

In [40]:
custom_stopwords = ['t', 'rt', 'ti', 'vk', 'to', 'co',
                    'dqlw','y', 'mla','z', 'nd', 'm', 's', 'kur', 'u', 'o', 'd']
STOP_WORDS = STOPWORDS.union(custom_stopwords)

- **Tokeniziation**

In [41]:
df['clean_text'] 

0        rt  northstarcharts  the    year yield is tell...
1        rt  michaelaarouet  german   y mortgage rate w...
2                                     rt  goldseek  when  
3        rt  charliebilello  the    year mortgage rate ...
4        rt  biancoresearch  rates rise until something...
                               ...                        
16467    rt  charanjitchanni  best wishes  amp  heartfe...
16468    rt  pbhushan   thank you  bajpayeemanoj for th...
16469                        rt  s shreyatweets  agree    
16470    rt  tejjinc     peace yatra by late sunil dutt...
16471    rt  parthtiwari    gujarat congress mla arrest...
Name: clean_text, Length: 16472, dtype: object

In [42]:
df['clean_text'] = df['clean_text'].apply(
    lambda x: [item for item in x.split() if item not in STOP_WORDS])

df['clean_text']

0        [northstarcharts, year, yield, telling, us, hi...
1        [michaelaarouet, german, mortgage, rate, went,...
2                                               [goldseek]
3        [charliebilello, year, mortgage, rate, us, ris...
4        [biancoresearch, rates, rise, something, break...
                               ...                        
16467    [charanjitchanni, best, wishes, amp, heartfelt...
16468    [pbhushan, thank, bajpayeemanoj, beautiful, me...
16469                                [shreyatweets, agree]
16470    [tejjinc, peace, yatra, late, sunil, dutt, mum...
16471    [parthtiwari, gujarat, congress, arrested, twe...
Name: clean_text, Length: 16472, dtype: object

In [43]:
sentence_list = [sent for sent in df['clean_text']]
print(sentence_list[:5])

[['northstarcharts', 'year', 'yield', 'telling', 'us', 'high', 'risk', 'something', 'breaking', 'system', '#gold', '#silver', '#crypto', '#'], ['michaelaarouet', 'german', 'mortgage', 'rate', 'went', 'hear', 'sound', 'german', 'real', 'estate', 'bubble', 'bursting'], ['goldseek'], ['charliebilello', 'year', 'mortgage', 'rate', 'us', 'rises', 'highest', 'level', 'last', 'year', 'hit', 'time', 'low'], ['biancoresearch', 'rates', 'rise', 'something', 'breaks', 'anything', 'broken', 'yet']]


In [44]:
word_list = [sent for sent in sentence_list]
print(word_list[:5])

[['northstarcharts', 'year', 'yield', 'telling', 'us', 'high', 'risk', 'something', 'breaking', 'system', '#gold', '#silver', '#crypto', '#'], ['michaelaarouet', 'german', 'mortgage', 'rate', 'went', 'hear', 'sound', 'german', 'real', 'estate', 'bubble', 'bursting'], ['goldseek'], ['charliebilello', 'year', 'mortgage', 'rate', 'us', 'rises', 'highest', 'level', 'last', 'year', 'hit', 'time', 'low'], ['biancoresearch', 'rates', 'rise', 'something', 'breaks', 'anything', 'broken', 'yet']]


- **Lemmatization**

In [45]:
lemmatizer = WordNetLemmatizer()
word_list_lematized = []

for w in word_list:
    word_list_lematized.append([lemmatizer.lemmatize(x) for x in w])
print(word_list_lematized[:5])

[['northstarcharts', 'year', 'yield', 'telling', 'u', 'high', 'risk', 'something', 'breaking', 'system', '#gold', '#silver', '#crypto', '#'], ['michaelaarouet', 'german', 'mortgage', 'rate', 'went', 'hear', 'sound', 'german', 'real', 'estate', 'bubble', 'bursting'], ['goldseek'], ['charliebilello', 'year', 'mortgage', 'rate', 'u', 'rise', 'highest', 'level', 'last', 'year', 'hit', 'time', 'low'], ['biancoresearch', 'rate', 'rise', 'something', 'break', 'anything', 'broken', 'yet']]


- **Modeling**

In [46]:
id2word = corpora.Dictionary(word_list_lematized) #dictionery with ID and WORD
corpus = [id2word.doc2bow(tweet) for tweet in word_list]

In [47]:
print(np.array(word_list).shape)
print(np.array(id2word).shape)
print(np.array(corpus).shape)

(16472,)
(29722,)
(16472,)


- **Build my Latent Dirichlet Allocation (LDA)Model**

In [48]:
lda_model = gensim.models.ldamodel.LdaModel(corpus,
                                            id2word=id2word,
                                            num_topics=7,
                                            random_state=100,
                                            update_every=1,
                                            chunksize=100,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)

- **see my pretty-print**

In [49]:
pprint(lda_model.show_topics(formatted=True)) 

[(0,
  '0.036*"will" + 0.033*"follow" + 0.023*"people" + 0.020*"today" + '
  '0.013*"know" + 0.012*"profile" + 0.011*"need" + 0.010*"even" + 0.009*"good" '
  '+ 0.009*"never"'),
 (1,
  '0.022*"sri" + 0.021*"man" + 0.017*"aitcofficial" + 0.017*"lanka" + '
  '0.013*"th" + 0.012*"still" + 0.012*"going" + 0.012*"may" + 0.009*"april" + '
  '0.009*"next"'),
 (2,
  '0.034*"one" + 0.026*"#srilanka" + 0.019*"day" + 0.016*"#" + 0.015*"online" '
  '+ 0.015*"cartoon" + 0.014*"#lka" + 0.012*"make" + 0.011*"two" + '
  '0.011*"don"'),
 (3,
  '0.041*"amp" + 0.014*"pm" + 0.012*"minister" + 0.011*"government" + '
  '0.010*"last" + 0.009*"#covid" + 0.009*"take" + 0.007*"p" + '
  '0.007*"mamataofficial" + 0.007*"narendramodi"'),
 (4,
  '0.020*"go" + 0.018*"new" + 0.017*"world" + 0.014*"power" + 0.014*"country" '
  '+ 0.010*"president" + 0.009*"covid" + 0.009*"first" + 0.008*"crisis" + '
  '0.008*"news"'),
 (5,
  '0.055*"india" + 0.022*"read" + 0.021*"please" + 0.018*"full" + '
  '0.013*"police" + 0.012*"b

- **Check the model using Perplexity**
- Perplexity is a metric used to judge how good a language model is. for more use the follwing link 
- https://towardsdatascience.com/perplexity-in-language-models-87a196019a94

In [50]:
print('\nPerplexity: ', lda_model.log_perplexity(corpus)) 


Perplexity:  -10.972058569345505


- The lower the score the better the model will be.
- https://www.tutorialspoint.com/gensim/gensim_using_lda_topic_model.htm

- **Ceck the Coherence Score**

In [51]:
doc_lda = lda_model[corpus]
coherence_model_lda = CoherenceModel(
    model=lda_model, texts=word_list, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Ldamodel Coherence Score/Accuracy on Tweets: ', coherence_lda) 


 Ldamodel Coherence Score/Accuracy on Tweets:  0.48474655206387596


The LDA model (lda_model) we have created above can be used to compute the model’s perplexity, 
i.e. how good the model is. The lower the score the better the model will be
The LDA model (lda_model) we have created above can be used to compute the model’s coherence score i.e. the average /median of the pairwise word-similarity scores of the words in the topic.

## 3. Data Visualization

In [52]:
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis

In [53]:
vis_data = gensimvis.prepare(lda_model, corpus, id2word)
pyLDAvis.display(vis_data)

- From the above output, the bubbles on the left-side represents a topic and larger the bubble, the more prevalent is that topic. The topic model will be good if the topic model has big, non-overlapping bubbles scattered throughout the chart

## 4. Analysis of Sentiments from the clean_text

- **𝐩𝐨𝐥𝐚𝐫𝐢𝐭𝐲 based analysys**

In [54]:
df = pd.DataFrame(columns=['clean_text', 'polarity'])
df['clean_text'] = tweets['original_text']
df['polarity'] = tweets['polarity']

def clean_tweet(tweet):
    clean_tweet = re.sub("[^a-zA-Z]",  " ",  tweet)
    return clean_tweet

df['clean_text'] = df['clean_text'].apply(clean_tweet)
df[:5]

Unnamed: 0,clean_text,polarity
0,rt northstarcharts the year yield is tell...,0.16
1,rt michaelaarouet german y mortgage rate w...,0.15
2,rt goldseek when,0.0
3,rt charliebilello the year mortgage rate ...,0.0
4,rt biancoresearch rates rise until something...,-0.4


- **duplication check**

In [55]:
print("duplicate count: {}".format(df.isnull().sum().sum()))

duplicate count: 0


  **get text catagories as:**
- Positive
- Negative 
- Neutral

In [56]:
def text_category(p):
  if p > 0:
    return "positive"
  elif p < 0:
    return "negative"
  else:
    return "neutral"

- **use the above function to test clean_texts with their polarity**

In [57]:
df["polarity"] = df["polarity"].apply(text_category)
df[:10]


Unnamed: 0,clean_text,polarity
0,rt northstarcharts the year yield is tell...,positive
1,rt michaelaarouet german y mortgage rate w...,positive
2,rt goldseek when,neutral
3,rt charliebilello the year mortgage rate ...,neutral
4,rt biancoresearch rates rise until something...,negative
5,rt lanceroberts buying opportunities like th...,negative
6,rt macroalf welcome to september bond...,positive
7,rt botbenfranklin the horse thinks one thing...,neutral
8,rt galactic trader global growth optimism at...,positive
9,rt andreassteno this is the most important c...,positive


- **use piechart and barchart to visualize better**

In [58]:
category = df.groupby(['polarity']).size()
category

polarity
negative    2691
neutral     7466
positive    6315
dtype: int64

**classification model to the clean tweet.**
- Here i assumed that 𝐩𝐨𝐥𝐚𝐫𝐢𝐭𝐲 =0 as 𝐬𝐜𝐨𝐫𝐞 = Neutral 

In [59]:
df = df[df['polarity'] != 'neutral']
df


Unnamed: 0,clean_text,polarity
0,rt northstarcharts the year yield is tell...,positive
1,rt michaelaarouet german y mortgage rate w...,positive
4,rt biancoresearch rates rise until something...,negative
5,rt lanceroberts buying opportunities like th...,negative
6,rt macroalf welcome to september bond...,positive
...,...,...
16465,rt ozamizcps pssg gedson casta eros mobile ...,positive
16466,rt salt project os free yourself from writin...,positive
16467,rt charanjitchanni best wishes amp heartfe...,positive
16468,rt pbhushan thank you bajpayeemanoj for th...,positive


- **Create a column  scoremap  Used for mapping {'positive':1, 'negative':0} to the  score  column**

In [60]:
df['scoremap'] = df["polarity"].map( lambda score: 1 if score == "positive" else 0)
df

Unnamed: 0,clean_text,polarity,scoremap
0,rt northstarcharts the year yield is tell...,positive,1
1,rt michaelaarouet german y mortgage rate w...,positive,1
4,rt biancoresearch rates rise until something...,negative,0
5,rt lanceroberts buying opportunities like th...,negative,0
6,rt macroalf welcome to september bond...,positive,1
...,...,...,...
16465,rt ozamizcps pssg gedson casta eros mobile ...,positive,1
16466,rt salt project os free yourself from writin...,positive,1
16467,rt charanjitchanni best wishes amp heartfe...,positive,1
16468,rt pbhushan thank you bajpayeemanoj for th...,positive,1


- **Create feature and target variables (X,y) from  clean_text  and  scoremap  columns respectively.**

In [61]:
(X, y) = df['clean_text'], df['scoremap']

- **split the data to train and test using split funcion from sklearn**

In [62]:
from sklearn.model_selection import train_test_split

In [63]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

**The name Stochastic Gradient Descent - Classifier (SGD-Classifier):**
- SGD Classifier is a linear classifier (SVM, logistic regression, a.o.) optimized by the SGD.
- While SGD is a optimization method, Logistic Regression or linear Support Vector Machine is a machine learning algorithm/model. 
- **Here** - I used it to vectorize my train text data

In [64]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from joblib import dump, load # used for saving and loading sklearn objects
from scipy.sparse import save_npz, load_npz # used for saving and loading sparse matrices
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

In [65]:
trigram_vectorizer = CountVectorizer(ngram_range=(1, 3))
trigram_vectorizer.fit_transform(X.values)
X_trigram = trigram_vectorizer.transform(X)


def train_and_show_scores(X, y, title: str) -> None:
    X_train, X_valid, y_train, y_valid = train_test_split(
        X, y, train_size=0.75, stratify=y
    )

    clf = SGDClassifier()
    clf.fit(X_train, y_train)
    train_score = clf.score(X_train, y_train)
    valid_score = clf.score(X_valid, y_valid)
    print(f'{title}\nTrain score: {round(train_score, 2)} ; Validation score: {round(valid_score, 2)}\n')

In [66]:
train_and_show_scores(X_trigram, df['scoremap'], title="sentiment")

sentiment
Train score: 1.0 ; Validation score: 0.85

