# Topic Modeling and sentiment analaysis

## Topic modeling

Topic Models, in a nutshell, are a type of statistical language models used for uncovering hidden structure in a collection of texts. In a practical and more intuitively, you can think of it as a task of:

Dimensionality Reduction, where rather than representing a text T in its feature space as {Word_i: count(Word_i, T) for Word_i in Vocabulary}, you can represent it in a topic space as {Topic_i: Weight(Topic_i, T) for Topic_i in Topics}

## Sentiment analysis

Sentiment analysis
It is used in social media monitoring, allowing businesses to gain insights about how customers feel about certain topics, and detect urgent issues in real time before they spiral out of control.

Our task here is to classify a tweet as a positive or negative tweet sentiment wise.



There are several existing algorithms we can use to perform the topic modeling. The most common of it are, Latent Semantic Analysis (LSA/LSI), Probabilistic Latent Semantic Analysis (pLSA), and Latent Dirichlet Allocation (LDA)


In this article, we’ll take a closer look at LDA, and implement our first topic model using the sklearn implementation in python.

# LDA Implementation

The steps used to implement LDA :
   1) Loading data
   
   2) Data cleaning
   
   
   3) Exploratory analysis
   
   
   4) Preparing data for LDA analysis
   
   
   5) LDA model training
   
   
   6) Analyzing LDA model results

# Step 1: Loading Data 

In [14]:
# Importing modules
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import STOPWORDS,WordCloud
from gensim import corpora
import pandas as pd
import statistics
import string
import os
import re

#os.chdir('..')

# Read data into papers
papers = pd.read_csv('week_0_challenge_3/processed_tweet_data .csv')

# Print head
papers.head()

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,favorite_count,retweet_count,original_author,followers_count,friends_count,possibly_sensitive,hashtags,user_mentions,place
0,Fri Jun 18 17:55:49 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...","🚨Africa is ""in the midst of a full-blown third...",0.166667,0.188889,en,548,612,ketuesriche,551,351,,[],"[{'screen_name': 'TelGlobalHealth', 'name': 'T...",Mass
1,Fri Jun 18 17:55:59 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...","Dr Moeti is head of WHO in Africa, and one of ...",0.133333,0.455556,en,195,92,Grid1949,66,92,,[],"[{'screen_name': 'globalhlthtwit', 'name': 'An...","Edinburgh, Scotland"
2,Fri Jun 18 17:56:07 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...",Thank you @research2note for creating this ama...,0.316667,0.483333,en,2,1,LeeTomlinson8,1195,1176,,"[{'text': 'red4research', 'indices': [103, 116]}]","[{'screen_name': 'NHSRDForum', 'name': 'NHS R&...",
3,Fri Jun 18 17:56:10 +0000 2021,"<a href=""https://mobile.twitter.com"" rel=""nofo...","Former Pfizer VP and Virologist, Dr. Michael Y...",0.086111,0.197222,en,1580,899,RIPNY08,2666,2704,,[],"[{'screen_name': 'HighWireTalk', 'name': 'The ...",
4,Fri Jun 18 17:56:20 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",I think it’s important that we don’t sell COVA...,0.28,0.62,en,72,20,pash22,28250,30819,,[],"[{'screen_name': 'PeterHotez', 'name': 'Prof P...",United Kingdom


# Step 2: Data Cleaning

Since the goal of this analysis is to perform topic modeling, let's focus only on the text data from each paper, and drop other metadata columns.

In [15]:
# Remove the columns
papers = papers.drop(columns=['favorite_count','possibly_sensitive','user_mentions'], axis=1).sample(100)
# Print out the first rows of papers
papers.head()

Unnamed: 0,created_at,source,original_text,polarity,subjectivity,lang,retweet_count,original_author,followers_count,friends_count,hashtags,place
4949,Sat Jun 19 03:48:23 +0000 2021,"<a href=""https://help.twitter.com/en/using-twi...",Pin Code:[411040] \nChatrapati Shahu Maharaj P...,-0.75,1.0,en,0,PuneUpdater,89,0,[],
1117,Fri Jun 18 19:21:12 +0000 2021,"<a href=""http://twitter.com/download/android"" ...",@AIPAC Will aipac tell the world that Israel h...,-0.125,0.725,en,28,Cavill08,227,261,[],
2867,Fri Jun 18 22:39:46 +0000 2021,"<a href=""http://twitter.com/download/android"" ...","Former Pfizer VP and Virologist, Dr. Michael Y...",0.086111,0.197222,en,904,scott48_scott,62,64,[],United States
1698,Fri Jun 18 20:17:14 +0000 2021,"<a href=""https://help.twitter.com/en/using-twi...",Novavax says COVID-19 vaccine highly effective...,0.6,0.8,en,0,CanNews24,180,448,[],Canada
3864,Sat Jun 19 01:15:34 +0000 2021,"<a href=""http://twitter.com/download/iphone"" r...",You can get a #COVID19 vaccine and other vacci...,0.09375,0.28125,en,128,HenryARose,385,492,"[{'text': 'COVID19', 'indices': [26, 34]}]","Pembroke Pines, Florida, USA"


## Remove punctuation/lower casing

In [None]:
# Load the regular expression library
import re

# Remove punctuation



# Step 3: Exploratory Analysis

To verify whether the preprocessing, we’ll make a word cloud using the wordcloud package to get a visual representation of most common words. It is key to understanding the data and ensuring we are on the right track, and if any more preprocessing is necessary before training the model.

In [16]:
# Import the wordcloud library
from wordcloud import WordCloud
# Join the different processed titles together.
long_string = ','.join(list(papers['paper_text_processed'].values))
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')
# Generate a word cloud
wordcloud.generate(long_string)
# Visualize the word cloud
wordcloud.to_image()

KeyError: 'paper_text_processed'