# **ANALYZING YOUTUBE COMMENTS ON ENVIRONMENTAL SUSTAINABILITY TRENDS**
### A TDI Data Science Capstone Project Presented by Diane Christine Pelejo
   
   
In recent news cycles, we have been seeing environmental catastrophes unfold one after
another. This has increased awareness among everyday people, making us more conscientious
about our actions and how we can minimize the negative impacts of our life choices on Mother
Nature. Living an environmentally sustainable lifestyle is a goal that many of us wish to achieve.
Admittedly, it can get overwhelming to figure out what we can realistically do to move towards
this goal. Several sustainability trends have gained popularity such as: installing solar panel at
home, driving electric cars, composting, using paper straws and so on and so forth. But not all
of us can practice or pursue these trends and some may even be problematic. In this capstone
project, I wish to explore patterns in the way people talk about different sustainability trends. 

I looked at comments on youtube videos related to some environmental sustainability trends and analyze the subtopics being discussed by the commenters on these videos. 

# Collecting Data


Here we call the python scrip `scrape_yt_comments.py` to collect information on top videos (maximum of 150 videos) related to an environmental sustainability trend. The search keywords we use are listed `keywords`.


```python
keywords=['solar energy', 'composting', 'paper straws', 'electric vehicles']

import scrape_yt_comments
from scrape_yt_comments import *

num_vids=150
for keyword in keywords: 
    num=0
    while num<num_vids:
        try:
            print("Searching for videos related to:", keyword,"...") 
            vids=vid_by_kw(keyword)
            num+=len(vids)
            for i in vids:
                print(f"Retrieving comments on video {i[0]} which has {i[1]} comments" )
                com_by_vid(i[0],int(i[1]))
        except Exception as e:
            print('The error raised is:', e)
            traceback_output = traceback.format_exc()
            print(traceback_output)   
            break
```

The data is saved in an sqlite database file called `capstone.db`. We can look at the schema information of this database using the following code:

```python

import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
URL_DB = 'sqlite:///capstone.db'
db_engine = create_engine(URL_DB)
conn=db_engine.connect()

schema=pd.read_sql_table('sqlite_master',conn)
schema 
```

# Cleaning up

We need to do some pre-processing of the data to remove some redundant rows and to create separate tables for each trend. We use the following code.


```python
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
URL_DB = 'sqlite:///capstone.db'
db_engine = create_engine(URL_DB)
conn=db_engine.connect()

#schema=pd.read_sql_table('sqlite_master',conn)

df_vids=pd.read_sql_table('youtube_videos',conn)
df_vids=df_vids.astype({'publishedAt': 'datetime64[ns]','num_views':int, 'num_likes':'int','num_comments':int})

df_comms=pd.read_sql_table('youtube_comments',conn)
df_comms=df_comms.astype({'published_date': 'datetime64[ns]'})

#Videos 1-150  -> solar energy (146 unique videos)
l1=list(range(150))
solar_vids=df_vids.iloc[l1].groupby(['video_url','title','description','publishedAt']).max().reset_index()
solar_vids.to_sql('solar_energy_videos',con=conn,if_exists='replace',index=False)

solar_comments=df_comms[df_comms['video_url'].isin(solar_vids['video_url'])]
solar_comments=solar_comments.groupby(['comment_id']).max().reset_index()
solar_comments.to_sql('solar_energy_comments',con=conn,if_exists='replace',index=False)



#151-300 #set aside 301-1844 -> composting (654 unique videos)
l2=list(range(150,300)) #l2=list(range(150,1845)) 
compost_vids=df_vids.iloc[l2].groupby(['video_url','title','description','publishedAt']).max().reset_index()
compost_vids.to_sql('composting_videos',con=conn,if_exists='replace',index=False)

compost_comments=df_comms[df_comms['video_url'].isin(compost_vids['video_url'])]
compost_comments=compost_comments.groupby(['comment_id']).max().reset_index()
compost_comments.to_sql('composting_comments',con=conn,if_exists='replace',index=False)


#1845-1994 -> paper straws (142 unique videos)
l3=list(range(1845,1995))
straw_vids=df_vids.iloc[l3].groupby(['video_url','title','description','publishedAt']).max().reset_index()
straw_vids.to_sql('paper_straws_videos',con=conn,if_exists='replace',index=False)

straw_comments=df_comms[df_comms['video_url'].isin(straw_vids['video_url'])]
straw_comments=straw_comments.drop_duplicates().groupby(['comment_id']).max().reset_index()
straw_comments.to_sql('paper_straws_comments',con=conn,if_exists='replace',index=False)

#1995-2144 -> electric vehicles (147 unique videos)
l4=list(range(1995,2144))
EV_vids=df_vids.iloc[l4].groupby(['video_url','title','description','publishedAt']).max().reset_index()
EV_vids.to_sql('electric_vehicles_videos',con=conn,if_exists='replace',index=False)

EV_comments=df_comms[df_comms['video_url'].isin(EV_vids['video_url'])]
EV_comments=EV_comments.drop_duplicates().groupby(['comment_id']).max().reset_index()
EV_comments.to_sql('electric_vehicles_comments',con=conn,if_exists='replace',index=False)
```


# Loading the Data For Analysis

Now we are ready to do our analysis.

In [446]:
import pandas as pd
import sqlalchemy
from sqlalchemy import create_engine
URL_DB = 'sqlite:///capstone.db'
db_engine = create_engine(URL_DB)
conn=db_engine.connect()

solar_vids=pd.read_sql_table('solar_energy_videos',conn)
solar_comms=pd.read_sql_table('solar_energy_comments',conn)
print(f'We collected {len(solar_vids)} videos and {len(solar_comms)} related to "solar energy".') 

compost_vids=pd.read_sql_table('composting_videos',conn)
compost_comms=pd.read_sql_table('composting_comments',conn)
print(f'We collected {len(compost_vids)} videos and {len(compost_comms)} related to "composting".') 

straw_vids=pd.read_sql_table('paper_straws_videos',conn)
straw_comms=pd.read_sql_table('paper_straws_comments',conn)
print(f'We collected {len(straw_vids)} videos and {len(straw_comms)} related to "paper straws".') 

EV_vids=pd.read_sql_table('electric_vehicles_videos',conn)
EV_comms=pd.read_sql_table('electric_vehicles_comments',conn)
print(f'We collected {len(EV_vids)} videos and {len(EV_comms)} related to "electric vehicles".') 

We collected 146 videos and 183530 related to "solar energy".
We collected 137 videos and 47133 related to "composting".
We collected 142 videos and 30619 related to "paper straws".
We collected 147 videos and 259811 related to "electric vehicles".


# Time-Series Analysis of Video Metrics (Comments, Likes, Views)

First, for each trend we will plot the number of videos by month of publication and the number of comments by month of posting and see the trend on how many comments a trend gets each month on average. 

In [146]:
import matplotlib
import matplotlib.pyplot as plt

import seaborn as sns
sns.set()


def plot_monthly(trend, vids_df, comms_df):
    vid_count=(vids_df['publishedAt']
    .dt.to_period('M')
    .dt.to_timestamp()
    .value_counts()
    .sort_index()
    .reset_index()
    .rename(columns={'index': 'month','publishedAt':'num_videos'})
    )

    comm_count=(comms_df['published_date']
    .dt.to_period('M')
    .dt.to_timestamp()
    .value_counts()
    .sort_index()
    .reset_index()
    .rename(columns={'index': 'month','published_date':'num_comments'})
    )

    summary=vid_count.merge(comm_count, on='month',how='outer').fillna(0)
    summary=summary.set_index('month')
    summary=summary.sort_index()
    summary['num_vids_pub']=summary['num_videos'].cumsum()
    summary['norm_comm']=summary['num_comments']/summary['num_vids_pub']

    fig, (ax1,ax2,ax3) = plt.subplots(nrows=3, sharex=True)
    fig.suptitle(trend, fontsize=16, fontweight='bold',y=0.92)
    fig.set_figheight(6.9)
    fig.set_figwidth(5)
    fig.align_ylabels()


    plt.subplots_adjust(hspace=.1)# remove vertical gap between subplots
    plt.xlabel('date published',fontsize=12, fontstyle='italic',fontweight='bold',labelpad=12)
    plt.xticks(rotation=45)
    
    line1, = ax1.plot(summary['norm_comm'], color='b', marker='.', markersize=3)
    line2, =  ax2.plot(summary['num_comments'], color='r', marker='.', markersize=3)
    line3, =  ax3.plot(summary['num_videos'], color='g', marker='.', markersize=3)
    line4, = ax3.plot(summary['num_vids_pub'],color='orange',marker='.', markersize=3)
    
    ax1.set_ylabel(r'$\bf\frac{no.\ new \ comments}{no.\ existing\ videos}$',fontsize=12)
    ax2.set_ylabel('no. new comments',fontsize=10, fontstyle='italic',fontweight='bold')
    ax3.set_ylabel('no. videos',fontsize=10, fontstyle='italic',fontweight='bold')
    ax3.xaxis.set_major_locator(matplotlib.dates.YearLocator(base=1))
    ax3.legend((line3,line4),('newly published','total so far'),loc='upper left')
    
    plt.savefig(trend.replace(" ", "_")+'.jpg',bbox_inches='tight')
    plt.close()
    
plot_monthly('solar energy', solar_vids, solar_comms)
plot_monthly('composting', compost_vids, compost_comms)
plot_monthly('paper straws', straw_vids, straw_comms)
plot_monthly('electric vehicles', EV_vids, EV_comms)

### On Publication Dates of Top Videos Collected and their comments

| ![](solar_energy.jpg) | ![](composting.jpg) |
|-|-|
| ![](paper_straws.jpg) | ![](electric_vehicles.jpg) |

Next, we have some information on the number of likes, views and comments of videos but we don't have a timestamp for when the likes and views were acquired. Here, we look at the individual videos and the number of likes, views and comments they have as of data collection. We arrange the videos by publication date. 

In [147]:
trend='solar energy'
vids_df=solar_vids
comms_df=solar_comms

def metrics_trend(trend, vids_df):

    temp=vids_df[['publishedAt','num_views','num_likes','num_comments']].sort_values('publishedAt',ignore_index=True).set_index('publishedAt')
    temp['views_cum_avg']=temp['num_views'].expanding().mean()
    temp['views_cum_std']=temp['num_views'].expanding().std(ddof=0)
    temp['likes_cum_avg']=temp['num_likes'].expanding().mean()
    temp['likes_cum_std']=temp['num_likes'].expanding().std(ddof=0)
    temp['comms_cum_avg']=temp['num_comments'].expanding().mean()
    temp['comms_cum_std']=temp['num_comments'].expanding().std(ddof=0)

    fig, (ax1,ax2,ax3) = plt.subplots(nrows=3, sharex=True)

    fig.suptitle(trend, fontsize=16, fontweight='bold',y=0.92)
    fig.set_figheight(6.9)
    fig.set_figwidth(5)
    fig.align_ylabels()


    plt.subplots_adjust(hspace=.1)# remove vertical gap between subplots
    plt.xlabel('video index (publication date-time)',fontsize=12, fontstyle='italic',fontweight='bold',labelpad=12)
    plt.xticks(rotation=45)

    line11, = ax1.plot(temp['num_views'], color='b', marker='.', markersize=3)
    line12, =ax1.plot(temp['num_views'].expanding().mean(),color='orange',marker='.', markersize=3)
    ax1.fill_between(temp.index,temp['views_cum_avg'] - temp['views_cum_std'], temp['views_cum_avg'] + temp['views_cum_std'], color='orange', alpha=0.2)
    ax1.legend((line11,line12),('individual video','cum. avg.\n (1 std range)'),loc='upper left',fontsize=8)
    ax1.set_ylabel('no. views',fontsize=12, fontstyle='italic',fontweight='bold')
    ax1.ticklabel_format(style='sci', scilimits=(0,0), axis='y', useOffset=True, useLocale=None, useMathText=True)

    line21, =  ax2.plot(temp['num_likes'], color='r', marker='.', markersize=3)
    line22, =ax2.plot(temp['num_likes'].expanding().mean(),color='orange',marker='.', markersize=3)
    ax2.fill_between(temp.index,temp['likes_cum_avg'] - temp['likes_cum_std'], temp['likes_cum_avg'] + temp['likes_cum_std'], color='orange', alpha=0.2)
    ax2.legend((line11,line12),('individual video','cum. avg. \n(1 std range)'),loc='upper left',fontsize=8)
    ax2.set_ylabel('no. likes',fontsize=12, fontstyle='italic',fontweight='bold')
    ax2.ticklabel_format(axis='y', style='sci', scilimits=(0,0), useOffset=True, useLocale=None, useMathText=True)



    ax3.legend((line11,line12),('individual video','cum. avg. (1 std range)'),loc='upper left',fontsize=8)
    line31, =  ax3.plot(temp['num_comments'], color='g', marker='.', markersize=3)
    line32, =ax3.plot(temp['num_comments'].expanding().mean(),color='orange',marker='.', markersize=3)
    ax3.fill_between(temp.index,temp['comms_cum_avg'] - temp['comms_cum_std'], temp['comms_cum_avg'] + temp['comms_cum_std'], color='orange', alpha=0.2)
    ax3.set_ylabel('no. comments',fontsize=12, fontstyle='italic',fontweight='bold')
    ax3.ticklabel_format(axis='y', style='sci', scilimits=(0,0), useOffset=True, useLocale=None, useMathText=True)
    ax3.xaxis.set_major_locator(matplotlib.dates.YearLocator(base=1))
    
    plt.savefig(trend.replace(" ", "_")+'_metrics.jpg',bbox_inches='tight')
    plt.close()

metrics_trend('solar energy', solar_vids)
metrics_trend('composting', compost_vids)
metrics_trend('paper straws', straw_vids)
metrics_trend('electric vehicles', EV_vids)

| ![](solar_energy_metrics.jpg) | ![](composting_metrics.jpg) |
|-|-|
| ![](paper_straws_metrics.jpg) | ![](electric_vehicles_metrics.jpg) |

Finally, we look at the trend on the number of comments a video gets days its publication date. 

In [148]:
trend='solar energy'
vids_df=solar_vids
comms_df=solar_comms

def days_after_vidpub(trend,vids_df,comms_df):
    new_df=comms_df.merge(vids_df[['video_url','publishedAt']], on='video_url').rename(columns={'publishedAt':'video_pub_date'})
    new_df['days_vid_to_comment']=(new_df['published_date'].dt.date-new_df['video_pub_date'].dt.date).dt.days
    temp3=new_df.groupby(['video_url','days_vid_to_comment'])['comments'].count().reset_index()
    grid=sns.lineplot(data=temp3, x='days_vid_to_comment', y='comments')
    grid.set(xscale='log',yscale='log')
    plt.title(trend,fontsize=16, fontweight='bold',y=0.92)
    plt.xlabel('no. of days after video publication',fontsize=10, fontweight='bold', fontstyle='italic',)
    plt.ylabel('avg. no. of new comments',fontsize=10, fontweight='bold', fontstyle='italic',)
    plt.savefig(trend.replace(" ", "_")+'_days.jpg',bbox_inches='tight')
    plt.close()
    
days_after_vidpub('solar energy', solar_vids, solar_comms)
days_after_vidpub('composting', compost_vids, compost_comms)
days_after_vidpub('paper straws', straw_vids, straw_comms)
days_after_vidpub('electric vehicles', EV_vids, EV_comms)

| ![](solar_energy_days.jpg) | ![](composting_days.jpg) |
|-|-|
| ![](paper_straws_days.jpg) | ![](electric_vehicles_days.jpg) |

# <center> **Topic Modeling** </center>

Finally, we look at subtopics that come up in the comments for a particular trend. We use four different models to get the top 5 subtopics for each one.
1. Non-Negative Matrix Factorization model (NMF) with Frobenius loss function, 
2. NMF with Kullback-Leibler loss function, 
3. MiniBatch NMF with Frobenius loss function; and 
4. Latent Dirichlet Allocation (LDA)) model

First we do some preprocessing of the corpus of comments for each trend.

In [428]:
import string
import re
import nltk
#nltk.download('averaged_perceptron_tagger') #run this once
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet


my_words={'mine','myself','your','yours','yourself','yourselves', 'ours','ourselves','itself','himself','hers','herself', 'they','them','their','theirs',
           'themselves','youre','youve','youll','youd','this','that','itll','theyll','been''isnt','arent','hasnt','have','havent','hadnt','wasnt','were','werent',
           'dont','does','doesnt','didnt','must','mustnt','cant','could','couldt','should','shouldnt','would','wouldnt','wont', 'shant','aint','onto','also','than','just','only'}


def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def process_stp(stpwrds):
    def process_txt(txt):
        #this block removes punctuations and escape characters from the text
        char_dict={ord(char):' ' for char in string.punctuation}
        char_dict.update({num:' ' for num in list(range(32))})
        ans=txt.translate(char_dict) 

        #this block removes short words (length up to three)
        shortword = re.compile(r'\W*\b\w{1,3}\b')
        ans=shortword.sub('',ans)
        
        #this block removes stop words from the text and lemmatizes the remaining words
        wnl=WordNetLemmatizer()
        return ' '.join([wnl.lemmatize(word,get_wordnet_pos(word)) for word in word_tokenize(ans) if not word in stpwrds])
    
    return process_txt

def get_corpus(trend,comms_df):
    search_words=set(trend.split())
    return  comms_df['comments'].apply(process_stp(my_words.union(search_words)))

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, MiniBatchNMF, LatentDirichletAllocation

def plot_topic_words(model,feature_names,title,file_name):
    n_top_words=10
    fig,axes=plt.subplots(1,5, figsize=(30,15),sharex=True)
    axes=axes.flatten()
    for topic_idx,topic in enumerate(model.components_):
        top_features_ind=topic.argsort()[: -n_top_words-1:-1]
        top_features=[feature_names[i] for i in top_features_ind]
        weights=topic[top_features_ind]
        ax=axes[topic_idx]
        ax.barh(top_features,weights)
        ax.set_title(f"Topic {topic_idx+1}", fontdict={"fontsize":30})
        ax.invert_yaxis()
        ax.tick_params(axis="both",which="major",labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)
    plt.subplots_adjust(top=0.9, bottom=0.5,wspace=0.9, hspace=0.3)
    plt.savefig(file_name+'.jpg',bbox_inches='tight')
    plt.close()

In [419]:
solar_corpus=get_corpus('solar energy',solar_comms)
compost_corpus=get_corpus('composting',compost_comms)
straw_corpus=get_corpus('paper straws', straw_comms)
EV_corpus=get_corpus('electric vehicles',EV_comms)

In [420]:
solar_corpus.to_pickle('./solar_corpus.pkl')
compost_corpus.to_pickle('./compost_corpus.pkl')
straw_corpus.to_pickle('./straw_corpus.pkl')
EV_corpus.to_pickle('./EV_corpus.pkl')

In [422]:
solar_corpus=pd.read_pickle('./solar_corpus.pkl')
compost_corpus=pd.read_pickle('./compost_corpus.pkl')
straw_corpus=pd.read_pickle('./straw_corpus.pkl')
EV_corpus=pd.read_pickle('./EV_corpus.pkl')

In [423]:
#without using further stopwords on the vectorizers 

n_features=500
n_components=5
batch_size=1000
init="nndsvda"

for trend,corpus in {'solar energy': solar_corpus, 'composting': compost_corpus, 'paper straws': straw_corpus, 'electric vehicles': EV_corpus}.items():
    
    #NMF model using frobenius norm
    vectorizer_tfidf=TfidfVectorizer(max_df=0.9,min_df=5,max_features=n_features)
    corpus_tfidf=vectorizer_tfidf.fit_transform(corpus)
    features_tfidf=vectorizer_tfidf.get_feature_names_out()
    
    nmf1=NMF(n_components=n_components,random_state=1,init=init,beta_loss="frobenius",alpha_W=0.00005,alpha_H=0.00005,l1_ratio=1).fit(corpus_tfidf)
    plot_topic_words(nmf1,features_tfidf, f"Topic modeling for {trend}-related Youtube video comments Using Non-Negative Matrix Factorization (NMF) with Frobenius loss function",'ns_'+trend.replace(' ','_')+'_nmf_fro')
    
    #NMF model using Kullback-leibler loss function
    nmf2 = NMF(n_components=n_components,random_state=1,init=init,beta_loss="kullback-leibler",solver="mu", max_iter=1000,alpha_W=0.00005,alpha_H=0.00005,l1_ratio=0.5,).fit(corpus_tfidf)
    plot_topic_words(nmf2,features_tfidf, f"Topic modeling for {trend}-related Youtube video comments Using NMF with Kullback-Leibler loss function",'ns_'+trend.replace(' ','_')+'_nmf_kl')
    
    #MiniBatch
    mbnmf = MiniBatchNMF(n_components=n_components,random_state=1,batch_size=batch_size, init=init,beta_loss="frobenius", alpha_W=0.00005,alpha_H=0.00005,l1_ratio=0.5,).fit(corpus_tfidf)
    plot_topic_words(mbnmf,features_tfidf, f"Topic modeling for {trend}-related Youtube video comments Using MiniBatch NMF with Frobenius loss function",'ns_'+trend.replace(' ','_')+'_mb_fro')
    
    #LDA
    vectorizer_cvec=CountVectorizer(max_df=0.95,min_df=2,max_features=15000)
    corpus_cvec=vectorizer_cvec.fit_transform(corpus)
    features_cvec=vectorizer_cvec.get_feature_names_out()

    lda = LatentDirichletAllocation(n_components=n_components,max_iter=5,learning_method="online",learning_offset=50.0,random_state=0)
    lda.fit(corpus_cvec)
    plot_topic_words(lda,features_cvec, f"Topic modeling for {trend}-related Youtube video comments Using Latent Dirichlet Allocation (LDA) NMF with Frobenius loss function",'ns_'+trend.replace(' ','_')+'_lda')
    

In [447]:
# using  stopwords on the vectorizers 

n_features=500
n_components=5
batch_size=1000
init="nndsvda"

for trend,corpus in {'solar energy': solar_corpus, 'composting': compost_corpus, 'paper straws': straw_corpus, 'electric vehicles': EV_corpus}.items():
    
    #NMF model using frobenius norm
    vectorizer_tfidf=TfidfVectorizer(max_df=0.9,min_df=5,max_features=n_features,stop_words="english")
    corpus_tfidf=vectorizer_tfidf.fit_transform(corpus)
    features_tfidf=vectorizer_tfidf.get_feature_names_out()
    
    nmf1=NMF(n_components=n_components,random_state=1,init=init,beta_loss="frobenius",alpha_W=0.00005,alpha_H=0.00005,l1_ratio=1).fit(corpus_tfidf)
    plot_topic_words(nmf1,features_tfidf, f"Topic modeling for {trend}-related Youtube video comments Using Non-Negative Matrix Factorization (NMF) with Frobenius loss function",trend.replace(' ','_')+'_nmf_fro')
    
    #NMF model using Kullback-leibler loss function
    nmf2 = NMF(n_components=n_components,random_state=1,init=init,beta_loss="kullback-leibler",solver="mu", max_iter=1000,alpha_W=0.00005,alpha_H=0.00005,l1_ratio=0.5,).fit(corpus_tfidf)
    plot_topic_words(nmf2,features_tfidf, f"Topic modeling for {trend}-related Youtube video comments Using NMF with Kullback-Leibler loss function",trend.replace(' ','_')+'_nmf_kl')
    
    #MiniBatch
    mbnmf = MiniBatchNMF(n_components=n_components,random_state=1,batch_size=batch_size, init=init,beta_loss="frobenius", alpha_W=0.00005,alpha_H=0.00005,l1_ratio=0.5,).fit(corpus_tfidf)
    plot_topic_words(mbnmf,features_tfidf, f"Topic modeling for {trend}-related Youtube video comments Using MiniBatch NMF with Frobenius loss function",trend.replace(' ','_')+'_mb_fro')
    
    #LDA
    vectorizer_cvec=CountVectorizer(max_df=0.95,min_df=2,max_features=15000,stop_words="english")
    corpus_cvec=vectorizer_cvec.fit_transform(corpus)
    features_cvec=vectorizer_cvec.get_feature_names_out()

    lda = LatentDirichletAllocation(n_components=n_components,max_iter=5,learning_method="online",learning_offset=50.0,random_state=0)
    lda.fit(corpus_cvec)
    plot_topic_words(lda,features_cvec, f"Topic modeling for {trend}-related Youtube video comments Using Latent Dirichlet Allocation (LDA) NMF with Frobenius loss function",trend.replace(' ','_')+'_lda')
    

# <center> **Topic Modeling for Solar Energy** </center>

1. Here we removed certain words from the corpus but did not remove additional stop_words while vectorizing
![](ns_solar_energy_nmf_fro.jpg) 
![](ns_solar_energy_nmf_kl.jpg)
![](ns_solar_energy_mb_fro.jpg)
![](ns_solar_energy_lda.jpg)
2. Here we used additional stop_words while vectorizing. 
![](solar_energy_nmf_fro.jpg) 
![](solar_energy_nmf_kl.jpg)
![](solar_energy_mb_fro.jpg)
![](solar_energy_lda.jpg)

# <center> **Topic Modeling for Composting** </center>

1. Here we removed certain words from the corpus but did not remove additional stop_words while vectorizing

![](ns_composting_nmf_fro.jpg)  
![](ns_composting_nmf_kl.jpg) 
![](ns_composting_mb_fro.jpg)
![](ns_composting_lda.jpg)

2. Here we used additional stop_words while vectorizing. 

![](composting_nmf_fro.jpg)  
![](composting_nmf_kl.jpg) 
![](composting_mb_fro.jpg)
![](composting_lda.jpg)


# <center> **Topic Modeling for Paper Straws** </center>

1. Here we removed certain words from the corpus but did not remove additional stop_words while vectorizing

![](ns_paper_straws_nmf_fro.jpg)  
![](ns_paper_straws_nmf_kl.jpg)
![](ns_paper_straws_mb_fro.jpg)
![](ns_paper_straws_lda.jpg)

2. Here we used additional stop_words while vectorizing. 

![](paper_straws_nmf_fro.jpg)  
![](paper_straws_nmf_kl.jpg)
![](paper_straws_mb_fro.jpg)
![](paper_straws_lda.jpg)

# <center> **Topic Modeling for Electric Vehicles** </center>

1. Here we removed certain words from the corpus but did not remove additional stop_words while vectorizing

![](paper_straws_nmf_fro.jpg)  
![](paper_straws_nmf_kl.jpg)
![](paper_straws_mb_fro.jpg)
![](paper_straws_lda.jpg)

2. Here we used additional stop_words while vectorizing. 

![](paper_straws_nmf_fro.jpg)  
![](paper_straws_nmf_kl.jpg)
![](paper_straws_mb_fro.jpg)
![](paper_straws_lda.jpg)