### Library import

In [3]:
from matplotlib import pyplot as plt
import pandas as pd
import glob

from string import punctuation
from nltk.corpus import stopwords
from collections import defaultdict, Counter
from wordcloud import WordCloud

!pip install textblob
from textblob import TextBlob

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, TruncatedSVD, LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.lda_model
import pyLDAvis.gensim_models

from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.models import LdaModel

import warnings
warnings.filterwarnings('ignore')



In [4]:
df = pd.read_csv('data/news_cleaned.csv')
#df['tokens'] = df['tokens'].str.replace("'", "")
#df['tokens_no_climate'] = df['tokens_no_climate'].str.replace("'", "")

In [5]:
len(df)

90863

### Sentiment analysis with Textblob

Textblob polarity scoring is between [-1.0 to 1.0] where -1.0 indicates negative sentiment and 1.0 indicates positive sentiment.

Textblob subjectivity scoring is between [0.0 to 1.0], where 0.0 is very objective, and 1.0 is very subjective. 

In [6]:
df['polarity'] = df['snippet'].apply(lambda x: TextBlob(x).polarity)
df['subjectivity'] = df['snippet'].apply(lambda x: TextBlob(x).subjectivity)

In [7]:
df.sample(n=5)

Unnamed: 0,matchdatetime,station,snippet,tokens,snippet_no_climate,tokens_no_climate,polarity,subjectivity
77120,2019-04-14 19:22:49,FOXNEWS,they wanted to be more moderate rather than mo...,"['wanted', 'moderate', 'rather', 'liberal', 'p...",they wanted to be more moderate rather than mo...,"['wanted', 'moderate', 'rather', 'liberal', 'p...",0.089123,0.448864
11141,2019-08-14 14:56:20,BBCNEWS,-- mitigate. -- mitigate. that is of course th...,"['mitigate', 'mitigate', 'course', 'big', 'que...",-- mitigate. -- mitigate. that is of course th...,"['mitigate', 'mitigate', 'course', 'big', 'que...",0.1,0.15
59882,2013-08-16 18:44:26,FOXNEWS,end run around congress. the agency moving to ...,"['end', 'run', 'around', 'congress', 'agency',...",end run around congress. the agency moving to ...,"['end', 'run', 'around', 'congress', 'agency',...",0.278788,0.533333
45280,2015-11-09 19:44:45,MSNBC,"totally is. anyway, no. it is a missile. presi...","['totally', 'anyway', 'missile', 'president', ...","totally is. anyway, no. it is a missile. presi...","['totally', 'anyway', 'missile', 'president', ...",0.083333,0.361111
32942,2015-09-21 21:19:48,MSNBC,"popular pope. at the same time, he is somewhat...","['popular', 'pope', 'time', 'somewhat', 'contr...","popular pope. at the same time, he is somewhat...","['popular', 'pope', 'time', 'somewhat', 'contr...",0.33,0.595


In [8]:
# calculate average polarity and subjectivity for each station
station_stats = df.groupby('station').agg({'polarity': 'mean', 'subjectivity': 'mean'})

# rename column
station_stats.columns = ['Average Polarity', 'Average Subjectivity']
print(station_stats)


         Average Polarity  Average Subjectivity
station                                        
BBCNEWS          0.088726              0.395586
CNN              0.097142              0.397627
FOXNEWS          0.075069              0.369970
MSNBC            0.099143              0.396647


There is overall positive sentiment for all four stations, with little to none significant differences between them. 

There is an overall objective reporting of climate change for all news stations with little significant differences between them. 

In [44]:
# see 10 most subjective snippets
s= df.nlargest(10, 'subjectivity')[['snippet','station']].index
for index in s:
    print(df.loc[index, 'snippet'])

environmental catastrophe in another part of the world. so far, administration officials are not backing away from nuclear. which they said will reduce emissions and prevent climate
strict greenhouse gas reduction law. prop 23 would suspend that law and that, of course, would be awesome for companies that make a lot of money by making a lot of pollution. 97% of the funding for prop 23 so far comes from oil and chemical companies, including a
and state chapters of the naacp. the letters urged perriello to vote against climate change legislation. the letters were fake. tea party groups camped out mr. perriello's virginia office, one
targeting climate change, is there a bit of hypocrisy of it? i disagree with it. i think you find it funny. he made hundred million. he is 3500 votes of being
producing countries? yes, it is. so you al gore are doing business with this country. [ laughter ] that's enabling your ultimate foe, climate change? i think i understand what you are getting at. [ laug

In [43]:
# see 10 most negative snippets
s= df.nsmallest(10, 'polarity')[['snippet','station']].index
for index in s:
    print(df.loc[index, 'snippet'])

by 2050, countries like that might not exist. closer to home, things like wildfires, devastating hurricanes, food shortages, migrations, they're all a host of awful things associated with climate change. we're already seeing the beginnings of this now. and this report just underscores
report warns of devastating effects from climate change. president trump suggested that he doesn't believe it, what's your response to the president? look, the climate den
we're going to have to build shelters so people can escape when these terrible fires get out of hand. and yes, we're going to have to deal with climate change. all of that. reporter: meanwhile, 145 evacuees and workers in shelters around butte county are suffering from norovirus.
published scientific literature. so what this report will tell us is that we are seeing the impact of climate change on our coastlines here in the united states, in terms of devastating superstorms. you add a foot of sea level rise and we could see six feet to


In [42]:
# see 10 most positive snippets
s = df.nlargest(10, 'polarity')[['snippet','station']].index
for index in s:
    print(df.loc[index, 'snippet'])

issues, or pressing concerns - whether it be climate change, animal exploitation or refugees. at the forefront of films addressing the refugee crisis was 80-year-old legendary actress, vanessa redgrave,
some scientists have called climate change the greatest threat that humanity changes. president trump's defense secretar james mattis called it a challenge to national security. the president said he would make
that is not all. causation is the republican resolution that climate change is happening and we need to find a solution. while she has had an impressive start in congress, she does not plan to be there forever. i do think institutionally congress benefits from having a
candidates. by the way, in massachusetts they say the shape of the field determines the winner. here's the people that look like they may run against her. maybe ed markey, very impressive senior who did all this mark pushing the climate change and
truly greatest weapons. but the speech had nothing to say about clim

### Topic modeling


In [10]:
# Count Vectorizer
combined_tokens=[' '.join(sublist) for sublist in df['tokens']]
count_vectorizer = CountVectorizer(stop_words='english', min_df=3, max_df=0.7)
count_vectorizer
count_vectors = count_vectorizer.fit_transform(combined_tokens)
count_vectors
#count_vectors.shape

ValueError: empty vocabulary; perhaps the documents only contain stop words

In [None]:
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', min_df=5, max_df=0.7)
tfidf_vectors = tfidf_vectorizer.fit_transform(combined_tokens)
tfidf_vectors.shape

ValueError: empty vocabulary; perhaps the documents only contain stop words

#### LDA

In [None]:
## takes >25 minutes
# Fitting LDA Model
lda_model = LatentDirichletAllocation(n_components=5, random_state=314)
W_lda_matrix = lda_model.fit_transform(count_vectors)
H_lda_matrix = lda_model.components_

# Display LDA Model
display_topics(lda_model, count_vectorizer.get_feature_names_out())

KeyboardInterrupt: 

In [None]:
lda_display = pyLDAvis.lda_model.prepare(lda_model, count_vectors, count_vectorizer, sort_topics=False)
pyLDAvis.display(lda_display)

#### LDA w/ Gensim

In [None]:
gensim_tokens = df['tokens']

# initialize Gensim dictionary 
dict_gensim = Dictionary(gensim_tokens)

# filter for words that appear in: at least 5 but not more than 70% of all snippets
dict_gensim.filter_extremes(no_below=5, no_above=0.7)

# calculate bag of words matrix
bow_gensim = [dict_gensim.doc2bow(token) for token in gensim_tokens]

# perform TF-IDF transformation

tfidf_gensim = TfidfModel(bow_gensim)
vectors_gensim = tfidf_gensim[bow_gensim]


KeyboardInterrupt



In [None]:
# using LDA with Gensim
lda_gensim = LdaModel (corpus = bow_gensim,
                       id2word = dict_gensim,
                       chunksize = 2000,
                       alpha = 'auto',
                       eta = 'auto',
                       iterations = 400,
                       num_topics = 5,
                       passes = 20,
                       eval_every = None,
                       random_state = 509)

In [None]:
# see word distribution of topics
#display_topics_gensim(lda_gensim)
lda_gensim_topics = lda_gensim.show_topics(num_topics=5, num_words=10)

# Display the topics
for topic_idx, topic in lda_gensim_topics:
    print(f"Topic #{topic_idx + 1}: {topic}")

Topic #1: 0.024*"world" + 0.018*"greenhouse" + 0.017*"new" + 0.017*"emissions" + 0.014*"carbon" + 0.013*"gas" + 0.011*"government" + 0.011*"action" + 0.010*"environment" + 0.010*"news"
Topic #2: 0.017*"weather" + 0.011*"facing" + 0.009*"record" + 0.008*"air" + 0.008*"water" + 0.008*"seen" + 0.008*"extreme" + 0.008*"across" + 0.007*"become" + 0.006*"finds"
Topic #3: 0.023*"president" + 0.021*"us" + 0.015*"said" + 0.014*"trump" + 0.010*"first" + 0.010*"today" + 0.009*"next" + 0.008*"two" + 0.008*"crisis" + 0.008*"hes"
Topic #4: 0.025*"global" + 0.023*"warming" + 0.020*"people" + 0.016*"think" + 0.014*"going" + 0.013*"like" + 0.013*"one" + 0.012*"say" + 0.010*"get" + 0.010*"would"
Topic #5: 0.024*"years" + 0.018*"year" + 0.017*"scientists" + 0.017*"could" + 0.012*"says" + 0.012*"report" + 0.010*"impact" + 0.009*"temperatures" + 0.008*"planet" + 0.008*"global"
