# 1. Introduction

In this project, my objective is to juxtapose how climate change is viewed by the left and the right in the United States. This is a follow up to a previous project in which I analyzed whether energy portfolios differed between red states and blue states (https://github.com/alab5037/Red-or-Blue/blob/master/Report%20(update).pdf ). Here's an excerpt from the paper regarding the conclusion of my study:

"In conclusion, this study makes a strong case that blue states generate more electricity from biomass than red states. It can be argued that red states generate more power from coal compared to blue states, but the argument is a bit more tenuous considering the significance level for the covariate coal in Table 1 of the logistic regression model."

As stated above, I found evidence that energy policies differ between red and blue states as it pertains to clean/renewable energy--particularly, the adoption of biomass energy (waste, biofuels, etc.). Renewable energy proliferation is directly related to climate change since they are "net-zero", meaning that carbon dioxide and other greenhouse gas emissions are reduced 100% (Sims, 2003). To follow up on this research, I thought it would be interesting to further explore the political divide surrounding climate change. This time, I would like to get a better understanding of why it exists and where we stand today. To obtain insight on these questions, I explore the rhetoric of climate change articles from two mass media sites on different sides of the political spectrum: Fox News (right) and the Huffington Post (left). Fox News has been alleged by many to have a Republican Party bias in their news coverage; thus, it will be used as the right leaning media outlet in this study. Conversely, the Huffington Post has been said to be a liberal news and information site; hence, it will serve as the left leaning media outlet in this study. Where these media outlets lie on the political spectrum can be viewed in the following link: https://guides.lib.umich.edu/c.php?g=637508&p=4462444. 

Previous studies suggest that the conservative media tends to downplay climate change. For example, a 2007 and 2008 analysis found that Fox News takes a more dismissive tone towards climate change than CNN and MSNBC, which are both left leaning (Feldman, Lauren, et al., 2011). Interestingly, this seems to have quite an influence on how conservatives perceive climate change. In a 2010 Gallup survey of 1,014 adults in the U.S., 74% of liberals agreed that “effects of global warming are already occurring,” whereas only 30% of conservatives concurred (Jones, 2010). In this study, I would like to confirm whether these findings still hold true in 2019. If the message surrounding climate change proves to be different between both media outlets, then it would suggest that this divide still exists. This might also explain why energy portfolios differ between blue and red states as evidence shows that the media impacts how people vote. For instance, a 2007 paper, titled, "The Fox News Effect: Media Bias and Voting," found that the introduction of Fox News to cable programming led to an increase in the share of votes in Presidential elections between 1996 and 2000. In particular, Republicans gained 0.4 to 0.7 percentage points in the towns that broadcast Fox News (DellaVigna and Kaplan, 2007). 

Ultimately, the goal of this study is not to point fingers at any one party. I think that both sides bare responsibility for the current state of affairs. As mentioned before, studies have shown that right leaning media outlets tend to be more dismissive of climate change. However, it is also said that liberals push too hard and exaggerate climate change fears which can be counterproductive (Kreutzer, 2016). I hope that my results can help to stress that tackling climate change requires political unity. If both sides of the political spectrum cannot align, it will be arduous to take effective action against climate change. 

To begin my study, I will first scrape climate change related articles from both Fox News and the Huffington Post. Once I’ve gathered this information, I will build statistical models to determine whether I can predict the source of the article based on the entire text of the article. A succesful model would point out these two political factions have different agendas when it comes to climate change.

# 2.1 Scraping Fox News

First, I scraped articles from Fox News' climate change section. In the cell below, I create a function that allows me to load articles on the climate change section. This can be done by manipulating the offset parameter. I also include the JSON decoder in order to simplify the formatting of the page.

In [1]:
import pandas as pd
import requests
import json
from pandas.io.json import json_normalize

def fox_df(offset):
    url = ("https://www.foxnews.com/api/article-search?isCategory=false&isTag=true&isKeyword=false&"
    "isFixed=false&isFeedUrl=false&searchSelected=fox-news%2Fus%2Fenvironment%2Fclimate-change&contentTypes=%7B"
    "%22interactive%22:true,%22slideshow%22:true,%22video%22:true,%22article%22:true%7D&"
    "size=30&offset=0")
    
    url = url.replace('offset=0', 'offset=%s' % offset)
    
    r_fox = requests.get(url)
    new_df_fox = json_normalize(r_fox.json())
    
    return new_df_fox

Next, I generate a loop that loads 500 articles in batches of 30. The articles are in chronological order, so, the first 30 loaded, for example, will be the 30 most recently posted articles on Fox News. These articles are then all store in a dataframe, foxdf.

In [2]:
from time import sleep # for pausing

# create empty dataframe to store results
foxdf = pd.DataFrame() 

# Create a loop that counts up by 30.
for offset in range(0, 500, 30):
    new_df = fox_df(offset)
    
    # Add the new results to the existing database
    foxdf = foxdf.append(new_df, ignore_index=True)
    
    # Pause for three seconds to be polite to the web server
    sleep(3)

284 articles were loaded in the dataframe, meaning that only 284 were available on Fox News. Nonetheless, this should be ample for this analysis. However, I don't find that the available variables will serve as very good predictors for my model. For example, descriptions and titles do not seem to contain enough information to differentiate between Fox News and Huffington Post. As a result, I will retrieve the text of the articles in subsequent steps since these will be rich with information. Lastly, in the dataframe above, publisher information is not clearly available. This is ultimately what I want to serve as my dependent variable in my model. Therefore, I will create a new column for publisher.

In [3]:
foxdf #add a column for publisher = 'fox'
foxdf['publisher']='fox'

To retrieve text from each individual article, I create a list containing the links from the 'url' column in the fox dataframe. I notice, however, that some of these links direct you to videos, which, of course, don't contain much text. Hence, I will first remove links from the dataframe that are videos. These are identified by containing "https" in the url. After doing so, the dataframe has been reduce to 222 articles. I also need to change the index of the newly created dataframe, rm_video, to have consecutive numbers or else the loop won't work.

In [4]:
#remove rows that include video links in url
rm_video=foxdf[~foxdf.url.str.contains("https")]
#change index to have consecutive numbers 
rm_video.index = range(len(rm_video))

Below, I create a loop that stores links in a list.

In [5]:
url='https://www.foxnews.com' #create fox base url (domain name), we can append paths to this
links_fox=[]

#create a df that stores all of the links into one df
for i in range(0,len(rm_video)): 
        row=rm_video['url'][i] #path of each article
        full_url=url+row #append path to domain
        print(full_url)
        links_fox.append(full_url)

https://www.foxnews.com/politics/sen-joe-manchin-bloombergs-carbon-initiative-would-punish-west-virginians
https://www.foxnews.com/politics/mini-aoc-releases-re-election-video-mocking-the-new-york-congresswoman
https://www.foxnews.com/politics/coal-state-dem-slams-bloomberg-for-climate-initiative
https://www.foxnews.com/politics/green-new-deal-architect-says-climate-change-can-lead-to-25-holocausts
https://www.foxnews.com/entertainment/bill-nye-world-climate-change-d-day
https://www.foxnews.com/opinion/liz-peek-joe-biden-climate-plan-ocasio-cortez-democrats-frontrunner
https://www.foxnews.com/politics/joe-biden-matt-gaetz-2020-democrat-old-tired
https://www.foxnews.com/politics/biden-unveils-wide-ranging-climate-change-plan-that-uses-green-new-deal-as-crucial-framework
https://www.foxnews.com/politics/green-new-deal-could-have-dems-facing-blue-collar-backlash-at-polls-some-say
https://www.foxnews.com/opinion/roy-spencer-tornadoes-ohio-ocasio-cortez-sanders
https://www.foxnews.com/polit

Now that I have all article links, I can extract text from them. I do this using the newspaper.article package, which makes it easy to pull the aformentioned data from the links. In addition to this, I will extract article titles and descrptions from the links. This should provide some background for each article, making it easier to identify. This data will be stored in a new dataframe, fox. 

In [6]:
from newspaper import Article
from newspaper.article import ArticleException, ArticleDownloadState

article_info_fox=[] #empty list to store title/descpription of each individual article
df_fox=[] #combine data from all of these articles into one list then convert to df

for i in range(0,len(links_fox)): 
    article_fox = Article(links_fox[i]) #iterate through each article link
        
    slept = 0
    article_fox.download()
    while article_fox.download_state == ArticleDownloadState.NOT_STARTED:
    # Raise exception if article download state does not change after 10 seconds
        if slept > 9:
            raise ArticleException('Download never started')
    sleep(1)
    slept += 1
    
    article_fox.parse() #makes it easy to identify main components of article
       
    #to retrieve title/description/full text from each article + append publisher name
    article_info_fox={'title' : article_fox.title, 
                      'description': article_fox.meta_description,
                      'text':article_fox.text,
                      'publisher' : 'fox'} 

    df_fox.append(article_info_fox)
    fox = pd.DataFrame(df_fox)

# 2.2 Scraping Huffington Post

Next, I will scrape articles from the Huffington Post. As mentioned previously, the Huffington Post represents a left leaning media site, and, thus, will serve as the counterpart to Fox News. The goal is to end up with the same type of dataframe as Fox News with columns for description, publisher, text, and title. In the cell below, I use the Article command from the newspaper package. Moreover, I use the BeautifulSoup package to parse through each html page. I first define the function for collecting the information and then run it in a loop over all URLs. Since there are 26 articles per page, I need at least 9 pages to retrieve 222 articles--the same as Fox News. You can see that I loop through 11 pages (286 articles), to make sure I have enough articles just in case I have to remove those that are videos.

In [17]:
#Scraping Huffington Post

import requests
import json
import re
from newspaper import Article
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup

store_pages=[] #store html code for each page

for page_number in range(1,12): 
     
    page_number=str(page_number) #convert integer to string (Request command only reads strings)
    base_url_huff='https://www.huffpost.com/impact/topic/climate-change?page='
    r_huff = Request(base_url_huff + page_number, headers={'User-Agent': 'Mozilla/5.0'}) #To get around block
    webpage = urlopen(r_huff).read()
    page_soup=soup(webpage,"html.parser")
    store_pages.append(page_soup) #store each html parsed page in dataframe

    print('Now on page' + page_number)
    sleep(3)

Now on page1
Now on page2
Now on page3
Now on page4
Now on page5
Now on page6
Now on page7
Now on page8
Now on page9
Now on page10
Now on page11


Similar to what was done in the Fox News scraping, I retrieve article titles and descriptions by first retrieving links to each article. The stored links (285) can be seen below:

In [18]:
links_huff = [] #empty list to store links
i=0
for i in range(0, len(store_pages)):
    for link in store_pages[i].findAll("a", {"class": "card__link yr-card-headline"}):
        links_huff.append(link.get('href'))

Next, I retrieve article titles and descriptions from each link. Again, the article package makes it easy to identify main components of the article. Here, I'm interested in obtaining the article title, description, text. I also include a publisher column labeled, huff, which will serve as my dependent variable in my analysis in the next section. Lastly, I combine data from all of these articles into one list then convert to a dataframe.

In [20]:
from newspaper.article import ArticleException, ArticleDownloadState

article_info_huff=[] #empty list to store title/descpription/text of each individual article
df_huff=[]

for i in range(0,len(links_huff)): 
    article_huff = Article(links_huff[i]) #iterate through each article link
        
    slept = 0
    article_huff.download()
    while article_huff.download_state == ArticleDownloadState.NOT_STARTED:
    # Raise exception if article download state does not change after 10 seconds
        if slept > 9:
            raise ArticleException('Download never started')
    sleep(1)
    slept += 1
    
    article_huff.parse() 
        
    article_info_huff={'title' : article_huff.title, 
                      'description': article_huff.meta_description,
                      'text': article_huff.text,
                      'publisher' : 'huff'} 

    df_huff.append(article_info_huff)
    huff = pd.DataFrame(df_huff)

Now that I have dataframes for both Fox News and Huffington Post, I can combine them. However, before doing so I will remove some rows from the Huffington Post (285) to match the number of rows in the Fox News dataframe (222). I do this because I want a balanced dataset. The trimmed dataframe is called new_huff.

In [21]:
new_huff=huff[0:222]

Now that I have both dataframes for Fox News and the Huffington Post each containing the same features, I can combine them into one dataframe. The combined dataframe can be seen below and will be used for my analysis. In total, there are 444 rows since the Fox News and Huffington Post dataframs have 222 rows. 

In [22]:
#append both df
combine=fox.append(new_huff, ignore_index=True)

Below, I save the dataframe to a csv file for backup. 

In [23]:
#save to csv
combine.to_csv("/Users/halabanz/Desktop/fox_huff-scrape2.csv", sep=',')

In [24]:
combine

Unnamed: 0,description,publisher,text,title
0,Democratic presidential candidate Beto O'Rourk...,fox,Remember Beto?\n\nMonday former congressman fr...,Gutfeld on Beto's $5 trillion plan to save the...
1,"""The Five"" host Jesse Watters had some tough w...",fox,"""The Five"" co-host Jesse Watters had some toug...",Watters on Beto's climate plan: He's a 'scam a...
2,Democrats who truly wish to address climate an...,fox,"This week, Democrats in the House of Represent...",Mandy Gunasekara: Democrats should back Trump'...
3,In the next few weeks California Gov. Gavin Ne...,fox,"In the next few weeks, California Gov. Gavin N...",California Gov. Newsom facing new pressure fro...
4,Climate change activists pulled an Earth Day s...,fox,Here’s a headline for you: California climate ...,Tomi Lahren: California is the Golden State of...
5,Actor Alec Baldwin said Tuesday that his passi...,fox,Actor Alec Baldwin said Tuesday that his passi...,Alec Baldwin calls for action on climate chang...
6,"Little has changed over at CBS, with the netwo...",fox,CBS dug up for Earth Day a decades-old news cl...,CBS resurfaces 1970 'Act or Die' alarmist clim...
7,The hurricane pre-season hysteria is ramping u...,fox,The hurricane pre-season hysteria is ramping u...,"Joe Bastardi: Hurricanes happen, whether they ..."
8,Climate-change activists took over the Univers...,fox,Climate-change activists took over the Univers...,Protesters climb Universal Studios globe in Ca...
9,The protesters who glued themselves to trains ...,fox,The protesters who glued themselves to trains ...,Climate change protesters announce shift to po...


# 3. Analysis

Now that I have my data, I can perform various statistical analyses in order to ascertain the source of an article,  Fox News or Huffington Post. I will first create an array of dummy variables using the count vectorizer. The count vectorizer counts the number of times a token shows up in the document and uses this value as its weight. Furthermore, I will also implement the tfidf vectorizer which is similar to the count vectorizer, but now the weight also depends on the occurrence of a word in the entire corpora. With this newly created array, I will try to predict whether a given article belongs to Fox News or Huffington Post. In this study, I will use three models: 1) multinomial naive bayes 2) k nearest neighbors 3) random forest. 

In [1]:
import pandas as pd
combine=pd.read_csv("/Users/halabanz/Desktop/Big Data/fox_huff-scrape.csv")
combine.drop(combine.columns[[0]], axis=1, inplace=True)

In [2]:
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from sklearn.feature_extraction import stop_words

from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier

Before getting started, for my dependent variable, publisher, I convert from string type to numeric type. In this case, articles belonging to Fox News will be labeled '1' and articles belonging to Huffington Post will be labeled '0'.

In [3]:
combine.loc[combine['publisher']=='fox','publisher']=1
combine.loc[combine['publisher']=='huff','publisher']=0
combine['publisher'].value_counts()

1    222
0    222
Name: publisher, dtype: int64

Next, I randomly split my dataset into a training set (80% of dataframe) and a test set (20% of dataframe). The training set will be used to train my model (i.e. search for patterns) and the test set will be used to test the predictive strength and accuracy of my model. I do this until the number is balanced (45 vs 44).

In [4]:
train, test = train_test_split(combine, test_size=0.2)
train['publisher'].value_counts()
test['publisher'].value_counts()

0    45
1    44
Name: publisher, dtype: int64

Next, I create a list of stop words. Stop words are important because they remove common words that are not really useful for the analysis like 'a' and 'the'. I also included other words as stop words that could impact the analysis. For example, I found that 'fox' often appears in Fox News articles. However, this word does not have anything to do with the rhetoric regarding climate change and, thus, should not contribute to the model. The same goes for other words like 'getty' which is simply the author of article photos. 

In [5]:
from sklearn.feature_extraction import text 
list_stop=['___','getty','huffpost','huff','huffington','fox','https','click','facebook','twitter']
stop_words = text.ENGLISH_STOP_WORDS.union(list_stop)

# 3.1. Count Vectorizer 

Here, I create create my array of variables using the count vectorizer.

In [6]:
cv=CountVectorizer(lowercase = True,
stop_words = stop_words,
min_df = 2,
ngram_range = (1,2)) 

In [7]:
x_traincv=cv.fit_transform(train['text'])
x_testcv=cv.transform(test['text'])

y_traincv=train['publisher']
y_traincv=y_traincv.astype('int')

y_testcv=test['publisher']
y_testcv=y_testcv.astype('int')

# 3.1.1 Multinomial Naive Bayes 

In [8]:
mnb=MultinomialNB()
mnb.fit(x_traincv,y_traincv)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

The results show that the multinomial naive bayes model using the count vectorizer array achieved nearly 100% accuracy. This, however, is on the training data, so this isn't so impressive. To truly test the model, I run it on the test data.

In [9]:
pred_mnbcv_train=mnb.predict(x_traincv)
print(accuracy_score(y_traincv, pred_mnbcv_train))

0.9774647887323944


We can see that the accuracy drops about 10 percentage points, but the predictive power is still quite good. Furthermore, the fall in accuracy tells us that our model is not overfitting the data. Based on this model, I think it's fair to say that there is a difference in climate change rhetoric between Fox News and the Huffington Post.

In [10]:
predmnbcv_test=mnb.predict(x_testcv)
print(accuracy_score(y_testcv, predmnbcv_test))

0.8651685393258427


The confusion matrix is not so interesting in this case as it looks like the true postive, false positive, and false negative rates are all pretty similar for both Fox News and Huffington Post.

In [11]:
actual_values=np.array(y_testcv) 
actual_range = range(len(actual_values))

y_testcv.index=actual_range #need to reorder index so it matches with pred

actual = pd.Series(y_testcv, name='Actual')
prediction_mnbcv= pd.Series(mnb.predict(x_testcv), name='Predicted')
df_confusion_mnbcv = pd.crosstab(actual, prediction_mnbcv)
df_confusion_mnbcv

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,40,5
1,7,37


Next, I explore how the language between both sites actually differ by examining the coefficients of the model. The coefficients are essentially the log of the estimated probability of a feature given the positive class. Since Fox News is labeled as '1' (positive class) and the coefficients listed below are negative, the model conveys that if an article contains 'commerce committee', for example, then it is less likely to be classified as a Fox News article. In other words, they are more likely to be found in Huffington Post articles. Because there are many words with coefficient around -11, it would be difficult to make a general statement about what they all mean in regards to climate change. However, if I pick out a few words like 'overfishing', 'crisis threatens','plastic pollution','coral bleaching', etc. an argument can be made that these words convey the adverse impacts of climate change. This makes sense as we would expect the Huffington Post, a liberal media outlet, to push for climate change action.

In [12]:
coeficients_mnbcv = pd.Series(mnb.coef_[0],
index=cv.get_feature_names())
coeficients_mnbcv.sort_values()[:100]

commerce committee       -11.409585
peddling                 -11.409585
pegged                   -11.409585
pegged number            -11.409585
peggy                    -11.409585
curb greenhouse          -11.409585
pelosi allies            -11.409585
pelosi plan              -11.409585
pelosi power             -11.409585
cultural                 -11.409585
cult claimed             -11.409585
cult                     -11.409585
pelosi said              -11.409585
cruz texas               -11.409585
pension                  -11.409585
crumbling                -11.409585
crown                    -11.409585
people care              -11.409585
current assessment       -11.409585
current climate          -11.409585
peddle                   -11.409585
peatlands                -11.409585
passionate               -11.409585
cutoff                   -11.409585
past century             -11.409585
past couple              -11.409585
past month               -11.409585
past months              -11

# 3.1.2 K Nearest Neighbor 

In this section, I implement the k nearest neighbor (knn) algorithm. 

In [13]:
knn_classifier = KNeighborsClassifier(n_neighbors = 3)

In [14]:
knn_classifier.fit(x_traincv, y_traincv)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [15]:
#on training data
knn_train_prediction_cv = knn_classifier.predict(x_traincv)
print(accuracy_score(y_traincv, knn_train_prediction_cv))

0.7746478873239436


In [16]:
train['prediction'] = knn_classifier.predict(x_traincv)
pd.crosstab(train['publisher'], train['prediction'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


prediction,0,1
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1
0,148,29
1,51,127


In [17]:
#on test data
knn_test_prediction_cv = knn_classifier.predict(x_testcv)
print(accuracy_score(y_testcv, knn_test_prediction_cv))

0.5955056179775281


In [18]:
test['prediction'] = knn_classifier.predict(x_testcv)
pd.crosstab(test['publisher'], test['prediction'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


prediction,0,1
publisher,Unnamed: 1_level_1,Unnamed: 2_level_1
0,32,13
1,23,21


knn performs considerably worse than the naive bayes model, even after adjusting the number of clusters. For the test data, its accuracy falls around 15%. Taking a look at the confusion matrix for the test data indicates that there are a large proportion of false negatives for Fox News articles. That is, the model incorrectly classified 19 Fox News articles as Huffington Post articles.

# 3.1.3 Random Forest

For the last model in the count vectorized data, I use a random forest.

In [19]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_traincv, y_traincv)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [20]:
pred_rf=rf.predict(x_testcv)
pred_rf

array([1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1,
       0])

In [21]:
count=0
for i in range (len(pred_rf)):
    if pred_rf[i]==actual[i]:
        count=count+1
        
numerator=count
denominator=len(pred_rf)
numerator/denominator 

0.7865168539325843

In [22]:
feature_importances1 = pd.DataFrame(rf.feature_importances_,
                                   index = cv.get_feature_names(),
                                    columns=['importance']).sort_values('importance',ascending=False)
feature_importances1

Unnamed: 0,importance
app,0.019167
president donald,0.016822
pic com,0.013314
pic,0.012541
news app,0.011766
green new,0.011484
new deal,0.011263
explained,0.007918
stories matter,0.007577
clear,0.006179


The random forest model performs decently with an accuracy of around 78%. I also took a look at what words were most important in classifying the origins of the article. However, these features don't really make sense to me and dont provide insight.

# 3.2 TFIDF

In this section, we run the same three models, but this time on tfidf vectorized data.

In [23]:
#tfidf method
tf=TfidfVectorizer(min_df = 2,
stop_words=stop_words,
ngram_range = (1,2)) 

In [24]:
tf.fit(train['text'])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=frozenset({'inc', 'each', 'system', 'mill', 'sometime', 're', 'anyway', 'upon', 'have', 'third', 'their', 'others', 'any', 'take', 'eg', 'amoungst', 'every', 'see', 'seeming', 'up', 'nobody', 'hereby', 'therein', 'whose', 'almost', 'four', 'be', 'huffington', 'else', 'next', 'might', 'lat...omehow', 'towards', 'name', 'interest', 'he', 'became', 'to', 'fifteen', 'between', 'who', 'where'}),
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [25]:
x_traintf=tf.transform(train['text'])
x_testtf=tf.transform(test['text'])

y_traintf=train['publisher']
y_traintf=y_traintf.astype('int')

y_testtf=test['publisher']
y_testtf=y_testtf.astype('int')

# 3.2.1 Multinomial Naive Bayes 

This time, the multinomial naive bayes model is run on tfidf vectorized data.

In [26]:
mnb.fit(x_traintf,y_traintf)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [27]:
pred_mnbtf_train=mnb.predict(x_traintf)
print(accuracy_score(y_traintf, pred_mnbtf_train))

0.9577464788732394


In [40]:
pred_mnbtf_test=mnb.predict(x_testtf)
print(accuracy_score(y_testtf, pred_mnbtf_test))

0.8651685393258427


In [29]:
prediction_mnbtf = pd.Series(mnb.predict(x_testtf), name='Predicted')
df_confusion_mnbtf = pd.crosstab(actual, prediction_mnbtf)
df_confusion_mnbtf

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,39,6
1,6,38


In [43]:
coeficients_mnbtf = pd.Series(mnb.coef_[0],
index=tf.get_feature_names())

The results show that the model performs similarly to the last in terms of accuracy. Furthermore, many of the same words have large coefficients.

# 3.2.2 K Nearest Neighbour 

In [31]:
knn_classifier_tf = KNeighborsClassifier(n_neighbors = 3)

In [32]:
knn_classifier_tf.fit(x_traintf, y_traintf)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

In [33]:
knn_train_prediction_tf = knn_classifier_tf.predict(x_traintf)
print(accuracy_score(y_traintf, knn_train_prediction_tf))

0.8873239436619719


In [34]:
knn_test_prediction_tf = knn_classifier_tf.predict(x_testtf)
print(accuracy_score(y_testtf, knn_test_prediction_tf))

0.8089887640449438


This time, we see that the knn model performs much better than in the previous case--accuracy has increased by nearly 20%. However, this model still underperforms compared to the naive bayes. Again, there are many false negatives in which Fox News articles are being classified as Huffington Post.

# 3.2.3 Random Forest

Lastly, the random forest is conducted a second time. 

In [35]:
from sklearn.ensemble import RandomForestClassifier
rf.fit(x_traintf, y_traintf)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [36]:
pred_rf_tf=rf.predict(x_testtf)

In [37]:
count=0
for i in range (len(pred_rf_tf)):
    if pred_rf_tf[i]==actual[i]:
        count=count+1
        
numerator=count
denominator=len(pred_rf_tf)
numerator/denominator 

0.7752808988764045

In [44]:
feature_importances2 = pd.DataFrame(rf.feature_importances_,
                                   index = tf.get_feature_names(),
                                    columns=['importance']).sort_values('importance',ascending=False)

The random forest model performs similarly to the previous version.

# 4. Conclusion

Overall, my results show that Fox News and the Huffington Post convey a different rhetoric regarding climate change. When models were applied to tfidf vectorized data, they achieved an average accuracy of approximately 82%. This is pretty good especially when considering that the most successful model (multinomial naive bayes) predicted nearly 87% of articles correctly.  

On the other hand, it is a little more difficult to pinpoint how exactly the rhetoric differs. Using the entire text of articles allowed me to leverage more data and build stronger models, but interpretation is a tall order due to the sheer amount of features. From what I did find, however, it seems that the Huffington Post tends to focus on the negative impacts that climate change has on the environment compared to Fox News. This aligns with what previous research has shown.

The take away from this study is that there still exists a considerable divide between the left and right as it pertains to climate change. This argument is strengthened when combined with the results from my previous study. Ultimately, I believe both sides need to align in order to see significant progress in tackling climate change.

# 5. References

DellaVigna, Stefano, and Ethan Kaplan. “Fox News Effect: Media Bias and Voting *.” OUP Academic, Narnia, 1 Aug. 2007, https://www.nber.org/papers/w12169.pdf.

Feldman, Lauren, et al. “Climate on Cable: The Nature and Impact of Global Warming Coverage on Fox News, CNN, and MSNBC - Lauren Feldman, Edward W. Maibach, Connie Roser-Renouf, Anthony Leiserowitz, 2012.” SAGE Journals, 2 Nov. 2011, journals.sagepub.com/doi/abs/10.1177/1940161211425410.

Jones, J. M. (2010). Conservatives’ doubts about global warming grow. Gallup Poll. Retrieved August 3, 2014, from http://www.gallup.com/poll/126563/conservatives-doubts-global-warming-grow.aspx

Kreutzer, David. “The State of Climate Science: No Justification for Extreme Policies.” The Heritage Foundation, 22 Apr. 2016, www.heritage.org/environment/report/the-state-climate-science-no-justification-extreme-policies.

Sims, Ralph. “Renewable Energy: a Response to Climate Change.” Solar Energy, Pergamon, 24 Apr. 2003, www.sciencedirect.com/science/article/pii/S0038092X03001014.

University of Michigan. “Fake News," Lies and Propaganda: How to Sort Fact from Fiction.” Research Guides, https://guides.lib.umich.edu/c.php?g=637508&p=4462444.