# Final Project Draft

Our current president loves to talk about climate change. While President Trump's tweets have lately been more focused on what might consider to be more pressing matters for his administration (No collusion!), Trump has publicly expressed his opinions on climate science no less than 145 times in the past 6 years. 

If you take a look at the content of these tweets, you'll find two key themes.

__1. Climate change / global warming is a conspiracy.__ President Trump is well known for his stance on climate change, notably backing out of the Paris Climate Accord early in his first term. Public opinion on this message is divisive--President Trump's supporters seem to echo this sentiment, and his opponents often use as fodder for political attacks.

## Notes for what to do going forward:

1. Separate notebooks into two separate chunks (introduction / interest trends)
2. Exploring text data
3. Building classifier

For sections 1 and two, provide clearer discussion and annotation for functions.

Add headings for different subsections, provide justification for each question or subquestion that is asked.

Make sure the introduction isn't too politically motivated (try to be as objective as possible)

Note where I got code chunks for tweets, etc.

Assemble corpus of text from titles

- Clean up corpus (follow steps from current homework)
- Construct word clouds for different labels
- Remove labels and produce classifier that predicts label used
- Also check whether the specific subreddit matters (perhaps look at whether you can predict whether text came from climatechange or climateskeptics).


In [1]:
class Tweet(object):
    def __init__(self, embed_str=None):
        self.embed_str = embed_str
    def _repr_html_(self):
        return self.embed_str
    
address = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">
The concept of global warming was created by and for the Chinese in order to make U.S. 
manufacturing non-competitive.</p>&mdash; Donald J. Trump (@realDonaldTrump) 
<a href="https://twitter.com/realDonaldTrump/status/265895292191248385?ref_src=twsrc%5Etfw">November 6, 2012</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")

Tweet(address)

__2. Current temperatures provide evidence for this fact:__ President Trump also likes to draw attention to cold weather patterns, using them to justify his attacks on climate science.   

In [2]:
address = ("""
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">
In the East, it could be the COLDEST New Year’s Eve on record. 
Perhaps we could use a little bit of that good old Global Warming that our Country, 
but not other countries, was going to pay TRILLIONS OF DOLLARS to protect against. Bundle up!
</p>&mdash; Donald J. Trump (@realDonaldTrump) 
<a href="https://twitter.com/realDonaldTrump/status/946531657229701120?ref_src=twsrc%5Etfw">
December 29, 2017</a></blockquote>
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
""")

Tweet(address)

As an ecologist, this pattern of behavior is troubling -- while I welcome any discussion of how to best address climate change through legislation or commercial activity, climate science is subject to rigorous peer review and its underlying causes are well understood. 

One might describe President Trump's actions as a form of recency bias or "ecological amnesia", a problem that many of us in environmental science grapple with when communicating research. Many ecological phenomena occur over long timescales that are slower to change than our perceptions of what is "normal". Climate trends, for example, can only be detected with 100+ year observations, rather than the short periods that form our frame of reference. As humans, it's much easier for us to focus on outliers than to detect a slow, gradual change.

Case in point; many of President Trump's tweets on Climate change occur in winter months when temperatures *feel* cold, even if they fall within normal ranges of variation. 

He also likes to use the term "global warming", which has fallen out of use among scientists, in large part because it poorly represents the effects of climate change. True, the climate is getting warmer on average, but experts often place greater emphasis on increased variability in temperature and precipitation.

Together, recency bias and poor nomenclature may form a substantial barrier to public understanding of global climate change. In this project, I aim to explore how prevalent these same patterns in the general public.

__Hypotheses:__

1. US Google searches for the terms "climate change" and "global warming" will peak annually in winter months.

2. Use of the terms "climate change" and "global warming" correlate with political affiliation. User activity in social media forums more closely linked to the current presidential administration will use the term "global warming", while communities with stronger ties to Democratic politics will use the term "climate change".


In [3]:
import pandas as pd
import numpy as np
import datetime
import statsmodels
import plotly
import plotly.plotly as py
from plotly.graph_objs import *
import statsmodels.api as sm
import sqlite3
import sqlalchemy

plotly.tools.set_credentials_file(username='ebatzer', api_key='BIJOCGPZqKooZ16thhcw')

%matplotlib inline

gtrends = pd.read_csv("googletrends.csv")
gtrends.columns = ["month", "climate change", "global warming"]
gtrends["month"] = gtrends["month"] + "/01"
gtrends["date_num"] = pd.to_datetime(gtrends["month"], yearfirst = True)
gtrends["total"] = gtrends["climate change"] + gtrends["global warming"]
gtrends["total"] = (gtrends["total"] / max(gtrends["total"])) * 100

# Create traces
data = Data([
    Scatter(
        y = gtrends["global warming"],
        x = gtrends["date_num"],
        marker=Line(
            color = "red"
        ),
        mode='lines',
        name = 'Global Warming',
        showlegend = True),
    Scatter(
        y = gtrends["climate change"],
        x = gtrends["date_num"],
        marker=Line(
            color = "blue"
        ),
        mode='lines',
        name = 'Climate Change',
        showlegend = True)
])

layout = Layout(
    title='Google Searches for "Climate Change" and "Global Warming"',
    xaxis=dict(
        title='Date',
        titlefont=dict(
            size=18,
            color='#7f7f7f'
        ),
        showgrid=False
    ),
    yaxis=dict(
        title='Frequency',
        titlefont=dict(
            size=18,
            color='#7f7f7f'
        ),
        showgrid=False
    )
)

fig = Figure(data = data, layout = layout)
py.iplot(fig)


The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.



In [4]:
weather = pd.read_csv("NOAAweatherdata.csv", skiprows = 4) 
weather["Month"] = weather["Date"] % 100
weather["Year"] = weather["Date"].apply(str).apply(lambda x: x[:4])
weather["Date"] = weather["Year"] + "/" + weather["Month"].apply(str) + "/01"
weather["date_num"] = pd.to_datetime(weather["Date"], yearfirst = True)
gtrends = pd.merge(gtrends, weather, "inner")

lowess = sm.nonparametric.lowess
ys = lowess( gtrends["total"], pd.to_numeric(gtrends["date_num"]), frac = .3)[:,1]
gtrends['polyfit'] = ys
gtrends['resid'] = gtrends["total"] - gtrends['polyfit']

# Create traces
data = Data([
    Scatter(
        y = gtrends["total"],
        x = gtrends["date_num"],
        marker=Marker(
            size=12,
            cmax=80,
            cmin=30,
            color= gtrends["Value"],
            colorbar=ColorBar(
                title='Mean US Temperature'
            ),
            colorscale='Viridis'
        ),
        mode='lines+markers',
        showlegend = False),
    Scatter(
        y= gtrends["polyfit"],
        x = gtrends["date_num"],
        line=Line(
            color= "black"),
        mode='lines',
        showlegend = False),
])

layout = Layout(
    title='Google Searches for "Climate Change" and "Global Warming"',
    xaxis=dict(
        title='Date',
        titlefont=dict(
            size=18,
            color='#7f7f7f'),
        showgrid=False
    ),
    yaxis=dict(
        title='Frequency',
        titlefont=dict(
            size=18,
            color='#7f7f7f'),
        showgrid=False
    )
)

fig = Figure(data = data, layout = layout)
py.iplot(fig)

Some thoughts on why there is a peak in searches ca. 2007 -

- "An Inconvenient Truth" released in 2006 - wins Academy Award for best documentary feature (February 2007), Al Gore wins Nobel Peace Prize (October 2007).
- IPCC report published in 2007 (fourth since 1990, most substantial and comprehensize undertaken)
- Skeptical science - Global warming stopped in 2007? Cold climate conditions across the US (also in 2002, 2010)


In [187]:
gore_lines = ["2006-05-24","2007-02-26","2007-10-12"]
labels = ['"An Incovenient Truth" Released', 'AIT Wins Best Documentary', 'Al Gore and IPCC win Nobel']

In [208]:
# Create traces
data = Data([
            Scatter(
                y = gtrends["total"],
                x = gtrends["date_num"],
                marker=Marker(
                    size=12,
                    cmax=80,
                    cmin=30,
                    color= gtrends["Value"],
                    colorbar=ColorBar(
                        title='Mean US Temperature'
                    ),
                    colorscale='Viridis'
                ),
                mode='lines+markers',
                showlegend = False),
            Scatter(
                y= gtrends["polyfit"],
                x = gtrends["date_num"],
                line=Line(
                    color= "black"),
                mode='lines',
                showlegend = False),
            Scatter(
                y = [0, 100],
                x = [gore_lines[0], gore_lines[0]],
                mode = "lines",
                line = Line(color = "red"),
                showlegend = False),
            Scatter(
                y = [0, 100],
                x = [gore_lines[1], gore_lines[1]],
                mode = "lines",
                line = Line(color = "red"),
                showlegend = False),
            Scatter(
                y = [0, 100],
                x = [gore_lines[2], gore_lines[2]],
                mode = "lines",
                line = Line(color = "red"),
                showlegend = False)
])

annotations = []

for label, xval, yadj in zip(labels, gore_lines, [0, 20, 40]):
    annotations.append(dict( x=xval, y=100, ay = -yadj,
                                      xanchor='right', yanchor='middle',
                                      text=label,
                                      font=dict(family='Arial',
                                                size=16),
                                      showarrow=True))
    
layout["annotations"] = annotations
fig = Figure(data = data, layout = layout)
py.iplot(fig)

In [5]:
f = np.poly1d(np.polyfit(gtrends["Value"], gtrends["resid"], 2))
predseq = np.linspace(30,80).tolist()
quadfit = f(predseq)

# Create traces
data = Data([
    Scatter(
        y= gtrends["resid"],
        x= gtrends["Value"],
        marker=Marker(
            size=12,
            cmax=80,
            cmin=0,
            color=  gtrends["Value"],
            colorbar=ColorBar(
                title='Temperature (F)'
            ),
            colorscale='Viridis'
        ),
        mode='markers',
        showlegend = False),
    Scatter(
        y = quadfit,
        x = predseq,
        line=Line(
            color= "black"),
        mode='lines',
        showlegend = False)
])


layout = Layout(
    title='LOWESS Residuals',
    xaxis=dict(
        title='Mean US Temperature',
        titlefont=dict(
            size=18,
            color='#7f7f7f'
        ),
        showgrid=False
    ),
    yaxis=dict(
        title='Residuals',
        titlefont=dict(
            size=18,
            color='#7f7f7f'
        ),
        showgrid=False
    )
)

fig = Figure(data = data, layout = layout)
py.iplot(fig)

In [6]:
import requests
import requests_cache
import time

requests_cache.install_cache('reddit_cache')

url = "https://api.pushshift.io/reddit/search/submission/"

# Sets dates - will run for all days between start and end
dstart = datetime.date(2008, 1, 1)
dend = datetime.date(2018, 1, 1)
steps = (dend - dstart).days

# Sets timesteps (start date, the first timestep, and the size between them, in epochs)
t0 = pd.to_numeric(time.mktime(time.strptime("2008/01/01 00:00:00 GMT", "%Y/%m/%d %H:%M:%S %Z")))
t1 = pd.to_numeric(time.mktime(time.strptime("2018/01/01 00:00:00 GMT", "%Y/%m/%d %H:%M:%S %Z")))

# Define the search subs function:
def search_subs(url, searchterm, sub, size, after, before):
    
    # Requests PushShift API
    req = requests.get(url, params = {"title" : searchterm,
                                     "after": after,
                                     "before": before,
                                     "subreddit": sub,
                                     "size": size})

    # Selects data element
    subdata = pd.DataFrame(req.json()['data'])
    
    # Adds identifying columns
    subdata["searchstart"] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(after))
    subdata["searchend"] = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(before))
    subdata["keyword"] = searchterm
    
    return(subdata, subdata["created_utc"].iloc[-1])

In [7]:
engine = sqlalchemy.create_engine('sqlite:///subs.db')

# search_utc = int(t0)
# outputdf = dict()
# count = 1

# while search_utc < int(t1) :
#     subs, search_utc = search_subs( 
#             url = url, 
#             searchterm = "climate change", 
#             sub = None, 
#             after = search_utc,
#             before = None,
#             size = 500)
    
#     cols = ["author", "created_utc", "domain", "id", "num_comments",
#            "permalink", "score", "subreddit", "subreddit_id",
#            "title", "url", "searchstart", "searchend", "keyword"]
#     subs = subs[cols]
#     subs.to_sql("cctable", engine, if_exists = "append")
    
#     search_utc = search_utc + 1

#     if count % 25 == 0:
#         print("Current time is %s" % time.strftime('%Y-%m-%d %H:%M:%S',
#                                                    time.localtime(search_utc)))
        
#     count = count + 1
      
#     # Sleep to avoid over-requesting
#     time.sleep(1)

In [8]:
# search_utc = int(t0)
# outputdf = dict()
# count = 1

# while search_utc < int(t1) :
#     subs, search_utc = search_subs( 
#             url = url, 
#             searchterm = "global warming", 
#             sub = None, 
#             after = search_utc,
#             before = None,
#             size = 500)
    
#     cols = ["author", "created_utc", "domain", "id", "num_comments",
#            "permalink", "score", "subreddit", "subreddit_id",
#            "title", "url", "searchstart", "searchend", "keyword"]
#     subs = subs[cols]
#     subs.to_sql("gwtable", engine, if_exists = "append")
    
#     search_utc = search_utc + 1

#     if count % 25 == 0:
#         print("Current time is %s" % time.strftime('%Y-%m-%d %H:%M:%S',
#                                                    time.localtime(search_utc)))
        
#     count = count + 1
      
#     # Sleep to avoid over-requesting
#     time.sleep(1)

In [89]:
gwcounts = pd.read_sql("""SELECT subreddit, COUNT(subreddit) 
FROM gwtable
GROUP BY subreddit
ORDER BY COUNT(subreddit) DESC
""",
               con = engine)

cccounts = pd.read_sql("""SELECT subreddit, COUNT(subreddit) 
FROM cctable
GROUP BY subreddit
ORDER BY COUNT(subreddit) DESC
""",
               con = engine)

authors = pd.read_sql("""SELECT subreddit, COUNT(DISTINCT author) 
FROM gwtable
GROUP BY subreddit
ORDER BY COUNT(DISTINCT author) DESC
""",
               con = engine)

#################################
# Some basic summary statistics #
#################################

gwcounts.columns = ['subreddit', 'gw_counts']
cccounts.columns = ['subreddit', 'cc_counts']
authors.columns = ['subreddit', 'authors']

keywords = pd.merge(gwcounts, cccounts)
keywords = pd.merge(keywords, authors)

keywords["total"] = keywords["cc_counts"] + keywords["gw_counts"]
keywords["gw_frac"] = keywords["gw_counts"] / keywords["total"]
keywords["cc_frac"] = keywords["cc_counts"] / keywords["total"]
keywords["aut_frac"] = keywords["authors"] / keywords["total"]

Which subreddits talk the most about climate change and global warming?

In [152]:
trace1 = Scatter(
    x=keywords.sort_values("total", ascending = False)["subreddit"],
    y=np.log(keywords.sort_values("total", ascending = False)["total"])
)

trace2 = Scatter(
    x=keywords.sort_values("cc_counts", ascending = False)["subreddit"],
    y=np.log(keywords.sort_values("cc_counts", ascending = False)["cc_counts"])
)

trace3 = Scatter(
    x=keywords.sort_values("gw_counts", ascending = False)["subreddit"],
    y=np.log(keywords.sort_values("gw_counts", ascending = False)["gw_counts"])
)

fig = plotly.tools.make_subplots(rows=1, cols=3)

fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 2)
fig.append_trace(trace3, 1, 3)


fig['layout'].update(height=600, width=600, title='Total keyword mentions (log scale)')
py.iplot(fig, filename='simple-subplot')


divide by zero encountered in log


divide by zero encountered in log


divide by zero encountered in log



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]



In [158]:
# Create traces
data = Data([
    Scatter(
        y= np.log(keywords.sort_values("total", ascending = False)["total"]),
        x= np.log(keywords.sort_values("total", ascending = False)["authors"]),
        marker=Marker(
            size= np.log(keywords.sort_values("total", ascending = False)["total"]) * 2,
            cmax=1,
            cmin=0,
            color= keywords.sort_values("total", ascending = False)["cc_frac"],
            colorbar=ColorBar(
                title='Climate Change Fraction'
            ),
            colorscale='Viridis'
        ),
        mode='markers',
        name='Markers',
        text=keywords.sort_values("total", ascending = False)["subreddit"],
        textposition='bottom',
        showlegend = False)
])


layout = Layout(
    title='Activity vs. Unique Number of Authors',
    xaxis=dict(
        title='Total Unique Authors (log scale)',
        titlefont=dict(
            size=18,
            color='#7f7f7f'
        ),
        showgrid=False
    ),
    yaxis=dict(
        title='Total Number of Posts (log scale)',
        titlefont=dict(
            size=18,
            color='#7f7f7f'
        ),
        showgrid=False
    )
)

fig = Figure(data = data, layout = layout)
py.iplot(fig)


divide by zero encountered in log


divide by zero encountered in log


divide by zero encountered in log



In [153]:
keywords.sort_values("total", ascending = False).iloc[:20,]

Unnamed: 0,subreddit,gw_counts,cc_counts,authors,total,gw_frac,cc_frac,aut_frac
0,environment,6876,22549,2109,29425,0.233679,0.766321,0.071674
12,EcoInternet,1586,17293,2,18879,0.084009,0.915991,0.000106
2,politics,4343,12488,1800,16831,0.258036,0.741964,0.106946
7,climate,2357,9978,469,12335,0.191082,0.808918,0.038022
3,reddit.com,4148,7340,2316,11488,0.361072,0.638928,0.201602
4,science,4047,6955,1981,11002,0.367842,0.632158,0.180058
1,climateskeptics,4374,4626,526,9000,0.486,0.514,0.058444
10,worldnews,2046,6026,960,8072,0.253469,0.746531,0.11893
8,POLITIC,2072,4916,36,6988,0.296508,0.703492,0.005152
6,AskReddit,2811,4107,1784,6918,0.406331,0.593669,0.257878


In [121]:
cctop5 = keywords[(keywords["total"] > 2000) & (keywords["aut_frac"] > .01)].sort_values("cc_frac", ascending = False).iloc[:5,]

Unnamed: 0,subreddit,gw_counts,cc_counts,authors,total,gw_frac,cc_frac,aut_frac
31,climatechange,383,1884,190,2267,0.168946,0.831054,0.083811
7,climate,2357,9978,469,12335,0.191082,0.808918,0.038022
27,collapse,512,1904,172,2416,0.211921,0.788079,0.071192
0,environment,6876,22549,2109,29425,0.233679,0.766321,0.071674
10,worldnews,2046,6026,960,8072,0.253469,0.746531,0.11893


In [122]:
gwtop5 = keywords[(keywords["total"] > 2000) & (keywords["aut_frac"] > .01)].sort_values("gw_frac", ascending = False).iloc[:5,]

Unnamed: 0,subreddit,gw_counts,cc_counts,authors,total,gw_frac,cc_frac,aut_frac
11,Showerthoughts,1897,1644,1340,3541,0.535724,0.464276,0.378424
1,climateskeptics,4374,4626,526,9000,0.486,0.514,0.058444
15,conspiracy,1338,1462,512,2800,0.477857,0.522143,0.182857
5,askscience,2887,3190,1958,6077,0.47507,0.52493,0.322198
16,explainlikeimfive,1090,1226,785,2316,0.470639,0.529361,0.338946


Where do these posts come from?

In [157]:
domaincounts = pd.read_sql("""SELECT domain, COUNT(domain) 
FROM gwtable
GROUP BY domain
ORDER BY COUNT(domain) DESC
""",
               con = engine)

news = domaincounts[~(domaincounts["domain"].str.contains("self.")) &
             ~(domaincounts["domain"] == "reddit.com") & 
             ~(domaincounts["domain"] == "imgur.com") & 
             ~(domaincounts["domain"] == "i.imgur.com") & 
             ~(domaincounts["domain"] == "i.redd.it")]

news.head(10)

Unnamed: 0,domain,COUNT(domain)
0,youtube.com,2980
4,theguardian.com,1910
10,dailycaller.com,1053
11,nytimes.com,1024
12,washingtonpost.com,939
13,independent.co.uk,905
15,wattsupwiththat.com,830
16,dailymail.co.uk,725
17,thinkprogress.org,664
18,scientificamerican.com,648
