# Web request for words similar to "Climate change"

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import json
import nltk
from nltk.stem import LancasterStemmer
import pandas as pd
import numpy as np
import re

We are now going to perform the query for words related to "climate change". The website from which we are quering, uses multiple algorithms to extract the words that are the most similar to "climate change". The further away they are on the list, the least similar they are.

In [2]:
# Make the request
r = requests.get('https://relatedwords.org/relatedto/climate%20change')

# Parse content
soup = BeautifulSoup(r.text, 'html.parser')

# Extract words with score and source
for words in soup.find_all('script', attrs={"id" : "preloadedDataEl", "type" : "text/json"}):
    words_json_format = json.loads(words.string)

Now that we have extracted what we wanted, we will transfrom the content into something we can use for analysis.

In [3]:
# Transform into python list fo scores associated to words
words_with_score = []
words_only = []

for term in words_json_format['terms']:
    words_with_score.append((term['word'], term['score']))
    words_only.append(term['word'])

# Stemming

In [5]:
# Stemming
def stem_list(words_list):
    stemmed_words = []
    lancaster = LancasterStemmer()

    for word in words_only:
        stemmed_words.append(lancaster.stem(word))
        
    return stemmed_words

# Extracting and stemming data from dataset

In [6]:
# Load data
df_reader = pd.read_json('data/quotebank/quotes-2020.json.bz2', lines=True, compression='bz2', chunksize =100000)

In [7]:
# Setup set of words we're using for searching
stemmed_words = stem_list(words_only)
stemmed_words = set(stemmed_words[:10]) #Only use 10 first words for now

### Extracting quotes related to the dataset
We're using chunks to perform our operations, as accessing the whole dataset at once requires huge computing power.

In [8]:
%%time

climate_like_quotes = []

reg_query = "|".join(np.array(words_only)[[0, 1, 2, 3, 4, 5, 6]])

for chunk in df_reader:
    df = chunk[chunk.quotation.str.contains(reg_query, case=False, na=False)]
    climate_like_quotes.append(df)

CPU times: user 5min 20s, sys: 6.78 s, total: 5min 27s
Wall time: 5min 27s


### Adding notion of score and sorting according to it
We are counting the number of words from the regex in each quotation and sorting according to the quotations with the higher scores. The higher the score the more related the quotations are to our topic.

In [9]:
# Sorting quotes
df_climate = pd.concat(climate_like_quotes) # Transform the previous list of dataframes obtained into one dataframe
scores = [] # Create list to store scores

for index, row in df_climate.iterrows():
    scores.append(len(re.findall(reg_query, row['quotation'])))
    
df_climate['score'] = scores
df_climate = df_climate.sort_values(by=['score'], ascending=False)

In [10]:
df_climate.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,score
3353517,2020-03-31-049406,The abstract from Pongratz et. al. report coup...,,[],2020-03-31 21:42:09,1,"[[None, 0.5255], [Julia Pongratz, 0.4745]]",[https://www.commdiginews.com/environment/moth...,E,10
1412716,2020-02-04-079117,The climate crisis poses one of the most serio...,Amber Rudd,[Q479171],2020-02-04 12:36:00,1,"[[Amber Rudd, 0.9], [None, 0.1]]",[https://www.miragenews.com/amber-rudd-joins-c...,E,8
1306081,2020-01-28-111450,"Who, in all history, ever suffered unpopularit...",Scientific American,[Q7433741],2020-01-28 00:00:00,1,"[[Scientific American, 0.6878], [None, 0.3122]]",[https://jamanetwork.com/journals/jama/fullart...,E,8
1164670,2020-02-25-080564,We try and look at the scientific evidence aro...,John Neal,"[Q1315962, Q6250190, Q6250195]",2020-02-25 00:00:00,1,"[[John Neal, 0.8669], [None, 0.1331]]",[http://www.wlrn.org/post/sunshine-economy-llo...,E,7
3751810,2020-01-09-014127,Companies should be increasingly focused on pu...,Andrew Winston,[Q42723260],2020-01-09 00:00:00,1,"[[Andrew Winston, 0.9269], [None, 0.0731]]",[https://www.forbes.com/sites/susanmcpherson/2...,E,7


In [11]:
for i in range(10):
    print(df_climate.iloc[i]['quotation'])
    print()

The abstract from Pongratz et. al. report coupled climate -- carbon simulations that indicate minor global effects of wars and epidemics on atmospheric CO2 between ad 800 and 1850 reporting: Historic events such as wars and epidemics have been suggested as explanation for decreases in atmospheric CO2 reconstructed from ice cores because of their potential to take up carbon in forests regrowing on abandoned agricultural land. (emphasis added) READ ALSO: From GM creating ventilators, to new deals, the auto industry is reacting to COVID-19 Here, we use a coupled climate -- carbon cycle model to assess the carbon and climate effects of the Mongol invasion (~ 1200 to ~ 1380), the Black Death (~ 1347 to ~ 1400), the conquest of the Americas (~ 1519 to ~ 1700), and the fall of the Ming Dynasty (~ 1600 to ~ 1650). We calculate their impact on atmospheric CO2 including the response of the global land and ocean carbon pools. It has been hypothesized that these events have contributed to signific

# Storing file

In [12]:
df_climate.to_pickle('data/climate_df_score_related-2020.pkl')

# Next Steps
Each word has its own score of similartiy to the term we searched for (i.e., "climate change") and we will give bigger weight to quotes containing words with higher score of similarity. But the current score approach, already allows us to have a relevant score system.
We will also take into account the length of the quotes, because bigger quotes have more words and so more chances of getting more words.

# Current situation
The rest of the notebook was based on an extraction of quotes containing "climate change" without any stemming or search for related words. We decided to do this because we did not want to loose time with computing and just wanted to show our pipeline is working.