<h3>Part 1: Selecting Articles From Lexis-Nexis and Pre-processing</h3>

Below is the SQL query used to generate relevant results:

`select title, source, content
from 'gcp-cset-projects.gcp_cset_lexisnexis.raw_news'
where (where source.name = "The Times of India (TOI)"
or source.name = "Epoch Times"
or source.name = "The Epoch Times"
or source.name = "The New York Times"
or source.name = "Global Times (China)"
or source.name = "South China Morning Post")
and language = “English”
and id in (select id from gcp_cset_lexisnexis.unique_article_ids)
and regexp_contains(title, r"(?i)\b(?:chin(a|ese)|beijing|ccp)\b")`

Results of this query —> mrm311_sandbox.china_pubs_regexed

In [1]:
import pandas as pd
import numpy as np
import datetime

import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import openai
openai.api_key = "[ACCESS KEY GOES HERE]"

In [2]:
# Pull results into a dataframe

df = pd.read_gbq("""select *
                    from `gcp-cset-projects.mrm311_sandbox.china_pubs_regexed`""",
                 dialect='standard', project_id='gcp-cset-projects')

print(len(df))
df.head()

54089


Unnamed: 0,title,source,content
0,rocketing fares 'little help' for stricken air...,South China Morning Post,Airfares to the mainland soared in the past fe...
1,Killer who cut up tycoon into 108 pieces sente...,South China Morning Post,A killer who chopped a tycoon from China into ...
2,Year too far for many athletes; For the likes ...,South China Morning Post,Not all athletes were able to prolong their ca...
3,resistance hardens against china; Beijing now ...,South China Morning Post,When Australia first proposed an international...
4,Why enigma of the black hands is tired narrati...,South China Morning Post,US diplomats and politicians talk endlessly ab...


In [3]:
# Drop duplicates and see the breakdown by source

print(len(df))
df = df.drop_duplicates('title')
print(len(df))

df.value_counts('source')

54089
50992


source
South China Morning Post    17975
Global Times (China)        13197
The New York Times           7458
The Times of India (TOI)     5435
The Epoch Times              3515
Epoch Times                  3412
dtype: int64

In [4]:
# Inconsistent naming for the Epoch Times makes it convenient to use a source_num instead of source

source_to_source_num = {'Global Times (China)': 0,
                        'Epoch Times': 1,
                        'The Epoch Times': 1,
                        'The New York Times': 2,
                        'The Times of India (TOI)': 3,
                        'South China Morning Post': 4}

source_num_to_source = {0: 'Global Times',
                       1: 'Epoch Times',
                       2: 'New York Times',
                       3: 'Times of India', 
                       4: 'South China\nMorning Post'}

df['source_num'] = [source_to_source_num[source] for source in df.source]
df.head()

Unnamed: 0,title,source,content,source_num
0,rocketing fares 'little help' for stricken air...,South China Morning Post,Airfares to the mainland soared in the past fe...,4
1,Killer who cut up tycoon into 108 pieces sente...,South China Morning Post,A killer who chopped a tycoon from China into ...,4
2,Year too far for many athletes; For the likes ...,South China Morning Post,Not all athletes were able to prolong their ca...,4
3,resistance hardens against china; Beijing now ...,South China Morning Post,When Australia first proposed an international...,4
4,Why enigma of the black hands is tired narrati...,South China Morning Post,US diplomats and politicians talk endlessly ab...,4


In [5]:
# Preprocess the text by removing links, lowercasing, and lemmatizer

stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()

def regex(text):
    text = re.sub('http\S+', ' ', text)
    text = re.sub('[^a-zA-z]', ' ', text)
    return text.lower()

def tokenize(text):
    tokens = re.split('\s+', text)
    tokens = [tok for tok in tokens if tok not in stop_words]
    return tokens

def lemmatize(tokens):
    lemms = [lemmatizer.lemmatize(tok) for tok in tokens]
    return lemms

In [6]:
# Create columns holding the various stages of pre-processed text
# Also create a column for the word count (in post-preprocessing tokens) and title length (in characters)

print(datetime.datetime.now())
df['regexed'] = df['content'].apply(lambda x: regex(x))
print(datetime.datetime.now())
df['tokens'] = df['regexed'].apply(lambda x: tokenize(x))
print(datetime.datetime.now())
df['word_count'] = df['tokens'].apply(lambda x: len(x))
print(datetime.datetime.now())
df['title_len'] = df['title'].apply(lambda x: len(x))
print(datetime.datetime.now())

print(len(df))
df.head()

2021-04-26 14:44:56.682666
2021-04-26 14:45:05.595197
2021-04-26 14:46:08.023452
2021-04-26 14:46:08.053004
2021-04-26 14:46:08.075995
50992


Unnamed: 0,title,source,content,source_num,regexed,tokens,word_count,title_len
0,rocketing fares 'little help' for stricken air...,South China Morning Post,Airfares to the mainland soared in the past fe...,4,airfares to the mainland soared in the past fe...,"[airfares, mainland, soared, past, weeks, coro...",422,195
1,Killer who cut up tycoon into 108 pieces sente...,South China Morning Post,A killer who chopped a tycoon from China into ...,4,a killer who chopped a tycoon from china into ...,"[killer, chopped, tycoon, china, pieces, vanco...",309,143
2,Year too far for many athletes; For the likes ...,South China Morning Post,Not all athletes were able to prolong their ca...,4,not all athletes were able to prolong their ca...,"[athletes, able, prolong, careers, next, summe...",278,129
3,resistance hardens against china; Beijing now ...,South China Morning Post,When Australia first proposed an international...,4,when australia first proposed an international...,"[australia, first, proposed, international, in...",423,135
4,Why enigma of the black hands is tired narrati...,South China Morning Post,US diplomats and politicians talk endlessly ab...,4,us diplomats and politicians talk endlessly ab...,"[us, diplomats, politicians, talk, endlessly, ...",388,195


In [7]:
# Restrict to articles with an effective word count between 100 and 500 words

df = df[df['word_count'] > 100]
df = df[df['word_count'] < 500]
print(len(df))

36099


In [8]:
# Check how many articles now exist from each source 

df.value_counts('source_num')

source_num
4    13146
0    10327
3     4838
1     4632
2     3156
dtype: int64

In [9]:
# Randomly sample 3,000 articles from each publication

df = df.groupby('source_num').sample(n=3000, replace=False)
df = df.reset_index(drop=True)

print(len(df))
df.head()

15000


Unnamed: 0,title,source,content,source_num,regexed,tokens,word_count,title_len
0,China inspires the true hustler in me,Global Times (China),Illustration: Luo Xuan/GT\n\n\n\nWho knew that...,0,illustration luo xuan gt who knew that aft...,"[illustration, luo, xuan, gt, knew, studying, ...",277,37
1,Nepal PM's China visit puts ties on a new road,Global Times (China),Illustration: Liu Rui/GT\n\n\n\nDuring Nepales...,0,illustration liu rui gt during nepalese pr...,"[illustration, liu, rui, gt, nepalese, prime, ...",427,46
2,China's delivery evolution under coronavirus,Global Times (China),"An SF Express plane Photo: IC\n\n\n\nWu Lian, ...",0,an sf express plane photo ic wu lian a de...,"[sf, express, plane, photo, ic, wu, lian, deli...",445,44
3,Flights between China and S. Korea expected to...,Global Times (China),A flight parks at Jeju International Airport i...,0,a flight parks at jeju international airport i...,"[flight, parks, jeju, international, airport, ...",184,91
4,China to protect own rights against US tariffs...,Global Times (China),China will take effective measures to firmly s...,0,china will take effective measures to firmly s...,"[china, take, effective, measures, firmly, saf...",204,54


In [10]:
# Finish pre-processing by lemmatizing the tokens for each article
# NOTE: this was not done previously because it is more efficient to do it when the results are pared down

df['lemmas'] = df['tokens'].apply(lambda x: lemmatize(x))

print(len(df))
df.head()

15000


Unnamed: 0,title,source,content,source_num,regexed,tokens,word_count,title_len,lemmas
0,China inspires the true hustler in me,Global Times (China),Illustration: Luo Xuan/GT\n\n\n\nWho knew that...,0,illustration luo xuan gt who knew that aft...,"[illustration, luo, xuan, gt, knew, studying, ...",277,37,"[illustration, luo, xuan, gt, knew, studying, ..."
1,Nepal PM's China visit puts ties on a new road,Global Times (China),Illustration: Liu Rui/GT\n\n\n\nDuring Nepales...,0,illustration liu rui gt during nepalese pr...,"[illustration, liu, rui, gt, nepalese, prime, ...",427,46,"[illustration, liu, rui, gt, nepalese, prime, ..."
2,China's delivery evolution under coronavirus,Global Times (China),"An SF Express plane Photo: IC\n\n\n\nWu Lian, ...",0,an sf express plane photo ic wu lian a de...,"[sf, express, plane, photo, ic, wu, lian, deli...",445,44,"[sf, express, plane, photo, ic, wu, lian, deli..."
3,Flights between China and S. Korea expected to...,Global Times (China),A flight parks at Jeju International Airport i...,0,a flight parks at jeju international airport i...,"[flight, parks, jeju, international, airport, ...",184,91,"[flight, park, jeju, international, airport, s..."
4,China to protect own rights against US tariffs...,Global Times (China),China will take effective measures to firmly s...,0,china will take effective measures to firmly s...,"[china, take, effective, measures, firmly, saf...",204,54,"[china, take, effective, measure, firmly, safe..."


In [None]:
# Save the results to a .csv for future work
# All future code should use this .csv instead of re-running the previous code
# This avoids randomness in the sampling above from carrying over into the results

df.to_csv('china_articles_cleaned.csv')

<h3>Part Two: Create GPT-3 Outputs</h3>

In [11]:
# Read in the .csv generated in Part One

df = pd.read_csv('china_articles_cleaned.csv')
df['joined'] = df['lemmas'].apply(lambda x: ' '.join(eval(x)))
df.drop(columns=['Unnamed: 0'], inplace=True)

df.head()

Unnamed: 0,title,source,content,source_num,regexed,tokens,word_count,title_len,lemmas,joined
0,China urges greater efforts to ease Korean ten...,Global Times (China),China hopes all parties would do more to ease ...,0,china hopes all parties would do more to ease ...,"['china', 'hopes', 'parties', 'would', 'ease',...",276,51,"['china', 'hope', 'party', 'would', 'ease', 't...",china hope party would ease tension korean pen...
1,Multilateral cooperation involving China best ...,Global Times (China),The US and several Western powers including Au...,0,the us and several western powers including au...,"['us', 'several', 'western', 'powers', 'includ...",273,79,"['u', 'several', 'western', 'power', 'includin...",u several western power including australia su...
2,"Despite pandemic at home, US continues to make...",Global Times (China),Illustration: Liu Rui/GT\n\n\n\nOver the last ...,0,illustration liu rui gt over the last two ...,"['illustration', 'liu', 'rui', 'gt', 'last', '...",377,67,"['illustration', 'liu', 'rui', 'gt', 'last', '...",illustration liu rui gt last two week u govern...
3,Cut Chinese firms in eye of the storm a break:...,Global Times (China),"Photo: IC\n\n\n\n\nAccording to Reuters, Chine...",0,photo ic according to reuters chinese le...,"['photo', 'ic', 'according', 'reuters', 'chine...",438,69,"['photo', 'ic', 'according', 'reuters', 'chine...",photo ic according reuters chinese leading chi...
4,Finnair launches new flight services for China,Global Times (China),"Lars Olofsson, Greater China sales director at...",0,lars olofsson greater china sales director at...,"['lars', 'olofsson', 'greater', 'china', 'sale...",361,46,"['lars', 'olofsson', 'greater', 'china', 'sale...",lars olofsson greater china sale director finn...


In [12]:
# First select 25 headlines from each source where the headline is between 75 and 125 characters long

headlines_df = df[df['title_len'] > 75]
headlines_df = headlines_df[headlines_df['title_len'] < 125]
print(len(headlines_df))
headlines_df = headlines_df.groupby('source_num').sample(n=25, replace=False)
headlines = headlines_df.title.tolist()

headlines_df.value_counts('source_num')

2663


source_num
4    25
3    25
2    25
1    25
0    25
dtype: int64

In [None]:
# Create a dataframe to store headline:output pairs

# Note the parameters: default amount of randomness by setting temperature to 0.7, non-default frequency penalty
# of 0.2 to prevent repetitive lists, and a max_token length of 400

gpt3 = pd.DataFrame(headlines, columns=['title'])

def call_gpt3(headline):
    prompt = headline+'\n'
    response_full = openai.Completion.create(engine='davinci', prompt=prompt, max_tokens=400, n=1, temperature=0.7, frequency_penalty=0.2)
    response = response_full.get('choices')[0].text.strip()
    return response

outputs = []
for i in range(len(headlines)):
    outputs.append(call_gpt3(headlines[i]))
    if (i+1) % 10 == 0:
        print('{} iterations completed at {}.'.format(i+1, datetime.datetime.now()))

gpt3['output'] = outputs

gpt3.to_csv('gpt3_outputs.csv')