## Getting Trends Data

This notebook demonstrates the workflow for obtaining trends data on keywords and topics, encompassing relevant terms such as destination cities and countries. The topic IDs have already been collected.

Please note, due to restrictions within the Google Trends API, this notebook may not run all at once. Multiple iterations may be required to complete the downloading process. Furthermore, the current notebook does not cover all conducted searches exhaustively; it serves as a sample to guide the process. The code may also fail when trying to import UNHCR datasets, which are not available in the repository because of privacy issues.

The downloading process encompasses the following:
- Downloads for each country of trends for relevant topics
- Downloads for each country of trends for relevant keywords, both in English and up to two origin languages. Keywords were obtain through a semantic link methodology by [Boss et al. (2022)](https://bse.eu/file/10163/download?token=ClMUlPt4), and are available in the repository.

In [None]:
#%pip install pytrends
#pip install igraph
# %pip install country_converter

Note: you may need to restart the kernel to use updated packages.


These functions below is how we make requests to google trends to return trends on keywords.

It gathers one term at a time from a list of trends and then if an error occurs (which often happens due to the fact that Google Trends is rate-limited), it sleeps for a minute and repeats the request. If 20 requests are made in a row that result in an error, it will skip that particular request and move on to the next term.

In [1]:
import pandas as pd
from pytrends.request import TrendReq
import numpy as np
import time
from tqdm import tqdm

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/112.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://trends.google.com/',
    'Alt-Used': 'trends.google.com',
    'Connection': 'keep-alive',
    # 'Cookie': '__utma=10102256.699944976.1681467038.1683327769.1683363479.30; __utmz=10102256.1683363479.30.23.utmcsr=trends.google.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmc=10102256; NID=511=GaXIe0Lwd1l8RAGkA2geWNynqviDUhjPBcVgHksJdTnugCvKuUPbm_bM-mT7DhT2jrBHT00aCt71oY7fZhydICB-HNWUzrDnonyPyOGmPTA75lOvpTiguXi3KiGJtRjK3BBH3e1ZcqQ_ywcsU5vHoxJFtH9HGhcLdOt7CL7AWKx8Jj9VSOI3cCwmjDl8gbj2PZ75BU_W4NqspBRMktcdhRitXCyOIqMdLMwZfSOOvFmRBTOJKg8M7UkUTwAVhXtxsKVlHfxPpiWx8HQ63Vr5SV_8qW9f4J0f8EbXWiofQLqpPKJzo0CMbyM-EcnRlR4YVqptEli6EgemOBUJAgH8951i7ANgVDSWy-vn3zXA5KPR5l0LtkriirFZPvsNAmV-_-Mtyuf6gYu8eYJL3g; CONSENT=PENDING+639; SID=WAjkbwUHGFuugy4Yy2rq46Op5ZjRIMvPaLQIAltzHSM35MU0x7YgYongisCrn5htv3RhAw.; __Secure-1PSID=WAjkbwUHGFuugy4Yy2rq46Op5ZjRIMvPaLQIAltzHSM35MU0YH1nUovIFt-jaEUUCx_SIQ.; __Secure-3PSID=WAjkbwUHGFuugy4Yy2rq46Op5ZjRIMvPaLQIAltzHSM35MU0Z8qXUg1LhjB4DtBZWFfNQg.; HSID=A8QJObb1Ve4vQOXFw; SSID=AJr3GRs7Jf_ctBT41; APISID=rNTBsHwZF0AVrKao/AoTWce3Qv8CyFykEc; SAPISID=vftmcyrgIFqWdYpV/AHlhj91rgxiQPlOq8; __Secure-1PAPISID=vftmcyrgIFqWdYpV/AHlhj91rgxiQPlOq8; __Secure-3PAPISID=vftmcyrgIFqWdYpV/AHlhj91rgxiQPlOq8; SIDCC=AP8dLtyjVDmjXvg3rEmTwoLfGyXkY0SDrIFQWqi1z9D1QOL5voioH1Uti_ANGJkiQCuzVd4Axww; __Secure-1PSIDCC=AP8dLtxxVvSKM2MgLGepw_20VZbYsJHar-zF5kvDajRKezVqui3YqxWUaT1e6meVcR9HTUP4lgo; __Secure-3PSIDCC=AP8dLtyyI8BLnakxZZ2OFmPTDfYzPW8jo13jnE34rpPuptgnFDFq-aKX5vfcZdtRDLLZswyAl3gv; 1P_JAR=2023-5-6-12; SOCS=CAISHAgCEhJnd3NfMjAyMjEwMDQtMF9SQzMaAmVuIAEaBgiAwY2aBg; AEC=AUEFqZchyarTzQblW5K5GOTGtYARrs8luJGdx84JVmSwETHSFqijMgs9FA; _ga_VWZPXDNJJB=GS1.1.1683458025.38.1.1683458061.0.0.0; _ga=GA1.3.699944976.1681467038; OTZ=6986051_48_52_123900_48_436380; ADS_VISITOR_ID=00000000-0000-0000-0000-000000000000/112727363205027642159; S=billing-ui-v3=wWfIrmncuOn4LfU6DArDU3LLPpCDgsAT:billing-ui-v3-efe=wWfIrmncuOn4LfU6DArDU3LLPpCDgsAT; __Secure-1PSIDTS=sidts-CjIBLFra0jgJEQyM4EqRZoyaN18X_Umt8M6GTvixMw1pDB_sj5P5XvQokN5dkVw1R2qAkRAA; __Secure-3PSIDTS=sidts-CjIBLFra0jgJEQyM4EqRZoyaN18X_Umt8M6GTvixMw1pDB_sj5P5XvQokN5dkVw1R2qAkRAA; _gid=GA1.3.1220682113.1683458025; _gat_gtag_UA_4401283=1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    # Requests doesn't support trailers
    # 'TE': 'trailers',
}

def pytrends_request(word_list, country, pytrends):
    
    pytrends.build_payload(kw_list=word_list, geo=country, timeframe='2005-01-01 2022-12-31')
    trends = pytrends.interest_over_time()
    if 'isPartial' in trends.columns:
        trends.drop('isPartial', axis=1, inplace=True)
    # print(word_list)
    return trends

def get_trends_data(country, keywords):
    pytrends = TrendReq(hl='en-US', tz=360, requests_args={'headers': headers})
    trends_df = pd.DataFrame()
    error_count = 0

    for keyword in keywords:
        while True:
            try:
                trends_df = pd.concat([trends_df, pytrends_request([keyword], country, pytrends)], axis=1)
                error_count = 0  # Reset error count if successful request
                break  # Exit the while loop if successful
            except:
                error_count += 1
                # print('Got an error. Trying again in 60 seconds.')
                time.sleep(60)

                if error_count == 20:
                    print('Reached maximum error count. Exiting loop.')
                    return trends_df  # Return the trends_df even if not complete

                continue

    return trends_df

### Semantic Link Topic Trends

In [6]:
semantic_topic_ids = pd.read_csv('topic_ids/semantic_topic_ids.csv')
countries = pd.read_csv('../../data/clean/unhcr.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [None]:
country_trends_list = []
for iso2country in tqdm(iso2_countries):
    a_country_trends = get_trends_data(iso2country, semantic_topic_ids.topic_id)
    a_country_trends['country'] = iso2country
    country_trends_list.append(a_country_trends)

 66%|██████▋   | 130/196 [15:39:10<22:20:28, 1218.61s/it]

In [None]:
semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()

semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(country_trends_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)

semantic_trends_df.to_csv('data/semantic_topic_trends.csv')

## Semantic Link Keyword trends

In [15]:
semantic_topic_ids = pd.read_csv('topic_ids/semantic_topic_ids.csv')
countries = pd.read_csv('../../data/data.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [20]:
country_trends_list = []
for iso2country in tqdm(iso2_countries):
    a_country_trends = get_trends_data(iso2country, semantic_topic_ids.keyword)
    a_country_trends['country'] = iso2country
    country_trends_list.append(a_country_trends)

100%|██████████| 13/13 [05:55<00:00, 27.34s/it]


In [22]:
semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()
semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(country_trends_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)



In [24]:
semantic_trends_df.to_csv('data/semantic_keywords_trends_EN_partial.csv')

## Semantic links - keywords - original lang

In [2]:
import pandas as pd

from helper_functions.country_abbrev import *
from helper_functions.country_language import *
from pytrends.request import TrendReq

import pycountry
import itertools

from googletrans import LANGCODES
import swifter

import helper_functions.trends_helpers as trends_helpers
import numpy as np

In [3]:
semantic_topic_ids = pd.read_csv('topic_ids/semantic_topic_ids.csv')
countries = pd.read_csv('../../data/data.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [4]:
# list of all unique languages:
unique_languages = pd.Series(list(set(list(itertools.chain(*country_language_dict.values())))), name='language')

# list of language codes from googletrans
langcodes = pd.DataFrame.from_dict(LANGCODES, orient='index', columns=['code'])
langcodes.index = langcodes.index.str.capitalize()

refugee_lang = unique_languages.to_frame().merge(langcodes, left_on='language', right_index=True, how='left')

refugee_lang.dropna(inplace=True)

refugee_lang_not_en = refugee_lang[refugee_lang['code'] != 'en']

In [5]:
translated_keywords = refugee_lang_not_en['code'].swifter.apply(lambda x: trends_helpers.translate_keywords_list(lst = semantic_topic_ids.keyword, lang= x))

Pandas Apply:   0%|          | 0/83 [00:00<?, ?it/s]

In [6]:
columns = list(refugee_lang_not_en['code'])

df = pd.concat([pd.DataFrame(sublist, columns=[col]) for sublist, col in zip(translated_keywords, columns)], axis=1)

df['en']=semantic_topic_ids.keyword

df.head()

Unnamed: 0,bn,fi,ur,ru,fr,uz,ar,ms,fa,ro,...,tr,hy,sd,no,he,ceb,sw,lt,th,en
0,পাসপোর্ট,passi,پاسپورٹ,заграничный пасспорт,passeport,pasport,جواز سفر,pasport,گذرنامه,pașaport,...,pasaport,անձնագիր,پاسپورٽ,pass,דַרכּוֹן,passport,pasipoti,pasas,หนังสือเดินทาง,passport
1,অভিবাসন,maahanmuutto,امیگریشن,иммиграция,immigration,immigratsiya,الهجرة,imigresen,مهاجرت,imigrare,...,göçmenlik,ներգաղթ,اميگريشن,innvandring,עלייה,imigrasyon,uhamiaji,imigracija,การตรวจคนเข้าเมือง,Immigration
2,ভ্রমণ ভিসা,matkaviisumi,سفری ویزا,туристическая виза,visa de voyage,sayohat vizasi,تأشيرة السفر,visa perjalanan,ویزای مسافرتی,viza de calatorie,...,seyahat vizesi,ճամփորդական վիզա,سفر ويزا,reisevisum,ויזת נסיעות,travel visa,visa ya kusafiri,kelionės viza,วีซ่าท่องเที่ยว,Travel Visa
3,উদ্বাস্তু,pakolainen,پناہ گزین,беженец,réfugié,qochoq,لاجئ,pelarian,پناهنده,refugiat,...,mülteci,փախստական,پناهگير,flyktning,פָּלִיט,kagiw,mkimbizi,pabėgėlis,ผู้ลี้ภัย,Refugee
4,দ্বন্দ্ব,konflikti,تنازعہ,конфликт,conflit,mojaro,صراع,konflik,تعارض,conflict,...,anlaşmazlık,կոնֆլիկտ,تڪرار,konflikt,סְתִירָה,panagbangi,migogoro,konfliktas,ขัดแย้ง,Conflict


In [7]:
# I'll only keep two original languages per country

country_language_dict_2 = {}

for key, values in country_language_dict.items():
    country_language_dict_2[key] = values[:2] 

max_length = max(map(len, country_language_dict_2.values()))

data_padded = {key: arr + [np.nan] * (max_length - len(arr)) for key, arr in country_language_dict_2.items()}

langs = pd.DataFrame(data_padded)
langs = langs.T
langs = langs.reset_index()

langs = langs.rename(columns={'index': 'Country', 0:'lang1', 1:'lang2', 2:'lang3'})

# Apply the function to the 'Country' column
langs['ISO2'] = langs['Country'].apply(trends_helpers.get_iso2_country_code)

langs = langs.drop(columns=["Country"])
langs_long = pd.melt(langs, id_vars=['ISO2'], var_name='numlang', value_name='lang')
langs_long = langs_long.dropna()

langs_long = pd.merge(langs_long, refugee_lang_not_en, left_on="lang", right_on="language")

In [8]:
# langs_long = langs_long[102:].reset_index()

In [10]:
country_trends_list = [] 

for i, country in tqdm(enumerate(langs_long["ISO2"])):
    languagecode= langs_long["code"][i]
    a_country_trends = get_trends_data(country, df[langs_long["code"][i]])
    a_country_trends['country'] = country
    country_trends_list.append(a_country_trends)

84it [1:41:35, 72.56s/it] 


In [53]:
# Update column names to english, but having a choice to 
# keep track of what the language of search was (wide==True)

def update_column_names(data_list, original_lang_code, wide:bool):
    
    mapping_dict = dict(zip(df[original_lang_code], df['en']))

    new_data_list = []
    for data in data_list:
        new_data = data.copy()  # Make a copy of the DataFrame
        new_columns = []

        if wide==True:

            for column in new_data.columns:
                if column in mapping_dict:
                    new_column = f"{mapping_dict[column]}_{original_lang_code}"
                else:
                    new_column = column
                new_columns.append(new_column)
            new_data.columns = new_columns
        
        else:

            for column in new_data.columns:
                if column in mapping_dict:
                    new_column = f"{mapping_dict[column]}"
                else:
                    new_column = column
                new_columns.append(new_column)
            new_data.columns = new_columns


        new_data_list.append(new_data)  # Add the modified DataFrame to the new list


    return new_data_list


In [14]:
# Short version

updated_data_list = country_trends_list.copy()
for lang in langs_long["code"][:len(country_trends_list)]:
    updated_data_list = update_column_names(updated_data_list, lang, wide=False)

# Detailed version

updated_data_list_detailed = country_trends_list.copy()
for lang in langs_long["code"][:len(country_trends_list)]:
    updated_data_list_detailed = update_column_names(updated_data_list_detailed, lang, wide=True)


In [15]:
# Save short
semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()
semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(updated_data_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)



In [17]:
semantic_trends_df.to_csv("data/semantic_keywords_OL_2.csv", index=False)

In [18]:
# Save detailed

semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(updated_data_list_detailed):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)

In [19]:
semantic_trends_df.to_csv("data/semantic_keywords_OL_2_detailed.csv", index=False)

### Boss words - original language

In [3]:
semantic_keywords = pd.read_csv('helper_files/boss_words.csv')
countries = pd.read_csv('../../data/data.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [4]:
semantic_keywords = semantic_keywords[semantic_keywords["consider alone"]==1].reset_index()
len(semantic_keywords.list)

39

In [7]:
# list of all unique languages:
unique_languages = pd.Series(list(set(list(itertools.chain(*country_language_dict.values())))), name='language')

# list of language codes from googletrans
langcodes = pd.DataFrame.from_dict(LANGCODES, orient='index', columns=['code'])
langcodes.index = langcodes.index.str.capitalize()

refugee_lang = unique_languages.to_frame().merge(langcodes, left_on='language', right_index=True, how='left')

refugee_lang.dropna(inplace=True)

refugee_lang_not_en = refugee_lang[refugee_lang['code'] != 'en']

In [8]:
translated_keywords = refugee_lang_not_en['code'].swifter.apply(lambda x: trends_helpers.translate_keywords_list(lst = semantic_keywords.list, lang= x))

Pandas Apply:   0%|          | 0/83 [00:00<?, ?it/s]

In [9]:
columns = list(refugee_lang_not_en['code'])

df = pd.concat([pd.DataFrame(sublist, columns=[col]) for sublist, col in zip(translated_keywords, columns)], axis=1)

df['en']=semantic_keywords.list

df.head()

Unnamed: 0,ha,sr,nl,fa,no,pa,fi,bs,ca,yo,...,ny,az,el,ne,et,zu,mi,si,ja,en
0,mafaka,азилу,asiel,پناهندگی,asyl,ਸ਼ਰਣ,turvapaikka,azil,asil,ibi aabo,...,asylum,sığınacaq,άσυλο,शरण,varjupaiga,indawo yokukhosela,whakaruruhau,සරණාගතභාවය,亡命,asylum
1,mai neman mafaka,азилант,asielzoeker,پناهجو,asylsøker,ਸ਼ਰਣ ਮੰਗਣ ਵਾਲਾ,turvapaikanhakija,tražilac azila,sol · licitant d'asil,ibi aabo,...,wofunafuna chitetezo,sığınacaq axtaran,αιτών άσυλο,शरणार्थी,varjupaigataotleja,ofuna ukukhoseliswa,tangata rapu whakarurutanga,රැකවරණ පතන්නා,亡命希望者,asylum seeker
2,kula da iyakoki,граничне контроле,grenscontroles,کنترل های مرزی,grensekontroller,ਬਾਰਡਰ ਕੰਟਰੋਲ,rajatarkastukset,granične kontrole,controls fronterers,aala idari,...,zowongolera malire,sərhəd nəzarəti,συνοριακούς ελέγχους,सीमा नियन्त्रण,piirikontrolli,izilawuli zemingcele,mana whakahaere rohe,දේශසීමා පාලනය,国境管理,border controls+border control
3,kula da iyakoki,контрола граница,grens controle,کنترل مرزی,grensekontroll,ਸਰਹੱਦ ਕੰਟਰੋਲ,rajavalvonta,granična kontrola,control de fronteres,aala iṣakoso,...,kulamulira malire,sərhəd nəzarəti,έλεγχος συνόρων,सीमा नियन्त्रण,piirikontroll,ukulawula umngcele,mana rohe,දේශසීමා පාලනය,国境警備隊,bureau of immigration
4,ofishin shige da fice,биро за имиграцију,immigratiebureau,اداره مهاجرت,immigrasjonsbyrået,ਇਮੀਗ੍ਰੇਸ਼ਨ ਬਿਊਰੋ,maahanmuuttovirasto,biroa za imigraciju,oficina d'immigració,ajọ ti iṣilọ,...,ofesi ya immigration,immiqrasiya bürosu,γραφείο μετανάστευσης,अध्यागमन ब्यूरो,immigratsioonibüroo,ihhovisi labokufika,tari manene,ආගමන කාර්යාංශය,入国管理局,citizen


In [20]:
# I'll only keep two original languages per country

country_language_dict_2 = {}

for key, values in country_language_dict.items():
    country_language_dict_2[key] = values[:2] 

max_length = max(map(len, country_language_dict_2.values()))

data_padded = {key: arr + [np.nan] * (max_length - len(arr)) for key, arr in country_language_dict_2.items()}

langs = pd.DataFrame(data_padded)
langs = langs.T
langs = langs.reset_index()

langs = langs.rename(columns={'index': 'Country', 0:'lang1', 1:'lang2', 2:'lang3'})

# Apply the function to the 'Country' column
langs['ISO2'] = langs['Country'].apply(trends_helpers.get_iso2_country_code)

langs = langs.drop(columns=["Country"])
langs_long = pd.melt(langs, id_vars=['ISO2'], var_name='numlang', value_name='lang')
langs_long = langs_long.dropna()

langs_long = pd.merge(langs_long, refugee_lang_not_en, left_on="lang", right_on="language")

In [25]:
langs_long

Unnamed: 0,ISO2,numlang,lang,language,code
0,PR,lang1,Spanish,Spanish,es
1,ES,lang1,Spanish,Spanish,es
2,UY,lang1,Spanish,Spanish,es
3,VE,lang1,Spanish,Spanish,es
4,AD,lang2,Spanish,Spanish,es
...,...,...,...,...,...
131,RS,lang2,Hungarian,Hungarian,hu
132,SK,lang2,Hungarian,Hungarian,hu
133,ZA,lang2,Xhosa,Xhosa,xh
134,LK,lang2,Tamil,Tamil,ta


In [28]:
country_trends_list = [] 

for i, country in tqdm(enumerate(langs_long["ISO2"])):
    languagecode= langs_long["code"][i]
    a_country_trends = get_trends_data(country, df[langs_long["code"][i]])
    a_country_trends['country'] = country
    country_trends_list.append(a_country_trends)

25it [55:10, 132.44s/it]


KeyboardInterrupt: 

In [29]:
# Short version

updated_data_list = country_trends_list.copy()
for lang in langs_long["code"][:len(country_trends_list)]:
    updated_data_list = update_column_names(updated_data_list, lang, wide=False)

# Detailed version

updated_data_list_detailed = country_trends_list.copy()
for lang in langs_long["code"][:len(country_trends_list)]:
    updated_data_list_detailed = update_column_names(updated_data_list_detailed, lang, wide=True)

In [31]:
# Save short
# semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()
semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(updated_data_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_keywords.list, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)

semantic_trends_df.to_csv("data/semantic_ol_partial_4.csv", index=False)


In [32]:
# Save detailed

semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(updated_data_list_detailed):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_keywords.list, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)

semantic_trends_df.to_csv("data/semantic_ol_detailed_partial_4.csv", index=False)

## Neighboring Countries

In [3]:
import pandas as pd
import igraph as ig
import country_converter as coco

# convert unhcr data to network format. To produce the unhcr.csv file, you will need to:
# # drag and drop the data.csv file from geraldine into the data/raw/ folder
# # open the clean_data.ipynb notebook in data/
# # run the section that cleans the unhcr data, which outputs unhcr.csv into data/clean/
unhcr = pd.read_csv('../../data/data.csv', engine='pyarrow').groupby(['iso_o','iso_d']).agg({'newarrival':'sum','contig':'first','Country_o':'first','Country_d':'first', 'island_o':'first'}).reset_index()

df_network = unhcr[unhcr.contig == 1]

graph = ig.Graph.TupleList(df_network[['Country_o','Country_d']].itertuples(index=False), directed=False)

# add island countries 
islands = unhcr.drop_duplicates('Country_o').sort_values('Country_o').Country_o[~unhcr.groupby('Country_o')['contig'].any().values].values

for i in islands:
    v = graph.add_vertex()
    # Set the name or other properties of the added vertex if needed
    v['name'] = i

graph.vs['name'] = coco.convert(graph.vs['name'], to='iso2')

In [113]:
# get country topic ids
country_topic_ids = pd.read_csv('topic_ids/country_topic_ids.csv')
country_topic_ids['iso2'] = coco.convert(country_topic_ids.search, to='iso2')
country_topic_dict = country_topic_ids[['topic_title', 'topic_mid']].set_index('topic_mid')['topic_title'].to_dict()

Get neighboring countries of order 1:

In [101]:
# list of countries
iso2_countries = coco.convert(unhcr.Country_o.unique(), to='iso2')

country_trends_list = []
last_index = iso2_countries.index('KG') + 1
for iso2country in tqdm(iso2_countries[last_index:]):
    # get neighbors of country
    neighboring_countries = graph.vs[graph.neighborhood(iso2country, order=1)]['name'][1:]

    order1_countries = country_topic_ids[country_topic_ids.iso2.isin(neighboring_countries)]

    a_country_trends = get_trends_data(iso2country, order1_countries.topic_mid)
    a_country_trends['country_o'] = iso2country
    # a_country_trends['country_d','city_d'] = order1_countries[['search_keyword','topic_title']]
    country_trends_list.append(a_country_trends)

100%|██████████| 105/105 [06:31<00:00,  3.73s/it]


In [102]:
# combine into a single dataframe
countries_trends_df = pd.DataFrame()
for _, a_country_trends in enumerate(country_trends_list):
    countries_trends_df = pd.concat([countries_trends_df, a_country_trends], axis=0)

# Before writing to a csv, make sure that the output makes sense, I think there should be a lot of nas and a lot of columns, one for each country/city topic. 
# The only ones that aren't na should be the neighboring countries for that country

# countries_trends_df = countries_trends_df.reset_index().rename({'index':'date'},axis=1).set_index(['country', 'date'])
countries_trends_df.to_csv('data/country_topic_trends_1.csv')

Then we can gather trends for countries of order 2, excluding order 1 

(I think we can skp this part for now. There needs to be some distance-based filter as well that omits far away countries, Looking at Afghanistan for example yields too many countries/cities).

This could be obtained by merging countries with the unhcr distance measurments between countries, and omitting countries above a certain threshold.

In [121]:
neighboring_countries_order2 = list(set(graph.vs[graph.neighborhood('AF', order=2)]['name']) - set(graph.vs[graph.neighborhood('AF', order=1)]['name']))

# too many countries for order 2.
country_topic_ids[country_topic_ids.iso2.isin(neighboring_countries_order2)]

Unnamed: 0,search,topic_title,topic_type,topic_mid,iso2
6,Armenia,Armenia,Country in Asia,/m/0jgx,AM
10,Azerbaijan,Azerbaijan,Country,/m/0jhd,AZ
19,Bhutan,Bhutan,Country in South Asia,/m/07bxhl,BT
73,Hong Kong SAR,Hong Kong,Special administrative regions of China,/m/03h64,HK
76,India,India,Country in South Asia,/m/03rk0,IN
79,Iraq,Iraq,Country in the Middle East,/m/0d05q4,IQ
86,Kazakhstan,Kazakhstan,Country in Central Asia,/m/047lj,KZ
92,Kyrgyz Republic,Kyrgyzstan,Country in Central Asia,/m/0jt3tjf,KG
93,Lao P.D.R.,Laos,Country in Asia,/m/04hhv,LA
101,Macao SAR,Macao,Special administrative regions of China,/m/04thp,MO


We can also gather trends for countries that are relevant but are not directly connected (i.e., Dominican Republic for Venezuela)

- For South American and Latin American countries, let’s say we add Spain, Chile, Argentina, USA, and Dominican Republic.
- For African countries + middle east, include the top 6 most receptive countries in Europe (Germany, France, Great Britain, Sweden, Austria, Hungary, Italy or something). In Carramia et al they also added likely countries that peopel in Africa would have to pass through to get to Europe which could be interesting to consider.
- China and India we can add the US.

There is room to refine this. Not sure how broad/specific this should be.

## Neighboring Cities

In [122]:
# get city topic ids
city_topic_ids = pd.read_csv('topic_ids/city_topic_id.csv')
city_topic_ids['iso2'] = coco.convert(city_topic_ids.search_country, to='iso2')

First we can gather all the cities of neighboring countries of order 1

In [123]:
# list of countries
iso2_countries = coco.convert(unhcr.Country_o.unique(), to='iso2')

city_trends_list = []
for iso2country in tqdm(iso2_countries):
    # get neighbors of country
    neighboring_countries = graph.vs[graph.neighborhood(iso2country, order=1)]['name'][1:]

    order1_cities = city_topic_ids[city_topic_ids.iso2.isin(neighboring_countries)]

    a_country_trends = get_trends_data(iso2country, order1_cities.topic_mid)
    a_country_trends['country_o'] = iso2country
    # a_country_trends['country_d','city_d'] = order1_cities[['search_keyword','topic_title']]
    city_trends_list.append(a_country_trends)

100%|██████████| 196/196 [2:30:52<00:00, 46.19s/it]   


In [148]:
city_topic_ids['citycountry'] = city_topic_ids.topic_title + ', ' + city_topic_ids.search_country
city_dict = city_topic_ids.set_index('topic_mid')['citycountry'].to_dict()

In [153]:
city_trends_df = pd.DataFrame()
for idx, city_trends in enumerate(city_trends_list):
    city_trends = city_trends.loc[:, ~city_trends.columns.duplicated()].copy()
    city_trends.rename(columns=city_dict, inplace=True)
    city_trends_df = pd.concat([city_trends_df, city_trends], axis=0)
#semantic_trends_df = semantic_trends_df.reset_index().rename({'index':'date'},axis=1).set_index(['country', 'date'])

# Before writing to a csv, make sure that the output makes sense, I think there should be a lot of nas and a lot of columns, one for each country/city topic. 
# The only ones that aren't na should be the neighboring countries for that country

city_trends_df.to_csv('data/city_topic_trends_1.csv')

Then we can gather trends for countries of order 2, excluding order 1 

In [62]:
neighboring_countries = list(set(graph.vs[graph.neighborhood('AF', order=2)]['name']) - set(graph.vs[graph.neighborhood('AF', order=1)]['name']))
order2_cities = city_topic_ids[city_topic_ids.iso2.isin(neighboring_countries)]

Unnamed: 0,search_country,search_keyword,topic_title,topic_type,topic_mid,iso2
3,India,Mumbai,Mumbai,City in India,/m/04vmp,IN
7,India,Delhi,Delhi,City in India,/m/09f07,IN
8,Russian Federation,Moscow,Moscow,Capital of Russia,/m/04swd,RU
12,Viet Nam,Ho Chi Minh City,Ho Chi Minh City,City in Vietnam,/m/0hn4h,VN
16,Viet Nam,Hanoi,Hanoi,Capital of Vietnam,/m/0fnff,VN
20,Iraq,Baghdad,Baghdad,Capital of Iraq,/m/01fqm,IQ
27,Russian Federation,Saint Petersburg,Saint Petersburg,City in Russia,/m/06pr6,RU
41,Turkey,Ankara,Ankara,Capital of Turkey,/m/0jyw,TR
67,Kazakhstan,Almaty,Almaty,City in Kazakhstan,/m/0151s1,KZ
86,Nepal,Kathmandu,Kathmandu,Capital of Nepal,/m/04cx5,NP


In [None]:
# list of countries
iso2_countries = coco.convert(unhcr.Country_o.unique(), to='iso2')

city_trends_list = []
for iso2country in tqdm(iso2_countries):
    # get neighbors of country of order 2
    neighboring_countries = list(set(graph.vs[graph.neighborhood(iso2country, order=2)]['name']) - set(graph.vs[graph.neighborhood(iso2country, order=1)]['name']))
    order2_cities = city_topic_ids[city_topic_ids.iso2.isin(neighboring_countries)]

    a_country_trends = get_trends_data(iso2country, order1_cities.topic_mid)
    a_country_trends['country_o'] = iso2country
    a_country_trends['country_d','city_d'] = order1_cities[['search_keyword','topic_title']]
    city_trends_list.append(a_country_trends)

## Interactions for every country - example with only one keyword (OL)

In [10]:
keywords = ["visa"]

In [11]:
countries = pd.read_csv('../../data/data.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [12]:
# list of all unique languages:
unique_languages = pd.Series(list(set(list(itertools.chain(*country_language_dict.values())))), name='language')

# list of language codes from googletrans
langcodes = pd.DataFrame.from_dict(LANGCODES, orient='index', columns=['code'])
langcodes.index = langcodes.index.str.capitalize()

refugee_lang = unique_languages.to_frame().merge(langcodes, left_on='language', right_index=True, how='left')

refugee_lang.dropna(inplace=True)

refugee_lang_not_en = refugee_lang[refugee_lang['code'] != 'en']




In [13]:
from itertools import product
tosearch = [f"{x} {y}" for x, y in product(keywords, countries)]


In [31]:
translated_keywords = refugee_lang_not_en['code'].swifter.apply(lambda x: trends_helpers.translate_keywords_list(lst = tosearch, lang= x))

Pandas Apply:   0%|          | 0/83 [00:00<?, ?it/s]

In [32]:
columns = list(refugee_lang_not_en['code'])

df = pd.concat([pd.DataFrame(sublist, columns=[col]) for sublist, col in zip(translated_keywords, columns)], axis=1)

df['en']=tosearch

df.head()


Unnamed: 0,be,sk,jw,he,lv,cy,az,bn,ru,it,...,ja,ko,so,af,sm,sn,am,ceb,tr,en
0,віза ў афганістан,vízum do afganistanu,visa afghanistan,ויזה לאפגניסטן,vīza afganistāna,fisa afghanistan,əfqanıstana viza,আফগানিস্তানের ভিসা,виза афганистан,visto afghanistan,...,アフガニスタンビザ,비자 아프가니스탄,fiisaha afgaanistaan,visum afghanistan,visa afghanistan,visa afghanistan,ቪዛ አፍጋኒስታን,visa sa afghanistan,afganistan vizesi,visa Afghanistan
1,віза албанія,vízum do albánska,visa albania,ויזה לאלבניה,vīza albānijai,fisa albania,viza albaniya,আলবেনিয়ার ভিসা,виза албания,visto albania,...,アルバニアビザ,비자 알바니아,fiisaha albania,visum albanië,visa albania,visa albania,ቪዛ አልባኒያ,visa sa albania,vize arnavutluk,visa Albania
2,віза алжыр,vízum do alžírska,visa aljazair,ויזה לאלג'יריה,vīza alžīrija,fisa algeria,əlcəzair vizası,ভিসা আলজেরিয়া,виза алжир,visto algeria,...,アルジェリアのビザ,비자 알제리,fiisaha algeria,visum algerië,visa algeria,visa algeria,ቪዛ አልጄሪያ,visa sa algeria,cezayir vizesi,visa Algeria
3,віза ў андору,vízum do andorry,visa andorra,ויזה אנדורה,vīza andora,fisa andorra,viza andorra,ভিসা অ্যান্ডোরা,виза андорра,visto andorra,...,アンドラビザ,비자 안도라,fiisaha andorra,visum andorra,visa andorra,visa andorra,ቪዛ andorra,visa sa andorra,vize andorra,visa Andorra
4,віза ангола,víza angola,visa angola,ויזה אנגולה,vīza uz angolu,fisa angola,viza anqola,ভিসা অ্যাঙ্গোলা,виза ангола,visto angola,...,アンゴラビザ,비자 앙골라,fiisaha angola,visum angola,visa angola,visa angola,ቪዛ አንጎላ,visa sa angola,vize angola,visa Angola


In [33]:
# I'll only keep two original languages per country

country_language_dict_2 = {}

for key, values in country_language_dict.items():
    country_language_dict_2[key] = values[:2] 

max_length = max(map(len, country_language_dict_2.values()))

data_padded = {key: arr + [np.nan] * (max_length - len(arr)) for key, arr in country_language_dict_2.items()}

langs = pd.DataFrame(data_padded)
langs = langs.T
langs = langs.reset_index()

langs = langs.rename(columns={'index': 'Country', 0:'lang1', 1:'lang2', 2:'lang3'})

# Apply the function to the 'Country' column
langs['ISO2'] = langs['Country'].apply(trends_helpers.get_iso2_country_code)

langs = langs.drop(columns=["Country"])
langs_long = pd.melt(langs, id_vars=['ISO2'], var_name='numlang', value_name='lang')
langs_long = langs_long.dropna()

langs_long = pd.merge(langs_long, refugee_lang, left_on="lang", right_on="language")

In [34]:
langs_long

Unnamed: 0,ISO2,numlang,lang,language,code
0,AF,lang1,Pashto,Pashto,ps
1,AL,lang1,Albanian,Albanian,sq
2,RS,lang1,Albanian,Albanian,sq
3,MK,lang2,Albanian,Albanian,sq
4,DZ,lang1,Arabic,Arabic,ar
...,...,...,...,...,...
235,RS,lang2,Hungarian,Hungarian,hu
236,SK,lang2,Hungarian,Hungarian,hu
237,ZA,lang2,Xhosa,Xhosa,xh
238,LK,lang2,Tamil,Tamil,ta


In [11]:
# country_trends_list = [] 
# last_index = 1
# for i, country in tqdm(enumerate(langs_long["ISO2"])):
#     print("Searching: " + country)
#     languagecode= langs_long["code"][i]
 
#     a_country_trends = get_trends_data(country, df2[languagecode])
#     a_country_trends['country'] = country
#     country_trends_list.append(a_country_trends)

#     updated_data_list = a_country_trends.copy()
#     updated_data_list = trends_helpers.update_column_names(updated_data_list, languagecode, wide=False)
#     updated_data_list = pd.DataFrame(updated_data_list)
#     pd.to_csv(updated_data_list, index=False)
    

# This approach is not ideal for these long searches bc we are not able to break the loop until a new country finishes
    

Nested loop approach (better to manually interrupt it at any time)

In [15]:
count_files=trends_helpers.file_counter("trends_downloads/interactions_full/")
print("{:.1f}% done.".format((count_files+55)/(len(langs_long["ISO2"]))*100))

97.5% done.


In [36]:
from pytrends.request import TrendReq
import time


# Set up the Google Trends API object
# pytrends = TrendReq()

results = pd.DataFrame() 

# Loop through regions and keywords

count_files=trends_helpers.file_counter("trends_downloads/interactions_full/")


# for i, country in enumerate(langs_long["ISO2"][count_files+55:], start=count_files+55): # update with new count_files values
for i, country in enumerate(langs_long["ISO2"][:1]):
    
    languagecode= langs_long["code"][i]

    for keyword in df[languagecode]:    
        print("Searching: " + str(keyword) + " in " + str(country))
        time.sleep(4)
        
        pytrends = TrendReq(tz=360, timeout=(10, 25), retries=2, backoff_factor=0.5,
            requests_args={'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}})
        
        # Build the payload with the selected region, keyword and date range
        pytrends.build_payload(kw_list=[keyword], timeframe='2005-01-01 2022-12-31', geo=str(country))
        # Get the interest over time data for the selected keyword
        interest_over_time_df = pytrends.interest_over_time()

        # Check if the DataFrame is not empty
        if not interest_over_time_df.empty:
            interest_over_time_df= interest_over_time_df.reset_index()
            interest_over_time_df['trends_index'] = interest_over_time_df.iloc[:, 1] 
            interest_over_time_df = interest_over_time_df.drop(interest_over_time_df.columns[1], axis=1)

            interest_over_time_df = interest_over_time_df.drop(columns=["isPartial"], axis=1)
            interest_over_time_df["keyword"]=keyword
            interest_over_time_df["region"] = country
            
            # Append the results to the main dataframe
            results = pd.concat([results, interest_over_time_df])
    
    if not results.empty:
        results_country = results[results["region"]==country]
        if not results_country.empty:
            translation = {'keyword': df[langs_long["code"][i]], 'keyword_en': df["en"]}
            translation = pd.DataFrame(translation)
            results_country = pd.merge(results_country, translation, left_on="keyword", right_on="keyword", how="left")
            file_path = 'trends_downloads/interactions_full/' + str(country) + '_' + languagecode + '.csv'
            results_country.to_csv(file_path, index=False)
        else:
            count_files = count_files+1 
    else:
        count_files = count_files+1
    time.sleep(300) # Wait a few minutes until starting with next country 

Searching: د افغانستان ویزه in AF
Searching: ویزه البانیا in AF
Searching: ویزه الجزایر in AF
Searching: ویزه اندورا in AF
Searching: ویزه انګولا in AF
Searching: ویزه انټيګوا او باربودا in AF
Searching: د ارجنټاین ویزه in AF
Searching: د ارمنستان ویزه in AF
Searching: ویزه اروبا in AF
Searching: ویزه استرالیا in AF
Searching: ویزه اتریش in AF
Searching: د اذربایجان ویزه in AF
Searching: د بهاماس ویزه in AF
Searching: ویزه بحرین in AF
Searching: د بنګله دیش ویزه in AF
Searching: ویزه بارباډوس in AF
Searching: بیلاروس ویزه in AF
Searching: د بلجیم ویزه in AF
Searching: بیلیز ویزه in AF
Searching: ویزه بینین in AF
Searching: د بوتان ویزه in AF
Searching: د بولیویا ویزه in AF
Searching: د بوسنیا او هرزیګوینا ویزه in AF
Searching: ویزه بوټسوانا in AF
Searching: د برازیل ویزه in AF
Searching: د برونای دارالسلام ویزه in AF
Searching: د بلغاریا ویزه in AF
Searching: د بورکینا فاسو ویزه in AF
Searching: ویزه برونډي in AF
Searching: ویزه cabo verde in AF
Searching: کمبوډیا ویزه in AF
Searching:

Importing all the files and compiling them into one:

In [2]:
import pandas as pd
import glob
import os

In [43]:
folder_path = "trends_downloads/interactions_full/"

dataframes = []

for filename in os.listdir(folder_path):
    if filename.endswith('.csv'):
        file_path = os.path.join(folder_path, filename)
        
        country = filename[:2]
        language = filename[3:5]
        
        df = pd.read_csv(file_path)
        df['country'] = country
        df['language'] = language
        
        dataframes.append(df)

combined_df = pd.concat(dataframes, ignore_index=True)


In [44]:
combined_df = combined_df.drop_duplicates()
combined_df.shape

(878040, 7)

In [45]:
cn_path = os.path.join(folder_path, "CN_zh-cn.csv")

country = "CN"
language = "zh-cn"
df = pd.read_csv(file_path)
df['country'] = country
df['language'] = language
        


In [46]:
combined_df = pd.concat([combined_df, df], ignore_index=True)
combined_df.region.value_counts()

region
NG    69768
US    52920
CA    36072
GB    30888
AE    29808
      ...  
CF      216
PK      216
ST      216
GQ      216
SB      216
Name: count, Length: 141, dtype: int64

In [47]:
af_path = os.path.join(folder_path, "AF_interactions.txt")

country = "TH"
language = "?"
df = pd.read_csv(file_path)
df['country'] = country
df['language'] = language

In [49]:
combined_df = pd.concat([combined_df, df], ignore_index=True)
combined_df  = combined_df.drop_duplicates()
combined_df.shape

(878472, 7)

In [50]:
combined_df.to_csv(os.path.join(folder_path, "interactions visa.csv"), index=False)