David Culhane
<br>
**Collecting and Processing API Data on Inflation**
<br>
<br>
The last data component this project will be working with is data acquired via API call. Since this term project is looking at inflation rates around the world over the years, the data acquired through API call will pertain to inflation. Our flat data source had quarterly values of the consumer price index according to the World Bank and our web source was a wikitable of annual changes in the value of the consumer price index according to the United Nations. 
<br>
<br>
This doesn't leave a ton of options for more inflation data outside of contextual information. This context can be found using the NewsAPI, a REST API that can b used to search the internet for news articles. I have obtained a key for NewsAPI and can search for articles pertaining to inflation by using keywords and timeframes of publication. After acquiring data, TFIDF vectorization can be applied to find the important words in the articles found. This can be done fo each country and the results - the TFIDF matrices and lists of important words - can be stored in a dataframe.
<br>
<br>
***Part 1: Helper Functions for Acquiring and Processing the Data from NewsAPI***
<br>
<br>
The first thing to do will be to build a function to help call the NewsAPI and acquire articles for processing. This function will want to take search parameters for a list of keywords (country, inflation) and timeframes (optional, using YYYY-MM-DD format). The response from NewsAPI will give a list of articles in JSON format with information on the publisher, title, description of the article, link to the article, when the article was published, and the start of the article.

In [2]:
# Libraries to Import for use throughout the script
import requests
import urllib.parse, urllib.error
import json
from bs4 import BeautifulSoup
import unicodedata
import sys
from nltk.tokenize import word_tokenize
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np

In [3]:
# Writing a function to call NewsAPI for
def newsapi_crier(keywords, time_start=None, time_end=None):
    # Making the Query URL
    base_url = 'https://newsapi.org/v2/everything?'
    apikey = 'apiKey=*******************************'
    subjects = keywords
    if time_start != None:
        search_start = 'from=' + str(time_start) + '&'
    else:
        search_start = ''
    if time_end != None:
        search_end = 'to=' + str(time_end) + '&'
    else:
        search_end = ''
    # To make sure the keywords are together correctly, we will need to bring them together with AND between them
    keystring = ''
    if len(subjects) > 1:
        for word in subjects:
            keystring = keystring + str(word) + ' AND '
        keystring = '(' + keystring[:-5] + ')'
    else:
        keystring = keystring + str(subjects[0])
    q = urllib.parse.urlencode({'q' : keystring}) + '&'
    query_url = base_url + q + search_start + search_end + 'language=en&' + 'sortBy=relevancy&' + apikey
    # Making the Query for articles
    try:
        response = requests.get(query_url)
        news = json.loads(response.text)
        return news
    except urllib.error.URLError as e:
        print(f"Error: {e.reason}")

The newsapi_crier function will return a list of dictionaries that contain information about articles found in the API call. One of the parameters for each article is url, a link to the article itself. Once we have a set of articles, we will want to go to the article at the url and scrape its text. It will be smart to do that with a separate function. 

In [5]:
# Writing a function to scrape the news text from an article by passing the article's url
def news_scraper(url):
    try:
        response = requests.get(url)  # Gets the HTML data from the supplied url
        if response.status_code == 200:
            encoding = response.encoding  # Checks the encoding of the response
            decoded = response.content.decode(encoding)  # Decodes the response
            soup = BeautifulSoup(decoded)  # Makes the soup
            paragraphs = soup.find_all('p')  # Finds all of the text using <p> tags
            # Sometimes, a line/paragraph in a tag can be repeated. We don't need those repeating lines/paragraphs.
            # We also need to add them into a list for processing
            lines = []
            for line in paragraphs:  # Looping through the lines/paragraphs to add them to a list while making sure no duplicates are added
                if line not in lines:
                    text = line.getText()
                    lines.append(text)
                else: continue
            article = ''  # Bringing all the lines/paragraphs of the article together as a single string
            for line in lines:
                article = article + line + ' '
            return article
        else:
            print('Unable to scrape from URL. Status Code:', response.status_code)
            return 400  # Sentinel value
    except ConnectionError:  # ConnectionError exception that was encountered
        return 400
    except RemoteDisconnected:  # RemoteDisconnected error exception that was encountered
        return 400
    except ProtocolError:  # ProtocolError exception that was encountered
        return 400
    except:
        return 400

news_scraper will be able to scrape the text from article urls fed into it and process it into a single string. Ideally, this string can be stored in a list of strings with each string being an article. To get closer to TFIDF vectorization though, we will need to process the text itself. All of the text will need to be of the same case, have punctuation and stop words removed, and then the words will need to be stemmed. Another helper function can be written to do that.

In [7]:
# Writing a function to process the scraped text
def text_processor(text):
    # Changing all text to lowercase
    lower = text.lower()  
    # Removing punctuation
    punctuation = dict.fromkeys(
        (i for i in range(sys.maxunicode)
        if unicodedata.category(chr(i)).startswith('P')
        ),
        ' ')
    runon = lower.translate(punctuation)
    # Tokenizing the words in the text
    tokens = word_tokenize(runon)
    # Updating stop words
    stop_words = stopwords.words('english')  
    # Removing stop words
    stopped = []
    for word in tokens:
        if word not in stop_words:
            stopped.append(word)  # Only adds the word if it is not a stop word
        else:
            continue
    # Stemming the list of words
    porter = PorterStemmer()
    stems = []
    for word in stopped:
        stem = porter.stem(word)
        stems.append(stem)
    # Recombining the stems into one string
    processed = str()
    for stem in stems:
        processed = processed + stem + ' '
    return processed

The text_processor function will take a body of text and process it to make sure everything is lowercase, without punctuation, without stop words, tokenized, stemmed, and then put back together. Once the number of desired articles have been processed, the processed texts as a list are then ready for TFIDV vectorization. Once that's been done, we will want the top words A separate helping function can do that.

In [9]:
# Writing a function to apply TFIDF Vectorzation to the processed text
def top_words_tfidf(corpus):
    tfidf = TfidfVectorizer()  # Renames th TFIDF Vectorizer
    tfidf_array = tfidf.fit_transform(corpus).toarray()  # Runs TFIDF on the corpus and turns it into an array
    vocab = tfidf.vocabulary_  # Acquires the vocabulary from running TFIDF
    reverse_vocab = {v:k for k,v in vocab.items()}  
    feature_names = tfidf.get_feature_names_out()  # Takes the vocabulary and creates alist of feature names for a dataframe
    df_tfidf = pd.DataFrame(tfidf_array, columns = feature_names)  # Creates the dataframe holding words and TFIDF values
    top_words = df_tfidf.max().sort_values(ascending=False).index[:10].to_list()  # Selects the words with the largest 10 TFIDF values
    return top_words  # Returns list of words

With the helping functions now in place, the actual work can now begin on organizing and storing the information into a dataframe that can be used to merge with the dataframes from the previous milestones.
<br>
<br>
***Part 2: Creating the Dataframe***
<br>
<br>
Now that we have functions that can grab news articles and process their text, we need to actually create a dataframe to store data in. Previous miletsones used CSVs or a web table to make sure countries were available. The initial inflation dataset could be loaded in order to get a list of countries. This list of countries can then be the basis of a new dataframe where data from these articles can be stored. The problem that arises is that NewsAPI allows for only 100 queries in 24 hours and a maximum of 50 in a 12 hour period. This means that for this assignment, we need to whittle the list of countries down to no more than *50* countries. The final product of the flat dataset, hcpiqr_129 can be loaded in and become the basis of the news words dataframe.

In [11]:
# Loading the inflation dataframe
hcpiqr_129 = pd.read_csv('hcpiqr_129.csv')

***Transformation 1: Dropping NAs***
<br>
<br>
The intention of the hcpiqr_129 dataframe was to allow for a large number of countries to be analyzed and allow for countries that had reported enough data since they came into existence to be included. The problem is that this dataframe has 126 countries listed in it, more than the 50 we can possibly search for news articles at a single time without taking breaks and storing data (a major inconvenience for academic assignments whose grading depends on the code working as intended at its initial running). To cut down on the number of countries in the dataframe, we can drop any/all countries with NA appearing in their data. It's unfortunate but necessary.

In [13]:
# Dropping NAs from hcpiqr_129
no_nas = hcpiqr_129.dropna(ignore_index=True)
no_nas

Unnamed: 0.1,Unnamed: 0,country,region,1970/3,1970/6,1970/9,1970/12,1971/3,1971/6,1971/9,...,2021/9,2021/12,2022/3,2022/6,2022/9,2022/12,2023/3,2023/6,2023/9,2023/12
0,6,Argentina,Latin America and the Caribbean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,512.33,563.49,638.17,754.37,910.07,1080.86,1288.95,1606.81,2055.76,2948.51
1,9,Australia,Australia and New Zealand,8.92,9.01,9.1,9.29,9.38,9.47,9.75,...,111.17,112.65,115.07,117.11,119.25,121.48,123.15,124.17,125.66,126.4
2,10,Austria,Western Europe,22.48,22.69,23.02,23.23,23.45,23.69,24.17,...,111.52,113.29,115.59,119.16,122.44,125.29,127.63,129.69,130.77,132.1
3,17,Belgium,Western Europe,19.08,19.25,19.38,19.47,19.74,20.0,20.26,...,111.54,114.16,117.96,119.96,123.0,126.81,126.35,125.94,127.35,127.85
4,21,Bolivia,Latin America and the Caribbean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,151.03,151.03,151.86,152.5,153.79,155.65,156.03,156.72,158.2,158.66
5,26,Burkina Faso,Sub-Saharan Africa,14.85,15.25,15.45,14.54,14.57,15.51,15.68,...,114.96,118.52,121.84,132.03,136.23,133.04,128.75,131.66,133.65,132.98
6,27,Burundi,Sub-Saharan Africa,1.67,1.66,1.66,1.67,1.71,1.72,1.72,...,213.21,218.96,230.08,247.0,255.54,274.21,298.9,319.44,324.8,334.93
7,30,Cameroon,Sub-Saharan Africa,7.75,7.91,7.95,7.91,8.1,8.08,8.2,...,124.85,125.84,128.15,130.97,133.56,135.59,138.36,141.43,143.37,144.11
8,31,Canada,Northern America,15.96,16.04,16.17,16.09,16.17,16.38,16.67,...,112.67,113.8,116.14,119.73,120.73,121.39,122.12,123.94,125.18,125.28
9,35,Chile,Latin America and the Caribbean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,120.34,123.79,127.16,132.1,136.77,139.85,142.15,143.63,144.47,146.27


We now have 56 countries to work with, still too many. At least 6 more countries need to be dropped.
<br>
<br>
***Transformation 2: Randomly Removing Countries***
<br>
<br>
With at least 6 additional countries to remove, some criteria for removal has to be decided upon. The best way to do it without any sort of bias would be to do it randomly. A random number generator can be used to generate random numbers between 0 and 55 in order to drop rows by index.

In [15]:
# Generating the random numbers
drops = np.random.default_rng().integers(low=0, high=55, endpoint=True, size=6)
drops

array([ 9, 30, 27, 37, 34, 21], dtype=int64)

In [16]:
# Dropping the countries with the indices generated
fifty_countries = no_nas.drop(drops)
fifty_countries

Unnamed: 0.1,Unnamed: 0,country,region,1970/3,1970/6,1970/9,1970/12,1971/3,1971/6,1971/9,...,2021/9,2021/12,2022/3,2022/6,2022/9,2022/12,2023/3,2023/6,2023/9,2023/12
0,6,Argentina,Latin America and the Caribbean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,512.33,563.49,638.17,754.37,910.07,1080.86,1288.95,1606.81,2055.76,2948.51
1,9,Australia,Australia and New Zealand,8.92,9.01,9.1,9.29,9.38,9.47,9.75,...,111.17,112.65,115.07,117.11,119.25,121.48,123.15,124.17,125.66,126.4
2,10,Austria,Western Europe,22.48,22.69,23.02,23.23,23.45,23.69,24.17,...,111.52,113.29,115.59,119.16,122.44,125.29,127.63,129.69,130.77,132.1
3,17,Belgium,Western Europe,19.08,19.25,19.38,19.47,19.74,20.0,20.26,...,111.54,114.16,117.96,119.96,123.0,126.81,126.35,125.94,127.35,127.85
4,21,Bolivia,Latin America and the Caribbean,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,151.03,151.03,151.86,152.5,153.79,155.65,156.03,156.72,158.2,158.66
5,26,Burkina Faso,Sub-Saharan Africa,14.85,15.25,15.45,14.54,14.57,15.51,15.68,...,114.96,118.52,121.84,132.03,136.23,133.04,128.75,131.66,133.65,132.98
6,27,Burundi,Sub-Saharan Africa,1.67,1.66,1.66,1.67,1.71,1.72,1.72,...,213.21,218.96,230.08,247.0,255.54,274.21,298.9,319.44,324.8,334.93
7,30,Cameroon,Sub-Saharan Africa,7.75,7.91,7.95,7.91,8.1,8.08,8.2,...,124.85,125.84,128.15,130.97,133.56,135.59,138.36,141.43,143.37,144.11
8,31,Canada,Northern America,15.96,16.04,16.17,16.09,16.17,16.38,16.67,...,112.67,113.8,116.14,119.73,120.73,121.39,122.12,123.94,125.18,125.28
10,37,Colombia,Latin America and the Caribbean,0.13,0.14,0.14,0.14,0.15,0.15,0.16,...,128.1,129.37,134.27,138.59,141.97,145.71,152.12,155.82,158.16,160.23


***Transformation 3: Creating the New Dataframe***
<br>
<br>
Up until this point, we have been working with "prior" data. This was necessary because just making API calls for contextual data requires inputs and one of the ones we need is a list of countries to work with. So we whittled the list of countries down to 50 in order to work with NewsAPI's limit of 50 per 24 hour period. This list of countries can be used to create the new dataframe that will hold the news information.

In [18]:
# Creating the list of countries
countries = fifty_countries['country'].to_list()

# Creating the new dataframe
news_data = pd.DataFrame({'Countries' : countries})
news_data.tail()

Unnamed: 0,Countries
45,Sweden
46,Switzerland
47,United Kingdom
48,United States
49,Uruguay


With the new dataframe created, we can work on acquiring the news data. Ideally, we can store the links that allowed scraping as well as the top words in the set of scraped articles according to TFIDF.

In [20]:
# Initializing lists for the news_data dataframe
article_urls = []  # Space to store lists of article urls for each country
keywords = [] # Space to store lists of important words from tfidf

# Looping through the countries to acquire recent news information on each country and inflation according to NewsAPI. 
for country in countries:
    news = newsapi_crier([country, 'inflation'])  # Querying for the country and inflation
    articles = news['articles']  # Indexing for the articles
    if len(articles) != 0:  # Making sure that there are articles to work with
        country_urls = []  # Initializing a list to hold the urls used for scraping, processing, and tfidf
        processed_texts = []  # Initializing a list to hold the processed texts
        i = 0  # Establishes a variable to index with
        while len(country_urls) < 10:  # Creating a while loop to scrape text from urls and process the text
            try:  # Using a try block in case there are less than 10 scrape-able urls
                url = articles[i]['url']  # Indexes for the url
                article = news_scraper(url)  # Tries to scrape from the url
                if article != 400:  # If scrape is successful
                    country_urls.append(url)  # Stores the URL
                    processed = text_processor(article)  # Processes the text
                    processed_texts.append(processed)  # Stores the processed text
                    i += 1
                else:  # If scrape is unsuccessful, move to the next url
                    i += 1
                    continue
            except IndexError:  # If there are less than 10 urls, it will break once it reaches the end via index error
                break
        top_words = top_words_tfidf(processed_texts) # Applies TFIDF Vectorization to the processed texts and extracts top words as a list
        article_urls.append(country_urls)  # Storing the urls used for TFIDF for the news_data dataframe
        keywords.append(top_words)  # Storing the top words from TFIDF for the news_data dataframe
    else:  # If there are no articles, append NA to article_urls and keywords
        article_urls.append(np.nan)
        keywords.append(np.nan)
        continue

Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Status Code: 403
Unable to scrape from URL. Statu

***Transformation 4: Adding the URLs Used for Each Country***
<br>
<br>
Now that we have the list of URLs used for scraping, we can add them to the news_data dataframe.

In [22]:
# Adding the URL field
news_data['URLs'] = article_urls
news_data.head()

Unnamed: 0,Countries,URLs
0,Argentina,[https://www.semafor.com/article/07/19/2024/ar...
1,Australia,[https://gizmodo.com/costco-raising-membership...
2,Austria,[https://money.com/how-bad-is-inflation-under-...
3,Belgium,[https://qz.com/national-french-fry-day-deal-m...
4,Bolivia,[https://www.bbc.com/news/articles/c25lpxzw9wl...


***Transformation 5: Adding the Top Words Found for Each Country***
<br>
<br>
We can also add the list of top words found for each country to the news_data dataframe.

In [24]:
news_data['Top Words'] = keywords
news_data

Unnamed: 0,Countries,URLs,Top Words
0,Argentina,[https://www.semafor.com/article/07/19/2024/ar...,"[soon, come, mortgag, inflat, price, crypto, a..."
1,Australia,[https://gizmodo.com/costco-raising-membership...,"[wage, drink, depart, membership, prime, store..."
2,Austria,[https://money.com/how-bad-is-inflation-under-...,"[soon, music, come, inflat, page, startribun, ..."
3,Belgium,[https://qz.com/national-french-fry-day-deal-m...,"[fri, email, expect, invest, 12, leader, kagam..."
4,Bolivia,[https://www.bbc.com/news/articles/c25lpxzw9wl...,"[best, biden, damodaran, aswath, credit, jul, ..."
5,Burkina Faso,"[https://removed.com, https://www.project-synd...","[soon, email, hiv, come, imf, meal, frequenc, ..."
6,Burundi,[https://finance.yahoo.com/news/poorest-countr...,"[soon, come, pbf, israel, hiv, imf, abus, chil..."
7,Cameroon,[https://journals.plos.org/plosone/article?id=...,"[hiv, israel, pbf, imf, food, health, tech, te..."
8,Canada,[https://gizmodo.com/costco-raising-membership...,"[percent, membership, inflat, store, bank, ms,..."
9,Colombia,"[https://biztoc.com/x/09b295058335e90b, https:...","[email, maduro, wage, secret, quarterli, team,..."


***Conclusion and Ethical Implications of Analysis***
<br>
<br>
This textual analysis had a number of hurdles the flat-source and online-source analyses didn't have to deal with. First off, the articles used for analysis depend entirely on how relevant NewAPI thinks the articles actually are. An example from testing the intial workings of this script is an excellent example. When using 'colombia' and 'inflation' as the keywords for NewsAPI. Using '+' can make sure a search term is included in the article and the top article with or without '+' for Colombia's search was an article about the Unied States Secret Service budget. This link allowed for text scraping and nothing short of manual exclusion would prevent the article, which had nothing to do with Colombia, from becoming part of the corpus of articles used in TFIDF analysis for Colombia's inflation articles. This means this script has less control than we like over what is actually pertaining to inflation and the country in question. This can be seen in the list of top words in the news_data dataframe.
<br>
<br>
A second concern is the limits on the number of calls that can be made to NewsAPI with free accounts. The initial purpose of this analysis was to see the scope of how inflation changed over time, going all the way back to 1970. A large number of countries didn't exist back then and experienced very different things at certain times, like post-Soviet and post-Yugoslavia countries. Being limited to only 50 API calls over a 12 hour period severely hampers the scope of this script as an academic assignment. If this analysis were being done over a long period of time, repeated sets of calls could be made on different days and the results could be saved to the computer running the function. This assignment is supposed to be run a single time and work as intended though, without saving data. This is a critical conflict that will limit the amount of contextual information allowed in the final product when all three datasets are put together (either as an intersection of all 3 datasets having only 50 countries or as an intersection of the first two datasets and union of that with this dataset).
<br>
<br>
The final concern with this script is its efficiency/timeliness. Looping through each country in a fifty country list and taking the steps being taken does require dedicated computing time. Unfortunately, I am not currently aware of ways in the code to force higher CPU usage in order to shorten computing time (like with the n_jobs parameter for various models). Taking the process used and applying it as a function with a lambda function mapped to the dataframe would be unlikely to remedy this issue.
<br>
<br>
If this were being done in a busniess/commercial setting, the second problem could be done away with. That would be at a cost of at least $499/month though, something that is prohibitively expensive in an academic setting like this one.
<br>
<br>
As of the writing of this conclusion, the process used for finding the top words according to TFIDF for each country's news query has hit its limit. I'm sincerely hoping that all possible errors have been identified but I can only hope that the code listed is robust enough. There is no more time to test it out fully. The limiting of data gathering to 50 countries is a by-product of running into the limits of a free account with NewsAPI. Before then, debugging work was being done on the code to make sure loops were being exited correctly for countries with less than 10 scrape-able articles and Armenia was being an issue. After that was fixed, Canada gave some context for additional errors that could be encountered before running out of queries for 12 hours. My current belief is that the code is now robust enough to move past errors but that is yet to be seen in an ethical manner (use of a naked except clause is discouraged but being done to try and make sure the process can be completed). A future running of this code to prepare for the merging may identify other problems that could be fixed post-submission to make sure a clean dataframe with news information is able to be used for merging with the other dataframes as a final product.

In [60]:
# Writing the darto to CSV
news_data.to_csv('news_data.csv', index=False)