# Keywords in News Articles

The goal of this project is to search the news articles on given websites for keywords and perform subsequent statistical analysis. The result are presented in a form of a Power BI report.

## Description

The following keywords are investigated: Elections, Black Friday, and NFL. The time period considered is November 1-15 of 2024. The task is to see how often the keywords are being mentioned in the articles title and description.

The information we are going to collect:
- Total number of times the word was mentioned,
- Number of times the word was mentioned each day,
- The day when the keyword was mentioned the most number of times,
- Three newspapers where the keywords is seen the most,
- How often the keyword is mentioned comparing to other popular topics.

The access to the article storage is provided by NEWS Api, where an api_id was obtained beforehand. Python is used to perform the analysis.

The results are presented in Power BI. NLTK Tokenizer is used for text analysis.

### Collecting the data

In this section we will discuss the code that is used to collect the data. The idea is to loop through all the dates, collect the data for each day separately, store that data, and then create a final dataset that will contain all of the information we need for the analysis.

We have to start with importing Python libraries that will be needed:

In [None]:
from newsapi import NewsApiClient
from collections import Counter
import nltk
import datetime
import csv

In order to access to the articles database, we need to create an object of class NewsApiClient using the api_id that we have. The **.get_everything()** method is then used to download the data according to the parameters we specify.

In [None]:
newsapi = NewsApiClient(api_key='your_api_id')

all_articles = newsapi.get_everything(domains=domains,
                                    from_param=ddate,
                                    to=ddate,
                                    language='en',
                                    sort_by='relevancy')

We will be using **datetime** Python library to help us build the loop.

In [None]:
date = datetime.datetime(year, month, day)
date += datetime.timedelta(days=1)

As mentioned earlier, we a going to use NLTK library to tokenize the incoming data. The functionality of this library allows us only keep the nouns clearing out the rest of text, and also take care of things like plural and singular forms of a noun and other similar things. Some custom code will also be added to go along to handle situations when the same word appears with a capital and lower case letter in different articles. Also, we will use a dictionary of stop words that we don't want in our analysis. It's mainly ment to store some special characters and prepositions that were missed by NLTK. In addition, it will contain first names of popular public people to prevent counting the same person twice. For example, we will be skipping words Elon and Donald. The stop words can be found in *stop_dict.py* file.

In [None]:
tokens = nltk.word_tokenize(whole_text)
tagged = nltk.pos_tag(tokens)

Now we are all set to get started.

So, in the first part of the project we are going to store top mentioned words per day for November 1-15 of 2024 with the number of times it was mentioned. We defiine top mentioned words as the ones that we see 4 times or more that day on the news websites that we examine. The list of that websites is stored in the *domains.py* file.

The second piece of information we will be looking for is the number of time our three keywords - Elections, Black Friday, and NFL - appear on the news each day, independently of if they are on the list of top mentioned words or not.

Note: To investigate the keyword Black Friday, we will be using only the word Friday, putting the word Black on the stop list. Like we decided to do for the famous people.

The complete code for this part of the project looks like this. Only one keyword is used in the code below. That is done to simplify the code for demonstration purposes. More keywords can be added in the same manner.

In [None]:
# List of news domains
from domains import domains
# We will use an additional list of stop words
from stop_dict import stop_dict

# Save results to file
def save_key_stats_to_file(filename, stats):
    d = datetime.datetime(year, month, day)
    with open(filename, 'w') as f:
        for line in stats:
            f.write(f"{str(d).split()[0]},{line}\n")
            d += datetime.timedelta(days=1)

# Parameters
year = 2024
month = 11
day = 1
n_days = 15
keyword_1 = 'Election'

date = datetime.datetime(year, month, day)
# Init
newsapi = NewsApiClient(api_key='e3a92ef3e7664dc1be44f76fa900828f')

keyword_1_stats = []
for _ in range(n_days):
    ddate = str(date).split()[0]

    all_articles = newsapi.get_everything(domains=domains,
                                        from_param=ddate,
                                        to=ddate,
                                        language='en',
                                        sort_by='relevancy')

    whole_text = ''
    for article in all_articles['articles']:
        whole_text += article['title']
        whole_text += str(article['description'])

    tokens = nltk.word_tokenize(whole_text)
    tagged = nltk.pos_tag(tokens)

    clean = []
    for tag in tagged:
        if tag[1] in ['NNP', 'NN']:
            w = tag[0].title() if tag[0][0].islower() else tag[0]
            if w not in stop_dict:
                clean.append(w)
                
    most_common = Counter(clean)

    keywords = {}
    total_n_keywords = 0
    for mc in most_common.items():
        if mc[1] > 3:
            keywords[mc[0]] = mc[1]
            total_n_keywords += 1
        if mc[0] == keyword_1:
            keyword_1_stats.append(mc[1])

    if keyword_1 not in most_common.keys():
        keyword_1_stats.append(0)

    # Save data to a CSV file
    with open(f"top_keywords_for_{ddate}.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        for row in keywords.items():
            writer.writerow(row)

    date += datetime.timedelta(days=1)

save_key_stats_to_file(f'{keyword_1}_stats.csv', keyword_1_stats)


### Top Sources

In this part of the project we will be looking for the three sources where the keywords appear the most.

Here is the code that is used to acomplish that. The names of the three sources are being save to a file separately for each keyword.

In [None]:
keywords = ['Election', 'Friday', 'NFL']

for keyword in keywords:
    all_articles = newsapi.get_everything(q=keyword,
                                        domains=domains,
                                        from_param=date,
                                        to=date,
                                        language='en',
                                        sort_by='relevancy')

    papers = []
    for headline in all_articles['articles']:
        papers.append(headline['source']['name'])

    most_common = Counter(papers)
    most_common = list(most_common.keys())
    most_common_sources = most_common[:3]

    with open(f'{keyword}_sources.txt', 'w') as f:
        for line in most_common_sources:
            f.write(f"{line}\n")

### Final Dataset

The last thing we need to do is to assemble the final table of all the data that we need for the analysis. Simply enough, we a going to loop through the files with the most popular words for each day and calculate the sum of all the mentions of all the top words. That will be the first column for the final table that we need. The second is the number of time the keyword appears on the news that day. And finally, the number of time the keyword appears on the news if it was among the most popular words that day having zero value otherwise.

Here is the code to perform that. The final table is stored as a CSV file separately for each keyword. **Pandas** library will help us do all of that.

In [None]:
import pandas as pd


total_counts = []
keyword = 'Election'
keyword_stats_if_popular = []

for idx in range(1, 16, 1):
    d = str(idx) if idx > 9 else '0' + str(idx)

    keywords = {}
    keywords_counts = []
    with open(f"top_keywords_for_2024-11-{d}.csv", encoding="utf-8") as file:
        for line in file:
            keywords[line.split(',')[0]] = line.split(',')[1]
            keywords_counts.append(int(line.split(',')[1]))
    total_counts.append(sum(keywords_counts))

    if keyword_1 in keywords.keys():
        keyword_stats_if_popular.append(int(keywords[keyword_1]))
    else:
        keyword_stats_if_popular.append(0)

df = pd.read_csv('Election_stats.csv', names=['Date', 'KeyWCount'])
df['KeyWCountIfPop'] = keyword_stats_if_popular
df['Totals'] = total_counts
df.to_csv('Election_final_stats.csv', index=False, header=False)

### Analysis

Once we have all of the data, we are ready to plot some graphs and see if the data we have collected contains any interesting and remarkable patterns.

Power BI is used to plot the graph we need. PDF with the report for each keyword can be found in the root directory for this project.

Let's see if we can derive any conclusions looking at those reports.

#### Elections

As expected, the word Elections can be seen the most on November 5 and 6 - the Election Day in the United States and the day after. It holds about 15% of all the mentions of top words on the news that day.

After that the use of this word decreases drastically.

#### Black Friday

For the Black Friday keyword we can observe that the popularity of this words grows as the date approaches. 

Also, its usage peaks on Saturdays, which is explainable as the companies intensify the advertising during the weekends.

#### NFL

NFL keyword almost never appears on the news on working days. However, it takes a substantial amount of attention on Sundays, taking up to more that 7% of all the hot topics on November 9, for instance.

We can also notice that ESPN and CNET are the top sources where NFL is discussed. Which is, of course, natural.