This notebook processes the "finance_news_de_2018-2022.csv" file. The "finance_news_de_2018-2022.csv" file was given as input by Prof. Gertz and contains german economic and financial articles from various sources. The csv file contains the articles as well as some metadata of the articles such as the website they were obtained from and the date. In this notebook the csv file is loaded as a DataFrame and each article is written into a separate txt file in order to efficiently load them into a Dokument store (elastic search). Additionally, a random sample of the txt files is upload into the annotation tool for manual annotation. The most important metadata are captured in the title of the txt file.

In [1]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import os

In [6]:
#path of csv directory
directory_path = "data/"

#path to the file containing the german economic and financial articles
df_path = directory_path + "finance_news_de_2018-2022.csv"

#path to save the individual txt files 
split_files_path = directory_path + 'split_files/'

In [13]:
#load german economic and financial articles as a DataFrame
df = pd.read_csv(df_path)
print(
    f"Dataframe contains {df.shape[0]} articles and {df.shape[1]} attributes")

Dataframe contains 287527 articles and 8 attributes


In [14]:
df.head()

Unnamed: 0,URL,outlet,headline,body,crawltime,extracttime,summary,obtained_from
0,https://www.gruenderszene.de/business/exit-sca...,www.gruenderszene.de,"Hack, ICO-Betrug, PR-Stunt: Was ist bei Savedr...","Frankfurter Fintech\n\nHack, ICO-Betrug, PR-St...",2018-04-18 16:26:50.069000,2018-04-18 16:26:50.069000,,/business/
1,https://www.gruenderszene.de/business/linearit...,www.gruenderszene.de,"Dieser 18-Jährige entwickelt seine Grafik-App,...",Linearity\n\nDieser 18-Jährige entwickelt sein...,2018-04-19 08:33:12.533000,2018-04-19 08:33:12.533000,,/business/
2,https://www.gruenderszene.de/business/pilot-wa...,www.gruenderszene.de,Was kann der crowdfinanzierte Übersetzungskopf...,Produkttest\n\nWas kann der crowdfinanzierte Ü...,2018-05-14 07:06:20.726000,2018-05-14 07:06:20.726000,,/business/
3,https://www.gruenderszene.de/business/fluffy-f...,www.gruenderszene.de,100.000 Euro verdient dieses Startup pro Tag –...,Fluffy Fairy Games\n\n100.000 Euro verdient di...,2018-07-13 07:31:00.864000,2018-07-13 07:31:00.864000,,/business/
4,https://www.gruenderszene.de/business/airbnb-l...,www.gruenderszene.de,Professionelle Vollzeitvermieter treiben das W...,Die Vermietungsplattform Airbnb sieht sich ger...,2018-08-25 09:21:00.780000,2018-08-25 09:21:00.780000,,/business/


In [15]:
def write_txt(domain: str, df_index: str, content: str, category: str):
    """Write an article into a new txt file. The title of the txt file has the format "DOMAIN___index____category.txt" to ensure that each txt file has a unique title.

    Parameters
    ----------
    domain : str
        The domain the article was obtained from.
    df_index : str
        The row from the dataframe that the article was obtained from
    content : str
        The article text 
    category : str
        The category of the article. Equals to the "obtained_from" column from the DataFrame
    """    

    #title of the txt file
    name = domain.upper() + "___" + df_index + "___" + category + ".txt"
    txt_path = split_files_path + name
    
    with open(txt_path, 'w') as file:
        file.write(content)
 
def get_domain(url: str):
    """get domain from url e.g. from www.gruenderszene.de returns gruenderszene.de

        Parameters
        ----------
        url : str
            Equals to the "outlet" column in the DataFrame

        Returns
        -------
        str
            [description]
    """        
    
    return url.replace("www.", "")

In [17]:
# Write for all articles a separate txt file, including a progressbar
N = df.shape[0]

with tqdm(total=N) as pbar:
    for i in range(N):

        url = df['outlet'].iloc[i]
        body = df['body'].iloc[i]
        category = df['obtained_from'].iloc[i]
        category = category.replace("/", "")

        write_txt(get_domain(url), str(i), body, category)

        pbar.update(1)
    pbar.close()

  0%|          | 0/287527 [00:00<?, ?it/s]

In [19]:
#check that all articles of the DataFrame were written to an indivual txt file
_, _, files = next(os.walk(split_files_path))
file_count = len(files)
file_count
print(f"From {df.shape[0]} articles, {file_count} articles were written in txt")

print(f"Number Not written files: {df.shape[0]-file_count}")


From 287527 articles, 287527 articles were written in txt
Number Not written files: 0
