# Loading News Data

Our news data is located in our news directory as individual json files. We want to aggregate these into a single dataframe. Furthermore, we are only interested in the title and text components of each json. Afterwards, we save our data into a csv for later use.

In [1]:
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
os.chdir('news')

In [4]:
print(os.listdir()[:5])

['news_0000001(1).json', 'news_0000001(2).json', 'news_0000001(3).json', 'news_0000001(4).json', 'news_0000001(5).json']


In [5]:
def get_data():
    """ Reads important data from each news json in /news_data. """
    total_data = {
        'date': [],
        'title': [],
        'text': [],
    }
    for file in os.listdir():
        with open(file, 'r') as json_file:
            try:
                data = json.load(json_file)
                total_data['date'].append(data['published'])
                total_data['title'].append(data['title'])
                total_data['text'].append(data['text'])
            except:
                pass
    return total_data

In [6]:
data = get_data()

In [7]:
df = pd.DataFrame(data).set_index('date').sort_index()

In [8]:
df.head()

Unnamed: 0_level_0,title,text
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-07-08T03:00:00.000+03:00,DGAP-News: Sixty North Gold's 2017 Prospecting...,\nAll assays by Bureau Veritas FAA550 1 assay ...
2018-07-08T03:00:00.000+03:00,DGAP-News: CeramTec GmbH: CeramTec appoints ne...,\nDGAP-News: CeramTec GmbH / Key word(s): Chan...
2018-07-08T03:00:00.000+03:00,Aladdin Blockchain Technologies Holding SE: Al...,printer\nDGAP-Media / 07.08.2018 / 10:07 Aladd...
2018-07-08T03:00:00.000+03:00,"DGAP-News: Luxoft Holding, Inc: Luxoft Acquire...","\nDGAP-News: Luxoft Holding, Inc / Key word(s)..."
2018-07-08T03:00:00.000+03:00,DGAP-News: Watts Miners Delivers the Most Powe...,\nDGAP-News: Watts Miners / Key word(s): Misce...


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 443738 entries, 2018-07-08T03:00:00.000+03:00 to 2019-06-01T04:14:00.000+03:00
Data columns (total 2 columns):
title    443738 non-null object
text     443738 non-null object
dtypes: object(2)
memory usage: 10.2+ MB


In [10]:
os.chdir('..')
os.listdir()

['.ipynb_checkpoints',
 'get_news_data.ipynb',
 'news',
 'news_analysis.ipynb',
 'news_data.csv',
 'news_data.ipynb',
 'polygon_data.ipynb']

In [73]:
df.to_csv('news_data.csv')