In [1]:
#General libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Read in Datasets
data from Finanical News Headlines dataset hosted on Kaggle (https://www.kaggle.com/notlucasp/financial-news-headlines)

In [2]:
#read in cnbc dataset
cnbc = pd.read_csv('../Data/cnbc_headlines.csv')
cnbc.head()

Unnamed: 0,Headlines,Time,Description
0,Jim Cramer: A better way to invest in the Covi...,"7:51 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer recommended buying..."
1,Cramer's lightning round: I would own Teradyne,"7:33 PM ET Fri, 17 July 2020","""Mad Money"" host Jim Cramer rings the lightnin..."
2,,,
3,"Cramer's week ahead: Big week for earnings, ev...","7:25 PM ET Fri, 17 July 2020","""We'll pay more for the earnings of the non-Co..."
4,IQ Capital CEO Keith Bliss says tech and healt...,"4:24 PM ET Fri, 17 July 2020","Keith Bliss, IQ Capital CEO, joins ""Closing Be..."


In [25]:
#read in guardian dataset
guard = pd.read_csv('../Data/guardian_headlines.csv')
guard.head()

Unnamed: 0,Time,Headlines
0,7/18/2020,Johnson is asking Santa for a Christmas recovery
1,7/18/2020,‘I now fear the worst’: four grim tales of wor...
2,7/18/2020,Five key areas Sunak must tackle to serve up e...
3,7/18/2020,Covid-19 leaves firms ‘fatally ill-prepared’ f...
4,7/18/2020,The Week in Patriarchy \n\n\n Bacardi's 'lad...


In [4]:
#read in reuters dataset
reuters = pd.read_csv('../Data/reuters_headlines.csv')
reuters.head()

Unnamed: 0,Headlines,Time,Description
0,TikTok considers London and other locations fo...,Jul 18 2020,TikTok has been in discussions with the UK gov...
1,Disney cuts ad spending on Facebook amid growi...,Jul 18 2020,Walt Disney has become the latest company to ...
2,Trail of missing Wirecard executive leads to B...,Jul 18 2020,Former Wirecard chief operating officer Jan M...
3,Twitter says attackers downloaded data from up...,Jul 18 2020,Twitter Inc said on Saturday that hackers were...
4,U.S. Republicans seek liability protections as...,Jul 17 2020,A battle in the U.S. Congress over a new coron...


Glancing at each of the three datasets, there are obvious discrepancies between them. One difference is that the CNBC data contains datetimes whereas the other two files from the Guardian and Reuters has only the date. Also, the Guardian dataset does not contain a field for the article description that is present in the other two sources. 

Each dataset will require some additional cleaning before we can begin the rest of the EDA.

#### Checking data types and values

In [29]:
orig_df = [cnbc,guard,reuters]

for df in orig_df:
    print(df.info())
    print('+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3080 entries, 0 to 3079
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    2800 non-null   object
 1   Time         2800 non-null   object
 2   Description  2800 non-null   object
dtypes: object(3)
memory usage: 72.3+ KB
None
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17800 entries, 0 to 17799
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Time       17800 non-null  object
 1   Headlines  17800 non-null  object
dtypes: object(2)
memory usage: 278.2+ KB
None
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32770 entries, 0 to 32769
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Headlines    3277

In [30]:
for df in orig_df:
    print(pd.to_datetime(df.Time).describe())

  print(pd.to_datetime(df.Time).describe())
  print(pd.to_datetime(df.Time).describe())


count                    2800
unique                   2474
top       2019-01-29 19:12:00
freq                        6
first     2017-12-22 18:52:00
last      2020-07-17 19:51:00
Name: Time, dtype: object
count                   17800
unique                    774
top       2018-07-24 00:00:00
freq                       40
first     2017-12-17 00:00:00
last      2021-07-18 00:00:00
Name: Time, dtype: object
count                   32770
unique                    852
top       2020-03-19 00:00:00
freq                      126
first     2018-03-20 00:00:00
last      2020-07-18 00:00:00
Name: Time, dtype: object


  print(pd.to_datetime(df.Time).describe())


In [31]:
#function to remove missing data, convert date strings to type datetime, 
#and drop duplicates (keeping first record of article).
#index is also reset to date of article's publication
def data_cleaner(df):
    df = df.dropna()
    df['datetime'] = pd.to_datetime(df['Time'])
    df['Date'] = pd.to_datetime(df.Time).dt.date
    df = df.set_index('datetime').sort_index()
    df = df.drop_duplicates(subset=['Headlines'],keep='first')
    return(df)

In [6]:
df = cnbc
cnbc2 = data_cleaner(df)
cnbc2.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['datetime'] = pd.to_datetime(df['Time'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Date'] = pd.to_datetime(df.Time).dt.date


Unnamed: 0_level_0,Headlines,Time,Description,Date
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-12-22 18:52:00,Cramer: Never buy a stock all at once — you'll...,"6:52 PM ET Fri, 22 Dec 2017",Jim Cramer doubled down on his key investing r...,2017-12-22
2017-12-22 19:07:00,Cramer: I helped investors through the 2010 fl...,"7:07 PM ET Fri, 22 Dec 2017","Jim Cramer built on his ""nobody ever made a di...",2017-12-22
2017-12-22 19:07:00,Cramer says owning too many stocks and too lit...,"7:07 PM ET Fri, 22 Dec 2017",Jim Cramer broke down why owning fewer stocks ...,2017-12-22
2017-12-26 10:15:00,Markets lack Christmas cheer,"10:15 AM ET Tue, 26 Dec 2017","According to Kensho, here's how markets have f...",2017-12-26
2017-12-27 10:13:00,S&P tends to start new year bullish after this...,"10:13 AM ET Wed, 27 Dec 2017",The S&P is on track to end the year up 20 perc...,2017-12-27


In [37]:
df = guard
guard2 = data_cleaner(df)
#add headline as article description
guard2['Description'] = guard2.Headlines
#reorder columns
cols_rodr = cnbc2.columns.to_list()
guard2 = guard2[cols_rodr]
guard2.head()

Unnamed: 0_level_0,Headlines,Time,Description,Date
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-12-17,The Guardian view on Ryanair’s model: a union-...,12/17/2017,The Guardian view on Ryanair’s model: a union-...,2017-12-17
2017-12-17,Peter Preston on press and broadcasting \n\n\...,12/17/2017,Peter Preston on press and broadcasting \n\n\...,2017-12-17
2017-12-17,Why business could prosper under a Corbyn gove...,12/17/2017,Why business could prosper under a Corbyn gove...,2017-12-17
2017-12-17,Youngest staff to be given UK workplace pensio...,12/17/2017,Youngest staff to be given UK workplace pensio...,2017-12-17
2017-12-17,Grogonomics \n\n\n This year has been about ...,12/17/2017,Grogonomics \n\n\n This year has been about ...,2017-12-17


In [34]:
df = reuters
reuters2 = data_cleaner(df)
reuters2.head()

Unnamed: 0_level_0,Headlines,Time,Description,Date
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018-03-20,UK will always consider ways to improve data l...,Mar 20 2018,Britain will consider any suggestions to give ...,2018-03-20
2018-03-20,Senate Democrat wants Facebook CEO Zuckerberg ...,Mar 20 2018,"U.S. Senator Dianne Feinstein, the top Democra...",2018-03-20
2018-03-20,"Factbox: How United States, others regulate au...",Mar 20 2018,An Uber self-driving sport utility vehicle str...,2018-03-20
2018-03-20,Cambridge Analytica played key Trump campaign ...,Mar 20 2018,The suspended chief executive of UK-based poli...,2018-03-20
2018-03-20,Start of AT&T-Time Warner trial delayed until ...,Mar 20 2018,Opening statements in the trial to decide if A...,2018-03-20


Each dataset has been cleaned so that no duplicates or null columns are included. 

## Combine datasets

In [41]:
all_data = pd.concat([cnbc2,guard2,reuters2])
all_data

Unnamed: 0_level_0,Headlines,Time,Description,Date
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-12-22 18:52:00,Cramer: Never buy a stock all at once — you'll...,"6:52 PM ET Fri, 22 Dec 2017",Jim Cramer doubled down on his key investing r...,2017-12-22
2017-12-22 19:07:00,Cramer: I helped investors through the 2010 fl...,"7:07 PM ET Fri, 22 Dec 2017","Jim Cramer built on his ""nobody ever made a di...",2017-12-22
2017-12-22 19:07:00,Cramer says owning too many stocks and too lit...,"7:07 PM ET Fri, 22 Dec 2017",Jim Cramer broke down why owning fewer stocks ...,2017-12-22
2017-12-26 10:15:00,Markets lack Christmas cheer,"10:15 AM ET Tue, 26 Dec 2017","According to Kensho, here's how markets have f...",2017-12-26
2017-12-27 10:13:00,S&P tends to start new year bullish after this...,"10:13 AM ET Wed, 27 Dec 2017",The S&P is on track to end the year up 20 perc...,2017-12-27
...,...,...,...,...
2020-07-17 00:00:00,Exclusive: Pact to aid poor cocoa farmers in p...,Jul 17 2020,The steepest dive in cocoa demand in a decade ...,2020-07-17
2020-07-18 00:00:00,Twitter says attackers downloaded data from up...,Jul 18 2020,Twitter Inc said on Saturday that hackers were...,2020-07-18
2020-07-18 00:00:00,Trail of missing Wirecard executive leads to B...,Jul 18 2020,Former Wirecard chief operating officer Jan M...,2020-07-18
2020-07-18 00:00:00,Disney cuts ad spending on Facebook amid growi...,Jul 18 2020,Walt Disney has become the latest company to ...,2020-07-18


In [42]:
#write data to csv
all_data.to_csv('../Data/all_data.csv')