# Analisis de datos
Analisis a traves de graficos

Publicacion = https://arxiv.org/abs/1801.07055
Dataset = https://archive.ics.uci.edu/ml/datasets/News+Popularity+in+Multiple+Social+Media+Platforms

VARIABLES OF NEWS DATA
* IDLink (numeric): Unique identifier of news items
* Title (string): Title of the news item according to the official media sources
* Headline (string): Headline of the news item according to the official media sources
* Source (string): Original news outlet that published the news item
* Topic (string): Query topic used to obtain the items in the official media sources
* PublishDate (timestamp): Date and time of the news items' publication
* SentimentTitle (numeric): Sentiment score of the text in the news items' title
* SentimentHeadline (numeric): Sentiment score of the text in the news items' headline
* Facebook (numeric): Final value of the news items' popularity according to the social media source Facebook
* GooglePlus (numeric): Final value of the news items' popularity according to the social media source Google+
* LinkedIn (numeric): Final value of the news items' popularity according to the social media source LinkedIn

VARIABLES OF SOCIAL FEEDBACK DATA
* IDLink (numeric): Unique identifier of news items
* TS1 (numeric): Level of popularity in time slice 1 (0-20 minutes upon publication)
* TS2 (numeric): Level of popularity in time slice 2 (20-40 minutes upon publication)
* TS... (numeric): Level of popularity in time slice ...
* TS144 (numeric): Final level of popularity after 2 days upon publication

In [1]:
# !pip install plotly
# !pip install nbformat

import pandas as pd

In [2]:
data = pd.read_csv(r"data\News_Final.csv", sep=",")
print("data.shape=",data.shape)
print(data.dtypes)
pd.concat([data.head(), data.tail()],axis=0)

data.shape= (93239, 11)
IDLink               float64
Title                 object
Headline              object
Source                object
Topic                 object
PublishDate           object
SentimentTitle       float64
SentimentHeadline    float64
Facebook               int64
GooglePlus             int64
LinkedIn               int64
dtype: object


Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
0,99248.0,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,0.0,-0.0533,-1,-1,-1
1,10423.0,A Look at the Health of the Chinese Economy,"Tim Haywood, investment director business-unit...",Bloomberg,economy,2008-09-20 00:00:00,0.208333,-0.156386,-1,-1,-1
2,18828.0,Nouriel Roubini: Global Economy Not Back to 2008,"Nouriel Roubini, NYU professor and chairman at...",Bloomberg,economy,2012-01-28 00:00:00,-0.42521,0.139754,-1,-1,-1
3,27788.0,Finland GDP Expands In Q4,Finland's economy expanded marginally in the t...,RTT News,economy,2015-03-01 00:06:00,0.0,0.026064,-1,-1,-1
4,27789.0,"Tourism, govt spending buoys Thai economy in J...",Tourism and public spending continued to boost...,The Nation - Thailand&#39;s English news,economy,2015-03-01 00:11:00,0.0,0.141084,-1,-1,-1
93234,61851.0,Stocks rise as investors key in on US economy ...,The June employment report is viewed as a cruc...,MarketWatch,economy,2016-07-07 15:31:05,0.104284,0.044943,-1,3,5
93235,61865.0,Russian PM proposes to use conservative and to...,"In addition, establish stimulating economic po...",TASS,economy,2016-07-07 15:31:10,0.072194,0.0,-1,0,1
93236,104793.0,Palestinian Government Uses Foreign Aid To Pay...,The Palestinian government spends nearly $140 ...,Daily Caller,palestine,2016-07-07 15:38:26,0.291667,-0.139754,5,1,0
93237,104794.0,Palestine Youth Orchestra prepares for first U...,Palestine Youth Orchestra prepares for first U...,Ahram Online,palestine,2016-07-07 15:59:22,0.121534,0.092313,0,0,0
93238,61870.0,Sausalito businesswoman wins $10000 in Microso...,"Goldstein, the proprietor of the TG Travel Gro...",East Bay Times,microsoft,2016-07-07 16:16:11,0.0,0.054554,-1,1,0


In [3]:
## fix format
data['PublishDate'] = pd.to_datetime(data['PublishDate'])
data['IDLink'] = data['IDLink'].astype(int)

# EDA

In [4]:
for x in set(data['Topic']):
    print(x, dict(data[data['Topic']==x]['PublishDate'].value_counts()[:5]))

economy {Timestamp('2016-05-19 00:00:00'): 31, Timestamp('2015-11-09 00:00:00'): 20, Timestamp('2016-01-08 00:00:00'): 13, Timestamp('2016-05-18 00:00:00'): 12, Timestamp('2015-11-08 00:00:00'): 12}
palestine {Timestamp('2016-05-19 00:00:00'): 27, Timestamp('2015-11-18 00:00:00'): 22, Timestamp('2015-11-12 00:00:00'): 17, Timestamp('2016-03-22 00:00:00'): 16, Timestamp('2015-11-19 00:00:00'): 15}
obama {Timestamp('2015-11-18 00:00:00'): 54, Timestamp('2016-05-19 00:00:00'): 36, Timestamp('2015-11-17 00:00:00'): 31, Timestamp('2016-01-06 00:00:00'): 26, Timestamp('2015-11-16 00:00:00'): 24}
microsoft {Timestamp('2016-05-19 00:00:00'): 18, Timestamp('2015-11-04 00:00:00'): 16, Timestamp('2016-05-18 00:00:00'): 15, Timestamp('2015-11-05 00:00:00'): 14, Timestamp('2015-11-09 00:00:00'): 14}


Tiene varias noticias por Topic a la misma hora

In [5]:
data.groupby(['Topic'])['PublishDate'].agg(['min','max'])

Unnamed: 0_level_0,min,max
Topic,Unnamed: 1_level_1,Unnamed: 2_level_1
economy,2008-09-20 00:00:00,2016-07-07 15:31:10
microsoft,2015-03-01 00:19:00,2016-07-07 16:16:11
obama,2002-04-02 00:00:00,2016-07-07 14:20:15
palestine,2015-03-01 01:20:00,2016-07-07 15:59:22


In [6]:
data.groupby(['Topic'])['PublishDate'].agg(['min','max']).apply(lambda x: x['max']-x['min'], axis=1)

Topic
economy     2847 days 15:31:10
microsoft    494 days 15:57:11
obama       5210 days 14:20:15
palestine    494 days 14:39:22
dtype: timedelta64[ns]

In [7]:
data[data['Topic']=='economy'].sort_values('Facebook',ascending=False)[:20]

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn
44634,27780,Editorial: Welcome rain clouds issues for economy,"On the back of good grass growth, Dairy NZ rev...",New Zealand Herald,economy,2016-02-29 07:17:31,0.070868,0.18474,49211,0,0
19658,12561,"For the Wealthiest, a Private Tax System That ...","A. Winters, a political scientist at Northwes...",New York Times,economy,2015-12-29 17:00:23,-0.113067,-0.104257,29564,774,1514
36767,23117,"Under Sanders, income and jobs would soar, eco...",Those are just a few of the things that would ...,CNNMoney,economy,2016-02-08 17:40:27,-0.041667,0.069317,16993,1267,455
32931,20851,Venezuela is on the brink of a complete econom...,The only question now is whether Venezuela's g...,Washington Post,economy,2016-01-29 15:40:34,-0.044799,0.093803,11336,182,243
47265,29734,Revealed: the 30-year economic betrayal draggi...,In seven major economies in North America and ...,The Guardian,economy,2016-03-07 11:20:29,-0.019981,-0.22964,8950,361,1444
32027,20075,Sinking economy may lead to Trudeau ouster: O'...,"Our economy is now measured in """"""dollerettes""...",Toronto Sun,economy,2016-01-28 01:12:25,-0.15468,0.077286,8010,10,16
58683,39921,The Panama Papers Could Lead to Capitalism's G...,"As Global Financial Integrity recently found, ...",TIME,economy,2016-04-04 18:01:17,0.0,0.111803,7997,137,597
28428,18082,More plastic than fish in oceans by 2050,"&quot;After a short first-use cycle, 95% of pl...",CNNMoney,economy,2016-01-19 14:00:27,-0.359066,-0.0195,7045,67,193
28522,18148,The Oceans Will Contain More Plastic Than Fish...,The New Plastics Economy presents three key pl...,Fortune,economy,2016-01-19 16:40:21,-0.3,0.135096,6056,12,79
73514,49375,PH is best economy in Southeast Asia  Oxfor...,"Metro Manila (CNN Philippines) """""" The Philipp...",CNN Philippines,economy,2016-05-13 16:01:21,0.026352,-0.031092,5695,4,65


# Grafica

In [8]:
import plotly.express as px

In [55]:
def make_plot_line(data,x_col,Title):
    filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-06-30'))
    dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['IDLink'].nunique().reset_index()
    dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
    fig = px.line(dfg1, x="PublishDate", color=x_col, y="IDLink", title=Title)
    fig.show()

In [64]:
make_plot_line(data, 'Topic', 'Publicacion de noticias por Topic')

In [98]:
x_col, Title = 'Topic', 'Publicacion de noticias por Topic por Numero de dia'
filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-06-30'))
dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.day],dropna=False)['IDLink'].nunique().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="IDLink", title=Title)
fig.show()


In [97]:
x_col, Title = 'Topic', 'Publicacion de noticias por Topic por Dia de la semana'
filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-06-30'))
dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.dayofweek],dropna=False)['IDLink'].nunique().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="IDLink", title=Title)
fig.show()

## Facebook

In [61]:
x_col, Title = 'Topic', 'Popularidad noticias por Topic segun Facebook'
filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-06-30'))
dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['Facebook'].sum().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="Facebook", title=Title)
fig.show()

In [83]:
data[filtro].sort_values('Facebook', ascending=False).head(10)[['Title','PublishDate']].set_index('PublishDate').to_dict()

{'Title': {Timestamp('2016-02-29 07:17:31'): 'Editorial: Welcome rain clouds issues for economy',
  Timestamp('2016-01-13 05:44:47'): "Fact Check: Top 10 Lies in Obama's State of the Union",
  Timestamp('2016-02-09 08:32:36'): 'I Miss Barack Obama',
  Timestamp('2016-01-12 21:24:46'): "Obama's legacy is at stake",
  Timestamp('2015-12-29 17:00:23'): 'For the Wealthiest, a Private Tax System That Saves Them Billions',
  Timestamp('2015-11-20 19:24:35'): 'How the inner Obama fights ISIS',
  Timestamp('2015-12-17 05:44:45'): 'Paul Ryan Betrays America: $1.1 Trillion, 2000-Plus Page Omnibus ...',
  Timestamp('2016-03-24 09:53:54'): "Microsoft's 'teen girl' AI turns into a Hitler-loving sex robot within 24 ...",
  Timestamp('2016-03-04 17:40:36'): "GZ Roundtable: Is Microsoft's effort to bring Xbox One exclusives to ...",
  Timestamp('2016-02-09 01:20:28'): "Microsoft's Cortana won't take your crap"}}

In [90]:
Title = "Popularidad de microsoft en Facebook"
filtro2 = data['Topic'].str.lower().str.contains('microsoft')
dfg1 = data[filtro & filtro2].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['Facebook'].sum().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="Facebook", title=Title)
fig.show()

In [91]:
data[filtro & filtro2].sort_values('Facebook', ascending=False).head(10)[['Title','PublishDate']].set_index('PublishDate').to_dict()

{'Title': {Timestamp('2016-03-24 09:53:54'): "Microsoft's 'teen girl' AI turns into a Hitler-loving sex robot within 24 ...",
  Timestamp('2016-03-04 17:40:36'): "GZ Roundtable: Is Microsoft's effort to bring Xbox One exclusives to ...",
  Timestamp('2016-02-09 01:20:28'): "Microsoft's Cortana won't take your crap",
  Timestamp('2016-03-14 16:48:36'): 'Microsoft will allow Xbox gamers to play against PS4 and PC players',
  Timestamp('2016-05-20 10:35:00'): "Microsoft's Nadella to Meet India's CEOs, Tech Leaders (MSFT)",
  Timestamp('2016-03-30 17:21:19'): 'Microsoft is adding the Linux command line to Windows 10',
  Timestamp('2016-03-30 17:26:19'): 'Microsoft is bringing the Bash shell to Windows 10',
  Timestamp('2016-02-02 22:00:27'): "Look out: Microsoft shifts Windows 10 to 'Recommended' update ...",
  Timestamp('2016-03-15 21:39:35'): "Sony Responds to Microsoft's Invite to Connect Xbox One and PS4 ...",
  Timestamp('2016-03-16 11:51:36'): "Sony responds to Microsoft's invite to 

In [99]:
Title = "Popularidad de microsoft en Facebook para economy"
filtro2 = data['Topic'].str.lower().str.contains('economy')
dfg1 = data[filtro & filtro2].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['Facebook'].sum().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="Facebook", title=Title)
fig.show()

In [101]:
data[filtro & filtro2].sort_values('Facebook', ascending=False).head(20)[['Title','PublishDate']].set_index('PublishDate').to_dict()

{'Title': {Timestamp('2016-02-29 07:17:31'): 'Editorial: Welcome rain clouds issues for economy',
  Timestamp('2015-12-29 17:00:23'): 'For the Wealthiest, a Private Tax System That Saves Them Billions',
  Timestamp('2016-02-08 17:40:27'): 'Under Sanders, income and jobs would soar, economist says',
  Timestamp('2016-01-29 15:40:34'): 'Venezuela is on the brink of a complete economic collapse',
  Timestamp('2016-03-07 11:20:29'): 'Revealed: the 30-year economic betrayal dragging down ...',
  Timestamp('2016-01-28 01:12:25'): "Sinking economy may lead to Trudeau ouster: O'Leary",
  Timestamp('2016-04-04 18:01:17'): "The Panama Papers Could Lead to Capitalism's Great Crisis",
  Timestamp('2016-01-19 14:00:27'): 'More plastic than fish in oceans by 2050',
  Timestamp('2016-01-19 16:40:21'): 'The Oceans Will Contain More Plastic Than Fish by 2050',
  Timestamp('2016-05-13 16:01:21'): 'PH is best economy in Southeast Asia \x9d\x9d\x9d Oxford Business Group',
  Timestamp('2015-12-30 10:00:19'

## Google+

In [62]:
x_col, Title = 'Topic', 'Popularidad noticias por Topic segun Google+'
filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-06-30'))
dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['GooglePlus'].sum().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="GooglePlus", title=Title)
fig.show()

In [84]:
data[filtro].sort_values('GooglePlus', ascending=False).head(10)[['Title','PublishDate']].set_index('PublishDate').to_dict()

{'Title': {Timestamp('2016-02-08 17:40:27'): 'Under Sanders, income and jobs would soar, economist says',
  Timestamp('2016-03-30 17:21:19'): 'Microsoft is adding the Linux command line to Windows 10',
  Timestamp('2016-02-17 16:40:33'): 'Learning the Alphabet',
  Timestamp('2016-03-24 09:53:54'): "Microsoft's 'teen girl' AI turns into a Hitler-loving sex robot within 24 ...",
  Timestamp('2015-11-15 18:20:14'): 'Intervention by PM at G20 working session on Inclusive Growth ...',
  Timestamp('2016-03-29 23:41:18'): '\x9d\x9d\x9dMicrosoft and Canonical partner to bring Ubuntu to Windows 10',
  Timestamp('2015-12-29 17:00:23'): 'For the Wealthiest, a Private Tax System That Saves Them Billions',
  Timestamp('2016-02-02 13:13:30'): 'Michele Bachmann warns Obama will take over the United Nations ...',
  Timestamp('2015-12-10 23:24:39'): "Police Chief: 'Revolution' Coming if Obama Tries to Disarm ...",
  Timestamp('2016-01-16 23:11:26'): 'Microsoft says new processors will only work with Wi

In [63]:
x_col, Title = 'Topic', 'Popularidad noticias por Topic segun LinkedIn'
filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-06-30'))
dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['LinkedIn'].sum().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="LinkedIn", title=Title)
fig.show()

In [85]:
data[filtro].sort_values('LinkedIn', ascending=False).head(10)[['Title','PublishDate']].set_index('PublishDate').to_dict()

{'Title': {Timestamp('2016-06-13 14:40:16'): 'Microsoft to buy LinkedIn for $26.2B in cash, makes big move into ...',
  Timestamp('2016-06-13 05:50:00'): 'Microsoft to buy LinkedIn for $26B in cash, makes big move into enterprise social media',
  Timestamp('2016-06-13 14:00:18'): 'Microsoft and LinkedIn: Together Changing the Way the World Works',
  Timestamp('2016-06-13 15:40:12'): "LinkedIn CEO: Here's Why I Sold the Company to Microsoft",
  Timestamp('2016-06-13 13:56:16'): 'Microsoft to buy LinkedIn for $26.2 billion; LNKD shares jump 48 pct',
  Timestamp('2016-06-13 13:35:15'): 'Microsoft to acquire LinkedIn for $26.2 billion',
  Timestamp('2016-06-13 13:59:16'): 'Microsoft to Buy LinkedIn for $26.2 Billion',
  Timestamp('2016-06-13 06:03:00'): 'Microsoft Agrees to Acquire LinkedIn for $26.2 Billion',
  Timestamp('2016-06-20 18:40:11'): "Obama on post-White House job plans: 'I'm gonna get on LinkedIn'",
  Timestamp('2016-06-13 14:00:14'): 'Microsoft Pays $26 Billion for LinkedIn i

In [102]:
x_col, Title = 'Topic', 'Popularidad noticias por Topic segun LinkedIn'
filtro = data['PublishDate'].between(pd.to_datetime('2015-11-01'),pd.to_datetime('2016-05-31'))
dfg1 = data[filtro].groupby([x_col, data['PublishDate'].dt.to_period('D').dt.to_timestamp()],dropna=False)['LinkedIn'].sum().reset_index()
dfg1[x_col] = dfg1[x_col].fillna('').replace('','Nulo')
fig = px.line(dfg1, x="PublishDate", color=x_col, y="LinkedIn", title=Title)
fig.show()