### What is important when you write about data/computing science?

In this class we will try to answer the question about what metrics are important when you write a science blog in a platform like medium.com

We will use some statistics extracted from the platform stats itself and apply our knowledge of data science to it.

The idea is that even with the simplest of the datasets and a good question in mind, we can perform impressive analysis with only a few lines of code.

In [1]:
# Let's import everything we need

# Data science imports
import pandas as pd
import numpy as np

%load_ext autoreload
%autoreload 2

# Options for pandas
pd.options.display.max_columns = 25

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import plotly.plotly as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import iplot

import cufflinks
cufflinks.go_offline()

# disable warning due to some incompatible methods
import warnings
warnings.filterwarnings('ignore')

In [2]:
# read data
df = pd.read_csv("test-data/data.cvs")
df



Unnamed: 0.1,Unnamed: 0,claps,days_since_publication,fans,link,num_responses,publication,published_date,read_ratio,read_time,reads,started_date,...,title,title_word_count,type,views,word_count,claps_per_word,editing_days,<tag>Education,<tag>Data Science,<tag>Towards Data Science,<tag>Machine Learning,<tag>Python
0,121,2,716.053848,2,https://medium.com/p/screw-the-environment-but...,0,,2017-06-10 14:25:00,42.17,7,70,2017-06-10 14:24:00,...,"Screw the Environment, but Consider Your Wallet",8,published,166,1859,0.001076,0,0,0,0,0,0
1,132,18,708.735909,3,https://medium.com/p/the-vanquishing-of-war-pl...,0,,2017-06-17 22:02:00,29.51,14,54,2017-06-17 22:02:00,...,"The Vanquishing of War, Plague and Famine",8,published,183,3891,0.004626,0,0,0,0,0,0
2,119,52,696.116014,20,https://medium.com/p/capstone-project-mercedes...,0,,2017-06-30 12:55:00,19.98,42,224,2017-06-30 12:00:00,...,Capstone Project: Mercedes-Benz Greener Manufa...,7,published,1121,12025,0.004324,0,0,0,0,1,1
3,130,0,695.273421,0,https://medium.com/p/home-of-the-scared-5af0fe...,0,,2017-07-01 09:08:00,35.85,9,19,2017-06-30 18:21:00,...,Home of the Scared,4,published,53,2533,0.000000,0,0,0,0,0,0
4,123,0,691.285764,0,https://medium.com/p/the-triumph-of-peace-f485...,0,,2017-07-05 08:51:00,8.33,14,5,2017-07-03 20:18:00,...,The Triumph of Peace,4,published,60,3892,0.000000,1,0,0,0,0,0
5,117,0,676.772808,0,https://medium.com/p/nasa-internship-report-dd...,0,,2017-07-19 21:09:00,19.62,47,62,2017-07-19 19:07:00,...,NASA Internship Report,3,published,316,13048,0.000000,0,0,0,0,0,0
6,126,77,670.908352,20,https://medium.com/p/deep-neural-network-class...,1,,2017-07-25 17:54:00,17.65,14,1728,2017-07-24 11:31:00,...,Deep Neural Network Classifier,4,published,9791,1778,0.043307,1,0,0,0,1,1
7,128,252,668.767744,56,https://medium.com/p/object-recognition-with-g...,2,,2017-07-27 21:17:00,23.69,12,5738,2017-07-26 20:49:00,...,Object Recognition with Google’s Convolutional...,8,published,24225,2345,0.107463,1,0,0,0,1,1
8,118,22,665.911324,2,https://medium.com/p/make-an-effort-not-an-exc...,0,,2017-07-30 17:50:00,33.09,7,46,2017-07-30 09:06:00,...,"Make an Effort, Not an Excuse",7,published,139,1895,0.011609,0,0,0,0,0,0
9,124,0,664.056211,1,https://medium.com/p/the-ascent-of-humanity-54...,0,,2017-08-01 14:21:00,21.74,17,15,2017-07-15 20:04:00,...,The Ascent of Humanity,4,published,69,4684,0.000000,16,0,0,0,0,0


In [3]:
df.columns

Index(['Unnamed: 0', 'claps', 'days_since_publication', 'fans', 'link',
       'num_responses', 'publication', 'published_date', 'read_ratio',
       'read_time', 'reads', 'started_date', 'tags', 'text', 'title',
       'title_word_count', 'type', 'views', 'word_count', 'claps_per_word',
       'editing_days', '<tag>Education', '<tag>Data Science',
       '<tag>Towards Data Science', '<tag>Machine Learning', '<tag>Python'],
      dtype='object')

###### An explanation of some of the columns
- 'claps': the equivalent of like for a given article
- 'fans': the equivalent of followers for that article
- 'num_responses': reply or comments of a given article
- 'publication': where the article was published
- 'read time': How long an article took to be read
- 'title word count': How many words per title
- 'type': Published or unpublished
- 'views': How many people read that article
- 'claps per words': A ration of claps and number of words
- '< tag \>': The tags used as keywords for that article

In [4]:
df.dtypes

Unnamed: 0                     int64
claps                          int64
days_since_publication       float64
fans                           int64
link                          object
num_responses                  int64
publication                   object
published_date                object
read_ratio                   float64
read_time                      int64
reads                          int64
started_date                  object
tags                          object
text                          object
title                         object
title_word_count               int64
type                          object
views                          int64
word_count                     int64
claps_per_word               float64
editing_days                   int64
<tag>Education                 int64
<tag>Data Science              int64
<tag>Towards Data Science      int64
<tag>Machine Learning          int64
<tag>Python                    int64
dtype: object

In [5]:
# Let's calculate the correlation of each column for the published articles
corrs = df[df['type'] == 'published'].corr()
corrs.round(2)

Unnamed: 0.1,Unnamed: 0,claps,days_since_publication,fans,num_responses,read_ratio,read_time,reads,title_word_count,views,word_count,claps_per_word,editing_days,<tag>Education,<tag>Data Science,<tag>Towards Data Science,<tag>Machine Learning,<tag>Python
Unnamed: 0,1.0,-0.11,0.98,-0.08,-0.09,0.05,0.33,0.14,-0.26,0.13,0.28,-0.1,-0.07,-0.75,-0.24,-0.3,0.05,0.11
claps,-0.11,1.0,-0.15,0.99,0.9,-0.04,-0.14,0.76,0.09,0.74,-0.14,0.77,-0.01,0.27,0.4,0.56,0.2,0.28
days_since_publication,0.98,-0.15,1.0,-0.12,-0.12,0.05,0.36,0.08,-0.27,0.07,0.32,-0.11,-0.09,-0.76,-0.31,-0.34,0.0,0.07
fans,-0.08,0.99,-0.12,1.0,0.87,-0.04,-0.13,0.76,0.11,0.75,-0.14,0.76,-0.0,0.25,0.39,0.55,0.21,0.29
num_responses,-0.09,0.9,-0.12,0.87,1.0,0.03,-0.14,0.78,0.06,0.74,-0.16,0.78,-0.05,0.21,0.36,0.52,0.13,0.3
read_ratio,0.05,-0.04,0.05,-0.04,0.03,1.0,-0.6,0.02,0.02,-0.15,-0.54,0.28,0.09,0.09,-0.01,-0.09,-0.28,-0.22
read_time,0.33,-0.14,0.36,-0.13,-0.14,-0.6,1.0,-0.1,-0.12,0.0,0.96,-0.26,-0.07,-0.43,-0.16,-0.17,0.15,0.21
reads,0.14,0.76,0.08,0.76,0.78,0.02,-0.1,1.0,0.02,0.94,-0.13,0.56,-0.07,-0.02,0.37,0.33,0.25,0.4
title_word_count,-0.26,0.09,-0.27,0.11,0.06,0.02,-0.12,0.02,1.0,0.01,-0.13,0.09,-0.01,0.32,0.11,0.29,0.26,0.24
views,0.13,0.74,0.07,0.75,0.74,-0.15,0.0,0.94,0.01,1.0,-0.04,0.4,-0.05,-0.04,0.35,0.32,0.33,0.43


In [6]:
# Let's pick the claps and see how it correlates with the other variables
corrs['claps'].sort_values(ascending=False)

claps                        1.000000
fans                         0.988034
num_responses                0.900918
claps_per_word               0.774486
reads                        0.760699
views                        0.741683
<tag>Towards Data Science    0.559420
<tag>Data Science            0.404089
<tag>Python                  0.277659
<tag>Education               0.269488
<tag>Machine Learning        0.199788
title_word_count             0.086853
editing_days                -0.010198
read_ratio                  -0.036428
Unnamed: 0                  -0.107036
read_time                   -0.136584
word_count                  -0.144957
days_since_publication      -0.145061
Name: claps, dtype: float64

In [7]:
colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu',
        'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet',
        'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']

In [8]:
# Let's plot a simple heatmap to visualize the correlations
figure = ff.create_annotated_heatmap(z = corrs.round(2).values, 
                                     x =list(corrs.columns), 
                                     y=list(corrs.index), 
                                     colorscale='YlGnBu',
                                     annotation_text=corrs.round(2).values)
iplot(figure)

In [9]:
# Let's create a scatter plot and see the distribuition of read_time and claps grouped by type
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'type']],
                                     index = 'type', colormap='Jet', title='Scatterplot Matrix by Type',
                                     diag='histogram', width=800, height=800)
iplot(figure)

In [10]:
# Let's plot the relatioship between several metrics grouped by where it was published
figure = ff.create_scatterplotmatrix(df[['read_time', 'claps', 'views',
                                         'num_responses', 'publication']],
                                     index = 'publication', 
                                     diag='histogram', 
                                     size=8, width=1000, height=1000,
                                     title='Scatterplot Matrix by Publication')

iplot(figure)

###### Pay attention to which metrics are more positively related vs negatively related.
###### Look also the distrubution of each metric, is this skewed for any direction, or this is a normal distribution?

In [11]:
from visuals import make_hist

In [12]:
# Number of view per local of publication (Where it is better to publish?)
figure = make_hist(df, x='views', category='publication')
iplot(figure)

In [13]:
# distribution of claps
figure=make_hist(df, x='claps')
iplot(figure)

In [14]:
from visuals import make_cum_plot


In [15]:
# Cummulative view, tells me how you are growing your user base
figure = make_cum_plot(df, y='views')
iplot(figure)



In [16]:
# How productive are you?
figure = make_cum_plot(df, y='word_count')
iplot(figure)

In [17]:
# Where my user base is growing most?
figure = make_cum_plot(df, y='views', category='publication')
iplot(figure)

In [18]:
from visuals import make_scatter_plot

In [19]:
# How the read_ration correlates with read_time? Do my readers like long reads?
figure = make_scatter_plot(df, x='read_time', y='read_ratio')
iplot(figure)

In [20]:
# How views and read time relates, grouped by type
figure = make_scatter_plot(df, x='read_time', y='views', ylog=True,
                           category='type')
iplot(figure)

In [21]:
figure = make_scatter_plot(df, x='read_time', y='views', ylog=True,
                           scale='read_ratio', sizeref=0.2)
iplot(figure)



In [22]:
figure = make_scatter_plot(df, x='word_count', y='fans',
                           scale='claps', sizeref=5)
iplot(figure)

In [23]:
figure = make_scatter_plot(df, x='word_count', y='reads', xlog=True,
                           scale='claps', sizeref=3)
iplot(figure)

In [24]:
df.groupby('publication').first()

Unnamed: 0_level_0,Unnamed: 0,claps,days_since_publication,fans,link,num_responses,published_date,read_ratio,read_time,reads,started_date,tags,text,title,title_word_count,type,views,word_count,claps_per_word,editing_days,<tag>Education,<tag>Data Science,<tag>Towards Data Science,<tag>Machine Learning,<tag>Python
publication,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
Engineering @ Feature Labs,38,393,243.083759,55,https://medium.com/feature-labs-engineering/fe...,2,2018-09-26 13:41:00,28.46,9,880,2018-09-20 14:21:00,"['Apache Spark', 'Python', 'Feature Engineerin...",Featuretools on Spark Distributed feature engi...,Featuretools on Spark,3,published,3092,2087,0.188309,5,1,1,0,0,1
,74,101,469.864577,23,https://medium.com/p/slow-tech-take-back-your-...,1,2018-02-11 18:57:00,57.94,3,135,2018-02-10 08:15:00,"['Social Media', 'Technology', 'Education', 'F...",Slow Tech: Take Back Your Mind Technology shou...,Slow Tech: Take Back Your Mind,6,published,233,634,0.159306,1,1,0,0,0,0
Noteworthy - The Journal Blog,40,1000,222.219749,136,https://blog.usejournal.com/the-power-of-i-don...,3,2018-10-17 10:26:00,25.83,9,559,2018-10-16 06:59:00,"['Growth Mindset', 'Education', 'Learning', 'S...",The Power of I Don’t Know Intellectual humilit...,The Power of I Don’t Know,7,published,2164,2172,0.460405,1,1,0,0,0,0
Towards Data Science,49,670,270.856669,117,https://towardsdatascience.com/how-to-put-full...,2,2018-08-29 19:08:00,74.24,1,1265,2018-08-29 08:40:00,"['Data Science', 'Programming', 'Education', '...","How to Put Fully Interactive, Runnable Code in...","How to Put Fully Interactive, Runnable Code in...",12,published,1704,163,4.110429,0,1,1,0,0,1


# Conclusions to the client 
 -
 - 
