## Data Science Work Flow

Python -> R? -> D3

In [None]:
# import the d3_module
import d3_example

# typical imports
# import requirments 
from IPython.display import Image
from IPython.display import display
from datetime import *
import json
from copy import *
from pprint import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import rpy2
%load_ext rpy2.ipython
%R require("ggplot2")
%matplotlib inline
from ggplot import *
randn = np.random.randn
from copy import *



### Pandas

Quick use cases with Pandas:

- Cleaning data
- View vs Copy 
- Datetime conversion
- Datetime Binning

#### Cleaning data
Common tools that I use:
- Try using [DataFrame.dropna()](http://stackoverflow.com/questions/13413590/how-to-drop-rows-of-pandas-dataframe-whose-value-of-certain-column-is-nan#answer-13434501) to remove null values.
- The [pd.io.parsers.read_csv()](http://pandas.pydata.org/pandas-docs/dev/generated/pandas.io.parsers.read_csv.html) module has a `names` argument that allows you to pass a specific column header, `skiprows` argument that can take an index or number of rows, and a `parse_dates` attribute that can make string to date conversions simple.

In [None]:
sampleData = pd.io.parsers.read_csv(
    "../data/sampleData.csv",
    header=0
    #names=['timeStamp1','ts','counts']
    #,parse_dates=[0]
    )
sampleData = sampleData.dropna(subset=['timeStamp'])
display(sampleData.head())


####View vs Copy
An easy frustration with a [simple solution](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy).

The first example returns a __copy__ of `sampleData`. The original value remains unchanged.

In [None]:
sampleData.iloc[0]['healthy_index'] = 5555
sampleData.iloc[0]['timeStamp'] = "1/1/12 5:55"
sampleData.head()

The next example returns a __view__ of `sampleData`. The original value is changed.

In [None]:
sampleData.ix[0,'healthy_index'] =5555
sampleData.ix[0,'timeStamp'] = "1/1/12 5:55"
sampleData.head()

####Datetime Conversion
When `parse_dates` is not an option, we can apply a transformation using `strptime`, which can handle non-zero padded values.

In [None]:
sampleData = pd.io.parsers.read_csv(
    "../data/sampleData.csv",
    header=0
    #names=['timeStamp1','ts','counts']
    #,parse_dates=[0]
    )
sampleData = sampleData.dropna(subset=['timeStamp'])

sampleData.timeStamp = pd.to_datetime(
    sampleData['timeStamp'].apply(lambda x: 
        datetime.strptime(x,"%m/%d/%y %H:%M")),format = "%m/%d/%y %H:%M")

sampleData.head()

####Datetime Binning
The datetime object has properties that make binning rather simple.

In [None]:
donut = sampleData.ix[:,("timeStamp","donut")]
display(donut.head())

Find selections from the datetime object to create new columns.

In [None]:
donut['Year'] = donut.timeStamp.dt.year
donut['Month'] = donut.timeStamp.dt.month
display(donut.head())
#display(sampleData.timeStamp.max())


####The `groupby` Method
The `groupby` method defaults to using groups as an index.

In [None]:
#donut_groupby_yearmonth = donut.groupby(['Year','Month'],as_index = False)
donut_groupby_yearmonth = donut.groupby(['Year','Month'])

display(donut_groupby_yearmonth.head(n=1))
print
print 
#display(donut_groupby_yearmonth1.head(n=1))

display(sampleData.timeStamp.max())

####Operations on Groups (Maybe help a brother here?)
Methods for acessing groups seems to depend on the operation applied to the groups:
-  dataframe w/ .indices requires: dataFrame.get_group((<group>))
-  dataframe w/ .index requires: dataFrame.get_value((<group>))


In [None]:
# The result of the DataFrame.indices method is just a dictionary
display(type(donut_groupby_yearmonth.indices))
display(donut_groupby_yearmonth.indices.keys()[0:5])
#display(donut_groupby_yearmonth.index #Error
print
print
display(donut_groupby_yearmonth.get_group((2014, 1)).head(n=5))
print
print 
donutTest = donut_groupby_yearmonth.get_group((2014, 1)).groupby(['Year','Month']).sum()
display(donutTest)
print 
print 
#display(test.indices) #Error
display(donutTest.index)
display(donutTest.donut.get_values())
display(donutTest.donut.get_value((2014,1)))

####Plotting w/ Pandas
The magic of the `DataFrame.plot()` method requires reading documentation where few examples exist (next up, stackoverflow), but somtimes it's just fun to see what DataFrame.plot() can provide without any additional arguments. 

In [None]:
display(sampleData.plot())

Focus the story.

In [None]:
paleo = sampleData.plot(
    x='timeStamp'
    ,y='paleo'
    ,kind='line'
    ,xlim=['2011-11-15 00:00:00','2015-06-01 00:00:00']
    ,color='b')
paleo.legend(['paleo'], loc='best')
kale = sampleData.plot(
    x='timeStamp'
    ,y='kale'
    ,kind='line'
    ,xlim=['2011-11-15 00:00:00','2015-06-01 00:00:00']
    ,color='g')
kale.legend(['kale'], loc='best')
dairy = sampleData.plot(
    x='timeStamp'
    ,y='dairy'
    ,kind='line'
    ,xlim=['2011-11-15 00:00:00','2015-06-01 00:00:00']
    ,color='r')
dairy.legend(['dairy'], loc='best')
plt.show()

Narrow the focus.

In [None]:
paleo = sampleData.plot(
    x='timeStamp'
    ,y='paleo'
    ,kind='line'
    ,xlim=['2011-11-15 00:00:00','2015-06-01 00:00:00'])
paleo.legend(['paleo'], loc='best')
kale = sampleData.plot(
    x='timeStamp'
    ,y='kale'
    ,kind='line'
    ,xlim=['2011-11-15 00:00:00','2015-06-01 00:00:00']
    ,ax = paleo)
kale.legend(['paleo','kale'], loc='best')
dairy = sampleData.plot(
    x='timeStamp'
    ,y='dairy'
    ,kind='line'
    ,xlim = ['2014-11-15 00:00:00','2015-06-01 00:00:00']
    ,ylim = [0,20000]
    ,ax=kale
    ,figsize =(15,8))
dairy.legend(['paleo','kale','dairy'], loc='best')
dairy.set_xlabel("Time")
dairy.set_ylabel("Mention Count")
plt.show()

####Similar experiences for R
Familiarity with a language is important to consider when running on deadlines. There's always a tradeoff between learning the new tool and using what you know.

In [None]:
# find min date
minDate = sampleData['timeStamp'].min()
minDate = minDate.strftime('%Y-%m-%d %H:%M')

# find max date
maxDate = sampleData['timeStamp'].max()
maxDate = maxDate.strftime('%Y-%m-%d %H:%M')

display(minDate, maxDate)
# send dataframe and strings to R
%Rpush sampleData minDate maxDate


In [None]:
%%R
# convert to date object in R
minDate = as.POSIXct(strptime(minDate, '%Y-%m-%d %H:%M',tz='UTC'))
maxDate = as.POSIXct(strptime(maxDate, '%Y-%m-%d %H:%M',tz='UTC'))
xlim = list(as.POSIXct('2014-11-15 00:00:00',format='%Y-%m-%d %H:%M',tz='UTC')
         ,as.POSIXct('2015-06-01 00:00:00',format='%Y-%m-%d %H:%M',tz='UTC'))
class(c(maxDate,xlim[[1]]))

####Plotting w/ R
I am more familiar with the ggplot library than I am with matplotlib/Pandas. 


In [None]:
%%R -w 900 -h 480 -u px

ggplot(data=sampleData)+geom_line(aes(x=timeStamp,y=paleo),color='blue')+geom_line(aes(x=timeStamp,y=kale),color='green')+geom_line(aes(x=timeStamp,y=gluten),color='red')+scale_x_datetime(limits=c(xlim[[1]],xlim[[2]]))+xlab('Time')+ylab('Mentions Count')


####Legend via Melt; Spice via Theme

In [None]:
df = sampleData.ix[:,('timeStamp','paleo','kale','dairy')]
display(df.head())
dfMelt = pd.melt(df, id_vars=['timeStamp'])
dfMelt.head()
%Rpush dfMelt

In [None]:
%%R -w 900 -h 480 -u px
p1 = ggplot(data=dfMelt)
p1 = p1 + geom_line(aes(x=timeStamp,y=value,color=variable))
p1 = p1 + scale_x_datetime( limits = c(xlim[[1]],xlim[[2]]), labels=date_format("%b %Y"))
p1 = p1 + ylim(0,20000)+xlab('Time')+ylab('Mentions Count')
p1 = p1 + scale_color_manual(values=c('red','green','blue'))
p1 = p1 + theme(legend.title=element_blank(), axis.text.x = element_text(angle = 30, hjust = 1), legend.position = c(.9, .8))
print(p1)

In [None]:
%%R
#save data
write.csv(df,"../data/df.csv", row.names=FALSE)
write.csv(dfMelt, "../data/dfMelt.csv", row.names=FALSE)

In [None]:
%%R -w 900 -h 480 -u px
#line graph in R

summary(sampleData)
p1 <- ggplot(data=sampleData) + geom_line(stat="identity",aes(x=timeStamp,y=paleo)) + ggtitle("Line Graph Test") + xlab("Time") + ylab("Mention Counts") + theme(legend.position="none", text = element_text(size=20))  

lower <- with(Day_subset,as.POSIXct(strftime(min(time),"%Y-%m-%d")))
upper <- with(Day_subset,as.POSIXct(strftime(as.Date(max(time))+1,"%Y-%m-%d"))-1)
limits = c(lower,upper)

print(ggplot( data=Day_subset, aes( x=time, y=observation) ) + 
        geom_point() +
        scale_x_datetime( breaks=("2 hour"), 
                          minor_breaks=("1 hour"), 
                          labels=date_format("%H:%M"),
                          limits=limits)
print(p1)

In [None]:
x_sum.plot(x='year', y='col_name_2', style='o')


In [None]:

sampleData['donut_sum'] = sampleData.grouby(['Year','Month'])['donut'].transform(np.sum)

In [None]:
sampleData.to_csv("../data/test.csv",index=False)

In [None]:
data = [{'x': 10, 'y': 20, 'r': 15, 'name': 'circle one'}, 
        {'x': 40, 'y': 40, 'r': 5, 'name': 'circle two'},
        {'x': 20, 'y': 30, 'r': 8, 'name': 'circle three'},
        {'x': 25, 'y': 10, 'r': 10, 'name': 'circle four'}]


d3_example.plot_circle(data)

In [None]:
d3_example.plot_circle(data, id=2)