# A demo of Pandas code
This is a demonstration of my knowledge of Python Pandas. Several years ago I took a course on data analysis with R on Coursera, in this Juyter workbook I apply the principles of data analysis with Python to recreates the analysis that I did in R. If I had to process the data in R I could, but I strongly prefer to process it in Pandas and Jupyter due to the ability to explore and analyze the data and document findings in one program as well as access to the large number of Python libraries.

To start, I import the libraries that I will use, define a couple of functions to quickly generate plots and tell Bokeh to output the plots to Jupyter.

In [1]:
import pandas as pd, bokeh as bk, numpy as np
from bokeh.plotting import figure, show, output_notebook

In [2]:
def histplot(hist, title, edges, xlabel, ylabel):
    p = figure(title=title, tools='')
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="red", line_color="black", alpha=0.5)
    p.y_range.start = 0
    p.xaxis.axis_label = xlabel
    p.yaxis.axis_label = ylabel
    p.grid.grid_line_color="white"
    return p

def lineplot(x, y, title, xlabel, ylabel, colorlist, datalegend):
    p=figure(title=title, tools='', x_axis_type='datetime')
    if isinstance(y, list):
        for z in range(0,len(y)):
            p.line(x,y[z],line_color=colorlist[z], line_width=1, legend=datalegend[z])
    else:
        p.line(x,y,line_color=colorlist, line_width=1, legend=datalegend)
    #p.y_range.start = 0
    p.legend.location = "top_right"
    p.legend.background_fill_color = "#fefefe"
    p.xaxis.axis_label = xlabel
    p.yaxis.axis_label = ylabel
    return p

output_notebook()

Here I read in the data, combine the Date and Time fields into a Datetime field and convert them to the correct `datetime` format

In [3]:
powerdata = pd.read_table('household_power_consumption.txt',sep=';',na_values='?',dayfirst=True,
                          parse_dates={'Datetime':['Date','Time']},
                          date_parser=lambda x: pd.datetime.strptime(x,"%d/%m/%Y %H:%M:%S"))

I check the headers and look at the structure to make sure that the data has been read in correctly.

In [4]:
powerdata.head()

Unnamed: 0,Datetime,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,2006-12-16 17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,2006-12-16 17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,2006-12-16 17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,2006-12-16 17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,2006-12-16 17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


In [5]:
powerdata.describe()

Unnamed: 0,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
count,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0,2049280.0
mean,1.091615,0.1237145,240.8399,4.627759,1.121923,1.29852,6.458447
std,1.057294,0.112722,3.239987,4.444396,6.153031,5.822026,8.437154
min,0.076,0.0,223.2,0.2,0.0,0.0,0.0
25%,0.308,0.048,238.99,1.4,0.0,0.0,0.0
50%,0.602,0.1,241.01,2.6,0.0,0.0,1.0
75%,1.528,0.194,242.89,6.4,0.0,1.0,17.0
max,11.122,1.39,254.15,48.4,88.0,80.0,31.0


For this project we were instructed to only use the data of *2/1/2007* and *2/2/2007* so I asign those records to `febdata`.

In [6]:
febdata = powerdata[:][(powerdata['Datetime']>=pd.datetime(2007,2,1))&(powerdata['Datetime']<pd.datetime(2007,2,3))]

## Chart 1
Chart 1 is a histogram of the distribution of data in the `global_active_power` field. 

In [7]:
hist, edges = np.histogram(febdata['Global_active_power'],  density=True, bins=20)
gap = histplot(hist, "Global Active Power", edges, 'Global Active Power (killowats)', 'Frequency')
show(bk.layouts.gridplot([gap], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))

## Chart 2
Chart 2 shows the global active power KWH over the course of the two day period.

In [8]:
gap2 = lineplot(febdata['Datetime'], febdata['Global_active_power'], 'Global Active Power', 'Time',
                'Global Active Power (killowats)', 'black', 'Global Active Power')
show(bk.layouts.gridplot([gap2], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))

## Chart 3
Chart 3 shows the three sub metering fields over the course of the days

In [9]:
gap3 = lineplot(febdata['Datetime'], [febdata['Sub_metering_1'],febdata['Sub_metering_2'],febdata['Sub_metering_3']],
                'Global Active Power',  'Time','Energy Sub Metering', ['black','red','blue'],
                ['Sub Metering 1','Sub Metering 2','Sub Metering 3'])
show(bk.layouts.gridplot([gap3], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))

## Chart 4
Chart 4 shows the earlier generated Chart 1 histogram, Voltage over the day, Chart 3's sub metering data and global reactive power over the course of the two days. 

In [10]:
gap4 = lineplot(febdata['Datetime'], febdata['Voltage'], 'Voltage', 'Time', 'Voltage', 'black', 'Voltage')
gap5 = lineplot(febdata['Datetime'], febdata['Global_reactive_power'], 'Global Reactive Power', 'Time',
                'Global Reactive Power (killowats)', 'black', 'Global Reactive Power')
show(bk.layouts.gridplot([gap,gap4,gap3,gap5], ncols=2, plot_width=400, plot_height=400, toolbar_location=None))

Created by David Peterson on November 4th, 2018