Skip to content
Data Prototyping with Python (Slides & Resources from talk)
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Follow me on twitter @clarecorthell!

September 9 2014 @ Hackbright Academy, SF

Prototyping in the Data World - Data Scripting Skills


  • numpy multi-dimensional container of data
  • pandas data structures analysis tools
  • matplotlib python plotting library
  • iPython browser-based code notebook / IDE (run blocks of code, not the whole program)

Notes accompanying the talk [Video & Slides]

All python code for this talk was run in the browser-based iPython interpreter

Import tools

import numpy as np
import pandas as pd

# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt

Get Data

turn a csv into a DataFrame (for example, an export from excel in csv form)

mattermark_df = pd.read_csv('mattermark_data.csv') => Mattermark data about funding rounds in New York City in the last five years

What's in here?

sample different parts of the data

mattermark_df[:10] sample the first ten rows of our DataFrame

mattermark_df.iloc[0] use .iloc to index into row location 0

mattermark_df['cached_uniques'] sample the column

mattermark_df['cached_uniques'].describe() show some standard statistics about that column (for numeric data)

mattermark_df.describe() show some standard statistics about all numeric columns

mattermark_df.sort('amount', ascending=False) sort entire table (descending) by amount amount of funding

What's not in here?

mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)

len(np.where(mattermark_df['amount'].isnull())[0]) Count the number of null values in the column

Ask a few Questions (which lead to other questions)

What is the most common stage for funding?

mattermark_df['series'].value_counts() count the values in each category

mattermark_df['series'].value_counts().plot(kind='bar') plot in a bar graph (grouped by series) to get a quick idea of relative scale

Leads to Question: What is the typical funding amount by round?

by_series = mattermark_df.groupby('series') group records by series column (stored in a variable)

print by_series['amount'].mean().astype(int) within each grouping, calculate the mean (and do some explicit type conversion)

How many of these are mobile companies?

mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads']) we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.


compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion

For more context, see the video & slides)

Great Resources for Getting Started

You can’t perform that action at this time.