Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



9 Commits

Repository files navigation

Follow me on twitter @clarecorthell!

September 9 2014 @ Hackbright Academy, SF

Prototyping in the Data World - Data Scripting Skills


  • numpy multi-dimensional container of data
  • pandas data structures analysis tools
  • matplotlib python plotting library
  • iPython browser-based code notebook / IDE (run blocks of code, not the whole program)

Notes accompanying the talk [Video & Slides]

All python code for this talk was run in the browser-based iPython interpreter

Import tools

import numpy as np
import pandas as pd

# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt

Get Data

turn a csv into a DataFrame (for example, an export from excel in csv form)

mattermark_df = pd.read_csv('mattermark_data.csv') => Mattermark data about funding rounds in New York City in the last five years

What's in here?

sample different parts of the data

mattermark_df[:10] sample the first ten rows of our DataFrame

mattermark_df.iloc[0] use .iloc to index into row location 0

mattermark_df['cached_uniques'] sample the column

mattermark_df['cached_uniques'].describe() show some standard statistics about that column (for numeric data)

mattermark_df.describe() show some standard statistics about all numeric columns

mattermark_df.sort('amount', ascending=False) sort entire table (descending) by amount amount of funding

What's not in here?

mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)

len(np.where(mattermark_df['amount'].isnull())[0]) Count the number of null values in the column

Ask a few Questions (which lead to other questions)

What is the most common stage for funding?

mattermark_df['series'].value_counts() count the values in each category

mattermark_df['series'].value_counts().plot(kind='bar') plot in a bar graph (grouped by series) to get a quick idea of relative scale

Leads to Question: What is the typical funding amount by round?

by_series = mattermark_df.groupby('series') group records by series column (stored in a variable)

print by_series['amount'].mean().astype(int) within each grouping, calculate the mean (and do some explicit type conversion)

How many of these are mobile companies?

mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads']) we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.


compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion

For more context, see the video & slides)

Great Resources for Getting Started


Data Prototyping with Python (Slides & Resources from talk)







No releases published


No packages published