Skip to content

clarecorthell/data-prototyping-talk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Follow me on twitter @clarecorthell!

September 9 2014 @ Hackbright Academy, SF

Prototyping in the Data World - Data Scripting Skills

Tools

  • numpy multi-dimensional container of data
  • pandas data structures analysis tools
  • matplotlib python plotting library
  • iPython browser-based code notebook / IDE (run blocks of code, not the whole program)

Notes accompanying the talk [Video & Slides]

All python code for this talk was run in the browser-based iPython interpreter

Import tools

import numpy as np
import pandas as pd

# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt

Get Data

turn a csv into a DataFrame (for example, an export from excel in csv form)

mattermark_df = pd.read_csv('mattermark_data.csv') => Mattermark data about funding rounds in New York City in the last five years

What's in here?

sample different parts of the data

mattermark_df[:10] sample the first ten rows of our DataFrame

mattermark_df.iloc[0] use .iloc to index into row location 0

mattermark_df['cached_uniques'] sample the column

mattermark_df['cached_uniques'].describe() show some standard statistics about that column (for numeric data)

mattermark_df.describe() show some standard statistics about all numeric columns

mattermark_df.sort('amount', ascending=False) sort entire table (descending) by amount amount of funding

What's not in here?

mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)

len(np.where(mattermark_df['amount'].isnull())[0]) Count the number of null values in the column

Ask a few Questions (which lead to other questions)

What is the most common stage for funding?

mattermark_df['series'].value_counts() count the values in each category

mattermark_df['series'].value_counts().plot(kind='bar') plot in a bar graph (grouped by series) to get a quick idea of relative scale

Leads to Question: What is the typical funding amount by round?

by_series = mattermark_df.groupby('series') group records by series column (stored in a variable)

print by_series['amount'].mean().astype(int) within each grouping, calculate the mean (and do some explicit type conversion)

How many of these are mobile companies?

mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads']) we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.

mattermark_df.shape
mobile_df.shape

compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion

For more context, see the video & slides)

Great Resources for Getting Started

About

Data Prototyping with Python (Slides & Resources from talk)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published