GitHub - clarecorthell/data-prototyping-talk: Data Prototyping with Python (Slides & Resources from talk)

Follow me on twitter @clarecorthell!

September 9 2014 @ Hackbright Academy, SF

Prototyping in the Data World - Data Scripting Skills

Tools

numpy multi-dimensional container of data
pandas data structures analysis tools
matplotlib python plotting library
iPython browser-based code notebook / IDE (run blocks of code, not the whole program)

Notes accompanying the talk [Video & Slides]

All python code for this talk was run in the browser-based iPython interpreter

Import tools

import numpy as np
import pandas as pd

# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt

Get Data

turn a csv into a DataFrame (for example, an export from excel in csv form)

mattermark_df = pd.read_csv('mattermark_data.csv') => Mattermark data about funding rounds in New York City in the last five years

What's in here?

sample different parts of the data

mattermark_df[:10] sample the first ten rows of our DataFrame

mattermark_df.iloc[0] use .iloc to index into row location 0

mattermark_df['cached_uniques'] sample the column

mattermark_df['cached_uniques'].describe() show some standard statistics about that column (for numeric data)

mattermark_df.describe() show some standard statistics about all numeric columns

mattermark_df.sort('amount', ascending=False) sort entire table (descending) by amount amount of funding

What's not in here?

mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)

len(np.where(mattermark_df['amount'].isnull())[0]) Count the number of null values in the column

Ask a few Questions (which lead to other questions)

What is the most common stage for funding?

mattermark_df['series'].value_counts() count the values in each category

mattermark_df['series'].value_counts().plot(kind='bar') plot in a bar graph (grouped by series) to get a quick idea of relative scale

Leads to Question: What is the typical funding amount by round?

by_series = mattermark_df.groupby('series') group records by series column (stored in a variable)

print by_series['amount'].mean().astype(int) within each grouping, calculate the mean (and do some explicit type conversion)

How many of these are mobile companies?

mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads']) we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.

mattermark_df.shape
mobile_df.shape

compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion

For more context, see the video & slides)

Great Resources for Getting Started

The Open Source Data Science Masters - A curated curriculum of open source resources to get you working with and understanding data
pandas cookbook - great beginning resource from Julia Evans
Python for Data Analysis / Book - the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python (with numpy, pandas, and matplotlib)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prototyping in the Data World - Data Scripting Skills

Tools

Notes accompanying the talk [Video & Slides]

Import tools

Get Data

What's in here?

What's not in here?

Ask a few Questions (which lead to other questions)

Great Resources for Getting Started

About

Releases

Packages

License

clarecorthell/data-prototyping-talk

Folders and files

Latest commit

History

Repository files navigation

Prototyping in the Data World - Data Scripting Skills

Tools

Notes accompanying the talk [Video & Slides]

Import tools

Get Data

What's in here?

What's not in here?

Ask a few Questions (which lead to other questions)

Great Resources for Getting Started

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages