Skip to content

# clarecorthell/data-prototyping-talk

Data Prototyping with Python (Slides & Resources from talk) Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information. LICENSE README.md

### README.md

Follow me on twitter @clarecorthell!

September 9 2014 @ Hackbright Academy, SF

## Prototyping in the Data World - Data Scripting Skills

### Tools

• numpy multi-dimensional container of data
• pandas data structures analysis tools
• matplotlib python plotting library
• iPython browser-based code notebook / IDE (run blocks of code, not the whole program)

### Notes accompanying the talk [Video & Slides]

All python code for this talk was run in the browser-based iPython interpreter

#### Import tools

``````import numpy as np
import pandas as pd

# Render our plots inline
%matplotlib inline
import matplotlib.pyplot as plt
``````

#### Get Data

turn a csv into a DataFrame (for example, an export from excel in csv form)

`mattermark_df = pd.read_csv('mattermark_data.csv')` => Mattermark data about funding rounds in New York City in the last five years

#### What's in here?

sample different parts of the data

`mattermark_df[:10]` sample the first ten rows of our DataFrame

`mattermark_df.iloc` use .iloc to index into row location 0

`mattermark_df['cached_uniques']` sample the column

`mattermark_df['cached_uniques'].describe()` show some standard statistics about that column (for numeric data)

`mattermark_df.describe()` show some standard statistics about all numeric columns

`mattermark_df.sort('amount', ascending=False)` sort entire table (descending) by amount amount of funding

#### What's not in here?

mattermark_df['amount'].isnull()` In the column, is the value at a given index null? (true or false)

`len(np.where(mattermark_df['amount'].isnull()))` Count the number of null values in the column

#### Ask a few Questions (which lead to other questions)

What is the most common stage for funding?

`mattermark_df['series'].value_counts()` count the values in each category

`mattermark_df['series'].value_counts().plot(kind='bar')` plot in a bar graph (grouped by series) to get a quick idea of relative scale

Leads to Question: What is the typical funding amount by round?

`by_series = mattermark_df.groupby('series')` group records by series column (stored in a variable)

`print by_series['amount'].mean().astype(int)` within each grouping, calculate the mean (and do some explicit type conversion)

How many of these are mobile companies?

`mobile_df = mattermark_df.dropna(subset=['cached_mobile_downloads'])` we do some brash inference that if a company doesn't have a monthly count of mobile downloads, it doesn't have a mobile application; using the .dropna function, we get rid of the rows that don't have a value for that column.

``````mattermark_df.shape
mobile_df.shape
``````

compare the shape of the two DataTables to see how many companies (rows) have mobile app data to see a rough proportion

For more context, see the video & slides)

### Great Resources for Getting Started

You can’t perform that action at this time.