Skip to content

Latest commit

 

History

History
435 lines (291 loc) · 12.5 KB

tutorial.rst

File metadata and controls

435 lines (291 loc) · 12.5 KB

Start Here: datascience Tutorial

This is a brief introduction to the functionality in :pydatascience. For a complete reference guide, please see tables-overview.

For other useful tutorials and examples, see:

Table of Contents

Getting Started

The most important functionality in the package is is the :pyTable class, which is the structure used to represent columns of data. First, load the class:

python

from datascience import Table

In the IPython notebook, type Table. followed by the TAB-key to see a list of members.

Note that for the Data Science 8 class we also import additional packages and settings for all assignments and labs. This is so that plots and other available packages mirror the ones in the textbook more closely. The exact code we use is:

# HIDDEN

import matplotlib
matplotlib.use('Agg')
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
plt.style.use('fivethirtyeight')

In particular, the lines involving matplotlib allow for plotting within the IPython notebook.

Creating a Table

A Table is a sequence of labeled columns of data.

A Table can be constructed from scratch by extending an empty table with columns.

python

t = Table().with_columns(

'letter', ['a', 'b', 'c', 'z'], 'count', [ 9, 3, 3, 1], 'points', [ 1, 2, 2, 10],

)

print(t)


More often, a table is read from a CSV file (or an Excel spreadsheet). Here's the content of an example file:

python

cat sample.csv

And this is how we load it in as a Table using ~datascience.tables.Table.read_table:

python

Table.read_table('sample.csv')

CSVs from URLs are also valid inputs to ~datascience.tables.Table.read_table:


It's also possible to add columns from a dictionary, but this option is discouraged because dictionaries do not preserve column order.

python

t = Table().with_columns({

'letter': ['a', 'b', 'c', 'z'], 'count': [ 9, 3, 3, 1], 'points': [ 1, 2, 2, 10],

})

print(t)

Accessing Values

To access values of columns in the table, use ~datascience.tables.Table.column, which takes a column label or index and returns an array. Alternatively, ~datascience.tables.Table.columns returns a list of columns (arrays).

python

t

t.column('letter') t.column(1)

You can use bracket notation as a shorthand for this method:

python

t['letter'] # This is a shorthand for t.column('letter') t[1] # This is a shorthand for t.column(1)

To access values by row, ~datascience.tables.Table.row returns a row by index. Alternatively, ~datascience.tables.Table.rows returns an list-like ~datascience.tables.Table.Rows object that contains tuple-like ~datascience.tables.Table.Row objects.

python

t.rows t.rows[0] t.row(0)

second = t.rows[1] second second[0] second[1]

To get the number of rows, use ~datascience.tables.Table.num_rows.

python

t.num_rows

Manipulating Data

Here are some of the most common operations on data. For the rest, see the reference (tables-overview).

Adding a column with ~datascience.tables.Table.with_column:

python

t t.with_column('vowel?', ['yes', 'no', 'no', 'no']) t # .with_column returns a new table without modifying the original

t.with_column('2 * count', t['count'] * 2) # A simple way to operate on columns

Selecting columns with ~datascience.tables.Table.select:

python

t.select('letter') t.select(['letter', 'points'])

Renaming columns with ~datascience.tables.Table.relabeled:

python

t t.relabeled('points', 'other name') t t.relabeled(['letter', 'count', 'points'], ['x', 'y', 'z'])

Selecting out rows by index with ~datascience.tables.Table.take and conditionally with ~datascience.tables.Table.where:

python

t t.take(2) # the third row t.take[0:2] # the first and second rows

python

t.where('points', 2) # rows where points == 2 t.where(t['count'] < 8) # rows where count < 8

t['count'] < 8 # .where actually takes in an array of booleans t.where([False, True, True, True]) # same as the last line

Operate on table data with ~datascience.tables.Table.sort, ~datascience.tables.Table.group, and ~datascience.tables.Table.pivot

python

t t.sort('count') t.sort('letter', descending = True)

python

# You may pass a reducing function into the collect arg # Note the renaming of the points column because of the collect arg t.select(['count', 'points']).group('count', collect=sum)

python

other_table = Table().with_columns(

'mar_status', ['married', 'married', 'partner', 'partner', 'married'], 'empl_status', ['Working as paid', 'Working as paid', 'Not working', 'Not working', 'Not working'], 'count', [1, 1, 1, 1, 1])

other_table

other_table.pivot('mar_status', 'empl_status', 'count', collect=sum)

Visualizing Data

We'll start with some data drawn at random from two normal distributions:

python

normal_data = Table().with_columns(

'data1', np.random.normal(loc = 1, scale = 2, size = 100), 'data2', np.random.normal(loc = 4, scale = 3, size = 100))

normal_data

Draw histograms with ~datascience.tables.Table.hist:

python

@savefig hist.png width=4in normal_data.hist()

python

@savefig hist_binned.png width=4in normal_data.hist(bins = range(-5, 10))

python

@savefig hist_overlay.png width=4in normal_data.hist(bins = range(-5, 10), overlay = True)

If we treat the normal_data table as a set of x-y points, we can ~datascience.tables.Table.plot and ~datascience.tables.Table.scatter:

python

@savefig plot.png width=4in normal_data.sort('data1').plot('data1') # Sort first to make plot nicer

python

@savefig scatter.png width=4in normal_data.scatter('data1')

python

@savefig scatter_line.png width=4in normal_data.scatter('data1', fit_line = True)

Use ~datascience.tables.Table.barh to display categorical data.

python

t @savefig barh.png width=4in t.barh('letter')

Exporting

Exporting to CSV is the most common operation and can be done by first converting to a pandas dataframe with ~datascience.tables.Table.to_df:

python

normal_data

# index = False prevents row numbers from appearing in the resulting CSV normal_data.to_df().to_csv('normal_data.csv', index = False)

An Example

We'll recreate the steps in Chapter 12 of the textbook to see if there is a significant difference in birth weights between smokers and non-smokers using a bootstrap test.

For more examples, check out the TableDemos repo.

From the text:

The table baby contains data on a random sample of 1,174 mothers and their newborn babies. The column Birth Weight contains the birth weight of the baby, in ounces; Gestational Days is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.

python

baby = Table.read_table('https://www.inferentialthinking.com/data/baby.csv') baby # Let's take a peek at the table

# Select out columns we want. smoker_and_wt = baby.select(['Maternal Smoker', 'Birth Weight']) smoker_and_wt

Let's compare the number of smokers to non-smokers.

python

smoker_and_wt.select('Maternal Smoker').group('Maternal Smoker')

We can also compare the distribution of birthweights between smokers and non-smokers.

python

# Non smokers # We do this by grabbing the rows that correspond to mothers that don't # smoke, then plotting a histogram of just the birthweights. @savefig not_m_smoker_weights.png width=4in smoker_and_wt.where('Maternal Smoker', 0).select('Birth Weight').hist()

# Smokers @savefig m_smoker_weights.png width=4in smoker_and_wt.where('Maternal Smoker', 1).select('Birth Weight').hist()

What's the difference in mean birth weight of the two categories?

python

nonsmoking_mean = smoker_and_wt.where('Maternal Smoker', 0).column('Birth Weight').mean() smoking_mean = smoker_and_wt.where('Maternal Smoker', 1).column('Birth Weight').mean()

observed_diff = nonsmoking_mean - smoking_mean observed_diff

Let's do the bootstrap test on the two categories.

python

num_nonsmokers = smoker_and_wt.where('Maternal Smoker', 0).num_rows def bootstrap_once(): """ Computes one bootstrapped difference in means. The table.sample method lets us take random samples. We then split according to the number of nonsmokers in the original sample. """ resample = smoker_and_wt.sample(with_replacement = True) bootstrap_diff = resample.column('Birth Weight')[:num_nonsmokers].mean() - resample.column('Birth Weight')[num_nonsmokers:].mean() return bootstrap_diff

repetitions = 1000 bootstrapped_diff_means = np.array( [ bootstrap_once() for _ in range(repetitions) ])

bootstrapped_diff_means[:10]

num_diffs_greater = (abs(bootstrapped_diff_means) > abs(observed_diff)).sum() p_value = num_diffs_greater / len(bootstrapped_diff_means) p_value

Drawing Maps

The main class in the maps module is the Map class. In this code we create a default Map. Maps can be displayed or converted to html.

python

from datascience.maps import Map # import the Map class default_map = Map() # generate a default Map default_map.show() # display the Map

html = default_map.as_html() # generate the html with open('map.html', 'w') as f: # make a file to store the html f.write(html) # write the html to the file

The maps modules also allows you to make custom maps with markers, circles and regions.

python from datascience.maps import Map, Marker, Circle, Region # import the Map, Marker, Circle and Region class

# generates markers with custom sets of coordinates, colors and popups marker1 = Marker(37.372, -121.758, color="green", popup="My green marker") marker2 = Marker(37.572, -121.758, color="orange", popup="My orange marker")

# generates a circle with a custom set of coordinates, color and popup circle = Circle(37.5, -122, color="red", area=1000, popup="My Circle")

# make a geojson object which is needed when making a region geojson = { "type": "Feature", "geometry": { "type": "Polygon", "coordinates": [ # specifies the coordinates [[-121,37],[-121.5,37],[-121.5,37.5],[-121,37.5],[-121,37]] # these coordinates make a rectangle ] } }

# make a region with your geojson object region = Region(geojson)

# Initialize the map custom_map = Map(features=[marker1, marker2, circle, region], # specifies the features width=800, # specifies a custom width height=600 # specifies a custom height ) custom_map.show() # display the map