Matplotlib and Pandas Sampler
These short guides are meant to show you some practical examples of matplotlib and pandas, not serve as comprehensive walkthroughs.
- Introduction to matplotlib visualizations
- Basic matplotlib visualization of climate data
- Basic pandas data wrangling and matplotlib visualization of climate data
- Nicolas P. Rougier has an excellent and beautifully designed Matplotlib tutorial.
- How to make beautiful data visualizations in Python with matplotlib
- Matplotlib homepage
(Note: While the Matplotlib homepage is a place you eventually want to go to, some of the documentation may be more complicated for you than necessary...)
- 10 Minutes to pandas
- Pandas cookbook
- An Introduction to Pandas, via Michael Hansen
- 12 Useful Pandas Techniques in Python for Data Manipulation
- Things in Pandas I Wish I'd Known Earlier
The data folder contains several datasets, extracted and somewhat normalized for your convenience:
- NASA-aggregated data on global temperature and greenhouse gases
- Daily closing prices for top tech stocks, via Yahoo Finance.
Ad-hoc examples (to get their own notebook)
Typecasting dates during the pandas import:
from os.path import join import matplotlib.pyplot as plt import pandas as pd fname = join('data', 'stocks', 'YHOO.csv') # must specify that the 'Date' column is actually a date # and pandas will try its best to convert it df = pd.read_csv(fname, parse_dates=['Date']) fig, ax = plt.subplots() ax.plot(df['Date'], df['Adj Close'])
Without pandas, here's what that typecasting would look like:
from os.path import join from datetime import datetime import csv fname = join('data', 'stocks', 'YHOO.csv') with open(fname, 'r') as rf: data = list(csv.DictReader(fname)) for d in data: d['Date'] = datetime.strptime(d['Date'], '%Y-%m-%d') d['Adj Close'] = float(d['Adj Close']) # then the visualization code...
Coercing numeric values with pandas
The 2014 SAT score data is an example of annoyingly difficult dirty data. The columns contain a mix of numbers and things like asterisks, which need to be cleared out if pandas is to typecast a column as all numbers/floats/etc.
The coercion can be done when read_csv() is called; check out the documentation for all of its arguments.
One argument is na_values, which let's us specify strings values that should be considered as "not-a-number" values. Such as
Here's the import without specifying
from os.path import join import pandas as pd fname = join('data', 'schools', 'sat-2014.csv') adf = pd.read_csv(fname) bdf = pd.read_csv(fname, na_values=['*'])
dtypes attributes of
bdf -- many more columns of the
bdf dataframe are typecasted as numbers.
Now it's easy to filter the SAT results by schools that have a minimum number of test takers:
cdf = bdf[bdf['number_of_test_takers'] >= 20]