Permalink
Switch branches/tags
Nothing to show
Find file
5cb7cbe Apr 8, 2015
231 lines (160 sloc) 4.12 KB
Session 1
=========
You should do install if you have not already.
We have Internet! But, I have USB keys too.
Format: lesson, exercises, solutions.
Open Exercise-1
Point out .from_csv functions
VIEWING
len(titles)
titles.head() and .head(20)
titles.tail() and .tail(10)
titles
FILTERING
h = titles.head()
h['year'] or h.year
h.year + 1000
h.year - 2000
h.year > 1960
h[h.year > 1960]
h[h.year > 1960 & h.year < 1970]
h[(h.year > 1960) & (h.year < 1970)]
t.year // 10 * 10
h[h.title == '...']
SORTING
titles.sort(['title'])
titles.sort(['year'])
titles.sort(['year', 'title'])
Session 2
=========
STRING METHODS
h.str.len()
h.str.startswith(s)
h.str.extract(RE)
AGGREGATION
titles.year.value_counts()
titles.year.value_counts().plot() whoops!
titles.year.index
titles.year.value_counts().sort_index().plot()
titles.year.value_counts().sort_index().plot(kind='bar')
c = cast
c = c[c.character == 'Kermit the Frog]
c.plot(x='year', y='n', kind='scatter')
COLUMNS
Can be hard to see data
c = cast
c = c[c.character == 'Kermit the Frog']
c = c[['year', 'n']]
c
Can also:
c[['year']]
Session 3
=========
INDEXES - SPEED
%%time cast[cast.title == 'Sleuth']
c = cast.set_index(['title'])
%%time c.loc['Sleuth']
c = cast.set_index(['title']).sort_index()
%%time c.loc['Sleuth']
c = cast.set_index(['title', 'year']).sort_index()
c.loc['Sleuth']
c.loc['Sleuth',1996]
c.loc[('Sleuth',1996),'character']
c.loc[('Sleuth',1996),('character','n')]
.reset_index('title')
.reset_index('year')
.reset_index(['title', 'year'])
.reset_index()
INDEXES - GROUP BY
c = cast
c = c[c.name == 'George Clooney']
c.groupby(['title', 'year', 'character']).size()
c = cast
c = c[c.name == 'George Clooney']
c.groupby(['character', 'title', 'year']).size()
c = cast
c = c[c.name == 'George Clooney']
c.groupby(['character']).size()
# How many times has he had two roles in the same film?
c = cast
c = c[c.name == 'George Clooney']
c = c.groupby(['year', 'title']).size()
c[c > 1]
c = cast
c = c[c.name == 'George Clooney']
c.groupby([c.year // 10 * 10, 'character']).size()
c = cast
c = c[c.name == 'George Clooney']
c.groupby(['character', c.year // 10 * 10]).size()
TODO: mean min max!
Session 4
=========
UNSTACK
c = cast
c = c[(c.character == 'Kermit the Frog') | (c.character == 'Oscar the Grouch')]
g = c.groupby(['character', c.year // 10 * 10]).size()
g
How can we compare years? Unstack!
g.unstack('year')
g.unstack('character')
u = g.unstack('character')
u['difference'] = u['Kermit the Frog'] - u['Oscar the Grouch']
u
But, NaN.
u = g.unstack('character').fillna(0)
u['difference'] = u['Kermit the Frog'] - u['Oscar the Grouch']
u
THE DANGERS OF UNSTACK
Do it again? Oh no, we get a series!
.stack() again to repair damage, BUT can devolve to series again.
PLOTTING
Ratio?
u = g.unstack('character')
total = u['Oscar the Grouch'] + u['Kermit the Frog']
u['difference'] = u['Oscar the Grouch'] / total
u.difference.plot(ylim=[0,1])
Indexing and grouping has been moving our data LEFT.
"Unstacking" moves it UP, to columns! Stacking, DOWN.
Session 5
=========
r = release_dates
r = r[r.title == 'Inception']
r.date.dt.year
year month date dayofweek dayofyear
MERGE
What if we were interested in fetching release dates,
NOT by information in that table itself,
but by information over in "cast"?
c = cast
c = c[c.name == 'Ellen Page']
c = c.merge(release_dates)
c
Session 6
=========
c = cast
c = c[c.n <= 2]
c = c[c.name == 'Cary Grant']
c = c.merge(cast, on=['title', 'year'])
c = c[c.n_y <= 2]
c = c[c.name_y != 'Cary Grant']
c = c[['title', 'year', 'name_x', 'name_y']]
c
c.groupby('name_y').size().order(ascending=False)
reindex? or what? yeah.
.dropna()
.info()
Pivot
r = release_dates
r = r[r.title.str.startswith('Star Wars: Episode')]
r = r[r.country.str.startswith('U')]
r.pivot('title', 'country', 'date')
which is the same as
r.set_index(['title', 'country'])[['date']].unstack()
.rename(columns={...})
.concat(df)
Thoughts for later
==================
(who had which co-stars how often)
(what pairs of co-stars have appeared the most often together)
Can you use merge to find who was in movies with each other?
Fix later: second exercise s/hamlet/batman/