Permalink
| Session 1 | |
| ========= | |
| You should do install if you have not already. | |
| We have Internet! But, I have USB keys too. | |
| Format: lesson, exercises, solutions. | |
| Open Exercise-1 | |
| Point out .from_csv functions | |
| VIEWING | |
| len(titles) | |
| titles.head() and .head(20) | |
| titles.tail() and .tail(10) | |
| titles | |
| FILTERING | |
| h = titles.head() | |
| h['year'] or h.year | |
| h.year + 1000 | |
| h.year - 2000 | |
| h.year > 1960 | |
| h[h.year > 1960] | |
| h[h.year > 1960 & h.year < 1970] | |
| h[(h.year > 1960) & (h.year < 1970)] | |
| t.year // 10 * 10 | |
| h[h.title == '...'] | |
| SORTING | |
| titles.sort(['title']) | |
| titles.sort(['year']) | |
| titles.sort(['year', 'title']) | |
| Session 2 | |
| ========= | |
| STRING METHODS | |
| h.str.len() | |
| h.str.startswith(s) | |
| h.str.extract(RE) | |
| AGGREGATION | |
| titles.year.value_counts() | |
| titles.year.value_counts().plot() whoops! | |
| titles.year.index | |
| titles.year.value_counts().sort_index().plot() | |
| titles.year.value_counts().sort_index().plot(kind='bar') | |
| c = cast | |
| c = c[c.character == 'Kermit the Frog] | |
| c.plot(x='year', y='n', kind='scatter') | |
| COLUMNS | |
| Can be hard to see data | |
| c = cast | |
| c = c[c.character == 'Kermit the Frog'] | |
| c = c[['year', 'n']] | |
| c | |
| Can also: | |
| c[['year']] | |
| Session 3 | |
| ========= | |
| INDEXES - SPEED | |
| %%time cast[cast.title == 'Sleuth'] | |
| c = cast.set_index(['title']) | |
| %%time c.loc['Sleuth'] | |
| c = cast.set_index(['title']).sort_index() | |
| %%time c.loc['Sleuth'] | |
| c = cast.set_index(['title', 'year']).sort_index() | |
| c.loc['Sleuth'] | |
| c.loc['Sleuth',1996] | |
| c.loc[('Sleuth',1996),'character'] | |
| c.loc[('Sleuth',1996),('character','n')] | |
| .reset_index('title') | |
| .reset_index('year') | |
| .reset_index(['title', 'year']) | |
| .reset_index() | |
| INDEXES - GROUP BY | |
| c = cast | |
| c = c[c.name == 'George Clooney'] | |
| c.groupby(['title', 'year', 'character']).size() | |
| c = cast | |
| c = c[c.name == 'George Clooney'] | |
| c.groupby(['character', 'title', 'year']).size() | |
| c = cast | |
| c = c[c.name == 'George Clooney'] | |
| c.groupby(['character']).size() | |
| # How many times has he had two roles in the same film? | |
| c = cast | |
| c = c[c.name == 'George Clooney'] | |
| c = c.groupby(['year', 'title']).size() | |
| c[c > 1] | |
| c = cast | |
| c = c[c.name == 'George Clooney'] | |
| c.groupby([c.year // 10 * 10, 'character']).size() | |
| c = cast | |
| c = c[c.name == 'George Clooney'] | |
| c.groupby(['character', c.year // 10 * 10]).size() | |
| TODO: mean min max! | |
| Session 4 | |
| ========= | |
| UNSTACK | |
| c = cast | |
| c = c[(c.character == 'Kermit the Frog') | (c.character == 'Oscar the Grouch')] | |
| g = c.groupby(['character', c.year // 10 * 10]).size() | |
| g | |
| How can we compare years? Unstack! | |
| g.unstack('year') | |
| g.unstack('character') | |
| u = g.unstack('character') | |
| u['difference'] = u['Kermit the Frog'] - u['Oscar the Grouch'] | |
| u | |
| But, NaN. | |
| u = g.unstack('character').fillna(0) | |
| u['difference'] = u['Kermit the Frog'] - u['Oscar the Grouch'] | |
| u | |
| THE DANGERS OF UNSTACK | |
| Do it again? Oh no, we get a series! | |
| .stack() again to repair damage, BUT can devolve to series again. | |
| PLOTTING | |
| Ratio? | |
| u = g.unstack('character') | |
| total = u['Oscar the Grouch'] + u['Kermit the Frog'] | |
| u['difference'] = u['Oscar the Grouch'] / total | |
| u.difference.plot(ylim=[0,1]) | |
| Indexing and grouping has been moving our data LEFT. | |
| "Unstacking" moves it UP, to columns! Stacking, DOWN. | |
| Session 5 | |
| ========= | |
| r = release_dates | |
| r = r[r.title == 'Inception'] | |
| r.date.dt.year | |
| year month date dayofweek dayofyear | |
| MERGE | |
| What if we were interested in fetching release dates, | |
| NOT by information in that table itself, | |
| but by information over in "cast"? | |
| c = cast | |
| c = c[c.name == 'Ellen Page'] | |
| c = c.merge(release_dates) | |
| c | |
| Session 6 | |
| ========= | |
| c = cast | |
| c = c[c.n <= 2] | |
| c = c[c.name == 'Cary Grant'] | |
| c = c.merge(cast, on=['title', 'year']) | |
| c = c[c.n_y <= 2] | |
| c = c[c.name_y != 'Cary Grant'] | |
| c = c[['title', 'year', 'name_x', 'name_y']] | |
| c | |
| c.groupby('name_y').size().order(ascending=False) | |
| reindex? or what? yeah. | |
| .dropna() | |
| .info() | |
| Pivot | |
| r = release_dates | |
| r = r[r.title.str.startswith('Star Wars: Episode')] | |
| r = r[r.country.str.startswith('U')] | |
| r.pivot('title', 'country', 'date') | |
| which is the same as | |
| r.set_index(['title', 'country'])[['date']].unstack() | |
| .rename(columns={...}) | |
| .concat(df) | |
| Thoughts for later | |
| ================== | |
| (who had which co-stars how often) | |
| (what pairs of co-stars have appeared the most often together) | |
| Can you use merge to find who was in movies with each other? | |
| Fix later: second exercise s/hamlet/batman/ |