When you wonder about syntax, and arguments, you can either go to the official documentation on the web, or rely on Jupyter's tab auto completion.


**Here is a fun trick: `SHIFT+TAB`!** Try it! Type `import pandas as pd` then run that to load pandas. Then type `pd.merge(` like you want to merge to dataframes, except you don't remember the arguments to use. So type `SHIFT+TAB+TAB`. _1 TAB opens a little syntax screen, 2 opens a larger amount of syntax info, and 3 opens the whole help file at the bottom of the screen._ I've personally moved to `SHIFT+TAB` for help most of the time (instead of `help()` or `?`).


## Outline

1. Be patient and persistent: keep going when the scope of `pandas` and your first data analysis exercises stump you!
1. Essential functions for data wrangling, statistics, and exploration
1. Example: Data wrangling and exploration that is readable and powerful
1. Cookbook: Typical data analysis steps, table creation, and figures



## The essential data wrangling toolkit

I imagine you'll return to this page a lot. Today and tomorrow,
1. I'm going to make note of a few important tools we have available
2. I'll show you an example that illustrates how those tools can be applied
3. We'll practice together
4. You'll be in position to attack ASGN 02
5. Over time and with experience, `pandas` will become less overwhelming and more of a friend!

Note 1: "`df`" is often used as a name of a generic dataframe in examples. Generally, you should
give your dataframes better names!

Note 2: There are other ways to do many of these operations. For example (the sidebar illustrates)
- `df['feet']=df['height']//12` will create a new column called feet
- `df[['gender','height']]` will return just those two columns- 
- `df.loc[df['feet']==5,'feet'] = np.nan` will overwrite the feet variable only when feet equals 5
- More on indexing and selection [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html)---highly recommend! If that link is dead: https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html


In [1]:
import pandas as pd # everyone imports pandas as pd
import numpy as np
df = pd.DataFrame({'height':[72,60,68],'gender':['M','F','M'],'weight':[175,110,150]})

df['feet']=df['height']//12
print("\n\n Original df:\n",df) 
print("\n\n Subset of vars:\n",df[['gender','height']])
df.loc[df['feet']==5,'feet'] = np.nan
print("\n\n Replace given condition:\n",df)



 Original df:
    height gender  weight  feet
0      72      M     175     6
1      60      F     110     5
2      68      M     150     5


 Subset of vars:
   gender  height
0      M      72
1      F      60
2      M      68


 Replace given condition:
    height gender  weight  feet
0      72      M     175   6.0
1      60      F     110   NaN
2      68      M     150   NaN


In [2]:
# AN ASIDE ON RESHAPING

df = pd.Series({   ('Ford',2000):10,
                   ('Ford',2001):12,
                   ('Ford',2002):14,
                   ('Ford',2003):16,
                   ('GM',2000):11,
                   ('GM',2001):13,
                   ('GM',2002):13,
                   ('GM',2003):15}).to_frame().rename(columns={0:'Sales'}).rename_axis(['Firm','Year'])
print("Original:\n",df)
wide = df.unstack(level=0)
print("\n\nUnstack (make it shorter+wider)\non level 0/Firm:\n",wide) # move index level 0 (firm name) to column
tall = wide.stack()
print("\n\nStack it back (make it tall):\n",tall) # move index level 0 (firm name) to column
print("\n\nStack it back and reorder \nindex as before (firm-year):\n",tall.swaplevel().sort_index())

print("\n\nUnstack level 1/Year:\n",df.unstack(level=1)) # move index level 0 (firm name) to column

Original:
            Sales
Firm Year       
Ford 2000     10
     2001     12
     2002     14
     2003     16
GM   2000     11
     2001     13
     2002     13
     2003     15


Unstack (make it shorter+wider)
on level 0/Firm:
      Sales    
Firm  Ford  GM
Year          
2000    10  11
2001    12  13
2002    14  13
2003    16  15


Stack it back (make it tall):
            Sales
Year Firm       
2000 Ford     10
     GM       11
2001 Ford     12
     GM       13
2002 Ford     14
     GM       13
2003 Ford     16
     GM       15


Stack it back and reorder 
index as before (firm-year):
            Sales
Firm Year       
Ford 2000     10
     2001     12
     2002     14
     2003     16
GM   2000     11
     2001     13
     2002     13
     2003     15


Unstack level 1/Year:
      Sales               
Year  2000 2001 2002 2003
Firm                     
Ford    10   12   14   16
GM      11   13   13   15


### Creating summary tables 

- A big part of the challenge is knowing what "cuts" of the data are interesting. You'll figure this out as you get more knowledge in a specific area.
- In tables that cut data based on two variables, how do you decide on the "second variable" to slice? 
    1. Secondary slices might be on _a priori_ (known in advance) interesting variables ("everyone talks about growth potential so I'll also cut by that too because they will wonder about it if I don't") 
    2. Or, choose second variables because you think that the impact of the first variable will depend on the second variable. Suppose you divide firms into "young and old" as a first slice. Your second slice might be industries: perhaps leverage is only a little lower for older railroad firms but leverage is much higher for large firms in tech industries.
    3. Another example: A common data science training dataset is about surviving the Titanic. Women survived at higher rates, but this is **especially** true in first class, where very few women died. The gender disparity was less severe in third class; while third class  men were more likely to perish than first class men, 45% of women in third class died, compared to ~0 in first class. 

**Essential reading.** Rather than me replicating these pages, read these after class, follow them, and save snippets of code you find useful to your growing cheat sheet.

- [Aggregation and grouping](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html)
- [Pivot tables](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html)

## Credits

Here are some links I found that I liked. They cover many aspects of `panda`ing:
1. [This covers indexing and slicing issues nicely](https://tomaugspurger.github.io/modern-1-intro)
2. [This covers method chaining so that Python can approximate R's elegance at data wrangling and plotting](https://stmorse.github.io/journal/tidyverse-style-pandas.html)
2. [This also has nice and quick examples showing Python matching R functions](https://gist.github.com/conormm/fd8b1980c28dd21cfaf6975c86c74d07)
3. [The most popular questions about `pandas` on Stackoverflow.](https://stackoverflow.com/questions/tagged/pandas?sort=votes&pageSize=50) This will give you an idea of common places others get stuck, and slicing and indexing issues are high on the list.
4. https://www.textbook.ds100.org


In [3]:
import pandas as pd
import pandas_datareader as pdr # IF NECESSARY, from terminal: pip install pandas_datareader 
import datetime
import datadotworld as dw # follow instructions for installing and using dw in accompanying lecture
import numpy as np

### A side lesson: What on earth is **`lambda`**? 

Can you survive without learning `lambda`? Yes. 

But knowing `lambda` will make you a more powerful programmer, because it lets you define functions very quickly. And this is very useful! For example, the `agg` function can apply any function, not just built-in ones, and sometimes you'll want to use non built-in functions!

So, generally, the syntax is `<fcn_name> = lambda <argument> : <function>`.

In [7]:
my_fcn = lambda a : a*5 # if I call my_fcn(7), python will set a=7, then evalute the function a*5
print(my_fcn(7))

35


Now, in the example above inside `agg()` I never named the function. That's because it's unnecessary in that context, python knows to immediately call it.

### Maybe you say "I hate that and will never ever write a lambda. How can I accomplish the problem above anyways?"

Well, I'm sorry to hear that! Here is what you could do:

In [9]:
def return_first_element(df):
    return df.iloc[0]

(baby_names.sort_values(['year','sex','count'],ascending=False) # sort descending so most popular name first
     .groupby(['year','sex']) # group by year and gender 
     .agg(return_first_element) # keep the first (most popular) name each year
     ['name'] # keep only the name variable
     .unstack() # format wide
     [-20:]
)

sex,F,M
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1999,Emily,Jacob
2000,Emily,Jacob
2001,Emily,Jacob
2002,Emily,Jacob
2003,Emily,Jacob
2004,Emily,Jacob
2005,Emily,Jacob
2006,Emily,Jacob
2007,Emily,Jacob
2008,Emma,Jacob
