# An Introduction to `pandas`

Pandas! They are adorable animals. You might think they are [the worst animal ever](https://www.reddit.com/r/todayilearned/comments/3azkqx/til_naturalist_chris_packham_said_he_would_eat/cshqy9y) but that is not true. You might sometimes think `pandas` is the worst library every, and that is only *kind of* true.

The important thing is **use the right tool for the job**. `pandas` is good for some stuff, SQL is good for some stuff, writing raw Python is good for some stuff. You'll figure it out as you go along.

Now let's start coding. Hopefully you did `pip install pandas` before you started up this notebook.

In [2]:
# import pandas, but call it pd. Why? Because that's What People Do.

In [3]:
import pandas as pd

When you import pandas, you use `import pandas as pd`. That means instead of typing `pandas` in your code you'll type `pd`.

You don't *have* to, but every other person on the planet will be doing it, so you might as well.

Now we're going to read in a file. Our file is called `NBA-Census-10.14.2013.csv` because we're **sports moguls**. `pandas` can `read_` different types of files, so try to figure it out by typing `pd.read_` and hitting tab for autocomplete.

In [6]:
# We're going to call this df, which means "data frame"
# It isn't in UTF-8 (I saved it from my mac!) so we need to set the encoding
df = pd.read_csv("NBA-Census-10.14.2013.csv", encoding ="mac_roman")
#this is a data frame (df)


**A dataframe is basically a spreadsheet**, except it lives in the world of Python or the statistical programming language R. They can't call it a spreadsheet because then people would think those programmers used Excel, which would make them boring and normal and they'd have to wear a tie every day.

# Selecting rows

Now let's look at our data, since that's what data is for

In [7]:
# Let's look at all of it
df

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
0,"Gee, Alonzo",26,Cavaliers,F,33,"$3,250,000",78,219,4,2009,5/29/1987,Alabama,"Riviera Beach, FL",Florida,US,Black,No
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No
3,"Gladness, Mickell",27,Magic,C,40,"$762,195",83,220,2,2011,7/26/1986,Alabama A&M,"Birmingham, AL",Alabama,US,Black,No
4,"Jefferson, Richard",33,Jazz,F,44,"$11,046,000",79,230,12,2001,6/21/1980,Arizona,"Los Angeles, CA",California,US,Black,No
5,"Hill, Solomon",22,Pacers,F,9,"$1,246,680",79,220,0,2013,3/18/1991,Arizona,"Los Angeles, CA",California,US,Black,No
6,"Budinger, Chase",25,Timberwolves,F,10,"$5,000,000",79,218,4,2009,5/22/1988,Arizona,"Encinitas, CA",California,US,White,No
7,"Williams, Derrick",22,Timberwolves,F,7,"$5,016,960",80,241,2,2011,5/25/1991,Arizona,"La Mirada, CA",California,US,Black,No
8,"Hill, Jordan",26,Lakers,F/C,27,"$3,563,600",82,235,1,2012,7/27/1987,Arizona,"Newberry, SC",South Carolina,US,Black,No
9,"Frye, Channing",30,Suns,F/C,8,"$6,500,000",83,245,8,2005,5/17/1983,Arizona,"White Plains, NY",New York,US,Black,No


If we scroll we can see all of it. But maybe we don't want to see all of it. Maybe we hate scrolling?

In [9]:
# Look at the first few rows
df.head() #shows first 5 rows


Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
0,"Gee, Alonzo",26,Cavaliers,F,33,"$3,250,000",78,219,4,2009,5/29/1987,Alabama,"Riviera Beach, FL",Florida,US,Black,No
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No
3,"Gladness, Mickell",27,Magic,C,40,"$762,195",83,220,2,2011,7/26/1986,Alabama A&M,"Birmingham, AL",Alabama,US,Black,No
4,"Jefferson, Richard",33,Jazz,F,44,"$11,046,000",79,230,12,2001,6/21/1980,Arizona,"Los Angeles, CA",California,US,Black,No


...but maybe we want to see more than a measly five results?

In [10]:
# Let's look at MORE of the first few rows
df.head(10)

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
0,"Gee, Alonzo",26,Cavaliers,F,33,"$3,250,000",78,219,4,2009,5/29/1987,Alabama,"Riviera Beach, FL",Florida,US,Black,No
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No
3,"Gladness, Mickell",27,Magic,C,40,"$762,195",83,220,2,2011,7/26/1986,Alabama A&M,"Birmingham, AL",Alabama,US,Black,No
4,"Jefferson, Richard",33,Jazz,F,44,"$11,046,000",79,230,12,2001,6/21/1980,Arizona,"Los Angeles, CA",California,US,Black,No
5,"Hill, Solomon",22,Pacers,F,9,"$1,246,680",79,220,0,2013,3/18/1991,Arizona,"Los Angeles, CA",California,US,Black,No
6,"Budinger, Chase",25,Timberwolves,F,10,"$5,000,000",79,218,4,2009,5/22/1988,Arizona,"Encinitas, CA",California,US,White,No
7,"Williams, Derrick",22,Timberwolves,F,7,"$5,016,960",80,241,2,2011,5/25/1991,Arizona,"La Mirada, CA",California,US,Black,No
8,"Hill, Jordan",26,Lakers,F/C,27,"$3,563,600",82,235,1,2012,7/27/1987,Arizona,"Newberry, SC",South Carolina,US,Black,No
9,"Frye, Channing",30,Suns,F/C,8,"$6,500,000",83,245,8,2005,5/17/1983,Arizona,"White Plains, NY",New York,US,Black,No


But maybe we want to make a basketball joke and see the **final four?**

In [14]:
# Let's look at the final few rows
df.tail(4)

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
524,"Landry, Marcus",27,Lakers,F,14,"$788,872",79,225,17,1996,11/1/1985,Wisconsin,"Milwaukee, WI",Wisconsin,US,Black,No
525,"Harris, Devin",30,Mavericks,G,20,"$854,389",75,192,9,2004,2/27/1983,Wisconsin,"Milwaukee, WI",Wisconsin,US,Black,No
526,"West, David",33,Pacers,F,21,"$12,000,000",81,250,10,2003,8/29/1980,Xavier,"Teaneck, NJ",New Jersey,US,Black,No
527,"Crawford, Jordan",24,Celtics,G,27,"$2,162,419",76,195,3,2010,10/23/1988,Xavier,"Detroit, MI",Michigan,US,Black,No


So yes, `head` and `tail` work kind of like the terminal commands. That's nice, I guess.

But maybe we're incredibly demanding (which we are) and we want, say, **the 6th through the 8th row** (which we do). Don't worry (which I know you were), we can do that, too.

In [18]:
# Show the 6th through the 8th rows
df[5:8]

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
5,"Hill, Solomon",22,Pacers,F,9,"$1,246,680",79,220,0,2013,3/18/1991,Arizona,"Los Angeles, CA",California,US,Black,No
6,"Budinger, Chase",25,Timberwolves,F,10,"$5,000,000",79,218,4,2009,5/22/1988,Arizona,"Encinitas, CA",California,US,White,No
7,"Williams, Derrick",22,Timberwolves,F,7,"$5,016,960",80,241,2,2011,5/25/1991,Arizona,"La Mirada, CA",California,US,Black,No


It's kind of like an array, right? Except where in an array we'd say `df[0]` this time we need to give it two numbers, the start and the end.

# Selecting columns

But jeez, my eyes don't want to go that far over the data. I only want to see, uh, name and age.

In [30]:
# Get the names of the columns, just because
#columns_we_want = ['Name', 'Age']
#df[columns_we_want]


In [23]:
# If we want to be "correct" we add .values on the end of it
df.columns

Index(['Name', 'Age', 'Team', 'POS', '#', '2013 $', 'Ht (In.)', 'WT', 'EXP',
       '1st Year', 'DOB', 'School', 'City',
       'State (Province, Territory, Etc..)', 'Country', 'Race', 'HS Only'],
      dtype='object')

In [None]:
# Select only name and age


In [27]:
# Combing that with .head() to see not-so-many rows
columns_we_want = ['Name', 'Age']
df[columns_we_want].head()

Unnamed: 0,Name,Age
0,"Gee, Alonzo",26
1,"Wallace, Gerald",31
2,"Williams, Mo",30
3,"Gladness, Mickell",27
4,"Jefferson, Richard",33


In [29]:
# We can also do this all in one line, even though it starts looking ugly
# (unlike the cute bears pandas looks ugly pretty often)
df[['Name', 'Age',]].head()



Unnamed: 0,Name,Age
0,"Gee, Alonzo",26
1,"Wallace, Gerald",31
2,"Williams, Mo",30
3,"Gladness, Mickell",27
4,"Jefferson, Richard",33


**NOTE:** That was not `df['Name', 'Age']`, it was `df[['Name', 'Age']]`. You'll definitely type it wrong all of the time. When things break with pandas it's probably because you forgot to put in a million brackets.

# Describing your data

A powerful tool of pandas is being able to select a portion of your data, *because who ordered all that data anyway*.

In [31]:
df.head()

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
0,"Gee, Alonzo",26,Cavaliers,F,33,"$3,250,000",78,219,4,2009,5/29/1987,Alabama,"Riviera Beach, FL",Florida,US,Black,No
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No
3,"Gladness, Mickell",27,Magic,C,40,"$762,195",83,220,2,2011,7/26/1986,Alabama A&M,"Birmingham, AL",Alabama,US,Black,No
4,"Jefferson, Richard",33,Jazz,F,44,"$11,046,000",79,230,12,2001,6/21/1980,Arizona,"Los Angeles, CA",California,US,Black,No


I want to know how **many people are in each position**. Luckily, pandas can tell me!

In [33]:
# Grab the POS column, and count the different values in it.
df['POS'].value_counts()

G      175
F      142
F/C     74
G/F     70
C       67
Name: POS, dtype: int64

**Now that was a little weird, yes** - we used `df['POS']` instead of `df[['POS']]` when viewing the data's details.

But now I'm curious about numbers: **how old is everyone?** Maybe we could, I don't know, get some statistics about age? Some statistics to **describe** age?

In [35]:
#race
race_counts = df['Race'].value_counts()
race_counts

Black       399
White        95
Hispanic     16
Mixed        16
Asian         1
Name: Race, dtype: int64

In [36]:
# Summary statistics for Age
df['Age'].describe()

count    528.000000
mean      26.242424
std        4.178868
min       18.000000
25%       23.000000
50%       25.000000
75%       29.000000
max       39.000000
Name: Age, dtype: float64

In [37]:
df.describe()

Unnamed: 0,Age,Ht (In.),WT,EXP,1st Year
count,528.0,528.0,528.0,528.0,528.0
mean,26.242424,79.119318,221.206439,4.772727,2008.227273
std,4.178868,3.431488,27.943169,4.325628,4.325628
min,18.0,69.0,20.0,0.0,1995.0
25%,23.0,77.0,200.0,1.0,2005.0
50%,25.0,80.0,220.0,4.0,2009.0
75%,29.0,82.0,240.0,8.0,2012.0
max,39.0,87.0,290.0,18.0,2013.0


In [39]:
# That's pretty good. Does it work for everything? How about the money?
df['2013 $'].describe()
#The result is the result, because the Money is a string.

count     528
unique    308
top       n/a
freq       43
Name: 2013 $, dtype: object

Unfortunately because that has dollar signs and commas it's thought of as a string. **We'll fix it in a second,** but let's try describing one more thing.

In [41]:
# Doing more describing
df['Ht (In.)'].describe()


count    528.000000
mean      79.119318
std        3.431488
min       69.000000
25%       77.000000
50%       80.000000
75%       82.000000
max       87.000000
Name: Ht (In.), dtype: float64

That's stupid, though, what's an inch even look like? What's 80 inches? I don't have a clue. If only there were some wa to manipulate our data.

# Manipulating data

Oh wait there is, HA HA HA.

In [43]:
# Take another look at our inches, but only the first few
df['Ht (In.)'].head()


0    78
1    79
2    73
3    83
4    79
Name: Ht (In.), dtype: int64

In [46]:
# Divide those inches by 12
#number_of_inches = 300
#number_of_inches / 12
df['Ht (In.)'].head() / 12


0    6.500000
1    6.583333
2    6.083333
3    6.916667
4    6.583333
Name: Ht (In.), dtype: float64

In [48]:
# Let's divide ALL of them by 12
df['Ht (In.)'] / 12

0      6.500000
1      6.583333
2      6.083333
3      6.916667
4      6.583333
5      6.583333
6      6.583333
7      6.666667
8      6.833333
9      6.916667
10     6.250000
11     6.166667
12     6.250000
13     6.500000
14     6.833333
15     6.666667
16     6.750000
17     6.416667
18     6.500000
19     6.083333
20     6.083333
21     6.583333
22     6.583333
23     6.083333
24     6.750000
25     6.583333
26     6.916667
27     6.833333
28     6.250000
29     6.833333
         ...   
498    6.000000
499    6.166667
500    6.000000
501    6.916667
502    7.083333
503    6.500000
504    6.250000
505    5.750000
506    5.750000
507    6.500000
508    6.500000
509    6.500000
510    6.833333
511    6.583333
512    6.250000
513    6.666667
514    6.916667
515    6.750000
516    6.750000
517    6.583333
518    6.750000
519    6.416667
520    6.250000
521    6.416667
522    6.916667
523    6.833333
524    6.583333
525    6.250000
526    6.750000
527    6.333333
Name: Ht (In.), dtype: f

In [51]:
# Can we get statistics on those?
height_in_feet = df['Ht (In.)'] / 12
height_in_feet.describe()

count    528.000000
mean       6.593277
std        0.285957
min        5.750000
25%        6.416667
50%        6.666667
75%        6.833333
max        7.250000
Name: Ht (In.), dtype: float64

In [52]:
# Let's look at our original data again
df.head(3)

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only
0,"Gee, Alonzo",26,Cavaliers,F,33,"$3,250,000",78,219,4,2009,5/29/1987,Alabama,"Riviera Beach, FL",Florida,US,Black,No
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No


Okay that was nice but unfortunately we can't do anything with it. It's just sitting there, separate from our data. If this were normal code we could do `blahblah['feet'] = blahblah['Ht (In.)'] / 12`, but since this is pandas, we can't. Right? **Right?**

In [55]:
# Store a new column
df['feet'] = df['Ht (In.)'] / 12
df.head()


Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
0,"Gee, Alonzo",26,Cavaliers,F,33,"$3,250,000",78,219,4,2009,5/29/1987,Alabama,"Riviera Beach, FL",Florida,US,Black,No,6.5
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No,6.583333
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No,6.083333
3,"Gladness, Mickell",27,Magic,C,40,"$762,195",83,220,2,2011,7/26/1986,Alabama A&M,"Birmingham, AL",Alabama,US,Black,No,6.916667
4,"Jefferson, Richard",33,Jazz,F,44,"$11,046,000",79,230,12,2001,6/21/1980,Arizona,"Los Angeles, CA",California,US,Black,No,6.583333


That's cool, maybe we could do the same thing with their salary? Take out the $ and the , and convert it to an integer?

In [None]:
# Can't just use .replace


In [None]:
# Need to use this weird .str thing


In [None]:
# Can't just immediately replace the , either


In [None]:
# Need to use the .str thing before EVERY string method


In [None]:
# Describe still doesn't work.


In [None]:
# Let's convert it to an integer using .astype(int) before we describe it


In [None]:
# Maybe we can just make them millions?


In [None]:
# Unfortunately one is "n/a" which is going to break our code, so we can make n/a be 0


In [None]:
# Remove the .head() piece and save it back into the dataframe


The average basketball player makes 3.8 million dollars and is a little over six and a half feet tall.

But who cares about those guys? I don't care about those guys. They're boring. I want the real rich guys!

# Sorting and sub-selecting

In [None]:
# This is just the first few guys in the dataset. Can we order it?




In [56]:
# Let's try to sort them, ascending value

df.sort_values('feet')

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
506,"Robinson, Nate",29,Nuggets,G,10,"$2,016,000",69,180,8,2005,5/31/1984,Washington,"Seattle, WA",Washington,US,Black,No,5.750000
505,"Thomas, Isaiah",24,Kings,G,22,"$884,293",69,185,2,2011,2/7/1989,Washington,"Tacoma, WA",Washington,US,Black,No,5.750000
235,"Larkin, Shane",21,Mavericks,G,3,"$1,536,960",71,176,0,2013,10/2/1992,Miami (FL),"Cincinnati, OH",Ohio,US,Black,No,5.916667
362,"Lucas III, John",30,Jazz,G,5,"$1,600,000",71,165,8,2005,11/21/1982,Oklahoma State,"Washington, DC",DC,US,Black,No,5.916667
256,"Pressey, Phil",22,Celtics,G,26,"$490,180",71,175,0,2013,2/17/1991,Missouri,"Dallas, TX",Texas,US,Black,No,5.916667
336,"Lawson, Ty",25,Nuggets,G,3,"$10,786,517",71,195,4,2009,11/3/1987,North Carolina,"Clinton, MD",Maryland,US,Black,No,5.916667
388,"McConnell, Mickey",24,Mavericks,G,32,"$490,180",72,189,2,2011,4/14/1989,St. Mary's (CA),"Mesa, AZ",Arizona,US,White,No,6.000000
498,"Paul, Chris",28,Clippers,G,3,"$18,668,431",72,175,8,2005,5/6/1985,Wake Forest,"Forsyth County, NC",North Carolina,US,Black,No,6.000000
133,"Bynum, Will",30,Pistons,G,12,"$2,790,343",72,185,8,2005,1/4/1983,Georgia Tech,"Chicago, IL",Illinois,US,Black,No,6.000000
500,"Smith, Ish",25,Suns,G,30,"$951,463",72,175,3,2010,7/5/1988,Wake Forest,"Charlotte, NC",North Carolina,US,Black,No,6.000000


Those guys are making nothing! If only there were a way to sort from high to low, a.k.a. descending instead of ascending.

In [57]:
# It isn't descending = True, unfortunately
df.sort_values('feet', ascending=False).head()

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
54,"Thabeet, Hasheem",26,Thunder,C,34,"$1,200,000",87,263,4,2009,2/16/1987,Connecticut,Dar es Salaam,,Tanzania,Black,No,7.25
120,"Hibbert, Roy",26,Pacers,C,55,"$14,283,844",86,280,5,2008,12/11/1986,Georgetown,"New York City, NY",New York,US,Black,No,7.166667
502,"Hawes, Spencer",25,76ers,C,0,"$6,500,000",85,245,6,2007,4/28/1988,Washington,"Seattle, WA",Washington,US,White,No,7.083333
145,"Leonard, Meyers",21,Trail Blazers,C,11,"$2,222,160",85,245,1,2012,2/27/1992,Illinois,"Robinson, IIL",Illinois,US,White,No,7.083333
303,"Gasol, Marc",28,Grizzlies,C,33,"$14,860,524",85,265,5,2008,1/29/1985,,Barcelona,,Spain,Hispanic,No,7.083333


In [58]:
# We can use this to find the oldest guys in the league
df.sort_values('Age', ascending=False).head()

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
392,"Nash, Steve",39,Lakers,G,10,"$9,300,500",75,178,7,2006,2/7/1974,Santa Clara,"Johannesburg, SA",,South Africa,White,No,6.25
225,"Camby, Marcus",39,Rockets,F/C,21,"$884,293",83,240,17,1996,3/22/1974,Massachusetts,"Hartford, CT",Connecticut,US,Black,No,6.916667
23,"Fisher, Derek",39,Thunder,G,6,"$884,293",73,210,17,1996,8/9/1974,Arkansas-Little Rock,"Little Rock, AR",Arkansas,US,Black,No,6.083333
63,"Allen, Ray",38,Heat,G,34,"$3,229,050",77,205,17,1996,7/20/1975,Connecticut,"Merced, CA",California,US,Black,No,6.416667
94,"James, Mike",38,Bulls,G,8,,74,188,15,1998,6/23/1975,Duquesne,"Copiague, NY",New York,US,Black,No,6.166667


In [59]:
# Or the youngest, by taking out 'ascending=False'
df.sort_values('feet').head()

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
506,"Robinson, Nate",29,Nuggets,G,10,"$2,016,000",69,180,8,2005,5/31/1984,Washington,"Seattle, WA",Washington,US,Black,No,5.75
505,"Thomas, Isaiah",24,Kings,G,22,"$884,293",69,185,2,2011,2/7/1989,Washington,"Tacoma, WA",Washington,US,Black,No,5.75
235,"Larkin, Shane",21,Mavericks,G,3,"$1,536,960",71,176,0,2013,10/2/1992,Miami (FL),"Cincinnati, OH",Ohio,US,Black,No,5.916667
362,"Lucas III, John",30,Jazz,G,5,"$1,600,000",71,165,8,2005,11/21/1982,Oklahoma State,"Washington, DC",DC,US,Black,No,5.916667
256,"Pressey, Phil",22,Celtics,G,26,"$490,180",71,175,0,2013,2/17/1991,Missouri,"Dallas, TX",Texas,US,Black,No,5.916667


But sometimes instead of just looking at them, I want to do stuff with them. Play some games with them! Dunk on them~ `describe` them! And we don't want to dunk on everyone, only the players above 7 feet tall.

First, we need to check out **boolean things.**

In [60]:
# Get a big long list of True and False for every single row.
df['feet'] > 6.5

0      False
1       True
2      False
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10     False
11     False
12     False
13     False
14      True
15      True
16      True
17     False
18     False
19     False
20     False
21      True
22      True
23     False
24      True
25      True
26      True
27      True
28     False
29      True
       ...  
498    False
499    False
500    False
501     True
502     True
503    False
504    False
505    False
506    False
507    False
508    False
509    False
510     True
511     True
512    False
513     True
514     True
515     True
516     True
517     True
518     True
519    False
520    False
521    False
522     True
523     True
524     True
525    False
526     True
527    False
Name: feet, dtype: bool

In [61]:
# We could use value counts if we wanted
above_or_below_six_five = df['feet'] > 6.5
above_or_below_six_five.value_counts()

True     317
False    211
Name: feet, dtype: int64

In [None]:
# But we can also apply this to every single row to say whether YES we want it or NO we don't


In [62]:
# Instead of putting column names inside of the brackets, we instead
# put the True/False statements. It will only return the players above 
# seven feet tall

df[df['feet'] > 6.5]

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
1,"Wallace, Gerald",31,Celtics,F,45,"$10,105,855",79,220,12,2001,7/23/1982,Alabama,"Sylacauga, AL",Alabama,US,Black,No,6.583333
3,"Gladness, Mickell",27,Magic,C,40,"$762,195",83,220,2,2011,7/26/1986,Alabama A&M,"Birmingham, AL",Alabama,US,Black,No,6.916667
4,"Jefferson, Richard",33,Jazz,F,44,"$11,046,000",79,230,12,2001,6/21/1980,Arizona,"Los Angeles, CA",California,US,Black,No,6.583333
5,"Hill, Solomon",22,Pacers,F,9,"$1,246,680",79,220,0,2013,3/18/1991,Arizona,"Los Angeles, CA",California,US,Black,No,6.583333
6,"Budinger, Chase",25,Timberwolves,F,10,"$5,000,000",79,218,4,2009,5/22/1988,Arizona,"Encinitas, CA",California,US,White,No,6.583333
7,"Williams, Derrick",22,Timberwolves,F,7,"$5,016,960",80,241,2,2011,5/25/1991,Arizona,"La Mirada, CA",California,US,Black,No,6.666667
8,"Hill, Jordan",26,Lakers,F/C,27,"$3,563,600",82,235,1,2012,7/27/1987,Arizona,"Newberry, SC",South Carolina,US,Black,No,6.833333
9,"Frye, Channing",30,Suns,F/C,8,"$6,500,000",83,245,8,2005,5/17/1983,Arizona,"White Plains, NY",New York,US,Black,No,6.916667
14,"Boateng, Eric",27,Lakers,C,12,,82,257,17,1996,11/20/1985,Arizona State,"London, ENG",,England,Black,No,6.833333
15,"Diogu, Ike",29,Knicks,F/C,50,"$792,377",80,255,8,2005,11/9/1983,Arizona State,"Buffalo, NY",New York,US,Black,No,6.666667


In [63]:
df['Race'] == 'Asian'

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
498    False
499    False
500    False
501    False
502    False
503    False
504    False
505    False
506    False
507    False
508    False
509    False
510    False
511    False
512    False
513    False
514    False
515    False
516    False
517    False
518    False
519    False
520    False
521    False
522    False
523    False
524    False
525    False
526    False
527    False
Name: Race, dtype: bool

In [None]:
df[]

In [None]:
# Or only the guards
df['POS'] == 'G'.head()
#People below 6 feet
df['feet'] < 6.5

In [65]:
#Every column you ant to query needs parenthesis aroung it
#Guards that are higher than 6.5
#this is combination of both
df[(df['POS'] == 'G') & (df['feet'] < 6.5)].head()

Unnamed: 0,Name,Age,Team,POS,#,2013 $,Ht (In.),WT,EXP,1st Year,DOB,School,City,"State (Province, Territory, Etc..)",Country,Race,HS Only,feet
2,"Williams, Mo",30,Trail Blazers,G,25,"$2,652,000",73,195,10,2003,12/19/1982,Alabama,"Jackson, MS",Mississippi,US,Black,No,6.083333
10,"Bayless, Jerryd",25,Grizzlies,G,7,"$3,135,000",75,200,5,2008,8/20/1988,Arizona,"Phoenix, AZ",Arizona,US,Black,No,6.25
11,"Terry, Jason",36,Nets,G,31,"$5,625,313",74,180,14,1999,9/15/1977,Arizona,"Seattle, WA",Washington,US,Black,No,6.166667
12,"Fogg, Kyle",23,Nuggets,G,6,,75,183,0,2013,1/27/1990,Arizona,"Brea, CA",California,US,Black,No,6.25
17,"Harden, James",24,Rockets,G,13,"$13,701,250",77,220,4,2009,8/26/1989,Arizona State,"Los Angeles, CA",California,US,Black,No,6.416667


In [66]:
#We can save stuff
centers = df[df['POS'] == 'C']
guards = df[df['POS'] ==  'G']

In [69]:
centers['feet'].describe()

count    67.000000
mean      6.962687
std       0.087381
min       6.750000
25%       6.916667
50%       7.000000
75%       7.000000
max       7.250000
Name: feet, dtype: float64

In [70]:
guards['feet'].describe()

count    175.000000
mean       6.263810
std        0.165729
min        5.750000
25%        6.166667
50%        6.250000
75%        6.416667
max        6.583333
Name: feet, dtype: float64

In [None]:
# It might be easier to break down the booleans into separate variables


In [None]:
# We can save this stuff


In [None]:
# Maybe we can compare them to taller players?


# Drawing pictures

Okay okay enough code and enough stupid numbers. I'm visual. I want graphics. **Okay?????** Okay.

In [None]:
# This will scream we don't have matplotlib.


`matplotlib` is a graphing library. It's the Python way to make graphs!

In [None]:
# this will open up a weird window that won't do anything


In [None]:
# So instead you run this code


But that's ugly. There's a thing called ``ggplot`` for R that looks nice. We want to look nice. We want to look like ``ggplot``.

In [None]:
# Import matplotlib
# What's available?


In [None]:
# Use ggplot


In [None]:
# Make a histogram


In [None]:
# Try some other styles


That might look better with a little more customization. So let's customize it.

In [None]:
# Pass in all sorts of stuff!
# Most from http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.hist.html
# .range() is a matplotlib thing


I want more graphics! **Do tall people make more money?!?!**

In [None]:
# How does experience relate with the amount of money they're making?


In [None]:
# At least we can assume height and weight are related


In [None]:
# At least we can assume height and weight are related
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html


In [None]:
# We can also use plt separately
# It's SIMILAR but TOTALLY DIFFERENT
