# Objectives
- Use pandas to create your own data and work with data that already exists
- Use pandas to select specific values of a DataFrame or Series

In [38]:
import pandas as pd

# Creating, Reading and Writing
The most important and most used of these three tasks is reading. 99% of the time we will read in a .csv file (whether we download the .csv or find the URL for one). Below, read in the 'Batting.csv' found in the 'Data' folder (`Data/Batting.csv`). Set the index column (`index_col`) to `playerID`. Display and inspect the first 5 rows of data.

In [39]:
mlb = pd.read_csv('Data/Batting.csv', index_col = 'playerID')
mlb.head()

Unnamed: 0_level_0,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
abercda01,1871,1,TRO,,1,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
addybo01,1871,1,RC1,,25,118.0,30.0,32.0,6.0,0.0,...,13.0,8.0,1.0,4.0,0.0,,,,,
allisar01,1871,1,CL1,,29,137.0,28.0,40.0,4.0,5.0,...,19.0,3.0,1.0,2.0,5.0,,,,,
allisdo01,1871,1,WS3,,27,133.0,28.0,44.0,10.0,2.0,...,27.0,1.0,1.0,0.0,2.0,,,,,
ansonca01,1871,1,RC1,,25,120.0,29.0,39.0,11.0,3.0,...,16.0,6.0,2.0,2.0,1.0,,,,,


# Indexing, Selecting and Assigning
Let's take a look at all of the column headers in our DataFrame `mlb` by using the attribute `columns`. 

In [40]:
mlb.columns

Index(['yearID', 'stint', 'teamID', 'lgID', 'G', 'AB', 'R', 'H', '2B', '3B',
       'HR', 'RBI', 'SB', 'CS', 'BB', 'SO', 'IBB', 'HBP', 'SH', 'SF', 'GIDP'],
      dtype='object')

Now, let's look at the HR column in our DataFrame `mlb`.

In [41]:
mlb.HR
mlb['HR']

playerID
abercda01     0.0
addybo01      0.0
allisar01     0.0
allisdo01     2.0
ansonca01     0.0
             ... 
zitoba01      0.0
zobribe01     6.0
zobribe01     7.0
zuninmi01    11.0
zychto01      0.0
Name: HR, Length: 101332, dtype: float64

What type of object is `mlb.HR`?

In [42]:
type(mlb.HR)

pandas.core.series.Series

### Indexing
Let's start with some basic indexing, just like we can do with list objects and tuple objects!! Use the bracket notation! Find the very first person's home runs in the `HR` column (number of home runs in the 0th position).

In [43]:
mlb.HR[0]

0.0

Poor guy had 0 home runs! Is this for his entire career? Or just one year? Let's look at all the information for this row in our DataFrame `mlb` using index-based selection (`.iloc[]`).

In [44]:
mlb.iloc[0]

yearID    1871
stint        1
teamID     TRO
lgID       NaN
G            1
AB           4
R            0
H            0
2B           0
3B           0
HR           0
RBI          0
SB           0
CS           0
BB           0
SO           0
IBB        NaN
HBP        NaN
SH         NaN
SF         NaN
GIDP       NaN
Name: abercda01, dtype: object

Using `iloc`, display all of the rows in the `HR` column (column 10). 

In [45]:
mlb.iloc[:,10]

playerID
abercda01     0.0
addybo01      0.0
allisar01     0.0
allisdo01     2.0
ansonca01     0.0
             ... 
zitoba01      0.0
zobribe01     6.0
zobribe01     7.0
zuninmi01    11.0
zychto01      0.0
Name: HR, Length: 101332, dtype: float64

All of the techniques we learned about list slicing will apply here!

In [46]:
mlb.iloc[-5:]
mlb.iloc[:5]
mlb.iloc[100:105, 7:11]

Unnamed: 0_level_0,H,2B,3B,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stearbi01,0.0,0.0,0.0,0.0
stirega01,30.0,4.0,6.0,2.0
suttoez01,45.0,3.0,7.0,3.0
sweasch01,4.0,1.0,0.0,0.0
treacfr01,42.0,7.0,5.0,4.0


`iloc` is really easy to use and understand because you can think about the position of where you are in the DataFrame! The other way to select things, label-based selection (`.loc[]`), is dependent on the `index` of the DataFrame. For `mlb` our index column is `playerID`, so we **CANNOT** use row numbers (0 - 101331) for the rows. Instead, we can use the index column value. For example, lets say we wanted hits, doubles, triple, and home runs for player `stearbi01`. We would use the index `stearbi01` to do the work for us! 

In [47]:
mlb.loc['stearbi01', ['H', '2B', '3B', 'HR']]

Unnamed: 0_level_0,H,2B,3B,HR
playerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
stearbi01,0.0,0.0,0.0,0.0
stearbi01,12.0,1.0,0.0,0.0
stearbi01,24.0,0.0,0.0,0.0
stearbi01,21.0,1.0,0.0,0.0
stearbi01,20.0,0.0,0.0,0.0


### Resetting the Index
Be careful when you use `set_index` to set the new index column, it will remove your previous index!

In [55]:
mlb = mlb.reset_index()
mlb.set_index('H')

Unnamed: 0_level_0,playerID,yearID,stint,teamID,lgID,G,AB,R,2B,3B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
H,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0.0,abercda01,1871,1,TRO,,1,4.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,,,,,
32.0,addybo01,1871,1,RC1,,25,118.0,30.0,6.0,0.0,...,13.0,8.0,1.0,4.0,0.0,,,,,
40.0,allisar01,1871,1,CL1,,29,137.0,28.0,4.0,5.0,...,19.0,3.0,1.0,2.0,5.0,,,,,
44.0,allisdo01,1871,1,WS3,,27,133.0,28.0,10.0,2.0,...,27.0,1.0,1.0,0.0,2.0,,,,,
39.0,ansonca01,1871,1,RC1,,25,120.0,29.0,11.0,3.0,...,16.0,6.0,2.0,2.0,1.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0.0,zitoba01,2015,1,OAK,AL,3,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
63.0,zobribe01,2015,1,OAK,AL,67,235.0,39.0,20.0,2.0,...,33.0,1.0,1.0,33.0,26.0,2.0,0.0,0.0,3.0,5.0
66.0,zobribe01,2015,2,KCA,AL,59,232.0,37.0,16.0,1.0,...,23.0,2.0,3.0,29.0,30.0,1.0,1.0,0.0,2.0,3.0
61.0,zuninmi01,2015,1,SEA,AL,112,350.0,28.0,11.0,0.0,...,28.0,0.0,1.0,21.0,132.0,0.0,5.0,8.0,2.0,6.0


### Selection
Think of selection as filtering an Excel workbook! We are going to only use/show all of the values where some condition is `True`. Maybe we want to show all of the players in a certain year. Maybe we want to show all of the players with more than 40 home runs. We can do that with selection!

First we need to make a conditional statement for the portion of the data we are trying to filter. In the examples above we were talking about the `yearID` column and the `HR` columns.

In [65]:
mlb.yearID == 2015

0         False
1         False
2         False
3         False
4         False
          ...  
101327     True
101328     True
101329     True
101330     True
101331     True
Name: yearID, Length: 101332, dtype: bool

Now we can use that selection inside of `loc` to subset (select) our DataFrame.

In [66]:
mlb.loc[mlb.yearID == 2015]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
99846,aardsda01,2015,1,ATL,NL,33,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
99847,abadfe01,2015,1,OAK,AL,62,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99848,abreujo02,2015,1,CHA,AL,154,613.0,88.0,178.0,34.0,...,101.0,0.0,0.0,39.0,140.0,11.0,15.0,0.0,1.0,16.0
99849,achteaj01,2015,1,MIN,AL,11,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
99850,ackledu01,2015,1,SEA,AL,85,186.0,22.0,40.0,8.0,...,19.0,2.0,2.0,14.0,38.0,0.0,1.0,3.0,3.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101327,zitoba01,2015,1,OAK,AL,3,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101328,zobribe01,2015,1,OAK,AL,67,235.0,39.0,63.0,20.0,...,33.0,1.0,1.0,33.0,26.0,2.0,0.0,0.0,3.0,5.0
101329,zobribe01,2015,2,KCA,AL,59,232.0,37.0,66.0,16.0,...,23.0,2.0,3.0,29.0,30.0,1.0,1.0,0.0,2.0,3.0
101330,zuninmi01,2015,1,SEA,AL,112,350.0,28.0,61.0,11.0,...,28.0,0.0,1.0,21.0,132.0,0.0,5.0,8.0,2.0,6.0


Try the other one on your own! We want only the players where the number of home runs (`HR`) is greater than 40.

In [69]:
mlb.loc[mlb.HR > 40]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
18517,ruthba01,1920,1,NYA,AL,142,457.0,158.0,172.0,36.0,...,137.0,14.0,14.0,150.0,80.0,,3.0,5.0,,
19037,ruthba01,1921,1,NYA,AL,152,540.0,177.0,204.0,44.0,...,171.0,17.0,13.0,145.0,81.0,,4.0,4.0,,
19379,hornsro01,1922,1,SLN,NL,154,623.0,141.0,250.0,46.0,...,152.0,17.0,12.0,65.0,50.0,,1.0,15.0,,
20094,ruthba01,1923,1,NYA,AL,152,522.0,151.0,205.0,45.0,...,131.0,17.0,21.0,170.0,93.0,,4.0,3.0,,
20194,willicy01,1923,1,PHI,NL,136,535.0,98.0,157.0,22.0,...,114.0,11.0,10.0,59.0,57.0,,7.0,3.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100151,cruzne02,2015,1,SEA,AL,152,590.0,90.0,178.0,22.0,...,93.0,3.0,2.0,59.0,164.0,9.0,5.0,0.0,1.0,6.0
100165,davisch02,2015,1,BAL,AL,160,573.0,100.0,150.0,31.0,...,117.0,2.0,3.0,84.0,208.0,6.0,8.0,0.0,5.0,6.0
100209,donaljo02,2015,1,TOR,AL,158,620.0,122.0,184.0,41.0,...,123.0,6.0,0.0,73.0,133.0,0.0,6.0,2.0,10.0,16.0
100421,harpebr03,2015,1,WAS,NL,153,521.0,118.0,172.0,38.0,...,99.0,6.0,4.0,124.0,131.0,15.0,5.0,0.0,4.0,15.0


What if I wanted to meet both conditions (2015 season and more than 40 home runs)? 

In [71]:
mlb.loc[(mlb.yearID == 2015) & (mlb.HR > 40)]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
99891,arenano01,2015,1,COL,NL,157,616.0,97.0,177.0,43.0,...,130.0,2.0,5.0,34.0,110.0,13.0,4.0,0.0,11.0,17.0
100151,cruzne02,2015,1,SEA,AL,152,590.0,90.0,178.0,22.0,...,93.0,3.0,2.0,59.0,164.0,9.0,5.0,0.0,1.0,6.0
100165,davisch02,2015,1,BAL,AL,160,573.0,100.0,150.0,31.0,...,117.0,2.0,3.0,84.0,208.0,6.0,8.0,0.0,5.0,6.0
100209,donaljo02,2015,1,TOR,AL,158,620.0,122.0,184.0,41.0,...,123.0,6.0,0.0,73.0,133.0,0.0,6.0,2.0,10.0,16.0
100421,harpebr03,2015,1,WAS,NL,153,521.0,118.0,172.0,38.0,...,99.0,6.0,4.0,124.0,131.0,15.0,5.0,0.0,4.0,15.0
101210,troutmi01,2015,1,LAA,AL,159,575.0,104.0,172.0,32.0,...,90.0,11.0,7.0,92.0,158.0,14.0,10.0,0.0,5.0,11.0


**CHALLENGE** I want to know all of the players in the 2015 season who either had more than 40 home runs or more than 100 RBIs (runs batted in). We use the pipe character `|` for or.

In [72]:
mlb.loc[(mlb.yearID == 2015) & ((mlb.HR > 40) | (mlb.RBI > 100))]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
99848,abreujo02,2015,1,CHA,AL,154,613.0,88.0,178.0,34.0,...,101.0,0.0,0.0,39.0,140.0,11.0,15.0,0.0,1.0,16.0
99891,arenano01,2015,1,COL,NL,157,616.0,97.0,177.0,43.0,...,130.0,2.0,5.0,34.0,110.0,13.0,4.0,0.0,11.0,17.0
99935,bautijo02,2015,1,TOR,AL,153,543.0,108.0,136.0,29.0,...,114.0,8.0,2.0,110.0,106.0,2.0,5.0,0.0,8.0,19.0
100151,cruzne02,2015,1,SEA,AL,152,590.0,90.0,178.0,22.0,...,93.0,3.0,2.0,59.0,164.0,9.0,5.0,0.0,1.0,6.0
100165,davisch02,2015,1,BAL,AL,160,573.0,100.0,150.0,31.0,...,117.0,2.0,3.0,84.0,208.0,6.0,8.0,0.0,5.0,6.0
100209,donaljo02,2015,1,TOR,AL,158,620.0,122.0,184.0,41.0,...,123.0,6.0,0.0,73.0,133.0,0.0,6.0,2.0,10.0,16.0
100244,encared01,2015,1,TOR,AL,146,528.0,94.0,146.0,31.0,...,111.0,3.0,2.0,77.0,98.0,5.0,9.0,0.0,10.0,14.0
100344,goldspa01,2015,1,ARI,NL,159,567.0,103.0,182.0,38.0,...,110.0,21.0,5.0,118.0,151.0,29.0,2.0,0.0,7.0,16.0
100421,harpebr03,2015,1,WAS,NL,153,521.0,118.0,172.0,38.0,...,99.0,6.0,4.0,124.0,131.0,15.0,5.0,0.0,4.0,15.0
100678,martijd02,2015,1,DET,AL,158,596.0,93.0,168.0,33.0,...,102.0,3.0,2.0,53.0,178.0,7.0,5.0,0.0,3.0,11.0


### Pandas Conditional Selectors
Pandas has a few VERY USEFUL built-in conditional selectors. `isin`, `isnull` and `notnull` are some of the most widely used.

In [84]:
mlb.loc[mlb.teamID.isin(['SDN', 'COL'])]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
47263,arciajo01,1969,1,SDN,NL,120,302.0,35.0,65.0,11.0,...,10.0,14.0,7.0,14.0,47.0,0.0,2.0,3.0,0.0,7.0
47264,arlinst01,1969,1,SDN,NL,4,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
47274,baldsja01,1969,1,SDN,NL,61,4.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,0.0
47344,brownol02,1969,1,SDN,NL,151,568.0,76.0,150.0,18.0,...,61.0,10.0,6.0,44.0,97.0,3.0,3.0,4.0,2.0,12.0
47368,cannich01,1969,1,SDN,NL,134,418.0,23.0,92.0,14.0,...,33.0,0.0,1.0,42.0,81.0,8.0,0.0,7.0,2.0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101223,uptonju01,2015,1,SDN,NL,150,542.0,85.0,136.0,26.0,...,81.0,19.0,5.0,68.0,159.0,5.0,4.0,0.0,5.0,10.0
101243,venabwi01,2015,1,SDN,NL,98,283.0,34.0,73.0,10.0,...,30.0,11.0,1.0,25.0,73.0,1.0,0.0,0.0,0.0,8.0
101256,vinceni01,2015,1,SDN,NL,26,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
101272,wallabr01,2015,1,SDN,NL,64,96.0,14.0,29.0,6.0,...,16.0,0.0,0.0,10.0,31.0,1.0,1.0,0.0,0.0,1.0


**SIDE NOTE**: In order to find the `teamID` for the Padres and Rockies above, I had to figure out what my options were. First, I found all of the rows where year is 2015. Then I found the `teamID` column. Then I used the `unique()` method to find all of the unique team names.

In [85]:
mlb.loc[mlb.yearID == 2015].teamID.unique()

array(['ATL', 'OAK', 'CHA', 'MIN', 'SEA', 'NYA', 'COL', 'CLE', 'SLN',
       'CIN', 'SFN', 'ARI', 'TOR', 'TEX', 'DET', 'CHN', 'KCA', 'SDN',
       'PHI', 'HOU', 'NYN', 'BAL', 'MIA', 'LAA', 'PIT', 'LAN', 'TBA',
       'BOS', 'MIL', 'WAS'], dtype=object)

In [88]:
mlb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101332 entries, 0 to 101331
Data columns (total 22 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   playerID  101332 non-null  object 
 1   yearID    101332 non-null  int64  
 2   stint     101332 non-null  int64  
 3   teamID    101332 non-null  object 
 4   lgID      100595 non-null  object 
 5   G         101332 non-null  int64  
 6   AB        96183 non-null   float64
 7   R         96183 non-null   float64
 8   H         96183 non-null   float64
 9   2B        96183 non-null   float64
 10  3B        96183 non-null   float64
 11  HR        96183 non-null   float64
 12  RBI       95759 non-null   float64
 13  SB        94883 non-null   float64
 14  CS        72729 non-null   float64
 15  BB        96183 non-null   float64
 16  SO        88345 non-null   float64
 17  IBB       59620 non-null   float64
 18  HBP       93373 non-null   float64
 19  SH        89845 non-null   float64
 20  SF  

In [87]:
mlb.loc[mlb.AB.isnull()]

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,...,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP
50858,abbotgl01,1973,1,OAK,AL,5,,,,,...,,,,,,,,,,
50864,alburvi01,1973,1,MIN,AL,14,,,,,...,,,,,,,,,,
50865,alexado01,1973,1,BAL,AL,29,,,,,...,,,,,,,,,,
50871,allenll01,1973,1,CAL,AL,5,,,,,...,,,,,,,,,,
50872,allenll01,1973,2,TEX,AL,23,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
79190,wengedo01,1999,1,KCA,AL,11,,,,,...,,,,,,,,,,
79192,wheelda01,1999,1,TBA,AL,6,,,,,...,,,,,,,,,,
79213,willito02,1999,1,SEA,AL,13,,,,,...,,,,,,,,,,
79226,wolcobo01,1999,1,BOS,AL,4,,,,,...,,,,,,,,,,


**CHALLENGE QUESTION**: 
- What type of object is returned after all our indexing, slicing, and selecting? (assuming multiple columns)
- What type of object is returned after all our indexing, slicing, and selecting? (assuming a single column)