In [2]:
import requests

download_url = "https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv"
target_csv_path = "nba_all_elo.csv"

response = requests.get(download_url)
response.raise_for_status()    # Check that the request was successful
with open(target_csv_path, "wb") as f:
    f.write(response.content)
print("Download ready.")

Download ready.


When you execute the script, it will save the file nba_all_elo.csv in your current working directory

Now you can use the pandas Python library to take a look at your data:



In [3]:
import pandas as pd
nba = pd.read_csv("nba_all_elo.csv")
type(nba)

pandas.core.frame.DataFrame

In [4]:
len(nba)

126314

In [5]:
nba.shape

(126314, 23)

In [6]:
nba.head()

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,...,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
0,1,194611010TRH,NBA,0,1947,11/1/1946,1,0,TRH,Huskies,...,40.29483,NYK,Knicks,68,1300.0,1306.7233,H,L,0.640065,
1,1,194611010TRH,NBA,1,1947,11/1/1946,1,0,NYK,Knicks,...,41.70517,TRH,Huskies,66,1300.0,1293.2767,A,W,0.359935,
2,2,194611020CHS,NBA,0,1947,11/2/1946,1,0,CHS,Stags,...,42.012257,NYK,Knicks,47,1306.7233,1297.0712,H,W,0.631101,
3,2,194611020CHS,NBA,1,1947,11/2/1946,2,0,NYK,Knicks,...,40.692783,CHS,Stags,63,1300.0,1309.6521,A,L,0.368899,
4,3,194611020DTF,NBA,0,1947,11/2/1946,1,0,DTF,Falcons,...,38.864048,WSC,Capitols,50,1300.0,1320.3811,H,L,0.640065,


Unless your screen is quite large, your output probably won’t display all 23 columns. Somewhere in the middle, you’ll see a column of ellipses (...) indicating the missing data. If you’re working in a terminal, then that’s probably more readable than wrapping long rows. However, Jupyter notebooks will allow you to scroll. You can configure pandas to display all 23 columns like this:



In [7]:
pd.set_option("display.max.columns", None)

While it’s practical to see all the columns, you probably won’t need six decimal places! Change it to two:

In [8]:
pd.set_option("display.precision", 2)

To verify that you’ve changed the options successfully, you can execute .head() again, or you can display the last five rows with .tail() instead:



In [9]:
nba.tail()

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,pts,elo_i,elo_n,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
126309,63155,201506110CLE,NBA,0,2015,6/11/2015,100,1,CLE,Cavaliers,82,1723.41,1704.39,60.31,GSW,Warriors,103,1790.96,1809.98,H,L,0.55,
126310,63156,201506140GSW,NBA,0,2015,6/14/2015,102,1,GSW,Warriors,104,1809.98,1813.63,68.01,CLE,Cavaliers,91,1704.39,1700.74,H,W,0.77,
126311,63156,201506140GSW,NBA,1,2015,6/14/2015,101,1,CLE,Cavaliers,91,1704.39,1700.74,60.01,GSW,Warriors,104,1809.98,1813.63,A,L,0.23,
126312,63157,201506170CLE,NBA,0,2015,6/16/2015,102,1,CLE,Cavaliers,97,1700.74,1692.09,59.29,GSW,Warriors,105,1813.63,1822.29,H,L,0.48,
126313,63157,201506170CLE,NBA,1,2015,6/16/2015,103,1,GSW,Warriors,105,1813.63,1822.29,68.52,CLE,Cavaliers,97,1700.74,1692.09,A,W,0.52,


Now, you should see all the columns, and your data should show two decimal places:

In [10]:
nba.tail(3)

Unnamed: 0,gameorder,game_id,lg_id,_iscopy,year_id,date_game,seasongame,is_playoffs,team_id,fran_id,pts,elo_i,elo_n,win_equiv,opp_id,opp_fran,opp_pts,opp_elo_i,opp_elo_n,game_location,game_result,forecast,notes
126311,63156,201506140GSW,NBA,1,2015,6/14/2015,101,1,CLE,Cavaliers,91,1704.39,1700.74,60.01,GSW,Warriors,104,1809.98,1813.63,A,L,0.23,
126312,63157,201506170CLE,NBA,0,2015,6/16/2015,102,1,CLE,Cavaliers,97,1700.74,1692.09,59.29,GSW,Warriors,105,1813.63,1822.29,H,L,0.48,
126313,63157,201506170CLE,NBA,1,2015,6/16/2015,103,1,GSW,Warriors,105,1813.63,1822.29,68.52,CLE,Cavaliers,97,1700.74,1692.09,A,W,0.52,


# Getting to Know Your Data


You’ve imported a CSV file with the pandas Python library and had a first look at the contents of your dataset. So far, you’ve only seen the size of your dataset and its first and last few rows. Next, you’ll learn how to examine your data more systematicall

###  Displaying Data Types

The first step in getting to know your data is to discover the different data types it contains. While you can put anything into a list, the columns of a DataFrame contain values of a specific data type. When you compare pandas and Python data structures, you’ll see that this behavior makes pandas much faster!

You can display all columns and their data types with .info():

In [11]:
nba.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 126314 entries, 0 to 126313
Data columns (total 23 columns):
gameorder        126314 non-null int64
game_id          126314 non-null object
lg_id            126314 non-null object
_iscopy          126314 non-null int64
year_id          126314 non-null int64
date_game        126314 non-null object
seasongame       126314 non-null int64
is_playoffs      126314 non-null int64
team_id          126314 non-null object
fran_id          126314 non-null object
pts              126314 non-null int64
elo_i            126314 non-null float64
elo_n            126314 non-null float64
win_equiv        126314 non-null float64
opp_id           126314 non-null object
opp_fran         126314 non-null object
opp_pts          126314 non-null int64
opp_elo_i        126314 non-null float64
opp_elo_n        126314 non-null float64
game_location    126314 non-null object
game_result      126314 non-null object
forecast         126314 non-null float64
notes     

You’ll see a list of all the columns in your dataset and the type of data each column contains. Here, you can see the data types int64, float64, and object. pandas uses the NumPy library to work with these types. Later, you’ll meet the more complex categorical data type, which the pandas Python library implements itself.

The object data type is a special one. According to the pandas Cookbook, the object data type is “a catch-all for columns that pandas doesn’t recognize as any other specific type.” In practice, it often means that all of the values in the column are strings.

Although you can store arbitrary Python objects in the object data type, you should be aware of the drawbacks to doing so. Strange values in an object column can harm pandas’ performance and its interoperability with other libraries. For more information, check out the official getting started guide.



### Showing Basics Statistics

Now that you’ve seen what data types are in your dataset, it’s time to get an overview of the values each column contains. You can do this with .describe():

In [12]:
nba.describe()

Unnamed: 0,gameorder,_iscopy,year_id,seasongame,is_playoffs,pts,elo_i,elo_n,win_equiv,opp_pts,opp_elo_i,opp_elo_n,forecast
count,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0,126314.0
mean,31579.0,0.5,1988.2,43.53,0.06,102.73,1495.24,1495.24,41.71,102.73,1495.24,1495.24,0.5
std,18231.93,0.5,17.58,25.38,0.24,14.81,112.14,112.46,10.63,14.81,112.14,112.46,0.22
min,1.0,0.0,1947.0,1.0,0.0,0.0,1091.64,1085.77,10.15,0.0,1091.64,1085.77,0.02
25%,15790.0,0.0,1975.0,22.0,0.0,93.0,1417.24,1416.99,34.1,93.0,1417.24,1416.99,0.33
50%,31579.0,0.5,1990.0,43.0,0.0,103.0,1500.95,1500.95,42.11,103.0,1500.95,1500.95,0.5
75%,47368.0,1.0,2003.0,65.0,0.0,112.0,1576.06,1576.29,49.64,112.0,1576.06,1576.29,0.67
max,63157.0,1.0,2015.0,108.0,1.0,186.0,1853.1,1853.1,71.11,186.0,1853.1,1853.1,0.98


This function shows you some basic descriptive statistics for all numeric columns

.describe() only analyzes numeric columns by default, but you can provide other data types if you use the include parameter:

In [13]:
import numpy as np
nba.describe(include=object)

Unnamed: 0,game_id,lg_id,date_game,team_id,fran_id,opp_id,opp_fran,game_location,game_result,notes
count,126314,126314,126314,126314,126314,126314,126314,126314,126314,5424
unique,63157,2,12426,104,53,104,53,3,2,231
top,195812120NYK,NBA,1/2/2009,BOS,Lakers,BOS,Lakers,H,W,at New York NY
freq,2,118016,30,5997,6024,5997,6024,63138,63157,440


.describe() won’t try to calculate a mean or a standard deviation for the object columns, since they mostly include text strings. However, it will still display some descriptive statistics:

In [14]:
nba.describe(include=object)

Unnamed: 0,game_id,lg_id,date_game,team_id,fran_id,opp_id,opp_fran,game_location,game_result,notes
count,126314,126314,126314,126314,126314,126314,126314,126314,126314,5424
unique,63157,2,12426,104,53,104,53,3,2,231
top,195812120NYK,NBA,1/2/2009,BOS,Lakers,BOS,Lakers,H,W,at New York NY
freq,2,118016,30,5997,6024,5997,6024,63138,63157,440


Take a look at the team_id and fran_id columns. Your dataset contains 104 different team IDs, but only 53 different franchise IDs. Furthermore, the most frequent team ID is BOS, but the most frequent franchise ID Lakers. How is that possible? You’ll need to explore your dataset a bit more to answer this question.



## Exploring Your Dataset

Exploratory data analysis can help you answer questions about your dataset. For example, you can examine how often specific values occur in a column:

In [15]:
nba["team_id"].value_counts()

BOS    5997
NYK    5769
LAL    5078
DET    4985
PHI    4533
CHI    4307
PHO    4080
ATL    4035
MIL    4034
POR    3870
HOU    3820
CLE    3810
GSW    3701
SEA    3547
SAS    3515
IND    3364
DEN    3312
UTA    3145
DAL    3013
NJN    2939
LAC    2563
SAC    2488
MIA    2371
ORL    2207
MIN    2131
WSB    1992
TOR    1634
WAS    1475
CIN    1230
MEM    1197
       ... 
MMP     172
PRO     168
PTC     168
NOP     168
MMT     168
NOK     164
HSM     159
TRI     135
WSA      91
MMS      89
TEX      89
CAP      89
MNM      88
MNP      85
NYN      82
CHO      82
CHP      80
CHZ      80
ANA      78
NJA      78
AND      72
SHE      65
CLR      63
WAT      62
DNN      62
DTF      60
PIT      60
TRH      60
INJ      60
SDS      11
Name: team_id, Length: 104, dtype: int64

In [16]:
nba["fran_id"].value_counts()

Lakers          6024
Celtics         5997
Knicks          5769
Warriors        5657
Pistons         5650
Sixers          5644
Hawks           5572
Kings           5475
Wizards         4582
Spurs           4309
Bulls           4307
Pacers          4227
Thunder         4178
Rockets         4154
Nuggets         4120
Nets            4106
Suns            4080
Bucks           4034
Trailblazers    3870
Cavaliers       3810
Clippers        3733
Jazz            3555
Mavericks       3013
Heat            2371
Pelicans        2254
Magic           2207
Timberwolves    2131
Grizzlies       1657
Raptors         1634
Hornets          894
Colonels         846
Squires          799
Spirits          777
Stars            756
Sounds           697
Baltimore        467
Floridians       440
Condors          430
Capitols         291
Olympians        282
Sails            274
Stags            260
Bombers          249
Steamrollers     168
Packers           72
Redskins          65
Rebels            63
Denver       

It seems that a team named "Lakers" played 6024 games, but only 5078 of those were played by the Los Angeles Lakers. Find out who the other "Lakers" team is:

In [17]:
nba.loc[nba["fran_id"] == "Lakers", "team_id"].value_counts()

LAL    5078
MNL     946
Name: team_id, dtype: int64

Indeed, the Minneapolis Lakers ("MNL") played 946 games. You can even find out when they played those games. For that, you’ll first define a column that converts the value of date_game to the datetime data type. Then you can use the min and max aggregate functions, to find the first and last games of Minneapolis Lakers:



In [18]:
nba["date_played"] = pd.to_datetime(nba["date_game"])

In [19]:
nba.loc[nba["team_id"] == "MNL", "date_played"].min()

Timestamp('1948-11-04 00:00:00')

In [20]:
nba.loc[nba['team_id'] == 'MNL', 'date_played'].max()

Timestamp('1960-03-26 00:00:00')

In [21]:
nba.loc[nba["team_id"] == "MNL", "date_played"].agg(("min", "max"))

min   1948-11-04
max   1960-03-26
Name: date_played, dtype: datetime64[ns]

It looks like the Minneapolis Lakers played between the years of 1948 and 1960. That explains why you might not recognize this team!

You’ve also found out why the Boston Celtics team "BOS" played the most games in the dataset. Let’s analyze their history also a little bit. Find out how many points the Boston Celtics have scored during all matches contained in this dataset. Expand the code block below for the solution:

In [23]:
nba.loc[nba["team_id"] == "BOS", "pts"].sum()

626484

You’ve got a taste for the capabilities of a pandas DataFrame. In the following sections, you’ll expand on the techniques you’ve just used, but first, you’ll zoom in and learn how this powerful data structure works.



# Getting to Know pandas’ Data Structures

While a DataFrame provides functions that can feel quite intuitive, the underlying concepts are a bit trickier to understand. For this reason, you’ll set aside the vast NBA DataFrame and build some smaller pandas objects from scratch.

### Understanding Series Objects

Python’s most basic data structure is the list, which is also a good starting point for getting to know pandas.Series objects. Create a new Series object based on a list:

In [25]:
revenues = pd.Series([5555, 7000, 1980])

In [26]:
revenues

0    5555
1    7000
2    1980
dtype: int64

You’ve used the list [5555, 7000, 1980] to create a Series object called revenues. A Series object wraps two components:

A sequence of values

A sequence of identifiers, which is the index

You can access these components with .values and .index, respectively:



In [28]:
revenues.values

array([5555, 7000, 1980])

In [29]:
revenues.index


RangeIndex(start=0, stop=3, step=1)

revenues.values returns the values in the Series, whereas revenues.index returns the positional index.

Note: If you’re familiar with NumPy, then it might be interesting for you to note that the values of a Series object are actually n-dimensional arrays:

In [30]:
type(revenues.values)

numpy.ndarray

If you’re not familiar with NumPy, then there’s no need to worry! You can explore the ins and outs of your dataset with the pandas Python library alone. However, if you’re curious about what pandas does behind the scenes, then check out Look Ma, No for Loops: Array Programming With NumPy.



While pandas builds on NumPy, a significant difference is in their indexing. Just like a NumPy array, a pandas Series also has an integer index that’s implicitly defined. This implicit index indicates the element’s position in the Series.

However, a Series can also have an arbitrary type of index. You can think of this explicit index as labels for a specific row:

In [31]:
city_revenues = pd.Series( [4200, 8000, 6500], index=["Amsterdam", "Toronto", "Tokyo"] )

In [32]:
city_revenues

Amsterdam    4200
Toronto      8000
Tokyo        6500
dtype: int64

Here, the index is a list of city names represented by strings. You may have noticed that Python dictionaries use string indices as well, and this is a handy analogy to keep in mind! You can use the code blocks above to distinguish between two types of Series:

revenues: This Series behaves like a Python list because it only has a positional index.

city_revenues: This Series acts like a Python dictionary because it features both a positional and a label index.

Here’s how to construct a Series with a label index from a Python dictionary:



In [33]:
city_employee_count = pd.Series({"Amsterdam": 5, "Tokyo": 8})

In [34]:
city_employee_count

Amsterdam    5
Tokyo        8
dtype: int64

The dictionary keys become the index, and the dictionary values are the Series values.



Just like dictionaries, Series also support .keys() and the in keyword:

In [35]:
city_employee_count.keys()

Index(['Amsterdam', 'Tokyo'], dtype='object')

In [36]:
"Tokyo" in city_employee_count

True

In [37]:
 "New York" in city_employee_count

False

### Understanding DataFrame Objects


While a Series is a pretty powerful data structure, it has its limitations. For example, you can only store one attribute per key. As you’ve seen with the nba dataset, which features 23 columns, the pandas Python library has more to offer with its DataFrame. This data structure is a sequence of Series objects that share the same index.



If you’ve followed along with the Series examples, then you should already have two Series objects with cities as keys:



city_revenues

city_employee_count


You can combine these objects into a DataFrame by providing a dictionary in the constructor. The dictionary keys will become the column names, and the values should contain the Series objects:

In [38]:
city_data = pd.DataFrame({"revenue": city_revenues,"employee_count": city_employee_count})

In [39]:
 city_data

Unnamed: 0,revenue,employee_count
Amsterdam,4200,5.0
Tokyo,6500,8.0
Toronto,8000,


Note how pandas replaced the missing employee_count value for Toronto with NaN.



The new DataFrame index is the union of the two Series indices:



In [40]:
city_data.index

Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

Just like a Series, a DataFrame also stores its values in a NumPy array:



In [41]:
city_data.values

array([[4.2e+03, 5.0e+00],
       [6.5e+03, 8.0e+00],
       [8.0e+03,     nan]])

You can also refer to the 2 dimensions of a DataFrame as axes:



In [42]:
city_data.axes

[Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object'),
 Index(['revenue', 'employee_count'], dtype='object')]

In [43]:
city_data.axes[0]

Index(['Amsterdam', 'Tokyo', 'Toronto'], dtype='object')

In [44]:
city_data.axes[1]

Index(['revenue', 'employee_count'], dtype='object')

The axis marked with 0 is the row index, and the axis marked with 1 is the column index. This terminology is important to know because you’ll encounter several DataFrame methods that accept an axis parameter.

A DataFrame is also a dictionary-like data structure, so it also supports .keys() and the in keyword. However, for a DataFrame these don’t relate to the index, but to the columns:



In [45]:
city_data.keys()

Index(['revenue', 'employee_count'], dtype='object')

In [46]:
"Amsterdam" in city_data

False

In [47]:
"revenue" in city_data

True

You can see these concepts in action with the bigger NBA dataset. Does it contain a column called "points", or was it called "pts"? To answer this question, display the index and the axes of the nba dataset, then expand the code block below for the solution:

In [48]:
 nba.index

RangeIndex(start=0, stop=126314, step=1)

nba, like all DataFrame objects, has two axes:



In [49]:
nba.axes


[RangeIndex(start=0, stop=126314, step=1),
 Index(['gameorder', 'game_id', 'lg_id', '_iscopy', 'year_id', 'date_game',
        'seasongame', 'is_playoffs', 'team_id', 'fran_id', 'pts', 'elo_i',
        'elo_n', 'win_equiv', 'opp_id', 'opp_fran', 'opp_pts', 'opp_elo_i',
        'opp_elo_n', 'game_location', 'game_result', 'forecast', 'notes',
        'date_played'],
       dtype='object')]

You can check the existence of a column with .keys():



In [50]:
"points" in nba.keys()

False

In [51]:
"pts" in nba.keys()

True

The column is called "pts", not "points".



As you use these methods to answer questions about your dataset, be sure to keep in mind whether you’re working with a Series or a DataFrame so that your interpretation is accurate.

## Accessing Series Elements

In the section above, you’ve created a pandas Series based on a Python list and compared the two data structures. You’ve seen how a Series object is similar to lists and dictionaries in several ways. A further similarity is that you can use the indexing operator ([]) for Series as well.



You’ll also learn how to use two pandas-specific access methods:



.loc

.iloc

You’ll see that these data access methods can be much more readable than the indexing operator.

### Using the Indexing Operator


Recall that a Series has two indices:

