In [1]:
import pandas as pd

# Introduction

Since I've been working on a lot of Kaggle competitions, I use Pandas a lot. As you may know, Pandas (in addition to Numpy) is the go-to Python library for all your data science needs. It helps with dealing with input data in CSV formats and with transofrming your data into a form where it can be inputted into ML models. However, getting comfortable with the ideas of dataframes, slicing, etc was very tough for me in the beginning. Hopefully, this short tutorial can show you a lot of different commands that will help you gain the most insights into your dataset. 

# Loading in Data

The first step in any ML problem is identifying what format your data is in, and then loading it into whateer framework you're using. For Kaggle compeitions, a lot of data can be found in CSV files, so that's the example we're going to use. 

Since I'm a huge sports fan, we're going to be looking at a sports dataset that shows the results from NCAA basketball games from 1985 to 2016. This dataset is in a CSV file, and the function we're going to use to read in the file is called **pd.read_csv()**. This function returns a **dataframe** variable. The dataframe is the golden jewel data structure for Pandas. It is defined as "a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns)".

Just think of it as a table for now. We'll explain more about what makes it unique later on. 

In [2]:
df = pd.read_csv('RegularSeasonCompactResults.csv')

Now that we have our dataframe in our variable df, let's look at what it contains. We can use the function **head()** to see the first couple rows of the dataframe (or the function **tail()** to see the last few rows).

In [3]:
df.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


We can see the dimensions of the dataframe using the the **shape** attribute

In [4]:
df.shape

(145289, 8)

We can also extract all the columns as a list, by using the **columns** attribute

In [6]:
df.columns.tolist()

['Season', 'Daynum', 'Wteam', 'Wscore', 'Lteam', 'Lscore', 'Wloc', 'Numot']

In order to get a better idea of the type of data that we are dealing with, we can call the **describe()** function to see statistics like mean, min, etc about each column of the dataset. 

In [10]:
df.describe()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Numot
count,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0,145289.0
mean,2001.574834,75.223816,1286.720646,76.600321,1282.864064,64.497009,0.044387
std,9.233342,33.287418,104.570275,12.173033,104.829234,11.380625,0.247819
min,1985.0,0.0,1101.0,34.0,1101.0,20.0,0.0
25%,1994.0,47.0,1198.0,68.0,1191.0,57.0,0.0
50%,2002.0,78.0,1284.0,76.0,1280.0,64.0,0.0
75%,2010.0,103.0,1379.0,84.0,1375.0,72.0,0.0
max,2016.0,132.0,1464.0,186.0,1464.0,150.0,6.0


Okay, so now let's looking at information that we want to extract from the dataframe. Let's say I wanted to know the max value of a certain column. The function **max()** will show you the maximum values of all columns

In [22]:
df.max()

Season    2016
Daynum     132
Wteam     1464
Wscore     186
Lteam     1464
Lscore     150
Wloc         N
Numot        6
dtype: object

Then, if you'd like to specifically get the max value for a particular column, you pass in the name of the column using the bracket indexing operator

In [24]:
df['Wscore'].max()

186

But what if that's not enough? Let's say we want to actually see the game(row) where this max score happened. We can call the **argmax()** function to identify the row index

In [36]:
df['Wscore'].argmax()

24970

Then, in order to get attributes about the game, we need to use the **iloc[]** function. Iloc is definitely one of the more important functions. The main idea is that you want to use it whenever you have the integer index of a certain row that you want to access. As per Pandas documentation, iloc is an "integer-location based indexing for selection by position."

In [35]:
df.iloc[[df['Wscore'].argmax()]]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
24970,1991,68,1258,186,1109,140,H,0


Let's take this a step further. Let's say you want to know the game with the highest scoring winning team (this is what we just calculated), but you then want to know how many points the losing team scored. 

In [38]:
df.iloc[[df['Wscore'].argmax()]]['Lscore'].max()

140

The bracket indexing operator is the best way to extract certain columns from a dataframe.

In [27]:
df[['Wscore', 'Lscore']]

Unnamed: 0,Wscore,Lscore
0,81,64
1,77,70
2,63,56
3,70,54
4,86,74
5,79,78
6,64,44
7,58,56
8,98,80
9,97,89


Now, let's say we want to find all of the rows that satisy a particular condition. For example, I want to find all of the games where the winning team scored more than 150 points. The idea behind this command is you want to access the column 'Wscore' of the dataframe df (df['Wscore']), find which entries are above 150 (df['Wscore'] > 150), and then return the results in a dataframe (df[df['Wscore'] > 150]).

In [33]:
df[df['Wscore'] > 150]

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
5269,1986,75,1258,151,1109,107,H,0
12046,1988,40,1328,152,1147,84,H,0
12355,1988,52,1328,151,1173,99,N,0
16040,1989,40,1328,152,1331,122,H,0
16853,1989,68,1258,162,1109,144,A,0
17867,1989,92,1258,181,1109,150,H,0
19653,1990,30,1328,173,1109,101,H,0
19971,1990,38,1258,152,1109,137,A,0
20022,1990,40,1116,166,1109,101,H,0
22145,1990,97,1258,157,1362,115,H,0


Each dataframe has a **values** attribute which is useful because it basically displays your dataframe in an array style format

In [39]:
df.values

array([[1985, 20, 1228, ..., 64, 'N', 0],
       [1985, 25, 1106, ..., 70, 'H', 0],
       [1985, 25, 1112, ..., 56, 'H', 0],
       ..., 
       [2016, 132, 1246, ..., 77, 'N', 1],
       [2016, 132, 1277, ..., 62, 'N', 0],
       [2016, 132, 1386, ..., 74, 'N', 0]], dtype=object)

Now, you can simply just access elements like you would in an array. 

In [40]:
df.values[0][0]

1985

# Dataframe Iteration

In order to iterate through dataframes, we can use the 

# Lots of Other Resources

Pandas has been around for a while and there are a lot of other good resources if you're still interested on getting the most out of this library. 
* http://pandas.pydata.org/pandas-docs/stable/10min.html
* https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
* http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/
* https://www.dataquest.io/blog/pandas-python-tutorial/
* https://drive.google.com/file/d/0ByIrJAE4KMTtTUtiVExiUGVkRkE/view