**Introduction to Pandas**

Pandas is a Python package that makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give us a single, convenient, place to do most of your data analysis and visualization work.

Parts of this tutorial are adapted from https://www.dataquest.io/blog/pandas-python-tutorial/

In [1]:
import pandas as pd



In [3]:
# Pandas uses a DataFrame data structure as a way to represent and work with tabular data. 
df = pd.read_csv("diabetes.csv")

In [4]:
# Once we read in a DataFrame, Pandas gives us two functions that make it fast to print out and review the data.
# pandas.DataFrame.head -- prints the first N rows of a DataFrame. By default 5.
df.head()

Unnamed: 0,id,chol,stab.glu,hdl,ratio,glyhb,location,age,gender,height,weight,frame,bp.1s,bp.1d,waist,hip
0,1000,203.0,82,56.0,3.6,4.31,Buckingham,46,female,62.0,121.0,medium,118.0,59.0,29.0,38.0
1,1001,165.0,97,24.0,6.9,4.44,Buckingham,29,female,64.0,218.0,large,112.0,68.0,46.0,48.0
2,1002,228.0,92,37.0,6.2,4.64,Buckingham,58,female,61.0,256.0,large,190.0,92.0,49.0,57.0
3,1003,78.0,93,12.0,6.5,4.63,Buckingham,67,male,67.0,119.0,large,110.0,50.0,33.0,38.0
4,1005,249.0,90,28.0,8.9,7.72,Buckingham,64,male,68.0,183.0,medium,138.0,80.0,44.0,41.0


In [5]:
# pandas.DataFrame.tail -- prints the last N rows of a DataFrame. By default 5.
df.tail()

Unnamed: 0,id,chol,stab.glu,hdl,ratio,glyhb,location,age,gender,height,weight,frame,bp.1s,bp.1d,waist,hip
398,41506,296.0,369,46.0,6.4,16.110001,Louisa,53,male,69.0,173.0,medium,138.0,94.0,35.0,39.0
399,41507,284.0,89,54.0,5.3,4.39,Louisa,51,female,63.0,154.0,medium,140.0,100.0,32.0,43.0
400,41510,194.0,269,38.0,5.1,13.63,Louisa,29,female,69.0,167.0,small,120.0,70.0,33.0,40.0
401,41752,199.0,76,52.0,3.8,4.49,Louisa,41,female,63.0,197.0,medium,120.0,78.0,41.0,48.0
402,41756,159.0,88,79.0,2.0,,Louisa,68,female,64.0,220.0,medium,100.0,72.0,49.0,58.0


In [6]:
# We can also access the pandas.DataFrame.shape property to see row many rows and columns are in a data frame:
df.shape

(403, 16)

In [7]:
# Indexing DataFrames with Pandas
# Above, we used the head() function to print the first 5 rows of reviews. 
# We could accomplish the same thing using the pandas.DataFrame.iloc function. 
# The iloc method allows us to retrieve rows and columns by position, 
# similar to how we did it with numpy 2-dimensional arrays. 

#The below code will do the same thing as df.head():

df.iloc[0:5,:]


Unnamed: 0,id,chol,stab.glu,hdl,ratio,glyhb,location,age,gender,height,weight,frame,bp.1s,bp.1d,waist,hip
0,1000,203.0,82,56.0,3.6,4.31,Buckingham,46,female,62.0,121.0,medium,118.0,59.0,29.0,38.0
1,1001,165.0,97,24.0,6.9,4.44,Buckingham,29,female,64.0,218.0,large,112.0,68.0,46.0,48.0
2,1002,228.0,92,37.0,6.2,4.64,Buckingham,58,female,61.0,256.0,large,190.0,92.0,49.0,57.0
3,1003,78.0,93,12.0,6.5,4.63,Buckingham,67,male,67.0,119.0,large,110.0,50.0,33.0,38.0
4,1005,249.0,90,28.0,8.9,7.72,Buckingham,64,male,68.0,183.0,medium,138.0,80.0,44.0,41.0


In [8]:
# Here are some indexing examples, along with the results:

df.iloc[:5,:] # the first 5 rows, and all of the columns for those rows.
df.iloc[:,:] # the entire DataFrame.
df.iloc[5:,5:] # rows from position 5 onwards, and columns from position 5 onwards.
df.iloc[:,0] # the first column, and all of the rows for the column.
df.iloc[9,:] # the 10th row, and all of the columns for that row.

id                1022
chol               263
stab.glu            89
hdl                 40
ratio              6.6
glyhb             5.78
location    Buckingham
age                 55
gender          female
height              63
weight             202
frame            small
bp.1s              108
bp.1d               72
waist               45
hip                 50
Name: 9, dtype: object

In [9]:
# Now that we know how to index by position, let's remove the first column (patient id), 
# which doesn't have any useful information:

df = df.iloc[:,1:]
df.head()

Unnamed: 0,chol,stab.glu,hdl,ratio,glyhb,location,age,gender,height,weight,frame,bp.1s,bp.1d,waist,hip
0,203.0,82,56.0,3.6,4.31,Buckingham,46,female,62.0,121.0,medium,118.0,59.0,29.0,38.0
1,165.0,97,24.0,6.9,4.44,Buckingham,29,female,64.0,218.0,large,112.0,68.0,46.0,48.0
2,228.0,92,37.0,6.2,4.64,Buckingham,58,female,61.0,256.0,large,190.0,92.0,49.0,57.0
3,78.0,93,12.0,6.5,4.63,Buckingham,67,male,67.0,119.0,large,110.0,50.0,33.0,38.0
4,249.0,90,28.0,8.9,7.72,Buckingham,64,male,68.0,183.0,medium,138.0,80.0,44.0,41.0


In [11]:
# A major advantage of Pandas over NumPy is that each of the columns and rows has a label. 
# Working with column positions makes it difficult to keep track of which number corresponds to which column.
# We can work with labels using the pandas.DataFrame.loc function will allow us 
# to find rows and columns using labels instead of positions.

df.loc[:5, 'hdl']



0    56.0
1    24.0
2    37.0
3    12.0
4    28.0
5    69.0
Name: hdl, dtype: float64

In [13]:
# We can also specify more than one column at a time by passing in a list:
reviews.loc[:5,["hdl", "ratio", "weight"]]

Unnamed: 0,hdl,ratio,weight
0,56.0,3.6,121.0
1,24.0,6.9,218.0
2,37.0,6.2,256.0
3,12.0,6.5,119.0
4,28.0,8.9,183.0
5,69.0,3.6,190.0


In [14]:
# There's an even easier way to retrieve a whole column. 
# We can just specify the column name in square brackets, like with a dictionary:
df["weight"]

0      121.0
1      218.0
2      256.0
3      119.0
4      183.0
5      190.0
6      191.0
7      170.0
8      166.0
9      202.0
10     156.0
11     195.0
12     170.0
13     165.0
14     183.0
15     157.0
16     183.0
17     159.0
18     126.0
19     196.0
20     178.0
21     230.0
22     288.0
23     185.0
24     113.0
25     118.0
26     252.0
27     100.0
28     145.0
29     189.0
       ...  
373    170.0
374    157.0
375    129.0
376    211.0
377    189.0
378    120.0
379    121.0
380    120.0
381    169.0
382    186.0
383    262.0
384    222.0
385    222.0
386    179.0
387    224.0
388    165.0
389    185.0
390    147.0
391    177.0
392    145.0
393    146.0
394    154.0
395    136.0
396    168.0
397    115.0
398    173.0
399    154.0
400    167.0
401    197.0
402    220.0
Name: weight, Length: 403, dtype: float64

In [15]:
# We can also use lists of columns with this method:

reviews[["hdl", "ratio", "weight"]]

Unnamed: 0,hdl,ratio,weight
0,56.0,3.6,121.0
1,24.0,6.9,218.0
2,37.0,6.2,256.0
3,12.0,6.5,119.0
4,28.0,8.9,183.0
5,69.0,3.6,190.0
6,41.0,4.8,191.0
7,44.0,5.2,170.0
8,49.0,3.6,166.0
9,40.0,6.6,202.0
