In [33]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
'imports complete'

'imports complete'

<h1>AN INTRO TO THE PANDAS LIBRARY</h1>
Source: Mostly pandas documentation (https://pandas.pydata.org/)

<h2>Data Frames and Series</h2>

In order to perform analysis, we will need to begin by importing data. This can be done in a couple ways depending on the format of your data. In our example, we are using csv and will consequently choose "read_csv" as our function. By doing this, we have read in our csv file as a <b>data frame</b> (which can be thought of as a similar to a table from DSCI 101 and 102).

In [34]:
df = pd.read_csv("framingham.csv") # read in csv file as data frame

Note: pandas also supports importing data of other types, too. These can be read about here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

We can view parts of our data frame with head (see beginning) and tail (to see the end)

In [35]:
df.head() # view beginning of data set

Unnamed: 0,AGE,SYSBP,DIABP,TOTCHOL,CURSMOKE,DIABETES,GLUCOSE,DEATH,ANYCHD
0,39,106.0,70.0,195.0,0,0,77.0,0,1
1,46,121.0,81.0,250.0,0,0,76.0,0,0
2,48,127.5,80.0,245.0,1,0,70.0,0,0
3,61,150.0,95.0,225.0,1,0,103.0,1,0
4,46,130.0,84.0,285.0,1,0,85.0,0,0


To view data from specific <b>columns</b> or <b>rows</b>, there are a few operations we can perform. First, lets look at accessing columns of data.

In [36]:
df['AGE'] # access a particular column

0       39
1       46
2       48
3       61
4       46
        ..
3837    51
3838    48
3839    52
3840    40
3841    39
Name: AGE, Length: 3842, dtype: int64

By accessing a particular column, we are returned a pandas <b>series</b>. A series is a one-dimensional ndarray with axis labels.

In [37]:
type(df['AGE']) # notice this is of type series

pandas.core.series.Series

By including a list of labels, we are specifying multiple columns to include

In [38]:
df[['AGE', 'DEATH']] # access particular column(s)

Unnamed: 0,AGE,DEATH
0,39,0
1,46,0
2,48,0
3,61,1
4,46,0
...,...,...
3837,51,1
3838,48,1
3839,52,0
3840,40,0


Doing so returns a data frame object.

In [39]:
type(df[['AGE', 'DEATH']]) # notice this is of type dataframe

pandas.core.frame.DataFrame

In [40]:
df[['AGE', 'SYSBP']] # get columns (sort of similar to dictionary reference to keys)

Unnamed: 0,AGE,SYSBP
0,39,106.0
1,46,121.0
2,48,127.5
3,61,150.0
4,46,130.0
...,...,...
3837,51,126.5
3838,48,131.0
3839,52,133.5
3840,40,141.0


In [41]:
df.iloc[[0,1]] # get rows

Unnamed: 0,AGE,SYSBP,DIABP,TOTCHOL,CURSMOKE,DIABETES,GLUCOSE,DEATH,ANYCHD
0,39,106.0,70.0,195.0,0,0,77.0,0,1
1,46,121.0,81.0,250.0,0,0,76.0,0,0


In [42]:
df.iloc[[0,1], [0,1]] # get columns and rows (arg1 = columns and arg2 = rows)

Unnamed: 0,AGE,SYSBP
0,39,106.0
1,46,121.0


In [43]:
df.loc[[0,1], ['AGE', 'SYSBP']] # iloc is by index whereas loc is by label

Unnamed: 0,AGE,SYSBP
0,39,106.0
1,46,121.0


In [44]:
df.loc[0:5, ['AGE', 'SYSBP']] # slicing is possible too! although note that 5 is INCLUSIVE unlike regular slicing in python

Unnamed: 0,AGE,SYSBP
0,39,106.0
1,46,121.0
2,48,127.5
3,61,150.0
4,46,130.0
5,43,180.0


In [45]:
df.loc[0:5, 'AGE':'DIABETES'] # its also possible to slice labels!

Unnamed: 0,AGE,SYSBP,DIABP,TOTCHOL,CURSMOKE,DIABETES
0,39,106.0,70.0,195.0,0,0
1,46,121.0,81.0,250.0,0,0
2,48,127.5,80.0,245.0,1,0
3,61,150.0,95.0,225.0,1,0
4,46,130.0,84.0,285.0,1,0
5,43,180.0,110.0,228.0,0,0


Its easy to count the values of a series:

In [46]:
count1 = df['DIABETES'].value_counts() # how many people have/don't have diabetes?
count1

0    3737
1     105
Name: DIABETES, dtype: int64

In [47]:
count2 = df[['DIABETES', 'DEATH']].value_counts()
count2

DIABETES  DEATH
0         0        2529
          1        1208
1         1          80
          0          25
dtype: int64

<h2>Indexing</h2>

Maybe I feel like integers aren't the best identifiers for my rows. Pandas allows us to change that!

In [49]:
df.loc[0:5, 'AGE':'DIABETES'].set_index('AGE')

Unnamed: 0_level_0,SYSBP,DIABP,TOTCHOL,CURSMOKE,DIABETES
AGE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
39,106.0,70.0,195.0,0,0
46,121.0,81.0,250.0,0,0
48,127.5,80.0,245.0,1,0
61,150.0,95.0,225.0,1,0
46,130.0,84.0,285.0,1,0
43,180.0,110.0,228.0,0,0


In this case, age is a pretty poor index as its not necessarily unique (note 46 appearing twice). Luckily, we did not modify the actual dataframe. To do so would require us to pass in an argument inplace=True such that our method now looks like <b>.set_index('AGE', inplace=True)</b>. You can also specify this when reading in the file: <b>df = pd.read_csv("framingham.csv", index_col='AGE')</b>