# Python + Pandas = Easy access to all your data

In this Notebook I will try to show some easy examples on how to read data into the pandas DataFrame data structure and then process/modify the data with slicing, deletion, appending, extraction etc.

I try to make the code as self-explanatory as possible, but I will also add comments where I feel the need of it. Enjoy!

In [92]:
# import of needed packages
import io
import requests
import pandas as pd

### Fetching data from website with url

In [93]:
# This URL is from the Stanford Statweb, - Datasets for "The Elements of Statistical Learning"
url    = "https://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data"
source = requests.get(url).content
data   = pd.read_csv(io.StringIO(source.decode('utf-8')))


In [94]:
# Printing the first 10 rows of the data with slicing
data[:10]

Unnamed: 0,row.names,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,1,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,2,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,3,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,4,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,5,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
5,6,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
6,7,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
7,8,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
8,9,114,0.0,3.83,19.4,Present,49,24.86,2.49,29,0
9,10,132,0.0,5.8,30.96,Present,69,30.11,0.0,53,1


### Loading data from local folder

In [95]:
# Reading the data into pandas dataframe from local folder, this is the same dataset as the one above
data = pd.read_csv('data/SAheart.csv')

In [96]:
# Printing the first 10 row of the data with slicing
# Notice how the two methos provide the exact same results 
data[:10]

Unnamed: 0,row.names,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,1,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,2,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,3,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,4,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,5,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
5,6,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
6,7,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
7,8,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
8,9,114,0.0,3.83,19.4,Present,49,24.86,2.49,29,0
9,10,132,0.0,5.8,30.96,Present,69,30.11,0.0,53,1


In [97]:
# Printing a list of the column headers and number of columns
print data.columns, "\nNumber of columns:", len(data.columns)

Index([u'row.names', u'sbp', u'tobacco', u'ldl', u'adiposity', u'famhist',
       u'typea', u'obesity', u'alcohol', u'age', u'chd'],
      dtype='object') 
Number of columns: 11


### Deleting column from dataframe
In the following step I am deleting the first column, ``row.names`` since this have no actual value. 

In [98]:
# Deleting 1st column of the dataset (remember that the dataFrame is 0-indexed)
data.drop(data.columns[[0]], axis=1, inplace=True)
data[:10]

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
5,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
6,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
7,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
8,114,0.0,3.83,19.4,Present,49,24.86,2.49,29,0
9,132,0.0,5.8,30.96,Present,69,30.11,0.0,53,1


In [99]:
# Printing a list of the column headers and number of columns
print data.columns, "\nNumber of columns:", len(data.columns)

Index([u'sbp', u'tobacco', u'ldl', u'adiposity', u'famhist', u'typea',
       u'obesity', u'alcohol', u'age', u'chd'],
      dtype='object') 
Number of columns: 10


From the output in the cells above, we can now verify that the column has been removed by both printing out a piece of the data and/or by checking that the total number of columns ``len(data-columns)`` is now smaller.

#### Deleting multiple columns from dataframe

This can be done in several ways. Below are examples of deleting a series of columns, deleting specific columns and keeping specific columns.

I will show both how to drop and keep since the methods are applicable in different situations. If you only need to keep a few columns, it is easier to extract those rather than mentioning all the ones you want removed.

In [100]:
# Dropping several columns based on index position (tobacco, ldl)
data.drop(data.columns[[1,2]], axis=1, inplace=True)
data[:10]

Unnamed: 0,sbp,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,23.11,Present,49,25.3,97.2,52,1
1,144,28.61,Absent,55,28.87,2.06,63,1
2,118,32.28,Present,52,29.14,3.81,46,0
3,170,38.03,Present,51,31.99,24.26,58,1
4,134,27.78,Present,60,25.99,57.34,49,1
5,132,36.21,Present,62,30.77,14.14,45,0
6,142,16.2,Absent,59,20.81,2.62,38,0
7,114,14.6,Present,62,23.11,6.72,58,1
8,114,19.4,Present,49,24.86,2.49,29,0
9,132,30.96,Present,69,30.11,0.0,53,1


In [101]:
# Dropping range of columns
data.drop(data.index[3:5])
data[:10]

Unnamed: 0,sbp,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,23.11,Present,49,25.3,97.2,52,1
1,144,28.61,Absent,55,28.87,2.06,63,1
2,118,32.28,Present,52,29.14,3.81,46,0
3,170,38.03,Present,51,31.99,24.26,58,1
4,134,27.78,Present,60,25.99,57.34,49,1
5,132,36.21,Present,62,30.77,14.14,45,0
6,142,16.2,Absent,59,20.81,2.62,38,0
7,114,14.6,Present,62,23.11,6.72,58,1
8,114,19.4,Present,49,24.86,2.49,29,0
9,132,30.96,Present,69,30.11,0.0,53,1


If the whole indexing part gets a little confusing an easy way to remove a columns (variable) is by using the column name -- yes pandas stores that information! This makes sure that only the column you want to delete is removed, whereas by using index values, you might risk deleting several columns, as the index values are updates after each run of the code.

In [102]:
# Dropping a series of columns using their names
data.drop('obesity', axis=1)[:10]

Unnamed: 0,sbp,adiposity,famhist,typea,alcohol,age,chd
0,160,23.11,Present,49,97.2,52,1
1,144,28.61,Absent,55,2.06,63,1
2,118,32.28,Present,52,3.81,46,0
3,170,38.03,Present,51,24.26,58,1
4,134,27.78,Present,60,57.34,49,1
5,132,36.21,Present,62,14.14,45,0
6,142,16.2,Absent,59,2.62,38,0
7,114,14.6,Present,62,6.72,58,1
8,114,19.4,Present,49,2.49,29,0
9,132,30.96,Present,69,0.0,53,1


### Keeping columns in dataframe

This can be done in many way - I'm showing a few here.

In [103]:
# Keeping specific columns from dataframe using columns names
name_data = data[['typea','age']]
name_data[:10]

Unnamed: 0,typea,age
0,49,52
1,55,63
2,52,46
3,51,58
4,60,49
5,62,45
6,59,38
7,62,58
8,49,29
9,69,53


In [104]:
# keeping specific columns from dataframe by idnex
index_data = data[[3,6]]
index_data[:10]

Unnamed: 0,typea,age
0,49,52
1,55,63
2,52,46
3,51,58
4,60,49
5,62,45
6,59,38
7,62,58
8,49,29
9,69,53


See how they both return the same thing

### Dropping a row

In [105]:
# Dropping a row if it contains a certain value
drop_row = data[data.typea != 49]
drop_row[:10]

Unnamed: 0,sbp,adiposity,famhist,typea,obesity,alcohol,age,chd
1,144,28.61,Absent,55,28.87,2.06,63,1
2,118,32.28,Present,52,29.14,3.81,46,0
3,170,38.03,Present,51,31.99,24.26,58,1
4,134,27.78,Present,60,25.99,57.34,49,1
5,132,36.21,Present,62,30.77,14.14,45,0
6,142,16.2,Absent,59,20.81,2.62,38,0
7,114,14.6,Present,62,23.11,6.72,58,1
9,132,30.96,Present,69,30.11,0.0,53,1
10,206,32.27,Absent,72,26.81,56.06,60,1
11,134,22.39,Present,65,23.09,0.0,40,1


Now there are no instances of the column value $49$ anymore.