In [1]:
# import the pandas module
import pandas as pd

In [2]:
# examine the built-in documentation (and in the process, confirm that pandas imported as expected)
help(pd)

Help on package pandas:

NAME
    pandas

DESCRIPTION
    pandas - a powerful data analysis and manipulation library for Python
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic an

In [3]:
# if you used an alias when importing, remember to use it when using help
help(pandas)

NameError: name 'pandas' is not defined

Beyond using `help` to learn about the capabilities of an imported package, we can also use `dir` to see the available methods...

In [4]:
dir(pd)

['Categorical',
 'CategoricalDtype',
 'CategoricalIndex',
 'DataFrame',
 'DateOffset',
 'DatetimeIndex',
 'DatetimeTZDtype',
 'ExcelFile',
 'ExcelWriter',
 'Float64Index',
 'Grouper',
 'HDFStore',
 'Index',
 'IndexSlice',
 'Int16Dtype',
 'Int32Dtype',
 'Int64Dtype',
 'Int64Index',
 'Int8Dtype',
 'Interval',
 'IntervalDtype',
 'IntervalIndex',
 'MultiIndex',
 'NaT',
 'Panel',
 'Period',
 'PeriodDtype',
 'PeriodIndex',
 'RangeIndex',
 'Series',
 'SparseArray',
 'SparseDataFrame',
 'SparseDtype',
 'SparseSeries',
 'TimeGrouper',
 'Timedelta',
 'TimedeltaIndex',
 'Timestamp',
 'UInt16Dtype',
 'UInt32Dtype',
 'UInt64Dtype',
 'UInt64Index',
 'UInt8Dtype',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__docformat__',
 '__file__',
 '__git_version__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_hashtable',
 '_lib',
 '_libs',
 '_np_version_under1p13',
 '_np_version_under1p14',
 '_np_version_under1p15',
 '_tslib',
 '_version',
 'api',
 'array',
 'array

Once we've found a method that looks promising, we can again use `help` for more information...

In [5]:
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read a comma-separated values (csv) file into DataFrame.
    
    Also supports option

Armed with this knowledge, we can now put it into use...

In [6]:
# import our data from a csv file
data = pd.read_csv('patron.csv')

# see all the data
print(data)

        ID           Last          First               Address  \
0      169           Chan        Killian          10 BEECH AVE   
1      372       Humphrey         Quincy      10 FAIRMOUNT AVE   
2      493        Parrish      Charlotte       10 HAMILTON AVE   
3      498         Barron        Delilah       10 HAMILTON AVE   
4      773          Bowen          Miley       10 IRONWOOD CIR   
5      892           West           Avah      10 JOHNNYCAKE RD   
6     1007         Austin          Imani           10 LYNCH RD   
7     1134          Hardy          Abbie      10 MCCORMICK AVE   
8     1138           Case          Averi          10 MEADOW RD   
9     1220          Ayala         Darian             10 OAK DR   
10    1289        Meadows         Austin        10 PARSONS AVE   
11    1417          Moore          Amina        10 PARSONS AVE   
12    1421          Hodge          Chase           10 PENNY LN   
13    1422         Dodson         Boston    10 PHILADELPHIA RD   
14    1482

In [6]:
# see how the data is being interpreted by python
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3455 entries, 0 to 3454
Data columns (total 12 columns):
ID            3455 non-null int64
Last          3455 non-null object
First         3455 non-null object
Address       3455 non-null object
City          3454 non-null object
State         3455 non-null object
Zip           3455 non-null int64
DOB           3455 non-null object
College       3455 non-null object
Department    3455 non-null object
Major         3455 non-null object
Degree        3455 non-null object
dtypes: int64(2), object(10)
memory usage: 324.0+ KB


notice that an additional column (the RangeIndex) has been added to our data to number the rows. since we already have an ID column for this, we should use it instead...

In [7]:
# re-read the data, using the ID column as the index
data = pd.read_csv('patron.csv', index_col='ID')

# check the result
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3455 entries, 169 to 604
Data columns (total 11 columns):
Last          3455 non-null object
First         3455 non-null object
Address       3455 non-null object
City          3454 non-null object
State         3455 non-null object
Zip           3455 non-null int64
DOB           3455 non-null object
College       3455 non-null object
Department    3455 non-null object
Major         3455 non-null object
Degree        3455 non-null object
dtypes: int64(1), object(10)
memory usage: 323.9+ KB


In [8]:
# and see how it prints out
print(data)

               Last          First               Address              City  \
ID                                                                           
169            Chan        Killian          10 BEECH AVE         BALTIMORE   
372        Humphrey         Quincy      10 FAIRMOUNT AVE         ARLINGTON   
493         Parrish      Charlotte       10 HAMILTON AVE         BALTIMORE   
498          Barron        Delilah       10 HAMILTON AVE          ROSEDALE   
773           Bowen          Miley       10 IRONWOOD CIR  MOUNT WASHINGTON   
892            West           Avah      10 JOHNNYCAKE RD         GWYNN OAK   
1007         Austin          Imani           10 LYNCH RD           DUNDALK   
1134          Hardy          Abbie      10 MCCORMICK AVE         BALTIMORE   
1138           Case          Averi          10 MEADOW RD         BALTIMORE   
1220          Ayala         Darian             10 OAK DR         GWYNN OAK   
1289        Meadows         Austin        10 PARSONS AVE        

with the additional index column gone, there's a little more room for our data. we can do better, though...

In [9]:
# use pandas' head function to see the first 5 rows of data (formatted a little nicer...)
data.head()

Unnamed: 0_level_0,Last,First,Address,City,State,Zip,DOB,College,Department,Major,Degree
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
169,Chan,Killian,10 BEECH AVE,BALTIMORE,MD,21206,05/26/1977,School of Arts and Sciences,Anthropology,Anthropology,Ph.D.
372,Humphrey,Quincy,10 FAIRMOUNT AVE,ARLINGTON,MD,21215,02/09/2001,School of Arts and Sciences,Anthropology,Anthropology,Ph.D.
493,Parrish,Charlotte,10 HAMILTON AVE,BALTIMORE,MD,21237,05/25/1997,School of Arts and Sciences,Anthropology,Anthropology,Ph.D.
498,Barron,Delilah,10 HAMILTON AVE,ROSEDALE,MD,21237,08/22/1970,School of Arts and Sciences,Anthropology,Anthropology,Ph.D.
773,Bowen,Miley,10 IRONWOOD CIR,MOUNT WASHINGTON,MD,21209,03/17/1986,School of Arts and Sciences,Anthropology,Anthropology,Ph.D.


Let's take another look at our info, to see if we can make any other improvements in how we read our data...

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3455 entries, 169 to 604
Data columns (total 11 columns):
Last          3455 non-null object
First         3455 non-null object
Address       3455 non-null object
City          3454 non-null object
State         3455 non-null object
Zip           3455 non-null int64
DOB           3455 non-null object
College       3455 non-null object
Department    3455 non-null object
Major         3455 non-null object
Degree        3455 non-null object
dtypes: int64(1), object(10)
memory usage: 323.9+ KB


Notice that our date-of-birth (DOB) column isn't being interpreted as a date. We can fix this by telling pandas how to parse that column. The parse_dates argument of pandas' read_csv() function just needs to know the indexes of the columns that should be interpreted as dates. Remember that in python, indexes start at zero and that ID is still our first column (even though the output of info no longer counts it as a _data_ column).

In [11]:
# re-read our data, parsing the DOB column (which has index 7) as dates
data = pd.read_csv('patron.csv', index_col='ID', parse_dates=[7])

# confirm that DOB is now a datetime object
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3455 entries, 169 to 604
Data columns (total 11 columns):
Last          3455 non-null object
First         3455 non-null object
Address       3455 non-null object
City          3454 non-null object
State         3455 non-null object
Zip           3455 non-null int64
DOB           3455 non-null datetime64[ns]
College       3455 non-null object
Department    3455 non-null object
Major         3455 non-null object
Degree        3455 non-null object
dtypes: datetime64[ns](1), int64(1), object(9)
memory usage: 323.9+ KB


In [12]:
# check our work, this time using tail to see the last 5 rows of our data
data.tail()

Unnamed: 0_level_0,Last,First,Address,City,State,Zip,DOB,College,Department,Major,Degree
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
3297,Howe,Ariel,8540 KAVANAGH RD,DUNDALK,MD,21222,1985-11-06,School of Public Health,"Population, Family and Reproductive Health",Public Health Studies,Ph.D.
3435,Burns,Miles,8548 Kavanagh Rd,Dundalk,MD,21222,1967-04-25,School of Public Health,"Population, Family and Reproductive Health",Public Health Studies,Ph.D.
1516,Mccann,Jane,8548 KAVANAGH RD,DUNDALK,MD,25601,1979-04-24,School of Public Health,"Population, Family and Reproductive Health",Social & Behavioral Sciences,Dr.P.H.
1384,Ball,Mathew,854 MILDRED AVE,DUNDALK,MD,21341,1995-05-23,School of Public Health,"Population, Family and Reproductive Health",Social & Behavioral Sciences,Ph.D.
604,Wright,Leo,8555 PULASKI HWY,ROSEDALE,MD,21237,1963-12-10,School of Public Health,"Population, Family, and Reproductive Health",Social & Behavioral Sciences,Ph.D.


By the way our DOB column is being formatted (year-month-date, rather than month/day/year), we can tell that it's being interpreted as date information, so we can now treat it as such.