# Introduction to Pandas

### What is Pandas?

* A data analysis library — **Pan**el **Da**ta **S**ystem.
* Created by Wes McKinney in 2009.
* Implemented in highly optimized Python/Cython.
* Like Excel or R for Python!

### Pandas is used for

* Cleaning data/munging.
* Exploratory analysis.
* Structuring data for plots or tabular display.
* Joining disparate sources.
* Modeling.
* Filtering, extracting, or transforming.

### Importing Pandas

Import Pandas at the top of your notebook. Give it the nickname **pd** so you don't have to keep typing "pandas." (But you can nickname it anything or leave out the nickname)

In [2]:
import pandas as pd

### Loading a CSV as a DataFrame

Pandas can load many types of files, but one of the most common types is .csv (comma separated values).

In [3]:
titanic = pd.read_csv('data/titanic_train.csv')

This creates a Pandas object called a **DataFrame.**  

DataFrames are powerful containers that have lots of built-in functions for exploring and manipulating your data. 

### Exploring the data using DataFrames

#### Use .head() to examine the top of the DataFrame

In [4]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Use .tail() to examine the bottom

In [6]:
titanic.tail()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


#### The .shape property will tell you how many rows and columns you have

In [7]:
titanic.shape

(891, 12)

#### You can look up the names of your columns using the .columns property.

In [8]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

#### You can access a specific column with bracket syntax (like with dictionaries) using the column's string name.

In [9]:
titanic['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

#### You can also access it using dot notation. (When might this not work?)

In [10]:
titanic.Name

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
5                                       Moran, Mr. James
6                                McCarthy, Mr. Timothy J
7                         Palsson, Master. Gosta Leonard
8      Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                    Nasser, Mrs. Nicholas (Adele Achem)
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
12                        Saundercock, Mr. William Henry
13                           Andersson, Mr. Anders Johan
14                  Vestrom, Miss. Hulda Amanda Adolfina
15                      Hewlett, Mrs. (Mary D Kingcome) 
16                                  Rice, Master. Eugene
17                          Wil

In [11]:
titanic.Name.head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Notice that this looks a little different than our DataFrame above. That is because it is a Series object. It's a little different than a Dataframe. 

**What's the difference between Pandas' Series and DataFrame objects?**  
Essentially, a Series object contains the data for a single column, and a DataFrame object is a matrix-like container for those Series objects that comprise your data. They mostly act like one another, but occasionaly you'll run into methods that only work for one.

#### Examining Your Data With .info()  
Provides information about:

* The name of the column/variable attribute.
* The type of index (RangeIndex is default).
* The count of non-null values by column/attribute.
* The type of data contained in the column/attribute.
* The unqiue counts of dtypes (pandas data types).
* The memory usage of our data set.

In [12]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Types affect the way data is represented in machine learning models, whether we can apply math operators to them, etc.   

Some common problems with working with a new dataset:  
* Missing values.
* Unexpected types (string/object instead of int/float).
* Dirty data (commas, dollar signs, unexpected characters, etc.).
* Blank values that are actually "non-null" or single white-space characters.

#### Summarize the data with .describe()
It gives us the following statistics:

* Count, which is equivalent to the number of cells (rows).
* Mean, or, the average of the values in the column.
* Std, which is the standard deviation.
* Min, a.k.a., the minimum value.
* 25%, or, the 25th percentile of the values.
* 50%, or, the 50th percentile of the values ( which is the equivalent to the median).
* 75%, or, the 75th percentile of the values.
* Max, which is the maximum value.  

Let's try this on a single column as well as the entire dataframe.

In [13]:
titanic['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [14]:
titanic.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


There are also built-in math functions that will work on all columns of a DataFrame at once, as well as subsets of the data.

#### For example, I can use the .mean() function on the titanic DataFrame to get the mean for every column.

In [15]:
titanic.mean()

PassengerId    446.000000
Survived         0.383838
Pclass           2.308642
Age             29.699118
SibSp            0.523008
Parch            0.381594
Fare            32.204208
dtype: float64

### Reading in trickier file types

This worked well above because we `.read_csv()` expected a comma-separated file with a header row. What happens when these don't match?

In [16]:
golf = pd.read_csv('data/playgolf.csv')

In [17]:
golf.head()

Unnamed: 0,07-01-2014|sunny|85|85|false|Don't Play
0,07-02-2014|sunny|80|90|true|Don't Play
1,07-03-2014|overcast|83|78|false|Play
2,07-04-2014|rain|70|96|false|Play
3,07-05-2014|rain|68|80|false|Play
4,07-06-2014|rain|65|70|true|Don't Play


What happened here? Let's Google `pandas .read_csv` to look at the documentation and troubleshoot. 

In [18]:
golf = pd.read_csv('data/playgolf.csv', sep = '|')

In [19]:
golf.head()

Unnamed: 0,07-01-2014,sunny,85,85.1,false,Don't Play
0,07-02-2014,sunny,80,90,True,Don't Play
1,07-03-2014,overcast,83,78,False,Play
2,07-04-2014,rain,70,96,False,Play
3,07-05-2014,rain,68,80,False,Play
4,07-06-2014,rain,65,70,True,Don't Play


We fixed part of the problem, but we still need pandas to understand we don't have a header in this file.

In [20]:
golf_cols = ["Date", "Outlook", "Temperature", "Humidity", "Windy", "Result"]
golf = pd.read_csv('data/playgolf.csv', sep = '|', header = None, names = golf_cols)

In [21]:
golf.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,07-01-2014,sunny,85,85,False,Don't Play
1,07-02-2014,sunny,80,90,True,Don't Play
2,07-03-2014,overcast,83,78,False,Play
3,07-04-2014,rain,70,96,False,Play
4,07-05-2014,rain,68,80,False,Play


The `skiprows` and `skipfooter` arguments may also be useful if you have collaborators who make extra notes in their data files that you need to ignore.

# Independent Practice Time

Now that we have some basics down, let's practice some basic DataFrame use on a new data set.

**Pro tip:** When your cursor is in a string, you can use the "tab" key to browse file system resources and get a relative reference for the files that can be loaded in Jupyter notebook. Remember, you have to use your arrow keys to navigate the files populated in the UI.

1. Find and load the diamonds data set into a DataFrame.
2. Print out the columns.
3. What does the data set look like in terms of dimensions?
4. Check the types of each column.  
    a. What is the most common type?   
    b. How many entries are there?   
    c. How much memory does this data set consume?
5. Examine the summary statistics of the data set.

In [22]:
diamonds = pd.read_csv("data/diamonds.csv")

In [23]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [24]:
diamonds.columns

Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

In [25]:
diamonds.shape

(53940, 10)

In [26]:
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
carat      53940 non-null float64
cut        53940 non-null object
color      53940 non-null object
clarity    53940 non-null object
depth      53940 non-null float64
table      53940 non-null float64
price      53940 non-null int64
x          53940 non-null float64
y          53940 non-null float64
z          53940 non-null float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


In [27]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


### Bonus Practice
Open `diamonds.csv` (or another of the data files) in Excel and save it with a different delimiter (e.g. as a `.txt` file), delete the header row, and add in some notes at the bottom of the file.  

#### OR

Read in one of your own data files. Make sure the path is correct. 

## Pandas Indexing

#### Let's read in the drug dataset for practicing indexing

In [28]:
drug = pd.read_csv("data/drug.csv")
drug.head()

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5
3,15,2956,29.2,6.0,14.5,25.0,0.5,4.0,0.1,9.5,...,0.8,3.0,2.0,4.5,1.5,6.0,0.3,10.5,0.4,30.0
4,16,3058,40.1,10.0,22.5,30.0,1.0,7.0,0.0,1.0,...,1.1,4.0,2.4,11.0,1.8,9.5,0.3,36.0,0.2,3.0


A common task is that we'll want to operate on a specific portion of our data. With indexing, we can pull out a specific part of our DataFrame.  

pandas has three properties you can use for indexing:

* **.loc** indexes with the labels for rows and columns.
* **.iloc** indexes with the integer positions for rows and columns. 

#### Using the .loc indexer, let's pull out row 0 and all columns `dataframe.loc[rows, columns]`

In [31]:
drug.loc[0, :]

age                          12
n                          2798
alcohol-use                 3.9
alcohol-frequency             3
marijuana-use               1.1
marijuana-frequency           4
cocaine-use                 0.1
cocaine-frequency           5.0
crack-use                     0
crack-frequency               -
heroin-use                  0.1
heroin-frequency           35.5
hallucinogen-use            0.2
hallucinogen-frequency       52
inhalant-use                1.6
inhalant-frequency         19.0
pain-releiver-use             2
pain-releiver-frequency      36
oxycontin-use               0.1
oxycontin-frequency        24.5
tranquilizer-use            0.2
tranquilizer-frequency       52
stimulant-use               0.2
stimulant-frequency           2
meth-use                      0
meth-frequency                -
sedative-use                0.2
sedative-frequency           13
Name: 0, dtype: object

#### What if I want multiple rows? Let's get rows 0, 1, and 2 by passing in a list

In [32]:
drug.loc[[0,1,2], :]

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5


#### Can you think of a more efficient way to do this?

In [34]:
drug.loc[0:2, :]

Unnamed: 0,age,n,alcohol-use,alcohol-frequency,marijuana-use,marijuana-frequency,cocaine-use,cocaine-frequency,crack-use,crack-frequency,...,oxycontin-use,oxycontin-frequency,tranquilizer-use,tranquilizer-frequency,stimulant-use,stimulant-frequency,meth-use,meth-frequency,sedative-use,sedative-frequency
0,12,2798,3.9,3.0,1.1,4.0,0.1,5.0,0.0,-,...,0.1,24.5,0.2,52.0,0.2,2.0,0.0,-,0.2,13.0
1,13,2757,8.5,6.0,3.4,15.0,0.1,1.0,0.0,3.0,...,0.1,41.0,0.3,25.5,0.3,4.0,0.1,5.0,0.1,19.0
2,14,2792,18.1,5.0,8.7,24.0,0.1,5.5,0.0,-,...,0.4,4.5,0.9,5.0,0.8,12.0,0.1,24.0,0.2,16.5


Note that .loc is inclusive on both sides. This is different than the behavior of some other Python functions, like `range`

#### Let's do the same thing for columns and just select the `sedative-use` and `sedative-frequency` column

In [36]:
drug.loc[:, 'sedative-use':'sedative-frequency']

Unnamed: 0,sedative-use,sedative-frequency
0,0.2,13.0
1,0.1,19.0
2,0.2,16.5
3,0.4,30.0
4,0.2,3.0
5,0.5,6.5
6,0.4,10.0
7,0.3,6.0
8,0.5,4.0
9,0.3,9.0


#### We can pull out rows and columns. Let's pull out rows 0 through 2 and `sedative-use` and `sedative-frequency` columns.

In [38]:
drug.loc[0:2, 'sedative-use':'sedative-frequency']

Unnamed: 0,sedative-use,sedative-frequency
0,0.2,13.0
1,0.1,19.0
2,0.2,16.5


#### We can do the same thing with the .iloc indexer. This time we use integers for the position.  Let's get all rows and columns in position 0 and 3.

In [39]:
drug.iloc[:,[0,3]]

Unnamed: 0,age,alcohol-frequency
0,12,3.0
1,13,6.0
2,14,5.0
3,15,6.0
4,16,10.0
5,17,13.0
6,18,24.0
7,19,36.0
8,20,48.0
9,21,52.0


#### Let's get all of the rows and columns 0 through 4 using `.iloc`

In [40]:
drug.iloc[:, 0:4]

Unnamed: 0,age,n,alcohol-use,alcohol-frequency
0,12,2798,3.9,3.0
1,13,2757,8.5,6.0
2,14,2792,18.1,5.0
3,15,2956,29.2,6.0
4,16,3058,40.1,10.0
5,17,3038,49.3,13.0
6,18,2469,58.7,24.0
7,19,2223,64.6,36.0
8,20,2271,69.7,48.0
9,21,2354,83.2,52.0


Note that `.iloc` is inclusive of the first number but exclusive of the second number. This is more like `range`.

#### Let's get the first four rows and the first two columns

In [43]:
drug.iloc[0:4, 0:2]

Unnamed: 0,age,n
0,12,2798
1,13,2757
2,14,2792
3,15,2956


### Creating DataFrames

You can create your own DataFrame without importing data from a file using pd.DataFrame() on a dictionary.  
Make sure the dictionary has lists of values that are all the same length. The keys correspond to the names of the columns, and the values correspond to the data in the columns.

In [45]:
mydata = pd.DataFrame({'Letters':['A','B','C'], 'Integers':[1,2,3], 'Floats':[2.2, 3.3, 4.4]})
mydata

Unnamed: 0,Floats,Integers,Letters
0,2.2,1,A
1,3.3,2,B
2,4.4,3,C


#### Examine the data types

Use .dtypes on your DataFrame.  

In [46]:
mydata.dtypes

Floats      float64
Integers      int64
Letters      object
dtype: object

Strings are stored as a type called "object," as they are not guaranteed to take up a set amount of space (strings can be any length).

#### Rename columns

Change the column name Integers to int:

In [48]:
mydata.rename(columns={'Integers':'Ints'},inplace=True)
mydata

Unnamed: 0,Floats,Ints,Letters
0,2.2,1,A
1,3.3,2,B
2,4.4,3,C


Why did we have to use `inplace` this time? Let's check the documentation. See that `inplace=False` is the default for this method. It's Pandas way of trying to protect us. 

#### Rename all of the columns by assigning a list to the .columns property

In [49]:
mydata.columns=['A','B','C']
mydata

Unnamed: 0,A,B,C
0,2.2,1,A
1,3.3,2,B
2,4.4,3,C
