# Introduction to pandas

---

In this notebook, we will learn the basics of pandas, a commonly used data science library

### Table of Contents

1 - [Introduction](#section intro)<br>

2 - [DataFrames](#section dataframes)<br>

3 - [Tools for Examining DataFrames](#section tools)<br>

4 - [Indexing into DataFrames](#section indexing)<br>

5 - [Manipulating DataFrames](#section manipulating)<br>

6 - [Wrap-Up](# section wrapup)<br>

Estimated time: 30 minutes

---

## 1. Introduction <a id='intro'></a>

<i>What is pandas?</i>

Here is the Wikipedia definition of pandas: "pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series."

Basically, pandas is yet another tool we can use to manipulate data. 

Let's begin our exploration of pandas! First, we need to import it. The standard way of importing pandas is with this line of code:

In [2]:
import pandas as pd

---

## 2. DataFrames <a id='dataframes'></a>

A data structure is a particular way of representing and storing data in code. The <b>DataFrame</b> is the most used data structure in pandas. 

<b>Definition</b>: A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types.

Basically, a DataFrame is a table-like thing, with rows and columns, that stores data for us.

Let's create our very first DataFrame. One way we can create a DataFrame is to use pandas' read_csv function, passing in a file name. The data in this file will be turned into a DataFrame, which we can save into a variable called df.

In [3]:
df = pd.read_csv("https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/Titanic.csv")

---

## 3. Tools for Examining DataFrames <a id ='tools'></a>

It is always good practice to get some preliminary information about your data before attempting analysis. Here are some pandas functions that will help us do that:

---
a) `head`

The `head` function will give us the first 5 rows in our DataFrame to examine. If we want to get a different number of rows, we can pass in a number when we call the function.

In [4]:
# Gets first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [5]:
# Gets first 10 rows
df.head(10)

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0
5,6,"Anderson, Mr Harry",1st,47.0,male,1,0
6,7,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
7,8,"Andrews, Mr Thomas, jr",1st,39.0,male,0,0
8,9,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1
9,10,"Artagaveytia, Mr Ramon",1st,71.0,male,0,0


<b>Question 3.1:</b> How do you get the first 3 rows?

In [None]:
### YOUR CODE HERE

---

b) `tail`

The `tail` function can give us the last rows of the table.

In [6]:
# Gets last 5 rows
df.tail()

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
1308,1309,"Zakarian, Mr Artun",3rd,27.0,male,0,0
1309,1310,"Zakarian, Mr Maprieder",3rd,26.0,male,0,0
1310,1311,"Zenni, Mr Philip",3rd,22.0,male,0,0
1311,1312,"Lievens, Mr Rene",3rd,24.0,male,0,0
1312,1313,"Zimmerman, Leo",3rd,29.0,male,0,0


<b>Question 3.2</b> How do you think we can get the last 3 rows?

In [8]:
### YOUR CODE HERE

---

c) `columns`

We can look at all of the column names of our DataFrame by using the `columns` attribute of our DataFrame

In [12]:
# Note that we are not using parentheses. This is because columns is an attribute of df, not a function
df.columns

Index(['Unnamed: 0', 'Name', 'PClass', 'Age', 'Sex', 'Survived', 'SexCode'], dtype='object')

---

d) `describe`

We can use the `describe` function to get a statistic summary of our data.

In [11]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Age,Survived,SexCode
count,1313.0,756.0,1313.0,1313.0
mean,657.0,30.397989,0.342727,0.351866
std,379.174762,14.259049,0.474802,0.477734
min,1.0,0.17,0.0,0.0
25%,329.0,21.0,0.0,0.0
50%,657.0,28.0,0.0,0.0
75%,985.0,39.0,1.0,1.0
max,1313.0,71.0,1.0,1.0


<b>Question 3.3:</b> Looking at our describe table, what was the mean age and standard deviation (in years) of passengers on the Titanic?

Mean (in years):

Standard Deviation (in years):


## 4. Indexing into DataFrames <a id='indexing'></a>

This next section will discuss how to actually access the values in our DataFrame. We call this <i>indexing</i>.

---

a) Selecting columns

In order to get a column of our DataFrame, we use bracket notation with the column name. <br><br>

<center>`data_frame_name[column_name]`</center>

Here's how we would get the Age column out of our DataFrame:

In [16]:
df['Age']

0       29.00
1        2.00
2       30.00
3       25.00
4        0.92
5       47.00
6       63.00
7       39.00
8       58.00
9       71.00
10      47.00
11      19.00
12        NaN
13        NaN
14        NaN
15      50.00
16      24.00
17      36.00
18      37.00
19      47.00
20      26.00
21      25.00
22      25.00
23      19.00
24      28.00
25      45.00
26      39.00
27      30.00
28      58.00
29        NaN
        ...  
1283    14.00
1284    22.00
1285      NaN
1286      NaN
1287      NaN
1288      NaN
1289      NaN
1290      NaN
1291    51.00
1292    18.00
1293    45.00
1294      NaN
1295      NaN
1296      NaN
1297    28.00
1298    21.00
1299    27.00
1300      NaN
1301    36.00
1302      NaN
1303    27.00
1304    15.00
1305      NaN
1306      NaN
1307      NaN
1308    27.00
1309    26.00
1310    22.00
1311    24.00
1312    29.00
Name: Age, Length: 1313, dtype: float64

Notice that the column names are case sensitive! 

<b>Question 4.1:</b> What happens when you try to access the `age` column from our DataFrame?

In [18]:
### Try it out!

<b>Answer:</b>

<b>Question 4.2</b>: Index into our DataFrame to get the `Survived` column!

In [None]:
### YOUR CODE HERE

---

b) Selecting Rows

In order to select rows from our DataFrame, we use square brackets and specify the range of rows we want.

For example,

<center>`df[0:3]`</center>

would give us the first three rows of our DataFrame.

The range of rows we get back is inclusive of the first number, but not of the second. In the example above, this means that we get the 0th row, the 1st row, and the 2nd row, but <b>not</b> the 3rd row. 

Does that make sense?

<b>Question 4.3:</b> Get the 100th row to 103rd row of our DataFrame, including the 103rd row.

Hint: You should have 4 rows in your result.

In [19]:
### YOUR CODE HERE

---

c) Selecting a subset of the columns and the rows

We've learned so far how to get out certain columns and rows of our DataFrame, but let's say we want a few rows, and a few columns from our DataFrame. How do we do that?

The `iloc` method will do the trick! `iloc` stands for index location, by the way. This is because we are using the index (the numbers) of the rows and the columns to access rows and columns in the DataFrame.

Here's the syntax:

<center>`df.iloc[3:5, 0:2]`</center>

The range before the comma represents the rows we are selecting (in this case, the 3rd row and the 4th row), and the range after the comma represents the columns we are selecting (the 0th column and the 1st column).

Go ahead and run the cell below! Let's see what we get!

In [25]:
df.iloc[3:5, 0:2]

Unnamed: 0.1,Unnamed: 0,Name
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)"
4,5,"Allison, Master Hudson Trevor"


<b>Question 4.4:</b> Get the 1st row and 2nd and 3rd column out (all of these numbers are by index, remember that Python is 0-indexed)!

In [27]:
### YOUR CODE HERE

## 5. Manipulating DataFrames <a id='manipulating'></a>

Now, we're going to examine a few functions that will enable us to manipulate our DataFrame in various ways.

First, notice that these changes do not persist. Basically, say we have DataFrame d. If we call one of these functions on it, what will be outputted is the result of applying the function on d, but d will remain unchanged.

In order to get our changes to persist, we need to save the result of applying the function into a variable.

---

a) `drop`

The `drop` function allows us to drop rows and columns that are not desired. 

To drop a row we do not want:
<center>`df.drop(1)`</center>

Here, we are dropping the 1st-index row.

If we want to drop a column:
<center>`df.drop(1, axis=1)`</center>

Here, we are dropping the 1st-index column. Note that we have to add axis=1 to drop a column and not a row!


<b>Question 5.1: </b> How would you drop the 27th (indexed column)?

In [32]:
### YOUR CODE HERE

<b>Question 5.2:</b> How would you drop the SexCode column in our DataFrame?

In [33]:
### YOUR CODE HERE

---

b) filtering data with logical subsetting

Let's say you want to get the rows of our DataFrame where the age of the person is older than 30. We would do that with logical subsetting, in which we use a boolean expression to select certain rows. The code would look like:

<center>`df[df['Age'] > 30]`</center>

In [39]:
### Run this line of code! See what happens!
df[df['Age'] > 30]

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
5,6,"Anderson, Mr Harry",1st,47.0,male,1,0
6,7,"Andrews, Miss Kornelia Theodosia",1st,63.0,female,1,1
7,8,"Andrews, Mr Thomas, jr",1st,39.0,male,0,0
8,9,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.0,female,1,1
9,10,"Artagaveytia, Mr Ramon",1st,71.0,male,0,0
10,11,"Astor, Colonel John Jacob",1st,47.0,male,0,0
15,16,"Baxter, Mrs James (Helene DeLaudeniere Chaput)",1st,50.0,female,1,1
17,18,"Beattie, Mr Thomson",1st,36.0,male,0,0
18,19,"Beckwith, Mr Richard Leonard",1st,37.0,male,1,0
19,20,"Beckwith, Mrs Richard Leonard (Sallie Monypeny)",1st,47.0,female,1,1


<b>Question 5.3:</b> It's your turn! Select only the rows of passengers who survived.

In [40]:
### YOUR CODE HERE

---

c) `dropna`

When dealing with data, we often have to deal with missing values. The `dropna` function drops rows in which there are missing values.

In [44]:
df.dropna()

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1st,29.00,female,1,1
1,2,"Allison, Miss Helen Loraine",1st,2.00,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1st,30.00,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.00,female,0,1
4,5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0
5,6,"Anderson, Mr Harry",1st,47.00,male,1,0
6,7,"Andrews, Miss Kornelia Theodosia",1st,63.00,female,1,1
7,8,"Andrews, Mr Thomas, jr",1st,39.00,male,0,0
8,9,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.00,female,1,1
9,10,"Artagaveytia, Mr Ramon",1st,71.00,male,0,0


If you look at the number of rows in this DataFrame, there are only 756 rows, when originially, the DataFrame had more than 1000 rows.

---

d) `fillna`

Let's say that instead of getting rid of rows with NAs, we want to fill the NAs with a default value. We can do this with the `fillna` function. All we have to do is call fillna on the DataFrame, passing in what we want to fill in.

In [47]:
df.fillna(3)

Unnamed: 0.1,Unnamed: 0,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1st,29.00,female,1,1
1,2,"Allison, Miss Helen Loraine",1st,2.00,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1st,30.00,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.00,female,0,1
4,5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0
5,6,"Anderson, Mr Harry",1st,47.00,male,1,0
6,7,"Andrews, Miss Kornelia Theodosia",1st,63.00,female,1,1
7,8,"Andrews, Mr Thomas, jr",1st,39.00,male,0,0
8,9,"Appleton, Mrs Edward Dale (Charlotte Lamson)",1st,58.00,female,1,1
9,10,"Artagaveytia, Mr Ramon",1st,71.00,male,0,0


Now, we've replaced all NAs with the number 3!

---

## 6. Wrap-up <a id ='wrapup'></a>

We've learned a lot about pandas in this intro notebook! Here's a summary table of what we've learned. Otherwise: congrats!

|Name|Example|Purpose|
|-|-|-|
|`read_csv`|`read_csv("filename.csv")`|Create a DataFrame with data from `filename.csv`|
|`head`|`df.head(5)`|Get first 5 rows of our DataFrame|
|`tail`|`df.tail(5)`|Get last 5 rows of our DataFrame|
|`columns`|`df.columns`|Get column names in our DataFrame|
|`iloc`|`df.iloc[1:3, 2:4]`|Get certain rows and columns of our DataFrame|
|`drop`|`df.drop(1), df.drop(1, axis=1)`|Drop 1st (indexed) row and drop 1st (indexed) column|
|`dropna`|`df.dropna()`|Drop rows in our DataFrame that have NAs|
|`fillna`|`df.fillna(5)`|Fill NA values with something that you want (5 in this case)|
