# Intro to Pandas
By Carl Shan

This Jupyter Notebook introduces the `pandas` library, and how to best use it for working with data.

## What is `pandas`?
Pandas is a popular Python library that contains many tools allowing you to more easily visualize, inspect or slice data. You may find it helpful in this class.

In [1]:
import pandas as pd
# the above line of code imports the pandas library, and renames it `pd` so we can type it more easily

### Loading in Data

To load in a .csv, we'll just use the following function:
```python 
pd.read_csv("some file path here")
```

In [3]:
titanic_data = pd.read_csv('/Users/cshan/dev/spring_2019/data_analytics/datasets/titanic/titanic_dataset.csv')
# Replace the string above with the path to YOUR titanic csv

In [14]:
titanic_data[titanic_data['pclass'] == 3]['survived'].value_counts(normalize=True)

0.0    0.744711
1.0    0.255289
Name: survived, dtype: float64

### Taking a look at data

In [3]:
titanic_data
# Run this cell and see what happens.

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0000,0.0,0.0,19952,26.5500,E12,S,3,,"New York, NY"
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0000,0.0,0.0,112050,0.0000,A36,S,,,"Belfast, NI"
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


### Woah. What are we looking at?

The above spreadsheet you're looking at is called a `DataFrame`. It's one of the most important `pandas` objects to know.

Below I'll share some of the things you can do with `DataFrame` objects that would be more difficult to do without it.

### Handy `DataFrame` methods

Let's look at two attributes that `DataFrame`objects have: `.columns`, `shape` and `.dtypes` 

Copy the code below into various cells and see what they do.

Example:

**`.columns`**

```python
titanic_data.columns
```

**`.shape`**

```python
titanic_data.shape
```

**`.dtypes`**

```python
titanic_data.dtypes
```

In [5]:
### Your code here
titanic_data.columns


titanic_data.shape


titanic_data.dtypes



pclass       float64
survived     float64
name          object
sex           object
age          float64
sibsp        float64
parch        float64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

### Handy DataFrame methods

#### Taking a look at the `.head()` and `.tail()`

The `.head()` and `.tail()` method allows you to see the first or last few rows of a `DataFrame`.

Try running the following code in your own cell and see what gets produced:

**`.head()`**

```python
titanic_data.head()
```

**`.tail()`**

```python
titanic_data.head()
```

In [11]:
### Your code here
titanic_data.head(150)

#titanic_data.tail()







Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0000,0.0,0.0,19952,26.5500,E12,S,3,,"New York, NY"
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0000,0.0,0.0,112050,0.0000,A36,S,,,"Belfast, NI"
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


### Handy DataFrame methods

#### `describe`, `count`, `min`, `max`, `std`, `corr`


Try running the each of the commands above to see what happens.

Example:

```python
titanic_data.describe()
```

In [7]:
### Your code here

titanic_data.describe()






Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


### Indexing into a `DataFrame`

You may now be curious how to get a specific column of a `DataFrame` object. Try this:

```python

fare = titanic_data['fare']

```

In [10]:
### Your code here


fare = titanic_data['fare']



#### What is the above object?

Try using the `type()` function on the `fare` variable above.

It should be a `Series`. This is the second important data structure to be aware of in the `pandas` DataFrame.

A `Series` is very much like a Python dictionary in that it has a number of `keys` and associated `values`. It is often used to represent one column or row of a `DataFrame`.

### Challenges:

Try to do each of the following:

1. Make a new DataFrame that is equal to only the first 125 rows of the data.

2. Make a new DataFrame that is equal to only the columns `fare` and `home.dest` of the original dataset.

3. Make a new DataFrame that is equal to the first 25 rows and the `sex` and `survived` columns.


In [22]:
### Your code here
titanic_data.head(125)
titanic_data['fare']

0       211.3375
1       151.5500
2       151.5500
3       151.5500
4       151.5500
5        26.5500
6        77.9583
7         0.0000
8        51.4792
9        49.5042
10      227.5250
11      227.5250
12       69.3000
13       78.8500
14       30.0000
15       25.9250
16      247.5208
17      247.5208
18       76.2917
19       75.2417
20       52.5542
21       52.5542
22       30.0000
23      227.5250
24      221.7792
25       26.0000
26       91.0792
27       91.0792
28      135.6333
29       26.5500
          ...   
1280      7.8958
1281      9.0000
1282      8.0500
1283      7.5500
1284      8.0500
1285      9.5000
1286      7.2292
1287      7.7500
1288      6.4958
1289      6.4958
1290      7.0000
1291      8.7125
1292      7.5500
1293      8.0500
1294     16.1000
1295      7.2500
1296      8.6625
1297      7.2500
1298      9.5000
1299     14.4542
1300     14.4542
1301      7.2250
1302      7.2250
1303     14.4583
1304     14.4542
1305     14.4542
1306      7.2250
1307      7.22

### Indexing by Conditions

Say you want to get certain rows or columns of a `DataFrame`, well you can do something like the following:

In [None]:
# What does this do?
is_female = titanic_data['sex'] == 'female'

In [None]:
# Let's inspect it.
is_female

### What's going on?

The above code has produced a `Series` object containing only `True` or `False`. We call this a `Boolean Series` because the `Series` contains only `Boolean` (e.g., `True` or `False`) values.

Why is this useful?

Because we can do the following:

In [None]:
# Run this cell

only_female_passengers = titanic_data[is_female]

In [None]:
# Now let's inspect it. We'll use the same `.head()` from before. It also works with Series objects!
# Let's look at the first 10 rows
only_female_passengers.head(n=10)

## Multiple criteria: using `and`, `or` and `not` in conditions

Say you wanted passengers that were female, in Class 1 and were over the age of 30. Here's how you could easily write a one-line bit of code:


#### Multiple `and` conditions

You use the `&` symbol to denote `and`.

```python

titanic_data[ (titanic_data['sex'] == 'female') & (titanic_data['pclass'] == 1) & (titanic_data['age'] > 30) ]

```

In general, the syntax is:

```python

titanic_data[ (condition1) & (condition2) ... ]

```


#### Mutiple `or` conditions

If you want to use an `or` condition, you simply use the `|` symbol:

```python

titanic_data[ (condition1) | (condition2) ... ]

```


#### Reversing a condition

Use the `~` symbol to reverse a condition.

For example, here's how you find all of the passengers that were **NOT** class 1.

```python

titanic_data[~(titanic_data['pclass'] == 1)]

```

#### You can also use `.isin` to check if a column is one of multiple values

```python

values_to_check = ['S', 'C']

subset = titanic_data[titanic_data['embarked'].isin( values_to_check )]

```


### Challenges

Try the following:

1. Find the number of passengers who survived that paid a fare above 50 and were NOT in class 1.

In [None]:
### Your code here






### Now that you're with the tutorial ...

Go to Canvas and try to work on some of the exercises listed in the assignment.

In [None]:
### You can use this cell and add more cells below to analyze more of the data.


















## Resources

If you want to learn more about `pandas`, here are some resources I suggest:

0. [The official documentation on the two `DataFrame` and `Series` data structures](https://pandas.pydata.org/pandas-docs/stable/dsintro.html)
1. [A lot of examples of awesome, crazy ways to filter and slice a `DataFrame`](https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-39e811c81a0c)
2. [A list of the most common `pandas` functionality from the official documentation](https://pandas.pydata.org/pandas-docs/stable/10min.html)
3. [The official `pandas` documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) -- click on one of the topics on the left hand side to navigate to it.