## Pandas
is a Python package providing fast, flexible, and expressive data structures designed to make working with labeled data easy. It's fundamental for doing practical  data analysis in Python. <br/>
Before you jump into the modeling or visualizations, you need to have a good understanding of the nature of your dataset and pandas is the best tool to help you do that.

In [2]:
import pandas as pd

### Core Data Structures
The two primary components that Pandas introduces are the **Series** and the **DataFrame**. <br/>

## Series
A Series is similar to a list, you can think of it as a 1 dimensional array. It will assign an index to each item it contains. By default, each item will receive an index label from 0 to N-1, where N is the size of the Series.

In [3]:
# Create a series with an arbitrary list of names
s = pd.Series(['John', 'Rich', 'Michael', 'Sue', 'Kath'])
print(s)

# Integer indexing
print(s[0])         # Index to select a specific item from the Series

print(s[1:4])       # List slicing syntax to select a range of items

print(s[[0,1,3]])   # Double brackets to select more than one item

# Boolean indexing
param = s != 'John' # Select all names that are not 'John'
print(s[param])

0       John
1       Rich
2    Michael
3        Sue
4       Kath
dtype: object
John
1       Rich
2    Michael
3        Sue
dtype: object
0    John
1    Rich
3     Sue
dtype: object
1       Rich
2    Michael
3        Sue
4       Kath
dtype: object


The Series constructor can convert a dictonary as well, using the keys of the dictionary as its index.

In [16]:
d = {'New York': 1300, 
     'Chicago': 900, 
     'San Francisco': 1100, 
     'Austin': 450}

cities = pd.Series(d)
print(cities)

New York         1300
Chicago           900
San Francisco    1100
Austin            450
dtype: int64


You can use the index to select specific items from the Series

In [11]:
cities[['Chicago', 'San Francisco']]  # Note: You have to use double brackets for more than one item

Chicago           900
San Francisco    1100
dtype: int64

Or you can use boolean indexing for selection.

In [15]:
cities[cities < 1000]   # Selecting all cities with population < 1000

Chicago    750
Austin     750
dtype: int64

You can also change values using boolean logic. For example, I can change the value of all the cities with population less than 1000

In [18]:
cities[cities < 1000] = 750
print(cities)

New York         1300
Chicago           750
San Francisco    1100
Austin            750
dtype: int64


## DataFrame
A DataFrame is a tablular data structure made up of rows and columns, similar to a spreadsheet, or database table. You can think of a DataFrame as a collection of Series.

**CSV to DataFrame** <br>
Reading a CSV file of data is as simple as calling the read_csv function. By default, the function expects the column separator to be a comma, but you can change that using the sep parameter. <br>
We will be using the Titanic dataset which has information about everyone that was onboard.

In [46]:
titanic = pd.read_csv('titanic.txt', sep='\t')

titanic.head()   # Displays the first five rows of the DataFrame

Unnamed: 0,pclass,age,sex,survived
0,1st,adult,male,yes
1,1st,adult,male,yes
2,1st,adult,male,yes
3,1st,adult,male,yes
4,1st,adult,male,yes


### Selecting
Each column of a DataFrame is essentially a Series. Selecting a single column from a DataFrame will return a Series object.

In [44]:
titanic['age']   # Selects the 'age' column, or series

0       adult
1       adult
2       adult
3       adult
4       adult
        ...  
2196    adult
2197    adult
2198    adult
2199    adult
2200    adult
Name: age, Length: 2201, dtype: object

To select multiple columns, pass a list of column names to the DataFrame. The output will be a DataFrame.

In [43]:
titanic[['age', 'pclass']]

Unnamed: 0,age,pclass
0,adult,1st
1,adult,1st
2,adult,1st
3,adult,1st
4,adult,1st
...,...,...
2196,adult,crew
2197,adult,crew
2198,adult,crew
2199,adult,crew


We can select rows by using boolean indexing. For example, we can select all the children that are in the 2nd class.

In [42]:
titanic[(titanic.age=='child') & (titanic.pclass=='2nd')]

Unnamed: 0,pclass,age,sex,survived
586,2nd,child,male,yes
587,2nd,child,male,yes
588,2nd,child,male,yes
589,2nd,child,male,yes
590,2nd,child,male,yes
591,2nd,child,male,yes
592,2nd,child,male,yes
593,2nd,child,male,yes
594,2nd,child,male,yes
595,2nd,child,male,yes


We can select rows by position using regular list slicing syntax. For example, we can select all the rows from position 320 through 330.

In [47]:
titanic[320:330]   # Note: list slicing doesn't include the last position

Unnamed: 0,pclass,age,sex,survived
320,1st,child,male,yes
321,1st,child,male,yes
322,1st,child,male,yes
323,1st,child,male,yes
324,1st,child,female,yes
325,2nd,adult,male,yes
326,2nd,adult,male,yes
327,2nd,adult,male,yes
328,2nd,adult,male,yes
329,2nd,adult,male,yes


### Grouping
The Pandas groupby function returns a GroupBy object which has a variety of methods, which are similar to standard SQL aggregate functions.

In [58]:
by_age = titanic.groupby('age')

If we want to see the total number of records, or rows, in each age group, we can use the size function.

In [60]:
by_age.size()   # We can see there are 2092 adults, and 109 children

age
adult    2092
child     109
dtype: int64

We can do many different DataFrame groupings. For example, we can see the number of males compared to females, or we can also see the number of survivors for each gender.

In [80]:
# The number of males compared to females
print(titanic.groupby(['sex']).size())


# Males that survived compared to females
print(titanic.groupby(['survived', 'sex']).size())


# The number of males in 1st class that survived or didn't
class1_males = titanic[(titanic.pclass=='1st') & (titanic.sex=='male')]
print(class1_males.groupby('survived').size())

sex
female     470
male      1731
dtype: int64
survived  sex   
no        female     126
          male      1364
yes       female     344
          male       367
dtype: int64
survived
no     118
yes     62
dtype: int64
