# Import Pandas

The standard way of importing pandas is like:

In [2]:
import pandas as pd

In [3]:
pd.__version__

'0.24.2'

http://pandas.pydata.org/

Pandas is numpy's extension for Data Analysis. Generally speaking, pandas provides the data structure `DataFrame`. A `pandas.Dataframe` is basically a table similar to a excel spreadsheet, that has columns and rows. In pandas each column is a `pandas.Series`, which is basically a numpy array with some additional functionality.

# Create a pd.Series

The columns  of a pandas dataframe are Series (pandas.Series)

In [4]:
series1 = pd.Series([1,2,3])

In [5]:
series1   #the first column represents the line number/index number 

0    1
1    2
2    3
dtype: int64

In [6]:
names = pd.Series(['Manuel', 'Michael', 'Hugo'])

In [7]:
names   #most of the time dtype=object means strings 

0     Manuel
1    Michael
2       Hugo
dtype: object

In [8]:
type(names)

pandas.core.series.Series

# Create a DataFrame

There are many ways to create a dataframe

In [9]:
rick_morty = pd.DataFrame(
    [
        ["Rick", "Sanchez", 60],
        ["Morty", "Smith", 14]
    ], columns = ["first_name", "last_name", "age"]
)
rick_morty

Unnamed: 0,first_name,last_name,age
0,Rick,Sanchez,60
1,Morty,Smith,14


In [10]:
type(rick_morty)

pandas.core.frame.DataFrame

In [11]:
age = pd.Series([33, 25, 14])

In [13]:
df2 = pd.DataFrame([names, age])

In [35]:
df2.head()

Unnamed: 0,0,1,2
0,Manuel,Michael,Hugo
1,33,25,14


We can create an empty dataframe

In [36]:
df3 = pd.DataFrame()

In [37]:
df3

# Adding columns to a DataFrame

We can assign columns to a dataframe the same way we would assign columns to a dictionary

In [38]:
df3['name'] = names

In [39]:
df3

Unnamed: 0,name
0,Manuel
1,Michael
2,Hugo


In [40]:
df3['age'] = age

In [41]:
df3

Unnamed: 0,name,age
0,Manuel,33
1,Michael,25
2,Hugo,14


We can see a dataframe's columns with `.columns`

In [42]:
df3.columns

Index(['name', 'age'], dtype='object')

We can see the values of a column the same way we would do with a dict

In [43]:
df3['name']

0     Manuel
1    Michael
2       Hugo
Name: name, dtype: object

Selecting a column that does not exists will raise a `KeyError` (same error as when selecting a missing key in a dictionary)

In [44]:
df3["address"]

KeyError: 'address'

# Editting the index

Dataframes have an index that allows us to perform complex data manipulations. The index is the row number by default.

In [45]:
df3.index #tells you the row numbers 

RangeIndex(start=0, stop=3, step=1)

In [46]:
df3 = df3.set_index('name')

In [47]:
df3

Unnamed: 0_level_0,age
name,Unnamed: 1_level_1
Manuel,33
Michael,25
Hugo,14


In [48]:
df3.index

Index(['Manuel', 'Michael', 'Hugo'], dtype='object', name='name')

# Reset the index

In [49]:
df3 = df3.reset_index()

In [50]:
df3

Unnamed: 0,name,age
0,Manuel,33
1,Michael,25
2,Hugo,14


In [51]:
df3 = df3.set_index('age')

In [52]:
df3

Unnamed: 0_level_0,name
age,Unnamed: 1_level_1
33,Manuel
25,Michael
14,Hugo


# Sorting the index

In [53]:
df3 = df3.sort_index()

In [54]:
df3

Unnamed: 0_level_0,name
age,Unnamed: 1_level_1
14,Hugo
25,Michael
33,Manuel


In [55]:
df3 = df3.sort_index(ascending=False)

In [56]:
df3

Unnamed: 0_level_0,name
age,Unnamed: 1_level_1
33,Manuel
25,Michael
14,Hugo


In [57]:
df3 = df3.reset_index()

In [58]:
df3

Unnamed: 0,age,name
0,33,Manuel
1,25,Michael
2,14,Hugo


# Sorting by a column

In [59]:
df3.sort_values(by="name")

Unnamed: 0,age,name
2,14,Hugo
0,33,Manuel
1,25,Michael


In [60]:
df3.sort_values(by="age", ascending=False)

Unnamed: 0,age,name
0,33,Manuel
1,25,Michael
2,14,Hugo


# Reading/Writing data with dataframes

pandas can import from/export to many types of files, csv, json, excel among others.

For example, we can read a csv including information about the Avengers (taken from [here](https://github.com/fivethirtyeight/data/tree/master/avengers))

In [61]:
avengers = pd.read_csv("data/avengers.csv")

In [62]:
avengers.head() #head gives you the first 5 rows of the table 

Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,1963,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,1963,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,1963,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,1963,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,1963,Dies in Fear Itself brought back because that'...


We can save the dataframe back to a csv file with `to_csv` (this method writes the index by default as a separate column, we can avoid this by passing the argument `index=False`).

In [63]:
avengers.to_csv("avengers2.csv", index=False) #saving the table to another file and index false makes it such that index is not added to file when saving it 

or we can export to excel using `to_excel` (it requires a separate package, `xlwt`)

In [64]:
avengers.to_excel("avengers.xls")

Likewise we can read from a excel file easily (this requires the package `xlrd`)

In [65]:
avengers_reloaded = pd.read_excel("avengers.xls")

In [66]:
avengers_reloaded.head()

Unnamed: 0.1,Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
0,0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,1963,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,1963,Dies in Secret Invasion V1:I8. Actually was se...
2,2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,1963,"Death: ""Later while under the influence of Imm..."
3,3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,1963,"Dies in Ghosts of the Future arc. However ""he ..."
4,4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,1963,Dies in Fear Itself brought back because that'...


# Inspecting a dataframe

We can see the first rows of a dataframe with `head()`

In [67]:
avengers.head()

Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
0,http://marvel.wikia.com/Henry_Pym_(Earth-616),"Henry Jonathan ""Hank"" Pym",1269,YES,MALE,1963,Merged with Ultron in Rage of Ultron Vol. 1. A...
1,http://marvel.wikia.com/Janet_van_Dyne_(Earth-...,Janet van Dyne,1165,YES,FEMALE,1963,Dies in Secret Invasion V1:I8. Actually was se...
2,http://marvel.wikia.com/Anthony_Stark_(Earth-616),"Anthony Edward ""Tony"" Stark",3068,YES,MALE,1963,"Death: ""Later while under the influence of Imm..."
3,http://marvel.wikia.com/Robert_Bruce_Banner_(E...,Robert Bruce Banner,2089,YES,MALE,1963,"Dies in Ghosts of the Future arc. However ""he ..."
4,http://marvel.wikia.com/Thor_Odinson_(Earth-616),Thor Odinson,2402,YES,MALE,1963,Dies in Fear Itself brought back because that'...


and the last ones with tail()

In [68]:
avengers.tail()

Unnamed: 0,URL,name,appearances,current,gender,starting_date,notes
168,http://marvel.wikia.com/Eric_Brooks_(Earth-616)#,Eric Brooks,198,YES,MALE,2013,
169,http://marvel.wikia.com/Adam_Brashear_(Earth-6...,Adam Brashear,29,YES,MALE,2014,
170,http://marvel.wikia.com/Victor_Alvarez_(Earth-...,Victor Alvarez,45,YES,MALE,2014,
171,http://marvel.wikia.com/Ava_Ayala_(Earth-616)#,Ava Ayala,49,YES,FEMALE,2014,
172,http://marvel.wikia.com/Kaluu_(Earth-616)#,Kaluu,35,YES,MALE,2015,


We can see the size of a dataframe (n_rows, n_columns) with `shape`

In [69]:
avengers.shape

(173, 7)

We can see the data type of each column with `dtypes`

In [70]:
avengers.dtypes

URL              object
name             object
appearances       int64
current          object
gender           object
starting_date     int64
notes            object
dtype: object

We can use `describe` to find statistical information about the dataframe's columns.

In [71]:
avengers.describe() #gives you a summary of numeric columns 

Unnamed: 0,appearances,starting_date
count,173.0,173.0
mean,414.052023,1988.445087
std,677.99195,30.374669
min,2.0,1900.0
25%,58.0,1979.0
50%,132.0,1996.0
75%,491.0,2010.0
max,4333.0,2015.0
