![alt text](assets/Pandas_logo.png)


# pandas: Basics

This notebook contains code examples as a Introduction into Pandas

The documentation of this package can be found here: https://pandas.pydata.org/docs/

## DataFrame

Is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

Doc: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

## First Steps

### Importing pandas
The next cell will import the pandas package and will set the max number of rows to display

Doc: https://pandas.pydata.org/docs/user_guide

In [None]:
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.min_rows = None

### Reading a csv
Now, you will import a csv file into a dataframe. There are several options for this function, you can check this doc for reference:  https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv

In [None]:
titanic = pd.read_csv("titanic.csv")
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.2500,S,
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.9250,S,
3,1,1,female,35.0,1,0,53.1000,S,C
4,0,3,male,35.0,0,0,8.0500,S,
...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,
887,1,1,female,19.0,0,0,30.0000,S,B
888,0,3,female,,1,2,23.4500,S,
889,1,1,male,26.0,0,0,30.0000,C,C


In [11]:
type(titanic)

pandas.core.frame.DataFrame

In [None]:
titanic_sep = pd.read_csv("titanic.csv", sep='|', header=None, na_values="other_null")
titanic_sep

Unnamed: 0,0
0,"survived,pclass,sex,age,sibsp,parch,fare,embar..."
1,"0,3,male,22.0,1,0,7.25,S,"
2,"1,1,female,38.0,1,0,71.2833,C,C"
3,"1,3,female,26.0,0,0,7.925,S,"
4,"1,1,female,35.0,1,0,53.1,S,C"
...,...
887,"0,2,male,27.0,0,0,13.0,S,"
888,"1,1,female,19.0,0,0,30.0,S,B"
889,"0,3,female,,1,2,23.45,S,"
890,"1,1,male,26.0,0,0,30.0,C,C"


### Writing a csv
`to_csv` allows you to write a DataFrame into a csv file, there a plenty options you can leverage on for this process: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [None]:
titanic.to_csv('titanic_from_df.csv')
titanic.to_csv('titanic_from_df.csv', sep='|', header=False, index=False, na_rep="NAN")

## Basic DF Functions, Attributes and Methods

In [None]:
titanic.head(2)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,C


In [None]:
titanic.tail(2)

In [None]:
titanic.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'deck'],
      dtype='object')

In [None]:
titanic.index

RangeIndex(start=0, stop=891, step=1)

In [None]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   pclass    891 non-null    int64  
 2   sex       891 non-null    object 
 3   age       714 non-null    float64
 4   sibsp     891 non-null    int64  
 5   parch     891 non-null    int64  
 6   fare      891 non-null    float64
 7   embarked  889 non-null    object 
 8   deck      203 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 62.8+ KB


In [10]:
titanic.dtypes

survived      int64
pclass        int64
sex          object
age         float64
sibsp         int64
parch         int64
fare        float64
embarked     object
deck         object
dtype: object

In [18]:
titanic.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [20]:
len(titanic)

891

In [21]:
round(titanic, 0).head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.0,S,
1,1,1,female,38.0,1,0,71.0,C,C
2,1,3,female,26.0,0,0,8.0,S,
3,1,1,female,35.0,1,0,53.0,S,C
4,0,3,male,35.0,0,0,8.0,S,


In [32]:
titanic.size

8019

## pandas.Series

One-dimensional ndarray with axis labels (including time series).

Doc: https://pandas.pydata.org/docs/reference/api/pandas.Series.html?highlight=series#pandas.Series

In [35]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.925,S,
3,1,1,female,35.0,1,0,53.1,S,C
4,0,3,male,35.0,0,0,8.05,S,


In [13]:
titanic["age"]

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [38]:
type(titanic["age"])

pandas.core.series.Series

In [12]:
titanic[["age"]]

Unnamed: 0,age
0,22.0
1,38.0
2,26.0
3,35.0
4,35.0
...,...
886,27.0
887,19.0
888,
889,26.0


In [9]:
type(titanic[["age"]])

pandas.core.frame.DataFrame

In [11]:
# titanic["age", "sex"]

In [14]:
titanic[["age", "sex"]]

Unnamed: 0,age,sex
0,22.0,male
1,38.0,female
2,26.0,female
3,35.0,female
4,35.0,male
...,...,...
886,27.0,male
887,19.0,female
888,,female
889,26.0,male


In [15]:
titanic[["sex", "age", "fare"]]

Unnamed: 0,sex,age,fare
0,male,22.0,7.2500
1,female,38.0,71.2833
2,female,26.0,7.9250
3,female,35.0,53.1000
4,male,35.0,8.0500
...,...,...,...
886,male,27.0,13.0000
887,female,19.0,30.0000
888,female,,23.4500
889,male,26.0,30.0000


## Creating series and DFs

Each component of a series has a unique identification thanks to an index. It is possible to create new Series or DataFrames by using lists, arrays, dictionaries, and existing Series objects

In [13]:
data = [1000, 2000, 3000, 4000, 5000]
s = pd.Series(data)
print(s)

0    1000
1    2000
2    3000
3    4000
4    5000
dtype: int64


In [14]:
data = [1000, 2000, 3000, 4000, 5000]
df = pd.DataFrame(data, columns=['Column1'])
print(df)

   Column1
0     1000
1     2000
2     3000
3     4000
4     5000


In [16]:
titanic.age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

In [17]:
titanic.age.equals(titanic["age"])

True

### Selecting Rows with Square Brackets (not advisable)

In [18]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.925,S,
3,1,1,female,35.0,1,0,53.1,S,C
4,0,3,male,35.0,0,0,8.05,S,


In [19]:
titanic[0:1]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,


In [20]:
titanic[4:8]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
4,0,3,male,35.0,0,0,8.05,S,
5,0,3,male,,0,0,8.4583,Q,
6,0,1,male,54.0,0,0,51.8625,S,E
7,0,3,male,2.0,3,1,21.075,S,


In [21]:
titanic[:10]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.925,S,
3,1,1,female,35.0,1,0,53.1,S,C
4,0,3,male,35.0,0,0,8.05,S,
5,0,3,male,,0,0,8.4583,Q,
6,0,1,male,54.0,0,0,51.8625,S,E
7,0,3,male,2.0,3,1,21.075,S,
8,1,3,female,27.0,0,2,11.1333,S,
9,1,2,female,14.0,1,0,30.0708,C,


In [22]:
titanic[-10:]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
881,0,3,male,33.0,0,0,7.8958,S,
882,0,3,female,22.0,0,0,10.5167,S,
883,0,2,male,28.0,0,0,10.5,S,
884,0,3,male,25.0,0,0,7.05,S,
885,0,3,female,39.0,0,5,29.125,Q,
886,0,2,male,27.0,0,0,13.0,S,
887,1,1,female,19.0,0,0,30.0,S,B
888,0,3,female,,1,2,23.45,S,
889,1,1,male,26.0,0,0,30.0,C,C
890,0,3,male,32.0,0,0,7.75,Q,


### Indexing Operator iloc (location based indexing) 

#### Selecting Rows with iloc

`.iloc[]` is an integer-location based indexing for selection by position. Is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array.

In [49]:
titanic.iloc[0]

survived       0
pclass         3
sex         male
age         22.0
sibsp          1
parch          0
fare        7.25
embarked       S
deck         NaN
Name: 0, dtype: object

In [23]:
type(titanic.iloc[0])

pandas.core.series.Series

In [24]:
titanic.iloc[-1]

survived       0
pclass         3
sex         male
age         32.0
sibsp          0
parch          0
fare        7.75
embarked       Q
deck         NaN
Name: 890, dtype: object

In [25]:
titanic.iloc[:5]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
0,0,3,male,22.0,1,0,7.25,S,
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.925,S,
3,1,1,female,35.0,1,0,53.1,S,C
4,0,3,male,35.0,0,0,8.05,S,


In [54]:
titanic.iloc[-5:]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
886,0,2,male,27.0,0,0,13.0,S,
887,1,1,female,19.0,0,0,30.0,S,B
888,0,3,female,,1,2,23.45,S,
889,1,1,male,26.0,0,0,30.0,C,C
890,0,3,male,32.0,0,0,7.75,Q,


In [26]:
titanic.iloc[456:459]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
456,0,1,male,65.0,0,0,26.55,S,E
457,1,1,female,,1,0,51.8625,S,D
458,1,2,female,50.0,0,0,10.5,S,


In [27]:
titanic.iloc[[2,45,765]]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,deck
2,1,3,female,26.0,0,0,7.925,S,
45,0,3,male,,0,0,8.05,S,
765,1,1,female,51.0,1,0,77.9583,S,D


In [28]:
titanic.iloc[0,0:3]

survived       0
pclass         3
sex         male
Name: 0, dtype: object

In [33]:
titanic.iloc[:,[0,2,6,8]]

Unnamed: 0,survived,sex,fare,deck
0,0,male,7.2500,
1,1,female,71.2833,C
2,1,female,7.9250,
3,1,female,53.1000,C
4,0,male,8.0500,
...,...,...,...,...
886,0,male,13.0000,
887,1,female,30.0000,B
888,0,female,23.4500,
889,1,male,30.0000,C


In [29]:
titanic.iloc[0,[0,2,6,8]]

survived       0
sex         male
fare        7.25
deck         NaN
Name: 0, dtype: object

In [62]:
titanic.iloc[34:39,[0,2,6,8]]

Unnamed: 0,survived,sex,fare,deck
34,0,male,82.1708,
35,0,male,52.0,
36,1,male,7.2292,
37,0,male,8.05,
38,0,female,18.0,


#### Selecting Columns with iloc

In [34]:
titanic.iloc[:, 0].equals(titanic.survived)

True

In [35]:
titanic["survived"]

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: survived, Length: 891, dtype: int64

### Index Operator loc (label based indexing)

In [36]:
medals = pd.read_csv("summer.csv", index_col="Athlete")

medals_wo_index = pd.read_csv("summer.csv")

In [37]:
medals_wo_index.head()

Unnamed: 0,Year,City,Sport,Discipline,Athlete,Country,Gender,Event,Medal
0,1896,Athens,Aquatics,Swimming,"HAJOS, Alfred",HUN,Men,100M Freestyle,Gold
1,1896,Athens,Aquatics,Swimming,"HERSCHMANN, Otto",AUT,Men,100M Freestyle,Silver
2,1896,Athens,Aquatics,Swimming,"DRIVAS, Dimitrios",GRE,Men,100M Freestyle For Sailors,Bronze
3,1896,Athens,Aquatics,Swimming,"MALOKINIS, Ioannis",GRE,Men,100M Freestyle For Sailors,Gold
4,1896,Athens,Aquatics,Swimming,"CHASAPIS, Spiridon",GRE,Men,100M Freestyle For Sailors,Silver


In [39]:
medals.head()

Unnamed: 0_level_0,Year,City,Sport,Discipline,Country,Gender,Event,Medal
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"HAJOS, Alfred",1896,Athens,Aquatics,Swimming,HUN,Men,100M Freestyle,Gold
"HERSCHMANN, Otto",1896,Athens,Aquatics,Swimming,AUT,Men,100M Freestyle,Silver
"DRIVAS, Dimitrios",1896,Athens,Aquatics,Swimming,GRE,Men,100M Freestyle For Sailors,Bronze
"MALOKINIS, Ioannis",1896,Athens,Aquatics,Swimming,GRE,Men,100M Freestyle For Sailors,Gold
"CHASAPIS, Spiridon",1896,Athens,Aquatics,Swimming,GRE,Men,100M Freestyle For Sailors,Silver


#### Selecting Rows with loc

With `.loc[]` you can ccess a group of rows and columns by label(s) or a boolean array. Is primarily label based, but may also be used with a boolean array.

In [40]:
medals.loc["DRIVAS, Dimitrios"]

Year                                1896
City                              Athens
Sport                           Aquatics
Discipline                      Swimming
Country                              GRE
Gender                               Men
Event         100M Freestyle For Sailors
Medal                             Bronze
Name: DRIVAS, Dimitrios, dtype: object

In [42]:
medals.loc["PHELPS, Michael", "Medal"]

Athlete
PHELPS, Michael      Gold
PHELPS, Michael      Gold
PHELPS, Michael    Bronze
PHELPS, Michael      Gold
PHELPS, Michael      Gold
                    ...  
PHELPS, Michael    Silver
PHELPS, Michael      Gold
PHELPS, Michael    Silver
PHELPS, Michael      Gold
PHELPS, Michael      Gold
Name: Medal, Length: 22, dtype: object

In [41]:
medals.loc["PHELPS, Michael"].iloc[0]

Year                    2004
City                  Athens
Sport               Aquatics
Discipline          Swimming
Country                  USA
Gender                   Men
Event         100M Butterfly
Medal                   Gold
Name: PHELPS, Michael, dtype: object

#### Slicing Rows and Columns with loc

In [43]:
medals.loc["PHELPS, Michael", ["Event","Medal"]]

Unnamed: 0_level_0,Event,Medal
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1
"PHELPS, Michael",100M Butterfly,Gold
"PHELPS, Michael",200M Butterfly,Gold
"PHELPS, Michael",200M Freestyle,Bronze
"PHELPS, Michael",200M Individual Medley,Gold
"PHELPS, Michael",400M Individual Medley,Gold
...,...,...
"PHELPS, Michael",200M Butterfly,Silver
"PHELPS, Michael",200M Medley,Gold
"PHELPS, Michael",4X100M Freestyle,Silver
"PHELPS, Michael",4X100M Medley,Gold


In [44]:
medals.loc[["PHELPS, Michael", "LEWIS, Carl"], ["Event","Medal"]]

Unnamed: 0_level_0,Event,Medal
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1
"PHELPS, Michael",100M Butterfly,Gold
"PHELPS, Michael",200M Butterfly,Gold
"PHELPS, Michael",200M Freestyle,Bronze
"PHELPS, Michael",200M Individual Medley,Gold
"PHELPS, Michael",400M Individual Medley,Gold
...,...,...
"LEWIS, Carl",200M,Silver
"LEWIS, Carl",Long Jump,Gold
"LEWIS, Carl",4X100M Relay,Gold
"LEWIS, Carl",Long Jump,Gold


In [47]:
medals.loc["DRIVAS, Dimitrios":"BLAKE, Arthur"]

Unnamed: 0_level_0,Year,City,Sport,Discipline,Country,Gender,Event,Medal
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
"DRIVAS, Dimitrios",1896,Athens,Aquatics,Swimming,GRE,Men,100M Freestyle For Sailors,Bronze
"MALOKINIS, Ioannis",1896,Athens,Aquatics,Swimming,GRE,Men,100M Freestyle For Sailors,Gold
"CHASAPIS, Spiridon",1896,Athens,Aquatics,Swimming,GRE,Men,100M Freestyle For Sailors,Silver
"CHOROPHAS, Efstathios",1896,Athens,Aquatics,Swimming,GRE,Men,1200M Freestyle,Bronze
"HAJOS, Alfred",1896,Athens,Aquatics,Swimming,HUN,Men,1200M Freestyle,Gold
...,...,...,...,...,...,...,...,...
"CURTIS, Thomas",1896,Athens,Athletics,Athletics,USA,Men,110M Hurdles,Gold
"GOULDING, Grantley",1896,Athens,Athletics,Athletics,GBR,Men,110M Hurdles,Silver
"LERMUSIAUX, Albin",1896,Athens,Athletics,Athletics,FRA,Men,1500M,Bronze
"FLACK, Edwin",1896,Athens,Athletics,Athletics,AUS,Men,1500M,Gold


In [48]:
medals.loc["HAJOS, Alfred", "Year":"Discipline"]

Unnamed: 0_level_0,Year,City,Sport,Discipline
Athlete,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
"HAJOS, Alfred",1896,Athens,Aquatics,Swimming
"HAJOS, Alfred",1896,Athens,Aquatics,Swimming


## Slicing errors

In case a label or column is not found, the `loc[]` method will raise errors

In [49]:
medals.loc["PHELPS, Michael", ["Year", "Age"]]

KeyError: "['Age'] not in index"

In [50]:
medals.loc["Other", ["Year", "City"]]


KeyError: 'Other'

# BONUS!

Check the [iloc](pandas-iloc.pdf) and [loc](pandas-loc.pdf)  cheat sheet included!