---
# Introduction to Machine Learning
---
## Carlos Ramirez-Perez
#### Email: [cross224@hotmail.com](cross224@hotmail.com)
#### Github: [carap/uea4502027](github.com/carap/uea4502027)

# Introduction to Pandas

In this section, we will review **Pandas**, a Python data analysis library. 

+ Pandas is an open source library built on top of NumPy
+ It allows for fast data cleaning/preparation/analysis
+ It can work with data from a wide variety of sources

> **Pandas** may be considered as a powerful version of Excel, having a lot more features. 


# Introduction to Pandas

We will cover the following topics
* Series
* DataFrames
* Data Input/Output and Visualization (*next session*)

---
# Pandas  `Series`
---

# Pandas `Series`

The first main data type we will learn about is the `Series`. 
+ **Pandas `Series`** data type is very similar to a **NumPy `array`** 
    + the former is built on top of the latter
+ What differentiates the NumPy `array` from Pandas `Series`, is that the latter can have `axis labels`
    + `Series` can be indexed by *labels*, instead of number location indexing
    + `Series` can hold any arbitrary Python Object, not only numbers.

# Pandas `Series`
Let's import Pandas and explore the `Series` object by some examples

In [1]:
import pandas as pd
# from pandas import Series

In [2]:
len(dir(pd.Series))

438

help(pd.Series)

    | Parameters
    | ----------
    |  data : array-like, dict, or scalar value
    |  index : array-like or Index (1d)
    |  dtype : numpy.dtype or None
    |  copy : boolean, default False

---
# Creating `Series` from `Arrays`
---

## Creating  Series from numpy Arrays

In [3]:
import numpy as np
import pandas as pd

In [4]:
my_narray = np.array([10,20,30])
my_labels = ['a','b','c']  # Python list

In [5]:
pd.Series( data= my_narray, index= my_labels )

a    10
b    20
c    30
dtype: int64

In [6]:
my_series = pd.Series( data= my_narray, index= my_labels )
my_series['a']

10

---
# Creating `Series` from `Lists`
---

---
# Data type: Lists [...]
---
A standard data structure in Computer Science is the **sequence container**.

+ They are a way to store collections of **ordered data**
+ In C++ STL, **sequence containers** are devided into ***vector, deque, list, array***, but in Python this structure is called **List**

> Standard Template Library (STL), a software library for the C++ programming language, provides components called containers, algorithms, and iterators

## Creating `Series` from `Lists`
You can convert Python lists into Pandas series

In [7]:
my_list = [10,20,30]    # Python list (homogeneous)

In [8]:
pd.Series(data = my_list)  # using lists

0    10
1    20
2    30
dtype: int64

In [9]:
pd.Series(data = my_list)[0]  # Indexing (numeric)

10

In [10]:
my_series = pd.Series(data = my_list)  # using lists
my_series[0]

10

## Creating `Series` from `Lists`

In [11]:
my_labels = ['a','b','c']      # Python list (non-homogeneous)
my_list = [10, 3.14, True]  

In [12]:
pd.Series(data= my_list, index= my_labels)

a      10
b    3.14
c    True
dtype: object

In [13]:
pd.Series( my_list , my_labels )['a']

10

In [14]:
my_series = pd.Series( my_list , my_labels )
my_series['a']

10

---
# Creating `Series` from `Dictionary`
---

---
# Data type: Dictionary {...}
---
A standard data structure in Computer Science is the **associative container** 
+ They are a way to store collections of **unordered data**, in the form of *key:value* (*input:output*) pairs.
+ In C++ STL, **associative containers** are devided into ***set/multiset, map/multimap***, but in Python this structure is called **Dictionary**.

## Creating `Series` from `Dictionary`
You can convert Python dictionary into Pandas series

In [15]:
d = {'key1':10, 'key2': True,'key3': (123,456)}

In [16]:
pd.Series(d)

key1            10
key2          True
key3    (123, 456)
dtype: object

In [17]:
d = {'First':'a','Second':'b','Third':'c'}

In [18]:
pd.Series(d)

First     a
Second    b
Third     c
dtype: object

## Creating `Series` from `Dictionary`
A pandas Series can hold a variety of object types

In [19]:
# Even functions (unlikely we will use this)
fseries = pd.Series({'a':sum,'b':print,'c':len})
fseries

a      <built-in function sum>
b    <built-in function print>
c      <built-in function len>
dtype: object

In [20]:
fseries['a']

<function sum>

# Using an Index
The key to using a Series is understanding its indexing. Pandas indexing allows for fast *lookups* of data (like a hash table)

In [21]:
ser1 = pd.Series(data = [1,2,3,4],
                 index = ['USA','Germany','Japan','USSR'])  
ser1

USA        1
Germany    2
Japan      3
USSR       4
dtype: int64

In [22]:
ser2 = pd.Series(data = [5,2,5,4],
                 index = ['USA','Germany','Japan','Italy'])                                   
ser2

USA        5
Germany    2
Japan      5
Italy      4
dtype: int64

In [23]:
np.array([ ser1['USA'], ser2['USA'] ])  # Array preparation 

array([1, 5])

# Using an Index
Operations are then also done based on the index

In [24]:
ser1 + ser2

Germany    4.0
Italy      NaN
Japan      8.0
USA        6.0
USSR       NaN
dtype: float64

In [25]:
type(ser1 + ser2)

pandas.core.series.Series

In [26]:
(ser1 + ser2)['Japan']  #Series indexing

8.0

---
# Pandas `DataFrames`
---

# Pandas `DataFrames`
DataFrames are the *workhorse* of Pandas and are directly inspired by the **R programming language**. 

> We can think of a `DataFrame` as a bunch of `Series` objects putting together to share the same set of indexes. 

# Pandas `DataFrames`
Let's import Pandas and explore the `DataFrames` object

In [27]:
import pandas as pd
#from pandas import DataFrame

In [28]:
len(dir(pd.DataFrame))

443

help(pd.DataFrame)

    |  Parameters
    |  ----------
    |  data :  numpy ndarray, dict , or DataFrame
    |  index : Index or array-like to use for resulting frame.
    |  columns : Index or array-like (column labels)
    |  dtype : dtype, default None
    |  copy : boolean, default False

# DataFrames

In [29]:
import pandas as pd
import numpy as np
np.random.seed(101)

In [30]:
df = pd.DataFrame( data= np.random.randn(5,4),
                  index= '1 2 3 4 5'.split(),
                  columns= 'A B C D'.split() )
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057
5,0.190794,1.978757,2.605967,0.683509


# Selection and Indexing
Let's learn the various methods to grab data from a DataFrame

In [31]:
df['A']

1    2.706850
2    0.651118
3   -2.018168
4    0.188695
5    0.190794
Name: A, dtype: float64

In [32]:
type(df['A'])

pandas.core.series.Series

In [33]:
df.A  # SQL object reference syntax (NOT RECOMMENDED!)

1    2.706850
2    0.651118
3   -2.018168
4    0.188695
5    0.190794
Name: A, dtype: float64

In [34]:
type(df['A']) == type(df.A)   # pandas.core.series.Series
                              # DataFrame Columns are Series

True

# Selection and Indexing
Let's learn the various methods to grab data from a DataFrame

In [35]:
df[['A']]   # Pass a list of column names

Unnamed: 0,A
1,2.70685
2,0.651118
3,-2.018168
4,0.188695
5,0.190794


In [36]:
type(df[['A']])   # Pass a list of column names

pandas.core.frame.DataFrame

# Selection and Indexing
Let's learn the various methods to grab data from a DataFrame

In [37]:
df[['A','D']]   # Pass a list of column names

Unnamed: 0,A,D
1,2.70685,0.503826
2,0.651118,0.605965
3,-2.018168,-0.589001
4,0.188695,0.955057
5,0.190794,0.683509


In [38]:
type(df[['A','D']])   # Pass a list of column names

pandas.core.frame.DataFrame

# Selection and Indexing
Select a range of columns 

In [39]:
df.loc[:,'B':'D']   # Pass a list of column names

Unnamed: 0,B,C,D
1,0.628133,0.907969,0.503826
2,-0.319318,-0.848077,0.605965
3,0.740122,0.528813,-0.589001
4,-0.758872,-0.933237,0.955057
5,1.978757,2.605967,0.683509


# Selection and Indexing
Select a range of rows 

In [40]:
df.loc['3':'5']   # Pass a list of column names

Unnamed: 0,A,B,C,D
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057
5,0.190794,1.978757,2.605967,0.683509


# Creating a new column

In [41]:
df['NEW'] = df['C'] + df['D']

In [42]:
df

Unnamed: 0,A,B,C,D,NEW
1,2.70685,0.628133,0.907969,0.503826,1.411795
2,0.651118,-0.319318,-0.848077,0.605965,-0.242112
3,-2.018168,0.740122,0.528813,-0.589001,-0.060187
4,0.188695,-0.758872,-0.933237,0.955057,0.021819
5,0.190794,1.978757,2.605967,0.683509,3.289476


# Removing columns

In [43]:
df.drop('NEW', axis=1)

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057
5,0.190794,1.978757,2.605967,0.683509


In [44]:
df  # Not 'inplace' unless specified!

Unnamed: 0,A,B,C,D,NEW
1,2.70685,0.628133,0.907969,0.503826,1.411795
2,0.651118,-0.319318,-0.848077,0.605965,-0.242112
3,-2.018168,0.740122,0.528813,-0.589001,-0.060187
4,0.188695,-0.758872,-0.933237,0.955057,0.021819
5,0.190794,1.978757,2.605967,0.683509,3.289476


# Removing columns

In [45]:
df.drop('NEW', axis=1, inplace=True)

In [46]:
df  # Not inplace unless specified!

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057
5,0.190794,1.978757,2.605967,0.683509


# Removing rows

In [47]:
df.drop( '5' , axis=0 )

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [48]:
df.drop( '5' , axis=0, inplace= True)

In [49]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


# Selecting Rows

In [50]:
df.loc['1']   # Purely label-LOCation based indexer for selection
              # Select based on label

A    2.706850
B    0.628133
C    0.907969
D    0.503826
Name: 1, dtype: float64

In [51]:
type(df.loc['1'])

pandas.core.series.Series

In [52]:
df.iloc[0]   # Purely integer-location based indexing for selection by position.
             # Select based on position, instead of label 

A    2.706850
B    0.628133
C    0.907969
D    0.503826
Name: 1, dtype: float64

In [53]:
type(df.iloc[0])

pandas.core.series.Series

# Selecting subset of rows and columns 

In [54]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [55]:
df.loc['1','A']

2.7068498393999381

In [56]:
type(df.loc['1','A'])

numpy.float64

# Selecting subset of rows and columns 

In [57]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [58]:
df.loc[['1','2'],['C','D']]

Unnamed: 0,C,D
1,0.907969,0.503826
2,-0.848077,0.605965


In [59]:
type(df.loc[['1','2'],['C','D']])

pandas.core.frame.DataFrame

# Conditional Selection
An important feature of Pandas is conditional selection using bracket notation (*very similar to numpy*)

In [60]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [61]:
df > 0  # returns a pandas.core.frame.DataFrame

Unnamed: 0,A,B,C,D
1,True,True,True,True
2,True,False,False,True
3,False,True,True,False
4,True,False,False,True


# Conditional Selection
An important feature of Pandas is conditional selection using bracket notation (*very similar to numpy*)

In [62]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [63]:
df[ df > 0 ]   # Masking using a DataFrame

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,,,0.605965
3,,0.740122,0.528813,
4,0.188695,,,0.955057


# Conditional Selection
An important feature of Pandas is conditional selection using bracket notation (*very similar to numpy*)

In [64]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [65]:
df[ df['A'] > 0 ]

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
4,0.188695,-0.758872,-0.933237,0.955057


# Conditional Selection
An important feature of Pandas is conditional selection using bracket notation (*very similar to numpy*)

In [66]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [67]:
df[ df['A']>0 ]['D']

1    0.503826
2    0.605965
4    0.955057
Name: D, dtype: float64

# Conditional Selection
An important feature of Pandas is conditional selection using bracket notation (*very similar to numpy*)

In [68]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [69]:
df[ df['A']>0 ][['B','C']]

Unnamed: 0,B,C
1,0.628133,0.907969
2,-0.319318,-0.848077
4,-0.758872,-0.933237


# Conditional Selection
An important feature of Pandas is conditional selection using bracket notation (*very similar to numpy*)

In [70]:
df

Unnamed: 0,A,B,C,D
1,2.70685,0.628133,0.907969,0.503826
2,0.651118,-0.319318,-0.848077,0.605965
3,-2.018168,0.740122,0.528813,-0.589001
4,0.188695,-0.758872,-0.933237,0.955057


In [71]:
df[(df['A']>0) & (df['D'] > 0.7)]

Unnamed: 0,A,B,C,D
4,0.188695,-0.758872,-0.933237,0.955057


# Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [72]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))  # list from iterable object that returns tuples
hier_index, type(hier_index), type(hier_index[0])

([('G1', 1), ('G1', 2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)],
 list,
 tuple)

In [73]:
hier_index = pd.MultiIndex.from_tuples(hier_index) # Convert list of tuples to MultiIndex
hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

# Multi-Index and Index Hierarchy
Let us go over how to work with Multi-Index, first we'll create a quick example of what a Multi-Indexed DataFrame would look like:

In [74]:
hier_index

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [75]:
df = pd.DataFrame( np.random.randn(6,2),
                   index= hier_index, 
                   columns= ['A','B'])
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


# Multi-Index and Index Hierarchy

In [76]:
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [77]:
df.loc['G1']  # Purely label-LOCation based indexer for selection
              # returns pandas.core.frame.DataFrame

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


# Multi-Index and Index Hierarchy

In [78]:
df.loc['G1']  # Purely label-LOCation based indexer for selection
              # returns pandas.core.frame.DataFrame

Unnamed: 0,A,B
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [79]:
df.loc['G1'].loc[1]   # Nested label-LOCation based indexer
                      # returns pandas.core.series.Series

A    0.302665
B    1.693723
Name: 1, dtype: float64

In [80]:
type(df.loc['G1'].loc[1])

pandas.core.series.Series

# Multi-Index and Index Hierarchy

In [81]:
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [82]:
df.index.names

FrozenList([None, None])

# Multi-Index and Index Hierarchy

In [83]:
df.index.names = ['Group','Num']

In [84]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


# Index Hierarchy and Cross-section
`pd.xs` method of pandas DataFrame instance

`xs(key, axis=0, level=None, drop_level=True)`
+ Returns a cross-section, either row(s) or column(s)
+ Defaults to cross-section on the rows (axis=0).

help(pd.xs)

    | Parameters
    | ----------
    | key : object
    | axis : int, default 0
    | level : object, defaults to first n levels (n=1 or len(key))
    | drop_level : boolean, default True

# Index Hierarchy and Cross-section

In [85]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [86]:
df.xs( key= 'G1')   # pandas.core.frame.DataFrame

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


# Index Hierarchy and Cross-section

In [87]:
df.xs( key= 'G1')   # pandas.core.frame.DataFrame

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.302665,1.693723
2,-1.706086,-1.159119
3,-0.134841,0.390528


In [88]:
df.xs( key= ['G1',1])  # pandas.core.series.Series

A    0.302665
B    1.693723
Name: (G1, 1), dtype: float64

In [89]:
df.xs(['G1',1])['A']  # numpy.float64

0.30266544858518252

# Index Hierarchy and Cross-section

In [90]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.302665,1.693723
G1,2,-1.706086,-1.159119
G1,3,-0.134841,0.390528
G2,1,0.166905,0.184502
G2,2,0.807706,0.07296
G2,3,0.638787,0.329646


In [91]:
df.xs(key= 1, level='Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.302665,1.693723
G2,0.166905,0.184502


---
# Creating `DataFrames` from CSV files
---

# Creating `DataFrames` from CSV files

In [92]:
import pandas as pd
df = pd.read_csv('nyc_weather.csv')

In [93]:
df.head(5)

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,1/1/2016,38,23,52,30.03,10,8.0,0,5,,281
1,1/2/2016,36,18,46,30.02,10,7.0,0,3,,275
2,1/3/2016,40,21,47,29.86,10,8.0,0,1,,277
3,1/4/2016,25,9,44,30.05,10,9.0,0,3,,345
4,1/5/2016,20,-3,41,30.57,10,5.0,0,0,,333


# Creating `Series` from CSV files

In [94]:
import pandas as pd
df = pd.read_csv('nyc_weather.csv')

In [95]:
df.tail(5)

Unnamed: 0,EST,Temperature,DewPoint,Humidity,Sea Level PressureIn,VisibilityMiles,WindSpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
26,1/27/2016,41,22,45,30.03,10,7.0,T,3,Rain,311
27,1/28/2016,37,20,51,29.9,10,5.0,0,1,,234
28,1/29/2016,36,21,50,29.58,10,8.0,0,4,,298
29,1/30/2016,34,16,46,30.01,10,7.0,0,0,,257
30,1/31/2016,46,28,52,29.9,10,5.0,0,0,,241


# Creating `Series` from CSV files
We want to get some insights rom the DataFrame

In [96]:
df['Temperature'].max()

50

In [97]:
df['EST'][ df['Events']=='Rain']

8      1/9/2016
9     1/10/2016
15    1/16/2016
26    1/27/2016
Name: EST, dtype: object

In [98]:
df['WindSpeedMPH'].mean()

6.8928571428571432

# That's it for now!... Thanks.