
<table>
<tr>
<td width=15%><img src="../../legacies.png"></img></td>
<td><center><h1>Introduction to Python for Data Sciences</h1></center></td>
<td width=15%><a href="https://www.linkedin.com/in/aymen-belkhair/" style="font-size: 16px; font-weight: bold">Aymen Belkhair</a> </td>
</tr>
</table>



<br/><br/>

<center><a style="font-size: 40pt; font-weight: bold">Chap. 2 - Introduction to Pandas </a></center> 

<br/><br/>


# 1- Pandas



In a previous chapter, we explored some features of NumPy and notably its arrays. Here we will take a look at the data structures provided by the **Pandas** library.

Pandas is a newer package built on top of NumPy which provides an efficient implementation of **DataFrames**. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations.



Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``.


In [53]:
import pandas as pd
import numpy as np

## Pandas Series


A Pandas `Series` is a one-dimensional array of indexed data.

In [56]:
dataArr =[0.25, 0.5, 0.75, 1.0]
data = pd.Series(dataArr)
data
# dataArr

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

The contents can be accessed in the same way as for NumPy arrays, to the difference that when more than one value is selected, the type remains a Pandas ``Series``.

In [61]:
print(data[5],type(data[0]))

KeyError: 5

In [58]:
print(data[2:],type(data[2:]))

2    0.75
3    1.00
dtype: float64 <class 'pandas.core.series.Series'>


The type ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the <tt>values</tt> and <tt>index</tt> attributes.

* ``values`` are the contents of the series as a NumPy array

In [59]:
print(data.values,type(data.values))

[0.25 0.5  0.75 1.  ] <class 'numpy.ndarray'>


* ``index`` are the indices of the series

In [60]:
print(data.index,type(data.index))

RangeIndex(start=0, stop=4, step=1) <class 'pandas.core.indexes.range.RangeIndex'>


### Power of series

In [62]:
# Opérations mathématiques

data_squared = data ** 2
print(data_squared)

0    0.0625
1    0.2500
2    0.5625
3    1.0000
dtype: float64


In [63]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [64]:
#filter
filtered_serie = data[data > 0.5]
print(filtered_serie)

2    0.75
3    1.00
dtype: float64


In [65]:
# Appliquer une fonction à chaque élément
square_root_serie = data.apply(lambda x: x ** 0.5)
print(square_root_serie)

    

0    0.500000
1    0.707107
2    0.866025
3    1.000000
dtype: float64


In [67]:
data[0]**0.5

0.5

In [68]:
dataArr_squared = dataArr ** 2
print(dataArr_squared)


TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

In [70]:
# for i in dataArr:
#     print(i**2)


def filters(arr,x):
    for i in arr:
        if(i>x):print("->",i)
            
filters(dataArr,0.5)

-> 0.75
-> 1.0


In [71]:
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

### Series Indices

The main difference between NumPy arrays and Pandas Series is the presence of this <tt>index</tt> field. By default, it is set (as in NumPy arrays) as <tt>0,1,..,size_of_the_series</tt> but a Series index can be explicitly defined. The indices may be numbers but also strings. Then, the contents of the series *have to* be accessed using these defined indices.

In [72]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64


In [74]:
data[1]

0.5

In [75]:
print(data['c'])
print(data[2])

0.75
0.75


### Series and Python Dictionaries [\*] 

Pandas Series and Python Dictionaries are close semantically: mappping keys to values. However, the implementation of Pandas series is usually more efficient than dictionaries in the context of data science. Naturally, Series can be contructed from dictionaries.

In [76]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
print(population_dict,type(population_dict))


{'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} <class 'dict'>


In [77]:
print(population,type(population))

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64 <class 'pandas.core.series.Series'>


In [78]:
population['California']

38332521

In [79]:
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

## Pandas DataFrames

DataFrames is a fundamental object of Pandas that mimicks what can be found in `R` for instance. Dataframes can be seen as an array of Series: to each `index` (corresponding to an individual for instance or a line in a table), a Dataframe maps multiples values; these values corresponds to the `columns` of the DataFrame which each have a name (as a string).   


In the following example, we will construct a Dataframe from two Series with common indices. 

In [80]:
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})

In [81]:
states = pd.DataFrame({'Population': population, 'Area': area})
print(states,type(states))

            Population    Area
California    38332521  423967
Texas         26448193  695662
New York      19651127  141297
Florida       19552860  170312
Illinois      12882135  149995 <class 'pandas.core.frame.DataFrame'>


In Jupyter notebooks, DataFrames are displayed in a fancier way when the name of the dataframe is typed (instead of using <tt>print</tt>)

In [82]:
states

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


DataFrames have 
* <tt>index</tt> that are the defined indices as in Series
* <tt>columns</tt> that are the columns names
* <tt>values</tt> that return a (2D) NumPy array with the contents

In [83]:
print(states.index)


Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')


In [84]:
print(states.columns)


Index(['Population', 'Area'], dtype='object')


In [85]:
print(states.values
      ,type(states.values)
      ,states.values.shape)

[[38332521   423967]
 [26448193   695662]
 [19651127   141297]
 [19552860   170312]
 [12882135   149995]] <class 'numpy.ndarray'> (5, 2)


*Warning:*  When accessing a Dataframe, `dataframe_name[column_name]` return the corresponding column as a Series. `dataframe_name[index_name]` returns an error! We will see later how to access a specific index.

In [86]:
print(states['Area'],type(states['Area']))

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: Area, dtype: int64 <class 'pandas.core.series.Series'>


In [100]:
try:
    print(states['Area']['California'])
except KeyError as error: 
    print("KeyError: ",error)

423967


### Dataframe creation

To create DataFrames, the main methods are:
* from Series (as above)

In [41]:
print(population,type(population))


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64 <class 'pandas.core.series.Series'>


In [43]:
print(area,type(area))

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64 <class 'pandas.core.series.Series'>


In [91]:
states = pd.DataFrame({'Population': population, 'Area': area})
states

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


* from NumPy arrays (the columns and indices are taken as the array's ones)

In [92]:
A = np.random.randn(5,3)
print(A,type(A))
dfA = pd.DataFrame(A)
dfA

[[-1.85726276  0.89511149  0.33030723]
 [ 0.39944528 -0.13833845  2.03239074]
 [ 0.3182466   0.5045642   0.42700999]
 [ 1.48188821 -0.25296977 -1.4955578 ]
 [ 1.45667767  0.66986482  2.00902412]] <class 'numpy.ndarray'>


Unnamed: 0,0,1,2
0,-1.857263,0.895111,0.330307
1,0.399445,-0.138338,2.032391
2,0.318247,0.504564,0.42701
3,1.481888,-0.25297,-1.495558
4,1.456678,0.669865,2.009024


* from a *list* of *dictionaries*. Be careful, each element of the list is an example (corresponding to an automatic index 0,1,...) while each key of the dictonary corresponds to a column.

In [93]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
print(data,type(data))
print(data[0],type(data[0]))

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}] <class 'list'>
{'a': 0, 'b': 0} <class 'dict'>


In [94]:
df = pd.DataFrame(data)
df

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


* from a *file* , typically a <tt>csv</tt> file (for comma separated values), eventually with the names of the columns as a first line.


    col_1_name,col_2_name,col_3_name
    col_1_v1,col_2_v1,col_3_v1
    col_1_v2,col_2_v2,col_3_v2
    ...
    
For other files types (MS Excel, libSVM, any other separator) see this [part of the doc](https://pandas.pydata.org/pandas-docs/stable/api.html#input-output)

In [95]:
# !head -4 president_heights.csv # Jupyter bash command to see the first 4 lines of the file
!more  data\president_heights.csv

order,name,height(cm)
1,George Washington,189
2,John Adams,170
3,Thomas Jefferson,189
4,James Madison,163
5,James Monroe,183
6,John Quincy Adams,171
7,Andrew Jackson,185
8,Martin Van Buren,168
9,William Henry Harrison,173
10,John Tyler,183
11,James K. Polk,173
12,Zachary Taylor,173
13,Millard Fillmore,175
14,Franklin Pierce,178
15,James Buchanan,183
16,Abraham Lincoln,193
17,Andrew Johnson,178
18,Ulysses S. Grant,173
19,Rutherford B. Hayes,174
20,James A. Garfield,183
21,Chester A. Arthur,183
23,Benjamin Harrison,168
25,William McKinley,170
26,Theodore Roosevelt,178
27,William Howard Taft,182
28,Woodrow Wilson,180
29,Warren G. Harding,183
30,Calvin Coolidge,178
31,Herbert Hoover,182
32,Franklin D. Roosevelt,188
33,Harry S. Truman,175
34,Dwight D. Eisenhower,179
35,John F. Kennedy,183
36,Lyndon B. Johnson,193
37,Richard Nixon,182
38,Gerald Ford,183
39,Jimmy Carter,177
40,Ronald Reagan,185
41,George H. W. Bush,188
42,Bill Clinton,188
43,George W. Bush,182
44,Barack Obama,185
45,Donald Tr

In [96]:
data = pd.read_csv('data/president_heights.csv')
data

Unnamed: 0,order,name,height(cm)
0,1,George Washington,189
1,2,John Adams,170
2,3,Thomas Jefferson,189
3,4,James Madison,163
4,5,James Monroe,183
5,6,John Quincy Adams,171
6,7,Andrew Jackson,185
7,8,Martin Van Buren,168
8,9,William Henry Harrison,173
9,10,John Tyler,183


### Names and Values

Notice there can be missing values in DataFrames.

In [97]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


You can set indices and columns names *a posteriori*

## Indexing




In [98]:
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
states = pd.DataFrame({'Population': population, 'Area': area})
states

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


You may access columns directly with names, *then* you can access individuals with their index. 

In [99]:
states['Area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: Area, dtype: int64

In [101]:
states['Area']['Texas']

695662

To ease the access, Pandas offers dedicated methods:
* <tt>iloc</tt> enables to access subparts of the dataframe as if it was a NumPy array.

In [102]:
stas[states.iloc[:2]

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662


In [103]:
states.iloc[:2,0]

California    38332521
Texas         26448193
Name: Population, dtype: int64

* <tt>loc</tt> does the same but with the explicit names (the last one is included)

In [111]:
states.loc[:'Texas']

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662


In [112]:
states.loc[:,'Population':]

Unnamed: 0,Population,Area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995
