<a href="https://colab.research.google.com/github/eluizaTsuda/fcc-data-analysis-with-python/blob/main/DataAnalysisWithPython_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

***Pandas Serie***

In [32]:
import pandas as pd
import numpy as np

We will start analyzing "The Group of Seven, which is a political formed by Canada, France, Germany, Italy, Japan, the United Kingdon and the United States. We'll start by analyzing population, and for that, we'll use a pandas.Series object.

In [33]:
# in milliions
g7_pop = pd.Series([35.467, 63.951, 80.948, 60.665, 127.061, 64.511, 318.253])

In [34]:
g7_pop

0     35.467
1     63.951
2     80.948
3     60.665
4    127.061
5     64.511
6    318.253
dtype: float64

Someone might not know we're representing population in millions of inhabitants. Series can have a name, to better document the purpose of the Series.

In [35]:
g7_pop.name = 'G7 Population in millions'

In [36]:
g7_pop

0     35.467
1     63.951
2     80.948
3     60.665
4    127.061
5     64.511
6    318.253
Name: G7 Population in millions, dtype: float64

Series are pretty similar to numpy arrays

In [37]:
g7_pop.dtype

dtype('float64')

In [38]:
g7_pop.values

array([ 35.467,  63.951,  80.948,  60.665, 127.061,  64.511, 318.253])

They're actually backed by numpy arrays:

In [39]:
type(g7_pop.values)

numpy.ndarray

And they look like simple Pythonb lists or Numpy Arrays. But there're actually more similar to Python **dict s**.
A Series has an **index**, that's similar to the automatic index assigned to Python's lists:

In [40]:
g7_pop

0     35.467
1     63.951
2     80.948
3     60.665
4    127.061
5     64.511
6    318.253
Name: G7 Population in millions, dtype: float64

In [41]:
g7_pop[0]

35.467

In [42]:
g7_pop[1]

63.951

In [43]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In [44]:
l = ['a', 'b',  'c']

But, in contrast to  lists, we can explicitly define the index:

In [45]:
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]

In [46]:
g7_pop

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [47]:
pd.Series({
    'Canada': 35.467,
    'France': 63.951, 
    'Germany': 80.948,
    'Italy':  60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.253
}, name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [48]:
pd.Series( 
    [35.467, 63.951, 80.948, 60.665, 127.061, 64.511, 318.253],
    index=['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdon', 'United States'],
    name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdon     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

You can also create Series out of other series, specifying indexes:

In [49]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.948
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

***Indexing***

Indexing works similary to lists and dictionaries, you use the index of the element you're looking for:

In [50]:
g7_pop

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [51]:
g7_pop['Canada']

35.467

In [52]:
g7_pop['Japan']

127.061

Numeric positions can also be used, with the ***iloc*** attribute

In [53]:
g7_pop.iloc[0]

35.467

In [54]:
g7_pop.iloc[-1]

318.253

Selecting multiple elements at once:

In [55]:
g7_pop[['Italy', 'France']]

Italy     60.665
France    63.951
Name: G7 Population in millions, dtype: float64

The result is another Series

In [56]:
g7_pop.iloc[[0, 1]]

Canada    35.467
France    63.951
Name: G7 Population in millions, dtype: float64

Slicing also works, but **important** in Pandas, the upper limit is also included:

In [58]:
g7_pop['Canada': 'Italy']

Canada     35.467
France     63.951
Germany    80.948
Italy      60.665
Name: G7 Population in millions, dtype: float64

In [59]:
l

['a', 'b', 'c']

In [60]:
l[:2]

['a', 'b']

**Conditional selection (boolean arrays)**

The same boolean array techbiques we saw applied to numpy arrays can be used for Pandas **Series**:

In [62]:
g7_pop

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [61]:
g7_pop > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [63]:
g7_pop[g7_pop > 70]

Germany           80.948
Japan            127.061
United States    318.253
Name: G7 Population in millions, dtype: float64

In [64]:
g7_pop.mean()

107.26514285714286

In [66]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.253
Name: G7 Population in millions, dtype: float64

In [67]:
g7_pop.std()

97.15187608829893

~ not
! or
& and

In [68]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() /2)]

France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

***Operations and methods***

In [69]:
g7_pop

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [70]:
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80948000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318253000.0
Name: G7 Population in millions, dtype: float64

In [71]:
g7_pop.mean()

107.26514285714286

In [72]:
np.log(g7_pop)

Canada            3.568603
France            4.158117
Germany           4.393807
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.762847
Name: G7 Population in millions, dtype: float64

In [73]:
g7_pop[ 'France': 'Italy'].mean()

68.52133333333333

***Boolean Arrays***

In [74]:
g7_pop

Canada             35.467
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [75]:
g7_pop > 80

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [76]:
g7_pop[g7_pop > 80]

Germany           80.948
Japan            127.061
United States    318.253
Name: G7 Population in millions, dtype: float64

In [77]:
g7_pop[(g7_pop > 80) | (g7_pop < 40)]

Canada            35.467
Germany           80.948
Japan            127.061
United States    318.253
Name: G7 Population in millions, dtype: float64

In [78]:
g7_pop[(g7_pop > 80) & (g7_pop < 200)]

Germany     80.948
Japan      127.061
Name: G7 Population in millions, dtype: float64

***Modifying Series***

In [79]:
g7_pop['Canada'] = 40.5

In [80]:
g7_pop

Canada             40.500
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.253
Name: G7 Population in millions, dtype: float64

In [81]:
g7_pop.iloc[-1] = 500

In [82]:
g7_pop

Canada             40.500
France             63.951
Germany            80.948
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [83]:
g7_pop[g7_pop < 70] = 99.99

In [84]:
g7_pop

Canada             99.990
France             99.990
Germany            80.948
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     500.000
Name: G7 Population in millions, dtype: float64