<h1>Pandas Introduction</h1>


Pandas is the most important library used in Data Analysis using Python.

1. Get all the data (DBs, Excel, CSV, JSON)
2. Process the data (Combine, Merge, Analyze)
3. Visualize the data 
4. Create Reports
5. Statistical Analysis

<h3> Panda Data Structures </h3>


<b>Import Pandas and Numpy</b>

In [1]:
import pandas as pd
import numpy as np

<b>Pandas Series</b>

<p>We'll analyze 'The Group of Seven', formed by several countries.
    We'll start by analyzing population using a <b>pandas.Series</b> object.</p>

In [15]:
#In millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

In [16]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

<p>We then add a <b>name</b> to the series to better document it.</p>

In [4]:
g7_pop.name = 'G7 Population in millions'

In [5]:
g7_pop

0       35.467
1    63951.000
2       80.940
3       60.665
4      177.061
5       64.511
6      318.523
Name: G7 Population in millions, dtype: float64

<p>Series are similar to numpy arrays:</p>

In [6]:
g7_pop.dtype

dtype('float64')

In [7]:
g7_pop.values

array([3.54670e+01, 6.39510e+04, 8.09400e+01, 6.06650e+01, 1.77061e+02,
       6.45110e+01, 3.18523e+02])

They're actually backed by numpy arrays: 

In [8]:
type(g7_pop.values)

numpy.ndarray

And they look like simple Python lists or Numpy Arrays. But they're actually more similar to Python <b>dict's</b>. 

A series has an <b>index</b>, that's similar to the automatic index assigned t Python's lists: 

In [14]:
g7_pop

Japan                35.467
United Kingdom    63951.000
United States        80.940
Italy                60.665
France              177.061
Canada               64.511
Germany             318.523
Name: G7 Population in millions, dtype: float64

In [9]:
g7_pop[0]

35.467

In [10]:
g7_pop[1]

63951.0

In [11]:
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

But in contrast to lists, we can <u><b>explicitly define the index</b></u>: 

In [18]:
g7_pop.index = {
    'Canada',
    'France', 
    'Germany', 
    'Italy', 
    'Japan', 
    'United Kingdom', 
    'United States',
}

In [19]:
g7_pop

Japan              35.467
United Kingdom     63.951
United States      80.940
Italy              60.665
France            127.061
Canada             64.511
Germany           318.523
dtype: float64

We can say that Series look like 'ordered dictionaries' and can actually <b>create Series out of dictionaries</b>: 

In [62]:
g7_pop = pd.Series({
    'Canada': 35.467,
    'France': 63.951, 
    'Germany': 80.94, 
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511, 
    'United States': 318.523
})

In [63]:
g7_pop.name = 'G7 Population in millions'

In [64]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

You can also <b>create everything in one simple flow</b>: <br><br>
<b>(1) -</b> [Bracket values]<br>
<b>(2) -</b> index = [KEYS]<br>
<b>(3) -</b> name = 'NAME'

In [26]:
pd.Series(
    [1,2,3,4,5,6],
    index = ['First', 'Second', 'Third', 'Fourth', 'Fifth', 'Sixth'],
    name = 'Placement Names'
)

First     1
Second    2
Third     3
Fourth    4
Fifth     5
Sixth     6
Name: Placement Names, dtype: int64

<u>And you can also create pd.Series out of other series by specifying indexes</u>:

In [27]:
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

<i>You can then assign the info to variable</i>

<h3> Indexing </h3>

In [32]:
g7_pop['Canada'] #What's the population of Canada?

35.467

In [33]:
g7_pop['Japan'] #What's the population of Japan?

127.061

<p>Numeric positions can also be used with the <b>iloc</b> attribute</p>:

In [35]:
g7_pop.iloc[0] #iloc means 'Index Location'

35.467

In [31]:
g7_pop.iloc[-1]

318.523

<u>Select multiple elements at once</u>:

In [34]:
g7_pop[['Italy', 'France']]

Italy     60.665
France    63.951
Name: G7 Population in millions, dtype: float64

<i>The result is another series</i>

In [36]:
g7_pop[[0,1]]

Canada    35.467
France    63.951
Name: G7 Population in millions, dtype: float64

In [37]:
g7_pop[['Canada', 'Italy']]

Canada    35.467
Italy     60.665
Name: G7 Population in millions, dtype: float64

<h3>Conditional Selection (boolean arrays)</h3>
<p>The same boolean array techniques in Numpy can be used for Pandas <b>series</b>:</p>

In [38]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [69]:
g7_pop > 70 # What countries have more than 70 mil citizens?

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [70]:
g7_pop[g7_pop > 70] # Show me the total # of citizens in countries with more than 70 mil citizens

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [41]:
g7_pop.mean()

107.30257142857144

In [43]:
g7_pop.std()

97.24996987121581

In [44]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

<h3> Operations and Methods </h3>
<p>Series also support vectorized operations and aggregation functions as Numpy:</p>

In [65]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [72]:
g7_pop * 1_000_000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [67]:
g7_pop.mean()

107.30257142857144

In [55]:
np.log(g7_pop)

27.631021115928547

In [68]:
g7_pop['France':'Italy'].mean()

68.51866666666666

<h3>Boolean Arrays</h3>
(Work in the same way as numpy)

In [73]:
g7_pop > 80

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [74]:
g7_pop[g7_pop > 80]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [75]:
g7_pop[(g7_pop > 80) | (g7_pop < 40)]

Canada            35.467
Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

<h3>Modifying Series</h3>

In [77]:
g7_pop['Canada'] = 40.5

In [78]:
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [79]:
g7_pop.iloc[-1] = 500

In [80]:
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     500.000
Name: G7 Population in millions, dtype: float64

In [81]:
g7_pop[g7_pop < 70] = 99.99

In [82]:
g7_pop

Canada             99.990
France             99.990
Germany            80.940
Italy              99.990
Japan             127.061
United Kingdom     99.990
United States     500.000
Name: G7 Population in millions, dtype: float64