<h1> Introduction to Pandas </h1>

<h3>Credit and Sources for this introduction: </h3>

- [freeCodeCamp.org Video](https://www.youtube.com/watch?v=r-uOLxNrNk8)
- [Source Notebooks](https://notebooks.ai/rmotr-curriculum/freecodecamp-intro-to-pandas-902ae59b/1+-+Pandas+-+Series.ipynb)
- [Pandas documentation](https://pandas.pydata.org/docs/pandas.pdf)

In [4]:
# Import
import pandas as pd
import numpy as np

<h3> What is Pandas?</h3>

Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical,real world data analysis in Python.  

Pandas is well suited for many different kinds of data: 
- Tabular data with heteogeneously-typed columnes, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necesarily fixed-frequency) times series data. 
- Arbitrary matrix data (homogenously typed or heteogenous) with row and columns labels
- Any other form of observational/statistical data sets. The data acually need not be labeled at all to be placed ino a pandas data structure.

There are two primary data structures of pandas **Series** (1-dimensional) and **DataFrame** (2-dimensional). These data structures can handle almost all use cases in finance, statistics, social science and many areas of engieering. DataFrame structure can be compare to the data.frame (or data.table) in R. Pandas is built on top of NumPy ansd is intended to integrate well within a scientific computing enviroment with many other 3rd party libaries. Pandas is heavily optimized and therefore fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. (Cython is a programming language that aims to be a superset of the Python programming language, designed to give C-like performance with code that is written mostly in Python with optional additional C-inspired syntax. Cython is a compiled language that is typically used to generate CPython extension modules (from Wiki))

Why more than one data structure?  
The pandas data structures are containers for lower dimensional data. The DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion. 

<h3> Pandas Series </h3> 
Series is the first datastructure pandas has. Series are usually then part of a dataframe. Also series can be compared to numpy arrays.

In [5]:
# Some simple data of the population of G7 countries in millions
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])

In [7]:
# The series has an associated datatype : float64
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

In [9]:
# We can name a series
g7_pop.name = "G7 Population in millions"

In [11]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [15]:
# Shows the datatype of the series
g7_pop.dtype

dtype('float64')

In [16]:
# This functions returns an numpy array of the series. 
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [20]:
#And they look like simple Python lists or Numpy Arrays. But they're actually more similar to Python dicts. A Series has an index, that's similar to the automatic index assigned to Python's lists:
type(g7_pop.values)

numpy.ndarray

In [22]:
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [25]:
# A series has an index and the values can be similiary accessed
print(g7_pop[0])
print(g7_pop[1])

35.467
63.951


In [27]:
# Shows the index of the series
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In [28]:
l = ['a', 'b', 'c']

In [32]:
# We can explicitly define a label for the index of a series
g7_pop.index = [
    "Canada",
    "France",
    "Germany",
    "Italy",
    "Japan",
    "United Kingdom",
    "United States"
]

In [31]:
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [33]:
# We can say that Series look like "orderd dictionaries". We can actually create Series out of dictionaries: 
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [35]:
# Create from a list a series
pd.Series(
    [35.467, 63.951, 80.94, 60.665, 127.061, 64.511, 318.523],
    index=['Canada', 'France', 'Germany', 'Italy', 'Japan', 'United Kingdom',
       'United States'],
    name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

<h3> Indexing </h3>

In [None]:
g7_pop

In [36]:
# Select values
print(g7_pop['Canada'])
print(g7_pop['Japan'])

35.467
127.061
