
# Understanding Pandas Series and DataFrames

## Introduction

In this lesson, we're digging into Pandas Series and DataFrames - the two main data types you'll work with.

## Objectives
You will be able to:
- Understand and explain what Pandas Series and DataFrames are and how they differ from dictionaries and lists
- Create Series & DataFrames from dictionaries and lists
- Manipulate columns in DataFrames (`df.rename()`, `df.drop()`) 
- Manipulate the index in DataFrames (`df.reindex()`, `df.drop()`, `df.rename()`) 
- Manipulate column datatypes 

## Pandas Data Types vs. Native Python Data Types

As we talk more about Object-Oriented Programming (OOP), using Pandas Series and DataFrames instead of built-in Python datatypes can have a range of benefits. One of the most important benefit is that Series and DataFrames have a range of built-in methods which make standard practices and procedures streamlined. Some of these methods can result in dramatic performance gains. To read more about these methods, make sure to continuously reference the [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/). It is impossible to know every method of pandas at any given time, nor should you devote much time to memorization. We will not deeply explain every Pandas method in these upcoming lessons and labs, but a critical part of every Data Scientist's job is to investigate documentation to learn about components of these tools on your own.


**From the Pandas documentation:**

**pandas** is everyone's favorite data analyis library providing fast, flexible, and expressive data structures designed to work with *relational* or table-like data (SQL table or Excel spreadsheet). It is a fundamental high-level building block for doing practical, real world data analysis in Python. 

pandas is well suited for:

- Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
- Ordered and unordered (not necessarily fixed-frequency) time series data.
- Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
- Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

The two primary data structures of pandas, **Series** (1-dimensional) and **DataFrame** (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. Pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

<p>Here are just a few of the things that pandas does well:</p>
<blockquote>
<div><ul class="simple">
<li>Easy handling of <strong>missing data</strong> (represented as NaN) in floating point as
well as non-floating point data</li>
<li>Size mutability: columns can be <strong>inserted and deleted</strong> from DataFrame and
higher dimensional objects</li>
<li>Automatic and explicit <strong>data alignment</strong>: objects can be explicitly
aligned to a set of labels, or the user can simply ignore the labels and
let <cite>Series</cite>, <cite>DataFrame</cite>, etc. automatically align the data for you in
computations</li>
<li>Powerful, flexible <strong>group by</strong> functionality to perform
split-apply-combine operations on data sets, for both aggregating and
transforming data</li>
<li>Make it <strong>easy to convert</strong> ragged, differently-indexed data in other
Python and NumPy data structures into DataFrame objects</li>
<li>Intelligent label-based <strong>slicing</strong>, <strong>fancy indexing</strong>, and <strong>subsetting</strong>
of large data sets</li>
<li>Intuitive <strong>merging</strong> and <strong>joining</strong> data sets</li>
<li>Flexible <strong>reshaping</strong> and pivoting of data sets</li>
<li><strong>Hierarchical</strong> labeling of axes (possible to have multiple labels per
tick)</li>
<li>Robust IO tools for loading data from <strong>flat files</strong> (CSV and delimited),
Excel files, databases, and saving / loading data from the ultrafast <strong>HDF5
format</strong></li>
<li><strong>Time series</strong>-specific functionality: date range generation and frequency
conversion, moving window statistics, moving window linear regressions,
date shifting and lagging, etc.</li>
</ul>
</div></blockquote>
<p>Many of these principles are here to address the shortcomings frequently
experienced using other languages / scientific research environments. For data
scientists, working with data is typically divided into multiple stages:
munging and cleaning data, analyzing / modeling it, then organizing the results
of the analysis into a form suitable for plotting or tabular display. pandas
is the ideal tool for all of these tasks.</p>

# Introducing the most important objects: Series and DataFrames

## Setup

Let's take a little time to import the packages we need and to import and preview a dataset.

In [1]:
# importing the convention ie pandas with an alias pd so we dont have to write pandas always
import pandas as pd


## The Pandas Series

The **Series** data structure in Pandas is a <i>one-dimensional labeled array</i>. 

* Data in the array can be of any type (integers, strings, floating point numbers, Python objects, etc.). 
* Data within the array is homogeneous
* Pandas Series objects always have an index: this gives them both ndarray-like and dict-like properties.
    
<img src="../images/pandas_series1.jpg">

# Creating a Pandas Series

There are many ways to create a Pandas Series objects, some of the most common ways are:
- Creation from a list
- Creation from a dictionary
- Creation from a ndarray
- From an external source like a file

In [2]:
# define the data and index as lists
temperature = [40, 29, 15, 19, 11, -15, 9]
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

# create series 
series_from_list = pd.Series(temperature, index=days)
series_from_list

Mon    40
Tue    29
Wed    15
Thu    19
Fri    11
Sat   -15
Sun     9
dtype: int64

### From a Dictionary

In [3]:
my_dict = {'Mon': 40, 'Tue': 29, 'Wed': 15, 'Thu': 19, 'Fri': 11, 'Sat': -15, 'Sun': 9}
series_from_dict = pd.Series(my_dict)
series_from_dict

Mon    40
Tue    29
Wed    15
Thu    19
Fri    11
Sat   -15
Sun     9
dtype: int64

<img src="../images/pandas_series2.jpg">

### From a numpy array

In [7]:
import numpy as np
my_array = np.linspace(0,10,15)
series_from_ndarray = pd.Series(my_array)
series_from_ndarray

0      0.000000
1      0.714286
2      1.428571
3      2.142857
4      2.857143
5      3.571429
6      4.285714
7      5.000000
8      5.714286
9      6.428571
10     7.142857
11     7.857143
12     8.571429
13     9.285714
14    10.000000
dtype: float64

# Vectorized operations also work in pandas Series

In [8]:
np.exp(series_from_list)

Mon    2.353853e+17
Tue    3.931334e+12
Wed    3.269017e+06
Thu    1.784823e+08
Fri    5.987414e+04
Sat    6.737947e-03
Sun    8.103084e+03
dtype: float64

# Pandas DataFrames

### We have the min and max temperatures in a city in London for each months of the year. We would like to find a function to describe this and show it graphically , the dataset given below . 

In [34]:
df_new = pd.DataFrame({'Max' : [39,41,43,47,49,51,45,38,37,29,27,25], 
                       'Min': [21,23,27,28,32,35,31,28,21,19,17,18]})
df_new.head()

Unnamed: 0,Max,Min
0,39,21
1,41,23
2,43,27
3,47,28
4,49,32


DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

<img src="../images/dataframe1.jpg">

You can create a DataFrame from:

* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* From text, CSV, Excel files or databases
* Many other ways

Here's an example where we have set the Dates column to be the index and label for the rows. 

<img src="../images/dataframe2.jpg">

# The above image can be represented as a pandas DataFrame as below

In [4]:
df_new = pd.DataFrame({"Dates" : ["12-1" ,"12-2","12-3","12-4","12-5","12-6","12-7"],
                       'Tokyo' : [15,19,15,11,9,8,13], 
                       'Paris': [-2,0,2,5,7,-5,-3],
                       "Mumbai" : [20,18,23,19,25,27,23]
                      })
df_new.head()

Unnamed: 0,Dates,Tokyo,Paris,Mumbai
0,12-1,15,-2,20
1,12-2,19,0,18
2,12-3,15,2,23
3,12-4,11,5,19
4,12-5,9,7,25


# Let's Create a pandas Dataframe with details of flight infornation

In [38]:
data = pd.DataFrame({"Dates" : ["12-1" ,"12-2","12-3","12-4","12-5"],
                    "Airline" : ["KLM" , "AirFrance" ,"SwissAir" ,"RyanAir","Emirates"],
                     
                    "Departure" : ["Tokyo","Madrid", "Mumbai" , "London" , "NewYork"],
                     
                     "Arrival" : ["Paris","Milan","Stockholm","Brussels" , "Accra"],
                     
                     "FlightNumber" : [10045,10050,10065,10070,10080],
                     
                     "RecentDelays" : ["23-47 hours","10-14 hours","4-18 hours","13 hours",
                                       "20-30 hours"],
                     })
data

Unnamed: 0,Dates,Airline,Departure,Arrival,FlightNumber,RecentDelays
0,12-1,KLM,Tokyo,Paris,10045,23-47 hours
1,12-2,AirFrance,Madrid,Milan,10050,10-14 hours
2,12-3,SwissAir,Mumbai,Stockholm,10065,4-18 hours
3,12-4,RyanAir,London,Brussels,10070,13 hours
4,12-5,Emirates,NewYork,Accra,10080,20-30 hours


# Renaming columns

In [39]:
data.rename(columns={'RecentDelays':"Delays"} , inplace=True) # set inplace to true

In [None]:
# Let's check our result and see if RecentDelays has been changed to delays

In [40]:
data.head()

Unnamed: 0,Dates,Airline,Departure,Arrival,FlightNumber,Delays
0,12-1,KLM,Tokyo,Paris,10045,23-47 hours
1,12-2,AirFrance,Madrid,Milan,10050,10-14 hours
2,12-3,SwissAir,Mumbai,Stockholm,10065,4-18 hours
3,12-4,RyanAir,London,Brussels,10070,13 hours
4,12-5,Emirates,NewYork,Accra,10080,20-30 hours


In [45]:
data.reset_index(inplace=True) # Reset the index to the default
                # the old index is added as a column, and a
                # new sequential index is used

In [47]:
data

Unnamed: 0,index,Dates,Airline,Departure,Arrival,FlightNumber,Delays
0,0,12-1,KLM,Tokyo,Paris,10045,23-47 hours
1,1,12-2,AirFrance,Madrid,Milan,10050,10-14 hours
2,2,12-3,SwissAir,Mumbai,Stockholm,10065,4-18 hours
3,3,12-4,RyanAir,London,Brussels,10070,13 hours
4,4,12-5,Emirates,NewYork,Accra,10080,20-30 hours


# Dropping columns

In [48]:
# note axis 1 is our columns and axis 0 are our rows
data.drop("FlightNumber" ,axis=1 ,inplace=True) 

In [49]:
# dropping the new index column too
data.drop("index", axis=1 , inplace=True)

# Checking the final result
+ We can now see the columns have been removed from the DataFrame

In [50]:
data

Unnamed: 0,Dates,Airline,Departure,Arrival,Delays
0,12-1,KLM,Tokyo,Paris,23-47 hours
1,12-2,AirFrance,Madrid,Milan,10-14 hours
2,12-3,SwissAir,Mumbai,Stockholm,4-18 hours
3,12-4,RyanAir,London,Brussels,13 hours
4,12-5,Emirates,NewYork,Accra,20-30 hours
