# 1. Introduction to Pandas

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 

- Certainly among the most important tools for data analysts and data scientists.
- The most popular library for working with tabular data in Python

To import it, simply do:

In [None]:
import pandas as pd

To check the version of Pandas you are running, do:

In [None]:
pd.__version__

# 2. From Series to DataFrame

To get started with pandas, you will need to get comfortable with its two workhorses:  **Series** and **Dataframe.**

They provide a solid, easy-to-use basis for most applications.


Every object successfully returned by  Pandas is either  **Series** or **DataFrame**  

**DataFrames** and **Series** are not simply storage containers. Since Pandas treat them similarly, they have built-in support for a variety of data-wrangling operations, such as: 

* Single-level and hierarchical indexing
* Handling missing data
* Arithmetic and Boolean operations on entire columns and tables
* Database-type operations (such as merging and aggregation)
* Plotting individual columns and whole tables
* Reading data from files and writing data to files

## 2.1. **Series**

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.


You can create a simple series from any sequence: **a list, a tuple, or a numpy array** or even **a python dictionary**.

#### From a List

**==> Of numbers**

In [None]:
s1 = pd.Series( [2 , 10 , 12]   )

In [None]:
s1

**==> Of strings**

In [None]:
s2 = pd.Series( ['Uganda' , 'Mali' , 'Chad' , 'Niger']   )

In [None]:
s2

**==> Of Objects**

In [None]:
s3 = pd.Series(  [ "£12" , 25 , "Banjul", "50km"  ])

In [None]:
s3

#### From a tuple


In [None]:
s4 = pd.Series( ("malaria" , "tuberculosis", "influenza") )

In [None]:
s4

#### From a numpy array


In [None]:
#Let's import the numpy library first
import numpy as np

In [None]:
a1 = np.arange(0,10,2)
a1

In [None]:
s5 = pd.Series(a1)

In [None]:
s5

#### From a python dictionary


In [None]:
d1 = {"Outbreak": 'Ebola' , 
      "City": 'Goma' , 
      "Country": 'DRC' ,
      "Continent": 'Africa'}

In [None]:
d1

In [None]:
s6 = pd.Series(d1)

In [None]:
s6

As you can notice, there is column on the left always appearing when printing a series.

It's a column index which, by default run from 0 to n-1 where n is the length of the series

In the case of a dictionary, it is automatically replaced by **the key** of the dictionary.

And the **values** of the dictionary are the actual content of the Pandas Series

You can verify it by typing the command below:


In [None]:
s6.index

In [None]:
s6.values

Series can also be created along with its indices

In [None]:
# An information recorded from a patient during a survey
s7 = pd.Series( ["Traore" , "Senegalese" , "Single" , "Wolof"],
              index = ["Name" , "Nationality" , "Status" , "Language"])

In [None]:
s7

In [None]:
#The age of the members of a family in Bouake, Ivory Coast as 
#recorded during a survey
s8 =  pd.Series([12 , 25 , 7 , 58 , 39],
               index = ["Yao" , "Kouassi" , "Senan", "Bony" , "Marguerite"])

In [None]:
s8

We can get to each of the terms easily

In [None]:
s8['Yao']

In [None]:
s8['Bony']

We can check the number of people **below the age of 20**

In [None]:
s8 < 20

It returns Boolean values: True where there is a match and False otherwise.
    
But, what if we want to get the actual values

In [None]:
s8[s8<20]

### Basic Statistics

**==> The sum of all the elements**

In [None]:
s8.sum()

**==> The average**

In [None]:
s8.mean()

**==> The lowest value**

In [None]:
s8.min()

**==> The largest value**

In [None]:
s8.max()

**==> The variance**

In [None]:
s8.var()

**==> The standard deviation**

In [None]:
s8.std()

## 2.2. **Dataframe**

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.)

It can also simply be viewed as a collection of Series

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [None]:
#The temperature across African cities this summer
d2 = {"City": ['Tomboucto' , 'Thies' , 'Nouackshott','Niamey', 'Douala'],
      
      "Temperature(°C)": [29,32,27,19,35]     }

In [None]:
df1 = pd.DataFrame(d2)

In [None]:
df1

In [None]:
#The birth rate from several hospitals in Cotonou, Benin on a specific week
d2 = {"Hospital": ["Hopital General" , "Institut Pasteur" , "Clinique Notre-Dame",
                   "Hopital Regional", "Hopital Jamot", "Liberty Clinic"],
    "Birth_Rate" :[ 0.4 , 0.25 , 0.98 , 0.18 , 0.62 , 0.16]
     }

In [None]:
df2   = pd.DataFrame(d2)

In [None]:
df2

To query the content of a dataframe, we can call the different series

In [None]:
df2['Hospital']

In [None]:
df2['Birth_Rate']

We might decide to add another column. For instance, whether or not there were light in those hospital that day

In [None]:
df2['Light'] = ["Yes" , "Yes" , "No" , "Yes" , "No" , "No"]

In [None]:
df2

We can get a subset of dataframe satisfying certain conditions.

#### Q: What are the hospitals where the birth rate is less than 0.50?

In [None]:
df2[df2['Birth_Rate'] < 0.5]

#### Q: What are the hospitals which had light on that day?

In [None]:
df2[df2['Light'] =="Yes"]

Another common form to build a dataframe is through a nested dict of dicts format.

In [None]:
data = {
    'Size(cm)' : {'Paul': 125 , 'John': 175 , 'Thomas': 186 , 'Julio':145 },
    'Weight(kg)' : {'Paul': 68 , 'John': 72 , 'Thomas': 102 , 'Julio' :98},
    'Age': {'Paul': 24 , 'John': 31 , 'Thomas': 18 , 'Julio' : 36}
        }

In [None]:
df3 = pd.DataFrame(data)

In [None]:
df3