# Pandas -- Data Analysis Library
- Used extensively by Data Scientist/Data Analyst to perform EDA (Exploratory Data Analysis)
- EDA is all about getting comfortable with the data such that you can answer any type of question
raised by your stakeholders / client

In [1]:
import numpy as np
import pandas as pd

# Data Containers

## Series

Series: It is a one-dimensional array-like structure used to represent a dataset and can be visualized as **a single column dataset**. It supports multiple data types, such as Integer, string, float.

Series can be created in multiple ways with the help of data elements which, if defined properly, act as data input to create a series. Therefore, data input can be an ndarray, dict, scalar, or a list. Let’s take a look at each one in detail.

Now, let’s see how we can create a series.

### List

This basic Python data structure which can act as an input to create Pandas series. List can hold a range of values of multiple data types. So, if a dataset appears as list, use list as input to create series.

In [2]:
print (list('abcdef'))

['a', 'b', 'c', 'd', 'e', 'f']


In [3]:
# Pass list as an argument

first_series = pd.Series(list('abcdef'))
print (first_series)

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object


***Shows index, data value and data type***

***We have not created index for data but notice that data alignment is done automatically.***

### ndarray
    
An ndarray can be used as an input to create Pandas series. The use of ndarray is recommended wherever the dataset is number-centric and requires complex numerical computing. 

In [4]:
# ndarray for countries
np_countries = np.array(['Algeria','Angola','Argentina','Australia','Austria','Bahamas','Bangladesh','Belarus','Belgium',
                      'Bhutan','Brazil','Bulgaria','Cambodia','Cameroon','Chile','China','Colombia','Cyprus','Denmark'])
np_countries

# U - unicode string

# Unicode is a standard encoding system that is used to represent characters from almost all languages. 
# Every Unicode character is encoded using a unique integer code point between 0 and 0x10FFFF . 
# A Unicode string is a sequence of zero or more code points.

array(['Algeria', 'Angola', 'Argentina', 'Australia', 'Austria',
       'Bahamas', 'Bangladesh', 'Belarus', 'Belgium', 'Bhutan', 'Brazil',
       'Bulgaria', 'Cambodia', 'Cameroon', 'Chile', 'China', 'Colombia',
       'Cyprus', 'Denmark'], dtype='<U10')

In [5]:
# Pass ndarray as an argument

s_countries = pd.Series(np_countries)
print (s_countries)

0        Algeria
1         Angola
2      Argentina
3      Australia
4        Austria
5        Bahamas
6     Bangladesh
7        Belarus
8        Belgium
9         Bhutan
10        Brazil
11      Bulgaria
12      Cambodia
13      Cameroon
14         Chile
15         China
16      Colombia
17        Cyprus
18       Denmark
dtype: object


### dict
A Pandas series can also be created using dictionary and it is very efficient when it comes to indexing or reindexing a dataset for data wrangling purposes. dict works in a key-value fashion, so use it whenever the dataset is structured as key-value pair.

In [6]:
dictionary = {"A" : 20, "B" : 35, 'C': 100}
print (dictionary)

{'A': 20, 'B': 35, 'C': 100}


In [7]:
# Pass dictionary as an argument

series = pd.Series(dictionary)
print(series) 

A     20
B     35
C    100
dtype: int64


### Input values in pd.Series

In [8]:
# Series for countries and their gdp

country_gdp = pd.Series([2255.225482,629.9553062,11601.63022,25306.82494,27266.40335,19466.99052,588.3691778,2890.345675,
                           24733.62696,1445.760002,4803.398244,2618.876037,590.4521124,665.7982328,7122.938458,2639.54156,
                           3362.4656,15378.16704,30860.12808], 
                        index = ['Algeria','Angola','Argentina','Australia','Austria','Bahamas','Bangladesh','Belarus',
                                      'Belgium','Bhutan','Brazil','Bulgaria','Cambodia','Cameroon','Chile','China','Colombia',
                                      'Cyprus','Denmark'])
print (country_gdp)

Algeria        2255.225482
Angola          629.955306
Argentina     11601.630220
Australia     25306.824940
Austria       27266.403350
Bahamas       19466.990520
Bangladesh      588.369178
Belarus        2890.345675
Belgium       24733.626960
Bhutan         1445.760002
Brazil         4803.398244
Bulgaria       2618.876037
Cambodia        590.452112
Cameroon        665.798233
Chile          7122.938458
China          2639.541560
Colombia       3362.465600
Cyprus        15378.167040
Denmark       30860.128080
dtype: float64


### Vectorized operations

Vectorized operations show you how you can add two or more series. The vector operations are essentially performed by the index positions of data elements.

The first example shows how the two series, ‘first_vector_series’ and ‘second_vector_series’ are added and this is done at index level. 

In [9]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['a','b','c','d'])

print (first_vector_series)
print ()
print (second_vector_series)

a    1
b    2
c    3
d    4
dtype: int64

a    10
b    20
c    30
d    40
dtype: int64


In [10]:
print (first_vector_series + second_vector_series)

a    11
b    22
c    33
d    44
dtype: int64


Let’s **shuffle indices** and see what happens. For the second vector series, we change the values of indices a, d, b, and c. Thus, when we add the two vector series, we get a different output as the data element is bound to the index position. 

In [11]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10,20,30,40], index = ['c','a','d','b'])

print (first_vector_series)
print ()
print (second_vector_series)

a    1
b    2
c    3
d    4
dtype: int64

c    10
a    20
d    30
b    40
dtype: int64


In [12]:
print (first_vector_series + second_vector_series)

a    21
b    42
c    13
d    34
dtype: int64


***Where ever the indices don't match, it will not add and would hold NOT A NUMBER or NaN***

In [13]:
first_vector_series = pd.Series([1,2,3,4], index = ['a','b','c','d']) 
second_vector_series = pd.Series([10.0,20,30,40], index = ['a','b','e','f'])

print (first_vector_series)
print ()
print (second_vector_series)

a    1
b    2
c    3
d    4
dtype: int64

a    10.0
b    20.0
e    30.0
f    40.0
dtype: float64


In [14]:
print (first_vector_series + second_vector_series)

a    11.0
b    22.0
c     NaN
d     NaN
e     NaN
f     NaN
dtype: float64


## Dataframes

DataFrame is another core feature of the Pandas data structure.

DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.

A DataFrame looks like a spreadsheet with a row-columnar structure or a SQL data table with rows and columns.

There can be several inputs to the DataFrame and we’ll go through them in detail. Let’s have a quick overview of the data inputs:

The core data abstraction layer in Pandas is called a DataFrame

**Any data that you load/initialize using Pandas will be represented in the form of DataFrame**

- To create a Dataframe, you can use the following two approach
 1. Create DF using collection object
 2. Create DF by loading a file

## List

In [15]:
list1 = [[1,'Prashant',1000], [2,'Arun',2000]]
list1

[[1, 'Prashant', 1000], [2, 'Arun', 2000]]

In [16]:
empDataFrameFromList = pd.DataFrame(list1)
empDataFrameFromList

Unnamed: 0,0,1,2
0,1,Prashant,1000
1,2,Arun,2000


In [17]:
# DataFrame data is represented using Row and Column indexes
# You can replace Column indexes with column names

empDataFrameFromList.columns = ['eid', 'ename', 'esal'] # set the column names
print (empDataFrameFromList)

display (empDataFrameFromList)

empDataFrameFromList

   eid     ename  esal
0    1  Prashant  1000
1    2      Arun  2000


Unnamed: 0,eid,ename,esal
0,1,Prashant,1000
1,2,Arun,2000


Unnamed: 0,eid,ename,esal
0,1,Prashant,1000
1,2,Arun,2000


In [18]:
empDataFrameFromList.columns # get the column names

Index(['eid', 'ename', 'esal'], dtype='object')

# Inclass Assignment

# 1


Use the list below to create DF "empDataFrameFromList2" and provide the col names as 'empid','ename','esal'

list2 = [[1,'Prashant',5000], [2,'Arun',8000], [3,'Aman',9899]]

### dict

A Pandas DataFrame can also be created using ***dictionary of list***. It is very efficient when it comes to indexing or reindexing a dataset for data wrangling purposes. 

In this example, we will create a dataset related to Summer Olympics. 

First, import the Pandas library. Then, declare a dict ‘Olympic_data_list’ and pass the indices ‘HostCity’, ‘No. of Participating Countries’, and ‘Year’ with its data elements as arguments.

As you can observe, it is a tabular representation of data with rows and columns.
Next, pass this list to the DataFrame method ‘pd.DataFrame’ to create a basic DataFrame.

Note that data alignment is automatically taken care here. When we call this DataFrame ‘df_Olympic_data’, the output displays all the rows with its corresponding indices.

In [19]:
olympic_data = {'HostCity':['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'], 
                'Year': [2012, 2008, 2004, 2000, 1996],
                'No of Participating Countries': [205, 205, 201, 200, 197]}

print (type(olympic_data))
print ()
olympic_data

<class 'dict'>



{'HostCity': ['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'],
 'Year': [2012, 2008, 2004, 2000, 1996],
 'No of Participating Countries': [205, 205, 201, 200, 197]}

In [20]:
# dictionary of list as an argument to pd.DataFrame

df_olympic_data = pd.DataFrame(olympic_data)
df_olympic_data

Unnamed: 0,HostCity,Year,No of Participating Countries
0,London,2012,205
1,Beijing,2008,205
2,Athens,2004,201
3,Sydney,2000,200
4,Atlanta,1996,197


### Series

Series can also be an input to a DataFrame. 

Let’s learn how to create DataFrame from series.

Let’s create two series first. The first series, ‘olympic_series_participation’, is for the number of countries participating for the given year. The second series, ‘olympic_series_country’, is for the cities which held the Olympics that year. 
Now, create a DataFrame ‘df_olympic_series’ and pass both the series as dicts in it. You can also assign column names in the DataFrame and manipulate the dataset as shown in this example. 

In [21]:
olympic_series_participation = pd.Series([205,205,201,200,197], index = [2012,2008,2004,2000,1996])
olympic_series_countries = pd.Series(['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'], index = [2012,2008,2004,2000,1996])

In [22]:
print (olympic_series_participation)
print ()
print (olympic_series_countries)

2012    205
2008    205
2004    201
2000    200
1996    197
dtype: int64

2012     London
2008    Beijing
2004     Athens
2000     Sydney
1996    Atlanta
dtype: object


In [23]:
# dictionary of Series

df_olympic_series = pd.DataFrame({'No. of Participating Countries': olympic_series_participation,
                                  'HostCity': olympic_series_countries})
df_olympic_series

Unnamed: 0,No. of Participating Countries,HostCity
2012,205,London
2008,205,Beijing
2004,201,Athens
2000,200,Sydney
1996,197,Atlanta


### ndarray
    
An ndarray can be used as an input to creating Pandas DataFrame. The use of ndarray is recommended wherever the dataset is number centric and when instances require complex numerical computing.


In [24]:
# Create ndarrays 

olympic_array_year = np.array([2012,2008,2004,2000,1996]) # array
olympic_array_participation = np.array([205,205,201,200,197])
olympic_array_countries = np.array(['London', 'Beijing', 'Athens', 'Sydney', 'Atlanta'])

In [25]:
# Create a df with the ndarray dict

df_olympic_array = pd.DataFrame({'No of Participating Countries': olympic_array_participation, 
                                 'HostCity': olympic_array_countries, 'Year' : olympic_array_year}) # dictionary of array
df_olympic_array

Unnamed: 0,No of Participating Countries,HostCity,Year
0,205,London,2012
1,205,Beijing,2008
2,201,Athens,2004
3,200,Sydney,2000
4,197,Atlanta,1996


In [26]:
df_olympic_array.index = ['A', 'B', 'C', 'D', 'E'] # set the row index
df_olympic_array.index # get the row index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [27]:
df_olympic_array

Unnamed: 0,No of Participating Countries,HostCity,Year
A,205,London,2012
B,205,Beijing,2008
C,201,Athens,2004
D,200,Sydney,2000
E,197,Atlanta,1996


## Accessing column in a dataframe

In [28]:
df_olympic_data

Unnamed: 0,HostCity,Year,No of Participating Countries
0,London,2012,205
1,Beijing,2008,205
2,Athens,2004,201
3,Sydney,2000,200
4,Atlanta,1996,197


In [29]:
df_olympic_data.HostCity # used for working with a single column

0     London
1    Beijing
2     Athens
3     Sydney
4    Atlanta
Name: HostCity, dtype: object

In [30]:
df_olympic_data[["HostCity", "Year"]] # used for accessing multiple columns

Unnamed: 0,HostCity,Year
0,London,2012
1,Beijing,2008
2,Athens,2004
3,Sydney,2000
4,Atlanta,1996


In [31]:
df_olympic_data.No of Participating Countries # columns with spaces in the name

SyntaxError: invalid syntax (<ipython-input-31-3c17888e4c21>, line 1)

In [32]:
df_olympic_data[["No of Participating Countries"]]

Unnamed: 0,No of Participating Countries
0,205
1,205
2,201
3,200
4,197


In [35]:
# how to replace space with underscore in python column headers?

df_olympic_data.columns = df_olympic_data.columns.str.replace(' ','_')
df_olympic_data.columns

Index(['HostCity', 'Year', 'No_of_Participating_Countries'], dtype='object')

In [36]:
df_olympic_data.No_of_Participating_Countries

0    205
1    205
2    201
3    200
4    197
Name: No_of_Participating_Countries, dtype: int64

## Data Operation with Statistical Functions

In [37]:
df_test_scores = pd.DataFrame({'Test1': [95,84,73,88,82,61], 'Test2': [74,85,82,73,77,79]}, 
                              index = ['Jack','Lewis','Patrick','Rich','Kelly','Paula'])

display (df_test_scores)

Unnamed: 0,Test1,Test2
Jack,95,74
Lewis,84,85
Patrick,73,82
Rich,88,73
Kelly,82,77
Paula,61,79


In [38]:
df_test_scores.shape

(6, 2)

In [39]:
df_test_scores.max() # default column wise ans

Test1    95
Test2    85
dtype: int64

In [40]:
df_test_scores.max(axis = 1)  # axis = 1 = represents rows

Jack       95
Lewis      85
Patrick    82
Rich       88
Kelly      82
Paula      79
dtype: int64

In [41]:
df_test_scores.mean()

Test1    80.500000
Test2    78.333333
dtype: float64

In [46]:
df_test_scores.Test1.mean()

80.5

In [42]:
df_test_scores.mean(axis = 1)

Jack       84.5
Lewis      84.5
Patrick    77.5
Rich       80.5
Kelly      79.5
Paula      70.0
dtype: float64

## Creating a new column 

In [43]:
df_test_scores.Total_Scores = df_test_scores.Test1 + df_test_scores.Test2
df_test_scores

  """Entry point for launching an IPython kernel.


Unnamed: 0,Test1,Test2
Jack,95,74
Lewis,84,85
Patrick,73,82
Rich,88,73
Kelly,82,77
Paula,61,79


In [44]:
df_test_scores[["Total_Scores"]] = df_test_scores.Test1 + df_test_scores.Test2
df_test_scores # deprecated in higher version

Unnamed: 0,Test1,Test2,Total_Scores
Jack,95,74,169
Lewis,84,85,169
Patrick,73,82,155
Rich,88,73,161
Kelly,82,77,159
Paula,61,79,140


In [45]:
df_test_scores['Total_Scores'] = df_test_scores.Test1 + df_test_scores.Test2
df_test_scores

Unnamed: 0,Test1,Test2,Total_Scores
Jack,95,74,169
Lewis,84,85,169
Patrick,73,82,155
Rich,88,73,161
Kelly,82,77,159
Paula,61,79,140
