# Data Science Boot Camp

## Introduction to Pandas Part 1

* __Pandas__ is a Python package providing fast, flexible, and expressive data structures designed to work with *relational* or *labeled* data both.<br>
<br>
* It is a fundamental high-level building block for doing practical, real world data analysis in Python.<br>
<br>
* Python has always been great for prepping and munging data, but it's never been great for analysis - you'd usually end up using R or loading it into a database and using SQL. Pandas makes Python great for analysis.<br>

* Pandas is well suited for:<br>
<br>
    * Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet<br>
    <br>
    * Ordered and unordered (not necessarily fixed-frequency) time series data.<br>
    <br>
    * Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels<br>
    <br>
    * Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure<br>



* Key features of Pandas:<br>
<br>
    * Easy handling of __missing data__<br>
<br>
    * __Size mutability__: columns can be inserted and deleted from DataFrame and higher dimensional objects.<br>
<br>
    * Automatic and explicit __data alignment__: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically.<br>
<br>
    * __Fast__ and __efficient__ DataFrame object with default and customized indexing.<br>
<br>
    * __Reshaping__ and __pivoting__ of data sets.<br>

* Key features of Pandas (Continued):<br>
<br>
    * Label-based __slicing__, __indexing__, __fancy indexing__ and __subsetting__ of large data sets.<br>
<br>
    * __Group by__ data for aggregation and transformations.<br>
<br>
    * High performance __merging__ and __joining__ of data.<br>
<br>
    * __IO Tools__ for loading data into in-memory data objects from different file formats.<br>
<br>
    * __Time Series__ functionality.<br>

* First thing we have to import pandas and numpy library under the aliases pd and np.<br>
<br>
* Then check our pandas version.<br>

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np

print(pd.__version__)

0.22.0


* Let's set some options for `Pandas`

In [2]:
pd.set_option('display.notebook_repr_html', False)
pd.set_option('max_columns', 10)
pd.set_option('max_rows', 10)

## Pandas Objects

* At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.<br>
<br>
* There are three fundamental Pandas data structures: the Series, DataFrame, and Index.

### Series

* A __Series__ is a single vector of data (like a NumPy array) with an *index* that labels each element in the vector.<br><br>
* It can be created from a list or array as follows:

In [3]:
counts = pd.Series([15029231, 7529491, 7499740, 5445026, 2702492, 2742534, 4279677, 2133548, 2146129])
counts

0    15029231
1     7529491
2     7499740
3     5445026
4     2702492
5     2742534
6     4279677
7     2133548
8     2146129
dtype: int64

* If an index is not specified, a default sequence of integers is assigned as the index. A NumPy array comprises the values of the `Series`, while the index is a pandas `Index` object.

In [4]:
counts.values

array([15029231,  7529491,  7499740,  5445026,  2702492,  2742534,
        4279677,  2133548,  2146129])

In [5]:
counts.index

RangeIndex(start=0, stop=9, step=1)

* We can assign meaningful labels to the index, if they are available:

In [6]:
population = pd.Series([15029231, 7529491, 7499740, 5445026, 2702492, 2742534, 4279677, 2133548, 2146129], 
    index=['Istanbul Total', 'Istanbul Males', 'Istanbul Females', 'Ankara Total', 'Ankara Males', 'Ankara Females', 'Izmir Total', 'Izmir Males', 'Izmir Females'])
population

Istanbul Total      15029231
Istanbul Males       7529491
Istanbul Females     7499740
Ankara Total         5445026
Ankara Males         2702492
Ankara Females       2742534
Izmir Total          4279677
Izmir Males          2133548
Izmir Females        2146129
dtype: int64

* These labels can be used to refer to the values in the `Series`.

In [7]:
population['Istanbul Total']

15029231

In [8]:
mask = [city.endswith('Females') for city in population.index]
mask 

[False, False, True, False, False, True, False, False, True]

In [9]:
population[mask]

Istanbul Females    7499740
Ankara Females      2742534
Izmir Females       2146129
dtype: int64

* As you noticed that we can masking in series.<br>
<br>
* Also we can still use positional indexing even we assign meaningful labels to the index, if we wish.<br>

In [10]:
population[0]

15029231

* We can give both the array of values and the index meaningful labels themselves:<br>

In [11]:
population.name = 'population'
population.index.name = 'city'
population

city
Istanbul Total      15029231
Istanbul Males       7529491
Istanbul Females     7499740
Ankara Total         5445026
Ankara Males         2702492
Ankara Females       2742534
Izmir Total          4279677
Izmir Males          2133548
Izmir Females        2146129
Name: population, dtype: int64

* Also, NumPy's math functions and other operations can be applied to Series without losing the data structure.<br>

In [12]:
np.ceil(population / 1000000) * 1000000

city
Istanbul Total      16000000.0
Istanbul Males       8000000.0
Istanbul Females     8000000.0
Ankara Total         6000000.0
Ankara Males         3000000.0
Ankara Females       3000000.0
Izmir Total          5000000.0
Izmir Males          3000000.0
Izmir Females        3000000.0
Name: population, dtype: float64

* We can also filter according to the values in the `Series` like in the Numpy's:

In [13]:
population[population>3000000]

city
Istanbul Total      15029231
Istanbul Males       7529491
Istanbul Females     7499740
Ankara Total         5445026
Izmir Total          4279677
Name: population, dtype: int64

* A `Series` can be thought of as an ordered key-value store. In fact, we can create one from a `dict`:

In [14]:
populationDict = {'Istanbul Total': 15029231, 'Ankara Total': 5445026, 'Izmir Total': 4279677}
pd.Series(populationDict)

Ankara Total       5445026
Istanbul Total    15029231
Izmir Total        4279677
dtype: int64

* Notice that the `Series` is created in key-sorted order.<br>
<br>
* If we pass a custom index to `Series`, it will select the corresponding values from the dict, and treat indices without corrsponding values as missing. Pandas uses the `NaN` (not a number) type for missing values.<br>

In [15]:
population2 = pd.Series(populationDict, index=['Istanbul Total','Ankara Total','Izmir Total','Bursa Total', 'Antalya Total'])
population2

Istanbul Total    15029231.0
Ankara Total       5445026.0
Izmir Total        4279677.0
Bursa Total              NaN
Antalya Total            NaN
dtype: float64

In [16]:
population2.isnull()

Istanbul Total    False
Ankara Total      False
Izmir Total       False
Bursa Total        True
Antalya Total      True
dtype: bool

* Critically, the labels are used to **align data** when used in operations with other Series objects:

In [17]:
population + population2

Ankara Females           NaN
Ankara Males             NaN
Ankara Total      10890052.0
Antalya Total            NaN
Bursa Total              NaN
                     ...    
Istanbul Males           NaN
Istanbul Total    30058462.0
Izmir Females            NaN
Izmir Males              NaN
Izmir Total        8559354.0
Length: 11, dtype: float64

* Contrast this with NumPy arrays, where arrays of the same length will combine values element-wise; adding Series combined values with the same label in the resulting series. Notice also that the missing values were propogated by addition.

### DataFrame

* A `DataFrame` represents a tabular, spreadsheet-like data structure containing an or- dered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).<br>
<br>
* `DataFrame` has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same  index).

In [18]:
areaDict = {'Istanbul': 5461, 'Ankara': 25632, 'Izmir': 11891,
            'Bursa': 10813, 'Antalya': 20177}
area = pd.Series(areaDict)
area

Ankara      25632
Antalya     20177
Bursa       10813
Istanbul     5461
Izmir       11891
dtype: int64

In [19]:
populationDict = {'Istanbul': 15029231, 'Ankara': 5445026, 'Izmir': 4279677, 'Bursa': 2936803, 'Antalya': 2364396}
population3 = pd.Series(populationDict)
population3

Ankara       5445026
Antalya      2364396
Bursa        2936803
Istanbul    15029231
Izmir        4279677
dtype: int64

* Now that we have 2 Series population by cities and areas by cities, we can use a dictionary to construct a single two-dimensional object containing this information:

In [20]:
cities = pd.DataFrame({'population': population3, 'area': area})
cities

           area  population
Ankara    25632     5445026
Antalya   20177     2364396
Bursa     10813     2936803
Istanbul   5461    15029231
Izmir     11891     4279677

* Or we can create our cities `DataFrame` with lists and indexes.

In [21]:
cities = pd.DataFrame({
    'population':[15029231, 5445026, 4279677, 2936803, 2364396],
    'area':[5461, 25632, 11891, 10813, 20177],
    'city':['Istanbul', 'Ankara', 'Izmir', 'Bursa', 'Antalya']
    })
cities

    area      city  population
0   5461  Istanbul    15029231
1  25632    Ankara     5445026
2  11891     Izmir     4279677
3  10813     Bursa     2936803
4  20177   Antalya     2364396

Notice the `DataFrame` is sorted by column name. We can change the order by indexing them in the order we desire:

In [22]:
cities[['city','area', 'population']]

       city   area  population
0  Istanbul   5461    15029231
1    Ankara  25632     5445026
2     Izmir  11891     4279677
3     Bursa  10813     2936803
4   Antalya  20177     2364396

* A `DataFrame` has a second index, representing the columns:

In [23]:
cities.columns

Index([u'area', u'city', u'population'], dtype='object')

* If we wish to access columns, we can do so either by dictionary like indexing or by attribute:

In [24]:
cities['area']

0     5461
1    25632
2    11891
3    10813
4    20177
Name: area, dtype: int64

In [25]:
cities.area

0     5461
1    25632
2    11891
3    10813
4    20177
Name: area, dtype: int64

In [26]:
type(cities.area)

pandas.core.series.Series

In [27]:
type(cities[['area']])

pandas.core.frame.DataFrame

* Notice this is different than with `Series`, where dictionary like indexing retrieved a particular element (row). If we want access to a row in a `DataFrame`, we index its `iloc` attribute.


In [28]:
cities.iloc[2]

area            11891
city            Izmir
population    4279677
Name: 2, dtype: object

In [29]:
cities.iloc[0:2]

    area      city  population
0   5461  Istanbul    15029231
1  25632    Ankara     5445026

Alternatively, we can create a `DataFrame` with a dict of dicts:

In [30]:
cities = pd.DataFrame({
    0: {'city': 'Istanbul', 'area': 5461, 'population': 15029231},
    1: {'city': 'Ankara', 'area': 25632, 'population': 5445026},
    2: {'city': 'Izmir', 'area': 11891, 'population': 4279677},
    3: {'city': 'Bursa', 'area': 10813, 'population': 2936803},
    4: {'city': 'Antalya', 'area': 20177, 'population': 2364396},
   
})
cities

                   0        1        2        3        4
area            5461    25632    11891    10813    20177
city        Istanbul   Ankara    Izmir    Bursa  Antalya
population  15029231  5445026  4279677  2936803  2364396

* We probably want this transposed:

In [31]:
cities = cities.T
cities

    area      city population
0   5461  Istanbul   15029231
1  25632    Ankara    5445026
2  11891     Izmir    4279677
3  10813     Bursa    2936803
4  20177   Antalya    2364396

* It's important to note that the Series returned when a DataFrame is indexted is merely a **view** on the DataFrame, and not a copy of the data itself. <br>
<br>
* So you must be cautious when manipulating this data just like in the Numpy.<br>

In [32]:
areas = cities.area
areas

0     5461
1    25632
2    11891
3    10813
4    20177
Name: area, dtype: object

In [33]:
areas[3] = 0
areas

0     5461
1    25632
2    11891
3        0
4    20177
Name: area, dtype: object

In [34]:
cities

    area      city population
0   5461  Istanbul   15029231
1  25632    Ankara    5445026
2  11891     Izmir    4279677
3      0     Bursa    2936803
4  20177   Antalya    2364396

* It's a usefull behavior for large data sets but for preventing this you can use copy method.<br>

In [35]:
areas = cities.area.copy()
areas[3] = 10813
areas

0     5461
1    25632
2    11891
3    10813
4    20177
Name: area, dtype: object

In [36]:
cities

    area      city population
0   5461  Istanbul   15029231
1  25632    Ankara    5445026
2  11891     Izmir    4279677
3      0     Bursa    2936803
4  20177   Antalya    2364396

* We can create or modify columns by assignment:<br>

In [37]:
cities.area[3] = 10813
cities

    area      city population
0   5461  Istanbul   15029231
1  25632    Ankara    5445026
2  11891     Izmir    4279677
3  10813     Bursa    2936803
4  20177   Antalya    2364396

In [38]:
cities['year'] = 2017
cities

    area      city population  year
0   5461  Istanbul   15029231  2017
1  25632    Ankara    5445026  2017
2  11891     Izmir    4279677  2017
3  10813     Bursa    2936803  2017
4  20177   Antalya    2364396  2017

* But note that, we can not use the attribute indexing method to add a new column:<br>

In [39]:
cities.projection2020 = 20000000
cities

    area      city population  year
0   5461  Istanbul   15029231  2017
1  25632    Ankara    5445026  2017
2  11891     Izmir    4279677  2017
3  10813     Bursa    2936803  2017
4  20177   Antalya    2364396  2017

* It creates another variable.<br>

In [40]:
cities.projection2020 

20000000

* Specifying a `Series` as a new columns cause its values to be added according to the `DataFrame`'s index:

In [41]:
populationIn2000 = pd.Series([11076840, 3889199, 3431204, 2150571, 1430539])
populationIn2000

0    11076840
1     3889199
2     3431204
3     2150571
4     1430539
dtype: int64

In [42]:
cities['population_2000'] = populationIn2000
cities

    area      city population  year  population_2000
0   5461  Istanbul   15029231  2017         11076840
1  25632    Ankara    5445026  2017          3889199
2  11891     Izmir    4279677  2017          3431204
3  10813     Bursa    2936803  2017          2150571
4  20177   Antalya    2364396  2017          1430539

* Other Python data structures (ones without an index) need to be the same length as the `DataFrame`:

In [43]:
populationIn2007 = [12573836, 4466756, 3739353, 2439876]
cities['population_2007'] = populationIn2007

ValueError: Length of values does not match length of index

* We can use `del` to remove columns, in the same way `dict` entries can be removed:

In [44]:
cities

    area      city population  year  population_2000
0   5461  Istanbul   15029231  2017         11076840
1  25632    Ankara    5445026  2017          3889199
2  11891     Izmir    4279677  2017          3431204
3  10813     Bursa    2936803  2017          2150571
4  20177   Antalya    2364396  2017          1430539

In [45]:
del cities['population_2000']
cities

    area      city population  year
0   5461  Istanbul   15029231  2017
1  25632    Ankara    5445026  2017
2  11891     Izmir    4279677  2017
3  10813     Bursa    2936803  2017
4  20177   Antalya    2364396  2017

* We can extract the underlying data as a simple `ndarray` by accessing the `values` attribute:<br>

In [46]:
cities.values

array([[5461, 'Istanbul', 15029231, 2017],
       [25632, 'Ankara', 5445026, 2017],
       [11891, 'Izmir', 4279677, 2017],
       [10813, 'Bursa', 2936803, 2017],
       [20177, 'Antalya', 2364396, 2017]], dtype=object)

* Notice that because of the mix of string and integer (and could be`NaN`) values, the dtype of the array is `object`.

* The dtype will automatically be chosen to be as general as needed to accomodate all the columns.

In [47]:
df = pd.DataFrame({'integers': [1,2,3], 'floatNumbers':[0.5, -1.25, 2.5]})
df

   floatNumbers  integers
0          0.50         1
1         -1.25         2
2          2.50         3

In [48]:
print(df.values.dtype)
df.values

float64


array([[ 0.5 ,  1.  ],
       [-1.25,  2.  ],
       [ 2.5 ,  3.  ]])

* Pandas uses a custom data structure to represent the indices of Series and DataFrames.

In [49]:
cities.index

Int64Index([0, 1, 2, 3, 4], dtype='int64')

* Index objects are immutable:

In [50]:
cities.index[0] = 15

TypeError: Index does not support mutable operations

* This is so that Index objects can be shared between data structures without fear that they will be changed.
* That means you can move, copy your meaningful labels to other `DataFrames`

In [51]:
cities

    area      city population  year
0   5461  Istanbul   15029231  2017
1  25632    Ankara    5445026  2017
2  11891     Izmir    4279677  2017
3  10813     Bursa    2936803  2017
4  20177   Antalya    2364396  2017

In [52]:
cities.index = population2.index
cities

                 area      city population  year
Istanbul Total   5461  Istanbul   15029231  2017
Ankara Total    25632    Ankara    5445026  2017
Izmir Total     11891     Izmir    4279677  2017
Bursa Total     10813     Bursa    2936803  2017
Antalya Total   20177   Antalya    2364396  2017

## Importing data

* A key, but often under appreciated, step in data analysis is importing the data that we wish to analyze.<br>
<br>
* Though it is easy to load basic data structures into Python using built-in tools or those provided by packages like NumPy, it is non-trivial to import structured data well, and to easily convert this input into a robust data structure.<br>
<br>
* Pandas provides a convenient set of functions for importing tabular data in a number of formats directly into a `DataFrame` object. 

* Let's start with some more population data, stored in csv format.

In [53]:
!cat data/population.csv

Provinces;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017
Total;64729501;65603160;66401851;67187251;68010215;68860539;69729967;70586256;71517100;72561312;73722988;74724269;75627384;76667864;77695904;78741053;79814871;80810525
Adana;1879695;1899324;1916637;1933428;1951142;1969512;1988277;2006650;2026319;2062226;2085225;2108805;2125635;2149260;2165595;2183167;2201670;2216475
Adıyaman;568432;571180;573149;574886;576808;578852;580926;582762;585067;588475;590935;593931;595261;597184;597835;602774;610484;615076
Afyonkarahisar;696292;698029;698773;699193;699794;700502;701204;701572;697365;701326;697559;698626;703948;707123;706371;709015;714523;715693
Ağrı;519190;521514;523123;524514;526070;527732;529417;530879;532180;537665;542022;555479;552404;551177;549435;547210;542255;536285
Amasya;333927;333768;333110;332271;331491;330739;329956;328674;323675;324268;334786;323079;322283;321977;321913;322167;326351;329888
Ankara;3889199;3971642;40503

* This table can be read into a DataFrame using `read_csv`:

In [54]:
populationDF = pd.read_csv("data/population.csv")
populationDF

   Provinces;2000;2001;2002;2003;2004;2005;2006;2007;2008;2009;2010;2011;2012;2013;2014;2015;2016;2017
0   Total;64729501;65603160;66401851;67187251;6801...                                                 
1   Adana;1879695;1899324;1916637;1933428;1951142;...                                                 
2   Adıyaman;568432;571180;573149;574886;576808;57...                                                 
3   Afyonkarahisar;696292;698029;698773;699193;699...                                                 
4   Ağrı;519190;521514;523123;524514;526070;527732...                                                 
..                                                ...                                                 
77  Yalova;144923;150027;155041;160099;165333;1707...                                                 
78  Karabük;205172;207241;209056;210812;212667;214...                                                 
79  Kilis;109698;111024;112219;113387;114615;11588...                    

* Notice that `read_csv` automatically considered the first row in the file to be a header row.<br>
<br>
* We can override default behavior by customizing some the arguments, like `header`, `names` or `index_col`.<br>

* `read_csv` is just a convenience function for `read_table`, since csv is such a common format:<br>

In [55]:
pd.set_option('max_columns', 5)
populationDF = pd.read_table("data/population_missing.csv", sep=';')
populationDF

   Provinces        2000     ...            2016        2017
0        NaN         NaN     ...             NaN         NaN
1        NaN         NaN     ...             NaN         NaN
2      Total  64729501.0     ...      79814871.0  80810525.0
3      Adana   1879695.0     ...       2201670.0   2216475.0
4   Adıyaman    568432.0     ...        610484.0    615076.0
..       ...         ...     ...             ...         ...
79    Yalova    144923.0     ...        241665.0    251203.0
80   Karabük    205172.0     ...        242347.0    244453.0
81     Kilis    109698.0     ...        130825.0    136319.0
82  Osmaniye    411163.0     ...        522175.0    527724.0
83     Düzce    296712.0     ...        370371.0    377610.0

[84 rows x 19 columns]

* The `sep` argument can be customized as needed to accomodate arbitrary separators.<br>

* If we have sections of data that we do not wish to import (for example, in this example empty rows), we can populate the `skiprows` argument:

In [56]:
populationDF = pd.read_csv("data/population_missing.csv", sep=';', skiprows=[1,2])
populationDF

         Provinces      2000    ...         2016      2017
0            Total  64729501    ...     79814871  80810525
1            Adana   1879695    ...      2201670   2216475
2         Adıyaman    568432    ...       610484    615076
3   Afyonkarahisar    696292    ...       714523    715693
4             Ağrı    519190    ...       542255    536285
..             ...       ...    ...          ...       ...
77          Yalova    144923    ...       241665    251203
78         Karabük    205172    ...       242347    244453
79           Kilis    109698    ...       130825    136319
80        Osmaniye    411163    ...       522175    527724
81           Düzce    296712    ...       370371    377610

[82 rows x 19 columns]

* For a more useful index, we can specify the first column, which provide a unique index to the data.

In [57]:
populationDF = pd.read_csv("data/population.csv", sep=';', index_col='Provinces')
populationDF.index

Index([u'Total', u'Adana', u'Adıyaman', u'Afyonkarahisar', u'Ağrı', u'Amasya',
       u'Ankara', u'Antalya', u'Artvin', u'Aydın', u'Balıkesir', u'Bilecik',
       u'Bingöl', u'Bitlis', u'Bolu', u'Burdur', u'Bursa', u'Çanakkale',
       u'Çankırı', u'Çorum', u'Denizli', u'Diyarbakır', u'Edirne', u'Elazığ',
       u'Erzincan', u'Erzurum', u'Eskişehir', u'Gaziantep', u'Giresun',
       u'Gümüşhane', u'Hakkari', u'Hatay', u'Isparta', u'Mersin', u'İstanbul',
       u'İzmir', u'Kars', u'Kastamonu', u'Kayseri', u'Kırklareli', u'Kırşehir',
       u'Kocaeli', u'Konya', u'Kütahya', u'Malatya', u'Manisa',
       u'Kahramanmaraş', u'Mardin', u'Muğla', u'Muş', u'Nevşehir', u'Niğde',
       u'Ordu', u'Rize', u'Sakarya', u'Samsun', u'Siirt', u'Sinop', u'Sivas',
       u'Tekirdağ', u'Tokat', u'Trabzon', u'Tunceli', u'Şanlıurfa', u'Uşak',
       u'Van', u'Yozgat', u'Zonguldak', u'Aksaray', u'Bayburt', u'Karaman',
       u'Kırıkkale', u'Batman', u'Şırnak', u'Bartın', u'Ardahan', u'Iğdır',
       u'Yalov

Conversely, if we only want to import a small number of rows from, say, a very large data file we can use `nrows`:

In [58]:
pd.read_csv("data/population.csv", sep=';', nrows=4)

        Provinces      2000    ...         2016      2017
0           Total  64729501    ...     79814871  80810525
1           Adana   1879695    ...      2201670   2216475
2        Adıyaman    568432    ...       610484    615076
3  Afyonkarahisar    696292    ...       714523    715693

[4 rows x 19 columns]

* Most real-world data is incomplete, with values missing due to incomplete observation, data entry or transcription error, or other reasons. Pandas will automatically recognize and parse common missing data indicators, including `NA`, `NaN`, `NULL`.

In [59]:
pd.read_csv("data/population_missing.csv", sep=';').head(10)

        Provinces        2000     ...            2016        2017
0             NaN         NaN     ...             NaN         NaN
1             NaN         NaN     ...             NaN         NaN
2           Total  64729501.0     ...      79814871.0  80810525.0
3           Adana   1879695.0     ...       2201670.0   2216475.0
4        Adıyaman    568432.0     ...        610484.0    615076.0
5  Afyonkarahisar    696292.0     ...        714523.0    715693.0
6            Ağrı    519190.0     ...        542255.0    536285.0
7          Amasya    333927.0     ...        326351.0    329888.0
8          Ankara   3889199.0     ...       5346518.0   5445026.0
9         Antalya   1430539.0     ...       2328555.0   2364396.0

[10 rows x 19 columns]

Above, Pandas recognized `NaN` and an empty field as missing data.

In [60]:
pd.isnull(pd.read_csv("data/population_missing.csv", sep=';')).head(10)

   Provinces   2000  ...     2016   2017
0       True   True  ...     True   True
1       True   True  ...     True   True
2      False  False  ...    False  False
3      False  False  ...    False  False
4      False  False  ...    False  False
5      False  False  ...    False  False
6      False  False  ...    False  False
7      False  False  ...    False  False
8      False  False  ...    False  False
9      False  False  ...    False  False

[10 rows x 19 columns]

### Microsoft Excel

* Since so much financial and scientific data ends up in Excel spreadsheets, Pandas' ability to directly import Excel spreadsheets is valuable. <br>
<br>
* This support is contingent on having one or two dependencies (depending on what version of Excel file is being imported) installed: `xlrd` and `openpyxl`.<br>
<br>
* Importing Excel data to Pandas is a two-step process. First, we create an `ExcelFile` object using the path of the file:                                             

In [61]:
excel_file = pd.ExcelFile('data/population.xlsx')
excel_file

<pandas.io.excel.ExcelFile at 0x10cfa1a50>

* Then, since modern spreadsheets consist of one or more "sheets", we parse the sheet with the data of interest:

In [62]:
excelDf = excel_file.parse("Sheet 1 ")
excelDf

         Provinces      2000    ...         2016      2017
0            Total  64729501    ...     79814871  80810525
1            Adana   1879695    ...      2201670   2216475
2         Adıyaman    568432    ...       610484    615076
3   Afyonkarahisar    696292    ...       714523    715693
4             Ağrı    519190    ...       542255    536285
..             ...       ...    ...          ...       ...
77          Yalova    144923    ...       241665    251203
78         Karabük    205172    ...       242347    244453
79           Kilis    109698    ...       130825    136319
80        Osmaniye    411163    ...       522175    527724
81           Düzce    296712    ...       370371    377610

[82 rows x 19 columns]

* Also, there is a `read_excel` conveneince function in Pandas that combines these steps into a single call:

In [63]:
excelDf2 = pd.read_excel('data/population.xlsx', sheet_name='Sheet 1 ')
excelDf2.head(10)

        Provinces      2000    ...         2016      2017
0           Total  64729501    ...     79814871  80810525
1           Adana   1879695    ...      2201670   2216475
2        Adıyaman    568432    ...       610484    615076
3  Afyonkarahisar    696292    ...       714523    715693
4            Ağrı    519190    ...       542255    536285
5          Amasya    333927    ...       326351    329888
6          Ankara   3889199    ...      5346518   5445026
7         Antalya   1430539    ...      2328555   2364396
8          Artvin    167909    ...       168068    166143
9           Aydın    870460    ...      1068260   1080839

[10 rows x 19 columns]

* In, the first day we learned how to read and write `JSON` Files, with that way you can also import JSON files to `DataFrames`. 

* Also, you can connect to databases and import your data into `DataFrames` by help of 3rd party libraries.

## Pandas Fundamentals

* This section introduces the new user to the key functionality of Pandas that is required to use the software effectively.<br>
<br>
* For some variety, we will leave our population data behind and employ some `Superhero` data.<br>

* The data comes from Marvel Wikia.<br>
<br>
* The file has the following variables:<br>

<table>
<table>
<thead>
<tr>
<th style="text-align:left;">Variable</th>
<th style="text-align:left;">Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left;">page_id</td>
<td style="text-align:left;">The unique identifier for that characters page within the wikia</td>
</tr>
<tr>
<td style="text-align:left;">name</td>
<td style="text-align:left;">The name of the character</td>
</tr>
<tr>
<td style="text-align:left;">urlslug</td>
<td style="text-align:left;">The unique url within the wikia that takes you to the character</td>
</tr>
<tr>
<td style="text-align:left;">ID</td>
<td style="text-align:left;">The identity status of the character (Secret Identity, Public identity No Dual Identity)</td>
</tr>
<tr>
<td style="text-align:left;">ALIGN</td>
<td style="text-align:left;">If the character is Good, Bad or Neutral</td>
</tr>
<tr>
<td style="text-align:left;">EYE</td>
<td style="text-align:left;">Eye color of the character</td>
</tr>
<tr>
<td style="text-align:left;">HAIR</td>
<td style="text-align:left;">Hair color of the character</td>
</tr>
<tr>
<td style="text-align:left;">SEX</td>
<td style="text-align:left;">Sex of the character (e.g. Male, Female, etc.)</td>
</tr>
<tr>
<td style="text-align:left;">GSM</td>
<td style="text-align:left;">If the character is a gender or sexual minority (e.g. Homosexual characters, bisexual characters)</td>
</tr>
<tr>
<td style="text-align:left;">ALIVE</td>
<td style="text-align:left;">If the character is alive or deceased</td>
</tr>
<tr>
<td style="text-align:left;">APPEARANCES</td>
<td style="text-align:left;">The number of appareances of the character in comic books (as of Sep. 2, 2014. Number will become increasingly out of date as time goes on.)</td>
</tr>
<tr>
<td style="text-align:left;">FIRST APPEARANCE</td>
<td style="text-align:left;">The month and year of the character's first appearance in a comic book, if available</td>
</tr>
<tr>
<td style="text-align:left;">YEAR</td>
<td style="text-align:left;">The year of the character's first appearance in a comic book, if available</td>
</tr>
</tbody>
</table>

In [64]:
pd.set_option('max_columns', 12)
pd.set_option('display.notebook_repr_html', True)
marvelDF = pd.read_csv("data/marvel-wikia-data.csv", index_col='page_id')
marvelDF.head(5)

Unnamed: 0_level_0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
page_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1678,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
7139,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
64786,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
1868,"Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
2460,Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0


* Notice that we specified the `page_id` column as the index, since it appears to be a unique identifier. We could try to create a unique index ourselves by trimming `name`:

* First, import the regex module of python.<br>
<br>
* Then, trim the name column with regex.<br> 

In [65]:
import re
pattern = re.compile('([a-zA-Z]|-|\s|\.|\')*([a-zA-Z])')
heroName = []
for name in marvelDF.name:
    match = re.search(pattern, name)
    if match:                      
        heroName.append(match.group())
    else:
        heroName.append(name)
heroName

['Spider-Man',
 'Captain America',
 'Wolverine',
 'Iron Man',
 'Thor',
 'Benjamin Grimm',
 'Reed Richards',
 'Hulk',
 'Scott Summers',
 'Jonathan Storm',
 'Henry McCoy',
 'Susan Storm',
 'Namor McKenzie',
 'Ororo Munroe',
 'Clinton Barton',
 'Matthew Murdock',
 'Stephen Strange',
 'Mary Jane Watson',
 'John Jonah Jameson',
 'Robert Drake',
 'Henry Pym',
 'Charles Xavier',
 'Warren Worthington III',
 'Piotr Rasputin',
 'Wanda Maximoff',
 'Nicholas Fury',
 'Janet van Dyne',
 'Jean Grey',
 'Natalia Romanova',
 'Kurt Wagner',
 'Vision',
 'May Reilly',
 'Katherine Pryde',
 'Carol Danvers',
 'Jennifer Walters',
 'Emma Frost',
 'Frank Castle',
 'Luke Cage',
 'Rogue',
 'Conan',
 'Joseph Robertson',
 'Pietro Maximoff',
 'Hercules',
 'Victor von Doom',
 'Max Eisenhardt',
 'Elizabeth Braddock',
 'Norrin Radd',
 'Norman Osborn',
 'Eugene Thompson',
 'Simon Williams',
 'Samuel Guthrie',
 'James Buchanan Barnes',
 'Remy LeBeau',
 'Daniel Rand',
 'Nathan Summers',
 'Elizabeth Brant',
 'Richard Jones'

* This looks okay, let's copy '__marvelDF__' to '__marvelDF_newID__' and assign new indexes.<br>

In [66]:
marvelDF_newID = marvelDF.copy()
marvelDF_newID.index = heroName
marvelDF_newID.head(5)

Unnamed: 0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
Spider-Man,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
Captain America,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
Wolverine,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
Iron Man,"Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
Thor,Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0


* Let's check the uniqueness of ID's:

In [67]:
marvelDF_newID.index.is_unique

False

* So, indices need not be unique. Our choice is not unique because some of superheros have some differenet variations.

In [68]:
pd.Series(marvelDF_newID.index).value_counts()

Sentinel         16
Charlie          10
Peter Parker     10
M                 8
X                 8
                 ..
Wrogg             1
Lynch             1
Llan              1
Gudrun Tyburn     1
Iron Brain        1
Length: 15271, dtype: int64

* The most important consequence of a non-unique index is that indexing by label will return multiple values for some labels:

In [69]:
marvelDF_newID.loc['Peter Parker']

Unnamed: 0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
Peter Parker,Peter Parker (Ben Reilly) (Earth-616),\/Peter_Parker_(Ben_Reilly)_(Earth-616),Secret Identity,Good Characters,Hazel Eyes,Blond Hair,Male Characters,,Deceased Characters,263.0,Oct-75,1975.0
Peter Parker,Peter Parker (Kaine) (Earth-616),\/Peter_Parker_(Kaine)_(Earth-616),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,137.0,Dec-94,1994.0
Peter Parker,Peter Parker (Doppelganger) (Earth-616),\/Peter_Parker_(Doppelganger)_(Earth-616),Secret Identity,Bad Characters,White Eyes,No Hair,Male Characters,,Living Characters,35.0,Jun-92,1992.0
Peter Parker,Peter Parker (Spidercide) (Earth-616),\/Peter_Parker_(Spidercide)_(Earth-616),Secret Identity,Bad Characters,Hazel Eyes,Brown Hair,Male Characters,,Deceased Characters,35.0,Mar-95,1995.0
Peter Parker,Peter Parker (Spider-Skeleton) (Earth-616),\/Peter_Parker_(Spider-Skeleton)_(Earth-616),Secret Identity,Neutral Characters,,,Male Characters,,Deceased Characters,12.0,Feb-96,1996.0
Peter Parker,Peter Parker (Jack) (Earth-616),\/Peter_Parker_(Jack)_(Earth-616),Secret Identity,Neutral Characters,Hazel Eyes,Bald,Male Characters,,Deceased Characters,8.0,Mar-95,1995.0
Peter Parker,Peter Parker (Skrull) (Earth-616),\/Peter_Parker_(Skrull)_(Earth-616),Secret Identity,Bad Characters,Green Eyes,No Hair,Male Characters,,Living Characters,6.0,Jun-08,2008.0
Peter Parker,Peter Parker (Guardian) (Earth-616),\/Peter_Parker_(Guardian)_(Earth-616),Secret Identity,Bad Characters,Hazel Eyes,Brown Hair,Male Characters,,Deceased Characters,3.0,Mar-95,1995.0
Peter Parker,Peter Parker (Counter-Earth) (Earth-616),\/Peter_Parker_(Counter-Earth)_(Earth-616),No Dual Identity,,,,Male Characters,,Deceased Characters,1.0,Oct-72,1972.0
Peter Parker,Peter Parker (Robot) (Earth-616),\/Peter_Parker_(Robot)_(Earth-616),No Dual Identity,,,,Male Characters,,Deceased Characters,1.0,,


* Let's give a truly unique index by not triming `name` column:

In [70]:
hero_id = marvelDF.name
marvelDF_newID = marvelDF.copy()
marvelDF_newID.index = hero_id
marvelDF_newID.head()

Unnamed: 0_level_0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Spider-Man (Peter Parker),Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
Captain America (Steven Rogers),Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
"Wolverine (James \""Logan\"" Howlett)","Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
"Iron Man (Anthony \""Tony\"" Stark)","Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
Thor (Thor Odinson),Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0


In [71]:
marvelDF_newID.index.is_unique

True

* We can create meaningful indices more easily using a hierarchical index.<br>
<br>
* For now, we will stick with the numeric IDs as our index for '__NewID__' DataFrame.<br>

In [72]:
marvelDF_newID.index = range(16376)
marvelDF.index = marvelDF['name']
marvelDF_newID.head(5)

Unnamed: 0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
0,Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
1,Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
2,"Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
3,"Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
4,Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0


### Manipulating indices

* __Reindexing__ allows users to manipulate the data labels in a DataFrame. <br>
<br>
* It forces a DataFrame to conform to the new index, and optionally, fill in missing data if requested.<br>
<br>
* A simple use of `reindex` is reverse the order of the rows:

In [73]:
marvelDF_newID.reindex(marvelDF_newID.index[::-1]).head()

Unnamed: 0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
16375,Yologarch (Earth-616),\/Yologarch_(Earth-616),,Bad Characters,,,,,Living Characters,,,
16374,TK421 (Spiderling) (Earth-616),\/TK421_(Spiderling)_(Earth-616),Secret Identity,Neutral Characters,,,Male Characters,,Living Characters,,,
16373,Tinkerer (Skrull) (Earth-616),\/Tinkerer_(Skrull)_(Earth-616),Secret Identity,Bad Characters,Black Eyes,Bald,Male Characters,,Living Characters,,,
16372,Thane (Thanos' son) (Earth-616),\/Thane_(Thanos%27_son)_(Earth-616),No Dual Identity,Good Characters,Blue Eyes,Bald,Male Characters,,Living Characters,,,
16371,Ru'ach (Earth-616),\/Ru%27ach_(Earth-616),No Dual Identity,Bad Characters,Green Eyes,No Hair,Male Characters,,Living Characters,,,


* Keep in mind that `reindex` does not work if we pass a non-unique index series.

* We can remove rows or columns via the `drop` method:

In [74]:
marvelDF_newID.shape

(16376, 12)

In [75]:
marvelDF_dropped = marvelDF_newID.drop([16375, 16374])

In [76]:
print(marvelDF_newID.shape)
print(marvelDF_dropped.shape)

(16376, 12)
(16374, 12)


In [77]:
marvelDF_dropped = marvelDF_newID.drop(['EYE','HAIR'], axis=1)

In [78]:
print(marvelDF_newID.shape)
print(marvelDF_dropped.shape)

(16376, 12)
(16376, 10)


## Indexing and Selection

* Indexing works like indexing in NumPy arrays, except we can use the labels in the `Index` object to extract values in addition to arrays of integers.<br>

In [79]:
heroAppearances = marvelDF.APPEARANCES
heroAppearances

name
Spider-Man (Peter Parker)              4043.0
Captain America (Steven Rogers)        3360.0
Wolverine (James \"Logan\" Howlett)    3061.0
Iron Man (Anthony \"Tony\" Stark)      2961.0
Thor (Thor Odinson)                    2258.0
                                        ...  
Ru'ach (Earth-616)                        NaN
Thane (Thanos' son) (Earth-616)           NaN
Tinkerer (Skrull) (Earth-616)             NaN
TK421 (Spiderling) (Earth-616)            NaN
Yologarch (Earth-616)                     NaN
Name: APPEARANCES, Length: 16376, dtype: float64

* Let's start with Numpy style indexing:

In [80]:
heroAppearances[:3]

name
Spider-Man (Peter Parker)              4043.0
Captain America (Steven Rogers)        3360.0
Wolverine (James \"Logan\" Howlett)    3061.0
Name: APPEARANCES, dtype: float64

* Indexing by Label:

In [81]:
heroAppearances[['Spider-Man (Peter Parker)','Hulk (Robert Bruce Banner)']]

name
Spider-Man (Peter Parker)     4043.0
Hulk (Robert Bruce Banner)    2017.0
Name: APPEARANCES, dtype: float64

* We can also slice with data labels, since they have an intrinsic order within the Index:

In [82]:
heroAppearances['Spider-Man (Peter Parker)':'Matthew Murdock (Earth-616)']

name
Spider-Man (Peter Parker)              4043.0
Captain America (Steven Rogers)        3360.0
Wolverine (James \"Logan\" Howlett)    3061.0
Iron Man (Anthony \"Tony\" Stark)      2961.0
Thor (Thor Odinson)                    2258.0
                                        ...  
Susan Storm (Earth-616)                1713.0
Namor McKenzie (Earth-616)             1528.0
Ororo Munroe (Earth-616)               1512.0
Clinton Barton (Earth-616)             1394.0
Matthew Murdock (Earth-616)            1338.0
Name: APPEARANCES, Length: 16, dtype: float64

* You can change sliced array, and if you get warning it's ok.<br>

In [83]:
heroAppearances['Minister of Castile D\'or (Earth-616)':'Yologarch (Earth-616)'] = 0
heroAppearances

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


name
Spider-Man (Peter Parker)              4043.0
Captain America (Steven Rogers)        3360.0
Wolverine (James \"Logan\" Howlett)    3061.0
Iron Man (Anthony \"Tony\" Stark)      2961.0
Thor (Thor Odinson)                    2258.0
                                        ...  
Ru'ach (Earth-616)                        0.0
Thane (Thanos' son) (Earth-616)           0.0
Tinkerer (Skrull) (Earth-616)             0.0
TK421 (Spiderling) (Earth-616)            0.0
Yologarch (Earth-616)                     0.0
Name: APPEARANCES, Length: 16376, dtype: float64

* In a `DataFrame` we can slice along either or both axes:

In [84]:
marvelDF[['SEX','ALIGN']]

Unnamed: 0_level_0,SEX,ALIGN
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Spider-Man (Peter Parker),Male Characters,Good Characters
Captain America (Steven Rogers),Male Characters,Good Characters
"Wolverine (James \""Logan\"" Howlett)",Male Characters,Neutral Characters
"Iron Man (Anthony \""Tony\"" Stark)",Male Characters,Good Characters
Thor (Thor Odinson),Male Characters,Good Characters
...,...,...
Ru'ach (Earth-616),Male Characters,Bad Characters
Thane (Thanos' son) (Earth-616),Male Characters,Good Characters
Tinkerer (Skrull) (Earth-616),Male Characters,Bad Characters
TK421 (Spiderling) (Earth-616),Male Characters,Neutral Characters


In [85]:
mask = marvelDF.APPEARANCES>50
marvelDF[mask]

Unnamed: 0_level_0,name,urlslug,ID,ALIGN,EYE,HAIR,SEX,GSM,ALIVE,APPEARANCES,FIRST APPEARANCE,Year
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Spider-Man (Peter Parker),Spider-Man (Peter Parker),\/Spider-Man_(Peter_Parker),Secret Identity,Good Characters,Hazel Eyes,Brown Hair,Male Characters,,Living Characters,4043.0,Aug-62,1962.0
Captain America (Steven Rogers),Captain America (Steven Rogers),\/Captain_America_(Steven_Rogers),Public Identity,Good Characters,Blue Eyes,White Hair,Male Characters,,Living Characters,3360.0,Mar-41,1941.0
"Wolverine (James \""Logan\"" Howlett)","Wolverine (James \""Logan\"" Howlett)",\/Wolverine_(James_%22Logan%22_Howlett),Public Identity,Neutral Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,3061.0,Oct-74,1974.0
"Iron Man (Anthony \""Tony\"" Stark)","Iron Man (Anthony \""Tony\"" Stark)",\/Iron_Man_(Anthony_%22Tony%22_Stark),Public Identity,Good Characters,Blue Eyes,Black Hair,Male Characters,,Living Characters,2961.0,Mar-63,1963.0
Thor (Thor Odinson),Thor (Thor Odinson),\/Thor_(Thor_Odinson),No Dual Identity,Good Characters,Blue Eyes,Blond Hair,Male Characters,,Living Characters,2258.0,Nov-50,1950.0
...,...,...,...,...,...,...,...,...,...,...,...,...
Joshua Guthrie (Earth-616),Joshua Guthrie (Earth-616),\/Joshua_Guthrie_(Earth-616),Secret Identity,Good Characters,Green Eyes,Red Hair,Male Characters,,Deceased Characters,51.0,Nov-84,1984.0
Antonio Rodriguez (Earth-616),Antonio Rodriguez (Earth-616),\/Antonio_Rodriguez_(Earth-616),Secret Identity,Neutral Characters,Brown Eyes,No Hair,Male Characters,,Living Characters,51.0,Aug-85,1985.0
Jack Hammer (Earth-616),Jack Hammer (Earth-616),\/Jack_Hammer_(Earth-616),Secret Identity,Neutral Characters,Brown Eyes,Black Hair,Male Characters,,Living Characters,51.0,Aug-93,1993.0
Neal Shaara (Earth-616),Neal Shaara (Earth-616),\/Neal_Shaara_(Earth-616),Secret Identity,Good Characters,Brown Eyes,Black Hair,Male Characters,,Living Characters,51.0,May-00,2000.0


* The indexing field `loc` allows us to select subsets of rows and columns in an intuitive way:

In [86]:
marvelDF.loc['Spider-Man (Peter Parker)', ['ID', 'EYE', 'HAIR']]

ID      Secret Identity
EYE          Hazel Eyes
HAIR         Brown Hair
Name: Spider-Man (Peter Parker), dtype: object

In [87]:
marvelDF.loc[['Spider-Man (Peter Parker)','Thor (Thor Odinson)'],['ID', 'EYE', 'HAIR']]

Unnamed: 0_level_0,ID,EYE,HAIR
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Spider-Man (Peter Parker),Secret Identity,Hazel Eyes,Brown Hair
Thor (Thor Odinson),No Dual Identity,Blue Eyes,Blond Hair


## Operations

* `DataFrame` and `Series` objects allow for several operations to take place either on a single object, or between two or more objects.<br>
<br>
* For example, we can perform arithmetic on the elements of two objects, such as change in population across years:

In [88]:
populationDF

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,...,2012,2013,2014,2015,2016,2017
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Total,64729501,65603160,66401851,67187251,68010215,68860539,...,75627384,76667864,77695904,78741053,79814871,80810525
Adana,1879695,1899324,1916637,1933428,1951142,1969512,...,2125635,2149260,2165595,2183167,2201670,2216475
Adıyaman,568432,571180,573149,574886,576808,578852,...,595261,597184,597835,602774,610484,615076
Afyonkarahisar,696292,698029,698773,699193,699794,700502,...,703948,707123,706371,709015,714523,715693
Ağrı,519190,521514,523123,524514,526070,527732,...,552404,551177,549435,547210,542255,536285
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yalova,144923,150027,155041,160099,165333,170705,...,211799,220122,226514,233009,241665,251203
Karabük,205172,207241,209056,210812,212667,214591,...,225145,230251,231333,236978,242347,244453
Kilis,109698,111024,112219,113387,114615,115886,...,124320,128586,128781,130655,130825,136319
Osmaniye,411163,417418,423214,428943,434930,441108,...,492135,498981,506807,512873,522175,527724


In [89]:
pop2000 = populationDF['2000']
pop2017 = populationDF['2017']

In [90]:
pop2000DF = pd.Series(pop2000.values, index=populationDF.index)
pop2017DF = pd.Series(pop2017.values, index=populationDF.index)

In [91]:
popDiff = pop2017DF - pop2000DF
popDiff

Provinces
Total             16081024
Adana               336780
Adıyaman             46644
Afyonkarahisar       19401
Ağrı                 17095
                    ...   
Yalova              106280
Karabük              39281
Kilis                26621
Osmaniye            116561
Düzce                80898
Length: 82, dtype: int64

* Let's assume our '__pop2000DF__' DataFrame has not row which index is "Yalova"

In [92]:
pop2000DF["Yalova"] = np.nan
pop2000DF

Provinces
Total             64729501.0
Adana              1879695.0
Adıyaman            568432.0
Afyonkarahisar      696292.0
Ağrı                519190.0
                     ...    
Yalova                   NaN
Karabük             205172.0
Kilis               109698.0
Osmaniye            411163.0
Düzce               296712.0
Length: 82, dtype: float64

In [93]:
popDiff = pop2017DF - pop2000DF
popDiff

Provinces
Total             16081024.0
Adana               336780.0
Adıyaman             46644.0
Afyonkarahisar       19401.0
Ağrı                 17095.0
                     ...    
Yalova                   NaN
Karabük              39281.0
Kilis                26621.0
Osmaniye            116561.0
Düzce                80898.0
Length: 82, dtype: float64

* For accessing not null elements, we can use Pandas'notnull function.

In [94]:
popDiff[popDiff.notnull()]

Provinces
Total             16081024.0
Adana               336780.0
Adıyaman             46644.0
Afyonkarahisar       19401.0
Ağrı                 17095.0
                     ...    
Iğdır                20490.0
Karabük              39281.0
Kilis                26621.0
Osmaniye            116561.0
Düzce                80898.0
Length: 81, dtype: float64

* We can add `fill_value` argument to insert a zero for home `NaN` values.

In [95]:
pop2017DF.subtract(pop2000DF, fill_value=0)

Provinces
Total             16081024.0
Adana               336780.0
Adıyaman             46644.0
Afyonkarahisar       19401.0
Ağrı                 17095.0
                     ...    
Yalova              251203.0
Karabük              39281.0
Kilis                26621.0
Osmaniye            116561.0
Düzce                80898.0
Length: 82, dtype: float64

* We can also use functions to each column or row of a `DataFrame`

In [96]:
minPop = pop2017DF.values.min()
indexOfMinPop = pop2017DF.index[pop2017DF.values.argmin()]
print(indexOfMinPop + " -> " + str(minPop))

Bayburt -> 80417


In [97]:
populationDF['2000'] = np.ceil(populationDF['2000'] / 10000) * 10000
populationDF

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,...,2012,2013,2014,2015,2016,2017
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Total,64730000.0,65603160,66401851,67187251,68010215,68860539,...,75627384,76667864,77695904,78741053,79814871,80810525
Adana,1880000.0,1899324,1916637,1933428,1951142,1969512,...,2125635,2149260,2165595,2183167,2201670,2216475
Adıyaman,570000.0,571180,573149,574886,576808,578852,...,595261,597184,597835,602774,610484,615076
Afyonkarahisar,700000.0,698029,698773,699193,699794,700502,...,703948,707123,706371,709015,714523,715693
Ağrı,520000.0,521514,523123,524514,526070,527732,...,552404,551177,549435,547210,542255,536285
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yalova,150000.0,150027,155041,160099,165333,170705,...,211799,220122,226514,233009,241665,251203
Karabük,210000.0,207241,209056,210812,212667,214591,...,225145,230251,231333,236978,242347,244453
Kilis,110000.0,111024,112219,113387,114615,115886,...,124320,128586,128781,130655,130825,136319
Osmaniye,420000.0,417418,423214,428943,434930,441108,...,492135,498981,506807,512873,522175,527724


## Sorting and Ranking

* Pandas objects include methods for re-ordering data.

In [98]:
populationDF.sort_index(ascending=True).head()

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,...,2012,2013,2014,2015,2016,2017
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Adana,1880000.0,1899324,1916637,1933428,1951142,1969512,...,2125635,2149260,2165595,2183167,2201670,2216475
Adıyaman,570000.0,571180,573149,574886,576808,578852,...,595261,597184,597835,602774,610484,615076
Afyonkarahisar,700000.0,698029,698773,699193,699794,700502,...,703948,707123,706371,709015,714523,715693
Aksaray,360000.0,353939,355942,357819,359834,361941,...,379915,382806,384252,386514,396673,402404
Amasya,340000.0,333768,333110,332271,331491,330739,...,322283,321977,321913,322167,326351,329888


In [99]:
populationDF.sort_index().head()

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,...,2012,2013,2014,2015,2016,2017
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Adana,1880000.0,1899324,1916637,1933428,1951142,1969512,...,2125635,2149260,2165595,2183167,2201670,2216475
Adıyaman,570000.0,571180,573149,574886,576808,578852,...,595261,597184,597835,602774,610484,615076
Afyonkarahisar,700000.0,698029,698773,699193,699794,700502,...,703948,707123,706371,709015,714523,715693
Aksaray,360000.0,353939,355942,357819,359834,361941,...,379915,382806,384252,386514,396673,402404
Amasya,340000.0,333768,333110,332271,331491,330739,...,322283,321977,321913,322167,326351,329888


In [100]:
populationDF.sort_index(axis=1, ascending=False).head()

Unnamed: 0_level_0,2017,2016,2015,2014,2013,2012,...,2005,2004,2003,2002,2001,2000
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Total,80810525,79814871,78741053,77695904,76667864,75627384,...,68860539,68010215,67187251,66401851,65603160,64730000.0
Adana,2216475,2201670,2183167,2165595,2149260,2125635,...,1969512,1951142,1933428,1916637,1899324,1880000.0
Adıyaman,615076,610484,602774,597835,597184,595261,...,578852,576808,574886,573149,571180,570000.0
Afyonkarahisar,715693,714523,709015,706371,707123,703948,...,700502,699794,699193,698773,698029,700000.0
Ağrı,536285,542255,547210,549435,551177,552404,...,527732,526070,524514,523123,521514,520000.0


* We can also use `order` to sort a `Series` by value, rather than by label.

* For a `DataFrame`, we can sort according to the values of one or more columns using the `by` argument of `sort_values`:

In [101]:
populationDF[['2017','2001']].sort_values(by=['2017', '2001'],ascending=[False,True]).head(10)

Unnamed: 0_level_0,2017,2001
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1
Total,80810525,65603160
İstanbul,15029231,11292009
Ankara,5445026,3971642
İzmir,4279677,3477209
Bursa,2936803,2192169
Antalya,2364396,1480282
Adana,2216475,1899324
Konya,2180149,1855057
Gaziantep,2005515,1330205
Şanlıurfa,1985753,1294842


* __Ranking__ does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.

In [102]:
populationDF['2010'].rank(ascending=False)

Provinces
Total              1.0
Adana              6.0
Adıyaman          36.0
Afyonkarahisar    32.0
Ağrı              39.0
                  ... 
Yalova            72.0
Karabük           68.0
Kilis             79.0
Osmaniye          43.0
Düzce             52.0
Name: 2010, Length: 82, dtype: float64

In [103]:
populationDF[['2017','2001']].sort_values(by=['2017', '2001'],ascending=[False,True]).rank(ascending=False)

Unnamed: 0_level_0,2017,2001
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1
Total,1.0,1.0
İstanbul,2.0,2.0
Ankara,3.0,3.0
İzmir,4.0,4.0
Bursa,5.0,5.0
...,...,...
Artvin,78.0,76.0
Kilis,79.0,80.0
Ardahan,80.0,78.0
Tunceli,81.0,81.0


* Ties are assigned the mean value of the tied ranks, which may result in decimal values.

In [104]:
pd.Series([50,60,50]).rank()

0    1.5
1    3.0
2    1.5
dtype: float64

* Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:

In [105]:
pd.Series([100,50,100]).rank(method='first')

0    2.0
1    1.0
2    3.0
dtype: float64

* Calling the `DataFrame`'s `rank` method results in the ranks of all columns:

In [106]:
populationDF.rank(ascending=False)

Unnamed: 0_level_0,2000,2001,2002,2003,2004,2005,...,2012,2013,2014,2015,2016,2017
Provinces,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Total,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0
Adana,6.0,6.0,6.0,6.0,6.0,6.0,...,6.0,7.0,7.0,7.0,7.0,7.0
Adıyaman,37.5,37.0,37.0,37.0,37.0,37.0,...,36.0,36.0,36.0,34.0,34.0,34.0
Afyonkarahisar,28.0,28.0,28.0,30.0,30.0,31.0,...,32.0,32.0,32.0,32.0,32.0,32.0
Ağrı,40.5,40.0,41.0,40.0,40.0,40.0,...,39.0,39.0,40.0,40.0,40.0,40.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yalova,77.0,77.0,77.0,77.0,77.0,76.0,...,71.0,70.0,69.0,69.0,69.0,67.0
Karabük,69.5,70.0,69.0,69.0,69.0,69.0,...,68.0,68.0,68.0,68.0,68.0,69.0
Kilis,80.0,80.0,80.0,80.0,80.0,79.0,...,79.0,79.0,79.0,79.0,79.0,79.0
Osmaniye,44.0,45.0,44.0,44.0,44.0,44.0,...,43.0,43.0,43.0,43.0,42.0,43.0


## Hierarchical indexing

* Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis.<br>
<br>
* Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.<br>

* Let’s create a Series with a list of lists or arrays as the index:

In [107]:
data = pd.Series(np.random.randn(10),
               index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'],
                      [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])
data

a  1    1.315558
   2   -0.080769
   3   -0.835756
b  1    0.883245
   2    0.949988
   3    1.329186
c  1   -0.407536
   2   -0.247410
d  2   -1.293176
   3    0.708250
dtype: float64

In [108]:
data.index

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

* With a hierarchically-indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [109]:
data['b']

1    0.883245
2    0.949988
3    1.329186
dtype: float64

In [110]:
data['a':'c']

a  1    1.315558
   2   -0.080769
   3   -0.835756
b  1    0.883245
   2    0.949988
   3    1.329186
c  1   -0.407536
   2   -0.247410
dtype: float64

* Selection is even possible in some cases from an “inner” level:

In [111]:
data[:, 1]

a    1.315558
b    0.883245
c   -0.407536
dtype: float64

* Hierarchical indexing plays a critical role in reshaping data and group-based operations like forming a pivot table. For example, this data could be rearranged into a DataFrame using its unstack method:

In [112]:
dataDF = data.unstack()
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,-0.24741,
d,,-1.293176,0.70825


* The inverse operation of unstack is stack:

In [113]:
dataDF.stack()

a  1    1.315558
   2   -0.080769
   3   -0.835756
b  1    0.883245
   2    0.949988
   3    1.329186
c  1   -0.407536
   2   -0.247410
d  2   -1.293176
   3    0.708250
dtype: float64

## Missing data

* The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

* Missing data are represented in `Series` and `DataFrame` objects by the `NaN` floating point value. However, `None` is also treated as missing, since it is commonly used as such in other contexts (NumPy).

In [114]:
weirdSeries = pd.Series([np.nan, None, 'string', 1])
weirdSeries

0       NaN
1      None
2    string
3         1
dtype: object

In [115]:
weirdSeries.isnull()

0     True
1     True
2    False
3    False
dtype: bool

* Missing values may be dropped or indexed out:

In [116]:
population2

Istanbul Total    15029231.0
Ankara Total       5445026.0
Izmir Total        4279677.0
Bursa Total              NaN
Antalya Total            NaN
dtype: float64

In [117]:
population2.dropna()

Istanbul Total    15029231.0
Ankara Total       5445026.0
Izmir Total        4279677.0
dtype: float64

In [118]:
population2[population2.notnull()]

Istanbul Total    15029231.0
Ankara Total       5445026.0
Izmir Total        4279677.0
dtype: float64

In [119]:
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,-0.24741,
d,,-1.293176,0.70825


* By default, `dropna` drops entire rows in which one or more values are missing.

In [120]:
dataDF.dropna()

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186


* This can be overridden by passing the `how='all'` argument, which only drops a row when every field is a missing value.

In [121]:
dataDF.dropna(how='all')

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,-0.24741,
d,,-1.293176,0.70825


* This can be customized further by specifying how many values need to be present before a row is dropped via the `thresh` argument.

In [122]:
dataDF[2]['c'] = np.nan
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,,
d,,-1.293176,0.70825


In [123]:
dataDF.dropna(thresh=2)

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
d,,-1.293176,0.70825


* If we want to drop missing values column-wise instead of row-wise, we use `axis=1`.

In [124]:
dataDF[1]['d'] = np.random.randn(1)
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,,
d,0.227902,-1.293176,0.70825


In [125]:
dataDF.dropna(axis=1)

Unnamed: 0,1
a,1.315558
b,0.883245
c,-0.407536
d,0.227902


* Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. <br>
<br>
* We can do this programmatically in Pandas with the `fillna` argument.<br>

In [126]:
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,,
d,0.227902,-1.293176,0.70825


In [127]:
dataDF.fillna(0)

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,0.0,0.0
d,0.227902,-1.293176,0.70825


In [128]:
dataDF.fillna({2: 1.5, 3:0.50})

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,1.5,0.5
d,0.227902,-1.293176,0.70825


* Notice that `fillna` by default returns a new object with the desired filling behavior, rather than changing the `Series` or  `DataFrame` in place.

In [129]:
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,,
d,0.227902,-1.293176,0.70825


* If you don't like this behaviour you can alter values in-place using `inplace=True`.

In [130]:
dataDF.fillna({2: 1.5, 3:0.50}, inplace=True)
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,1.5,0.5
d,0.227902,-1.293176,0.70825


* Missing values can also be interpolated, using any one of a variety of methods:

In [131]:
dataDF[2]['c'] = np.nan
dataDF[3]['d'] = np.nan
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,,0.5
d,0.227902,-1.293176,


* We can also propagate non-null values forward or backward.

In [132]:
dataDF.fillna(method='ffill')

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,0.949988,0.5
d,0.227902,-1.293176,0.5


In [133]:
dataDF.fillna(dataDF.mean())

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,-0.141319,0.5
d,0.227902,-1.293176,0.331143


## Data summarization

* We often wish to summarize data in `Series` or `DataFrame` objects, so that they can more easily be understood or compared with similar data.<br>
<br>
* The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.<br>

In [134]:
marvelDF.sum()

name           Spider-Man (Peter Parker)Captain America (Stev...
urlslug        \/Spider-Man_(Peter_Parker)\/Captain_America_(...
APPEARANCES                                               260270
Year                                                 3.08878e+07
dtype: object

* Clearly, `sum` is more meaningful for some columns than others.(Total Appearances)<br> 

* For methods like `mean` for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:

In [135]:
marvelDF.mean()

APPEARANCES      15.893381
Year           1984.951803
dtype: float64

* The important difference between NumPy's functions and Pandas' methods is that Numpy have different functions for handling missing data like 'nansum' but Pandas use same functions.

In [136]:
dataDF

Unnamed: 0,1,2,3
a,1.315558,-0.080769,-0.835756
b,0.883245,0.949988,1.329186
c,-0.407536,,0.5
d,0.227902,-1.293176,


In [137]:
dataDF.mean()

1    0.504793
2   -0.141319
3    0.331143
dtype: float64

* Sometimes we may not want to ignore missing values, and allow the `nan` to propagate.

In [138]:
dataDF.mean(skipna=False)

1    0.504793
2         NaN
3         NaN
dtype: float64

* A useful summarization that gives a quick snapshot of multiple statistics for a `Series` or `DataFrame` is `describe`:

In [139]:
dataDF.describe()

Unnamed: 0,1,2,3
count,4.0,3.0,3.0
mean,0.504793,-0.141319,0.331143
std,0.754891,1.122807,1.092304
min,-0.407536,-1.293176,-0.835756
25%,0.069043,-0.686972,-0.167878
50%,0.555574,-0.080769,0.5
75%,0.991324,0.43461,0.914593
max,1.315558,0.949988,1.329186


* `describe` can detect non-numeric data and sometimes yield useful information about it.

## Writing Data to Files

* Pandas can also export data to a variety of storage formats.<br>
<br>
* We will bring your attention to just a couple of these.

In [140]:
myDF = populationDF['2000']
myDF.to_csv("data/roundedPopulation2000.csv")

* The `to_csv` method writes a `DataFrame` to a comma-separated values (csv) file.<br>
<br>
* You can specify custom delimiters (via `sep` argument), how missing values are written (via `na_rep` argument), whether the index is writen (via `index` argument), whether the header is included (via `header` argument), among other options.