---
<center><h1>Lesson 2 - Basic intro into pandas</h1></center>
---
---

<center><h2>Part 1. Introduction to pandas data structures</h2></center>

---

<img src="images/pandas_logo.png" width="500px">

[Рandas](http://pandas.pydata.org/pandas-docs/stable/) is a powerfull and flexible open source Python library for data analysis. Python has long been great for data preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling us to carry out our entire data analysis workflow in Python without having to switch to a more domain specific language like R or loading of working data into a database and using SQL (or worse, Excel). pandas makes Python great for analysis.

Library Highlights:

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.


# Table of Contents
- [Introduction to pandas data structures](#Introduction-to-pandas-data-structures)
    * [Series](#Series)
    * [DataFrames](#DataFrames)
    * [Read data from files and write data to files](#Read-data-from-files-and-write-data-to-files)
    - [*Exercise 1.1*](#Exercise-1.1)
    - [*Exercise 1.2*](#Exercise-1.2)

To use pandas you should simply import it as Python module. Usually together with NumPy.

In [1]:
import pandas as pd
import numpy as np
import random

## Introduction to pandas data structures

[[back to top]](#Table-of-Contents)

### Series

[[back to top]](#Table-of-Contents)

Further we will use Python 2.7 and suppose that you have the basic knowledge in Python.
Pandas introduces two new data structures to Python – Series and DataFrame, both of which are built on top of NumPy.
A Series is a one-dimensional object similar to an array, list, or column in a table. It will assign a labeled index to each item in the Series. By default, each item will receive an index label from 0 to N, where N is the length of the Series minus one.

In [2]:
my_first_series = pd.Series([1, 'hello, world', np.nan, -1234567890, 3.14, 0])
my_first_series

0               1
1    hello, world
2             NaN
3     -1234567890
4            3.14
5               0
dtype: object

You can also set specific index at creating the Series.

In [3]:
my_first_series = pd.Series([1, 'hello, world', np.nan, -1234567890, 3.14, 0], index=['A', 'B', 'unknown', 0, 'C', 'D'])
my_first_series

A                     1
B          hello, world
unknown             NaN
0           -1234567890
C                  3.14
D                     0
dtype: object

The Series constructor can convert a Python dictionary, using the keys of the dictionary as its index.

In [4]:
my_dict = {'John': 10, 'Annet': 12, 'Robert': 5, 'Jack': 55}
my_first_series = pd.Series(my_dict)
my_first_series

John      10
Annet     12
Robert     5
Jack      55
dtype: int64

We replace previous Series `my_first_series` by new one. Then you can use the index to select necessary items from the Series

In [5]:
my_first_series['Jack']

55

or

In [6]:
my_first_series[['Jack', 'Robert']]

Jack      55
Robert     5
dtype: int64

To see all indexes of the Series you may use index attribute

In [7]:
my_first_series.index

Index(['John', 'Annet', 'Robert', 'Jack'], dtype='object')

Similarly, you may display only values

In [8]:
my_first_series.values

array([10, 12,  5, 55])

pandas provides the method [`rename({old_name_1: new_name_1, old_name_2: new_name_2, ... })`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) that returns a new table allowing easily change index names.

We also can filter the series. Let’s find all items, which are less than 10 and odd:

In [9]:
my_first_series[(my_first_series < 10) & 
(my_first_series % 2 != 0)]

Robert    5
dtype: int64

You may achieve the same result at using Python dictionary data structures `my_dict`, which we have set before, for example, in such way

In [10]:
filtered = {}
for key, val in my_dict.items():
    if val < 10 and val % 2 != 0:
        filtered[key] = val
filtered

{'Robert': 5}

or using generators for dict

In [11]:
filtered = {key: val for key, val in my_dict.items() if val < 10 and val % 2 != 0}
filtered

{'Robert': 5}

The previous example demonstrates only one of huge amount of advantages of pandas’ potential over pure Python. 
Series is a mutable data structures and you can easily change any item’s value:

In [12]:
print("Robert's previous value : {}".format(my_first_series['Robert']))
my_first_series['Robert'] = 15
print("Robert's new value : {}".format(my_first_series['Robert']))

Robert's previous value : 5
Robert's new value : 15


or add new values:

In [13]:
my_first_series['Susan'] = None
my_first_series = my_first_series.append(pd.Series({'Joshua': 0}))
my_first_series

John        10
Annet       12
Robert      15
Jack        55
Susan     None
Joshua       0
dtype: object

If it is necessary to apply any mathematical operation to Series items, you may done it like below:

In [14]:
my_first_series_2 = my_first_series / 1.5
my_first_series_2

John      6.66667
Annet           8
Robert         10
Jack      36.6667
Susan         NaN
Joshua          0
dtype: object

Thus the corresponding mathematical operation is applied for each Series item like at using loops for Python lists.
In the same way you may add, substitute etc. two or more Series:

In [15]:
my_first_series_total = my_first_series + my_first_series_2
my_first_series_total

John      16.6667
Annet          20
Robert         25
Jack      91.6667
Susan         NaN
Joshua          0
dtype: object

`NULL`/`NaN` checking can be performed with `isnull()` and `notnull()`.

In [16]:
my_first_series_total.notnull()

John       True
Annet      True
Robert     True
Jack       True
Susan     False
Joshua     True
dtype: bool

In [17]:
my_first_series_total.isnull()

John      False
Annet     False
Robert    False
Jack      False
Susan      True
Joshua    False
dtype: bool

### DataFrames
[[back to top]](#Table-of-Contents)

A DataFrame is a tabular data structure comprised of rows and columns which is too closest to a spreadsheet, database table etc. It is a primary data structure in pandas as Series.  We can consider a DataFrame as a group of Series objects that share an index (the column names). Arithmetic operations align on both row and column labels.

One of the simplest ways for creation of a DataFrame out of common Python data structures is the passing a dictionary of lists to the DataFrame constructor. To order columns we may use columns parameter, because by default the DataFrame constructor will order the columns alphabetically.

Let’s create the DataFrame for the list of finals matches of World Cup, their locations, the finalists and final scores beginning from 1990:

In [18]:
data = {'year': [1990, 1994, 1998, 2002, 2006, 2010, 2014],
        'winner': ['Germany', 'Brazil', 'France', 'Brazil','Italy', 'Spain', 'Germany'],
        'runner-up': ['Argentina', 'Italy', 'Brazil','Germany', 'France', 'Netherlands', 'Argentina'],
        'final score': ['1-0', '0-0 (pen)', '3-0', '2-0', '1-1 (pen)', '1-0', '1-0'] }
world_cup = pd.DataFrame(data, columns=['year', 'winner', 'runner-up', 'final score'])
world_cup

Unnamed: 0,year,winner,runner-up,final score
0,1990,Germany,Argentina,1-0
1,1994,Brazil,Italy,0-0 (pen)
2,1998,France,Brazil,3-0
3,2002,Brazil,Germany,2-0
4,2006,Italy,France,1-1 (pen)
5,2010,Spain,Netherlands,1-0
6,2014,Germany,Argentina,1-0


Other recipe to set a DataFrame is the using of Python list of dictionaries:

In [19]:
data_2 = [{'year': 1990, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}, 
          {'year': 1994, 'winner': 'Brazil', 'runner-up': 'Italy', 'final score': '0-0 (pen)'},
          {'year': 1998, 'winner': 'France', 'runner-up': 'Brazil', 'final score': '3-0'}, 
          {'year': 2002, 'winner': 'Brazil', 'runner-up': 'Germany', 'final score': '2-0'}, 
          {'year': 2006, 'winner': 'Italy','runner-up': 'France', 'final score': '1-1 (pen)'}, 
          {'year': 2010, 'winner': 'Spain', 'runner-up': 'Netherlands', 'final score': '1-0'}, 
          {'year': 2014, 'winner': 'Germany', 'runner-up': 'Argentina', 'final score': '1-0'}
         ]
world_cup = pd.DataFrame(data_2)
world_cup

Unnamed: 0,final score,runner-up,winner,year
0,1-0,Argentina,Germany,1990
1,0-0 (pen),Italy,Brazil,1994
2,3-0,Brazil,France,1998
3,2-0,Germany,Brazil,2002
4,1-1 (pen),France,Italy,2006
5,1-0,Netherlands,Spain,2010
6,1-0,Argentina,Germany,2014


If you want to see only 3 first rows of the previous table, you may use method `head(n)`, where `n` corresponds to the number of first rows of the table.

Note, that the expression `head()` is equal to `head(5)`.

In [20]:
world_cup.head(3)

Unnamed: 0,final score,runner-up,winner,year
0,1-0,Argentina,Germany,1990
1,0-0 (pen),Italy,Brazil,1994
2,3-0,Brazil,France,1998


There is method `tail(n)`, which works like `head(n)`, but return the last `n` rows of the DataFrame:

In [21]:
world_cup.tail(2)

Unnamed: 0,final score,runner-up,winner,year
5,1-0,Netherlands,Spain,2010
6,1-0,Argentina,Germany,2014


Here you can also use well-know Python slices:

In [23]:
world_cup[2:3]

Unnamed: 0,final score,runner-up,winner,year
2,3-0,Brazil,France,1998


### Read data from files and write data to files
[[back to top]](#Table-of-Contents)

Too often we have the necessity to work with a dataset saved in specific text format file (txt, CSV, JSON etc.) or database (MySQL, particularly). pandas allows us to convert data from any file or database (this point we will consider in the following part of the post series about pandas) to DataFrame. Let’s show how you may read and write a dataset for different types of files:

1\. CSV file(s) (“Comma Separated Values” is text format for presenting tabular data; each line of the file corresponds to one line of the table; the values of a single column are separated by comma separating character, in general):

**Reading:**

    df = pd.read_csv("path\to\the\csv\file\for\reading")
    
**Writing:**
    
    df.to_csv("path\to\the\folder\where\you\want\save\csv\file")
    
where you should set the absolute path to CSV file like `"C:/User/csv_file_with_data.csv"`

2\. Excel file(s) (\*.xls and \*.xlsx):  

**Reading:**

    df = pd.read_excel("path\to\the\excel\file\for\reading", "sheet_name")
    
**Writing:**

    df.to_excel("path\to\the\folder\where\you\want\save\excel\file")
    
where you should set the absolute path to Excel file and the sheet name like “Sheet1”

3\. txt  file(s) (txt file can be read as a CSV file with other separator (delimiter); we suppose below that columns are separated by tabulation):

**Reading:**

    df = pd.read_csv("path\to\the\txt\file\for\reading", sep='\t')
    
**Writing:**

    df.to_csv("path\to\the\folder\where\you\want\save\txt\file", sep='\t')
    
4\. JSON files (an open-standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is the most common data format used for asynchronous browser/server communication. By its view it is very similar to Python dictionary)

**Reading:**

    df = pd.read_json("path\to\the\json\file\for\reading", sep='\t')
    
**Writing:**

    df.to_json("path\to\the\folder\where\you\want\save\json\file", sep='\t')

Pay attention that compounded paths contain various separators (such "/" or such "\") on different operating system (OS). The best practise is usage of `os` Python library. Suppose we want save a DataFrame in the folder `target_folder` with such hierarchy:

    main_folder
    |----folder1
    |----folder2
         |----sub_folder1
         |----target_folder
         |----sub_folder3
    |----folder2
         |----sub_folder1
         
and call it as `"my_file.csv"`. The code for saving using `os` library is the following:

    import os
    df.to_csv(os.path.join("main_folder", "folder2", "target_folder", "my_file.csv")
    
thus, `os.path.join` concatenated all directories into one path independently on the OS.

Let's save the `world_cup` DataFrame as CSV and JSON files:

In [24]:
world_cup.to_csv("world_cup.csv")
print("DataFrame was written")

# Check whether the "world_cup.csv" exists
import os
print(os.path.exists("world_cup.csv"))

# Read the file 
print
with open("world_cup.csv") as f:
    print(f.read())

DataFrame was written
True
,final score,runner-up,winner,year
0,1-0,Argentina,Germany,1990
1,0-0 (pen),Italy,Brazil,1994
2,3-0,Brazil,France,1998
3,2-0,Germany,Brazil,2002
4,1-1 (pen),France,Italy,2006
5,1-0,Netherlands,Spain,2010
6,1-0,Argentina,Germany,2014



To save CSV file without index use `index=False` attribute.

In [25]:
world_cup.to_json("world_cup.json")
print("DataFrame was written")

# Check whether the "world_cup.json" exists
import os
print(os.path.exists("world_cup.json"))

# Read the file 
print
with open("world_cup.json") as f:
    print(f.read())

# Let's prettify print
import json
with open("world_cup.json") as f:
    content = json.load(f)
content

DataFrame was written
True
{"final score":{"0":"1-0","1":"0-0 (pen)","2":"3-0","3":"2-0","4":"1-1 (pen)","5":"1-0","6":"1-0"},"runner-up":{"0":"Argentina","1":"Italy","2":"Brazil","3":"Germany","4":"France","5":"Netherlands","6":"Argentina"},"winner":{"0":"Germany","1":"Brazil","2":"France","3":"Brazil","4":"Italy","5":"Spain","6":"Germany"},"year":{"0":1990,"1":1994,"2":1998,"3":2002,"4":2006,"5":2010,"6":2014}}


{'final score': {'0': '1-0',
  '1': '0-0 (pen)',
  '2': '3-0',
  '3': '2-0',
  '4': '1-1 (pen)',
  '5': '1-0',
  '6': '1-0'},
 'runner-up': {'0': 'Argentina',
  '1': 'Italy',
  '2': 'Brazil',
  '3': 'Germany',
  '4': 'France',
  '5': 'Netherlands',
  '6': 'Argentina'},
 'winner': {'0': 'Germany',
  '1': 'Brazil',
  '2': 'France',
  '3': 'Brazil',
  '4': 'Italy',
  '5': 'Spain',
  '6': 'Germany'},
 'year': {'0': 1990,
  '1': 1994,
  '2': 1998,
  '3': 2002,
  '4': 2006,
  '5': 2010,
  '6': 2014}}

And read the just saved CSV and JSON files to new DataFrames:

In [26]:
df_csv = pd.read_csv("world_cup.csv")
df_csv

Unnamed: 0.1,Unnamed: 0,final score,runner-up,winner,year
0,0,1-0,Argentina,Germany,1990
1,1,0-0 (pen),Italy,Brazil,1994
2,2,3-0,Brazil,France,1998
3,3,2-0,Germany,Brazil,2002
4,4,1-1 (pen),France,Italy,2006
5,5,1-0,Netherlands,Spain,2010
6,6,1-0,Argentina,Germany,2014


As you can see, the `df_csv` contains an additional index column `Unnamed: 0`. You can miss it using `index_col=0` attribute.

In [27]:
df_csv = pd.read_csv("world_cup.csv", index_col=0)
df_csv

Unnamed: 0,final score,runner-up,winner,year
0,1-0,Argentina,Germany,1990
1,0-0 (pen),Italy,Brazil,1994
2,3-0,Brazil,France,1998
3,2-0,Germany,Brazil,2002
4,1-1 (pen),France,Italy,2006
5,1-0,Netherlands,Spain,2010
6,1-0,Argentina,Germany,2014


In [28]:
df_json = pd.read_json("world_cup.json")
df_json

Unnamed: 0,final score,runner-up,winner,year
0,1-0,Argentina,Germany,1990
1,0-0 (pen),Italy,Brazil,1994
2,3-0,Brazil,France,1998
3,2-0,Germany,Brazil,2002
4,1-1 (pen),France,Italy,2006
5,1-0,Netherlands,Spain,2010
6,1-0,Argentina,Germany,2014


> ### Exercise 1.1

> - Rename `John` to `Barbara` in `my_first_series` and change value of `Jack` from `55` to `-10`.

In [30]:
# type your code here
my_first_series = my_first_series.rename({'John':'Barbara'})
my_first_series['Jack'] = -10
print (my_first_series)

Barbara      10
Annet        12
Robert       15
Jack        -10
Susan      None
Joshua        0
dtype: object


In [49]:
from test_helper import Test
 
Test.assertEqualsHashed(my_first_series, 'f0087d59110eb5eb365ebef8a712ec876c5dedcd', 
                                         'Incorrect content of "my_first_series"', "Exercise 1.1 is successful")

1 test failed. Incorrect content of "my_first_series"


> ### Exercise 1.2

> - Find all positive values in `my_first_series` using filter options and write resulting Series to `positive` variable.

> - Create the new Series `new_series`, which contains all items from `my_first_series` and two new items `(Ashly, NaN)` and `(Lukas, -5)`. 

> - At first, add Series `new_series` and `my_first_series` and then multiply them. Try explaining why you have such results.

> - Save `new_series` as CSV file to the folder, where the current IPython notebook exists. Call this file as `"new_series.csv"`. 

In [31]:
# type your code here

positive = my_first_series[my_first_series > 0]
print (positive)
new_series = my_first_series.append(pd.Series({'Ashly':'NaN','Lukas':'-5'}))
print (new_series)
print (new_series+my_first_series)
print (new_series*my_first_series)
df = new_series.to_frame()
print (df)
df.to_csv('new_series.csv')

Barbara    10
Annet      12
Robert     15
dtype: object
Barbara      10
Annet        12
Robert       15
Jack        -10
Susan      None
Joshua        0
Ashly       NaN
Lukas        -5
dtype: object
Annet       24
Ashly      NaN
Barbara     20
Jack       -20
Joshua       0
Lukas      NaN
Robert      30
Susan      NaN
dtype: object
Annet      144
Ashly      NaN
Barbara    100
Jack       100
Joshua       0
Lukas      NaN
Robert     225
Susan      NaN
dtype: object
            0
Barbara    10
Annet      12
Robert     15
Jack      -10
Susan    None
Joshua      0
Ashly     NaN
Lukas      -5


In [45]:
Test.assertEqualsHashed(positive, 'cf6115bfe272789ec092c12878feb369009331eb', 
                                  'Incorrect content of "positive"', "Exercise 1.2.1 is successful")
Test.assertEqualsHashed(new_series, '9cebbc1f0b5d095414934775e1d778be51b44029', 
                                    'Incorrect content of "new_series"', "Exercise 1.2.2 is successful")
from os.path import exists
Test.assertEqualsHashed(exists('new_series.csv'), '88b33e4e12f75ac8bf792aebde41f1a090f3a612', 
                                    'File was not found', "Exercise 1.2.3 is successful")

1 test failed. Incorrect content of "positive"
1 test failed. Incorrect content of "new_series"
1 test passed. Exercise 1.2.3 is successful


<center><h3>Presented by <a target="_blank" rel="noopener noreferrer nofollow" href="http://datascience-school.com">datascience-school.com</a></h3></center>