# Data handling in Python

*Monday 16, September*

## Contents

- [1. Introduction](#Introduction) 
- [2. Panda](#2. Panda) 
- [3. Matplotlib](#3. Matplotlib)  
  

### 1. Introduction

Python is a simple programming language that would be rather unexciting by itself. For researchers, its appeal comes from the vast and endlessly growing collection of packages allowing you to do pretty much anything you want. Packages, libraries or modules (these terms are synonyms) are user-written lists of Python functions helping you do whatever you may want to achieve without requiring you to code everything.

There are 3 essential modules for empirical and mathematical projects: **numpy**, **pandas** and **matplotlib**. they all serve different purposes.
* **NumPy**, for mathematical functions and scientific computing
* **SciPy**
* **pandas** (shorthand for "panel data") for data work
* **matplotlib**, for graphs

Beyond these common modules, , we will use two other Python packages later in the course: **statsmodels** and **scikit-learn**. 
* **statsmodels** provides functions for the estimation of most statistical models, statistical tests and descriptive statistics social scientists may be interested in. It generates outputs that are similar to Stata and R's. **statsmodels** is heavily dependent on your data being structured with **panda** dataframes.
* **scikit-learn** offers off-the-shelf functions for data mining and machine learning. It builds on **NumPy**, **SciPy** and **matplotlib**. Classification or supervised learning, clustering or unspervised learning, dimensionality reduction and model selection can all be performed with **scikit-learn**.

**To summarise.** In the beginning of each Notebook or Python script, you will need to import the modules you will use. It is helpful to give short names to the modules in order to call them more easily later. To import and name modules, just follow use the syntax shown below

In [5]:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mp

import statsmodels as sm
import sklearn as sl

You can call the function `help()` to get more information about a function. For instance, if we need explanations about NumPy's random number generator, we run

In [8]:
help(np.random.rand)

Help on built-in function rand:

rand(...) method of mtrand.RandomState instance
    rand(d0, d1, ..., dn)
    
    Random values in a given shape.
    
    Create an array of the given shape and populate it with
    random samples from a uniform distribution
    over ``[0, 1)``.
    
    Parameters
    ----------
    d0, d1, ..., dn : int, optional
        The dimensions of the returned array, should all be positive.
        If no argument is given a single Python float is returned.
    
    Returns
    -------
    out : ndarray, shape ``(d0, d1, ..., dn)``
        Random values.
    
    See Also
    --------
    random
    
    Notes
    -----
    This is a convenience function. If you want an interface that
    takes a shape-tuple as the first argument, refer to
    np.random.random_sample .
    
    Examples
    --------
    >>> np.random.rand(3,2)
    array([[ 0.14022471,  0.96360618],  #random
           [ 0.37601032,  0.25528411],  #random
           [ 0.49313049,  0.94909878]]

To know all the functions of a module, write the module name followed by a `.`, then press `TAB`. An autocomplete menu will appear. You can do the same with a function, and the menu will suggest all the different options of that function.

In [None]:
np.
np.random.

### 2. Panda

The two basic objects we will manipulate when doing research with structured data are Pandas **Series** and **Dataframes**. 

**2.1. Series**

Series are lists of integer, float, boolean or string values. By default, the values are indexed by integers starting from 0. <span style="color:Red"> MATLAB users beware </span>, indexation does not start with 1 here. But you can define your own index scheme by adding `index=list('abcdef')` as an option to `Series()`. We define a few series below.

In [60]:
S1 = pd.Series([1, 1, 2, 3, 4, 5])
S2 = pd.Series([0, 0.2 , 0.4, 0.6, 0.8, 1])
S3 = pd.Series([1, 0.8, 0.6, 0.4, 0.2, 0] ,index=list('abcdef'))
S4 = pd.Series(['hello','world','byebye'])
S5 = pd.Series([True, False, True, False])
S6 = pd.Series([0, 0.2, 'hello', True])

You can see the content and the indices of a series by typing their names and running the code.

In [55]:
S1

0    1
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [56]:
S2, S3, S4, S5, S6

(0    0.0
 1    0.2
 2    0.4
 3    0.6
 4    0.8
 5    1.0
 dtype: float64, a    0.0
 b    0.2
 c    0.4
 d    0.6
 e    0.8
 f    1.0
 dtype: float64, 0     hello
 1     world
 2    byebye
 dtype: object, 0     True
 1    False
 2     True
 3    False
 dtype: bool, 0        0
 1      0.2
 2    hello
 3     True
 dtype: object)

**2.2. Indexing and functions of Series**

You can get access to the values of a series separately, to its indices, to a given element, to a section of a series, and to the elements of a series that satisfy a given condition. See examples below.

In [64]:
S4.values       # display all the values of S4
S3.index        # display indices of S3
S5[0]           # first value of S5
S5[:]           # all values of S5
S5[1:3]         # 3 first values of  S5
S2[[5, 3, 1]]   # values of S2 indexed by 5, 3 and 1
S3[['f', 'b']]  # value of S3 indexed by g and b
S3 > 0.5        # returns a series of booleans, resulting from evaluating the condition at each value of S3
S3[S3 > 0.5]    # only returns the values of S3 satisfying the condition

a    1.0
b    0.8
c    0.6
dtype: float64

Some Series function can be useful. See below.

In [63]:
S1.size           # returns the number of values (scalar)
S1.prod()         # product of all values (scalar)
S1.sum()          # sum of all values (scalar)
S1.cumsum()       # sumulative sum (vector)
S1.max()          # maximum (scalar)
S1.idxmax()       # maximum index (scalar)
S2.round()        # series rounded to the nearest integer (float vector)
np.ceil(S2)       # series rounded up (float vector)
np.floor(S2)      # series rounded down (float vector)
S1.unique()       # series of unique values
S3.sort_values()  # values sorted in ascending order
S1.sort_index(ascending=False)  # sort in descending order of indices
S1.isin([1,3,5,7,9])            # returns series of booleans equal to "True" if the value is in the list of values provided

0     True
1     True
2    False
3     True
4    False
5     True
dtype: bool

**2.3. Missing values**

In Pandas, a missing value is coded `NaN` for "Not a Number". It is also the result given by Python after a forbidden mathematical operation such as

In [65]:
np.sqrt(-2)

  """Entry point for launching an IPython kernel.


nan

The count function does not count `NaN` as values. Let's see how by replacing a value in a series by a forbidden mathematical operation

In [66]:
S1.count()

6

In [67]:
S1[1] = np.inf - np.inf
S1.count()

5

In [68]:
S1

0    1.0
1    NaN
2    2.0
3    3.0
4    4.0
5    5.0
dtype: float64

The function `isnull()` returns a series of boolean equals to `True` if the value of the original series is not `NaN` and `False` otherwise.

In [69]:
S1.isnull()

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

When cleaning data in Python, we can delete all the `NaN` of a Series with the `dropna` function. This delete the missing values, yet it keep the old indexation.

In [72]:
S1 = S1.dropna()
S1

0    1.0
2    2.0
3    3.0
4    4.0
5    5.0
dtype: float64

**2.4. Data frames**

In Pandas, a data frame is simply a table of structure data, with variables as columns and observations as rows. Each column can have a **label** that we can simply assimilate to a variable name here. By default, observations are indexed by integers, starting from 0, but we can re-index observation at will.

We create below a data frame of 10 observations of 3 random numbers from the uniform distribution in $[0, 1)$

In [78]:
randomTable = pd.DataFrame(np.random.rand(10,3))
randomTable

Unnamed: 0,0,1,2
0,0.745625,0.503069,0.705821
1,0.048162,0.61289,0.697716
2,0.159873,0.670215,0.363837
3,0.636976,0.63698,0.597105
4,0.627611,0.50112,0.937945
5,0.560617,0.310853,0.217623
6,0.197659,0.982266,0.084955
7,0.933875,0.903979,0.144268
8,0.028542,0.111783,0.315832
9,0.382338,0.855089,0.591182


We can index the rows differently, for instance by making them start from 1. To do so, we define a Pandas Series and define it as the index.

In [79]:
obs = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
obs

0     1
1     2
2     3
3     4
4     5
5     6
6     7
7     8
8     9
9    10
dtype: int64

In [80]:
randomTable.set_index(obs)

Unnamed: 0,0,1,2
1,0.745625,0.503069,0.705821
2,0.048162,0.61289,0.697716
3,0.159873,0.670215,0.363837
4,0.636976,0.63698,0.597105
5,0.627611,0.50112,0.937945
6,0.560617,0.310853,0.217623
7,0.197659,0.982266,0.084955
8,0.933875,0.903979,0.144268
9,0.028542,0.111783,0.315832
10,0.382338,0.855089,0.591182


We can add variable names by creating a list.

In [87]:
columnNames = pd.Series(['A', 'B', 'C'])
randomTable.columns = columnNames
randomTable

Unnamed: 0,A,B,C
0,0.745625,0.503069,0.705821
1,0.048162,0.61289,0.697716
2,0.159873,0.670215,0.363837
3,0.636976,0.63698,0.597105
4,0.627611,0.50112,0.937945
5,0.560617,0.310853,0.217623
6,0.197659,0.982266,0.084955
7,0.933875,0.903979,0.144268
8,0.028542,0.111783,0.315832
9,0.382338,0.855089,0.591182


Let's rename variables separately, just like with STATA's `rename` function.

In [None]:
#randomTable.rename(columns={"A": "a"})
#randomTable.rename(columns={"B": "b", "C": "c"})
randomTable.rename(columns={'A': 'a', 'B': 'c'})
randomTable
NOT WORKING: TO BE REVISED

We can define data frames more explicitely, by defining column after column as a label and a series.

In [110]:
syllabus = pd.DataFrame({ 'Date' : pd.date_range('20190916', '20190927'),   # creates a series of dates from 16/09/2019 to 27/09/2019
    'Topic' : pd.Categorical(["Version control","Cloud computing","Intro to Python","Basics of data handling","OLS, GLS, IV and NLLS","WEEKEND","WEEKEND","MLE","Time Series","-No class-","GMM","Intro to machine learning"]), # string variable of topics
    'Alphabetical index' : list('abcdefghijkl'),                            # another index, defined as a list
    'Numerical index' : np.linspace(1, 12, 12)   })                         # another index, defined with a useful NumPy function

syllabus

Unnamed: 0,Date,Topic,Alphabetical index,Numerical index
0,2019-09-16,Version control,a,1.0
1,2019-09-17,Cloud computing,b,2.0
2,2019-09-18,Intro to Python,c,3.0
3,2019-09-19,Basics of data handling,d,4.0
4,2019-09-20,"OLS, GLS, IV and NLLS",e,5.0
5,2019-09-21,WEEKEND,f,6.0
6,2019-09-22,WEEKEND,g,7.0
7,2019-09-23,MLE,h,8.0
8,2019-09-24,Time Series,i,9.0
9,2019-09-25,-No class-,j,10.0


### 3. SciPy

### 3. Matplotlib

### 4. SciPy

### 5. Statsmodels

### 6. Scikit-learn