**Introduction to NumPy**

NumPy (short form for Numerical Python) is the most fundamental package designed for scientific computing and data analysis. Most of the other packages such as pandas, statsmodels are built on top of it, and is an important package to know and learn about. At the heart of NumPy is a data structure called **ndarray**. ndarray is a basically a multi-dimensional array that is built specifically for the purpose of numerical data analysis. Python also has array capabilities, but they are more generic. The advantage of using ndarray is that processing is extremely efficient and fast. 

You can perform standard mathematical operations on either individual elements or complete array. The range of functions covered is linear algebra, statistical operations, and other specialized mathematical operations. For our purpose, we need to know about ndarray and the range of mathematical functions that are relevant to our research purpose. If you already know languages such as C, Fortran, then you can integrate NumPy code with code written in these languages and can pass NumPy arrays seamlessly. 

From an overall perspective, understanding of NumPy will help us in using pandas effectively as it is built on top of NumPy and frequently we will also be using functions of NumPy in research work. In the current session, we will only look at some of the most important features of NumPy. For a full listing of NumPy features, please visit http://wiki.scipy.org/Numpy_Example_List .

Possible application of NumPy package in research work are:

+ Algorithmic operations such as sorting, grouping and set operations
+ Performing repetitive operations on whole arrays of data without using loops
+ Data merging and alignment operations
+ Data indexing, filtering, and transformation on individual elements or whole arrays
+ Data summarization and descriptive statistics

**Installing NumPy**

In order to check if NumPy is installed, go to Package Manager and type NumPy. You will get a list of packages with names closely matching to NumPy. For our purpose, we need to focus on package named numpy 1.xx. If the package is not installed, click on Install. 

**Importing NumPy**

In order to be able to use NumPy, first import it using import statement

In [17]:
import numpy as np

The above statement will import all of NumPy into your workspace. For starters its good, but if you are doing performance intensive work, then saving space is of importance. In such cases, you can import specific modules of NumPy by using

In [3]:
from numpy import array

ndarray
The most important data structure in NumPy is an n-dimensional array object. Using ndarray, you can store large multidimensional datasets in Python. Being an array, you can perform mathematical operations on these arrays either one element at a time or on complete arrays without using loops. The way to initialize an array object is

In [18]:
a = array((1,2,3,4,5))    #initializes an array a and assigns values to it
b = array((10,20,30,40,50)) # initializes another array b
print(a) 
print(b)
print(a+b) 
print (a+5) 
print (a**2) 

[1 2 3 4 5]
[10 20 30 40 50]
[11 22 33 44 55]
[ 6  7  8  9 10]
[ 1  4  9 16 25]


In [19]:
c = array(np.arange(15))   #arange function here works as a sequence or counter
anarray = array(np.arange(1,15,2)) 
onemorearray = array(np.linspace(1,10,15)) 
print(c)
print(anarray)
print(onemorearray)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
[ 1  3  5  7  9 11 13]
[ 1.          1.64285714  2.28571429  2.92857143  3.57142857  4.21428571
  4.85714286  5.5         6.14285714  6.78571429  7.42857143  8.07142857
  8.71428571  9.35714286 10.        ]


With each ndarray are associated two attributes: shape of the array, and type of the array. The shape of the array tells you about dimensionality of the array (rows and columns), and type of the array tells you about the data type contained in the array.

In [20]:
data = np.array((32,45,123,756,23,2123))
print(data.shape)
print(data.dtype)
print(data.size)

(6,)
int64
6


In [21]:
data2 = [[1,2,3,4],[5,6,7]]
arr2 = np.array(data2)
print(arr2)
print(arr2.shape)
print(arr2[1])

[list([1, 2, 3, 4]) list([5, 6, 7])]
(2,)
[5, 6, 7]


In [22]:
np.zeros(50)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [23]:
np.zeros((3,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [24]:
np.ones(30)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [25]:
np.ones((5,9))

array([[1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.]])

In [26]:
np.eye(5) # creates a 5*5 identity matrix. 

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [27]:
np.diag(array([1,3,5,3,4,5]))

array([[1, 0, 0, 0, 0, 0],
       [0, 3, 0, 0, 0, 0],
       [0, 0, 5, 0, 0, 0],
       [0, 0, 0, 3, 0, 0],
       [0, 0, 0, 0, 4, 0],
       [0, 0, 0, 0, 0, 5]])

**pandas**

pandas is the primary package for performing data analysis tasks in Python. pandas derives its name from panel data analysis and is the fundamental package that provides relational data structures (think Excel, SQL type) and a host of capabilities to play with those data structures. It is the most widely used package in Python for data analysis tasks, and is very good to work with cross sectional, time series, and panel data analysis. Python sits on top of NumPy and can be used with NumPy arrays and the functions in NumPy. How is pandas suited for a researcher’s needs:

+ Has a tabular data structure that can hold both homogenous and heterogenous data.
+ Very good indexing capabilities that makes data alignment and merging easy.
+ Good time series functionality. No need to use different data structures for time series and cross sectional data. Allows for both ordered and unordered time-series data.
+ A host of statistical functions developed around NumPy and pandas that makes a researcher’s task easy and fast.
+ Programming is lot simpler and faster.
+ Easily handles data manipulation and cleaning.
+ Easy to expand and shorten data sets. Comprehensive merging, joins, and group by functionality to join multiple data sets.

**Installing pandas** 

In order to check if pandas is installed, go to Package Manager and type pandas. By default, pandas already comes installed with a distribution of Canopy. If the package is not installed, click on Install.

**Importing pandas**

In order to be able to use NumPy, first import it using import statement


In [29]:
import pandas as pd #this will import pandas into your workspace

In [30]:
import numpy as np  #we will be using numpy functions so import numpy

**Data Structures in pandas**

There are two basic data structures in pandas: Series and DataFrame

**Series:** It is similar to a NumPy 1-dimensional array. In addition to the values that are specified by the programmer, pandas attaches a label to each of the values. If the labels are not provided by the programmer, then pandas assigns labels ( 0 for first element, 1 for second element and so on). A benefit of assigning labels to data values is that it becomes easier to perform manipulations on the dataset as the whole dataset becomes more of a dictionary where each value is associated with a label. 


In [31]:
series1 = pd.Series([10,20,30,40])
series1

0    10
1    20
2    30
3    40
dtype: int64

In [34]:
series1.values

array([10, 20, 30, 40])

In [35]:
series1.index

RangeIndex(start=0, stop=4, step=1)

If you want to specify custom index values rather than the default ones provided, you can do so using the following command

In [75]:
series2 = pd.Series([10,20,30,40,50], index=['one',2,'three','four','five'])
series2

one      10
2        20
three    30
four     40
five     50
dtype: int64

The ways of accesing elements in a Series object are similar to what we have seen in NumPy, and you can perform NumPy operations on Series data arrays.

In [76]:
series2[2]

20

In [72]:
series2['three']

30

In [74]:
series2[['one', 'three', 'five']]

one      10
three    30
five     50
dtype: int64

In [48]:
series2[[0,1,3]]

one     10
two     20
four    40
dtype: int64

In [42]:
series2 + 4

one      14
two      24
three    34
four     44
five     54
dtype: int64

In [None]:
series2 ** 3

In [49]:
series2[series2>30]

four    40
five    50
dtype: int64

In [None]:
np.sqrt(series2)

If you have a dictionary, you can create a Series data structure from that dictionary. Suppose you are interested in EPS values for firms and the values come from different sources and is not clean. In that case you dont have to worry about cleaning and aligning those values. 

In [77]:
years = [90, 91, 92, 93, 94, 95]
f1 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm1 = pd.Series(f1,index=years)
firm1

90     8
91     9
92     7
93     8
94     9
95    11
dtype: int64

In [78]:
f2 = {90:14,92:9, 93:13, 94:5}
firm2 = pd.Series(f2,index=years)
firm2

90    14.0
91     NaN
92     9.0
93    13.0
94     5.0
95     NaN
dtype: float64

In [79]:
f3 = {93:10, 94:12, 95: 13}
firm3 = pd.Series(f3,index=years)
firm3

90     NaN
91     NaN
92     NaN
93    10.0
94    12.0
95    13.0
dtype: float64

NaN stands for missing or NA values in pandas. Make use of isnull() function to find out if there are any missing values in the data structure.

In [80]:
pd.isnull(firm3)

90     True
91     True
92     True
93    False
94    False
95    False
dtype: bool

A key feature of Series data is structures is that you don't have to worry about data alignment. For example, if we have run a word count program on two different files and we have the following data structures

In [82]:
dict1 = {'finance': 10, 'earning': 5, 'debt':8}
dict2 = {'finance' : 8, 'compensation':4, 'earning': 9}
count1 = pd.Series(dict1)
count2 = pd.Series(dict2)
print(count1)
count2

finance    10
earning     5
debt        8
dtype: int64


finance         8
compensation    4
earning         9
dtype: int64

If we want to calculate the sum of common words in combined files, then we dont have to worry about data alignment. If we want to include all words, then we can take care of NaN values and compute the sum. By default, Series data structure ignores NaN values. NaN values stand for missing data values.

In [83]:
count1+count2

compensation     NaN
debt             NaN
earning         14.0
finance         18.0
dtype: float64

**Data Frame**

DataFrame is a tabular data structure in which data is laid out in rows and column format (similar to a CSV and SQL file), but it can also be used for higher dimensional data sets. The DataFrame object can contain homogenous and heterogenous values, and can be thought of as a logical extension of Series data structures. In contrast to Series, where there is one index, a DataFrame object has one index for column and one index for rows. This allows flexibility in accessing and manipulating data.

In [84]:
data = pd.DataFrame({'price':[95, 25, 85, 41, 78],
                     'ticker':['AXP', 'CSCO', 'DIS', 'MSFT', 'WMT'],
                     'company':['American Express', 'Cisco', 'Walt Disney','Microsoft', 'Walmart']})
data

Unnamed: 0,price,ticker,company
0,95,AXP,American Express
1,25,CSCO,Cisco
2,85,DIS,Walt Disney
3,41,MSFT,Microsoft
4,78,WMT,Walmart


If a column is passed with no values, it will simply have NaN values

In order to access a column, simply mention the column name

In [85]:
data['company']

0    American Express
1               Cisco
2         Walt Disney
3           Microsoft
4             Walmart
Name: company, dtype: object

In [93]:
data.company

0    American Express
1               Cisco
2         Walt Disney
3           Microsoft
4             Walmart
Name: company, dtype: object

In [94]:
data.iloc[2]

price               85
ticker             DIS
company    Walt Disney
Year              2014
Name: 2, dtype: object

In [95]:
data.loc[data.ticker=='DIS']

Unnamed: 0,price,ticker,company,Year
2,85,DIS,Walt Disney,2014


In order to add additional columns

In [96]:
data['Year'] = 2014
data

Unnamed: 0,price,ticker,company,Year
0,95,AXP,American Express,2014
1,25,CSCO,Cisco,2014
2,85,DIS,Walt Disney,2014
3,41,MSFT,Microsoft,2014
4,78,WMT,Walmart,2014


In [97]:
data['pricesquared'] = data.price**2
data

Unnamed: 0,price,ticker,company,Year,pricesquared
0,95,AXP,American Express,2014,9025
1,25,CSCO,Cisco,2014,625
2,85,DIS,Walt Disney,2014,7225
3,41,MSFT,Microsoft,2014,1681
4,78,WMT,Walmart,2014,6084


In [98]:
del data['pricesquared']
data

Unnamed: 0,price,ticker,company,Year
0,95,AXP,American Express,2014
1,25,CSCO,Cisco,2014
2,85,DIS,Walt Disney,2014
3,41,MSFT,Microsoft,2014
4,78,WMT,Walmart,2014


In [99]:
data['pricesquared'] = np.NaN
data

Unnamed: 0,price,ticker,company,Year,pricesquared
0,95,AXP,American Express,2014,
1,25,CSCO,Cisco,2014,
2,85,DIS,Walt Disney,2014,
3,41,MSFT,Microsoft,2014,
4,78,WMT,Walmart,2014,


In [100]:
data['sequence'] = np.arange(1,6)
data

Unnamed: 0,price,ticker,company,Year,pricesquared,sequence
0,95,AXP,American Express,2014,,1
1,25,CSCO,Cisco,2014,,2
2,85,DIS,Walt Disney,2014,,3
3,41,MSFT,Microsoft,2014,,4
4,78,WMT,Walmart,2014,,5


In [101]:
data.values

array([[95, 'AXP', 'American Express', 2014, nan, 1],
       [25, 'CSCO', 'Cisco', 2014, nan, 2],
       [85, 'DIS', 'Walt Disney', 2014, nan, 3],
       [41, 'MSFT', 'Microsoft', 2014, nan, 4],
       [78, 'WMT', 'Walmart', 2014, nan, 5]], dtype=object)

In [102]:
newdata = data.drop(2)

In [103]:
newdata

Unnamed: 0,price,ticker,company,Year,pricesquared,sequence
0,95,AXP,American Express,2014,,1
1,25,CSCO,Cisco,2014,,2
3,41,MSFT,Microsoft,2014,,4
4,78,WMT,Walmart,2014,,5


In [104]:
years = [90, 91, 92, 93, 94, 95]
f1 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm1 = pd.Series(f1,index=years)
firm1
f2 = {90:14,92:9, 93:13, 94:5}
firm2 = pd.Series(f2,index=years)
firm2
f3 = {93:10, 94:12, 95: 13}
firm3 = pd.Series(f3,index=years)
firm3


df1 = pd.DataFrame(columns=['Firm1','Firm2','Firm3'],index=years)
df1

df1.Firm1 = firm1
df1.Firm2 = firm2
df1.Firm3 = firm3
df1


Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [105]:
dft = df1.T
dft
del dft[90]
dft


Unnamed: 0,91,92,93,94,95
Firm1,9.0,7.0,8.0,9.0,11.0
Firm2,,9.0,13.0,5.0,
Firm3,,,10.0,12.0,13.0


You can pass a number of data structures to DataFrame such as a ndarray, lists, dict, Series, and another DataFrame. You can also reindex to confirm to data to a new index. Reindexing is a powerful feature that allows you to access data in a number of different ways, and also to confirm data to some new time series or other index.

In [106]:
df1

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [107]:
reindexdf1 = df1.reindex([88,89,90,91,92,93,94,95,96,97,98])
reindexdf1

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,8.0,14.0,
91,9.0,,
92,7.0,9.0,
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,,13.0
96,,,
97,,,


In [108]:
years1 = [90, 91, 92, 93, 94, 95]
f4 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm4 = pd.Series(f4,index=years)
f5 = {90:14,91:12, 92:9, 93:13, 94:5, 95:8}
firm5 = pd.Series(f5,index=years)
f6 = {90:8, 91: 9, 92:9,93:10, 94:12, 95: 13}
firm6 = pd.Series(f6,index=years)
df2 = pd.DataFrame(columns=['Firm1','Firm2','Firm3'],index=years1)
df2.Firm1 = firm4
df2.Firm2 = firm5
df2.Firm3 = firm6
df2


Unnamed: 0,Firm1,Firm2,Firm3
90,8,14,8
91,9,12,9
92,7,9,9
93,8,13,10
94,9,5,12
95,11,8,13


In [109]:
reindexdf2 = df2.reindex([88,89,90,91,92,93,94,95,96,97,98], fill_value=0)
reindexdf2

Unnamed: 0,Firm1,Firm2,Firm3
88,0,0,0
89,0,0,0
90,8,14,8
91,9,12,9
92,7,9,9
93,8,13,10
94,9,5,12
95,11,8,13
96,0,0,0
97,0,0,0


Similarly, you have backfill (bfill) method to fill values backwards.

In [110]:
df2

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14,8
91,9,12,9
92,7,9,9
93,8,13,10
94,9,5,12
95,11,8,13


In [111]:
reindexdf3 = df2.reindex([88,89,90,91,92,93,94,95,96,97,98], method='ffill')
reindexdf3

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,8.0,14.0,8.0
91,9.0,12.0,9.0
92,7.0,9.0,9.0
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,8.0,13.0
96,11.0,8.0,13.0
97,11.0,8.0,13.0


In [113]:
reindexdf1

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,8.0,14.0,
91,9.0,,
92,7.0,9.0,
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,,13.0
96,,,
97,,,


In [114]:
reindexdf3

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,8.0,14.0,8.0
91,9.0,12.0,9.0
92,7.0,9.0,9.0
93,8.0,13.0,10.0
94,9.0,5.0,12.0
95,11.0,8.0,13.0
96,11.0,8.0,13.0
97,11.0,8.0,13.0


In [115]:
reindexdf1+reindexdf3

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,16.0,28.0,
91,18.0,,
92,14.0,18.0,
93,16.0,26.0,20.0
94,18.0,10.0,24.0
95,22.0,,26.0
96,,,
97,,,


In [117]:
reindexdf1.add(reindexdf3, fill_value=0)

Unnamed: 0,Firm1,Firm2,Firm3
88,,,
89,,,
90,16.0,28.0,8.0
91,18.0,12.0,9.0
92,14.0,18.0,9.0
93,16.0,26.0,20.0
94,18.0,10.0,24.0
95,22.0,8.0,26.0
96,11.0,8.0,13.0
97,11.0,8.0,13.0


You can use NumPy functions inside DataFrame objects.

In [125]:
dataframe = pd.DataFrame(np.random.randn(3,3),columns=['one','two','three'])
dataframe

Unnamed: 0,one,two,three
0,-0.249185,0.081035,1.636391
1,1.882915,0.30606,-0.134024
2,1.644701,-0.392341,-0.966508


In [119]:
np.abs(dataframe)

Unnamed: 0,one,two,three
0,1.062195,0.367188,0.114217
1,0.560258,0.471771,0.349899
2,1.362624,0.690582,3.403217


In [122]:
f = lambda x:x.max()-x.min()
dataframe.apply(f)

one      2.424819
two      1.162353
three    3.517434
dtype: float64

In [123]:
dataframe.apply(f,axis=1)

0    1.176412
1    0.910157
2    4.765840
dtype: float64

In [None]:
g = lambda x: x - np.mean(x)
dataframe.apply(g)

In [126]:
def f(x):
    return pd.Series([np.mean(x), x.max(), x.min()], index=['mean','max','min'])
dataframe.apply(f,axis=1)

Unnamed: 0,mean,max,min
0,0.489414,1.636391,-0.249185
1,0.684984,1.882915,-0.134024
2,0.095284,1.644701,-0.966508


In [128]:
dataframe = pd.DataFrame(np.random.randn(3,3),columns=['one','two','three'])
dataframe

Unnamed: 0,one,two,three
0,1.763656,0.291608,1.435963
1,-1.645361,-2.549766,-0.13153
2,-0.176816,-0.313602,-1.096868


In [129]:
dataframe.sort_values(by='one')

Unnamed: 0,one,two,three
1,-1.645361,-2.549766,-0.13153
2,-0.176816,-0.313602,-1.096868
0,1.763656,0.291608,1.435963


In [130]:
dataframe.sort_values(by=['one','two'])

Unnamed: 0,one,two,three
1,-1.645361,-2.549766,-0.13153
2,-0.176816,-0.313602,-1.096868
0,1.763656,0.291608,1.435963


In [131]:
dataframe.sum()

one     -0.058520
two     -2.571760
three    0.207565
dtype: float64

In [132]:
dataframe.sum(axis=1)

0    3.491227
1   -4.326657
2   -1.587285
dtype: float64

In [133]:
dataframe.cumsum(axis=1)

Unnamed: 0,one,two,three
0,1.763656,2.055265,3.491227
1,-1.645361,-4.195127,-4.326657
2,-0.176816,-0.490418,-1.587285


In [134]:
dataframe.describe()

Unnamed: 0,one,two,three
count,3.0,3.0,3.0
mean,-0.019507,-0.857253,0.069188
std,1.709944,1.496669,1.278289
min,-1.645361,-2.549766,-1.096868
25%,-0.911088,-1.431684,-0.614199
50%,-0.176816,-0.313602,-0.13153
75%,0.79342,-0.010997,0.652216
max,1.763656,0.291608,1.435963


If you have non-numeric data, then applying describe function would produce statistics such as count, unique, frequency. In addition to this, you can also calculate skewness (skew), kurtosis (kurt), percent changes, difference, and other statistics.

**Missing Data**

pandas have a number of features to deal with missing data. We have seen an example of the case of descriptive statistics, where missing values are not taken into account while calculating the descriptive statistics. Missing data is denoted by NaN. 

In [135]:
years = [90, 91, 92, 93, 94, 95]
f1 = {90:8, 91:9, 92:7, 93:8, 94:9, 95:11}
firm1 = pd.Series(f1,index=years)
firm1
f2 = {90:14,92:9, 93:13, 94:5}
firm2 = pd.Series(f2,index=years)
firm2
f3 = {93:10, 94:12, 95: 13}
firm3 = pd.Series(f3,index=years)
firm3
df3 = pd.DataFrame(columns=['Firm1','Firm2','Firm3'],index=years)
df3
df3.Firm1 = firm1
df3.Firm2 = firm2
df3.Firm3 = firm3
df3

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [136]:
firm2

90    14.0
91     NaN
92     9.0
93    13.0
94     5.0
95     NaN
dtype: float64

In [137]:
nadeleted = firm2.dropna()
nadeleted

90    14.0
92     9.0
93    13.0
94     5.0
dtype: float64

In [138]:
df3

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In case of DataFrame, if you use dropna, it deletes entire row by default. Another way is to drop only those rows that are all NA. If you want to drop columns, pass axis=1

In [139]:
cleandf3 = df3.dropna()

In [140]:
cleandf3

Unnamed: 0,Firm1,Firm2,Firm3
93,8,13.0,10.0
94,9,5.0,12.0


In [141]:
clean2 = df3.dropna(how='all')
clean2

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [142]:
columndrop = df3.dropna(axis=1)
columndrop

Unnamed: 0,Firm1
90,8
91,9
92,7
93,8
94,9
95,11


In [143]:
df3

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [144]:
thresholddf = df3.dropna(thresh=2)
thresholddf

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [145]:
fillna1 = df3.fillna(5)
fillna1

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,5.0
91,9,5.0,5.0
92,7,9.0,5.0
93,8,13.0,10.0
94,9,5.0,12.0
95,11,5.0,13.0


In [146]:
df3

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [147]:
fillna2 = df3.fillna({'Firm1':8, 'Firm2': 10, 'Firm3':14})
fillna2

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,14.0
91,9,10.0,14.0
92,7,9.0,14.0
93,8,13.0,10.0
94,9,5.0,12.0
95,11,10.0,13.0


In [148]:
df3

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [149]:
fillna3 = df3.fillna(method='ffill')
fillna3

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,14.0,
92,7,9.0,
93,8,13.0,10.0
94,9,5.0,12.0
95,11,5.0,13.0


In [150]:
fillna4 = df3.fillna(method='bfill',limit=2)
fillna4

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,
91,9,9.0,10.0
92,7,9.0,10.0
93,8,13.0,10.0
94,9,5.0,12.0
95,11,,13.0


In [151]:
fillna5 = df3.fillna(df3.mean())
fillna5

Unnamed: 0,Firm1,Firm2,Firm3
90,8,14.0,11.666667
91,9,10.25,11.666667
92,7,9.0,11.666667
93,8,13.0,10.0
94,9,5.0,12.0
95,11,10.25,13.0


**Hierarchical Indexing**

Hierarchical indexing allows you to have index on an index (multiple index). It is an important feature of pandas using which you can select subsets of data and perform independent analyses on them. For example, suppose you have firm prices data and the data is indexed by firm name. On top of that, you can index firms by industry. Thus, industry becomes an index on top of firms. You can then perform analyses either on individual firm, or on group of firms in an industry, or on the whole dataset.

In [152]:
h_i_data = pd.Series(np.random.randn(10),index=[['Ind1','Ind1','Ind1','Ind1','Ind2','Ind2','Ind2','Ind3','Ind3','Ind3'],
                                              [1,2,3,4,1,2,3,1,2,3]])
h_i_data

Ind1  1   -0.223898
      2    0.720853
      3    0.207692
      4   -1.671725
Ind2  1    0.539773
      2   -0.263149
      3   -0.758377
Ind3  1    1.079712
      2   -1.002728
      3    1.136453
dtype: float64

In [None]:
h_i_data['Ind3']

In [None]:
h_i_data['Ind1':'Ind3']

In [None]:
h_i_data[['Ind1','Ind3']]

In [None]:
h_i_data[:,3]

In [None]:
h_i_data[:,4]

In [None]:
h_i_data.unstack()

In [None]:
h_i_data.unstack().stack()

In [None]:
h_i_data.sum()

In [None]:
h_i_data.sum(level=1)

In [None]:
h_i_data.sum(level=0)

In [None]:
h_i_data

**IO in pandas**

In this section, we will focus on I/O from text files, csv, excel, and sql files as well as getting data from web such as Yahoo! Finance. Using functions in pandas, you can read data as a DataFrame object. 

**Reading a csv file**

In [None]:
import pandas as pd
import numpy as np

In [154]:
roedatacsv = pd.read_csv('roedata.csv')
#roedatacsv
roedatacsv.head()

Unnamed: 0,Industry Name,Number of firms,ROE
0,Advertising,65,16.51%
1,Aerospace/Defense,95,21.60%
2,Air Transport,25,42.68%
3,Apparel,70,17.87%
4,Auto & Truck,26,22.05%


If the file does not have a header, then you can either let pandas assign default headers or you can specify custom headers. If you want industry name to be the index of DataFrame, you can achieve that.

In [155]:
roedatacsv = pd.read_csv('roedata.csv', index_col = 'Industry Name' )
roedatacsv

Unnamed: 0_level_0,Number of firms,ROE
Industry Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Advertising,65,16.51%
Aerospace/Defense,95,21.60%
Air Transport,25,42.68%
Apparel,70,17.87%
Auto & Truck,26,22.05%
...,...,...
Transportation,22,14.75%
Trucking,28,16.01%
Utility (General),20,7.34%
Utility (Water),20,9.95%


In [156]:
roedatacsv = pd.read_csv('roedata.csv', usecols = ['Industry Name','ROE'] )
roedatacsv

Unnamed: 0,Industry Name,ROE
0,Advertising,16.51%
1,Aerospace/Defense,21.60%
2,Air Transport,42.68%
3,Apparel,17.87%
4,Auto & Truck,22.05%
...,...,...
92,Transportation,14.75%
93,Trucking,16.01%
94,Utility (General),7.34%
95,Utility (Water),9.95%


In [159]:
capm_dem_data = pd.read_table('../Session01/capm_dem.dat', delimiter=' ',header = None)
capm_dem_data

Unnamed: 0,0,1,2,3
0,195710,880211,-0.012605,0.003871
1,195710,880212,-0.008511,0.007406
2,195710,880216,0.008584,0.001411
3,195710,880217,-0.004255,0.002414
4,195710,880218,0.000000,0.002845
...,...,...,...,...
269,811710,880524,0.024540,0.001700
270,811710,880525,0.008982,0.001598
271,811710,880526,0.000000,0.003151
272,811710,880527,0.002967,-0.000359


In [160]:
capm_dem_data = pd.read_table('../Session01/capm_dem.dat', delimiter=' ',header = None)
capm_dem_data

Unnamed: 0,0,1,2,3
0,195710,880211,-0.012605,0.003871
1,195710,880212,-0.008511,0.007406
2,195710,880216,0.008584,0.001411
3,195710,880217,-0.004255,0.002414
4,195710,880218,0.000000,0.002845
...,...,...,...,...
269,811710,880524,0.024540,0.001700
270,811710,880525,0.008982,0.001598
271,811710,880526,0.000000,0.003151
272,811710,880527,0.002967,-0.000359


In [161]:
compustatdata = pd.read_csv('../Session01/compustat.csv', index_col = ['ggroup','gvkey'] )
compustatdata = compustatdata.sort_index(0,ascending=False)
compustatdata

Unnamed: 0_level_0,Unnamed: 1_level_0,datadate,fyear,indfmt,consol,popsrc,datafmt,tic,conm,curcd,fyr,act,artfs,at,ebitda,epsfi,ni,costat
ggroup,gvkey,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5510.0,273902,20101231,2010.0,INDL,C,D,STD,PENGF,PRIMARY ENERGY RECYCLING CP,CAD,12.0,34.821,,329.365,34.071,0.10,13.382,A
5510.0,273902,20111231,2011.0,INDL,C,D,STD,PENGF,PRIMARY ENERGY RECYCLING CP,CAD,12.0,31.385,,309.250,28.511,0.00,0.037,A
5510.0,273902,20121231,2012.0,INDL,C,D,STD,PENGF,PRIMARY ENERGY RECYCLING CP,CAD,12.0,41.333,,300.047,23.890,0.20,9.042,A
5510.0,273902,20131231,2013.0,INDL,C,D,STD,PENGF,PRIMARY ENERGY RECYCLING CP,CAD,12.0,34.165,,293.968,21.166,-0.04,-2.026,A
5510.0,269005,20101231,2010.0,INDL,C,D,STD,CPL,CPFL ENERGIA SA,USD,12.0,2343.929,,12059.891,1953.276,5.77,924.948,A
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,1272,20100630,2010.0,INDL,C,D,STD,ATPT,ALL STATE PROPERTIES HLDGS,USD,6.0,0.000,,0.000,-1.637,-0.02,-2.625,A
,1119,20101231,2010.0,INDL,C,D,STD,ADX,ADAMS EXPRESS,USD,12.0,,,,,,,A
,1119,20111231,2011.0,INDL,C,D,STD,ADX,ADAMS EXPRESS,USD,12.0,,,,,,,A
,1119,20121231,2012.0,INDL,C,D,STD,ADX,ADAMS EXPRESS,USD,12.0,,,,,,,A


In [164]:
crsp_data = pd.read_table('../Session01/crsp.output', sep='\s+',header = None)
crsp_data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,00036110,19840831,19890929,3.311923,22.625,0.97,0.597,-0.172,0.248,0.153,0.193,0.081,0.163,135.908
1,00036110,19850830,19900928,0.378167,22.625,1.51,0.640,-0.028,0.185,0.149,0.187,0.095,0.148,136.474
2,00036110,19860829,19910930,0.104652,23.500,1.27,0.446,0.178,0.058,0.157,0.142,0.112,0.129,213.827
3,00036110,19870831,19920930,-0.391107,37.375,1.50,0.342,0.371,-0.097,0.157,0.101,0.122,0.078,392.400
4,00036110,19880831,19930930,-0.408846,24.625,1.34,0.385,0.286,-0.697,0.187,0.044,0.163,0.018,390.429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
686,05652510,19910328,19960430,0.823062,17.875,1.45,0.927,0.125,0.138,0.062,0.044,0.062,0.070,30.388
687,05654310,19860331,19910430,1.244162,22.750,1.98,0.852,0.031,0.310,0.121,0.066,0.049,0.100,22.614
688,05654310,19870331,19920430,1.318356,23.125,0.59,0.828,-0.075,0.360,0.120,0.080,0.046,0.064,22.986
689,05709710,19840330,19890428,0.720602,38.500,3.58,0.556,0.201,-0.035,0.203,-0.008,0.201,0.072,238.199


**Reading Files in Chunks**

When dealing with very large files, sometimes it is handy to work with a subset of file or to work on file iteratively in smaller chunks. This can be done in pandas using chunksize and nrows.

In [170]:
altdata = pd.read_csv('abcd.csv', chunksize=1000000 )
altdata
altdata.

AttributeError: 'TextFileReader' object has no attribute 'top'

In [166]:
#Try not to run this
altdata1 = pd.read_csv('abcd.csv')
altdata1.count()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


TICKER         5006178
OFTIC          4985426
CNAME          4985426
ACTDATS        5006178
ESTIMATOR      5006178
FPI            5006178
MEASURE        5006178
VALUE          5006175
FPEDATS        5006178
ACTTIMS        5006178
ANNDATS        5006178
ANNDATS_ACT    3534509
ANNTIMS_ACT    3534509
dtype: int64

Note that in previous case, you did not receive a table as output. Instead you were given an object. This is called a TextParser object. This object allows you to iterate over the complete compustat file according to the chunksize you mentioned. Let us suppose we want to aggregate ebitda values of the entire dataset. 

In [168]:
total = 0
for chunks in altdata:
    total += chunks['ESTIMATOR'].sum()
    print(total)
total

1017039317
2086630305
3124599748
4195450974
5263040616
5267984450


5267984450

** Handling missing values**

Some types of missing values are automatically identified by pandas as NaN while importing the data. Those types are NA, NULL, -1.#IND. Additionally, you can also specify a list of missing values. 

In [None]:
roemissing = pd.read_csv('/Users/suppi/Documents/Peeyush/CBA/B14/Pre-Term Python/roemissing.csv', na_values=['NULL',-999] )
roemissing

In [None]:
roemissing = pd.read_csv('/Users/suppi/Documents/Peeyush/CBA/B14/Pre-Term Python/roemissing.csv', na_values={'Number of firms':['NULL',-999],'ROE':['10000.00%']} )
roemissing

**Writing Data**

In [None]:
roedata = pd.read_csv('/Users/suppi/Documents/Peeyush/CBA/B14/Pre-Term Python/roedata.csv')
roedata.to_csv('/Users/suppi/Documents/Peeyush/CBA/B14/Pre-Term Python/roedatawrite.csv')

In [None]:
roedata = pd.read_csv('/Users/suppi/Documents/Peeyush/CBA/B14/Pre-Term Python/roedata.csv')
roedata.to_csv('/Users/suppi/Documents/Peeyush/CBA/B14/Pre-Term Python/roedatawrite2.csv', index=False, columns=['Industry Name','ROE'])

**Merging Data**

In [None]:
import pandas as pd

In [None]:
left_frame = pd.DataFrame({'key': range(5), 
                           'left_value': ['a', 'b', 'c', 'd', 'e']})
right_frame = pd.DataFrame({'key': range(2, 7), 
                           'right_value': ['f', 'g', 'h', 'i', 'j']})
print(left_frame)
print('\n')
print(right_frame)

In [None]:
pd.merge(left_frame, right_frame, left_on='key', right_on = 'key', how='inner')

In [None]:
pd.merge(left_frame, right_frame, on='key', how='left')

In [None]:
pd.merge(left_frame, right_frame, on='key', how='right')

In [None]:
pd.merge(left_frame, right_frame, on='key', how='outer')

In [None]:
pd.concat([left_frame, right_frame])

In [None]:
pd.concat([left_frame, right_frame], axis=1)