
<p><img align="left" src="https://www.cqf.com/themes/custom/creode/logo.svg" style="vertical-align: top; padding-top: 23px;" width="10%"/>
<img align="right" src="https://upload.wikimedia.org/wikipedia/commons/c/c3/Python-logo-notext.svg" style="vertical-align: middle;" width="12%"/>
<font color="#306998"><h1><center>Python Labs</center></h1></font><br/></p>
<p></p><h1><center>Introduction to Pandas</center></h1>
<center><h3>Kannan Singaravelu</h3></center>
<center>kannan.singaravelu@fitchlearning.com</center>



<h2 id="Pandas">Pandas<a class="anchor-link" href="#Pandas">¶</a></h2><p>Pandas is one of the most important Python library built on top of NumPy used for data manipulation. Unlike, NumPy, Pandas is designed for working with tabular or heterogenous data. The two main data structures of Pandas is</p>
<ul>
<li><strong>Series</strong> - 1D labeled homogeneous array, size immutable.</li>
<li><strong>DataFrame</strong> - 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.</li>
</ul>



<p><em>Note: To run all of the code cells in this example, select <strong>Run All</strong> from the <strong>Cell</strong> menu.</em></p>



<h3 id="Installation">Installation<a class="anchor-link" href="#Installation">¶</a></h3>



<p>We'll install the required libraries that we'll use in this example.</p>


In [None]:

# Instal Pandas library
# ! pip install pandas




<h3 id="Importing">Importing<a class="anchor-link" href="#Importing">¶</a></h3><p>We'll import the required libraries that we'll use in this example.</p>


In [None]:

# Import required libraries
import pandas as pd
import numpy as np



In [None]:

# Check the version
pd.__version__




<h2 id="Data-Structure">Data Structure<a class="anchor-link" href="#Data-Structure">¶</a></h2>



<h3 id="Series">Series<a class="anchor-link" href="#Series">¶</a></h3><p>Pandas Series is a one-dimensional labeled array capable of holding data on any type such as integer, string, float, python objects, etc., The axis labels are collectively called index.</p>
<p>A series can be created using <code>array</code>, <code>dict</code> and <code>scalar value</code>.</p>


In [None]:

# Create an Empty Series
s = pd.Series(dtype = float)
print(s)



In [None]:

# Create a Series from ndarray
s = np.arange(10,20)
pd.Series(s)



In [None]:

# Create a Series from dict
s = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(s)



In [None]:

# Create a Series from Scalar
pd.Series(123, index=[0, 1, 2, 3])




<h3 id="DataFrame">DataFrame<a class="anchor-link" href="#DataFrame">¶</a></h3><p>A DataFrame is a two-dimensional, size-mutuable, heterogenous tabular data with labeled axes where arithmetic operations align on both row and column labels.</p>
<p>A pandas DataFrame can be created using <code>lists</code>, <code>dict</code>, <code>Series</code>, <code>ndarrays</code> and <code>Another DataFrame</code>.</p>


In [None]:

# Create an Empty DataFrame
df = pd.DataFrame()
print (df)



In [None]:

# Create a DataFrame from Lists
data = [['Program','CQF'],['Module', 1],['School', 'Fitch'], ['City', 'London'], ['Country', 'UK']]
df = pd.DataFrame(data, columns=['Name', 'Details'])
df



In [None]:

# Create a DataFrame from Dictionary
data = {'Name': ['Program', 'Module', 'School', 'City', 'Country'],
           'Details': ['CQF', 1, 'Fitch', 'London', 'UK']}
df = pd.DataFrame(data)
df



In [None]:

# Create a DataFrame from Series
data = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
        'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
df



In [None]:

# Create a DataFrame from NumPy arrays
data = np.ones((3,4))
df = pd.DataFrame(data)
df



In [None]:

# Create a DataFrame from another DataFrame
df = pd.DataFrame(df.iloc[:,0])
df




<h2 id="Indexing-&amp;-Selection">Indexing &amp; Selection<a class="anchor-link" href="#Indexing-&amp;-Selection">¶</a></h2>



<h3 id="Indexers:-loc,-iloc">Indexers: loc, iloc<a class="anchor-link" href="#Indexers:-loc,-iloc">¶</a></h3><p>Pandas has special indexing operators <code>loc</code> and <code>iloc</code> that enables us to selection a subset of rows and columns. The basics of indexing are as follows:</p>
<table>
<thead><tr>
<th>Operation</th>
<th>Syntax</th>
<th>Result</th>
</tr>
</thead>
<tbody>
<tr>
<td>Select column</td>
<td>df[col]</td>
<td>Series</td>
</tr>
<tr>
<td>Select row by label</td>
<td>df.loc[label]</td>
<td>Series</td>
</tr>
<tr>
<td>Select row by integer location</td>
<td>df.iloc[loc]</td>
<td>Series</td>
</tr>
<tr>
<td>Slice rows</td>
<td>df[2:5]</td>
<td>DataFrame</td>
</tr>
</tbody>
</table>



<h3 id="Data-Selection-in-Series">Data Selection in Series<a class="anchor-link" href="#Data-Selection-in-Series">¶</a></h3><p>Data in Series can be accessed similar to that in an ndarray, and in many ways like a standard Python dictionary.</p>


In [None]:

# Selection from Series
s = pd.Series(np.arange(10,20))

print(f'First Element in a Series: {s[0]}')  # first element
print(f'First Element in a Series: {s[:3]}')  # first three elements



In [None]:

s.loc[1]



In [None]:

# Selection from Series as dictionary
s = pd.Series({'a' : 0., 'b' : 1., 'c' : 2.})
s['a'], s['c']




<h3 id="DataFrame-Column-Selection">DataFrame Column Selection<a class="anchor-link" href="#DataFrame-Column-Selection">¶</a></h3>


In [None]:

# Column selection from DataFrame o
data = {'Name': ['Program', 'Module', 'School', 'City', 'Country'],
           'Details': ['CQF', 1, 'Fitch', 'London', 'UK']}

df = pd.DataFrame(data,  index = ['a','b','c','d','e'])
df['Name']



In [None]:

df.loc['d']



In [None]:

df.iloc[2]



In [None]:

df.iloc[4]



In [None]:

df[2:4]




<h2 id="Essential-Functionality">Essential Functionality<a class="anchor-link" href="#Essential-Functionality">¶</a></h2><p>We'll now see some of the essential functionality common to the Pandas data structures.</p>


In [None]:

df = pd.DataFrame(np.arange(1,21), columns=['Numeric'])




<p>To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.</p>


In [None]:

# First five values
df.head()



In [None]:

# Last five values
df.tail()



In [None]:

# DataFrame index object
df.index



In [None]:

# Metadata
df.info()



In [None]:

# Tuple representing the dimension of the DataFrame
df.shape



In [None]:

# NumPy representation of NDFrame
df.values




<h2 id="Descriptive-Statistics">Descriptive Statistics<a class="anchor-link" href="#Descriptive-Statistics">¶</a></h2><p>Pandas objects are equipped with common mathematical and statistical methods. A large number of methods collectively compute descriptive statistics and other related operations on DataFrame. Unlike NumPy, Pandas methods can handle missing data.</p>


In [None]:

# Sum
print(f'Sum                : {df.sum()[0]}')
print(f'Mean               : {df.mean()[0]}')
print(f'Median             : {df.median()[0]}')
print(f'Standard deviation : {df.std()[0]:.2f}')
print(f'Minimum            : {df.min()[0]}')
print(f'Maximum            : {df.max()[0]}')
print(f'Index of Minimum   : {df.idxmin()[0]}')
print(f'Index of Maximum   : {df.idxmax()[0]}')



In [None]:

# Cumulative Sum
df.cumsum().iloc[-1]



In [None]:

# Cumulative Product
df.cumprod().iloc[-1]



In [None]:

# Summary Statistics 
df.describe().T




<pre><code>                           List of key aggregation functions available in Pandas

</code></pre>
<table>
<thead><tr>
<th>Aggregation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>count()</code></td>
<td>Total number of items</td>
</tr>
<tr>
<td><code>first()</code>, <code>last()</code></td>
<td>First and last item</td>
</tr>
<tr>
<td><code>mean()</code>, <code>median()</code></td>
<td>Mean and median</td>
</tr>
<tr>
<td><code>min()</code>, <code>max()</code></td>
<td>Minimum and maximum</td>
</tr>
<tr>
<td><code>std()</code>, <code>var()</code></td>
<td>Standard deviation and variance</td>
</tr>
<tr>
<td><code>mad()</code></td>
<td>Mean absolute deviation</td>
</tr>
<tr>
<td><code>prod</code></td>
<td>Product of all items</td>
</tr>
<tr>
<td><code>sum</code></td>
<td>Sum of all items</td>
</tr>
</tbody>
</table>


In [None]:

# Sort by Index
s = pd.Series(np.arange(1,6), index=['e', 'a', 'c', 'b', 'd'])
s.sort_index()



In [None]:

# Sort by values
s.sort_values()



In [None]:

np.random.seed(0)
df = pd.DataFrame(np.arange(10).reshape(2,5), index=['two', 'one'], columns=['d', 'e', 'a', 'c', 'b'])
df



In [None]:

# Sort DataFrame by index column
df.sort_index(axis=1)



In [None]:

# Sort DataFrame by index
df.sort_index(axis=1, ascending=False)



In [None]:

dd = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
dd



In [None]:

# Sort DataFrame by values
dd.sort_values( by=['col1', 'col2'])




<h2 id="Aggregation">Aggregation<a class="anchor-link" href="#Aggregation">¶</a></h2><h3 id="GroupBy">GroupBy<a class="anchor-link" href="#GroupBy">¶</a></h3><p>Groupby object is a flexbible abstraction and is very useful in data manipulation</p>


In [None]:

data = pd.read_csv('data/spy.csv', index_col=0, parse_dates=True)
data.head()



In [None]:

data[['Volume']].groupby(data.index.year).sum().head()



In [None]:

data.groupby(data.index.year).sum().head()




<h3 id="Pivot-Tables">Pivot Tables<a class="anchor-link" href="#Pivot-Tables">¶</a></h3><p>Pivot table is a similar operation seen in spreadsheets that operate on tabular data and can be performed on multidimension.</p>


In [None]:

pd.pivot_table(data=data, index = data.index.year, values='Volume', aggfunc=sum).head()



In [None]:

pd.pivot_table(data=data, index = [data.index.year, data.index.month], values='Volume', aggfunc=sum).head(13)




<h2 id="Filtering">Filtering<a class="anchor-link" href="#Filtering">¶</a></h2><p>Let's now slice and filter close prices above/below certain values.</p>


In [None]:

# Filter dates on which close price above 300
data[data['Close']>300]['Close']



In [None]:

# Filter Open = High
data[data['Open']== data['High']].count()



In [None]:

df1 = data.copy()
df1.loc[df1['Open'] == df1['High'], 'O=H'] = -1
df1.loc[df1['Open'] == df1['Low'], 'O=L'] = 1
df1.fillna(0,inplace=True)



In [None]:

df1.head()



In [None]:

# Add new column 
df1['CHG'] = 100*(df1['Close'].sub(df1['Open']))/df1['Open']



In [None]:

# O = L
df1[df1['O=L']==1]['CHG'].sum()



In [None]:

# O = H
df1[df1['O=H']==-1]['CHG'].sum()




<h2 id="File-Input-&amp;-Output-with-Arrays">File Input &amp; Output with Arrays<a class="anchor-link" href="#File-Input-&amp;-Output-with-Arrays">¶</a></h2><p>Accessing data is the first step in any data analysis. Pandas features a number of function for reading tabular data as a DataFrame object.</p>
<p>The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. The corresponding writer functions are object methods that are accessed like DataFrame.to_csv(). The table below show some of the key parsing functions.</p>
<table>
<thead><tr>
<th>Format</th>
<th>Description</th>
<th>Reader</th>
<th>Writer</th>
</tr>
</thead>
<tbody>
<tr>
<td>text</td>
<td>CSV</td>
<td>read_csv</td>
<td>to_csv</td>
</tr>
<tr>
<td>text</td>
<td>HTML</td>
<td>read_html</td>
<td>to_html</td>
</tr>
<tr>
<td>xls or xlsx</td>
<td>MS Excel</td>
<td>read_excel</td>
<td>to_excel</td>
</tr>
<tr>
<td>binary</td>
<td>Pickle</td>
<td>read_pickle</td>
<td>to_pickle</td>
</tr>
<tr>
<td>sql</td>
<td>SQL</td>
<td>read_sql</td>
<td>to_sql</td>
</tr>
</tbody>
</table>


In [None]:

# Reading a csv file
data = pd.read_csv('data/spy.csv', index_col=0, parse_dates=True)
data.tail()



In [None]:

# Reading an Excel file
data = pd.read_excel('data/mystocks.xlsx',  sheet_name='AMZN', index_col=0, parse_dates=True)
data.tail()



In [None]:

# Writer functions
# data.to_csv(), data.to_excel




<h1 id="References">References<a class="anchor-link" href="#References">¶</a></h1><ul>
<li><p>Pandas documentation <a href="https://pandas.pydata.org/docs/">https://pandas.pydata.org/docs/</a></p>
</li>
<li><p>Jake VanderPlas (2016), Python Data Science Handbook: Essential tools for working with data</p>
</li>
<li><p>McKinney (2018), Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython</p>
</li>
<li><p>Python Resources <a href="https://github.com/kannansingaravelu/PythonResources">https://github.com/kannansingaravelu/PythonResources</a></p>
</li>
</ul>
